Who Owns History?
Brewster Kahle’s Internet Archive
David Womack
The history of the Internet is located between an auto-body shop and a set of new condos at the corner of 6th and Mission in San Francisco. The building, which one of the tenants describes as “an old sweatshop,” does have a front door, but Brewster Kahle, founder of the Internet Archive, prefers to use the freight elevator that opens off the alley. One hundred and fifty or so PCs sit nested in cables in a large loft on the second floor. These machines contain 10 billion web pages full of the promises of politicians, satellite images of Siberia, chats on knitting, chemical formulas for homemade explosives, and every other type of information—both profound and profane—human beings might conceivably produce, stretching back to 1996.
Nineteen ninety-six was when Brewster Kahle set out to do the preposterous—to archive the entire Internet not just once, but over and over again. He has now created the largest and most complete archive of the Internet in existence. Taken together, these machines hold well over a hundred terabytes of data—four times more information than is contained in the Library of Congress. Kahle is currently archiving every public website approximately once every two months. Until October 2001, when Kahle made the archives free and open to the public via an application he named “the Wayback machine,” these machines hummed away in relative obscurity. Now, the machines and the information they contain are at the forefront of a debate that is determining the future of the web’s history.
Seven million three hundred thousand new web pages are published every day, adding up to 250 megabytes of information produced a year for every man, woman, and child on earth.[1] However, though information is now being created at an unprecedented pace, it is forgotten almost as quickly. Web pages, on average, exist for only about 100 days.[2] Unlike mass-consumable printed material, web pages—even large, “dense” ones—often disappear without a trace. They “go dark.” The templates and databases are discarded, dismantled, or over-written. Without the Archive, millions of pages of information would have disappeared completely. As the pace of information accelerates, we have fewer resources to understand the present and are remembering less and less about the past. Without this information we have no means to learn from our success and failures. “Paradoxically,” Kahle has written, “with the explosion of the internet we live in a digital dark age.”
Archiving the Internet is, conceptually at least, remarkably straightforward. Search engines such as Google, Yahoo, and Lycos rely on programs called “spiders” to gather information to feed their search engines. The spider is a program that visits remote sites and automatically downloads their contents for indexing, creating a kind of snapshot of each page of the site. When you do a web search, it is the index, rather than the great wide web itself, that is being referenced. Search engines are interested in only the latest information; as new information becomes available, old entries are overwritten. The Archive collects these old snapshots. Kahle’s genius was to recognize the value of this information and then to build a container large enough to hold it.
Few other people could have tackled this project and even fewer would have chosen to. After graduating from MIT in 1982, Kahle founded several companies that did groundbreaking work in information storage and retrieval. It was the sale of a company called WAIS (Wide Area Information Server) Inc. in 1995 that provided the initial capital to start the Archive. Since then Kahle has brought in notable partners such as the Library of Congress, the National Science Foundation, and the Smithsonian. While these partnerships provide some cover in the legal battles that are beginning to threaten both the project and Kahle personally, the ten million-dollar annual budget continues to come primarily out of Kahle’s pocket. The payoff, Kahle insists, is personal. “My definition of a life well lived,” he told me, “is to be of service to others.” Kahle loses money on every megabyte. While the business model may seem distinctly dot-com, the goal hearkens back to the Renaissance. Kahle’s aim, he says, is “universal access to human knowledge.”
Human knowledge seems a strange way to describe the information in the Archive. On 9 September, the President made the following remarks about terrorism, “We know we can’t make the world risk-free but we can reduce the risks we face, and we have to take the fight to the terrorists ... rallying a world coalition with zero tolerance [and] by improving security in our airports and our airplanes.” The year was 1996 and the president was Bill Clinton as quoted on www.whitehouse.gov. Or, go back to October of 2000 and read Enron.com on respect: “We treat others as we would like to be treated ourselves. Ruthlessness, callousness and arrogance do not belong here.” Or, perhaps a statement of shareholder value from WorldCom from the same period, “WorldCom has a strong track record of creating shareholder value ... The opportunities for future growth are superb.” If your tastes run towards the macabre, you can travel back to 1997 and read the statement on suicide by the members of Heaven’s Gate before they hitched a last ride on Hale Bopp: “The true meaning of ‘suicide’ is to turn against the Next Level when it is being offered.”
“Isn’t it the coolest thing around?” says Christopher A. Lee, chairman of the Electronic Records Section of the Society of American Archivists. Lee points out that, in addition to making governments and corporations accountable for past statements, the archive is a valuable record of a society in transition. Millions of individuals who have posted personal homepages to the web have created an unprecedented resource for historians of the future. “A lot of social historians would say a website that says, ‘Here’s a picture of me, here’s a little bit about my cat’ tells us so many important things about how people were using the Internet at a particular point in time.”[3] The Internet Archive has allowed researchers to analyze the Web in unprecedented ways.[4] The Archive is not just a collection of pages and sites; it also captures the links between them. When you go to an archived site from 1996 and click on a link, the link brings up another site—from 1996. The user can move laterally across the web, from archived site to archived site, experiencing a unique 360-degree perspective of a moment in time. With a few clicks of the mouse, the historian can achieve an effect that would have taken thousands of hours of research before the archive.
Kahle believes that the Archive is particularly important in the wake of the dot-com bust. “Archiving technology transitions is very important because, initially at least, technology is often disappointing,” he says. “But by looking at the disappointments, you get an idea of what people wanted from the technology in the first place. What were their dreams for the technology? From this perspective, the failures can be just as interesting as the successes.”
But not everyone wants us to remember. “Brewster is taking an extraordinary personal risk, because this is potentially a criminal offense,” says Lawrence Lessig, a Stanford University professor of law who recently argued the groundbreaking intellectual property case, Eldred v. Ashcroft, before the US Supreme Court. Kahle has found an unlikely enemy in the institutions whose business it is to inform the public. The New York Times, Wall Street Journal, and other major news and entertainment organizations have had their sites removed from the archive. This is because they sell what Kahle wants to give away for free. The New York Times charges $2.95 for access to each article in its online archives—articles that, initially, were free on the site. This business model assumes that only the latest version of the site is available to the public. And, although the articles themselves may be preserved in the Times’ private archive, the context in which the article appeared is lost. It is impossible to see, for instance, what the New York Times website looked like on 11 September 2001.
Another assault on the Archives comes from the Church of Scientology. Unlike the media organizations, the Church is not seeking to exclude its own sites from the Archive, but rather the sites of its critics. The Church insisted that the Archive remove pages that quoted from Church materials, claiming that such pages violate its copyright. Because of the way the Archive stores information, it is impossible to go in and remove single pages or sections of text. Fearing one of the Church’s infamous legal assaults, the Archive has been forced to remove entire sites against the will of the owners because of a single questionable paragraph. In an email to the Archive, the owner of one site says he was “puzzled and disturbed” to try to access his site on the archive only to get a message that said that the site had been removed at the “owners’ request.”
The government, too, has removed historically significant information from the public domain. Following 11 September, there was a massive effort to purge information that might be in any way construed to be of value to terrorists. The entire Nuclear Regulatory Commission domain was removed from the Archive, including safety reports on American nuclear power plants. Sites that contained information on water supplies and chemical formulas were also removed from the Archives. In an effort to deprive terrorists of information, the government has deprived current and future citizens of an understanding of how the government communicated on important issues in difficult times. Perhaps the hole in history is an indication of its own.
The most democratic publishing vehicle in history, the Internet, is also the most fragile. Not only do websites depend on the collaboration of intricate and rapidly aging systems but they also depend on the whim of the author, the sensitivities of the subject, and the censor’s nervous cursor. The same qualities that have enabled the web to host a billion voices also threaten their survival: websites are easy to put up, take down, and change. The question is whether there will be any public record of these changes, or whether we, like the ever-changing web itself, must be trapped in the present and accept as true and inevitable whatever appears upon our flickering screens for lack of any comparison. In 1984 George Orwell predicted this dilemma with disturbing accuracy: “Within twenty years at the most,” he reflected, “the huge and simple question, ’Was life better before the Revolution than it is now?’ would have ceased once and for all to be answerable.”
- Peter Lyman and Hal R. Varian, “How Much Information,” 2000. Retrieved on 9 January 2003 from www.sims.berkeley.edu/how-much-info [link defunct—Eds.].
- L. A. Lorek, “Site Lets Surfers Explore Net’s Past,” San Antonio Express News, 16 June 2002.
- John Schwartz, “New Economy: A Library of Web Pages that Warms the Cockles of the Wired Heart and Beats the Library of Congress for Sheer Volume,” The New York Times, 29 October 2001.
- “A University of Maryland professor has been using the archive to index Hungarian texts, whole researchers from Xerox’s Palo Alto research center have used it to find out whether the dominance of English on the Web is killing off less widely used languages. Thanks to the archive, we now know that there are 1.5 million Hungarian language pages on the web. Xerox researchers also found 201 other languages represented on the web and thriving in the digital universe.” Douglas F. Gray, “Archiving the Net All the Wayback to 1996,” Infoworld Daily News, 26 October 2001. Available at www.arnnet.com.au/article/41189/archiving_net_-_all_wayback_1996.
David Womack is the director of new media at the American Institute of Graphic Arts.