Skip to main content

End of Bush's Term: Will It Disappear from the Web?

By Hunter Stern, CDL Technical Writer

Will the Homeland Security and No Child Left Behind websites disappear on January 20th 2009?  The answer might surprise you.  January 20th 2009 will mark the beginning of a new presidential administration and the coincident end of the current administration, putting much of the online material related to its policies and initiatives at risk.  According to the Washington Post, “Many federal agency records exist only in digital form and are in danger of disappearing when the administration changes” (August 20, 2008).

The University of California community, not to mention scholars the world over, require perpetual access to these online materials in the normal conduct of research, teaching, and learning. Even without a change in administration, government records stored in digital form are notoriously volatile.  Web pages on government sites have an average life span of only 44 days.

To ensure that the historical record of the current administration is not lost, a partnership of government and nonprofit agencies has taken responsibility for its preservation.  The University of California – California Digital Library (CDL), in partnership with the Library of Congress, the Government Printing Office (GPO), the Internet Archive (IA), and the University of North Texas Libraries (UNTL) are planning the harvest and archival storage of more than 100 million US government web pages from the second George W. Bush administration.  This effort will involve the comprehensive harvest of the .gov domain as well as focused Web harvests of specific government agencies.  The goal is to conduct a broad capture of all Federal government Web sites, and a deep capture of specific high-priority sites that have been chosen by the project’s curators.  Each partner plays a critical role in the project.

The California Digital Library, a recipient of a Library of Congress National Digital Information Infrastructure and Preservation Program (NDIIPP) grant, leads the Web-at-Risk project, a goal of which is “to develop tools that enable librarians and archivists to capture, curate, preserve, and provide access to web-based government and political information.”  These tools will be put to use doing deep crawls of specific government agencies ranked as priority sites by the project’s curators.  In addition to CDL, UNTL will be responsible for conducting deep crawls.

The broad crawl will be the responsibility of the Internet Archive, a non-profit group providing universal and permanent access to digital information for educators, researchers, and the general public.  IA will use its advanced Web-crawling software, called Heritrix, to capture the intended sites.

In order to prioritize the vast list of URLs included in the scope of the crawl, the University of North Texas has designed a software tool that allows curators to nominate URLs for harvest and tag them with numeric rankings.

The Library of Congress, which has preserved congressional Web sites since December 2003, will focus on developing the overall harvesting plan.  The GPO and the libraries in its Federal Depository Library Program will assist in the curation process.

For more information on the End of Term project contact Patricia Cruse (patricia.cruse@ucop.edu), Director, Digital Preservation Program.