Preserving the cultural heritage of our nation is a task of incalculable importance. Artifacts of cultural heritage provide the raw material for researchers and historians and are the narrative upon which our national identity is built.
The Web has increasingly become the public forum for publications by non-profit agencies and research groups as well as a locus of grass-roots reaction to historical events. Due to the volatility, diversity, and growing volume of the Web, the resulting narrative has become increasingly difficult to preserve. The goal of the CDL Web Archiving Program is to create tools to help librarians capture and preserve these materials, to participate in collaborative efforts to preserve web publications on a large scale, and to develop policies and services to assist librarians and archivists in this new realm of collection development.
The two primary services of CDL's Web Archiving Program are the Web Archiving Service (WAS) used by curators to capture and preserve web content, and the Web Archives created with WAS and hosted by CDL.
The Web Archiving Service (WAS) is a web-based application providing librarians and scholars with the means to preserve web content. WAS is built using both open source and locally developed technology and its design was heavily influenced by user feedback. WAS allows curators to easily preserve at-risk materials and to build publicly accessible archives. The resulting archives not only provide a snapshot of each website in time, but also allow researchers to explore those resources in ways they could not do on the live web.See the CDL Web Archiving Service site for further information.
CDL provides access to a wide range of archives created many different institutions using the Web Archiving Service. These are available at http://webarchives.cdlib.org, and include archives of the State of California Government agencies, local government agencies from Orange County, San Diego, Los Angeles and more. Also included are archives of Middle Eastern political organizations, American left-wing organizations, and web content from events such as the 2007 Southern California Wildfires and the 2003 California Recall Elections. The archives represent the culmination of the Web-at-Risk grant, funded by the National Digital Information and Infrastructure Preservation Program, and led by the California Digital Library.
CDL's Web Archiving Program also contributes to collaborative web archiving projects, such as the End-of-Term Harvest of the U.S. Federal Government web sites during the 2008-2009 change of administrations, and the K-12 Web archiving projects. Details about these and other collaborative archiving projects are available.
The CDL web archiving program staff work consistently to raise awareness and advocate for the importance of preserving web content at UC campuses, professional organizations and in library publications. In addition, the CDL works closely with a number of national and international organizations devoted to web archiving efforts and contributes to the development of standards for this emerging field.
The International Internet Preservation Consortium
The International Internet Preservation Consortium (IIPC) is a group of institutions that fund and participate in projects and working groups to develop tools and standards for the emerging field of web archiving. The IIPC’s specific goals are:
- To enable the collection, preservation and long-term access of a rich body of Internet content from around the world.
- To foster the development and use of common tools, techniques and standards for the creation of international archives.
- To be a strong international advocate for initiatives and legislation that encourage the collection, preservation and access to Internet content.
- To encourage and support libraries, archives, museums and cultural heritage institutions everywhere to address Internet content collecting and preservation.
The California Digital Library (CDL) has been a member of IIPC since 2007. As an IIPC member, the CDL has contributed to the development of the WARC file format, the emerging standard for captured web content.
The National Digital Information Infrastructure and Preservation Program
The California Digital Library has been part of the NDIIPP collaborative partnership since 2005, when it was awarded a grant for the Web-at-Risk project. Several NDIIPP partners are directly involved in developing web archiving solutions for libraries, insuring that both a variety of strategies can be explored and collaborative solutions can be developed where possible. The CDL co-authored the “BagIt” specification used by the Library of Congress to transfer large quantities of data produced by all of the NDIIPP projects.The Internet Archive
The CDL’s web archiving team works in close communication with the Internet Archive and has contributed technical documentation to the Internet Archive’s open source tools, which are widely used in the web archiving community.
WARC
The CDL staff contributed to the development of the Web ARChive (WARC) file format. WARC is a more advanced version of the ARC file format. ARC files are "archives" of other files collected during a web crawl.BagIt
The CDL staff co-authored the BagIt specification. BagIt is a hierarchical file package format suitable for the exchange of generalized archival content via the network or hard-disk. The "bag" has just enough structure to safely enclose its payload but does not require deep knowledge about its internal semantics.
The Web Archiving Program grew from a 2003 Mellon-funded study conducted by the CDL to evaluate the impact of the web as a medium of publication for government information. The final report for that study “Web-Based Government Information: Evaluating Solutions for Capture, Curation, and Preservation” served as the basis for CDL’s 2005 “Web-at-Risk” grant proposal. Research and assessment have continued to be a strong focus of the Web Archiving Program, whether evaluating promising technologies or drawing on user-centered design practices for the tools we build.
Web Archiving Service
Learn more about the tools the CDL has created to capture, curate and preserve web content.
Web-at-Risk Grant
Learn more about the 4.5 year grant effort to develop tools, policies and standards for web archiving.
Digital Preservation Program
The Web Archiving Program is part of the CDL’s more comprehensive Digital Preservation Program, and draws from the practices and technologies developed by the Digital Preservation Group.
Web-Based Government Information: Evaluating Solutions for Capture, Curation, and Preservation