Inside CDL

Web-based Government Information Project, a Mellon Funded Initiative

Web-based government information is increasingly at-risk of being lost due to its volatility, diversity, and growing volume. The California Digital Library's government information project is studying ways to preserve these valuable materials.

See sections below:

Project Summary

The California Digital Library with support from the Andrew W. Mellon Foundation is conducting a cost-benefit review of technologies and approaches appropriate for the capture, curation, and persistent management of web-based documents of US state and federal governments. The project's ultimate goal is to outline the requirements for stable and sustainable digital collection building for this genre of information. In order to achieve this, the project is engaged in several inter-related sets of activities including:

  1. Review the scope of the domain of web-based government information, focusing on the development of meaningful data such as size, diversity of formats and rate of growth.
  2. Identify promising technologies and projects related to the capture, management, and preservation of this material; survey the managers of these technologies and projects to build an information base for the project.
  3. Capture a subset of web documents from government sites to serve as a test bed for analyzing the technical requirements for building collections of web-based government documents.

These activities will provide the basis for a Report that will both describe our findings and discuss their implications for libraries that wish to develop persistent collections of web-based government documents to meet scholars' information needs. The Report will be developed in consultation with two principal sets of stakeholders: faculty and graduate students whose research depends upon this genre of information; and librarians who select, build and manage print as well as digital collections. Input to the Report will be sought from several sources, including: (1) appropriate University of California bodies such as the Systemwide Library and Scholarly Information Advisory Committee (specifically its Collection Management Planning Group), and the Systemwide Operations and Planning Group; (2) related projects such as CRL's Political Communications Web Archiving project and the Stanford Libraries' LOCKSS project; and (3) stakeholder groups such as directors for libraries participating in the Federal Depository Library Program as well as directors of libraries in the Digital Library Federation and the Association of Research Libraries. For more information, see the complete project description. [RTF]

Participants and Partners

Project Team
  • Daniel Greenstein, Principal investigator. Greenstein is Associate Vice-Provost for Academic Initiatives, University Librarian for Systemwide Library Planning and Scholarly Information, and Director of the California Digital Library.
  • Patricia Cruse, Project Coordinator. Cruse is Manager of Government Information Initiatives, CDL.
  • Chuck Eckman, Content Specialist. Eckman is Principal Government Documents Librarian at Stanford University.
  • John Kunze. Kunze is a Senior Development Progammer.
  • John Ober, Technical Coordinator. Ober is CDL Associate University Librarian, Education and Strategic Innovation and Acting Director, CDL Technologies.

Partners and Collaborators

Reports

Environmental Scan, Preliminary Survey Results; Interim Report [PDF]

In conjunction with the goal to build an information base derived from existing practices in capturing and preserving web-based government information, project staff initiated two sets of interviews. The first set of interviews is being conducted with leaders in the digital preservation and government information fields to help shape the nature of our inquiry. The second wave of interviews includes individuals that are directly associated with projects addressing some aspect of the problem.

Although staff are still conducting these surveys, we believe that the information gained from early respondents point to a clear set of common themes and experiences. This interim report is intended to share these results with a broader community.

Theory and Content Neutral Preservation Sketch [PDF]

Project Phases

Phase I: [November—January 2002]

Work Package 1: A technical survey of approaches available for capturing, curating, and preserving web-based government information. The survey will include:

  • a specific definition of 3 primary terms: capture, curate, preserve,
  • an evaluation of what the candidate technologies are for capture, curation, and preservation, and
  • a technical analysis that will include a qualitative cost-benefit analysis of options.

Work Package2: An investigation of current projects that are involved in the capture, curation, and preservation of web-based government documents. This includes two surveys:

  • Web-based Government Information Survey A: a high level survey that asks key individuals at federal, state, international, and non-profit institutions "who is doing what?" [PDF]
  • Web-based Government Information Survey B: a project level survey that will ask key questions of those directly involved in projects [PDF]

Work Package 3: An empirical assessment of the .gov domain, which will examine:

  • the size of the .gov domain,
  • the rate of change,
  • the file types, and
  • the overlap with print materials (representative sample only)

Work Package 4: Develop a test bed of .gov materials. The project group is working with the Stanford Digital Libraries Project group to capture web-based government information and with the San Diego Supercomputer Center to store and index the following .gov sites:

  • Federal Sites:
    • U.S. Department of State and related Bureaus,
    • U.S. Department of the Interior and related Bureaus,
    • U.S. Senate, and
    • U.S. Environmental Protection Agency.
  • California State Sites:
    • California Energy Commission,
    • California State Water Resources Control Board, and the
    • California Legislative Analysts Office.

Phase II: [February—June 2002]

This phase will begin by convening a group of users of government information, which will include government information librarians, faculty, and graduate students. The meeting will be informed by the work completed in the previous phase. The information gathered at the meeting along with findings from Phase I will be reported in a document that lays out the main findings and discusses their implications for for research libraries and other libraries that wish to develop persistent collections of web-based government documents. The report will help to further define the scope and nature of the problem that research libraries need to address.

Phase III: [July—September 2003]

The primary objective of the last Phase of the project will consist of reporting our findings and seeking input from a broader community.