Skip to main content

WAS Service Update: April– May 2012 http://was.cdlib.org

WAS Service Description

The Web Archiving Service (WAS) enables librarians, archivists and researchers to capture, curate and preserve websites and web‐published materials.   WAS makes it easy to build web archives, with scheduling and other tools to help manage your archive.  You control public access to your archives and can configure the appearance and navigation of each archive.  We also provide collection development consultation and help desk support for web archiving questions.

WAS Service Manager

Tracy Seneca tracy.seneca@ucop.edu or washelp@ucop.edu

Additional WAS Information (training materials, videos, guides, etc.) More information about WAS is available at <was.cdlib.org> or by sending an inquiry to <washelp@ucop.edu>

WAS Harvest and Collection Activity, April – May 2012

• 102 archives actively collected
• 2,313 sites collected
• 2.9 TB of data collected

Recent Enhancements, News, and Activities

International Internet Preservation Consortium General Assembly The IIPC General Assembly took place in early May in Washington DC, with CDL participating as a long-standing member, and actively engaged in the IIPC’s Access Working Group.  The consortium is composed of a growing range of national libraries, academic institutions and other organizations engaged in web archiving.  The IIPC actively promotes the development of open source tools, standards and best practices in web archiving.

The Broad Value of Web Archives: General Use  The opening day of the IIPC assembly was devoted to an open session on talks devoted to researcher use of web archived data.  These sessions presented direct researcher use cases of data derived from web archives.  A WAS News posting on researcher Kalev Leetaru’s work on the mutability of White House Press releases on the web is available at: http://cdlib.org/2012/05/31/11716, and further findings from these presentations will be shared in the upcoming weeks.  The slides for these
sessions are available at (http://netpreserve.org/events/2012ga.php), and video interviews with the researchers who presented will be made available in the future.

2012 General Assembly Outcomes

Harvesting Working Group Tracy Seneca, CDL’s Web Archiving Service Manager will co-chair the IIPC Harvesting Working Group.  This working group has been at the heart of the IIPC’s activity since its inception in 2003.  The HWG guides the development priorities and process for Heritrix, the harvester behind the Web Archiving Service, the Internet Archive, the Archive-It service, the NetArchive Suite, the Web Curator Tool, and the majority of locally built web harvesting systems.  Heritirx currently stands at a cross-road in its development, with a need to foster more truly collaborative community development on future enhancements to the crawler.  This will be a noteworthy responsibility and highlights the significant role that CDL plays in IIPC.
Tracy will be setting down her prior role in the Access Working Group, and the IIPC Steering Committee has indicated that a curator from a UC Campus would be welcome to take up involvement in that working group as part of CDL’s membership in the organization.  We will be communicating with UC WAS administrators and curators about prospects for participation.  We would love for UC web archivists to have a chance to be involved in broader initiatives!

Twittervane One of the initiatives of the Access Working Group via the British Library has been the Twittervane project, which aims to augment collection development for event archives by drawing site nominations based on trending Twitter topics.  This is a similar approach to that used by CDL to identify key Gulf Oil Spill websites when we pulled in RSS feeds from sites tagged in Delicious.  Selecting sites for event archives can be notoriously time-consuming, so any tools for extending and managing the site nomination process are welcome!  At this time, Twittervane has only been used in the context of the British Library’s selection process, but it will be made available to the community as the project wraps up.  We will keep you posted as Twittervane becomes more widely accessible.  Slides with further background are at: (http://netpreserve.org/events/dc_ga/02_Tuesday_IIPC/Hockx-Yu.pdf)

2012 Olympics Nominations Nominations are open for the 2012 Summer Olympics Project.  The goal of this project is to lay the groundwork for interoperability between diverse web archives.  That foundation will be a series of archives on the Olympics using a shared descriptive vocabulary.  Because each participating nation may be bound by varying rules concerning copyright and legal deposit, the collection effort will be a fully international effort.  Nearly 1000 websites have been nominated at this point, and CDL is one of the organizations participating in site harvesting.  To review the nominations or contribute your own, visit the Olympics 2012 nomination tool:  http://digital2.library.unt.edu/nomination/olympics2012/  You simply need to provide your name and email address to contribute.  You are able to provide a site name, URL, description, and to select a relevant sport, nation, language or subject.

Columbia University Meeting On May 10th and 11th, The Columbia University Library hosted “Web Archiving Policies and Practices in the U.S.”.  This was a unique opportunity for US organizations, largely academic institutions, to share their practices, obstacles and visions.  Participants included curators using a range of tools, and included WAS partners from UC Irvine, the University of Michigan and NYU. The session’s goals, participants and program are available at https://webarch.cul.columbia.edu/.  While no formal plans were set for this group to meet again, it was agreed that U.S. academic institutions have a rich range of unique issues to share and collaborate upon outside the structure of the IIPC.  One immediate outcome for Web Archiving Service users was the proposal of a WAS User’s Group meeting in conjunction with the upcoming Society of American Archivist’s Meeting in August in San Diego.  Arrangements for this meeting are in the works, and UC WAS users who are not formally signed up for SAA will be welcome to attend this adjunct meeting.  Details will be provided in the next monthly newsletter, in the WAS News blog, and via email to the all WAS Users list.

Pilot Testing of WAS Metadata Export One forthcoming feature in WAS is an export of all curatorial metadata for each site in any given archive in XML format.  The purpose of this report is to provide a way to integrate access to archived websites with other relevant discovery systems, such as topical digital archives and library catalogs.  The record for each archived site points to a “site details” screen.  Here is a sample screen for a site in the UCSF California Tobacco Control Archive: http://webarchives.cdlib.org/site/sw1qn6068n

We are currently working with pilot testers from UC Davis, UC San Francisco and the University of Michigan.  These organizations have varied use cases for these records so early pilot tests will provide us with strong foundation for a wide range of needs.  We’re also grateful for early input from Peter Filardo at NYU and Yvonne Wilson at UC Irvine for helping us to establish the requirements for this feature.

Terry Reese of the University of Oregon, and the developer behind the freely available MarcEdit tool, has created a custom crosswalk to transform our WAS report to MARC format records.  When the feature is released, we will provide access to that crosswalk file as well as documentation for installing it and producing records from the WAS report.

Service Monitoring and Availability Check WAS’ system status page http://www.cdlib.org/contact/system.html