Digital Preservation News

October 17, 2007 Author: Patricia CruseCategories:

By Trisha Cruse, CDL Director of Digital Preservation

The CDL Digital Preservation Group has been busy with a variety of exciting activities, reported below.

Release 4 of the Web Archiving Service
On September 18th the Web Archiving Group released a new version of the Web Archiving Service – special thanks to Tracy Seneca, Scott Fisher, Margaret Low, Erik Hetzner, Mark Reyes, and Mike Wooldridge for getting this release out the door. So far the group has received very positive feedback from users on the service’s functionality and the user interface. We are also extremely pleased with the performance; we are up to 500 captures with relatively few hiccups.

We have also put together an overview of the service that is available on YouTube <http://tinyurl.com/2tdrwq>. This brief overview explains why the content targeted for this project is at risk, how we plan to address this in the Web Archiving Service, and provides an explanation of the collections our curators are working on. Warning: the YouTube video quality is a bit sketchy so we have also made this presentation available in a high-quality video format; contact tracy.seneca at ucop dot edu for further information.

A kinder and gentler ARK page
Thanks to Kirsten Neilsen and John Kunze there is now a kinder, gentler introduction to ARK identifiers on Inside CDL <http://www.cdlib.org/inside/diglib/ark/>. Don’t know what that is? Then definitely take a look. Our hope is that this will help others recognize and appreciate the true beauty and splendor of ARKs. The new page has already been re-purposed in a German “technology watch” newsletter, <http://www.kim-forum.org/techwatch/kim-dini-technology-watch-report1_2007.pdf> which is the very first edition of a bi-annual publication from the Interoperable Metadata Center for Excellence and the German Networked Information Initiative.

Tidal wave of web data knocking on our door
For the past several years the Digital Preservation group has been working with Andreas Paepcke and Hector Garcia-Molina at Stanford University on web crawling activities. Their research group has a wealth of experience collecting web data and while CDL’s Digital Preservation group was getting their “web crawling sea legs” they asked Stanford’s group to collect data on our behalf. Over the years Stanford has collected over 100 TB of data ranging from dot.gov sites, election data, Katrina, Virginia Tech tragedy, etc. However, they have been using a different crawler than the Web Archiving Service (WAS) crawler (Heritrix). As a consequence their crawler output is incompatible with most web archiving services, including ours. However, there is good news — they have recently created a tool that will turn the output of their crawler data into something that CDL’s service can understand. Erik Hetzner, Mike Wooldridge, and Scott Fisher are just beginning to play around with this, but we are hoping for a positive outcome.

Contributing to the community by documenting Heritrix
As mentioned above, our Web Archiving Service uses Heritrix, the Internet Archive’s (IA) open-source, extensible, web-scale, archival-quality web crawler project. “Heritrix” (often misspelled heretrix, heratrix, heritix, etc.) is an archaic word for “heiress”, which the IA chose because the project seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations. One of the challenges of using Heritrix is that there is a dearth of documentation. Over the next several months Hunter Stern, CDL’s technical writer, will be working with Heritrix programmers at CDL and IA to better document the crawler. This collaboration will help us tremendously and benefit the crawler community as well.

Moving big data: Mass Transit Project
Over the past couple of years the Digital Preservation Group has been working with the campuses to move large chunks of content into the Digital Preservation Repository (DPR). In the process we have encountered a few speed bumps along the way. The issues are two-fold but related: the files are large and the network transfer rates have been unaccountably slow. Though we have worked towards resolving this, we have more work to do in understanding the best transfer tools and in monitoring our networks to make sure there are no log jams and that they are ready to be used to their full potential bandwidth. The goal is to make sure we’re making the best use of our Internet2 pathways to/from the campuses and the data centers for the benefit of all CDL projects.

The Digital Preservation group has embarked on two efforts to speed up movement of large files into the DPR. First, they are collaborating with San Diego Supercomputer Center (SDSC) to understand how to transfer data across the network more quickly and efficiently. Second, they are implementing (on a trial basis) a method of pulling in large numbers of external data objects into a kind of preservation holding tank in order to reduce the impact of network speed and latency on the overall DPR ingest process. They are very excited about the collaboration with SDSC and Kirsten Neilsen will be leading the project for CDL – we’re calling the project “Mass Transit” and there is a project Wiki <http://masstransit.sdsc.edu/>.

If you want any additional information on any of these projects please contact Trisha Cruse (patricia.cruse@ucop.edu).