Five things I learned at IIPC

May 16, 2013 Author: Rosalie LackCategories:

I recently attended the International Internet Preservation Consortium (IIPC) General Assembly (http://netpreserve.org/general-assembly/2013/program). The IIPC is a consortium of libraries, academic institutions and other organization engaged in web archiving. The IIPC’s General Assembly included three days of member meetings and two days of meetings open to the public. The theme of the public conference day was: Scholarly Access to Web Archives: Progress, Requirements, and Challenges.

Ahmed Alsum from the Web Science and Digital Libraries Research Group at Old Dominion University posted a comprehensive summary of the GA. As you can see from his summary, there were many great presentations and discussions. It was very hard to choose just five things to share, but here they are:

1. Dark(ish) Archives

Because of copyright and privacy issues, many of the national libraries in Europe cannot provide online, public access to their web archives. They can only allow access in the library and many do not even allow printing in the library. So, how do you raise the awareness of web archiving when no one can see the archives?! There was much discussion about creating site lists/registries for the sites in these archives – some felt this would only lead to disappointment when the user finds out that they have to travel to the archive to see the materials. Sound familiar? Yes, finding aids. And YES, they are extremely useful!

harvesting tools — Harvesting tools.
Source: WikiMedia Commons http://commons.wikimedia.org/wiki/File:TequilaToolsMuseum.JPG

2. Common tools support is critical

Most IIPC members are using the same suite of OS tools –Heritrix and Open Source Wayback. There was a lot of concern about the development path for these tools. At member breakout sessions where future paths were discussed and prioritized, there was a clear message that tools are important. The IIPC steering committee quickly responded by supporting tool management as a top priority for the organization. Be on the lookout for updates and concrete plans shortly.

3. National and University Library of Slovenia is innovative

Besides being wonderful conference hosts, the National and University Library of Slovenia is doing some innovative work when it comes to web archiving. They demo’d a prototype (see screenshots (http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdf) ) of a tool for end users to engage and interact with web archives. It includes features such as: gather and save sites; annotate sites; tag sites; and crowd sourcing of metadata. The next generation of web archives is here!

4. Researchers use of web archives

There were several informative presentations by researchers about how they are using web archives. Sophie Geibeil, an historian from Aix-Marseille-Université, uses the archives to study the untold story of North African immigration. Megan Dougherty (http://www.luc.edu/soc/academics_facultystaff_doughteryM.shtml), a social scientist from Loyola University, was not as interested in site content as much as she was in taking an anthropological point of view of the sites; that is, studying the social aspects of sites – how people share sites, interact with sites, etc. Niels Brügger, from Netlab at Aarhus University, discussed their various research projects in the areas of digital humanities and internet studies, including RESAW and FUTARC (http://www.netlab.dk/projects/p6-fundamental-tools-for-web-archive-research-futarc/). Helen Hockx-Yu, UK Web Archive at the British Library, presented UK Web Archives in the eyes of scholars. She made the case for thinking of the archives not as documents but rather as large datasets for data mining and analysis.

5. Dancing into the future?

David Rosenthal provided historical context for the early days of web crawling and also provided some future challenges. Not surprising, the web as mainly HTML links is rapidly becoming a thing of the past. Turns out all the current, problematic areas (including rich media, database driven features, dynamically generated URIs, etc.) remain challenging, and now add to that the fact that the new web is more and more JavaScript. Is there no rest for the weary? David did leave us with a light at the end of tunnel. He talked about recent work by the Institut National de L’Audiovisuel (INA) in Paris. The team there created a live archive proxy that shows great promise to enable the capture of some of the more problematic content. Also, there is Memento, which provides an aggregation of web archives each collected in slightly different ways by institutions so moving toward covering all the bases.

Fred Astaire.
Source: WikiMedia Commons http://commons.wikimedia.org/wiki/File:AdeleFred1921.jpg

David also had a great analogy for one of the challenges of preserving the web today; he says it is like “preserving theatre or dance” because as we view the web it changes to become an individual experience based on who we are, displaying customized ads and other personalized content. As he put it: “Every performance is a unique and unrepeatable interaction between the performers, in this case a vast collection of dynamically changing databases, and the audience. Actually, it is even worse. Preserving the Web is like preserving a dance performed billions of times, each time for an audience of one, who is also the director of their individual performance.” (Source: http://blog.dshr.org/2013/04/talk-on-harvesting-future-web-at.html)

Overall, it was an excellent, thought-provoking conference. Clearly there are lots of challenges ahead for web archiving, but so many more opportunities.