Inside CDL

CDLINFO Newsletter, December 14, 2006, Vol. 9, No. 20

CONTENTS

  1. Mass Digitization: Open Content Alliance and the UC Libraries
  2. UC Electronic Resources Management System (ERMS) Project Update
  3. XTF Update
  4. New Citation Report provides H Index
  5. Retrospective Bibliographic Searching in the Life Sciences

1. Mass Digitization: Open Content Alliance and the UC Libraries

On December 6, 2006, Microsoft released the beta Live Search Books (http://books.live.com), providing a new portal to access UC libraries books scanned by the Internet Archive (IA) for the Open Content Alliance. An initial review of Microsoft’s service was provided by CNET (http://news.com.com/Microsoft+releasing+book+search+in+beta/2100-1038_3-6141162.html).

Microsoft’s Live Search Books provides a window into scanned books that is as serendipitously fruitful as article indexes are for searching the content of scholarly articles. It searches every page of the scanned books and returns a link to the page that contains your search phrase.

At this site (http://search.live.com/results.aspx?q=&scope=books), search on “Adolph Sutro” – mayor of San Francisco from 1895–1897, mining engineer, philanthropist -- and you find not only important information about his role in the beginnings of San Francisco, in Nevada mining enterprises, his interactions with President Benjamin Harrison, but also two poems lauding him, one by Carrie Walter and another by Joaquin Miller.   

Search on the “Golden Gate Bridge” to see that landmark remembered in oral histories, documented by its creators, and praised in poem and song.

The opportunities for uncovering unknown connections are endless, and will only grow as Microsoft continues to digitize more historic titles.

Update on the Mass Digitization Projects

The Open Content Alliance (OCA) (http://www.opencontentalliance.org/) is one of two mass digitization projects now underway within the UC libraries. (The other is Google, about which more will be forthcoming in future articles as its workflow and scope unfolds.) With the approval of the University Librarians, the UC libraries became one of the earliest contributing members of the OCA. OCA is a coordinating body whose purpose is to build open access electronic collections and make them available through the Internet Archive (IA) http://www.archive.org/index.php. UC library books scanned with Microsoft funding for the Open Content Alliance are now available through both the Internet Archive interface and the Microsoft Live Search Books (beta).

As OCA contributors, the UC libraries are providing out-of-copyright, public domain materials and content for which the UC Regents hold copyright. The University of Toronto libraries are one of the many libraries contributing content to the OCA (for a complete contributor list see http://www.opencontentalliance.org/contributors.html). UC  hosts two Internet Archive (IA) scanning facilities, one at NRLF (which came online in April 2006) and one at SRLF (which came online in August 2006). A third IA-operated scanning site resides at the University of Toronto.

Under the direction of Brewster Kahle, the IA is the organization that provides the technology and staff for the scanning service. IA servers in San Francisco host the resulting files. In the case of scanned images of UC materials, the digital files will also become part of the UC libraries Digital Preservation Repository (DPR). Files will include JPEG 2000, PDF, fully searchable OCR, and meta.xml.

CDL is investigating the implications of integrating the content generated through the OCA and Google projects into our UC library access systems and will be consulting with UC library advisory groups as the issues are better defined. Content scanned by Google will be available through WorldCat, and discussions are underway to provide OCA-scanned materials through OCLC as well.

The two OCA funding sources (Yahoo! and Microsoft) requested that IA initially scan thousands of books that can broadly be defined as reflecting Americana. CDL has created lists of titles (known as picklists) by searching the Melvyl Catalog with a combination of date limits, subject headings, and broad classification codes. These lists are drawing from UC’s systemwide library book collections managed at NRLF and SRLF, from the UC Berkeley and UCLA main libraries, and from the Bancroft Library’s and UCLA YRL’s Special Collections.

With the advice from SOPAG members and AULs from across the system, several UC librarians were identified to help define those searches to retrieve the widest range of materials. This subject approach depends upon cataloging consistency and completeness through decades of librarianship on different campuses. Librarians will recognize that this makes any such search far from perfect! But it has identified thousands of books (including oral histories from the Bancroft’s Regional Oral History Office) so far, many of which have been digitized and can be viewed at: http://www.archive.org/search.php?query=collection%3A%28cdl%29

Books are non-invasively scanned. A small test of 800 Berkeley mathematics books was digitized initially to affirm that the process does no harm to the original volumes. IA designed and manufactured special scanning stations, called Scribes. These hold the book face up, open at a 90 degree angle. Carefully trained operators manually turn pages, check that metadata is correct, and replace the books on carts for return to their shelving locations.

Staff at the two RLF’s have been actively and creatively involved in this project so far, devising workflows, trouble-shooting, and insuring that all scanned books are returned to their rightful homes. Many UC librarians have offered excellent advice as CDL staff have wrestled with devising lists of books that meet the criteria for scanning. This systemwide teamwork enables UC to take advantage of this timely opportunity to add new levels of access to our priceless collections.

Mass Digitization Collection Advisory Committee

Recognizing the need to formalize the content selection process as we continue to move forward on both the OCA and Google mass digitization projects, CDL obtained approval from the ULs for the formation of a Mass Digitization Collection Advisory Committee (MDCAC).

MDCAC’s charge will include developing an internal process for the review, identification, and selection of collections for scanning across the UC libraries; developing criteria for evaluating potential collections for scanning; communicating with CDL staff, UC bibliographer consortial groups, HOPS members, and HOTS members as needed for advice and assistance pertaining to technical and programmatic issues as recommendations are developed for collection scanning; and advising the SOPAG Collection Development Committee (CDC) on issues about collection development for mass digitization projects and recommending collections for their review and approval. The CDC has proposed members for this committee which will be appointed in the near future.

We wish to express our deep thanks to all of the UC librarians and CDL staff who have helped and will continue to assist in this great effort. Congratulations on reaching this milestone!



2. UC Electronic Resources Management System (ERMS) Project Update

The following is a summary of recent ERMS project activities, and plans for the coming months.

Ex Libris product manager Ted Koppel visited the CDL in November. As a result of this meeting, CDL is in the process of upgrading our test environment to the beta version of Ex Libris’ ERMS product, Verde 2.0. We are aiming for this upgrade to be completed in January.

Key points
  • This is a continuation of what we started last year since it is an upgrade of our test environment only. When we are done testing, we will not be saving the test data.
  • This installation is planned to include a central instance (CDL) and up to 4 campus instances. The campus instances will be determined based our need to test interoperability between Verde and SFX (which underlies our UC-eLinks service).
  • >
  • We plan to open this up to the SOPAG ERMS implementation team to use in their work.
Benefits
  • Campuses will be able to interact with the latest version and experience the workflow.
  • Campuses will be able to analyze their data and processes against the latest version of the software.
  • CDL will be able to test the consortial functionality that has resulted from a first round of joint development with Ex Libris -- functionality that is required to meet UC’s needs.
  • CDL will be able to experiment with data loading, and test other specific concerns.
The systemwide EMRS Implementation team is making progress on overall project guidelines:
  • Identifying policy issues that will arise and creating an overall guidelines document that will begin our best practices documentation,
  • Working on defining a minimal data set that we all agree each campus should strive for,
  • Identifying data elements that require authority, and charting how that authority is currently determined on each campus. The end goal will be to have agreed-upon systemwide authority guidelines settled on ahead of time, for elements where these are necessary.

CDL is continuing discussion with other consortial institutions who are also implementing Verde 2.0 or plan to implement it. If, after further investigations with these consortia, and after completing our own testing, we discover no major issues that would prevent us from making use of Verde 2.0, we will proceed on a path to install the Production version.

CDL will be testing early in the year, and we are hopeful that implementation of a production version can proceed mid-year 2007.

For further information, see http://www.cdlib.org/inside/projects/erms/



3. XTF Update

CDL announces the latest XTF release: version 1.9. The main feature of the new XTF release is greatly improved documentation. Almost all features are now fully documented, allowing users to take better advantage of the system.

Users of XTF version 1.9 will also find the following new features:
  • Stylesheets with a real HTTP redirect, to send the user's browser to a different URL
  • Improved full-text scoring, file handling and numeric data searching
  • New query operator: multi-field AND, that requires *all* terms to be present, but in *any* of the listed fields. Default stylesheets now use this for a basic "keyword" search.
  • New query operator: orNear. This is like a typical OR query, except that when multiple terms are present in the same metadata field, their proximity is taken into account when scoring.
  • Experimental dynamic FRBR mode... see docs/experimental.html for details.

As always, please let us know if you encounter any bugs or problems with this release.

**We are also pleased to report that XTF continues to attract users across the globe. Most recently, it has been deployed at the Grupo de Estudos em Dereito das Telicomunicacoes, where it is being used to run searches on Brazilian acts of telecommunication law.

View the site at: http://www.gds.nmi.unb.br:8080/xtf/search

(terms to try: fust, universalização, "zona rural")



4. New Citation Report provides H Index

By Beth Weil (UC Berkeley), Web of Science Resource Liaison

The latest quarterly software update for Web of Science introduced a new feature called the Citation Report. It captures citation activity and identifies citation trends. The Citation Report enables you to instantly create formatted reports for any General Search (author, topic) of up to 10,000 records. A breakdown of the citation history for each record which can be exported for further analysis is available.

The report shows the number of articles published/year and the number of cites/year for a particular author or any general search. It also calculates something called the h index. This metric is useful because it discounts the disproportionate weight of highly cited papers or papers that have not yet been cited. The h index was developed by J.E. Hirsch and published in Proceedings of the National Academy of Sciences of the United States of America 102 (46): 16569-16572 November 15 2005.

Follow these steps to use this tool:

a) Go to Web of Science http://isiknowledge.com/wos
b) Choose the General Search button
c) Do an Author or Topic search or another search that results in less than 10,000 records.

The Citation Report button appears on the right hand side of the screen, just below the Analyze Result button.

It is hoped that this measure might replace some of the importance placed on the impact factor of specific journals. However, the usefulness of the measure is quite variable in different disciplines.



5. Retrospective Bibliographic Searching in the Life Sciences

By Beth Weil (UC Berkeley), Web of Science Resource Liaison

UC has just obtained access to the Biosis backfile (Biosis Archive) which comprehensively covers the biological sciences literature from 1926-1968. The current Biosis file has covered the data from 1969-present. At the present time we have the following resources for retrospective searching:

Biosis: 1926-present
PubMed: 1951-present
Web of Science: 1900-present

With such a wealth of resources, where is the best place to start?

The Biosis data is indexed by
  • Broad subject heading (i.e. Genetics, Ecology, Cell Biology)
  • Genus/species (if present)
  • Organism classifier (usually at the Family or Order level)
  • Broad taxonomic groups (birds, mammals, plants, bacteria)

for all articles back to 1926. The majority of the records also have abstracts. (Review articles, short papers and conference proceedings generally do not have abstracts.) This makes it very a very rich resource for searching the literature of the biological sciences.

Because the abbreviations used for a journal title varied quite a bit during these early years, links provided by UC-eLinks are particularly unreliable. Please ask your reference librarian for help in locating these early works. We have many of them, but UC-eLinks frequently cannot find them.

Web of Science

Web of Science started to include abstracts and keywords supplied by the author around 1991. Before that date, subject searching was limited to only words in the title of the article. Web of Science is particularly known for the benefits of its cited reference searching and the ability to follow a concept or idea forward and backward in the literature. However, since UC researchers have access to Web of Science and Biosis through the same Web of Knowledge platform, it is now best to start your search in Biosis and then move to Web of Science via one click to follow the cited references.

PubMed

PubMed currently goes back to 1951. The Old Medline records 1951- 1965 in PubMed do not have abstracts and were indexed with different subjects than the MeSH file currently in use. So you probably will need to expand the list of synonyms you use in your search to comprehensively search this portion of the database. NLM has begun an OLDMEDLINE subject heading-to-MeSH heading mapping project. This project maps the original subject headings assigned to the citations when they appeared in the print indexes to the current MeSH vocabulary. Subject retrieval is better than in Web of Science. However, during this time period, Medline only covered clinical medicine and the basic medical sciences. Biosis is highly recommended for coverage of any non-medical areas and actually does a good job in medicine as well.




Subscription information for the CDLINFO newsletter is available at: http://www.cdlib.org/inside/news/cdlinfo/