Inside CDL

UC Libraries Mass Digitization Projects

Frequently Asked Questions

  1. What is mass digitization?
  2. What are the University of California’s mass digitization goals?
  3. What mass digitization projects are currently underway at UC?
  4. What books are being scanned?
  5. Who is doing the digitizing?
  6. How and where is the digitizing being done?
  7. What will happen to the books after digitization?
  8. Are there standards regarding the quality of the scans?
  9. What rights to the digitized content does UC have in the projects; will access be limited in any way?
  10. How will our patrons be able to access these texts, i.e. through MELVYL, or local catalogs, or a webpage, any search engine, or....?
  11. Will access be different for the general public than it is for our faculty, staff and students?
  12. What can users do with the texts?
  13. How will patrons be impacted while books are being digitized?
  14. What happened to the Microsoft project?

Download FAQ [DOC]

UC Libraries Mass Digitization
FAQ

Last Updated: June 4, 2009

The Basics:

1. What is mass digitization?

The goal of mass digitization is not to create individual collections but to digitize the books in the world’s libraries on a grand scale - ideally, every book ever printed.  Millions of books from the UC Libraries will be scanned through our participation in mass digitization projects.  To do this economically and with some speed, mass digitization is based on the efficient photographing of books, page-by-page, and subjecting those images to optical character recognition (OCR) software to produce searchable text.  Human intervention is reduced to a minimum so the OCR output is generally used without undergoing additional revision.  Also, only limited structural markup, such as page numbers, tables of contents, and indices, are included.

2. What are the University of California’s mass digitization goals?

Mass digitization projects expand the UC Libraries ability to give faculty, students and the public access to information and support our exploration of new service models.  These projects are designed to:

  • Enhance student and faculty research.  Mass digitization of these materials increases awareness of the rich materials in our collections and enhances access.
  • Enable scholars to trace the evolution of ideas and perform other sophisticated textual analysis more easily by indexing the full text and making it searchable by computer, supporting scholarship in new ways.
  • Fulfill its public service mission - Many books of enduring general interest that are in the public domain – including classic works of literature but also more unique items such as early histories of the settlement of California and the West - can now be read by anyone, anywhere, anytime.
  • Preserve and protect our collections - In earthquake and fire-prone California, digitizing the books in our collections may also help protect the university from catastrophic loss should disaster someday strike our libraries.

Mass digitization will allow the UC Libraries to explore new questions and service models including but not limited to the following areas:

  • Enhanced Discovery and Access:  how will improved access to mass digitized materials in our print collections support the research, teaching, and private study needs of students, faculty, and other library users?
  • Collection Management: can mass digitization help support our efforts to manage campus print collections and build more effective shared print collections?
  • New Services to Users:  what new service opportunities and/or research paradigms are enabled by massively digitizing our library collections?
  • Curating through Collaboration: will participation in mass digitization projects help create access for our users to third-party materials not currently available through our own libraries?
  • Funding Reallocation: to what extent can the digital reformatting of our own collections of public domain works obviate or lessen the need to allocate funds toward licensing online collections of these same materials from commercial providers?

3. What mass digitization projects are currently underway at UC?

The UC Libraries are currently participating in two mass digitization projects: Google Book Search, and the Open Content Alliance.  These are non-exclusive agreements and the UC Libraries may enter into other agreements with other digitization projects as they arise.

Google Book Search   http://books.google.com/
In the Google Book Search project, books and serials (both in-copyright and in the public domain) in all languages are being scanned.  The Google-University of California contract targets scanning 2.5 million volumes over a period of six years.

Open Content Alliance: http://www.archive.org/details/americana  and http://www.archive.org/details/texts
The Open Content Alliance (OCA) represents the collaborative efforts of an international group of cultural, technology, nonprofit, and governmental organizations that are building a permanent archive of multilingual digitized text and multimedia content in the public domain.  UC Libraries earliest forays into mass digitization were through the OCA.  Scanning has been project driven, with funding provided by various organizations including the California Digital Library, Microsoft, Sloan Foundation and Yahoo.  These books are available on the Internet Archive website.

The OCA is in the process of reorganization into a successor organization called the Open Knowledge Commons.  UC plans to continue to participate.  More details will be shared when available.

Back to Top

4. What books are being scanned?:

Books of American literature, mathematics, and other subjects from both general and special collections have been selected by UC librarians in consultation with sponsors.

The Scanning Process:

5. Who is doing the digitizing?

Google is scanning books and serials for the Google Book Search project.  The Internet Archive serves as digitization agent for the Open Content Alliance.   Books are not destroyed during the digitization operation in any of these projects.

6. How and where is the digitizing being done?

The Google and OCA projects all employ non-destructive scanning technology.
Books scanned through the Google Book Search project are being digitized offsite in a Google-managed facility.  Books scanned through the Open Content Alliance are digitized at Internet Archive facilities including the SRLF facility within the UC Libraries system.

7. What will happen to the books after digitization?

All books are returned to their home locations after digitization. Books are generally returned to the shelf within two to three weeks.

8. Are there standards regarding the quality of the scans?

CDL and UC Berkeley Library staff have been actively engaged in deliberations with technical staff at Google and the Internet Archive regarding quality standards for the two scanning projects.  CDL will be performing automated quality assurance (QA) checking using a tool developed at Harvard to verify that files conform to the relevant ISO format specification.  Partner Libraries for the mass digitization projects have developed technical specifications for image compression to ensure efficient and high-quality long-term storage of the derived page images.

9. What rights to the digitized content does UC have in the projects; will access be limited in any way?

All contracts specify that UC digital images will be available to the UC Libraries to download and manage.  The UC Libraries’ digital copy is subject to certain rights and restrictions regarding use and distribution.  The University of California’s use or ability to display the downloaded copies of the full text of all books is subject to the restrictions of copyright law.  Full-text searching will be possible for all of the digitized books, but some scanned books will not be completely viewable due to copyright restrictions.  Specifics include:

Google

  • UC Libraries have the right to use the UC Libraries digital copy at the University’s sole discretion, subject to copyright law, as part of the services offered to University Library patrons (including all individuals and organizations served from the UC Libraries websites).
  • UC Libraries must implement technological measures to restrict automated access by crawlers, robots, spiders etc. to the UC Libraries digital copy.
  • UC Libraries may not permit downloading for commercial purposes.
  • UC Libraries may not knowingly permit the automated downloading and redistribution of the UC Library digital copy by third parties.  UC Libraries must develop methods for ensuring that substantial portions of the UC Libraries digital copies are not downloaded from the UC Libraries website or otherwise disseminated in bulk.
  • UC Libraries are permitted to distribute no more than 10% of the UC Libraries digital copy to other libraries and educational institutions for non-commercial, research, scholarly, or academic purposes (but not any portion of image coordinates).
  • UC Libraries are permitted to distribute all or any portion of public domain works contained in the UC Libraries digital copy (but not any portion of image coordinates) to other research libraries for use by those libraries’ authorized students, faculty, and staff for research, scholarly, or academic purposes.
  • Image coordinates, which link words in the OCR’d full text to specific locations on the viewable page, may not be shared with any entity.

Open Content Alliance

  • There are no restrictions on access or redistribution placed on the UC Libraries digital copy.

Back to Top

10. How will our patrons be able to access these texts, i.e. through MELVYL, or local catalogs, or a webpage, any search engine, or....?

UC Libraries patrons can currently access UC Libraries scanned books via these three digitization partner websites and the Melvyl Catalog:

Google Book Search
(Currently there is no ability to browse or search the UC Libraries subset).

Internet Archive

(The UC Libraries subset can be searched at http://www.archive.org/details/university_of_california_libraries )

Melvyl Catalog: http://melvyl.cdlib.org/F/?func=file&file_name=find-b&local_base=U-CDL90
(Links to Google books in UC and all partner collections)

UC Libraries books and serials made available through Google Book Search that are in copyright are subject to limited views.  Google indexes the full text, but does not serve or display the full-sized digital image or make available for printing and/or download unless Google has permission or a license from the copyright owner to do so.  Therefore, UC Libraries patrons will have limited access to books and serials in copyright made available through Google Book Search.  Books in the public domain are generally fully viewable and downloadable.

Ever since UC entered into its first mass digitization partnership, UC Libraries have been studying the options for resource discovery through MELVYL and campus OPACS.  The Bibliographic Services Task Force, as well as HOTS and HOPS have been involved in discussions of the various options for discovery of the mass digitized content.  After a great deal of analysis in this area, CDL put forward to the UC University Librarians the recommendation to pursue discovery of mass digitized content via the UC-OCLC WorldCat Local (WCL) pilot as the quickest and most resource efficient way of exposing this content.  The WCL pilot is designed to elicit a great deal of information about end user behavior and expectations; the opportunity to gain an understanding of user behavior and needs specifically regarding our mass digitized content is an added benefit.

Following discussion and approval of this recommendation by the University Librarians at their June 2007 meeting, the UC-OCLC Implementation team charged and staffed a Mass Digitized Content task group.  This task group’s charge is to work with OCLC to provide discovery and delivery to the UC mass digitized content in the Next Generation Melvyl WCL pilot.

OCLC has a dedicated program, E-content synchronization, which is focused first on loading the Google records.  It is designed to expand at a later date to include additional mass digitized content.  The current plan for the pilot is to expose all of UC’s Google content to each of the campus branded WCL OPACs, as well as the campus-wide WCL OPAC.  OCLC plans to create a separate online record for each print record and attach a UC-wide OCLC symbol to the online record.

It is important to note that the mass digitized bibliographic records are not yet available from OCLC; OCLC’s work plan is to have them ready by the time the WCL pilot is launched in April 2008 has been delayed.  OCLC currently has no plans to develop a workflow to return the additional records created by OCLC to the campuses Additionally, there is no mechanism to select a particular campus’s print contributions to the project, should a campus decide it wanted only its own material exposed in its OPAC.

Depending upon assessment and feedback from the WCL pilot, any number of options for exposing content in local OPACS can be explored, if desired, including loading records locally or investigating other linking options.  Any option other than discovery through the pilot would involve significant campus and CDL resources.

HOPS requests that linking to mass digitization items as part of the WorldCat Local pilot be thoroughly evaluated before considering other options to load records into Melvyl or local catalogs.  HOTS members have unanimously indicated that they are not interested in receiving the mass digitization records at this time because of the high volume of records created in this project (anticipated to be more than 3 million records) and the effect they would have on campus ILSs.  As of February 2008 many hundreds of thousands of books have been digitized, representing only a fraction of the total that is expected.  If a campus library wishes to receive the records, new workflows will have to be thought through and worked out according to OCLC’s schedule.  Additionally, each campus will have to work out a mechanism for suppressing these records from going to the existing Melvyl Catalog, which is not designed to handle this increased amount of data.  Although each campus could approach OCLC to work out how to get records back locally, should the campuses desire it, CDL might be able to play a brokering role starting in the second quarter of 2008.

As of March 2008, a Google Books API was deployed so that links to Google Books could be displayed in the Melvyl Catalog. The API went live April 25.  Depending on the copyright status of the book, Google will return a full text view, a limited preview, a snippet, or record view.

Availability through the pilot will be a timely opportunity to evaluate WorldCat Local as a discovery vehicle for the mass digitized materials before planning significant local projects to expose this material locally.

Back to Top

11. Will access be different for the general public than it is for our faculty, staff and students?

The general public and UC faculty, staff and students will have the same access to digitized UC Libraries books when searching and browsing the websites listed below and the Melvyl Catalog:

Google Book Search
Internet Archive and Open Library
Melvyl Catalog

In the future, should the UC Libraries establish a UC website providing access to UC Libraries digital copies of books scanned by Google, Microsoft and the Open Content Alliance, the use and distribution of these digital copies will be subject to the contractual rights and restrictions outlined in section eight.

12. What can users do with the texts?

The Google Book Search website, and the Internet Archive website both all permit users to:

  • Search across books
  • If the book is in the public domain, view the book online and download, save, and print a PDF version to read

Google Book Search permits users to:

  • Browse
  • Search within the book
  • Create and share annotations
  • Buy the book
  • Find it in a library
  • Learn about the publisher
  • Find more information on the “About the Book” page, including reviews, other editions, references from web pages, books and scholarly works, and maps of places mentioned in the book.

Google Book Search provides several views of the books based on specific privileges:

  • Full View – books in the public domain can be viewed, downloaded, and printed in their entirety
  • Limited View-- if publisher or author has given permission for users to see a limited number of pages from the book as a preview.
  • Snippet View-- shows information about the book plus a few snippets – a few sentences that display the search term in context
  • No preview-- provides basic information about the book including whether a user’s search term appears anywhere in the book, even if it's not in the title or index

Internet Archive’s site offers:

  • A variety of viewing options, including the Flip Book viewer which recreates the experience of reading a printed book
  • Keyword browsing
  • Write a review

The Open Library site offers:

  • Search across books
  • Advanced search

Additional services may be available from these partner websites as they are continually evolving and being enhanced.

13. How will patrons be impacted while books are being digitized?

Patrons will be impacted as minimally as possible during the mass digitization projects. All of the digitization partners have agreed to a maximum check-out period of two weeks.  In reality, the turn around time on a given book for both projects is closer to one week.

14. What happened to the Microsoft project?

Microsoft ceased funding for their Live Search Books program in May 2008.  Close to 150,000 public domain UC books were scanned with Microsoft funding during our participation in the program.  All books digitized via the Microsoft project are available via Internet Archive.

Internet Archive http://www.archive.org/details/university_of_california_libraries

Back to Top