Harvesting System
The Publishing Group’s harvesting system performs the following tasks within the eScholarship publishing platform:
- Locate and copy ("harvest") each item from Berkeley Electronic Press’ EdiKit authoring system, and assign it a UC number. This process brings together the metadata (e.g., title, author, date), PDF full text, and any supplementary files.
- Extract text from each PDF file (eliminating formatting information, colors, images, etc.). This text will be used for keyword searching. This step might require Optical Character Recognition (OCR) if the PDF file consists, for instance, of scanned sheets of paper.
- Classify and index the text. Classification analyzes each item and assigns it one or more disciplines based on a computer model our team constructed from hundreds of hand classified papers. Indexing creates an electronic "card catalog" so users can efficiently search the whole collection by keyword.
All of these activities can occur at the same time, and the action is coordinated by a central controller. The control system design is novel in its simplicity:
- Every piece of information about an item is stored in a single folder, eliminating the need for a cumbersome central database. The folder is easy to find because it is named for the item.
- When one task finishes with an item, the controller decides what to do next. In this fashion, tasks don't interact with each other and can remain simple and independent. Complex and brittle synchronization strategies are entirely avoided.
- Because control data is stored in plain text, our staff can check on an item or the entire system by reading the files; backing up or moving individual items involves nothing more than tried-and-true file utilities.
Further information:
- Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
- File-based storage: Pairtrees for Object Storage
- Text extraction: Poppler library
- Classification: Keyphrase Extraction Algorithm (KEA)
- Indexing and search: eXtensible Text Framework (XTF)
Last updated: December 14, 2012
Document owner: Justin Gonder
