The eScholarship controller coordinates the processing of objects from the point of submission to the point of publication within the eScholarship system, including:
- Harvesting: Locate and copy ("harvest") each item from the eScholarship submission and publishing system. This process brings together the metadata (e.g., title, author, date), PDF full text, and any supplementary files.
- Text Extraction: Extract text from each PDF file (eliminating formatting information, colors, images, etc.). This text will be used for keyword searching.
- OCR Processing: The previous text extraction step might require Optical Character Recognition (OCR) if the PDF file consists, for instance, of scanned sheets of paper.
- Indexing: Indexing creates an electronic "card catalog" so users can efficiently search the whole collection by keyword.
All of these activities can occur at the same time, and the action is coordinated by a central controller. The control system design is novel in its simplicity:
- Every piece of information about an item is stored in a single folder, eliminating the need for a cumbersome central database. The folder is easy to find because it is named for the item.
- When one task finishes with an item, the controller decides what to do next. In this fashion, tasks don't interact with each other and can remain simple and independent. Complex and brittle synchronization strategies are entirely avoided.
- Because control data is stored in plain text, our staff can check on an item or the entire system by reading the files; backing up or moving individual items involves nothing more than tried-and-true file utilities.
- Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
- File-based storage: Pairtrees for Object Storage
- Text extraction: Poppler library
- Indexing and search: eXtensible Text Framework (XTF)