Download the narrative of the final project report (PDF)
Download the appendices (not included with the narrative):
The aim of this Extension to the Melvyl Recommender Project was to carry out deeper explorations into the most interesting and promising questions raised during the original project, and to add obvious missing pieces of functionality. The principal area of investigation was the impact of adding full-text objects to what had previously been a metadata-only index.
Full Text Extension Supplementary Report (PDF)
Browse the Relvyl prototype, which incorporates many of the features explored over the course of the project.
The use of a text-based discovery system, XTF , with its built-in relevance ranking capability, proved to be a promising approach. Performance on a series of simple load tests suggests that the system is capable of scaling to support millions of records and hundreds of concurrent users.
Experiments with index-based spelling correction were similarly positive. Starting with an existing index-based spelling correction algorithm and applying a number of optimizations, we met the goal of producing the right correction for a misspelled word (on the first try) 90% of the time.
Although they were not a central focus of the project, we conducted a shallow initial investigation of two strategies for improving navigation through large record sets: faceted browsing, and grouping results based on functional requirements for bibliographic records (FRBR). In both cases, initial experiments suggest that delving more deeply into these areas will result in better service to patrons.
Our investigation of enhanced relevance ranking considered whether returning result sets using content-based relevance ranking, optionally boosted by weights based on circulation and holdings data, would improve the ability of patrons to complete typical academic tasks. A task-based user assessment showed that in general, academic users do prefer relevance ranked result sets to those that are unranked (current catalogs are typically unranked); preferences differed by level of subject area expertise. Limitations due to the design of the study prevented us from making a strong statement as to which of the three ranked methods that we tested will best serve the greatest number of patrons.
We explored two major strategies for generating recommendations: an approach based on the mining of circulation data (ie "patrons who checked this out also checked out..."), and an approach based on similarities in the content of bibliographic records ("more like this..."). A task-based user assessment of the former method showed that patrons are enthusiastic about using an online library catalog with a recommendation service; testing confirmed that recommendations were successful in supporting academic tasks. Moreover, the recommendation service was useful as a query expansion tool, suggesting alternative search strategies when users were boxed in by small or single result sets.
Plans for future work consist of a mix of shorter- and longer-term initiatives that extend the work done to date. Shorter-term, more discrete activities include support for multi-word spelling correction; incorporating persistent personalization into the prototype as a building block for additional recommending work; and an exploratory effort to identify potential applications and stumbling blocks associated with retrieval in a mixed metadata/full text environment. Longer-term tasks include additional work on automated strategies for grouping and clustering to better support search and presentation of very large data sets; extended work on recommending techniques; and investment in user-centered design and integration of new services.
Funding for this project was provided by the Andrew W. Mellon Foundation. The UCLA and UC Berkeley libraries, the Research Libraries Group and the Online Computer Library Center supplied circulation and holdings data used in relevance ranking and recommending experiments.
About a dozen CDL staff members were involved as team members in this project, participating in implementation or assessment activities or offering their expertise as advisors. This team met regularly over the course of the project.
Many other individuals at CDL contributed to this project significantly by facilitating or carrying out discrete tasks including data acquisition and analysis, scripting, and systems support.
Annita Auyang, Database Administrator
Rebecca Doherty, Data Integrity Coordinator
Erik Hetzner, Digital Ingest Programmer
Sean O'Hara, Systems Architect
Raymund Ramos, Systems Architect
Michael Russell, Development Programmer
Virginia Sinclair, Bibliographic Analyst
Randy Lai, Digital Ingest Programmer