Jump to Content
DMPTool logo

Try the DMPTool to create a data management plan.

Data Identifiers for Sharing, Citing, and Archiving your Data

There are many aspects to organizing your data. You'll want to consider using public identifier schemes if you want to share or cite your datasets. You may want your identifier scheme to make it easy to reference distinct versions and components of a dataset. You'll also want to archive your datasets where other people can access them.

Public data identifiers should be:

  • "actionable" (you can "click" on them in a web browser),
  • globally unique, and
  • persistent.

In today's Internet environment, this means data identifiers should fit inside URLs (also known as URIs that start with "http://") and be well-enough managed, through a combination of stable storage and identifier redirection, to remain actionable over the long-term. There are many different identifier schemes to choose from, but by far the most important factor in long-term data sharing will be stable data storage and well-managed identifier redirection.

Another important factor is what URL hostname to use. This is the domain name at the beginning of a URL (right after the "http://") that determines where URL "resolution" starts, for example, daac.ornl.gov. Any URL can be thought of as resolving either directly to its target (via the URL's hostname) or indirectly through one or more "redirects" to a final target URL. An identifier that does'nt contain a hostname may implicitly use a well-known hostname as the starting point for resolution, for example, doi.dx.org for DOIs, handle.net for Handles, and n2t.net for ARKs.

Because persistent (long-term) identifiers tend to be opaque (e.g., a string of digits) that reveal little or nothing about the nature of the identified object, it is also important for you to maintain metadata associated with the object. Among the most important pieces of metadata for you to maintain is the target URL that ensures that the identifier remain actionable. If you don't maintain the target URL for whatever identifier scheme you choose, the identifier will break.

Here are some identifier schemes:

  • ARK (Archival Resource Key) — a URL with extra features allowing you to ask for descriptive and archival metadata, and to recognize certain kinds of relationships between identifiers. It is used by memory organizations such as libraries, archives, and museums. Resolution depends on HTTP redirection and can be managed through an API or a user interface. Does not call for a browser plug-in or usage fees.
  • DOI (Digital Object Identifier) — an identifier that becomes actionable when embedded in a URL. It has become popular in academic journal publishing. Resolution depends on HTTP redirection and the Handle identifier protocol, and can be managed through an API or a user interface. A browser plug-in can save you from typing "http://doi.dx.org" in front of it. Annual fees apply to each DOI.
  • Handle — an identifier that becomes actionable when embedded in a URL. A browser plug-in can save you from typing "http://handle.net" in front of it. Resolution depends on HTTP redirection and the Handle protocol, and can be managed through an API or a user interface. Annual fees apply to each local Handle server.
  • InChI (IUPAC International Chemical Identifier) — a non-actionable identifier for chemical substances that can be used in printed and electronic data sources, thus enabling easier linking of diverse data compilations.
  • LSID (Life Sciences Identifier) — a kind of URN that identifies a biologically significant resource, including species names, concepts, occurrences, genes or proteins, or data objects that encode information about them. Like other URNs, it becomes actionable when embedded in a URL.
  • NCBI (National Center for Biotechnology Information) ACCESSION — a non-actionable number in used by NCBI.
  • PURL (Persistent Uniform Resource Locator) — a URL that is always redirected through a hostname (often purl.org). Resolution depends on HTTP redirection and can be managed through an API or a user interface. Does not call for a browser plug-in or usage fees.
  • URL (Uniform Resource Locator) — the typical "address" of web content. It is a kind of URI (Uniform Resource Identifier) that begins with "http://" and consists of a string of characters used to identify or name a resource on the Internet. Such identification enables interaction with representations of the resource over a network, typically the World Wide Web, using the HTTP protocol. Well-managed URL redirection can make URLs as persistent as any identifier. Resolution depends on HTTP redirection and can be managed through an API or a user interface. There are no usage fees.
  • URN (Uniform Resource Name) — an identifier that becomes actionable when embedded in a URL. Resolution depends on HTTP redirection and the DDDS protocol, and can be managed through an API or a user interface. A browser plug-in can save you from typing a hostname in front of it. There are no usage fees.

CDL provides an identifier service called EZID that offers several choices of identifier. EZID enables you to take control of the management and distribution of your datasets; share and get credit for your datasets, and build your reputation for the collection and documentation of research. By making data resources easier to access, re-use, and verify, EZID helps you to build on previous work, conduct new research, and avoid duplicating previous work.


Credit to the University of Virginia's Scientific Data Consulting Group and the MIT Libraries for permission to use and adapt their data management planning pages, and to members of the UC3 community. Please send us any comments about these guidelines.

Creative
Commons License

Last updated: February 06, 2014
Document owner: Perry Willett