Jump to Content
UC3 Logo

Creating your data

What type of data will be produced?

Establish a clear understanding of the nature of your data. Do you have, for example, numerical data, image data, text sequences or modeling data? How much data will be produced? How quickly will you add more data? How often will they be changed? Knowing exactly what kind of data you have, and how much, will inform many decisions about storage, backups and more. For instance, image data typically requires a lot of storage space, so you'll want to decide which of your images, if not all, you want to retain, and where such large datasets can be housed. You'll want to be sure to know your organization's capacity for storage and backups.

There are many aspects to organizing your data. You'll want to consider using sophisticated name schemas if you want to share or cite your data. You'll want put your datasets where other people can access them, and give your datasets identifiers that can be referenced easily.

Data identifiers for sharing, citing and archiving your data

The information at the beginning of this document will help you organize your datasets for your own use. But you'll want to consider ramifications of naming choices if you want to share or cite your data, especially if people will want access to distinct versions of a datasets or components of a dataset. You'll want to put your datasets where other people can access them and give them, as well as any appropriate versions and components, data identifiers that can be referenced easily. Data identifiers should be:

  • "actionable" (you can "click" on them in a web browser),
  • globally unique, and
  • persistent.

In today's Internet environment, this means data identifiers should fit inside URLs (also known as URIs that start with "http://") and be well-enough managed, through a combination of stable storage and identifier redirection, to remain actionable over the long-term. There are many different identifier schemes to choose from, but by far the most important factor in long-term data sharing will be stable data storage and well-managed identifier redirection.

Another important factor is what URL hostname to use. This is the domain name at the beginning of a URL (right after the "http://") that determines where URL "resolution" starts, for example, daac.ornl.gov. Any URL can be thought of as resolving either directly to its target (via the URL's hostname) or indirectly through one or more "redirects" to a final target URL. An identifier that does'nt contain a hostname may implicitly use a well-known hostname as the starting point for resolution, for example, doi.dx.org for DOIs, handle.net for Handles, and n2t.net for ARKs.

Because persistent (long-term) identifiers tend to be opaque (e.g., a string of digits) that reveal little or nothing about the nature of the identified object, it is also important for you to maintain metadata associated with the object. Among the most important pieces of metadata for you to maintain is the target URL that ensures that the identifier remain actionable. If you don't maintain the target URL for whatever identifier scheme you choose, the identifier will break.

Here are some identifier schemes:

  • ARK (Archival Resource Key) — a URL with extra features allowing you to ask for descriptive and archival metadata, and to recognize certain kinds of relationships between identifiers. It is used by memory organizations such as libraries, archives, and museums. Resolution depends on HTTP redirection and can be managed through an API or a user interface. Does not call for a browser plug-in or usage fees.
  • DOI (Digital Object Identifier) — an identifier that becomes actionable when embedded in a URL. It has become popular in academic journal publishing. Resolution depends on HTTP redirection and the Handle identifier protocol, and can be managed through an API or a user interface. A browser plug-in can save you from typing "http://doi.dx.org" in front of it. Annual fees apply to each DOI.
  • Handle — an identifier that becomes actionable when embedded in a URL. A browser plug-in can save you from typing "http://handle.net" in front of it. Resolution depends on HTTP redirection and the Handle protocol, and can be managed through an API or a user interface. Annual fees apply to each local Handle server.
  • InChI (IUPAC International Chemical Identifier) — a non-actionable identifier for chemical substances that can be used in printed and electronic data sources, thus enabling easier linking of diverse data compilations.
  • LSID (Life Sciences Identifier) — a kind of URN that identifies a biologically significant resource, including species names, concepts, occurrences, genes or proteins, or data objects that encode information about them. Like other URNs, it becomes actionable when embedded in a URL.
  • NCBI (National Center for Biotechnology Information) ACCESSION — a non-actionable number in used by NCBI.
  • PURL (Persistent Uniform Resource Locator) — a URL that is always redirected through a hostname (often purl.org). Resolution depends on HTTP redirection and can be managed through an API or a user interface. Does not call for a browser plug-in or usage fees.
  • URL (Uniform Resource Locator) — the typical "address" of web content. It is a kind of URI (Uniform Resource Identifier) that begins with "http://" and consists of a string of characters used to identify or name a resource on the Internet. Such identification enables interaction with representations of the resource over a network, typically the World Wide Web, using the HTTP protocol. Well-managed URL redirection can make URLs as persistent as any identifier. Resolution depends on HTTP redirection and can be managed through an API or a user interface. There are no usage fees.
  • URN (Uniform Resource Name) — an identifier that becomes actionable when embedded in a URL. Resolution depends on HTTP redirection and the DDDS protocol, and can be managed through an API or a user interface. A browser plug-in can save you from typing a hostname in front of it. There are no usage fees.

CDL provides an identifier service called EZID that offers several choices of identifier. EZID enables you to take control of the management and distribution of your datasets; share and get credit for your datasets, and build your reputation for the collection and documentation of research. By making data resources easier to access, re-use, and verify, EZID helps you to build on previous work, conduct new research, and avoid duplicating previous work.

How will you document your data?

In order for your data to be used properly by you, your colleagues, and other researchers in the future, the data must be documented. Data documentation (which includes metadata) enables you to describe the content, formats, and internal relationships of your data in detail and will enable other researchers to find, use and properly cite your data.

It is critical to start documenting your data at the very beginning of your research project, before data collection begins. Doing so will make documentation easier and reduce the likelihood that you will forget aspects of your data later in the research project.

Researchers can choose among various metadata standards. Some metadata standards are designed for the purpose of documenting the contents of files, others for documenting the technical characteristics of files, and yet others for expressing relationships between files within a set of data. It is important to establish a metadata strategy that is capable of describing your data and satisfying your data management needs. For assistance in defining an adequate metadata strategy, please contact uc3@ucop.edu.

Below are some general aspects of your data that you should document, regardless of your discipline. At minimum, store this documentation in a "readme.txt" file, or the equivalent, with the data itself. You can also reference a published article that may contain some of this information.

General overview Title Name of the dataset or research project that produced it
Creator Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane).
Identifier Unique number used to identify the data, even if it is just an internal project reference number
Date Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range
Method How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook
Processing How the data have been altered or processed (e.g., normalized)
Source Citations to data derived from other sources, including details of where the source data is held and how it was accessed
Funder Organizations or agencies who funded the research
Content description Subject Keywords or phrases describing the subject or content of the data
Place All applicable physical locations
Language All languages used in the dataset
Variable list All variables in the data files, where applicable
Code list Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. '999 indicates a missing value in the data')
Technical description File inventory All files associated with the project, including extensions (e.g. 'NWPalaceTR.WRL', 'stone.mov')
File Formats Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.
File structure Organization of the data file(s) and layout of the variables, where applicable
Version Unique date/time stamp and identifier for each version
Checksum A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed
Necessary software Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data
Access Rights Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data
Access information Where and how your data can be accessed by other researchers

How much data will the project produce?

To avoid being under- or over-prepared, it is wise to estimate the growth rate of your data. Are you manually collecting and recording data? Are you using observational instruments and computers to collect data? Is data collection highly iterative? From the start of the project to its conclusion, how much do you expect the data store to increase over regular intervals, say every month or every 90 days? How much data do you anticipate collecting and generating by the end of your project?

How often will the data change or be updated?

The answer to this question affects how you organize the data as well as the level of versioning you will need to undertake. Keeping track of rapidly changing datasets can be a challenge, so it is imperative that you begin with a plan to carry you through the entire data management process.

Credit to MIT Libraries for permission to use and adapt their pages and to members of the UC3 community.
Please send us any comments about these guidelines.

Creative Commons License

Last updated: February 06, 2014
Document owner: Perry Willett