Data documentation during the project

Tip: Why is this topic important
» Understanding, analysing and reusing data depends on how said data has been documented, structured, named and in other ways described
» The inclusion of metadata - providing data about the data used in a project - secures that data can be properly utilized, both within and beyond your own project
» Interpretation of project results requires an understanding of the data provenance/data lineage, i.e. where the data originates from and how it has been processed
» Data documentation should start as early as possible, ideally in the form of accompanied structured metadata and ensuring that data is accessible

About this chapter

This chapter includes information about how metadata and other accompanying information will be handled in the active phase of the project.

Question-specific guidance

How will you connect data and respective metadata/data documentation?

Metadata is data about data, providing the necessary context that allows understanding or use of data. Providing this information in a structured way facilitates data reuse. Metadata can be descriptive (e.g. title, data content, date of creation), structural (e.g. explaining file organisation), inform about data provenance (e.g. data origin, versions), administrative (e.g. access permissions), legal (e.g. data license), or technical (e.g. data format, tools and software). A metadata standard is a predefined way of describing data.

Often there will be multiple ways in which data and metadata can be linked within a project. Basic descriptive techniques will be relevant to many projects: these include structured and consistent naming of files and folders, using a README-file to provide information, using embedded metadata in files, or using a separate metadata-file (a sidecar file for each file in the dataset). More advanced techniques that may be relevant include using a database system for linking metadata and data, or establishing a data/variable dictionary for the data in the project.

Please consult the chapters in e.g. the RDMkit for life sciences, the CESSDA Data Management Guide, or The Turing Way handbook for more information, e.g. on what to write in a README-file.

Supplementary info: Almost all computer systems will provide some system metadata embedded in files, and provide info on creation date and who has editor or read-only access to the file, for example. Most systems will also provide users with possibilities of adding user metadata that can be embedded in files, such as descriptive tags.

Do suitable metadata standards exist for the data?

A metadata standard/schema is a predefined set of attributes to describe data in a clear and consistent way. Metadata standards/schemas can be generic or discipline-specific, and research communities have worked together to define what kind of metadata is needed when research of a certain kind is performed and described. Metadata standards/schemas are structured and machine-readable. Complete metadata helps to organise data during the project, and is necessary for data archiving.

Most research data repositories implement specific standards, and the use of a particular archive often leads to the use of a particular metadata standard. It is therefore useful to investigate suitable data repositories and respective metadata standards (or “Minimal Information Standards”) early in the research process to make sure relevant metadata is collected when it first becomes available.
To enter information about “Minimal Information Standards” that will be applied, standards from the FAIRsharing registry of standards can be selected in the Wizard.

Some examples of metadata standards:

DataverseNO institutional archives
- The metadata standard in DataverseNO combines generic and discipline-specific elements. The user guide provides detailed guidance.
Life sciences
- The European Nucleotide Archive (ENA) requires that all samples must conform to a defined checklist of expected metadata values, and provides checklists for different types of samples. The ELIXIR Norway helpdesk may assist with data archiving.
Language sciences
- Clarino requires that CMDI metadata are provided. The repository may assist in producing the CMDI metadata.
- TROLLing is part of DataverseNO, with similar metadata requirements, described in the deposit guidelines.

Supplementary info: When unsure about relevant metadata standards within your field, the Dublin Core standard defines a minimum set of values and is embedded in many more comprehensive standards.

Further reading:

Will you use existing vocabularies/ontologies/terminologies to describe the data?

Using defined terms ensures that your data is described consistently, reducing ambiguity and enabling interoperability across systems and disciplines. Controlled vocabularies provide standardised terms, while ontologies add structure by defining hierarchies and relationships between concepts. Please consider which controlled vocabularies, ontologies, or terminologies have relevance within your field of research and can be applied to heighten precision when describing the research data.

To enter information on vocabularies/ontologies/terminologies that will be applied, vocabularies/ontologies/terminologies from the FAIRsharing registry of standards can be selected in the Wizard.

For some disciplines, look-up services can help identify relevant vocabularies/ontologies/terminologies. When in doubt of relevance, please look for usage by others within your field, like in published journal articles or connected to published datasets.

Some disciplinary vocabularies/ontologies/terminologies examples:

Life sciences
- Darwin Core to describe information about biological diversity
- Gene Ontology for annotation of genes, gene products and sequences
Biomedical science
- Medical Subject Headings (MeSH) are used for indexing, cataloguing, and searching for biomedical and health-related information and documents
- Human Phenotype Ontology to describe phenotypic features encountered in human hereditary and other diseases
Geography
- Marine Regions aims to to create a standard, relational list of geographic names, coupled with information and maps of the geographic location of these features
Social sciences
- European Language Social Science Thesaurus (ELSST)

Further reading:

How are the rights to the collected data distributed?

Discuss making agreements between project members on usage rights and potential intellectual property rights prior to data collection. Defining rights and providing licenses to collected data will often reduce the potential for later conflicts around internal and external use (and reuse) of research data within and after the project period.

Not all data are covered by The Copyright Act. Some data may be in the form of databases, and may also qualify for protection. If the data counts as a database(s), the institution will often hold rights to the database. However, this does not exclude usage rights for the researchers.

If intellectual property rights are defined through a contract/agreement, make sure to refer to it in relation to the involved organisations in the chapter ‘Legal and ethical aspects’.

If the data is owned by, or copyrighted by, external bodies select this one and elaborate in the follow-up question, and the next question on “use restrictions”.
Please note that there is no Fair Use-clause in the Norwegian Copyright Act, so subsequent usage of data from secondary sources would restrict future sharing. This can be described in the chapter ‘Archiving and publishing data’.

If there is a consortium agreement or rights are arranged in another way, please make sure to list any relevant contracts or agreements.

Further reading:

Are there any use restrictions for these data?

Are there any limitations on the data use such as restricted use to research on certain types of diseases, sharing only within certain geographical boundaries, etc.?
If applicable, describing data use in a formalised way greatly improves the data reusability. Explicitly stating usage permissions or restrictions is recommended as opposed to applying a restrictive data license. Data licenses are addressed in the next question.

Examples of use definition:

Data Use Ontology (DUO) is an international standard, which provides codes to represent data use restrictions for controlled access datasets
Open Digital Rights Language (ODRL) is a policy expression language

Further reading:

FAIRCookbook: Permitted uses of data
Article: Alter, G., Gonzalez-Beltran, A., Ohno-Machado, L., & Rocca-Serra, P. (2020). The Data Tags Suite (DATS) model for discovering data access and use requirements. GigaScience, 9(2), giz165. doi: 10.1093/gigascience/giz165

Will a license be assigned to the data as early as possible?

It is not always clear to everyone in the project (and beyond) what can and cannot be done with a data set. Being clear about reuse conditions and assigning data a license is one requirement of the FAIR principles.

It is helpful to associate each data set with a license as early as possible in the project, and the license should be stored together with the data at all times. A data license should ideally be as free as possible: any restriction like ‘only for non-commercial use’ or ‘attribution required’ may have undesired implications, may reduce reusability and thereby the number of citations. If possible, use a computer-readable and computer actionable license.

Supplementary info: attribution requirements can lead to inconvenient license stacking and thus limiting reuse. Similarly, restricting commercial use can have unintended consequences.

Further reading: