Collect & annotate data

Reusing data and research software (i.e. from simple scripts to full libraries) is difficult if you don’t know what they are about and how they were created. Therefore, it is important to provide appropriate documentation including metadata about the data and software of your project and also to keep track of changes (version control).

Metadata

Descriptive metadata is indispensable for the preservation, retrieval and re-use of datasets and research software. 

Metadata about data provides answers to questions concerning the person creating the data, the subject of the data, the type of file(s), geographic information and other aspects. 

Metadata relevant for  research software is: the programming language, the operating system on which  the code can be run, the version of the software, the license, etc. The CodeMeta project provides the tool CodeMeta generator, which helps to collect minimal metadata and export it as a json file.

There are multiple types of metadata, for example:

  • Embedded metadata   
  • Additional data documentation
  • Discovery metadata in repositories       
  • Disciplinary metadata   

Embedded metadata

Sometimes metadata can be embedded directly within data files. Some scientific instruments will record metadata information about the files automatically. This is then recorded within the document properties, or embedded within the files themselves. Some examples include:

  • FASTQ files - these are files in txt format used in life sciences (bioinformatics in particular) which store information about nucleotide sequence
  • TIFF files - these files often contain additional information about images and how these were recorded  
  • FITS files - this is a file standard widely used in astronomy to store images and tables. FITS files contain a headers with metadata with information about the data            

Additional data documentation

Metadata about the data and research software can be also recorded by creating dedicated README files. 

For datasets these are usually .txt documents with necessary information about the data, which are stored next to the data files. 4TU.Research Data offers useful guidance on how to create README files.

For research software, a README file can be created as part of a code repository like GitHub or GitLab. You can find a README file template here, which additionally complies with the TUD Research Software Policy.

Some researchers, especially in the social sciences, create codebooks which explain the dataset and provide information such as codes (abbreviations or labels for categorical variables) used in the dataset, field and label descriptions. The Data Documentation Initiative provides useful guidance on how to create a codebook for your data.

Documentation in an Electronic Lab Notebook

An electronic laboratory notebook (commonly known as an ELN or a digital lab notebook) is a software system designed for scientists to help you document and maintain reproducibility of your research and share information more easily. Electronic lab notebooks provide a text editor for writing notes in a way that replicates a paper notebook along with other functionalities such as spreadsheet tools for calculations and formatting of tables and graphs, protocol templates for documenting standard procedures, laboratory inventories for documenting samples, reagents and apparatus and collaboration tools for sharing experimental information. TU Delft has a subscription to two electronic lab notebook tools: eLABJournal and RSpace.

Findable metadata in data repositories

When you publish data or research software in a data repository they will become findable. Findability is ensured through the use of metadata: information such as the title of the dataset or research software, the names of the authors, keywords, institutional affiliation, etc. Data repositories usually adhere to metadata standards. For example, 4TU.Research Data uses the DataCite metadata schema as well as additional Dublin Core metadata.

Disciplinary metadata standards

Some disciplines have community-agreed metadata standards, which define the minimal information which is needed to understand and re-use research data.

FAIRsharing is a registry where you can find disciplinary standards for data.

Version control

If you work on your data for a period of time or you develop software, it is useful to introduce some kind of version management to be able to properly track the changes.

There are two main systems available at TU Delft for version control: GitLab and Subversion.

GitLab

GitLab is a TU Delft provided version control system with backup facility, particularly useful for working with code and software. Access to external collaborators can be provided. You can create a GitLab repository yourself at https://gitlab.tudelft.nl

More information and a form to request access for external users can be found at Top Desk

Subversion

Subversion is a TU Delft provided version control system with backup facility, particularly useful for working with data. Access can be controlled by the repository owner.

A Subversion repository can be requested through Top Desk.

Alternatives

There are other tools to effectively provide version control for code such as GitHub and Gitea. Colleagues from TU Delft Digital Competence Centre have developed a useful flyer to guide you through available options

For comprehensive training material on Git, please check this GitLab repository.

Support

Alternatively, contact your Faculty Data Steward for advice.

/* */