Collect & annotate data

Reusing data is difficult if you don’t know what the data is about and how it was created. Therefore, it is important to provide appropriate data about your data (metadata) and also to keep track of changes to your data (version control)

Metadata

Descriptive metadata is indispensable for the preservation, retrieval and re-use of datasets. It provides answers to questions concerning the person creating the data, the subject of the data, the type of file(s), geographic information and other aspects. 

There are multiple types of metadata, for example:

  • Embedded metadata

  • Additional data documentation

  • Discovery metadata in data repositories

  • Disciplinary metadata

Embedded metadata
Sometimes metadata can be embedded directly within data files. Some scientific instruments will record metadata information about the files automatically. These are then recorded within the document properties, or embedded within the files themselves. Some examples include:

  • FASTQ files - these are files in txt format used in life sciences (bioinformatics in particular) which store information about nucleotide sequence

  • TIFF files - these files often contain additional information about images and how these were recorded

  • FITS files - this is a file standard widely used in astronomy to store images and tables. FITS files contain a headers with metadata with information about the data

Additional data documentation
Metadata about the data can be also recorded outside of the actual data files. The most common way of doing this is to create dedicated README files. These are usually txt documents with necessary information about the data, which are stored next to the data files. 4TU.Research Data offers useful guidance on how to create README files.

Some researchers, especially in the social sciences, create code books which explain the dataset and provide information such as code, field and label descriptions. The Data Documentation Initiative provides useful guidance on how to create a code book for your data.

Discover metadata in data repositories
When you upload your data into a data repository, the data repository will also make your data discoverable. Discoverability is ensured through the use of discovery metadata: information such as the title of the dataset, the names of the authors, keywords, institutional affiliation etc. Data repositories usually adhere to metadata standards. For example, 4TU.Research Data uses DataCite metadata schema as well as additional Dublin Core metadata.

Disciplinary metadata standards
Some disciplines have community-agreed metadata standards, which define the minimal information which is needed to understand and re-use research data.

FAIRsharing is a registry where you can find disciplinary standards for data.

Version control

If you work on your data for a period of time, it is useful to introduce some kind of version management to be able to properly track the changes.

There are two main systems available at TU Delft for version control: GitLab and Subversion.

GitLab
GitLab is a TU Delft provided version control system with backup facility, particularly useful for working with code and software. Access to external collaborators can be provided. You can create a GitLab repository yourself at https://gitlab.tudelft.nl
More information and a form to request access for external users can be found at Top Desk

Subversion
Subversion is a TU Delft provided version control system with backup facility, particularly useful for working with data. Access can be controlled by the repository owner.
Subversion repository can be requested through Top Desk.

Support

Alternatively, contact your Faculty Data Steward for advice.