Do’s and Don'ts of Metadata Formatting, Part 1

By Weiming Hu and Scott G. Daniel

[Paragraph to introduce metadata.]

In our lab we process multiple research projects a month, each with their associated metadata. Because we like to process runs as quickly as possible, it’s important to have good metadata. Besides just slowing the process down, there is the horrible scenario of switched samples or completely missing metadata. From our experience, we’ve compiled a list of helpful “Do’s” that will prevent the worst from happening and ensure a successful project. Side-note: this has become such a time-drain for members of our lab that our software engineer, Charlie Bushman, has compiled some of these into our metadata checker software: https://github.com/PennChopMicrobiomeProgram/CHOP_metadata_checker/

Do’s:

Keep sample IDs short and meaningful

Meaningful sample IDs are always a plus when working with metadata. They help with identifying patterns at a glance, catching an obvious error, and establish a good order for the samples if they're sorted alphabetically by ID. A good example of a sample ID is “s1wtd1”, which gives the subject ID (s1), treatment (wt), and time point (d1).

A lengthy sample ID could crash some analysis software, or cause the software to create output filenames with a shortened version of the ID. In terms of running the bioinformatics pipelines, it is a bit of a nightmare nightmare to have file names that don't match the sample ID in your metadata. We find that a reasonable length for sample IDs is about 40 characters or less.

Invest time up front to create stable sample IDs

It is always a good idea to pick your sample IDs at the beginning and keep them throughout the whole analysis. It is a major issue to change sample names in the middle of the analysis because most likely you will have to write an extra 10 lines (if not more) of code to fix and match the sample. Even worse, if you have multiple columns for your sample names, you will most likely forget which column should be used as the true sample name after 5 years (e.g. “New_sample_id”, “old_sample_id”, “original_sample_id”, etc.).

Use ASCII characters where possible

R can handle most characters when importing from excel or tabular files, but it is hard for R to guess a special character (e.g. ⧫, ◣) which you can only insert rather than typing into the file. This includes Greek characters (αβω) too. This can lead to sample names diverging from the files names.

Use ISO format for dates

Short of writing an entire blog post on the best date format, consistent ISO format is recommended (YYYY-MM-DD). A consistent ISO date format can save you much time when you are working in R and trying to extract, sort and categorize your items by date.

Double check dates when exporting from Excel

The date format sometimes will change into nonsense numbers when you lose the data formatting in excel, so always double check your dates are correct before importing into R. 

Add a data dictionary for rich metadata

To help other people understand the abbreviations created for a project, a data dictionary really helps. It could save you the trouble of multiple emails back-and-forth, so please do it!

In summary, we hope that these tips help format your metadata. It’s important to note that this is not an exercise in cruelty, following these guidelines saves time for both us, the sequencing center, and you, the metadata submitter. Please leave a comment if we’ve forgotten anything!

Stay tuned for Part 2: the “Dont’s” of Metadata formatting (including some tips on how to fix those issues). Part 2 will be helpful to Biodata scientists that have the task of cleaning metadata.