Acquisition and Management of Data

Responsible Conduct of Research (RCR)
Case Study Modules
Acquisition and Management of Data

Preface

The acquisition and management of data are vital to the research record. The acquisition of data begins with the execution of a research plan, which in turn relies on a scientific premise and an experimental design, ideally one that considers variables, statistical power, and an authentication of key biological or chemical resources – in short, the elements of reproducibility. The management of data requires a complete and accurate representation of the data, a full accounting of protocols and the logic underlying them, a means of authenticating results by co-workers and others seeking replication, and a protection of such information from loss and inappropriate intrusion.

Background

There is no element of RCR training more fundamental to the process of research itself than the acquisition and management of data. The topic is, in fact, not limited to RCR – acquisition of data occupies a central position in the concept of rigorous experimental design, and management of data is central to the concepts of reproducibility and transparency, all components of training in scientific rigor and reproducibility (SRR).

The management of data devolves to the laboratory notebook. The notebook, as a record of data, interpretation, and decisions, is central to one’s ability to recall, communicate, and substantiate one’s work. It is an extraordinarily vital tool, whose importance cannot be overappreciated. The methods by which a notebook is best maintained, therefore, are given considerable attention in this module.

There are currently no formal policies at Penn covering the types or format of record keeping. The only policy of relevance, which pertains to BGS students directly, is to have laboratory notebooks inspected at each dissertation committee meeting. This procedure is copied here:

All dissertation level students must bring their lab notebooks to their dissertation committee meetings. If a student has previously met with the dissertation committee, then he or she should bring the notebooks used since the last committee meeting. The dissertation committee should make a point of reviewing notebook data at each meeting. There is no expectation that lab notebooks be reviewed in their entirety. However, the dissertation committee should feel confident that the student's lab data are complete and well managed. Graduate groups may wish to impose additional requirements, such as assigning a particular committee member responsibility for reviewing lab notebooks or arranging for the notebooks to be reviewed in greater detail by a committee member outside of the meetings. In any event, each graduate group’s dissertation committee meeting report forms will be amended so as to document the committee’s review of lab notebooks.

Policies and Guidelines

All BGS students must be able to effectively manage data utilizing laboratory notebooks as the medium for keeping records. This is true regardless of the nature of the research – whether it is conducted at the bench entirely, utilizes computational methods alone, or employs any of the other myriad forms of data acquisition.

The laboratory notebook provides a record of primary and secondary data. It can, and should, provide as well a record of collaboration, interpretation, and decisions. It allows authentication of work by outside parties and proves ownership in claims to discovery. The notebook often represents the critical bridge between different studies occurring at different times in the lab, allowing researchers to identify, replicate, and build upon previous findings – including both positive and negative results. Scientific rigor, reproducibility, and progress all depend critically on one’s ability to maintain an appropriate and effective laboratory notebook.

A one-size-fits-all notebook is neither feasible nor appropriate, given the diversity in types of research. Guidelines for major classes of notebooks are provided in this document, however, proper implementation of these guidelines relies on the judgment of the student and principle investigator, usually with input from the thesis committee.

Section 1 of this document covers traditional bound notebooks, in which entries are typically made by hand with several options available for storage of primary and secondary data.

Section 2 covers ‘electronic notebooks’ specifically developed to replace traditional notebooks as stand-alone or web-based programs.

Section 3 covers a form of computational notebook in which a researcher performs an analysis using a computer program, a statistical package, or a website server, or otherwise makes a scientific inference using computational methods and/or ‘Big Data.’

Section 4 covers a form of computational notebook in which a researcher constructs a new computational tool used for the analysis of data as a resource that others could utilize.

It is important to understand that notebooks (and other products of research) are the property of the University, not of the student or principal investigator.

General guidelines

All laboratory notebooks, regardless of type, should be:

As permanent and, subsequent to each entry, as immutable as possible.
Entries can be annotated but never altered.
Complete.
It is imperative with primary data to document everything, to document everything as soon as possible, and to document everything whether it’s ‘good’ or ‘bad,’ ‘right’ or ‘wrong,’ even if an experiment is terminated prematurely. Omitting data is dishonest.
Well-organized and easily interpreted by others.
The notebook should be easily read and understood by any number of other individuals, including principal investigators, other lab members, members of the thesis committee, and Penn- or non-Penn-affiliated auditors tasked with monitoring and authentication, including the ability to: i) locate any piece/set of primary data and identify the experimental conditions under which it was collected; ii) identify and locate all primary data associated with a given experiment, project, analysis, or publication; and iii) determine how data are used to form decisions regarding next steps.
Always accessible to the principle investigator.
The principle investigator must be able to examine a notebook at any point in time. The principle investigator is tasked with the responsibility of authenticating, by virtue of a ‘signature’ and date, and on a frequent basis, that he or she has evaluated incremental entries and that they are compatible with the dictates of the agreed-upon system relating to management of the data.
Secure.
Notebooks and other records should be kept secure from unauthorized access, theft, and destruction through i) storage that safeguards access and ii) periodic, securely stored backups.

Section 1: Guidelines for traditional bound notebooks

The advantages of bound, page-numbered notebooks are compelling, especially at the level of authentication.
A table of contents is invaluable.
Entries should be made in chronological order and dated. No page should be skipped.
Entries should be clear and legible. They should be made with permanent non-erasable ink, never in pencil. Lines should be drawn through mistaken entries without making the entry completely illegible; correction fluid should not be used.
Entries for any single experiment should include date, purpose, materials, protocol, results, discussion, and next steps.
Entries must include primary, unedited data, and should include as well any derived data, tables, calculations, and graphs. Should primary or secondary data not be easily handwritten or affixed, or should inclusion be otherwise difficult or unwieldy, the data should be deposited in files, physical or electronic, but with clear identification in the notebook and an organization that befits that of the notebook.
If word-processing is used in place of handwritten entries, printouts should be affixed permanently to pages of the notebook. Printouts from other software programs should be treated likewise.
Backups of traditional notebooks take the form of copies, physical or electronic. The site of storage should be distinct from that of the notebooks themselves, but with due attention toward preventing unauthorized access. Only copies of notebooks, never originals, can be removed from the immediate environs of the laboratory, and only with permission of the PI.

Section 2: Guidelines for electronic notebooks used in lieu of traditional notebooks

Several commercially available electronic notebooks (ELNs) are described in the appendix. Important features to consider include verifiability, security, accessibility, organization flexibility, searchability, and version control.

An official procedure for the lab’s electronic record-keeping process should be defined and communicated by the principal investigator to all users.
The location, organization, and nature of electronic records for each user should be clearly identified.
The nature of entries with regard to content, how decisions are made, and how data are selected should conform to those recommended above for bound notebooks. Similarly, primary or secondary data that do not conform to entry or for which image manipulation must be closely monitored, should be deposited in files, physical or electronic, with clear identification in the notebook.
Verifiability is a key issue for ELNs. Due consideration should be given timestamps, electronic signatures, and the ability to convert entries to read-only (no editing) after creation. In this sense, the date and content of primary electronic records should never be altered.
Any corrections, addenda, or correspondence relating to primary electronic records, or any other form of annotation, should be made without alteration of original entries, again with consideration given timestamps and write-protection.
The PI should have full access in real time to the notebook. Others for whom access might be provided as needed include other lab members, members of the thesis committee, and those charged with auditing responsibilities.
Regarding storage and protection:
— Regular (daily) backup of all records should be mandated, and the process and oversight of this should be
clearly prescribed and regularly monitored for compliance.
— Data on laptops, portable hard drives, and other portable media should be encrypted.

Section 3: Guidelines for computational notebooks involving existing, validated computational tools and/or 'Big Data'

A computational notebook of the type discussed here and in Section 4, in spirit, emulates that of a traditional notebook: a record kept of the time, method, and manner in which specific experiments and studies are performed, along with results, interpretations of the results, and next steps.

However, there are some differences between the two that should be noted:

Version control. Computational work often involves many rounds of iterative development: for example in writing and debugging code, developing algorithms, developing scripts, and prototyping steps in computational pipelines. Although digital media allows one to simply ‘update’ existing files, special attention must be given to maintaining sufficient date-stamped version controls so as to document progress, ensure reproducibility, and if necessary identify and fix ‘bugs’.

Pointers to large data types (public and private); quality control and data management. Computational work can also utilize large amounts of data, either generated by the host lab OR downloaded from publicly available repositories. This generates several additional considerations: i) where is the raw data is housed/stored and ultimately made available, ii) how the processed and summarized data is stored and ultimately made available, and iii) what pipeline/scripts (and their versions) are utilized to process, filter, and/or quality control the data. Each of these are important for future analysis, and should be documented for reasons of reproducibility.

Examples of a computational notebook utilizing existing, validated computation tools and/or ‘Big Data’ include ipython/Jupyter notebooks + github; R markdown (knitr, shiny); github repos for UNIX pipelines.
This type of computational notebook, in its most basic form, can be conceptualized as a “pipeline”: it begins with an input, the input is processed in some way, and a desired output results. The notebook should contain sufficient details such that the owner (imagine 3 months in the future) could read, recapitulate the workflow exactly, and obtain the output that was previously reported.
The pipeline should be encapsulated in a document with the following instruction set:
1.    Enumerate the central scientific question and a short rationale. The objective should be clearly stated.
2.   Identify who is responsible, a date of creation, and last adjustment to the pipeline.
3.   In a preamble, list any requirements that are necessary to execute the documented steps in the pipeline from
       start to end. These include, but are not limited to: programs (and their versions), scripts, and/or data files –
       versions, when obtained, where, and data use agreement(s), and IRB protocol approvals as appropriate.
4.   If there are agreements or requests that limit the use of data for the purpose of analysis, these agreements
       should be clearly stated and appropriately referenced. If there are specific contacts for the access, those
       contacts should be listed in the preamble section.
5.    "Metadata," that is, additional documents with specific codings, dictionaries, variable tables or listing, that
        are necessary and required for the analysis or interpretation of the data should also be specified and listed in
        the preamble.
6.   The notebook should begin with a set of easily obtainable files (downloaded) that have not been previously
       modified.
7.   If files are modified, the notebook should include steps to create the modified versions of those files used for
       analysis, after the preamble. The modifications could include reformatting or quality control.
8.   List each step that is executed in the pipeline (UNIX command, computer program call, etc.), moving from the
       beginning of the document to the end of the document.
9.   At each step in the workflow, document human comprehensible comment(s) of what is done at each step, and
       the rationale.
10. At various steps along the way, the pipeline should also produce a set of summaries for intermediate outputs,
       which can be externally verified. This helps ensure that the pipeline is producing what was previously
       performed at intermediate steps.
11. It is often the case that figure or table “montages” are constructed for primary display in published
       manuscripts. If the pipeline produces data that go to a figure that is stand-alone or part of a figure montage,
       the pipeline should also include scripts or commands used to generate preliminary/final figures (or tables)
       that result, but have not been otherwise modified (i.e., Adobe tools).
12. It is also standard that, for large-scale analysis involving a cluster computing environment, many jobs are run
       and there will be a need for “batch” scripts or programs. In this case, provide the scripts that executed, along
       with specific examples instances of what the batch script does.
To make sure timestamps are included with workbooks and markdowns, version control systems are strongly encouraged as a manner to ensure the timestamps are authentic and verifiable.
Data files should be in text (not binary) and/or other non-proprietary formats. If data files are produced in proprietary software, copies should be exported into a universally readable format, so that they remain readable if software becomes incompatible or the license is not available.

Section 4: Guidelines for computational notebooks that are the code and documentation for a novel computational algorithm, statistic, or experimental methodology

Please note the discussion above in Section 3 regarding ‘version control and ‘pointers to large data types/quality control and data management.’

Examples of computational notebooks that provide for the code and documentation for a novel computational algorithm, statistic, or experimental methodology are Github and Subversion.
This type of notebook can be viewed essentially as a repository of code that performs a specific analytical job. The notebook should therefore comprise these components:
— The raw code that will be compiled (or interprets) to perform the intended analytical tasks.
— Documentation explaining how to use the excecutable.
— Suitable test case(s), including as appropriate script(s)to re-run the tests, input data, and output data, sufficient
     to verify correct installation/reproducibility of code when executed by a third party.
     This is to say, essentially: the raw code (with comments) plus documentation about the use of the program,
      plus test case(s) to verify the code runs as expected, constitutes the “notebook” that is preserved into
      antiquity.
Precompiled executable without the raw code are insufficient. Raw code that is not appropriately commented is also insufficient.
Developed code must be situated in an environment that allows version control so that previous versions of code are retained, and updated code with the latest versions are provided. This could be a version control system like Github or Subversion, but other options are potentially available.
Pipelines can be developed to address specific scientific questions that ultimately utilize these programs. Such pipelines constitute the aforesaid Section 3 notebooks.

Resources:

NIH Guidance on Lab Notebooks (Guidelines for SCIENTIFIC RECORD KEEPING in the Intramural Research Program at the NIH): https://oir.nih.gov/sites/default/files/uploads/sourcebook/documents/ethical_conduct/guidelines-scientific_recordkeeping.pdf

Online book about Lab Notebooks (Writing the Laboratory Notebook by Howard Kanares, ACS, 1985): http://www.chem.tamu.edu/rgroup/dunbar/Writing%20a%20Laboratory%20Notebook-1.pdf

Schnell (2015), Ten Simple Rules for a Computational Biologist’s Laboratory Notebook: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004385

Guerrero et al (2016) Analysis and Implementation of an Electronic Laboratory Notebook in a Biomedical Research Institute:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0160428

Dirnagl & Przesdzing (2016) A pocket guide to electronic laboratory notebooks in the academic life sciences: https://f1000research.com/articles/5-2/v1

http://researchdata.ox.ac.uk

https://sites.google.com/a/sheffield.ac.uk/rdm_links/technical/capture/eln

K. Palozola, a graduate student in BGS, described her experiences about computational notebook keeping: http://tinyurl.com/jpvvgna

Appendix:

Software relating to electronic notebooks (ELNs, to be used in lieu of traditional, bound notebooks):

In evaluating the software listed below, it is important to pay close attention to these features:

Verifiability
Are timestamps provided? How?
Are electronic signatures used?
Can entries be changed to read-only (no editing) after creation? Can this change be made permanent and/or verified?
Security.
If the files are on a local server, is there an effective firewall? How are backups made? Are the backed-up files secure?
If the files are part of a Cloud-based system, how is their security ensured? Is encryption used to transmit information to and from the server? How long does the platform keep the files? How are backups obtained, and how long are they maintained?
What is the privacy policy? How stable is the company or developer?
Accessibility.
Is it available on only one platform (e.g., Evernote only on Mac devices)?
How is access to the files controlled? Who has access, and with what capabilities? For example, can the student and PI both access and edit the files? What about other lab members? People outside the lab, such as the Thesis Committee?
Where do the files reside? For example, are they on a local server that is accessible only from within the lab, or in the Cloud accessible anywhere (also see “Security” below)?
Organizational flexibility.
How well can different kinds of information be organized?
Can data/images be included? If so, are there storage limitations?
Can annotations/discussions be added?
Searchability.
Can annotations/discussions be added to searches?
Version control
Particularly for algorithm development, are there built-in tools for version control?

LabArchives (https://researchnotebooks.upenn.edu/)

Penn has licensed LabArchives, and provides it to users cost-free.
The LabArchives notebook is accessible using any web browser.
Offers features such as templates, widgets.
The ability to link to other applications, as well as sharing pages and notebooks, and allowing others to make comments.
Information is stored in a cloud-based, secure network.
The platform ensures permanency, auditability, as well as Signing and Witnessing.

Labguru (https://www.labguru.com)

Advance tagging system for easy search, track recording from batch number to concentration
For keeping track of chemicals, plasmids, primers, cells, etc.
24/7 accessibility to all data
Easy to identify materials used with specific experiments and associated data files
Experiments – link experiments to their results regardless of format (images/data/notes)
Track project history and planned research track
Cons: not very intuitive, project view too complex, no free license, expensive monthly subscription

Scinote (https://scinote.net)

Very user-friendly and quick to set up
Unique experimental workflow
Open source license
Free account with unlimited project users
Downside - cannot draw chemical molecules
Good for molecular biology (gel pictures, protocols, etc.)
Storing electrophysiological data (not that useful)
Can be run on multiple platforms including phones and it is free up to 1 GB.
Pay extra for more space

Benchling (https://benchling.com/academic)

Very user friendly and quick to set up
Useful DNA tools, templates for sequence mapping (CRISPR, other scientific features)
Tons of entry templates (one for writing protocols, one for data entry, etc.)
Organizational methodology are logical as you can have multiple projects and experiments being worked on simultaneously.
Accessible from any computer – copy and paste tables from excel files and link protocols to experiments
Free account with 10 GB of storage space
Everything can be printed as a PDF for saving hard copies and entries changed are timestamped and can be integrated with DROPBOX
Good customer support for desktop/app versions
Cons: Tied to a single user and report structure is not flexible, and may need more storage

Evernote (https://evernote.com)

MAC ONLY
Enables the creation of multiple notebooks in a collaboration effort
Can be edited by other users that have access from creator
Create daily notes on experimental progress / tasks / meeting notes
Can sync across multiple devices
Search function to find notes by topic/date/etc.
Good for data analysis (Linux)
Future lab members could replicate entire experiments from accessing the files

Google Docs (https://docs.google.com)

Flexible for creating documents
Good source for sharable documentation that can be read and edited

Microsoft Excel / Word

Excel: good for daily experimental results and parameters
Word: good for a recording log for of experimental procedures

ASANA (https://asana.com)

An experimental timeline visible for all users

Software relating to computational notebooks:

Github: Useful for code keeping, version control.
Shiny, knitr package in R: Useful for workflows, document keeping.
Jupyter or ipython: Useful “markdown” to develop computational workflows and pipelines. These also be uploaded to github.
Google docs: Cloud solution to document activities, can handle images.

Case Studies

Access to case studies requires a PennKey. They are available here.

Comments/Resources

Two textbooks provide excellent discussions of data acquisition and management. Both are electronically accessible through the Biomedical Library. These are:

Scientific Integrity, F.L. Macrina, 4th ed.
Responsible Conduct of Research, A.E. Shamoo and D.B. Resnick, 3rd ed.