NIH Data Management & Sharing Plans

Download infographic with Emory updates

Policy overview

Effective January 25, 2023, the National Institutes of Health (NIH) has a new Policy for Data Management and Sharing. This replaces the previous Data Sharing Policy of 2003, with the main difference that all all competing grant or contract proposals (including renewals) to the NIH that generate scientific data must now include a data management & sharing (DMS) plan as part of the application.

DMS plans describe robust details of project data management and sharing during the entire funding period and for a minimum of 3 years after the end of the award. Project data need to be shared no later than the time of an associated publication or end of award (for unpublished data), whichever comes first.

DMS plans are not included in the peer-review process, and instead are assessed and finalized with Program Officers during the just-in-time award process. The DMS plan will become a Term and Condition of the Notice of Award and failure to comply may result in an enforcement action, including additional special terms and conditions or termination of the award, and may affect future funding decisions. Questions will be added to Research Performance Progress Report (RPPR) to help determine compliance with DMS plans, assessed annually by the NIH.

The NIH Data Sharing website contains a wealth of information about the policy's implementation, including a Frequently Asked Questions (FAQ) section to address common issues and concerns. Investigators are encouraged to reach out directly to the NIH Institute or Center (IC) Program Staff with questions about data sharing for specific programs.

Elements of a data management and sharing plan

A data management and sharing (DMS) plan consists of the following 6 elements:

  • Element 1. Data Type
  • Element 2. Related Tools, Software and/or Code
  • Element 3. Standards
  • Element 4. Data Preservation, Access, and Associated Timelines
  • Element 5. Access, Distribution, or Reuse Considerations
  • Element 6. Oversight of Data Management and Sharing

Sample plans and templates

DMS plans should provide descriptive details for the 6 elements listed above, be no more than 2 to 3 pages in length and should not include hyperlinks and URLs. Sample language for Emory investigators is outlined below.

DMPTool

DMPTool offers guidance for specific funders' data management requirements, including the NIH. It also includes sample plans and templates for preparing your own data management plan. Use your Emory NetID and password to login at dmptool.org.

NIH Sample Plans

Several NIH institutes and centers have provided sample DMS plans for investigators to consult when writing plans, following the optional DMS plan formatted template.

Budgeting for data management and sharing plans

Costs associated with a DMS plan should be included in the budget and budget justification of proposals. Allowable costs include:

  • Curating data
  • Developing supporting documentation
  • Formatting data according to accepted community standards, or for transmission to and storage at a selected repository for long-term preservation and access
  • De-identifying data
  • Preparing metadata to foster discoverability, interpretation, and reuse
  • Local data management considerations, such as unique and specialized information infrastructure necessary to provide local management and preservation (for example, before deposit into an established repository).

Unallowable costs include:

  • Infrastructure costs that are included in institutional overhead
  • Costs associated with the routine conduct of research, including costs associated with collecting or gaining access to research data.
  • Costs that are double charged or inconsistently charged as both direct and indirect costs

Please refer to the NIH scientific data sharing guidelines for more detailed information on budgeting for data management and sharing.


Sample language and data sharing scenarios

For Emory investigators, the following sample language and data sharing scenarios are outlined to consider including in plans:

Data type and Standards

Plan elements 1 and 3 are highly specific to the type of research project being conducted and at times the particular funding opportunity or institute’s expectations.

Sample Language: Data type (plan element 1)

A. Types and amount of scientific data expected to be generated in the project: Summarize the types and estimated amount of scientific data expected to be generated in the project.

Example answer: This project will produce _________ [Data type, e.g., imaging, sequencing, experimental measurements] data generated/obtained from __________ [Data modality, e.g., instrument, method, survey, experiment, data source]. Data will be collected from ___ [number] of research participants/specimens/experiments, generating ___ [number] datasets totaling approximately ___ [amount of data] in size. The following data files will be used or produced in the course of the project: ______ [list input data files, intermediate files, and final, post-processed files]. Raw data will be transformed by ____ [analysis, method], and the subsequent processed dataset used for statistical analysis. To protect research participant identities, ___________ [e.g., individual, aggregated, summarized] data will be made available for sharing.

If working with human subjects, consider adding: Data collection will be performed at clinical sites in the ____ [location] area(s) with ____ [population(s) being studied; i.e., T2 diabetes].

B. Scientific data that will be preserved and shared, and the rationale for doing so: Describe which scientific data from the project will be preserved and shared and provide the rationale for this decision.

Example answer: Based on _______ [ethical, legal, technical] considerations, only the following data produced in the course of the project will be preserved and shared: ____ [list subsets of the data to be shared]
OR
All data produced in the course of the project will be preserved and shared.

If working with human subjects, consider adding: The final dataset will include _______[e.g., self-reported demographic and behavioral data from interviews with participants and laboratory data from blood and urine specimens provided]. We will share de-identified individual-participant level (IPD) data. Appropriate measures such as _______ [describe specific de-identification practices to be used] will be used for data de-identification and sharing, and informed consent forms will reflect those plans.

C. Metadata, other relevant data, and associated documentation: Briefly list the metadata, other relevant data, and any associated documentation (e.g., study protocols and data collection instruments) that will be made accessible to facilitate interpretation of the scientific data.

Example answer: To facilitate interpretation of the data, ______[e.g., data dictionary, metadata, documentation, statistical analysis plans, bench protocols, data collection instruments] will be created, shared, and associated with the relevant datasets.

If working with clinical trials, consider adding: In addition to ______ [individual participant data (IPD) dataset being shared by restricted access and/or aggregate data being shared openly], the researcher will share the ______ [describe any other elements of the final data package not already addressed]. Documentation and support materials will be compatible with the clinicaltrials.gov Protocol Registration Data Elements.

Sample Language: Standards (plan element 3)

State what common data standards will be applied to the scientific data and associated metadata to enable interoperability of datasets and resources, and provide the name(s) of the data standards that will be applied and describe how these data standards will be applied to the scientific data generated by the research proposed in this project. If applicable, indicate that no consensus standards exist.

Example answer: Data will be stored in common and open formats, such as ____ for our ____ data. Information needed to make use of this data [e.g., the meaning of variable names, codes, information about missing data, other metadata, etc.] along with references to the sources of those standardized names and metadata items will be included wherever applicable.

If there are formal data standards for some/all of the data: Whenever possible, we will use ______ [common data elements, standardized survey instruments, etc.] to structure and organize our data. Our ____ data will be structured and described using the ____ standard, which has been widely adopted in the ____ community. [Add additional information about this standard, if applicable - e.g., implementation in data repositories, utility in combining/reusing datasets]

If there are not formal standards: Formal standards for ____ data have not yet been widely adopted. However, our data and other materials will be structured and described according to best practices which are as follows: [list appropriate best practices].

Domain-specific data repository

The NIH strongly encourage the use of domain-specific or data-specific repositories for data sharing, to maximize findability and reusability of data. Sometimes, particular Institutes, Centers, and Offices (ICOs), Funding Opportunity Announcement (FOA) or Program Officer require award recipients to utilize specific scientific data repositories or shared archives. Investigators should be sure to check the submission instructions for any such requirement or guidance.

Choosing a domain-specific or data-specific repository that currently exists:

  • See list of NIH-supported repositories for more information
  • Check for data submission instructions from the NIH institute, center, or office, and use available boilerplate for designated / recommended repository
  • Incorporate specific guidelines for genomics datasets, and for clinical trials, when applicable
  • Describe specific details as to how the repository meets the criteria of Element 4 “Data Preservation, Access, and Associated Timelines” and Element 5 “Access, Distribution, or Reuse Considerations” of the DMS plan if utilizing a consortium- or community-supported data repository or collaborative share.
  • Include any provisions and approach for storing datasets in a local, intermediate storage or repository location, accessible to project team members for analysis and curation while research outputs are not published yet.

Sample Language: Data Preservation, Access, and Associated Timelines (plan element 4)

A. Repository where scientific data and metadata will be archived: Provide the name of the repository(ies) where scientific data and metadata arising from the project will be archived; see Selecting a Data Repository)

Example answer: All dataset(s) that can be shared will be deposited in _________ [Add appropriate NIH-supported data repositories] OR ________ [Add appropriate discipline- or data-specific repository, or consortium data repository].

B. How scientific data will be findable and identifiable: Describe how the scientific data will be findable and identifiable, i.e., via a persistent unique identifier or other standard indexing tools.

Example answer: The _________ [repository name] provides metadata, persistent identifiers [insert whether DOI, handles, other], and long-term access. This repository is supported by ________[Insert funder/organization] and dataset(s) are available under a _______ [Insert license information]
OR
The _________ [repository name] provides metadata, persistent identifiers [insert whether DOI, handles, other], and long-term access. This repository is supported by ________[Insert funder/organization] and dataset(s) are available through a request process __________ [Insert information about request process]

C. When and how long the scientific data will be made available: Describe when the scientific data will be made available to other users (i.e., no later than time of an associated publication or end of the performance period, whichever comes first) and for how long data will be available.

Example answer: Shared data generated from this project will be made available as soon as possible, and no later than the time of publication or the end of the funding period, whichever comes first. The duration of preservation and sharing of the data will be a minimum of _____[duration] years after the end of the funding period. [Include any embargo provision and approach.]

Generalist repository: Emory Dataverse

If there isn't a preferred data repository in your domain or discipline, consider whether your data meets the Emory Dataverse generalist repository criteria:

  • small to medium datasets (i.e., less than 1 TB total)
  • doesn't contain internal, confidential, or proprietary data, or personally identifying information about human subjects

Emory Dataverse is Emory's open data generalist repository, offered through a partnership between Emory and UNC’s Odum Institute:

  • Data deposited with the Emory Dataverse is made available through a web-accessible repository at no cost to depositors or users.
  • Emory Dataverse provides persistent access to data, that meets the FAIR data sharing principles. Each dataset in Dataverse is assigned a Digital Object Identifier (DOI) for reliable citation and linking.
  • For datasets that cannot be deposited into Dataverse due to size or sensitivity limitations, Dataverse can still be used to publish de-identified versions of the datasets, metadata and access/request information associated to datasets stored elsewhere, or data replication protocols.
  • Dataverse provides the option to disable the access to a deposited dataset for embargo or other research protection reasons, until investigators are ready to release the dataset to the public.

Sample Language: Data Preservation, Access, and Associated Timelines (plan element 4)

Example answer: All dataset(s) that can be shared will be deposited in the Emory Dataverse. The Emory Dataverse provides metadata, persistent identifiers (DOIs), and long-term access. Emory Dataverse is the open generalist data repository supported by Emory University in partnership with the Odum Institute Data Archive at UNC Chapel Hill. Shared data generated from this project will be made available as soon as possible, and no later than the time of publication or the end of the funding period, whichever comes first.

Sample Language: Access, Distribution, or Reuse Considerations (plan element 5)

Example answer: There are no anticipated factors or limitations that will affect the access, distribution or reuse of the scientific data generated by the proposal. Controlled access will not be used. The data that is shared will be shared by unrestricted download.

Other generalist repositories

When data doesn't meet the criteria for Emory Dataverse or for other reasons, another generalist repository may be preferred. See the NIH guidance on selecting a repository with desirable characteristics and list of generalist repositories.

Note that since Emory does not subscribe to other generalist repositories at this time, additional subscription and deposit fees may be incurred by the project.

Sample Language: Data Preservation, Access, and Associated Timelines (plan element 4)

A. Repository where scientific data and metadata will be archived: Provide the name of the repository(ies) where scientific data and metadata arising from the project will be archived; see Selecting a Data Repository)

Example answer: All dataset(s) that can be shared will be deposited in _________ [Add appropriate generalist data repository].

B. How scientific data will be findable and identifiable: Describe how the scientific data will be findable and identifiable, i.e., via a persistent unique identifier or other standard indexing tools.

Example answer: The _________ [repository name] provides metadata, persistent identifiers [insert whether DOI, handles, other], and long-term access. This repository is supported by ________[Insert funder/organization, if applicable] and dataset(s) are available under a _______ [Insert license information]
OR
The _________ [repository name] provides metadata, persistent identifiers [insert whether DOI, handles, other], and long-term access. This repository is supported by ________[Insert funder/organization, if applicable] and dataset(s) are available through a request process __________ [Insert information about request process]

C. When and how long the scientific data will be made available: Describe when the scientific data will be made available to other users (i.e., no later than time of an associated publication or end of the performance period, whichever comes first) and for how long data will be available.

Example answer: Shared data generated from this project will be made available as soon as possible, and no later than the time of publication or the end of the funding period, whichever comes first. The duration of preservation and sharing of the data will be a minimum of _____[duration] years after the end of the funding period. [Include any embargo provision and approach.]

Sample Language: Access, Distribution, or Reuse Considerations (plan element 5)

Example answer: There are no anticipated factors or limitations that will affect the access, distribution or reuse of the scientific data generated by the proposal. Controlled access will not be used. The data that is shared will be shared by unrestricted download.

Sensitive data: Considerations and Controlled Access

Data collected or generated for a research project may be too sensitive to be shared readily, especially in the case of human subject studies involving electronic protected health information (ePHI) or other types of regulated / restricted studies. Legal, ethical or other considerations may preclude sharing the original datasets publicly or may require the establishment of Data Use Agreements and controlled-access repositories and data request processes.

Approaches to sharing datasets associated to these types of studies include:

  • Reliably anonymizing original data into a de-identified dataset that can be safely shared and submitted into a domain, community or generalist data repository, through methods such as Honest Broker de-identification to comply with HIPAA Safe Harbor requirement, date shifting, concept codification of data, natural language processing of notes.
  • Aggregating individual-level data into population-level datasets that only provide counts and other summarized data features.
  • Publishing metadata-only records of scientific datasets in a domain, consortium or generalist repository, with clear instructions and reliable processes to request the data via a Data Use Agreement, identified data custodians, and sustainable storage and distribution mechanisms.
  • Storing datasets in a consortium or institutional “data enclave” with governed request and access controls.
  • Publishing all details of methods, protocols, query/generation parameters by which similar data can be reproduced, or obtained from original source with a valid authorization (such as highly restricted / classified datasets).

Note that some of the approaches listed above may incur skilled services costs for data curation, de-identification, aggregation or legal agreement activities, and/or for solution development, data hosting and distribution. Consult the IT Service Catalog or contact dataplans@emory.edu with questions.

Sample Language: Access, Distribution, or Reuse Considerations (plan element 5)

A. Factors affecting subsequent access, distribution, or reuse of scientific data: NIH expects that in drafting Plans, researchers maximize the appropriate sharing of scientific data. Describe and justify any applicable factors or data use limitations affecting subsequent access, distribution, or reuse of scientific data related to informed consent, privacy and confidentiality protections, and any other considerations that may limit the extent of data sharing. See Frequently Asked Questions for examples of justifiable reasons for limiting sharing of data.

Example answer: Due to _______ [ethical/legal/technical considerations], access/distribution/reuse of the resulting scientific data will be limited and approved/monitored by ________ [describe the approach to limiting access/distribution/reuse].

B. Whether access to scientific data will be controlled: State whether access to the scientific data will be controlled (i.e., made available by a data repository only after approval).

Example answer for researchers selecting controlled access repositories:
Given the sensitive nature of the dataset, data will be made available in ________ [repository name], which restricts access to the data to ______ [describe restriction, e.g. to qualified investigators with an appropriate research question and approved data use agreement (DUA)]. Data can be accessed by _____ [describe data repository access methods and measures].  

C. Protections for privacy, rights, and confidentiality of human research participants: If generating scientific data derived from humans, describe how the privacy, rights, and confidentiality of human research participants will be protected (e.g., through de-identification, Certificates of Confidentiality, and other protective measures). For additional guidance related to safeguarding human subjects research, see the Emory IRB website.

Example answer for researchers working with human subjects data: [This subsection applies to studies involving human research participants. Other studies can generally skip.]
In order to ensure participant consent for data sharing, IRB paperwork and informed consent documents will include language describing plans for data management and sharing of data, describing the motivation for sharing, and explaining that personal identifying information will be removed. To protect participant privacy and confidentiality, shared data will be de-identified using the ______ methods [describe de-identification method, noting any other applicable laws or policies such as HIPAA].

Large data volumes: Considerations and Options

Certain research projects may acquire or generate volumes of data that are too large to transfer and/or store in a domain, consortium or generalist repository.

Approaches to sharing voluminous datasets include:

  • Publishing summary, processed/analyzed, or other aggregate-level datasets derived from the raw datasets in a domain or generalist repository, and making raw data available upon request with an associated request process and data custodian
  • Publishing metadata-only records of scientific datasets in a domain, consortium, or generalist repository, with clear instructions and reliable processes to request the data, identified data custodians, and sustainable storage and distribution mechanisms.
  • Storing datasets in a public, consortium, or institutional data repository with public or governed access, such as a link to a Cloud storage location, or website to download the data
  • Providing associated query, analysis tools and compute infrastructure as well, perhaps with an end user or collaboration agreement to manage access and utilization of resources
  • Publishing all details of methods, protocols, query/generation parameters by which similar data can be reproduced, or obtained from original source of data.

Note that the approaches to sharing high volumes of data or making them available may incur additional data storage, transfer and processing costs. Consult the IT Service Catalog or contact dataplans@emory.edu with questions.

Sample Language: Access, Distribution, or Reuse Considerations (plan element 5)

A. Factors affecting subsequent access, distribution, or reuse of scientific data: NIH expects that in drafting Plans, researchers maximize the appropriate sharing of scientific data. Describe and justify any applicable factors or data use limitations affecting subsequent access, distribution, or reuse of scientific data related to informed consent, privacy and confidentiality protections, and any other considerations that may limit the extent of data sharing. See Frequently Asked Questions for examples of justifiable reasons for limiting sharing of data.

Example answer: There are no anticipated factors or limitations that will affect the access, distribution or reuse of the scientific data generated by the proposal.
OR
Due to _______ [technical considerations], access/distribution/reuse of the resulting scientific data will be limited and approved/monitored by ________ [describe the approach to limiting access/distribution/reuse].

B. Whether access to scientific data will be controlled: State whether access to the scientific data will be controlled (i.e., made available by a data repository only after approval).

Example answer (Researchers who are not using controlled access repositories can skip this section or state): Controlled access will not be used. The data that is shared will be shared by unrestricted download.

For researchers selecting controlled access repositories: Given the nature of the dataset, data will be made available in ________ [repository name], which restricts access to the data to ______ [describe restriction, e.g. to qualified investigators with an appropriate research question and approved data use agreement (DUA)]. Data can be accessed by _____ [describe data repository access methods and measures].

Tools, Software and Code

Although not mandated by the 2023 NIH data sharing policy, it is encouraged to make analysis tools, software and code publicly available in association with the datasets that these tools generated or analyzed.

If desired, options for sharing tools, software and code include*:

  • Storing code files together with the datasets deposited in the selected discipline-specific or generalist repository
  • Providing a link or detailed instructions to access the tools, software or code in the metadata for the datasets deposited in the selected discipline-specific or generalist repository (“Related publications" field in Dataverse)
  • Storing the code in a protected code repository solution such as Emory GitHub, and packaging the code upon request
  • Storing the code in public GitHub or another code repository
  • Developing a Web site/application that can be accessed by the community, hosted on a public server, consortium or 3rd party managed environment, or an institutional solution such as AWS at Emory
  • Providing access to a computational environment with the tools to run analysis on shared datasets, especially when data are voluminous, code is compute-intensive, and/or datasets are too sensitive to be disseminated outside of an enclave-type of environment.

*Before sharing any tool, software or code, investigators should ensure that it is reusable and well tested, and stripped of any sensitive code like database connection strings, account login/password, user information, API keys, sensitive data values such as hospital name, dates, etc.

Note that some of these tool-sharing options may require specialized infrastructure, specialized skillsets, development and maintenance costs, as well as associated hosting, computational and storage costs. Consult the IT Service Catalog or contact dataplans@emory.edu with questions.

Sample Language: Related Tools, Software and/or Code (plan element 2)

State whether specialized tools, software, and/or code are needed to access or manipulate shared scientific data, and if so, provide the name(s) of the needed tool(s) and software and specify how they can be accessed.

Example answer:

If no specialized tools are needed to access or manipulate the data: _____ [Data type - Imaging data, survey data, etc.] data will be made available in _____ [csv, txt, dicom, etc.] format and will not require the use of specialized tools to be accessed or manipulated.

If specialized tools (open source or proprietary) are needed to access or manipulate the data: _____ [Data type - Imaging data, survey data, etc.] data will be made available in _____ format, which requires the use of specialized tools, such as _____ [include list of tools] to be accessed and manipulated.

These tools will be shared openly via ____.
OR
These tools are fee-based, proprietary software, or can only run on a specific environment. Alternative access to the data will be provided by [describe the strategy for other sites to see or work with the data - potential strategies include committing to provide links to file viewers, or exporting files to a nonproprietary format for limited use and reuse].

Commercial or proprietary data: Considerations and Options

State the dataset being used is proprietary/commercial and cannot be shared. Make sure to clearly justify why the data cannot be shared, eg. Terms of data use agreement or license. State how it can be purchased/accessed via vendor, company, organization etc.

Commercialization or technology transfer options

Oversight of Data Management and Sharing

At Emory, the oversight of individual project-related data management and sharing plans is the responsibility of the PI and/or named delegate(s) in the project team.

Sample Language: Oversight of Data Management and Sharing (plan element 6)

Describe how compliance with this Plan will be monitored and managed, frequency of oversight, and by whom at your institution (e.g., titles, roles).

Example answer: Lead PI ____[name]___, ORCID: __[ORCID ID]___, will be responsible for the day-to-day oversight of lab/team data management activities and data sharing. Broader issues of DMS Plan compliance oversight and reporting will be handled by the PI and Co-I team as part of general [campus(es)] stewardship, reporting, and compliance processes.

Additional Resources at Emory

Rigor and Reproducibility Seminar Series

Sponsored by the Emory Libraries, Office of Information Technology, and the WHSC Data Science Initiative, the Rigor and Reproducibility seminar series hosts regular webinars covering topics related to data management and sharing, including overviews of the new NIH Policy in Fall 2021 and Fall 2022.

Apps and Software Reviewed for Research

A list of software and services that have been reviewed and approved by Emory for processing and storing Identifiable Information as electronic data, including ePHI. Even though these services have been reviewed, you are still responsible for ensuring that your use of them meets all of Emory’s applicable IT security and HIPAA policies, as well as any applicable rules of behavior.

Boilerplate Language Library

Emory Facilities and Resources Boilerplate Language Library is a centralized collection of template language describing resources available at Emory for investigators writing grant proposals, progress reports and other documents that need to highlight Emory's institutional research environment.

Need Help?

Have questions about research data management at Emory to comply with NIH's policy? Email dataplans@emory.edu for assistance, to request a review of your draft data management & sharing plan, or get help selecting an appropriate data repository.

Email dataplans@emory.edu ยป