Carnegie science can be a data intensive endeavour.  Managing that data in an effective and efficient manner can be a challenge.  IT is here to help.  IT can:

  • Help you plan for how to handle your project's data storage needs
  • Provide data storage solutions to retain and preserve scientific data
  • Recommend workflows to ensure data is being processed and stored efficiently

Continue reading to see how IT works with data and can help you with managing your projects data.



Scientific Data Classification

IT classifies scientific data into the following categories:

  • Original Data: Scientific data acquired or generated by Carnegie instruments, codes, and/or scientific activities.  Generally before processing.
  • Foreign Data: Scientific data where the authoritative source is outside of Carnegie.  Data preservation and maintenance is handled by that external authoritative source (e.g. NOAA, NASA, universities, etc.)
  • Original Code: Source code written and maintained by Carnegie scientists.
  • Project Codes: Source code and/or programs used in the processing of data.  The combination of any Original Code as well as code downloaded or received from other sources.
  • Active Dataset: A copy of a subset of Original Data plus any Foreign Data that is to be processed by the Project Codes.
  • Final Products: End results for the project.  Typically this is the data that will be published and accessible by the public.  This may also include data that is not intended for the public, but may be made available to other scientists or granting agencies on-demand.
  • Temporary and Intermediate Data: Data generated by Project Codes while processing Active Datasets, but is not a Final Product.


Storage Systems and Services

IT provides several storage systems and services that you can leverage in managing your scientific data

  • Data
    • Purpose: Data Archival
    • Backups: Local Snapshots + Off-Site Mirror
    • Snapshot Schedule: 4x 15min, 24x hourly, 7x daily, 4x weekly, 12x monthly
    • Off-Site Mirror Schedule: Daily, last 30 successful backups
  • Google Drive
    • Purpose: Data Sharing and Archival
    • Backups: Global Datacenter Replication + Google Vault
    • Deleted File Retention Period: 30 days
    • Google Vault Retention Period: 7 years
    • Note: Data processing code can not read/write data directly from Google Drive
  • Scratch/Temp Space (/scratch on HPC clusters, /tmp on stand alone systems, T: on Windows)
    • Purpose: High-Speed Storage for Active Computation
    • Backups: NONE


Scientific Data Management

For each type of data, there are best practices on how to manage them to ensure critical data is protected, processing is efficient, and storage efficiently used:

  • Original Data
    • Recommended Retention: Permanent
    • Storage: Local Archive + Offsite Disaster Recovery
    • Recommended Server/Service: Lab or Project folder on Data, or Project Specific Server
  • Foreign Data
    • Recommended Retention: Active Processing
    • Storage: High-Speed Scratch or Temp space
    • Recommended Server/Service: Project folder on systems's Scratch/Temp space
    • Note: Please be sure that Foreign Data does not get stored on Carnegie file servers and/or archival systems.  Preservation and maintenance of Foreign Data (including the financial burden of these activities) are the responsibility of the authoritative source.  Storing Foreign Data on Carnegie file servers and/or archival systems wastes costly storage resources.
  • Original Code
    • Recommended Retention: Permanent
    • Storage: Local Archive + Offsite Disaster Recovery
    • Recommended Server/Service: Carnegie GitHub (Alternative: Lab or Project folder on Data)
  • Project Codes
    • Recommended Retention: Permanent
    • Storage: Local Archive + Offsite Disaster Recovery
    • Recommended Server/Service: Lab or Project folder on Data
    • Note: During active processing, these codes may perform better if a copy is stored on and executed from High-Speed Scratch or Temp space.
  • Active Dataset
    • Recommended Retention: Active Processing
    • Storage: High-Speed Scratch or Temp Space
    • Recommended Server/Service: Project folder on systems's Scratch/Temp space
    • Note: It's also best practice to remove the Active Dataset from the Scratch/Temp space after processing is complete.  This ensures this High-Speed (aka EXPENSIVE) storage space is available for the next job/project and is not being used by idle data.
  • Final Products
    • Recommended Retention: Permanent
    • Storage: Local Archive + Offsite Disaster Recovery
    • Recommended Server/Service: Lab or Project folder on Data
  • Temporary and Intermediate Data
    • Recommended Retention: Job Run
    • Storage: High-Speed Scratch or Temp Space
    • Recommended Server/Service: /scratch on HPC clusters, /tmp on stand-alone systems, T: drive on Windows systems
    • Note: Do not keep temporary or intermediate data longer than is absolutely necessary.


Overall Strategy

When doing computation on Carnegie's computing infrastructure, the recommended data storage strategy for DPB and DGE researchers is as follows:

  • Original files (ie, any data, settings files, etc.), along with any code, should be stored on Data in your lab's folder. Data is mounted on all Carnegie systems on Stanford's campus
    • On Linux systems, Data can be found under either /Carnegie/DPB/Data or /Carnegie/DGE/Data
    • On Windows systems, Data is mapped to the S: drive
  • For a new project, copy the set of raw files and code from Data to a folder on the system's "scratch" space
    • On Linux systems, this is either /lustre/scratch, /nfs/scratch, or /tmp (listed in order of recommended use/preference)
    • On Windows systems, scratch space is mapped to the T: drive  
  • Any temporary/intermediate output from your code should also be written to a folder in the "scratch" space
  • Nothing in "scratch" spaces will be backed up, so any final products from your jobs, or any modifications to your code that you want to keep should be copied back to your lab's folder on Data.
  • Files on Data are regularly backed up, both via local snapshots and via syncs to off-site disaster recovery storage at the Department of Embryology.

Your goal should be to structure your project so that in the event that files on any "scratch" space are lost, you can recopy and rerun all necessary components of your project from what is securely backed up on Data.