Archiving data
Data archiving procedure (for EPSRC-funded researchers)
Requirements
The EPSRC requires that all publications with a publication date on/after 1st May 2015 include a statement describing how to access the underlying research data. This means the data must be publicly available, and easy to understand.
The Thom Group uses the university data repository for this purpose. (http://www.data.cam.ac.uk/repository) For every paper, there should be a corresponding directory with a metadata file about its contents.
Required information
When the data was created (yyyy-mm-dd)
How the data was created:
- If you used publicly available software, it is sufficient to state the name and version of the software eg: Q-Chem (Version 4.0.1, Q-Chem, Inc., Pittsburgh, PA (2007) www.q-chem.com)
- If you used your own code or software which is not publicly available, you should include a copy of that code or, in the worst case scenario, details of the software and its creators. eg: ‘Data produced using The Thom Group's qcmagic script’
How the data was plotted. Including the code you used to plot the data is sufficient (as long as it clear how it can be run (ie does not refer to data files which are missing or in a different directory)
Archive as you go
You can make the archiving stage much easier by preparing for it as you research, rather than trying to clean up messy data and unintelligible code at the end. Here are some tips:
WHEN CREATING DATA:
Organise the data files logically. If it is possible to organise it by figure then do so. This might not be possible if your publication plots the same data in lots of different ways. In that case, make sure the way you organise your data is obvious to others (avoid meaningless acronyms eg H2-sto_trial1_TEST5.dat) Keep an index noting when and how the data was created. WHEN: yyyy-mm-dd HOW: Software used, version, input file
WHEN PLOTTING DATA:
Make your plots easy to use (Ideally, one plotting script per figure). Make sure your script only imports data it is going to use. As with data creation, keep an index of the version and date (eg. Python 2.7, 2014-08-15)
Example
Click this linkto see an example of properly indexed data in the university repository
Using the Thom Group create_data_dump script
If you have your data organised like the above example, you might be able to use our create_data_dump.py script to index the data for you! Here is what you need for it to work:
- The absolute path to the .tex file (eg. /home/rc566/Papers/iCCMC/iCCMC.tex)
- The .tex file must refer to directories in its own directory (eg. includegraphic(NePlateaux/NePlateaux)
- The script you used to plot the data must end in ".py"
- The data files which the plotting script refers to must end in ".dat", ".out" or include the string "FCIMCStats" (this can be altered in figure_classes.py)
To run the script, navigate to the directory it is stored in, then use: python create_data_dump.py <path_to_tex_file>
eg. python create_data_dump.py /home/rc566/Papers/iCCMC/iCCMC.tex
If you navigate to the directory containing your .tex file, you'll find a data_dump directory has appeared. There is one directory for each figure (eg. labelled figure #) inside, and a metadata.tex file with an index. You will need to find and index output files yourself (eg. neci.out).
NB: The script uses git to find the date for each file.
NB2: The script was written by an enthusiastic music student, not a computer scientist. You might want to check that it has done everything correctly after running it!