Archiving data: Difference between revisions

From Thom Group Wiki
Jump to navigation Jump to search
import>Rc566
No edit summary
import>Rc566
No edit summary
 
(One intermediate revision by the same user not shown)
Line 31: Line 31:
WHEN PLOTTING DATA:
WHEN PLOTTING DATA:


Make your plots easy to use. Ideally, one plotting script per figure. Make sure your script only imports data it is going to use. As with data creation, keep an index of the version and date (eg. Python 2.7, 2014-08-15)
Make your plots easy to use (Ideally, one plotting script per figure). Make sure your script only imports data it is going to use. As with data creation, keep an index of the version and date (eg. Python 2.7, 2014-08-15)


== Example ==
== Example ==
Click this [https://www.repository.cam.ac.uk/handle/1810/250382 link]to see an example of properly indexed data in the university repository
Click this [https://www.repository.cam.ac.uk/handle/1810/250382 link]to see an example of properly indexed data in the university repository


== Using the Thom Group create_data_dump script ==
If you have your data organised like the above example, you might be able to use our create_data_dump.py script to index the data for you! Here is what you need for it to work:

* The absolute path to the .tex file (eg. /home/rc566/Papers/iCCMC/iCCMC.tex)
* The .tex file must refer to directories in its own directory (eg. includegraphic(NePlateaux/NePlateaux)
* The script you used to plot the data must end in ".py"
* The data files which the plotting script refers to must end in ".dat", ".out" or include the string "FCIMCStats" (this can be altered in figure_classes.py)

To run the script, navigate to the directory it is stored in, then use:
python create_data_dump.py <path_to_tex_file>

eg.
python create_data_dump.py /home/rc566/Papers/iCCMC/iCCMC.tex

If you navigate to the directory containing your .tex file, you'll find a data_dump directory has appeared. There is one directory for each figure (eg. labelled figure #) inside, and a metadata.tex file with an index. You will need to find and index output files yourself (eg. neci.out).

NB: The script uses git to find the date for each file.

NB2: The script was written by an enthusiastic music student, not a computer scientist. You might want to check that it has done everything correctly after running it!

Latest revision as of 11:22, 14 September 2015

Data archiving procedure (for EPSRC-funded researchers)

Requirements

The EPSRC requires that all publications with a publication date on/after 1st May 2015 include a statement describing how to access the underlying research data. This means the data must be publicly available, and easy to understand.

The Thom Group uses the university data repository for this purpose. (http://www.data.cam.ac.uk/repository) For every paper, there should be a corresponding directory with a metadata file about its contents.


Required information

When the data was created (yyyy-mm-dd)

How the data was created:

  • If you used publicly available software, it is sufficient to state the name and version of the software eg: Q-Chem (Version 4.0.1, Q-Chem, Inc., Pittsburgh, PA (2007) www.q-chem.com)
  • If you used your own code or software which is not publicly available, you should include a copy of that code or, in the worst case scenario, details of the software and its creators. eg: ‘Data produced using The Thom Group's qcmagic script’

How the data was plotted. Including the code you used to plot the data is sufficient (as long as it clear how it can be run (ie does not refer to data files which are missing or in a different directory)


Archive as you go

You can make the archiving stage much easier by preparing for it as you research, rather than trying to clean up messy data and unintelligible code at the end. Here are some tips:

WHEN CREATING DATA:

Organise the data files logically. If it is possible to organise it by figure then do so. This might not be possible if your publication plots the same data in lots of different ways. In that case, make sure the way you organise your data is obvious to others (avoid meaningless acronyms eg H2-sto_trial1_TEST5.dat) Keep an index noting when and how the data was created. WHEN: yyyy-mm-dd HOW: Software used, version, input file

WHEN PLOTTING DATA:

Make your plots easy to use (Ideally, one plotting script per figure). Make sure your script only imports data it is going to use. As with data creation, keep an index of the version and date (eg. Python 2.7, 2014-08-15)

Example

Click this linkto see an example of properly indexed data in the university repository


Using the Thom Group create_data_dump script

If you have your data organised like the above example, you might be able to use our create_data_dump.py script to index the data for you! Here is what you need for it to work:

  • The absolute path to the .tex file (eg. /home/rc566/Papers/iCCMC/iCCMC.tex)
  • The .tex file must refer to directories in its own directory (eg. includegraphic(NePlateaux/NePlateaux)
  • The script you used to plot the data must end in ".py"
  • The data files which the plotting script refers to must end in ".dat", ".out" or include the string "FCIMCStats" (this can be altered in figure_classes.py)

To run the script, navigate to the directory it is stored in, then use: python create_data_dump.py <path_to_tex_file>

eg. python create_data_dump.py /home/rc566/Papers/iCCMC/iCCMC.tex

If you navigate to the directory containing your .tex file, you'll find a data_dump directory has appeared. There is one directory for each figure (eg. labelled figure #) inside, and a metadata.tex file with an index. You will need to find and index output files yourself (eg. neci.out).

NB: The script uses git to find the date for each file.

NB2: The script was written by an enthusiastic music student, not a computer scientist. You might want to check that it has done everything correctly after running it!