Research Data Management

What is a RDM Plan?

Research Data Management requires a plan, agreed at the point of funding with a research council, to manage data effectively.
  • Institutional RDM Policies
  • Data Handling
  • Data Policy

Each institution has policies on how RDM Plans will be implemented. For example, important experimental results may have to be repatriated to the home institution and deposited in particular repositories. Note that research councils may have additional requirements on top of this, but the institutional policy may be aware of most of those requirements.Information on these policies are currently being gathered together and links to your home institution's policies will be provided for reference.

Complying with the Policy

Each N8 institution currently has a RDM plan or has one in progress. Technical measures may be available from your institution to aid with this, and more information will be added here to assist with this as these policies and technical measures develop.

Leaving your Institution

If you are leaving your institution you may be required to take specific actions to ensure compliance with your institutional RDM policies. You should also inform your local N8 HPC helpdesk N8 HPC team to ensure that the team can assist, in good time, with compliance.

Cross-Institutional Research

If you are engaging in cross-institutional research you should either have a research data management plan which takes into account of this, and the ownership issues of data, or look to develop such a plan and agreement. It is useful to consider these issues at the earliest opportunity.Your local institution and the N8 support staff can offer advice on how to do this.

Introduction

Handling data is a challenging area. You should first ensure that your handling of data compiles with policy and research data management requirements, but there are additional things which will help keep data organised and help with these other elements.

Security

Ensure that your data is only visible to those that you want it to be visible. It is safest to keep write access more restricted than read to ensure that people do not accidentally change data that you rely on.The Linux command chmod can be used to change ownerships for read and write options and is based on the concepts of owner, group, and other (world) and all (all of the above) e.g.,

chmod a+r somefile.txt

will make the file somefile.txt readable by anyone (provided the directory it is in is also readable and visible.

chmod a+rx .

to make that directory readable and visible.

To make a file writable by everyone you would do

chmod a+w somefile.txt

The concept of groups is relatively coarse grained, but if you do

ls -l somefile.txt

you will see information including the owner (e.g. yrkat) and group (e.g.yrk). Unix groups can be set up to aid in sharing with the groups of people you wish to share with, but bear in mind that this requires the system administrators to set up new groups and add people to them, and to maintain membership of them, and requires that you assign a file to the appropriate group for sharing. For example, if you wish to share somefile.txt with a group called somefileusers then you need to first change the group of the file via

chgrp somefileusers somefile.txt

then make it visible (e.g. readable) via

chmod g+r somefile.txt

Options for classes of users are

  • u - user (owner)
  • g - group
  • o - other

and permission options are

  • x - make a directory traversable, or an executable file executable
  • r - read
  • w - write

Note that if you do chmod u-r somefile.txt you will make the file unreadable by yourself.

Note that some files (e.g. those related to security keys, SSH, SSL etc) which you should not make public.

Recording Information About Data Usage

It is generally good practice to ensure that your Sun Grid Engine job scripts record information about the datasets you use. For example, recording checksum information, or version numbers. Note that it is not recommended for every job to create a checksum (e.g. using cksum, but at the start of an array of jobs or long running jobs it may be appropriate, and it can be done by creating a small Sun Grid Engine job to do the checksum, and then making an array job a dependency of this job. It is particularly useful given the potential expiry of data in /nobackup as you can refer back to previous job information to ensure you can obtain the appropriate copy of an expired dataset. It is also good practice in that it allows reproducibility of results.

Storage of large data sets

These fall into two main categories, reference datasets from an external location, or large intermediate results sets from an analysis you have done.

General advice

If you are using a large data set then it should be stored on /nobackup. The N8 system does not offer an archival service for large data sets and so the primary data sets you use should exist elsewhere, and also have sufficient versioning information that you can obtain a new version of this data if you require it (e.g. should/nobackup fail for some reason). Your job scripts should record the dataset and version being used to allow you to obtain a fresh copy of the exact same database should you need to rerun experiments. (See the major section above on Recording Information About Your Data Usage for more hints on this).If you expect not to use a large data set for a while (and do not expect anyone else to use it), then please remove it to ensure that there is insufficient space in/nobackup. You will need to make an assessment of how long it will take to copy or create a fresh version of the data set and how often you use it to determine when to remove it. /nobackup will ultimately time out content that has not been changed or accessed and remove it itself, but you should also be proactive in management of your data. It's also not worth being too aggressive about removing data as you may find you then spend too long downloading fresh versions.Note that if you do not use data in /nobackup it will ultimately be timed out. Be aware of this in terms of scheduling when you run jobs.

Recreation of intermediate datasets

To do this you should ensure you record all the relevant information about how you created the intermediate dataset, e.g. the job scripts used, the source code (if available) along with details on the compiler versions used, libaries linked with, and so on, or else the binary program (or both), details of the primary data set, and how long it took to create it. The latter can be used to determine what the compute cost of deletion of the intermediate dataset is. If you cannot exactly create an intermediate data set (e.g. the process used to create it is non-deterministic) then you should consult your data management plan which your project uses to ensure if you should keep this intermediate dataset or not.

Checking for existence of datasets in job scripts

Given that /nobackup will potentially time out content you have not used for a while your job scripts should include a check for the existence of the data you need before running, as failure to find data can, if you are running a large array job under Grid Engine, create potentially tens of thousands of error emails. Various schemes to check the existence of data are available e.g., use the

if [ -f somefile.txt ]; then ... do main part of job.... else echo "failed to find data"; exit 1; fi 

construction in Linux shell scripts to test for the existence of a file (and similar constructions exist for checking for directories and in Perl, Python, etc). If the required data does not exist then your job script should send a message to you (not the system admins) and exit gracefully.In general a more useful pattern is to check that the data does not exist, and if it does not then automatically grab the copy that you want if you can obtain this from a public location that you can authenticate to easily. This depends on how you can obtain a fresh copy of the data (e.g. using wget or curl). If you do this you should check that the download of the data was actually successful before continuing with your job script. A useful pattern for this in a shell script is to check the value of $? which is the return value of the last command run e.g.,

wget http://www.thing.com/data .; if [ $? != 0 ]; then echo "oops - download failed"; exit 1; fi

Note that normally a successful command in Linux will return a value of 0.

Revision Control

Revision control (such as git or subversion) is useful for controlling source code and job scripts. However, it is not generally a good candidate for controlling large amounts of data.For source control it is more appropriate to use a source code repository at your home institution that allows you to easily use integrated development environments (IDEs) on your standard institutional end-user machine/desktop.If you are debugging and modifying code on the N8 system then it makes sense to check out the code from the standard location of the repository you use, make modifications, and then check those back into the repository from the N8 system.

Ownership

Storage of the data on the N8 HPC system does not affect the ownership of the data. You are asked to engage with N8 HPC to allow it to operate on behalf of you and your institution, and further guidance on how to do this will be made available soon.

Appropriate Use of Data Sets

You should ensure you have the right to use the data you wish to on the N8 HPC facility. You should ensure that you are not infringing on any copyright, licensing, or other agreements. In some instances the data sets you have may be restricted to use within your own institutional boundaries and you must seek permission from the rights owners to allow use on N8 HPC. In cases where a data set licence restricts it to a particular group of individuals then the N8 HPC helpdesk can create a group to create an access list for this. You must ensure that the data is available only to that group. If the data is sensitive then please see below.

Sensitive Data

At present time sensitive data is not appropriate for use on N8 HPC. If your data has identifiable individuals, or is commercially or otherwise sensitive then please contact your N8 HPC helpdesk for further advice.

Required Acknowledgements

Please acknowledge the N8 in any datasets which were created on the N8, and in use of data on N8 HPC. A reference to cite is available.

Login Form