Data Curation -an overview and a handy “how to”

Data curation is complex.   Researchers have reservations regarding data curation because of the perceived threat of intellectual property theft.  They also have concerns surrounding the usefulness of raw data being placed into a repository (if the raw data entered into a repository is undecipherable to others, it is a waste of time to enter it).  Institutions have concerns over the cost of data curation and librarians see data curation as a gargantuan task with no reasonable path or solution.  (Purdue, n.d.)  But regardless of what anyone thinks about data curation or data management, funding agencies are starting to require researchers and universities to come up with solutions for this problem.

The National Science Foundation (NSF) brought the problem of data curation to a head in January of 2011 by requiring researchers to include data management plans in their grant proposals.  NSF also required researchers to share their data (generally via a publically accessible repository). This was not a new request, data management plans had been a part of the NSF’s policy for a number of years, but in 2011 they started enforcing the requirement.  To complicate things further, they are not the only funding agency that is enforcing this type of requirement.  The National Institute of Health also requires researchers to enact data management plans for grants exceeding $500,000.  (MIT Libraries, 2011)  Many other funding agencies are sure to follow.

Within librarianship, the task of keeping libraries relevant is a constant topic of discussion.  In a world of Google Books, Kindles and the seemingly endless resources that the internet provides, how can libraries/librarians remain relevant?  Rick Anderson highlights this issue in his article: Away From the “Icebergs.”  He makes the case that in the past, libraries held the monopoly on the ability to access information and patrons were forced to seek the library out.  Now that information is everywhere, librarians must start actively seeking their patrons out.  Librarians must serve patrons in the areas where patrons are comfortable and active: on the web, at work, in study and in play (2006).  With this sentiment in mind, librarians can use data curation to assist researchers in their daily work lives by helping to solve the data management problem.  Information management is the librarian’s domain and it would be a shame to not take advantage of this opportunity to remain relevant and necessary in the lives of patrons.  Librarians must take an active role in solving the data problem.

The idea and even the term “data curation” is a relatively new concept.  In the past, a researcher would store data away in their office and only publish their results.  And while it was suggested that researchers deposit their data into a repository where other researchers could access it, this suggestion was not enforced and therefore not followed.  Also, because there was no incentive to curate one’s data and because the task of data curation seemed very difficult with very little return on the investment, there is no widespread standard form of how data curation is done.  Data was not stored in the past, it was ignored.

There are several individual examples of data management.  MIT provides detailed guidelines for creating a data management plan, provides a list of data repositories and they have their own DSpace repository for their professors to use.  (MIT Libraries, 2011)  The University of Edinburg also has a detailed plan with guidelines on creating data management proposals, how to store, back up and how to share data with other researchers.  (Univerity of Edinburg. 2011)  Cornell University has a similar data management plan, but rather than directing researchers to outside repositories, Cornell provides a database of their own (eCommons) as a data repository.  Similarly to MIT, eCommons is powered by DSpace, an archive software geared to serve colleges and universities. (Cornell University, 2011)

But how can libraries get involved and provide shape to this massive, non-standardized process?  Purdue University Libraries worked with the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign to initiate the Data Curation Project and toolkit.  They created a data curation profile that could be used to interview researchers (similar to a reference interview).  The profile and interview are based on the necessary requirements for a proper data management plan (such as one that would be required for a NSF grant-referenced above).  Once this profile is complete, a researcher will have met most (if not all) of the elements needed for a data management plan.

The role of the librarian is relatively simple, but boundaries must be set early.  The librarian acts as a guide to facilitate the data management process, the librarian is not a data entry clerk.  The librarian’s roll in the interview is to gain a clear picture of the researchers data needs.  Once the interview and profile are taken, the librarian compiles the information in a usable format as seen on Purdue’s data curation website: completed profiles.  (Purdue, n.d.)  From this point, the librarian can assist in helping the researcher find a repository that best suits his or her needs.  Generally, it is the researchers’ responsibility to have the data deposited into the repository.

In the interview, it is also advisable for the librarian to have a base knowledge of what kind of study the researcher is working on as this will aid in creating the profile.  A librarian does not have to be an expert in genomics, but it would certainly help to know that genomics is a study that relates to DNA sequencing and genetic mapping.  It would further help to have an idea of what the researcher is working on specifically; is the researcher trying to determine how DNA is affected by disease or how DNA mutates naturally?  Finally, some data repositories are subject specific.  The more you know, the better you can assist the researcher in creating a data management plan and finding an appropriate repository that fits their unique data requirements.

Finally, finding the right data repository is possibly the most important and most difficult task the librarian will face in this process.  It is not only hard because each repository operates differently; it is hard because there are countless repositories and it is impossible to memorize how each repository works.  Some repositories, like Dryad are incredibly simple to use.  I found Dryad on MIT’s list of data repositories; (MIT, 2011) the Dryad site is polished, easy to navigate and has a link to a simple how to video tutorial.  You simply create an account, make appropriate selections regarding embargo, ownership, property rights etc, and describe the data appropriately to give your data context and relevance.  Dryad is a great repository, but it is only useful for basic and applied biological sciences.  Not only is it easy to put data into Dryad, it is also easy to search for data in Dryad and find data that is relatively easy to understand (even to a lay person).  For example, I searched under the term “animal,” I found a list, clicked on the first dataset and found a link to a photo of a bone from a newly discovered dinosaur.

To counter the ease and simplicity of Dryad, SIMBAD is an Astronomical Database.  I found SIMBAD through MIT’s list of data repositories also.  SIMBAD is clunky and looks like it was designed in the 1990s.  After searching the site for over an hour and reviewing each (dated) tutorial, I could not find instructions on how to publish new data using SIMBAD.  After a much frustration, I noticed at the bottom of the page it indicates that “SIMBAD on the Web has all the functionalities provided by XSimbad, after the addition of list queries in March 2001. XSimbad software is thus not maintained any more.”  (CDS, 2001) While it may seem moot to discuss a repository that is no longer active, this is the type of hit and miss search process a librarian would undergo when reviewing data repositories and it is frustrating.  My experience with SIMBAD also illustrates the sometimes temporal nature of data repositories and this must be considered.  If the data life cycle indicates that the data must be maintained for a lifetime, how are you going to manage the data if the repository you used becomes obsolete?

Even though using outside repositories comes with risks and headaches, sometimes their benefits outweigh the risks.  This would be the case with Dryad.  Using lists, like the ones found on MIT’s website (2011), will at minimum reduce the legwork and serve as a great first step toward finding an appropriate data repository for your researcher.  Distributed Data Curation Center (D2C2), powered by Purdue University Libraries, is another great tool for finding repositories by subject.  D2C2’s list contains over 50 repositories and SIMBAD is not one of them.

Alternatively, some universities have moved to creating their own data repository.  Cornell University and MIT have employed the use of DSpace to create their own data repositories.  DSpace is an open source software geared toward serving colleges, universities and other members of academia like museums, archives and libraries. DSpace software is free, it can take less than a day to install/configure and is relatively easy to use.  However, it does require the user/administrator to manage the software and create the general set up and features.  DSpace is also configured to be automatically picked up by search engines like Google (assuming the datasets are in published/released format).  It is also possible to design DSpace with embargos, copyright provisions and security features to ease researchers concerns about intellectual property theft.  For the computer savvy (or at least the technologically adventurous) DSpace is an excellent way to manage data.  (DSpace, 2011)

No matter what data repository you use, system failures can occur and software will become obsolete (just like it did with SIMBAD); it is necessary to back up your data.  If your data must be preserved for a long period of time, you must take the time frame into consideration.  In five years, it is likely that information on a hard drive could be retrieved.  In ten or twenty years, the hard drives we have today will be obsolete.  Imagine trying to get information off of a three and a half inch floppy disc?  That isn’t impossible, but it does take work and extra equipment.  Now imagine trying to get information off of an eight inch floppy – if the information is even still readable.  It is best to use three different backup methods to preserve data.  The methods should be both remote and local and they should be stored in different location.  Examples of appropriate backup methods would include: a hard drive, internet backup products like MozyPro (2011), many universities offer free storage to their professors or simply keeping a copy at different physical locations like home and work.  It is also prudent to keep passwords to these copies accessible to a trusted confidant.  Finally, if the data package is not too large, it is best to store data in a non-compressed format for preservation purposes.

In conclusion, data curation is more than just storing data in a safe place.  To meet grant requirements, researchers must create data management plans that include plans to publish the data in a format where other researchers can access it.  Librarians can play a valued and irreplaceable role in data curation by working with researchers to create a data profile and ultimately a data management plan.  From there, librarians will have the information needed to match the researcher with the best repository for the dataset, or in some cases, librarians can assist in creating a data repository for their institution (Cornell and MIT).  What I appreciate about this process, as it is outlined, is that it can be as simple or as complex as the individual institution wants or needs it to be.  A librarian can complete the profile/interview, match the researcher with a repository and be finished, this model taking no more than 3-6 hours of effort.  Or, a librarian can complete the profile/interview, create a data management plan including the specific requirements of the grant, find an appropriate repository, and assist the researcher in the long term data preservation throughout the life cycle of the data.  Regardless of how much or little a library is able to participate in the data curation process, it is necessary that libraries take an active role in this endeavor.  In a world where information is everywhere, the roles of libraries and librarians are changing.  We must adapt and take on new challenges like data curation or we will become obsolete.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: