DataONE Focus Group

Held Thursday 2 October during the DataONE All Hands Meeting in Albuquerque, New Mexico. Attendees included Matt Jones, Dave Vieglais, Carly Strasser, Patricia Cruse, Amber Budden, Rebecca Koskela, Bill Michener, Ben Leinfelder, Stephanie Wright, Amy Hodge, Paolo, David Bloom, Bruce Grant, Bruce Wilson, Gail Steinhart, …

Questions

  1. What do researchers want to know about their own data?
  2. What do researchers want to know about other people’s data?
  3. Which metrics are most useful for judging impact? quality?
  4. What do repositories want to know about the data they hold?
  5. What decisions might be informed by that information?
  6. What metrics are repositories already looking at?
  7. What privacy issues are repositories concerned about?
  8. For each potential metric, what sources we should draw from?
  9. Are there other initiatives that we should be aware of?
  10. Besides researchers, data publishers, and repositories what other community might find value in DLMs?

Responses

1. What do researchers want to know about their own data?

  • Who’s using it
    • Their name, credentials, contact info
    • What other data is the person using your data downloading?
    • What data does the person using your data have in the system?
  • How exactly they are using it - purpose
    • Are they going to publish something based on it?
    • Do they have a project that makes use of the data?
    • Are they using it in research? teaching? in government/decision making?
    • Using all of it or subsets?
    • Combining with other datasets?
    • Is it being used by an organization trying to promote bad science? (i.e., inappropriately)
  • If a paper was published using the data
  • If someone is selling your data
  • Where a dataset was downloaded from - there might be other copies; ideally they would have the same identifier but they may not.
  • Are the same people citing datasets and papers?

2. What do researchers want to know about other people’s data?

  • Are there big gaps/missing data
  • if others downloaded it, did they think it was good?
  • links to publications associated with the data
    • has a publication come from the data?
    • publications that cite the dataset
  • data quality
    • quality assurance records
    • scripts for quality
  • restrictions on its use
  • who do I know that has used the dataset?
  • how often has it been downloaded? by whom? people I trust?

3. Which metrics are most useful for judging impact? Quality?

  • publications
    • how many publications has it led to
    • how many immediately associated with the dataset
    • how many pubs have been derived from the dataset
    • pubs by people who own the data versus pubs from other people (self-citation)
    • quality of publications
  • data use
    • number of downloads and on what time scale
    • how many different disciplines have used the dataset
    • use in education (downloads in a short period of time from one institution = educational use. esp. when tracked to .edu addresses)
    • used in decision-making
  • has the dataset been peer-reviewed (quality)
    • effort that went into peer review can indicate impact
    • Ted Haberman - quality rubriques for data/metadata. if it’s gone through something like that, it’s a good * indication of quality
  • reproducibility. has it been reproduced? has someone documented a reproduction of the dataset?
  • does it lead to legislation, land management, policy etc.
  • who funded the data collection

4. What do repositories want to know about the data they hold?

  • its impact
  • whether there are any security issues
  • usage
    • downloads, page visits
    • knowing what community is downloading the data
    • usage in primary literature
    • usage outside of literature in science
    • usage in education
  • where was data cited (hard reference) versus where was it referred to (softer indication)
  • how many times people are depositing data into a repository
  • number of datasets
  • number of people depositing dataset
  • click streams within sessions - how they are navigating through the system
  • relationships among datasets - cousins; derived; etc. would be great to have a visual of this
  • relationships between dataset usage. links between things and how they propagate. publications back to dataset back to people etc.

5. What repository decisions might be informed by that information?

  • when data isn’t being used, why not?
  • return on investment of developing a repository
  • deaccession (Varies depending on repository’s mission)
  • ability to get outside funding
  • developing our own cost models internally for charging users for data/service
  • ability to present views of the repo that are more customized for the user. user pages that describe how the data are used
  • collection development decisions
  • scope repository submissions to limit what is taken in in the future
  • define service tiers
  • define priorities
  • university-level impact of data

6. What metrics are repositories already looking at?

  • related publications
  • publication citation counts
  • usage stats - downloads

7. What privacy issues are repositories concerned about?

  • European privacy laws / international privacy laws
  • violation of HIPPA laws
  • other organizations sucking up data out of repository (e.g., running mirrors of the ORNL DAAC in China. ORNL doesn’t get the download counts.

8. For each potential metric, what sources we should draw from?

  • textbooks for educational use
  • primary literature, publications
  • assessments that reference multiple datasets
  • national assessments
  • google searches for links to a dataset. will show up in a syllabus for example
  • download stats from repositories; feeds from IRs
  • funding bodies
  • Data citation index

9. What other projects should we know about?

  • TR data citation index
  • RDA working groups - one on article/data connection
  • Impact Story
  • Figshare
  • Elsevier - has a data something program with a web page asking for contributions to what metrics they should want. emdp.elseiver
  • Think-Up - impact of social media. commercial endeavour.

10. Who are the stakeholders?

  • Researchers
  • data repositories
  • librarians
  • funders
  • policy makers
  • university administrators
  • office of research
  • publishers
  • federal agencies - what’s most trustworthy data?