DataONE Focus Group

Held Thursday 2 October during the DataONE All Hands Meeting in Albuquerque, New Mexico. Attendees included Matt Jones, Dave Vieglais, Carly Strasser, Patricia Cruse, Amber Budden, Rebecca Koskela, Bill Michener, Ben Leinfelder, Stephanie Wright, Amy Hodge, Paolo, David Bloom, Bruce Grant, Bruce Wilson, Gail Steinhart, …

Questions

What do researchers want to know about their own data?
What do researchers want to know about other people’s data?
Which metrics are most useful for judging impact? quality?
What do repositories want to know about the data they hold?
What decisions might be informed by that information?
What metrics are repositories already looking at?
What privacy issues are repositories concerned about?
For each potential metric, what sources we should draw from?
Are there other initiatives that we should be aware of?
Besides researchers, data publishers, and repositories what other community might find value in DLMs?

Responses

1. What do researchers want to know about their own data?

Who’s using it
- Their name, credentials, contact info
- What other data is the person using your data downloading?
- What data does the person using your data have in the system?
How exactly they are using it - purpose
- Are they going to publish something based on it?
- Do they have a project that makes use of the data?
- Are they using it in research? teaching? in government/decision making?
- Using all of it or subsets?
- Combining with other datasets?
- Is it being used by an organization trying to promote bad science? (i.e., inappropriately)
If a paper was published using the data
If someone is selling your data
Where a dataset was downloaded from - there might be other copies; ideally they would have the same identifier but they may not.
Are the same people citing datasets and papers?

2. What do researchers want to know about other people’s data?

Are there big gaps/missing data
if others downloaded it, did they think it was good?
links to publications associated with the data
- has a publication come from the data?
- publications that cite the dataset
data quality
- quality assurance records
- scripts for quality
restrictions on its use
who do I know that has used the dataset?
how often has it been downloaded? by whom? people I trust?

3. Which metrics are most useful for judging impact? Quality?

publications
- how many publications has it led to
- how many immediately associated with the dataset
- how many pubs have been derived from the dataset
- pubs by people who own the data versus pubs from other people (self-citation)
- quality of publications
data use
- number of downloads and on what time scale
- how many different disciplines have used the dataset
- use in education (downloads in a short period of time from one institution = educational use. esp. when tracked to .edu addresses)
- used in decision-making
has the dataset been peer-reviewed (quality)
- effort that went into peer review can indicate impact
- Ted Haberman - quality rubriques for data/metadata. if it’s gone through something like that, it’s a good * indication of quality
reproducibility. has it been reproduced? has someone documented a reproduction of the dataset?
does it lead to legislation, land management, policy etc.
who funded the data collection

4. What do repositories want to know about the data they hold?

its impact
whether there are any security issues
usage
- downloads, page visits
- knowing what community is downloading the data
- usage in primary literature
- usage outside of literature in science
- usage in education
where was data cited (hard reference) versus where was it referred to (softer indication)
how many times people are depositing data into a repository
number of datasets
number of people depositing dataset
click streams within sessions - how they are navigating through the system
relationships among datasets - cousins; derived; etc. would be great to have a visual of this
relationships between dataset usage. links between things and how they propagate. publications back to dataset back to people etc.

5. What repository decisions might be informed by that information?

when data isn’t being used, why not?
return on investment of developing a repository
deaccession (Varies depending on repository’s mission)
ability to get outside funding
developing our own cost models internally for charging users for data/service
ability to present views of the repo that are more customized for the user. user pages that describe how the data are used
collection development decisions
scope repository submissions to limit what is taken in in the future
define service tiers
define priorities
university-level impact of data

6. What metrics are repositories already looking at?

related publications
publication citation counts
usage stats - downloads

7. What privacy issues are repositories concerned about?

European privacy laws / international privacy laws
violation of HIPPA laws
other organizations sucking up data out of repository (e.g., running mirrors of the ORNL DAAC in China. ORNL doesn’t get the download counts.

8. For each potential metric, what sources we should draw from?

textbooks for educational use
primary literature, publications
assessments that reference multiple datasets
national assessments
google searches for links to a dataset. will show up in a syllabus for example
download stats from repositories; feeds from IRs
funding bodies
Data citation index

9. What other projects should we know about?

TR data citation index
RDA working groups - one on article/data connection
Impact Story
Figshare
Elsevier - has a data something program with a web page asking for contributions to what metrics they should want. emdp.elseiver
Think-Up - impact of social media. commercial endeavour.

10. Who are the stakeholders?

Researchers
data repositories
librarians
funders
policy makers
university administrators
office of research
publishers
federal agencies - what’s most trustworthy data?