We used the G8 National Action Plan’s definition of “High-Value Datasets,” recognizing that needs and opportunities are different at the state level than at the national level. We created GitHub issues to select each representative dataset, and in the ensuing discussion we selected the most appropriate state-level dataset to represent that class of data.
Each dataset is evaluated by 12 metrics. Each is worth 5 points, except for “Exists” (which is worth 0 points) “Free” (which is worth 15 points) and “Machine-readable” (which is worth 40 points). Machine readability is worth far more points because that is the essence of open data, and being without cost is worth somewhat more because of its corresponding importance (though not essential nature) to the free flow of data. There is a good argument to be made that open licensing is also quite valuable, but there is inadequate evidence that U.S. states claiming copyright on these datasets is a practical obstacle to openness to warrant equal scoring at this time—often, they seem to claim copyright accidentally, via boilerplate copyright statements on websites. All of these criteria add up to 100 points, upon which grading is based.
If no data is available, but only information (that is, as PDF, HTML, Word, etc.), that will be inventoried, but recorded as not being machine-readable. However, if any data is available, that will be inventoried. So if, for example, a legislature has a website with detailed PDFs about every bill and legislator, but only provides data as a single CSV file with vanishingly little information, the CSV will be inventoried, not the PDFs.
The grading scale is as follows:
Does this data exist at all?
Does the data exist as electronic data, or is it only on paper?
May the public have access to this data, or is it protected or confidential in some way?
Does the state sell this data, or do they give it away without charge? Any charge, no matter how small, yields a rating of “no.” If information is provided freely on the website, but data is sold (e.g., business entity records can be browsed on the web, but bulk data must be purchased), then this will be given a rating of “no.”
Can it be downloaded over the internet, or must it be transported on removable storage media?
Is this data, or is it mere information? For example, a PDF is not machine-readable data. On the other hand, CSV, JSON, SQLite, Excel, FoxPro, and SGML are all machine-readable data.
Is all of the data that comprises this data available in downloadable files that can be retrieved automatically (e.g.,
wget https://www.example.gov/addresses.csvcode>) at stable, predictable URLs? Or must a human click an “export” button, track down updated filenames, or interface with an API?
Does the state claim copyright over this data? A copyright statement generally in the footer of the dataset’s web page—absent a specific disclaimer of that copyright claim, that qualifies as a copyright restriction. On the other hand, if the agency’s website is silent about copyright, the dataset is assumed to be without copyright restrictions (as states’ data is generally in the public domain, silence indicates public domain status).
Is this data updated frequently enough that its content doesn’t become stale? This period of time can vary enormously between types of data.
Is this dataset found in the state’s open data repository (e.g.,
data.example.gov)? If the state does not have a repository, this is an automatic ”no.”
Is any mechanism provided to ensure that a downloaded copy of a dataset is unchanged from the master file provided by the state? This might mean providing a data authentication service like Data Seal, or it might mean that the site is served up over HTTPS (preventing man-in-the-middle attacks).
Does this dataset provide the minimum amount of information necessary to be viably useful to third parties? See below for details as to what this means for each type of data.
We created GitHub issues to define each dataset, where we researched existing standards and, when they didn’t exist, surveyed best practices, consulting with relevant experts when available. The goal was to set a baseline for the minimum data necessary for a dataset to be considered complete.
What follows are the evaluation criteria for each dataset.
Evaluation begins by looking at OpenCorporations’ scores for individual states.
Search terms used to find this dataset include “[state] businesses,” “[state] corporations,” and “[state] corporate registry.”
Search terms used to find this dataset include “[state] incarceration data,” “[state] prison data,” and “[state] prisoner statistics.”
Search terms used to find this dataset include “[state]-owned real estate,” [state]-owned property,” [state]-owned lands,” “[state]-owned buildings,” and “[state] government leases.”
Evaluation begins by referring to the National Conference of State Legislatures’ "Statewide Transparency and Spending Websites" and to U.S. PIRG’s “Follow the Money” report (page 57), each of which are already comprehensive censuses of each state’s spending transparency websites.
Search terms used to find this dataset include “[state] checkbook,” “[state] spending data,” and “[state] expenditures.”
Evaluation begins by checking OpenAddresses’ data sources, to see if there is a state-level data source in their inventory. OpenAddresses is conducting an ongoing census of every source of address points in the United States.
Search terms used to find this dataset include “[state] address points” and “[state] address database.”
Evaluation begins by referring to Open States’ “Open Legislation Data Report Card”, which is a comprehensive census of state-level legislation data.
Search terms used to find this dataset are narrowed to the site in question (e.g., “site:legis.example.gov”), and include terms like “download,” “data,” and “filetype:csv.”
Search terms used to find this dataset include “[state] restaurant inspections” and “[state] food safety.”
The search term “[state] population projections” generally yields this dataset.
This uses the Model Minimum Uniform Crash Criteria, which is a fairly detailed standard. The basic elements are:
Evaluation includes looking at the National Highway Traffic Safety Administration’s State Data System overview, which lists all states that are sharing crash data with the federal data. Participating states can be known to have the data, whether or not they share it with the public.
Search terms used to find this dataset include “[state] crash data” and “[state] crash statistics.”