Census Methodology

Selection of Datasets

We used the G8 National Action Plan’s definition of “High-Value Datasets,” recognizing that needs and opportunities are different at the state level than at the national level. We created GitHub issues to select each representative dataset, and in the ensuing discussion we selected the most appropriate state-level dataset to represent that class of data.

Evaluation Criteria

Each dataset is evaluated by 12 metrics. Each is worth 5 points, except for “Exists” (which is worth 0 points) “Free” (which is worth 15 points) and “Machine-readable” (which is worth 40 points). Machine readability is worth far more points because that is the essence of open data, and being without cost is worth somewhat more because of its corresponding importance (though not essential nature) to the free flow of data. There is a good argument to be made that open licensing is also quite valuable, but there is inadequate evidence that U.S. states claiming copyright on these datasets is a practical obstacle to openness to warrant equal scoring at this time—often, they seem to claim copyright accidentally, via boilerplate copyright statements on websites. All of these criteria add up to 100 points, upon which grading is based.

If no data is available, but only information (that is, as PDF, HTML, Word, etc.), that will be inventoried, but recorded as not being machine-readable. However, if any data is available, that will be inventoried. So if, for example, a legislature has a website with detailed PDFs about every bill and legislator, but only provides data as a single CSV file with vanishingly little information, the CSV will be inventoried, not the PDFs.

The grading scale is as follows:

GradeScore
A+97–100
A93–96
A-90–92
B+87–89
B83–86
B-80–82
C+77–79
C73–76
C-70–72
D+67–69
D63–66
D-60–62
F0–59

Data Exists

Does this data exist at all?

Digitized

Does the data exist as electronic data, or is it only on paper?

Available Publically

May the public have access to this data, or is it protected or confidential in some way?

Free (Without Cost)

Does the state sell this data, or do they give it away without charge? Any charge, no matter how small, yields a rating of “no.” If information is provided freely on the website, but data is sold (e.g., business entity records can be browsed on the web, but bulk data must be purchased), then this will be given a rating of “no.”

Available Online

Can it be downloaded over the internet, or must it be transported on removable storage media?

Machine-Readable

Is this data, or is it mere information? For example, a PDF is not machine-readable data. On the other hand, CSV, JSON, SQLite, Excel, FoxPro, and SGML are all machine-readable data.

Available in Bulk

Is all of the data that comprises this data available in downloadable files that can be retrieved automatically (e.g., wget https://www.example.gov/addresses.csvcode>) at stable, predictable URLs? Or must a human click an “export” button, track down updated filenames, or interface with an API?

No Copyright Restrictions

Does the state claim copyright over this data? A copyright statement generally in the footer of the dataset’s web page—absent a specific disclaimer of that copyright claim, that qualifies as a copyright restriction. On the other hand, if the agency’s website is silent about copyright, the dataset is assumed to be without copyright restrictions (as states’ data is generally in the public domain, silence indicates public domain status).

Up-to-Date

Is this data updated frequently enough that its content doesn’t become stale? This period of time can vary enormously between types of data.

In the State Repository

Is this dataset found in the state’s open data repository (e.g., data.example.gov)? If the state does not have a repository, this is an automatic ”no.”

Verifiable

Is any mechanism provided to ensure that a downloaded copy of a dataset is unchanged from the master file provided by the state? This might mean providing a data authentication service like Data Seal, or it might mean that the site is served up over HTTPS (preventing man-in-the-middle attacks).

Complete

Does this dataset provide the minimum amount of information necessary to be viably useful to third parties? See below for details as to what this means for each type of data.

Evaluation Criteria for Completeness

We created GitHub issues to define each dataset, where we researched existing standards and, when they didn’t exist, surveyed best practices, consulting with relevant experts when available. The goal was to set a baseline for the minimum data necessary for a dataset to be considered complete.

What follows are the evaluation criteria for each dataset.

Companies

GitHub Discussion

  • ID
  • name
  • address
  • registered agent
  • registered agent's address
  • date of incorporation
  • status
  • state formed
  • officers
  • shares (if applicable)

Evaluation begins by looking at OpenCorporations’ scores for individual states.

Search terms used to find this dataset include “[state] businesses,” “[state] corporations,” and “[state] corporate registry.”

Incarceration

GitHub Discussion

  • number of inmates in each facility
  • at what percentage of capacity the facility is operating
  • numerical and percent change in population from same time period of previous year

Search terms used to find this dataset include “[state] incarceration data,” “[state] prison data,” and “[state] prisoner statistics.”

Real Estate

GitHub Discussion

  • agency
  • property name
  • location identifier (e.g., address, coordinates, or general geographic descriptor)
  • land acreage (if applicable)
  • use type

Search terms used to find this dataset include “[state]-owned real estate,” [state]-owned property,” [state]-owned lands,” “[state]-owned buildings,” and “[state] government leases.”

Checkbook

GitHub Discussion

  • agency
  • payee
  • amount
  • date
  • check/voucher number
  • description
  • expense category

Evaluation begins by referring to the National Conference of State Legislatures’ "Statewide Transparency and Spending Websites" and to U.S. PIRG’s “Follow the Money” report (page 57), each of which are already comprehensive censuses of each state’s spending transparency websites.

Search terms used to find this dataset include “[state] checkbook,” “[state] spending data,” and “[state] expenditures.”

Address Points

GitHub Discussion

  • address: numeric component
  • address: street name
  • coordinates

Evaluation begins by checking OpenAddresses’ data sources, to see if there is a state-level data source in their inventory. OpenAddresses is conducting an ongoing census of every source of address points in the United States.

Search terms used to find this dataset include “[state] address points” and “[state] address database.”

Legislation

GitHub Discussion

  • legislators
    • legislator name
    • chamber (if extant)
    • party
    • district identifier
    • date sworn in
  • legislation
    • session
    • bill number
    • bill patron(s)
    • bill catch line (if extant)
    • bill summary (if extant)
    • bill text
    • bill status

Evaluation begins by referring to Open States’ “Open Legislation Data Report Card”, which is a comprehensive census of state-level legislation data.

Search terms used to find this dataset are narrowed to the site in question (e.g., “site:legis.example.gov”), and include terms like “download,” “data,” and “filetype:csv.”

Restaurant Inspections

GitHub Discussion

  • business ID
  • business name
  • business address
  • inspection date
  • inspection outcome (score, rating, action, etc.)

Search terms used to find this dataset include “[state] restaurant inspections” and “[state] food safety.”

Population Projections

GitHub Discussion

  • locality name
  • locality FIPS code or GNIS ID
  • estimate as of date X
Evaluation begins by referring to the U.S. Census Bureau’s “State-Produced Population Projections,” which links to the website of each state that produces projections.

The search term “[state] population projections” generally yields this dataset.

Vehicle Crashes

GitHub Discussion

This uses the Model Minimum Uniform Crash Criteria, which is a fairly detailed standard. The basic elements are:

  • crash information
  • vehicle information
  • driver information

Evaluation includes looking at the National Highway Traffic Safety Administration’s State Data System overview, which lists all states that are sharing crash data with the federal data. Participating states can be known to have the data, whether or not they share it with the public.

Search terms used to find this dataset include “[state] crash data” and “[state] crash statistics.”