The Troubled Ethics of Open Data

A screenshot of the Cook County Medical Examiner Maps application, which allows users to access and filter records about primary causes of death in Cook County, Illinois
The white-painted wooden crosses showed up in June, on the front stairs of a house around the corner from mine in Chicago. Honoring a life and mourning the loss thereof, they quietly announced a family’s trauma. For days following, new faces came and went, providing support to relatives and making the necessary arrangements. Without the tradition of the crosses, neighbors who didn’t know these people would have just seen the occasion as another family gathering. In moments like these, privacy can allow people to process and suffer at a pace that’s healthy for them.

The week after the crosses appeared, I was in the midst of research about public health in Chicago’s neighborhoods. Using a tool published by the Cook County Medical Examiner’s office, I could access and filter records about primary causes of death that were of interest to my work, like cardiovascular disease. Before applying filters, though, I generally liked to use the tool’s mapping features to get a look at overall causes of death in each of the neighborhoods I analyzed, color-coded to reflect natural deaths and various categories of human-caused deaths like accidents and homicides. The big picture helped form an initial understanding of a neighborhood’s age makeup, nutritional health, and other contributing factors to mortality. While browsing through my own neighborhood, I ran across a little dot representing that address with the crosses around the corner. Hovering over it, the following was spelled out in clear, cold letters:

Gender: FEMALE

Age: 12

Manner Of Death: SUICIDE

Primary Cause: HANGING

In 2013, the U.S. presidential administration of Barack Obama created an Open Data Policy for federal agencies, requiring those agencies to “collect or create information in a way that supports downstream information processing and dissemination activities”. In other words, data about the daily operations of the federal government would be made more accessible to the public than ever before. This policy came into existence as an outgrowth of a then-decade-old movement toward “open government”, closely linked to what is now most commonly known as civic tech.

At that time, many national, regional, and local governments around the world were taking similar steps to establish open data publication platforms, which were intended to increase government transparency as well as ease the handoff of data between different offices within a single level of government. Technology-related civic advocacy organizations like the U.K.’s mySociety, in Taiwan, and Code for America combined advocacy for collaboratively-built government technology with pushes for “open by default” policies for government-held data. The idea at the core of this open data proliferation was that the data collected and used by our governments was ultimately “our” data, since it reflected our civic interactions, our public services, and the overall administration of our societies.

Open data policies have been celebrated by researchers, journalists, and community activists. Thousands of governments now actively publish data on topics ranging from lobbyist registrations to property transfers. A survey conducted in 2018 by Open Knowledge International and Sunlight Foundation found that 265 U.S. cities now meet some criteria for open data release, and information published on open data portals has been used to reveal corruption, make legislative meetings more accessible, and even create works of civically-engaged art. In many cases, proactive publication of government data has saved municipal resources, too, since that same data would often previously be obtained by the public through a laborious public records request process (a process that had to be duplicated each time a new individual or organization needed data access).

The usefulness of open data across a wide variety of disciplines has made it an easy subject for enthusiasm in civic circles. Today, advocacy for data openness has been embraced wholeheartedly by many groups both inside and outside of government. However, this enthusiasm has bordered on fanatical at times, expressed in ways that ignore many inherent complications of the release of government-held information.

Published data often lacks the local context or domain knowledge that it was created with. Especially in jurisdictions large enough to hire administrative staff whose job it is to maintain an open data portal, the people responsible for data publication are often multiple steps removed from the people responsible for data creation. When it comes time to analyze open data in this environment, it can be hard to establish the trustworthiness of a “raw” dataset. Data about police-resident interactions very rarely includes information about how heavily each police precinct is patrolled, for instance, so anybody drawing conclusions from the “raw” data would see an over-representation of interactions in heavily patrolled areas. Data about anything related to municipal complaint systems tends to be skewed toward wealthy, white areas or rapidly gentrifying areas, giving an outsize impression of a problem in places typically already well covered by municipal resources. And wealthy civic stakeholders often have the time and resources to utilize public data in ways that marginalized communities don’t, so this can become a self-perpetuating problem when in-the-know residents use skewed data to advocate for further targeting of resources in their area, draining those same resources from areas with greater actual need.

Biases of collection and interpretation sometimes lead well-meaning volunteers and organizations to create data-related solutions that don’t effectively meet the needs of their targeted communities. Open data is celebrated as a shared base upon which objective civic plans can be built, but such a framing ignores that seemingly subjective knowledge from communities can and should be an important component of planning. And beyond that, data is always tainted in some way by subjectivity. This is all well known in civic tech circles, but those circles often get stuck in a mode of action that sees data publication as an ultimate end goal, rather than as a starting point for much larger conversations about how data is used by governments and how it can be leveraged equitably by communities.

These issues aren’t new, nor are they unique to data that is published publicly. Data held internally by governments doesn’t become more or less biased upon publication, of course, and existing government uses of data can be highly biased in their own right. Data-informed solutions require effective local context in order to be effective, but they also require significant work to understand and mitigate biased decisions with respect to that context. One unique challenge posed by open data is not that data is released or used in the first place, but that the actual act of data release is governed differently than by more established freedom-of-information processes. These traditional public records requests tend to be steered by criteria related to the potential harms of records release, with the intent to provide data to the public after vetting. Freedom-of-information processes are far from perfect: some agencies perpetually stonewall the release of any data, some are inept at redacting sensitive information when requests are approved, and so on. But they are generally more robust than open data publication processes, which are newer and often not subject to the same legally required oversight. This means more data is published that perhaps shouldn’t have been, like the home addresses of public employees. Or, like the gruesome manner of death of a 12-year-old girl who lived around the corner from me.

Fundamentally, tools like the Cook County Medical Examiner Maps demonstrate that sometimes what the open data community thinks of as “our” data may not ethically belong to all of us, but instead may belong to a single person or group of people who have interacted with a civic process. And when we center openness as the end goal of a movement, we ignore those ugly realities of injudicious data publication.

I know people on the team that built Cook County Medical Examiner Maps, and many other open data tools like it that contain highly personal information. I know that their aims are to provide data that can be made useful and improve communities, and I know from my own experience that tools like these can be very helpful to research efforts or social impact projects. But all of that potential benefit doesn’t erase the need to think seriously about harms that can come from publication of certain data, and the need to establish strict data governance within the teams that manage open datasets. Civic technologists have a responsibility to fight for transparency in a way that is cognizant of the dangers that certain forms of transparency can bring, and we need to do better than just openness for openness’s sake.