Protecting Queer Communities Through Data

This article originated as a talk given at Chi Hack Night on June 25th, 2019. It has since been modified for a Metis Data Science track lecture, for the 2019 Design for America Summit, and for a variety of other venues. A video of the original talk is available here.

It goes without saying that longstanding marginalization has led to misinterpretation and misunderstanding of LGBT+ people. In a learned environment of fear, the vast majority of queer people have stayed silent about their identities in one way or another. I came out as pansexual during high school, but was re-closeted when job changes forced my family to move to Alabama. The caution with which I approached my own sexuality during that time persists to this day, and I don’t actively discuss that part of me very often. There are many ways that it remains alienating and even dangerous to not be the magic combination of cisgender and heterosexual. This is amplified among already marginalized groups, like LGBT+ communities of color or neurodivergent queer communities.

The end result of all this hiding, all of this fear, is that even researchers’ best understanding of queer identities and the social patterns of queer groups is highly flawed. The loudest voices tend to characterize a group, and a lack of granular data leads those characterizations to be taken as a given for that group as a whole. And in addition to that, the very terminology we use to describe our identities is highly fluid, adapting as we become more open as a society to exploring natural differences in sexual attraction and gender. Because the words we use to describe many identities are so new, differing self-descriptions around the world also complicate research.

The fact that we know very little about LGBT+ communities poses challenges much broader than queer-related research fields. On a basic, day-to-day level, queer voices are often quiet in the workplaces where products and services are built, including digital systems. A recent survey found that only 30% of LGBT+ people are out in their workplace, and many don’t feel comfortable speaking up on topics related to their identities. As a consequence, most systems are not designed with LGBT+ users in mind. This is especially true of some civic systems, like driving license databases. When we do explicitly consider LGBT+ experiences, we often do so through a stereotypical lens, or assume the needs of communities with unique needs that are often invisible to us.


Gender selectors are very common on digital tools that we use, and gender data is one thing that people in many countries are accustomed to being collected about us by countless entities: governments, health providers, online retailers, and more all require it before we access their services.

As non-cisgender identities have become more commonly recognized and gained acceptance, the traditional binary options for gender offered by systems have started to change. At the moment, it’s very common to find interfaces that allow users to designate themselves as “Other” instead of either of the two traditional options for gender. While this technically captures all other possible identities, it lumps a huge diversity of genders under a single alienating term. This has big ramifications for data use after entry, since any content or functions that might be customized based on gender are essentially rendered meaningless by a data collection approach that puts every non-cis identity in a single bucket. It shows only a minimum level of care and understanding of additional gender identities, and does nothing to forward our understanding of just how common certain identities are among a system’s users.

There are multiple common approaches to gender selection in digital systems, but they all hold flaws of some sort for LGBT+ users.

The opposite approach is one taken by a growing number of companies and organizations like Facebook, spurred in part by this LGBT+ terminology resource published by GLAAD and Refinery29. In an attempt to be exhaustive, Facebook currently offers dozens of gender options for users to identify themselves. This attempt is flawed for a number of reasons, chief among them being that what’s considered exhaustive at any given time rapidly falls out of date. Since we’re finding new ways to describe gender and sexuality every day when we begin to discuss these things more openly, there’s really no such thing as an exhaustive approach to identity.

Something that better reflects the needs of our diverse LGBT+ communities is freeform text entry of a user’s preferred gender and sexuality, rather than a predefined set of selectors. Of course, if this information is displayed publicly there is potential for abuse by people who like to make discriminatory jokes about gender identity, and any such system would need to put steps in place to prevent such abuse. Additionally, any data scientist could tell you that working with free text can make data analysis much more challenging than working with a clear, limited set of options. But this is all for good reason, because it produces a granularity of data that’s simply impossible if we try to put everyone in boxes that they may not fit in. This is what an inclusive data practice looks like.
Freeform text entry is an inclusive approach to collecting gender data, but comes with challenges of its own.

But I’m not here to write about inclusive data practices. I believe that inclusion is often a misleading goal in the design of technical systems, and can cause more harm than good. And that’s where we start to talk about protection, rather than inclusion.


Inclusion-based frameworks fail to understand that forced participation in a flawed data system can be harmful. It isn’t enough to include or represent varying identities when the resulting functions are premised on bad assumptions about those identities, or in systems where data is improperly secured.
Inclusively collected data is almost always still subject to the same stereotype-based decision processes and content curation processes as non-inclusively collected data. Whether a human is in the mix of making content decisions or not, the data we feed into a system and the ways we act on that data frequently comes to align itself with existing social hierarchies. This can be as simple as creating content bubbles, in which customized experiences on a digital platform mean that LGBT+ people are highly likely to see LGBT+-related content and non-LGBT+ people never have to interact with LGBT+-related content. It can be as devastating as presumptions of medical needs or limitations based on a person’s identity, like the longstanding bans around the world on gay men donating blood or misassignment of sexual health services to somebody whose genitalia is different than what a rigid system presumes.

And the rigidity of our systems themselves make them prone to what Dr. Anna Lauren Hoffmann of the University of Washington refers to as “data violence”. We primarily design systems that assume that something so supposedly innate about a person as their gender is never-changing, which isn’t borne out by reality. Immutability can be traumatizing for somebody who has discovered something new about themselves and wants to reflect that discovery in their social media profiles, their medical information, their driver’s license…data systems need to be just as adaptable as their users, but users can’t be expected to be able to constantly update information about things like their gender and sexuality everywhere they’ve ever entered it. When systems fall out of date with the people using them, those people can be re-traumatized by reminders of names and identities that they no longer hold.

The discussion of re-traumatization naturally extends to the subject of data republication. I am willing to bet that more than a handful of the people reading this piece work for organizations that use data not originally sourced from them, or organizations that give or sell their users’ data to others. There is a whole industry of data brokers out there whose work involves scraping public records, buying user information from companies that directly collect it, and combining datasets as much as possible to get a full picture of a particular data subject. Or, in clearer terms, a human being.

Problem is, when a user at the original source of that data changes their information or revokes permission altogether, that data modification or revocation rarely makes its way down the line to third parties who are now spreading out-of-date information to their clients. Data brokers in the U.S., at least, exist in highly unregulated space, and the information they sell about people is generally an awful combination of poorly-vetted, improperly consented, and secretively sourced. Many data brokers refuse to tell potential clients where they get people’s information from in the first place because they don’t want those clients to collect that information for free on their own. Those barriers to clarity of sourcing mean that whoever uses information from a data broker cannot be sure of the correctness of its data, let alone the ethics of the collection process. This monster creates situations in which people using an online service for the first time are confronted with autofilled information with names they no longer use, are targeted with advertisements for a gender that doesn’t align with theirs, and are unable to escape constant reminders of traumatic past identities.

And that’s just the start of it. If our goal was to align somebody’s own experience with their self-conception, this conversation would be relatively simple. But unfortunately, even if we created a world in which every piece of data about a person matched their identity, we’d still be placing queer users in grave danger. If we collect gender or sexuality information, we must protect that information at all costs. Hacked databases, confusing user privacy settings, or malicious sharing of information meant to remain on one platform can all be used to harm LGBT+ people. There is a long history of queer folks being put in danger by those who try to out their sexuality or gender publicly when they themselves aren’t out to friends, family, or coworkers. And in many cases, our digital systems enable that kind of outing. On a larger scale, the legal rights that queer-opposed governments assert over the data on digital systems sometimes leads to situations in which data about gender and sexuality is collected and then used to directly persecute people who are considered deviants in their local sociopolitical context.


Which is to say, this article isn’t theoretical.

You may have read the story of a pastor in Tennessee who called for the execution of LGBT+ people. That pastor also happened to be a sheriff’s officer in his county, and said unequivocally that he believed it was part of the duty of police organizations to prosecute gay, bisexual, and transgender people. Imagine how he or an officer alongside him would treat somebody whose gender marker on their driver’s license didn’t appear to match their gender expression when pulled over.

A 28-year-old man in Fresno, California was arrested for claiming that he would cause a gay nightclub in the city to change its name to Pulse because “you will share the same fate”, referring to the mass killing of 49 LGBT+ people in Orlando in 2016. Imagine what he might have done if he had purchased a dataset from a data broker that contained the content of dating profiles for people in Fresno. Such datasets are readily available.

A man in St. Louis, Missouri sent emails to the organizers of its annual Pride celebration claiming that he would come heavily armed and “kill every gay person I can”. Imagine if somebody like that was able to gain access to an improperly secured database on an ecommerce website that primarily catered to gay customers.

All of this happened in the span of just two weeks during June, Pride Month.

We may feel that we have come a long way, especially those of us who live in large international cities in which people can generally be open about their identities, but let’s not kid ourselves. It can still be physically dangerous to not be heterosexual or cisgender in any city, and in any country. Digital industries bear responsibility for encoding and republishing information that can be used to target marginalized communities, and subfields like marketing analytics continue to normalize dangerous practices. We have to have a moral reckoning with what we consider acceptable in computing fields, in cases when gender and sexuality data isn’t essential.


Which brings me back to the freeform text entry option for gender selectors. We can do better than that.

How about we stop collecting? Instead of putting the burden on a user to fully understand the risks of sharing their highly personal information, let’s put the burden on ourselves to treat that information right. If we have no strong reason to collect it, or can’t guarantee its safety, we shouldn’t collect it. Is the danger to your LGBT+ users worth the ability to roughly guess whether somebody is buying a purse for themselves or as a gift, or to assume you know what kind of movie they want to watch?
We have a responsibility to build safe technology. And to do so, we not only need to confront the limitations of the inclusiveness framework, but we need to intimately involve LGBT+ technologists and users in our work at every step of the process. Like any other marginalized community, queer people need to be front and center in conversations about how technology can help or harm us. Otherwise, software teams are destined to continue building systems that will alienate us, hurt us, or even get us killed when we could be crafting a digital world that works better for everybody.
I take pride in the fact that I’m able to openly publish an article like this, discussing topics that may have been taboo just a couple short decades ago. I take pride in the fact that many of you reading this care deeply about these issues, and want to make a positive difference through your own work. But pride without concrete action is meaningless. So I call on you: take pride in the act of creating systems that keep my community safe. Take pride in sleeping at night with the knowledge that you’ve done everything you can to include and to protect. And carry that pride forward so that others can learn to do the same.