Protecting Queer Communities Through Data
This
article originated as a talk given at Chi Hack Night on June 25th,
2019. It has since been modified for a Metis Data Science track lecture,
for the 2019 Design for America Summit, and for a variety of other
venues. A video of the original talk is available here.
It
goes without saying that longstanding marginalization has led to
misinterpretation and misunderstanding of LGBT+ people. In a learned
environment of fear, the vast majority of queer people have stayed
silent about their identities in one way or another. I came out as
pansexual during high school, but was re-closeted when job changes
forced my family to move to Alabama. The caution with which I approached
my own sexuality during that time persists to this day, and I don’t
actively discuss that part of me very often. There are many ways that it
remains alienating and even dangerous to not be the magic combination
of cisgender and heterosexual. This is amplified among already
marginalized groups, like LGBT+ communities of color or neurodivergent
queer communities.
The
end result of all this hiding, all of this fear, is that even
researchers’ best understanding of queer identities and the social
patterns of queer groups is highly flawed. The loudest voices tend to
characterize a group, and a lack of granular data leads those
characterizations to be taken as a given for that group as a whole. And
in addition to that, the very terminology we use to describe our
identities is highly fluid, adapting as we become more open as a society
to exploring natural differences in sexual attraction and gender.
Because the words we use to describe many identities are so new,
differing self-descriptions around the world also complicate research.
The
fact that we know very little about LGBT+ communities poses challenges
much broader than queer-related research fields. On a basic, day-to-day
level, queer voices are often quiet in the workplaces where products and
services are built, including digital systems. A recent survey found
that only 30% of LGBT+ people are out in their workplace,
and many don’t feel comfortable speaking up on topics related to their
identities. As a consequence, most systems are not designed with LGBT+
users in mind. This is especially true of some civic systems, like
driving license databases. When we do explicitly consider LGBT+
experiences, we often do so through a stereotypical lens, or assume the
needs of communities with unique needs that are often invisible to us.
THE EVOLUTION OF GENDER IN DIGITAL SYSTEMS
Gender
selectors are very common on digital tools that we use, and gender data
is one thing that people in many countries are accustomed to being
collected about us by countless entities: governments, health providers,
online retailers, and more all require it before we access their
services.
As
non-cisgender identities have become more commonly recognized and
gained acceptance, the traditional binary options for gender offered by
systems have started to change. At the moment, it’s very common to find
interfaces that allow users to designate themselves as “Other” instead
of either of the two traditional options for gender. While this
technically captures all other possible identities, it lumps a huge
diversity of genders under a single alienating term. This has big
ramifications for data use after entry, since any content or functions
that might be customized based on gender are essentially rendered
meaningless by a data collection approach that puts every non-cis
identity in a single bucket. It shows only a minimum level of care and
understanding of additional gender identities, and does nothing to
forward our understanding of just how common certain identities are
among a system’s users.
![]() |
There are multiple common approaches to gender selection in digital systems, but they all hold flaws of some sort for LGBT+ users. |
The opposite approach is one taken by a growing number of companies and organizations like Facebook, spurred in part by this LGBT+ terminology resource
published by GLAAD and Refinery29. In an attempt to be exhaustive,
Facebook currently offers dozens of gender options for users to identify
themselves. This attempt is flawed for a number of reasons, chief among
them being that what’s considered exhaustive at any given time rapidly
falls out of date. Since we’re finding new ways to describe gender and
sexuality every day when we begin to discuss these things more openly,
there’s really no such thing as an exhaustive approach to identity.
Something
that better reflects the needs of our diverse LGBT+ communities is
freeform text entry of a user’s preferred gender and sexuality, rather
than a predefined set of selectors. Of course, if this information is
displayed publicly there is potential for abuse by people who like to
make discriminatory jokes about gender identity, and any such system
would need to put steps in place to prevent such abuse. Additionally,
any data scientist could tell you that working with free text can make
data analysis much more challenging than working with a clear, limited
set of options. But this is all for good reason, because it produces a
granularity of data that’s simply impossible if we try to put everyone
in boxes that they may not fit in. This is what an inclusive data
practice looks like.
![]() |
Freeform text entry is an inclusive approach to collecting gender data, but comes with challenges of its own. |
But
I’m not here to write about inclusive data practices. I believe that
inclusion is often a misleading goal in the design of technical systems,
and can cause more harm than good. And that’s where we start to talk
about protection, rather than inclusion.
INCLUSIVE DATA PRACTICES AND RE-TRAUMATIZATION
Inclusion-based
frameworks fail to understand that forced participation in a flawed
data system can be harmful. It isn’t enough to include or represent
varying identities when the resulting functions are premised on bad
assumptions about those identities, or in systems where data is
improperly secured.
Inclusively
collected data is almost always still subject to the same
stereotype-based decision processes and content curation processes as
non-inclusively collected data. Whether a human is in the mix of making
content decisions or not, the data we feed into a system and the ways we
act on that data frequently comes to align itself with existing social
hierarchies. This can be as simple as creating content bubbles, in which
customized experiences on a digital platform mean that LGBT+ people are
highly likely to see LGBT+-related content and non-LGBT+ people never
have to interact with LGBT+-related content. It can be as devastating as
presumptions of medical needs or limitations based on a person’s
identity, like the longstanding bans around the world on gay men
donating blood or misassignment of sexual health services to somebody
whose genitalia is different than what a rigid system presumes.
And the rigidity of our systems themselves make them prone to what Dr. Anna Lauren Hoffmann of the University of Washington refers to as “data violence”.
We primarily design systems that assume that something so supposedly
innate about a person as their gender is never-changing, which isn’t
borne out by reality. Immutability can be traumatizing for somebody who
has discovered something new about themselves and wants to reflect that
discovery in their social media profiles, their medical information,
their driver’s license…data systems need to be just as adaptable as
their users, but users can’t be expected to be able to constantly update
information about things like their gender and sexuality everywhere they’ve ever entered it.
When systems fall out of date with the people using them, those people
can be re-traumatized by reminders of names and identities that they no
longer hold.
The
discussion of re-traumatization naturally extends to the subject of
data republication. I am willing to bet that more than a handful of the
people reading this piece work for organizations that use data not
originally sourced from them, or organizations that give or sell their
users’ data to others. There is a whole industry of data brokers out
there whose work involves scraping public records, buying user
information from companies that directly collect it, and combining
datasets as much as possible to get a full picture of a particular data
subject. Or, in clearer terms, a human being.
Problem
is, when a user at the original source of that data changes their
information or revokes permission altogether, that data modification or
revocation rarely makes its way down the line to third parties who are
now spreading out-of-date information to their clients. Data brokers in
the U.S., at least, exist in highly unregulated space, and the
information they sell about people is generally an awful combination of
poorly-vetted, improperly consented, and secretively sourced. Many data
brokers refuse to tell potential clients where they get people’s
information from in the first place because they don’t want those
clients to collect that information for free on their own. Those
barriers to clarity of sourcing mean that whoever uses information from a
data broker cannot be sure of the correctness of its data, let alone
the ethics of the collection process. This monster creates situations in
which people using an online service for the first time are confronted
with autofilled information with names they no longer use, are targeted
with advertisements for a gender that doesn’t align with theirs, and are
unable to escape constant reminders of traumatic past identities.
And
that’s just the start of it. If our goal was to align somebody’s own
experience with their self-conception, this conversation would be
relatively simple. But unfortunately, even if we created a world in
which every piece of data about a person matched their identity, we’d
still be placing queer users in grave danger. If we collect gender or
sexuality information, we must protect that information at all costs.
Hacked databases, confusing user privacy settings, or malicious sharing
of information meant to remain on one platform can all be used to harm
LGBT+ people. There is a long history of queer folks being put in danger
by those who try to out their sexuality or gender publicly when they
themselves aren’t out to friends, family, or coworkers. And in many
cases, our digital systems enable that kind of outing. On a larger
scale, the legal rights that queer-opposed governments assert over the
data on digital systems sometimes leads to situations in which data
about gender and sexuality is collected and then used to directly
persecute people who are considered deviants in their local
sociopolitical context.
INCLUSION AS AN ACTIVE DANGER
Which is to say, this article isn’t theoretical.
You
may have read the story of a pastor in Tennessee who called for the
execution of LGBT+ people. That pastor also happened to be a sheriff’s
officer in his county, and said unequivocally that he believed it was
part of the duty of police organizations to prosecute gay, bisexual, and
transgender people. Imagine how he or an officer alongside him would
treat somebody whose gender marker on their driver’s license didn’t
appear to match their gender expression when pulled over.
A
28-year-old man in Fresno, California was arrested for claiming that he
would cause a gay nightclub in the city to change its name to Pulse
because “you will share the same fate”, referring to the mass killing of
49 LGBT+ people in Orlando in 2016. Imagine what he might have done if
he had purchased a dataset from a data broker that contained the content
of dating profiles for people in Fresno. Such datasets are readily
available.
A
man in St. Louis, Missouri sent emails to the organizers of its annual
Pride celebration claiming that he would come heavily armed and “kill
every gay person I can”. Imagine if somebody like that was able to gain
access to an improperly secured database on an ecommerce website that
primarily catered to gay customers.
All of this happened in the span of just two weeks during June, Pride Month.
We
may feel that we have come a long way, especially those of us who live
in large international cities in which people can generally be open
about their identities, but let’s not kid ourselves. It can still be
physically dangerous to not be heterosexual or cisgender in any city,
and in any country. Digital industries bear responsibility for encoding
and republishing information that can be used to target marginalized
communities, and subfields like marketing analytics continue to
normalize dangerous practices. We have to have a moral reckoning with
what we consider acceptable in computing fields, in cases when gender
and sexuality data isn’t essential.
TAKING A STAND ON DATA COLLECTION
Which brings me back to the freeform text entry option for gender selectors. We can do better than that.
How
about we stop collecting? Instead of putting the burden on a user to
fully understand the risks of sharing their highly personal information,
let’s put the burden on ourselves to treat that information right. If
we have no strong reason to collect it, or can’t guarantee its safety,
we shouldn’t collect it. Is the danger to your LGBT+ users worth the
ability to roughly guess whether somebody is buying a purse for
themselves or as a gift, or to assume you know what kind of movie they
want to watch?
We
have a responsibility to build safe technology. And to do so, we not
only need to confront the limitations of the inclusiveness framework,
but we need to intimately involve LGBT+ technologists and users in our
work at every step of the process. Like any other marginalized
community, queer people need to be front and center in conversations
about how technology can help or harm us. Otherwise, software teams are
destined to continue building systems that will alienate us, hurt us, or
even get us killed when we could be crafting a digital world that works
better for everybody.
I
take pride in the fact that I’m able to openly publish an article like
this, discussing topics that may have been taboo just a couple short
decades ago. I take pride in the fact that many of you reading this care
deeply about these issues, and want to make a positive difference
through your own work. But pride without concrete action is meaningless.
So I call on you: take pride in the act of creating systems that keep
my community safe. Take pride in sleeping at night with the knowledge
that you’ve done everything you can to include and to protect. And carry that pride forward so that others can learn to do the same.