Skip to main content

Inside the groundbreaking plan to truly anonymize your 2020 Census data

US Census Bureau Canvasser Walking Up to House 1
U.S. Census Bureau

There’s a parable about six men who are blind touching an elephant. Each describes the animal differently, depending on whether they felt its tusk, tail, leg, trunk, ear, or side. Take the accounts separately, and you’ll learn something about the feel of the individual parts. Put them together, and you get a sense of the elephant as a whole.

It’s the latter that worries those working at the Census Bureau. Right now, identifying individuals based on public Census data is difficult. But more information from outside sources is increasingly accessible, and the computing power needed to link globs of data from different places is also easier to attain than it was in 2010. There have been numerous studies showing that even anonymized datasets can be re-identified when they’re cross-referenced with each other. Journalists with The New York Times were able to verify they had received Donald Trump’s tax returns from 1985 to 1994 by comparing them to an anonymized database and other public documents.

So to protect your data, the Census Bureau is digging into statistical methodology, injecting variability and randomness into the data itself. But as numerous interviews with researchers and data scientists eager for all of that information show, balancing privacy with data science won’t be easy.

The Census attacks itself

In 2018, the Bureau published the results of a simulated attack on the 2010 Census data, to see if it could recreate private information from the many chunks of public data floating around. Over 308 million people were counted in the 2010 Census. Using the 2010 data, like sex, age, race, and ethnicity, it was able to reconstruct records for 46 percent of the population, exactly matching the confidential record only certain Census workers have access to.

2020 Census PSA: What is the 2020 Census? (:30)

Even with the Census records secure, the Bureau wanted to try linking the reconstructed records with commercially available data. Those reconstructed records didn’t have names, but using public databases, the Bureau found it could attach 45 percent of them to names and addresses. Those names were accurate only 38 percent of the time, however. coming out to a correct identification for 17 percent of the total population. An attacker wouldn’t necessarily know which 17 percent they had correct without some extra work. “They could have found out if they were right by doing additional field work,” John Abowd, chief scientist and associate director for research and methodology at the U.S. Census Bureau, told Digital Trends. “That means they’d have to go and find out by telephone or sending people to the homes to ask.” But the Census Bureau didn’t want to wait and see if more data would make the reidentification more likely. It started looking into using differential privacy ahead of the 2020 Census.

The impossible dream of perfect data

The more unique you are, the easier you are to spot in the data. If you’re the only 20-year-old Pacific Islander on your block, your record will stand out. That’s why, for years, the Bureau used “swapping” to mask such identifiable individuals. For example, The New York Times tracked down the sole couple who live on Liberty Island, the caretakers of the Statue of Liberty. While their Census records had their correct ages, their ethnicities had been listed as Asian, though they identify as white. That ethnicity wasn’t just randomly assigned; it had been substituted from another couple in the area. Just how frequently the Census is swapping such information is a mystery, to help keep the records more private. If attackers knew the percentage of numbers that were switched around, it would help them reconstruct the records.

“Differential privacy is forcing people to confront the fact that there’s error in the data …”

The Bureau has applied different methods of privacy protection over the years. In the 1970s, it suppressed full tables and started using measures including swapping for the 1990 census. Plus, there would be errors and missing information on the forms people sent back, and workers would do their best to correct mistakes and fill in the blanks. Add to this fundamental problems like undercounting — missing vulnerable populations like people experiencing homelessness or those in very remote areas — and overcounting — marking a child of divorced parents twice.

In other words, there’s been inaccuracies in the data forever. Differential privacy just lets the Bureau be transparent about how much it’s fiddled with the numbers. Let’s say there were 12 angry jurors in a room. In a secret ballot, they learn that 11 are for conviction and one is against. No one knows who’s who, unless they vote again while the lone holdout is in the bathroom. The idea with differential privacy is that the juror’s vote should be protected whether or not they’re actually included in the participant pool, though it’s not a guarantee of privacy.

“Differential privacy is forcing people to actually confront the fact that there’s error in the data, because differential privacy is very explicit about the introduction of error,” said Dr. Salil Vadhan, a computer science and applied mathematics professor at the Harvard John A. Paulson School of Engineering & Applied Sciences. “And we who work in differential privacy think of that as a feature not a bug.”

US Census 2020 Button on woman's jacket
Marvin Joseph/The Washington Post via Getty Images

With differential privacy, some amount of “noise” is added to each value in a table. With the jurors example, you’d add or subtract an amount from the yay and nay votes, and the amount would have to fall within a certain range. With a very small population, like 12, you’d want to keep the range tight while still allowing for privacy. Maybe you choose plus or minus three. The algorithm would then randomly select a value within the range and apply it to the yays, then do the same for the nays. You could, then, end up with results that look like this: Ten for and negative two against. That’s obviously illogical, but the algorithm randomly selected to subtract one from the yays and subtract three from the nays. The point is, the people in the room wouldn’t know if the algorithm subtracted two from the yays and three from nays. That’s not helpful for a jury, but it does keep things a little more private.

In this example, the total number of differential private votes — technically eight but more logically, 10 — doesn’t add up to the real number of people in the room, 12. You might look at that vote and say it’s worthless, but what if the vote didn’t have to be unanimous but merely a measure that needed to pass by majority? Even though the numbers aren’t exact, it’s clear the yays have it. Again, things become more tricky if the voters are split down the middle and the algorithm assigns plus one to the nays and minus one to the yays. The problem is magnified with small populations but starts to lessen as groups get larger.

“There’s always been resentment about this kind of two-tiered access.”

One feature of this noise is that it’s “tunable.” You can adjust it. If you have a table people are going to use for a specific metric, you can narrow the range for that column in the table, while increasing it in other values. If a demographer wants to know how many people of Hawaiian or Pacific Islander descent live in a city, the table with that information might have the noise injected into the actual number of people narrowly changed, but the ages are altered by a larger range. Instead of seeing the single 20-year-old, it’s suddenly a 25-year-old, and an attacker would be less certain that record belongs to a specific name and address in a commercial database.

From a demographics perspective, it might not matter too much that a 20-year-old is suddenly a 25-year-old, but for certain uses, like voting issues, that 20-year-old absolutely cannot become a 17-year-old. There are certain stats, known as invariants, that won’t have any noise injected. For example, state-level populations will remain untouched, so we’ll know exactly how many people live in Alaska, Kansas, and so on. “The Bureau will also release the exact, un-altered, total number of housing units at the Census block level, and it will not alter the number and type of occupied group quarters (like correctional facilities, college dorms, and shelters).

To make all the data products it releases more secure, the Bureau applied differential privacy to the voting-age population in the 2018 end-to-end census test and the 2010 Demonstration Data Product, which the Bureau released to help researchers see how the process would affect the data they use. While the Census used to provide exact numbers of people both above and below 18 (the voting age), the Census Bureau’s Data Stewardship Executive Policy Committee (DSEP) has “grave concerns about its effects on the Census Bureau’s ability to protect confidentiality, especially in block and block-group level tabulations,” according to an email from a Bureau spokesperson. DESP hasn’t yet made final decisions on the what will remain invariant.

Deductions from the privacy budget

For the 2020 Census, the form includes a number of demographic questions, including how many people live in the household; their ages, sexes, races, and ethnicities; and their relation to the head of household. As the 2010 Census data shows, however, the information adds up to more than it asks; based on its questions from a decade ago, the Bureau released about 7.8 billion statistics about Americans.

This time around, instead of releasing all that data and relying on swapping and suppression, each statistical table made public will nibble away at the privacy loss budget. This budget has to be determined first, then each table will be assigned a slice of that budget. Frequently used tables might stick closer to the original data, while less utilized one may get more noise.

2020 Census PSA: What is the 2020 Census? (:30)

The more privacy a table needs, the greater the chunk of the budget it takes and the more noise needs to be injected. It’s a double-edged sword. Small populations need more privacy projection to deter database reconstruction, but introducing more noise in tables with small numbers affects the results more significantly. Like the invariant question, the Bureau hasn’t made final decisions about the privacy loss budget.

The question for smaller populations, like Alaska Natives, is what is an acceptable level of privacy loss, said Dr. Randall Akee at a recent Committee on National Statistics (CNSTAT) workshop on differential privacy and the Census. He’s an associate professor at the University of California, Los Angeles in the Department of Public Policy and American Indian Studies. “I think that’s something that has to be addressed directly to tribal governments themselves,” he said. Some might be fine with their populations being publicly enumerated, while others may be more reticent, he said. It’s a problem the Census Bureau is still grappling with. “We have some further prototyping and other work to do before we can show the user community what those will look like,” said Abowd.

The demands on data access

Critics of the Census Bureau’s differential privacy plan worry that it will release less information than it has in the past or that researchers will have to visit Federal Statistical Research Data Centers to do their work. There are only 29 centers throughout the U.S., and demographers and others are concerned about applying for and receiving access in a timely manner. While researchers have always needed to have their work approved to visit the centers, some think that they’ll need to do so more often with the 2020 data. “There’s always been a little bit of resentment about this kind of two-tiered access,” said Jane Bambauer, a law professor at the University of Arizona. She thinks differential privacy might exacerbate the issue, with graduate students and researchers at smaller universities losing out with less publicly available data.

“A lot of social scientists feel shut out of the sphere of influence for the key decision makers at the Census Bureau.”

At the December 2019 CNSTAT workshop, a number of researchers presented their findings after working with some differentially private data. The Bureau released some 2010 data products that it had put through its differential privacy system. Researchers then compared the new data with the original 2010 data that the Bureau released with old privacy measures, like swapping. Many participants highlighted the discrepancies they found. William Sexton of the Census Bureau said that one source of error was “post-processing,” or fiddling with the data after applying differential privacy measures. This would include adjustments like making sure a block didn’t have negative people. There are ways to improve these fixes, he said. In addition, the Bureau is taking into account the problems people are finding with the DP data and looking for solutions. “In order to know where to look for anomalies, we need a lot more eyes on the data than are available inside the house,” Abowd told Digital Trends.

There has been frustration from some researchers and others about just how they should prepare for the 2020 Census data. “It will take some time for the data users to learn which are the appropriate methods to use to try to analyze the data that have been protected in this way,” said Vadhan. The Bureau is still deciding on all the products it will release and how researchers will access the data.

Privacy pros and woes 

Each dent in the privacy loss budget represents a value judgment. While they will ultimately be made by the Census Bureau, it is seeking feedback and input from researchers, advocates, and others.”It’s not a computer just spitting out a set of parameters that are the best ones to use,” said David Van Riper, director of spatial analysis at the Minnesota Population Center.  “It’s a group of people that are going to take in information from user groups, different stakeholders, and decide on these policy decisions.”

Infographic showing the Census' history of privacy protections from 1700s to present day
Click here to see a larger version of this graphic. U.S. Census Bureau

Yet there have been communication issues between data users and the Bureau. “I went to the National Demographers Conference earlier this year, and there are a lot of social scientists that feel shut out of the sphere of influence for the key decision makers at the Census Bureau,” said Bambauer.

Some researchers still feel that the Bureau is putting a higher value on privacy than access to the data itself. “The Census Bureau has an obligation to provide data that’s useful for a broad spectrum of data users, from local planners to researchers to state and local governments,” said Van Riper. “And that usefulness and utility is, in my opinion, as important as the privacy protections.”

In 2010, the “Census moment” was set at 11:59 p.m. on April 1. The aim was to count everyone living in the U.S. at that exact time. Because of the gap between this moment and when people send back their forms, the enumeration will never be flawless. The uses of the Census data — reapportioning Congressional seats, distributing federal funds, and so on — are important enough that data users are willing to overlook the imperfections.

Recently, historians learned that census officials provided the government with information about Japanese-Americans who were then sent to internment camps. While there is no citizenship question on the 2020 Census, people are wary of how their information will be used. Some experts are concerned that mistrust could result in one of the largest undercounts of several minority groups in decades.

With differential privacy, the hope is to safeguard the information from anyone who would use the data against another person, whether they’re inside or outside the government. The Bureau hopes the promise of increased security will make people more willing to participate, especially those who have been hesitant to do so in the past.

Correction: This story was updated on March 5 to clarify the measures the Census Bureau will take to anonymize block-level data.

Editors' Recommendations

Jenny McGrath
Former Digital Trends Contributor
Jenny McGrath is a senior writer at Digital Trends covering the intersection of tech and the arts and the environment. Before…
Blue Screen of Death: What it means and what to do if you get one
The Blue Screen of Death seen on a laptop.

The BSOD, or Blue Screen of Death, is an iconic error screen that anyone who's ever used a Windows PC has liekly come across at one time or another. It's no fun, and it can mean there's a problem with your PC that needs fixing. But in most cases, it's just one of those things that crops up, and simply keeping your PC updated will be enough to prevent it from coming up again.

Here's everything you need to know about BSODs and what to do if youget one.
What is a BSOD?
The stop error screen, or as it's colloquially known for its blue coloring, the BSOD, is an error screen that appears when something has gone critically wrong with your PC. It doesn't mean it's fundamentally broken, but it means something has gone so wrong with it that it can no longer function and needs to reboot to get working again.

Read more
Best Samsung Galaxy deals: S24, Buds, Watches and more
The Galaxy Z Fold 4's Cover Screen.

Samsung’s Galaxy lineup is made up of several different types of devices, and if you’re in the market for some savings, you’ll often find Samsung Galaxy tech among the best headphone deals, the best smartwatch deals, the best tablet deals, and the best phone deals. With so many different devices among the Galaxy lineup, and with so many Samsung Galaxy deals out there for the picking, we rounded up what we feel are the best Samsung Galaxy deals to shop right now. Reading onward you’ll find discounts on some of the best tablets, best smartwatches, and best wireless earbuds the Samsung Galaxy lineup has to offer, as well as some impressive discounts on Galaxy phones.
Samsung Galaxy Buds 2 -- $97, was $150

If you're looking for headphone deals but you want an alternative to Apple's AirPods, you should consider the Samsung Galaxy Buds 2. The wireless earbuds have great battery life that's made even better with an included charging case. While some of the other Galaxy Buds out there include the Galaxy Buds Live, Galaxy Buds Pro, and Galaxy Buds+, but with the Galaxy Buds 2's active noise-cancelation you can block out unwanted sounds and keep your focus on whatever you’re working on, watching, or listening to. You can also control the headphones with touch controls on each earbud, and they connect easily to any Bluetooth device.

Read more
Best Microsoft Office deals: Get Word, PowerPoint, and Excel for free
Students using Microsoft Office software on their laptops outside.

While the fight of Microsoft vs Google when it comes to office apps might be never-ending, if you're the sort of person who prefers dealing with Microsoft, you'll be happy to know that there are quite a few good deals you can take advantage of. As you may know, most of Microsoft's apps have gone under one rather expensive subscription service, Microsoft 365, but you can still get older parts of the suite for relatively good prices. In fact, you can even get a free trial of Microsoft Word to test it out, although you'd still need to pay to get the full suite of tools.
Best Microsoft Office deals
Microsoft Office is a pay once, receive once service. You don't have to pay recurring monthly fees to use it, but the software also never updates. For what it's worth, the Microsoft Office packages are labelled "2021", so they're all fairly recent but also ripe for a good deal. With the exception of AI integrations, not much has really changed in the past couple of years when it comes to your basic document creation and these programs should continue to be effective for years to come. Depending on what package you get, you'll get access to different apps, based on the needs of the target audience. For example, Microsoft Office Home & Student 2021 keeps it lean and cool and with Microsoft Word, Microsoft Excel, and Microsoft PowerPoint being the only apps included.

Here are our favorite deals for the classic Microsoft Office experience:

Read more