Inside the groundbreaking plan to truly anonymize your 2020 Census data

By Jenny McGrath February 28, 2020

US Census Bureau Canvasser Walking Up to House 1 — U.S. Census Bureau

There’s a parable about six men who are blind touching an elephant. Each describes the animal differently, depending on whether they felt its tusk, tail, leg, trunk, ear, or side. Take the accounts separately, and you’ll learn something about the feel of the individual parts. Put them together, and you get a sense of the elephant as a whole.

Contents

The Census attacks itself
The impossible dream of perfect data
Deductions from the privacy budget
The demands on data access
Privacy pros and woes

It’s the latter that worries those working at the Census Bureau. Right now, identifying individuals based on public Census data is difficult. But more information from outside sources is increasingly accessible, and the computing power needed to link globs of data from different places is also easier to attain than it was in 2010. There have been numerous studies showing that even anonymized datasets can be re-identified when they’re cross-referenced with each other. Journalists with The New York Times were able to verify they had received Donald Trump’s tax returns from 1985 to 1994 by comparing them to an anonymized database and other public documents.

So to protect your data, the Census Bureau is digging into statistical methodology, injecting variability and randomness into the data itself. But as numerous interviews with researchers and data scientists eager for all of that information show, balancing privacy with data science won’t be easy.

The Census attacks itself

In 2018, the Bureau published the results of a simulated attack on the 2010 Census data, to see if it could recreate private information from the many chunks of public data floating around. Over 308 million people were counted in the 2010 Census. Using the 2010 data, like sex, age, race, and ethnicity, it was able to reconstruct records for 46 percent of the population, exactly matching the confidential record only certain Census workers have access to.

2020 Census PSA: What is the 2020 Census? (:30)

Even with the Census records secure, the Bureau wanted to try linking the reconstructed records with commercially available data. Those reconstructed records didn’t have names, but using public databases, the Bureau found it could attach 45 percent of them to names and addresses. Those names were accurate only 38 percent of the time, however. coming out to a correct identification for 17 percent of the total population. An attacker wouldn’t necessarily know which 17 percent they had correct without some extra work. “They could have found out if they were right by doing additional field work,” John Abowd, chief scientist and associate director for research and methodology at the U.S. Census Bureau, told Digital Trends. “That means they’d have to go and find out by telephone or sending people to the homes to ask.” But the Census Bureau didn’t want to wait and see if more data would make the reidentification more likely. It started looking into using differential privacy ahead of the 2020 Census.

The impossible dream of perfect data

The more unique you are, the easier you are to spot in the data. If you’re the only 20-year-old Pacific Islander on your block, your record will stand out. That’s why, for years, the Bureau used “swapping” to mask such identifiable individuals. For example, The New York Times tracked down the sole couple who live on Liberty Island, the caretakers of the Statue of Liberty. While their Census records had their correct ages, their ethnicities had been listed as Asian, though they identify as white. That ethnicity wasn’t just randomly assigned; it had been substituted from another couple in the area. Just how frequently the Census is swapping such information is a mystery, to help keep the records more private. If attackers knew the percentage of numbers that were switched around, it would help them reconstruct the records.

“Differential privacy is forcing people to confront the fact that there’s error in the data …”

The Bureau has applied different methods of privacy protection over the years. In the 1970s, it suppressed full tables and started using measures including swapping for the 1990 census. Plus, there would be errors and missing information on the forms people sent back, and workers would do their best to correct mistakes and fill in the blanks. Add to this fundamental problems like undercounting — missing vulnerable populations like people experiencing homelessness or those in very remote areas — and overcounting — marking a child of divorced parents twice.

In other words, there’s been inaccuracies in the data forever. Differential privacy just lets the Bureau be transparent about how much it’s fiddled with the numbers. Let’s say there were 12 angry jurors in a room. In a secret ballot, they learn that 11 are for conviction and one is against. No one knows who’s who, unless they vote again while the lone holdout is in the bathroom. The idea with differential privacy is that the juror’s vote should be protected whether or not they’re actually included in the participant pool, though it’s not a guarantee of privacy.

“Differential privacy is forcing people to actually confront the fact that there’s error in the data, because differential privacy is very explicit about the introduction of error,” said Dr. Salil Vadhan, a computer science and applied mathematics professor at the Harvard John A. Paulson School of Engineering & Applied Sciences. “And we who work in differential privacy think of that as a feature not a bug.”

US Census 2020 Button on woman's jacket — Marvin Joseph/The Washington Post via Getty Images

With differential privacy, some amount of “noise” is added to each value in a table. With the jurors example, you’d add or subtract an amount from the yay and nay votes, and the amount would have to fall within a certain range. With a very small population, like 12, you’d want to keep the range tight while still allowing for privacy. Maybe you choose plus or minus three. The algorithm would then randomly select a value within the range and apply it to the yays, then do the same for the nays. You could, then, end up with results that look like this: Ten for and negative two against. That’s obviously illogical, but the algorithm randomly selected to subtract one from the yays and subtract three from the nays. The point is, the people in the room wouldn’t know if the algorithm subtracted two from the yays and three from nays. That’s not helpful for a jury, but it does keep things a little more private.

In this example, the total number of differential private votes — technically eight but more logically, 10 — doesn’t add up to the real number of people in the room, 12. You might look at that vote and say it’s worthless, but what if the vote didn’t have to be unanimous but merely a measure that needed to pass by majority? Even though the numbers aren’t exact, it’s clear the yays have it. Again, things become more tricky if the voters are split down the middle and the algorithm assigns plus one to the nays and minus one to the yays. The problem is magnified with small populations but starts to lessen as groups get larger.

“There’s always been resentment about this kind of two-tiered access.”

One feature of this noise is that it’s “tunable.” You can adjust it. If you have a table people are going to use for a specific metric, you can narrow the range for that column in the table, while increasing it in other values. If a demographer wants to know how many people of Hawaiian or Pacific Islander descent live in a city, the table with that information might have the noise injected into the actual number of people narrowly changed, but the ages are altered by a larger range. Instead of seeing the single 20-year-old, it’s suddenly a 25-year-old, and an attacker would be less certain that record belongs to a specific name and address in a commercial database.

From a demographics perspective, it might not matter too much that a 20-year-old is suddenly a 25-year-old, but for certain uses, like voting issues, that 20-year-old absolutely cannot become a 17-year-old. There are certain stats, known as invariants, that won’t have any noise injected. For example, state-level populations will remain untouched, so we’ll know exactly how many people live in Alaska, Kansas, and so on. “The Bureau will also release the exact, un-altered, total number of housing units at the Census block level, and it will not alter the number and type of occupied group quarters (like correctional facilities, college dorms, and shelters).

To make all the data products it releases more secure, the Bureau applied differential privacy to the voting-age population in the 2018 end-to-end census test and the 2010 Demonstration Data Product, which the Bureau released to help researchers see how the process would affect the data they use. While the Census used to provide exact numbers of people both above and below 18 (the voting age), the Census Bureau’s Data Stewardship Executive Policy Committee (DSEP) has “grave concerns about its effects on the Census Bureau’s ability to protect confidentiality, especially in block and block-group level tabulations,” according to an email from a Bureau spokesperson. DESP hasn’t yet made final decisions on the what will remain invariant.

Deductions from the privacy budget

For the 2020 Census, the form includes a number of demographic questions, including how many people live in the household; their ages, sexes, races, and ethnicities; and their relation to the head of household. As the 2010 Census data shows, however, the information adds up to more than it asks; based on its questions from a decade ago, the Bureau released about 7.8 billion statistics about Americans.

This time around, instead of releasing all that data and relying on swapping and suppression, each statistical table made public will nibble away at the privacy loss budget. This budget has to be determined first, then each table will be assigned a slice of that budget. Frequently used tables might stick closer to the original data, while less utilized one may get more noise.

2020 Census PSA: What is the 2020 Census? (:30)

The more privacy a table needs, the greater the chunk of the budget it takes and the more noise needs to be injected. It’s a double-edged sword. Small populations need more privacy projection to deter database reconstruction, but introducing more noise in tables with small numbers affects the results more significantly. Like the invariant question, the Bureau hasn’t made final decisions about the privacy loss budget.

The question for smaller populations, like Alaska Natives, is what is an acceptable level of privacy loss, said Dr. Randall Akee at a recent Committee on National Statistics (CNSTAT) workshop on differential privacy and the Census. He’s an associate professor at the University of California, Los Angeles in the Department of Public Policy and American Indian Studies. “I think that’s something that has to be addressed directly to tribal governments themselves,” he said. Some might be fine with their populations being publicly enumerated, while others may be more reticent, he said. It’s a problem the Census Bureau is still grappling with. “We have some further prototyping and other work to do before we can show the user community what those will look like,” said Abowd.

The demands on data access

Critics of the Census Bureau’s differential privacy plan worry that it will release less information than it has in the past or that researchers will have to visit Federal Statistical Research Data Centers to do their work. There are only 29 centers throughout the U.S., and demographers and others are concerned about applying for and receiving access in a timely manner. While researchers have always needed to have their work approved to visit the centers, some think that they’ll need to do so more often with the 2020 data. “There’s always been a little bit of resentment about this kind of two-tiered access,” said Jane Bambauer, a law professor at the University of Arizona. She thinks differential privacy might exacerbate the issue, with graduate students and researchers at smaller universities losing out with less publicly available data.

“A lot of social scientists feel shut out of the sphere of influence for the key decision makers at the Census Bureau.”

At the December 2019 CNSTAT workshop, a number of researchers presented their findings after working with some differentially private data. The Bureau released some 2010 data products that it had put through its differential privacy system. Researchers then compared the new data with the original 2010 data that the Bureau released with old privacy measures, like swapping. Many participants highlighted the discrepancies they found. William Sexton of the Census Bureau said that one source of error was “post-processing,” or fiddling with the data after applying differential privacy measures. This would include adjustments like making sure a block didn’t have negative people. There are ways to improve these fixes, he said. In addition, the Bureau is taking into account the problems people are finding with the DP data and looking for solutions. “In order to know where to look for anomalies, we need a lot more eyes on the data than are available inside the house,” Abowd told Digital Trends.

There has been frustration from some researchers and others about just how they should prepare for the 2020 Census data. “It will take some time for the data users to learn which are the appropriate methods to use to try to analyze the data that have been protected in this way,” said Vadhan. The Bureau is still deciding on all the products it will release and how researchers will access the data.

Privacy pros and woes

Each dent in the privacy loss budget represents a value judgment. While they will ultimately be made by the Census Bureau, it is seeking feedback and input from researchers, advocates, and others.”It’s not a computer just spitting out a set of parameters that are the best ones to use,” said David Van Riper, director of spatial analysis at the Minnesota Population Center. “It’s a group of people that are going to take in information from user groups, different stakeholders, and decide on these policy decisions.”

Infographic showing the Census' history of privacy protections from 1700s to present day — Click here to see a larger version of this graphic. U.S. Census Bureau

Yet there have been communication issues between data users and the Bureau. “I went to the National Demographers Conference earlier this year, and there are a lot of social scientists that feel shut out of the sphere of influence for the key decision makers at the Census Bureau,” said Bambauer.

Some researchers still feel that the Bureau is putting a higher value on privacy than access to the data itself. “The Census Bureau has an obligation to provide data that’s useful for a broad spectrum of data users, from local planners to researchers to state and local governments,” said Van Riper. “And that usefulness and utility is, in my opinion, as important as the privacy protections.”

In 2010, the “Census moment” was set at 11:59 p.m. on April 1. The aim was to count everyone living in the U.S. at that exact time. Because of the gap between this moment and when people send back their forms, the enumeration will never be flawless. The uses of the Census data — reapportioning Congressional seats, distributing federal funds, and so on — are important enough that data users are willing to overlook the imperfections.

Recently, historians learned that census officials provided the government with information about Japanese-Americans who were then sent to internment camps. While there is no citizenship question on the 2020 Census, people are wary of how their information will be used. Some experts are concerned that mistrust could result in one of the largest undercounts of several minority groups in decades.

With differential privacy, the hope is to safeguard the information from anyone who would use the data against another person, whether they’re inside or outside the government. The Bureau hopes the promise of increased security will make people more willing to participate, especially those who have been hesitant to do so in the past.

Correction: This story was updated on March 5 to clarify the measures the Census Bureau will take to anonymize block-level data.

Editors' Recommendations

Topics

Former Digital Trends Contributor

Jenny McGrath is a senior writer at Digital Trends covering the intersection of tech and the arts and the environment. Before…

Computing

Best HP laptop deals: Get a 17-inch workhorse for $370 and more

An open HP Spectre x360 16 sits on a table, angled so that the screen and keyboard can be seen.

HP is one of the best laptop brands on the market, and if you're thinking of picking up a new laptop, then you may want to consider one of its many varieties of laptops. Not only that, but HP usually has some form of deal going on each of its sub-brans, so whether you're looking for an HP Omen gaming laptop or a Spectre X360 2-in-1 convertible, you'll likely find a good deal on it. Of course, it can be hard to navigate the dozens of different types of laptops HP has, which is why we've gone out and collected some of our favorite deals to help save you the trouble. That said, if you can't find quite what you're looking for below, be sure to check out these other great laptop deals and gaming laptop deals as well.
HP Laptop 15z -- $250, was $500

If you need a budget laptop for basic tasks, you can't go wrong with the HP Laptop 15z. With its AMD Athlon Silver 7120U processor, AMD Radeon Graphics, and 8GB of RAM, it's going to be a dependable device for doing online research and working with productivity apps. The laptop features a 128GB SSD with Windows 11 Home pre-loaded, and a relatively large 15.6-inch HD screen for its low price.

Computing

Some Intel CPUs are about to take a big performance hit, report says

Intel's 14900K CPU socketed in a motherboard.

High-end Intel CPUs are about to lose some significant performance, according to a new report from BenchLife (via VideoCardz). The outlet claims Intel has sent guidance to motherboard partners to implement the Intel Default Settings on Z790 motherboards, following a wave of reports of instability on recent high-end Intel CPUs.

According to the report, these default settings will enforce a PL2 of 188 watts. Intel maintains power limits (PL) for its processors. PL1 is the base power, or the power that the processor can sustain for long periods of time. PL2 is the maximum boost power, which the processor can hit for brief spurts when under a heavy load.

Computing

Best Buy laptop deals: Cheap laptops starting at $159

Apple M1 MacBook Air open on a desk with plants in the background.

If you’re looking for an affordable laptop, Best Buy is a great outlet to turn to. It carries some of the best laptops on the market, and often you’ll find many of the best laptop deals taking place at Best Buy. And while it’s a great place to land some savings on almost any device, including tablet deals, headphone deals, and smartwatch deals, the Best Buy laptop deals you can shop right now are worth taking a look at. Among them you’ll find many quality laptop options at some of the best prices we’ve seen, so read onward for more details. And if Best Buy doesn’t have what you’re looking for, you can check out some of the best Amazon deals and best Walmart deals, where you’ll also find a discounted laptop or two.
HP 14-inch laptop — $159, was $180

The HP 14-inch laptop is a fast and fun computing device. It's a great option for anyone searching the best laptops for high school students or the best laptops for college. It has an Intel Celeron processor and 4GB of system RAM that combine to push through homework assignments, work presentations, and hours upon hours of binge watching. The 14-inch screen sports HD resolution and makes this HP laptop a great way to enjoy movies, photos, and other digital content. The HP 14-inch laptop is able to reach up to 14 hours of battery life on a single charge, making it a great all-day option for people who like to do their work on the go.