« Tetris Alarm Clock with Animated Time Change | Home | 4:01 a.m. Keychain Bottle Opener Series: — Episode 5: S-Biner Ahhh »

January 24, 2013

BehindTheMedspeak: You can be personally identified from an anonymous DNA sample

Screen Shot 2013-01-23 at 7.30.57 PM

That's the gist of Gina Kolata's explosive New York Times front page story last Friday.

You know how everyone's got their baggies in a twist about Facebook's privacy settings?

That's small beer compared to this.

Consider that when you send a DNA sample to 23andme along with your credit card number to pay for a detailed, state-of-the-art, up-to-date analysis that you believe to be private and eyes-only, there may be enough information there to pinpoint exactly who you are — and whom else you might be related to in the company's data bank.

From the article: "The genetic data posted online seemed perfectly anonymous — strings of billions of DNA letters from more than 1,000 people. But all it took was some clever sleuthing on the Web for a genetics researcher to identify five people he randomly selected from the study group. Not only that, he found their entire families, even though the relatives had no part in the study — identifying nearly 50 people."

More excerpts below.

The researcher did not reveal the names of the people he found, but the exercise, published Thursday in the journal Science, illustrates the difficulty of protecting the privacy of volunteers involved in medical research when the genetic information they provide needs to be public so scientists can use it.

Other reports have identified people whose genetic data was online, but none had done so using such limited information: the long strings of DNA letters, an age and, because the study focused on only American subjects, a state.

"I've been worried about this for a long time," said Barbara Koenig, a researcher at the University of California in San Francisco who studies issues involving genetic data. "We always should be operating on the assumption that this is possible."

The data are from an international study, the 1000 Genomes Project, that is collecting genetic information from people around the world and posting it online so researchers can use it freely. It also includes the ages of participants and the regions where they live. That information, a genealogy Web site and Google searches were sufficient to find complete family trees. While the methods for extracting relevant genetic data from the raw genetic sequence files were specialized enough to be beyond the scope of most laypeople, no one expected it to be so easy to zoom in on individuals.

"We are in what I call an awareness moment," said Eric D. Green, director of the National Human Genome Research Institute at the National Institutes of Health.

There is no easy answer about what to do to protect the privacy of study subjects. Subjects might be made more aware that they could be identified by their DNA sequences. More data could be locked behind security walls, or severe penalties could be instituted for those who invade the privacy of subjects.

... Opinions about just what should be done vary greatly among experts.

But after seeing how easy it was to find the individuals and their extended families, the N.I.H. removed people's ages from the public database, making it more difficult to identify them.

But Dr. Jeffrey R. Botkin, associate vice president for research integrity at the University of Utah, which collected the genetic information of some research participants whose identities were breached, cautioned about overreacting. Genetic data from hundreds of thousands of people have been freely available online, he said, yet there has not been a single report of someone being illicitly identified. He added that "it is hard to imagine what would motivate anyone to undertake this sort of privacy attack in the real world." But he said he had serious concerns about publishing a formula to breach subjects' privacy. By publishing, he said, the investigators "exacerbate the very risks they are concerned about."

The project was the inspiration of Yaniv Erlich, a human genetics researcher at the Whitehead Institute, which is affiliated with M.I.T. He stresses that he is a strong advocate of data sharing and that he would hate to see genomic data locked up. But when his lab developed a new technique, he realized he had the tools to probe a DNA database. And he could not resist trying.

The tool allowed him to quickly find a type of DNA pattern that looks like stutters among billions of chemical letters in human DNA. Those little stutters — short tandem repeats — are inherited. Genealogy Web sites use repeats on the Y chromosome, the one unique to men, to identify men by their surnames, an indicator of ancestry. Any man can submit the short tandem repeats on his Y chromosome and find the surname of men with the same DNA pattern. The sites enable men to find their ancestors and relatives.

So, Dr. Erlich asked, could he take a man's entire DNA sequence, pick out the short tandem repeats on his Y chromosome, search a genealogy site, discover the man's surname and then fully identify the man?

He tested it with the genome of Craig Venter, a DNA sequencing pioneer who posted his own DNA sequence on the Web. He knew Dr. Venter's age and the state where he lives. Bingo: two men popped up in the database. One was Craig Venter.

"Out of 300 million people in the United States, we got it down to two people," Dr. Erlich said.

He and his colleagues calculated they would be able to identify, from just their DNA sequences, the last names of approximately 12 percent of middle class and wealthier white men — the population that tends to submit DNA data to recreational sites like the genealogical ones. Then by combining the men's last names with their ages and the states where they lived, the researchers should be able to narrow their search to just a few likely individuals.

Now for the big test. On the Web and publicly available are DNA sequences from subjects in the 1000 Genomes Project. People's ages were included and all the Americans lived in Utah, so the researchers knew their state.

Dr. Erlich began with one man from the database. He got the Y chromosome's short tandem repeats and then went to genealogy databases and searched for men with those same repeats. He got surnames of the paternal and maternal grandfather. Then he did a Google search for those people and found an obituary. That gave him the family tree.

"Now I knew the whole family," Dr. Erlich said. And it was so simple, so fast.

"I said, 'Come on, that can’t be true.'" So he probed and searched and checked again and again.

"Oh my God, we really did this," Dr. Erlich said. "I had to digest it. We had so much information."

He and his colleagues went on to get detailed family trees for other subjects and then visited Dr. Green and his colleagues at the N.I.H. to tell them what they had done.

They were referred to Amy L. McGuire, a lawyer and ethicist at Baylor College of Medicine in Houston. She, like others, called for more public discussion of the situation.

"To have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position," Dr. McGuire said.

When the subjects in the 1000 Genomes Project agreed to participate and provide DNA, they signed a form saying that the researchers could not guarantee their privacy. But, at the time, it seemed like so much boilerplate. The risk, Dr. Green said, seemed "remote."

"I don't know that anyone anticipated that someone would go and actually figure out who some of those people were," Dr. McGuire said.


Above, Dr. Botkin "added that 'it is hard to imagine what would motivate anyone to undertake this sort of privacy attack in the real world.'"

All I can say is thank God this guy is up there in his ivory tower and not responsible for securing anyone's personal safety.

It's people like this — isolated and insulated from the real world we live in, where people do terrible things to others both online and in real life simply because they can — who endanger us all.

January 24, 2013 at 12:01 AM | Permalink


TrackBack URL for this entry:

Listed below are links to weblogs that reference BehindTheMedspeak: You can be personally identified from an anonymous DNA sample:


And the Insurance companies that use your parent's DNA to redline you; and, the schools that won't admit you because of your short life; and, the professions that deny you entry because of your genetic profile; and, the ever-present military recruiters who want to wring the most that they can from their cannon fodder.....

Posted by: 6.02*10^23 | Jan 24, 2013 4:37:00 PM

Well, but what if your kid was kidnaped and this sort of DNA reverse-lookup was used by law enforcement to track and apprehend the kidnapper (from, say, a hair left behind at the scene)?

Or, an anonymous rapist is located and charged based on DNA searches like this -- that is, to locate a suspect who's DNA is not on file as previous offender but can be traced...nonetheless?

These seem like possible theoretical benefits -- additionally one might consider helping displaced family members reunite, adopted children (legally) locating birth-parents, estate heirs located based on DNA-augmented record searches, and so forth.

Posted by: Anonymouse | Jan 24, 2013 2:39:50 PM

You've seen the film Gattaca? There you go...

Posted by: Rattlesnake Jake | Jan 24, 2013 7:16:39 AM

Now consider that miscreant relative whose DNA is collected and characterized by state and / or federal authorities.

DNA may well lead to " the guilty project" where we all shed tissue, hair (with follicles), blood and other tissue and bodily fluids - leaving a virtual DNA trail everywhere we go.

Not a happy thought.

Posted by: 6.02*10^23 | Jan 24, 2013 12:10:52 AM

The comments to this entry are closed.