Silicon Valley Watcher - Former FT journalist Tom Foremski reporting from the intersection of technology and media

Google Keeps Your Data Forever - Unlocking The Future Transparency Of Your Past

Posted by Tom Foremski - March 8, 2010

Wayne Rosing, when he was VP of Engineering at Google, once told me that Google saves every bit of data from people's searches and puts it onto tapes and ship it off to a storage facility.

Why does Google collect all that data I asked? We don't know, but we collect it all, he said.

These days Google has a better answer but it continues to save all that data.

Yes, Google will tell people that it removes data after 18 months but that is not strictly true. It removes the data that can be used to easily identify a person but the rest of the data is kept.

Google says it keeps the data to help advertisers with behavioral targeting. Or rather, its to help Google serve up ads to users based on their behavior.

Nate Anderson, at Ars Technica, reports:

Search data is mined to "learn from the good guys," in Google's parlance, by watching how users correct their own spelling mistakes, how they write in their native language, and what sites they visit after searches. That information has been crucial to Google's famously algorithm-driven approach to problems like spell check, machine language translation, and improving its main search engine. Without the algorithms, Google Translate wouldn't be able to support less-used languages like Catalan and Welsh.

Data is also mined to watch how the "bad guys" run link farms and other Web irritants so that Google can take countermeasures.

Google eventually anonymizes the data:

The last octet of the IP address is wiped after nine months, which means there are 256 possibilities for the IP address in question. After 18 months, Google anonymizes the unique cookie data stored in these logs.

This isn't especially ambitious; Europe's data protection supervisors have called for IP anonymization after six months and competing search engines like Bing do just that (and Bing removes the entire IP address, not just the last octet). Yahoo scrubs its data after 90 days.


But this data could still be traced to individual users.

This is what happened when AOL released search data on 685,000 search users in August 2007. The data had been anonymized but it was easy for reporters to find the actual users from clues in their searches, such as zip codes and town names.

You can search for what AOL users searched at these sites:

http://www.aolstalker.com/

http://aolpsycho.com/

The AOL searches revealed a glimpse into the unguarded thoughts of the digital haves.

In one instance, it looks as if a wife and a husband are using the same computer, each hiding their extramarital affairs from the other, then later looking for help online to deal with the pain of a failed relationship.

And there are real soap operas, tracked over a period of months... from the excitement of first meetings:

"how to get rid of nervousness of meeting a blind date 23 Apr, 12:27"

Then disaster:

"if your spouse has an affair should you contact the other person's spouse and let them know : 07 May, 09:58"

And the same user account asks:

"i had sex with my best friend and now he treats me differently :26 May, 13:58"

There are also "how to kill your wife" searches and more.

All this data was anonymized but all the searches from a single computer were kept together and that means they can eventually be traced. A New York Times reporter quickly managed to find one of the searchers.

Welcome to the future transparency of your past.

In the future, there will be vast databases of anonymized data from a variety of sources: search engines, credit card companies, cell phones, geo-location data, etc. It will be possible to triangulate that data, and if one piece of that data is linked to a user, it will unlock everything else.

Yes, it would take quite a bit of data mining but we have the technologies to do it today.

While each silo of information, technically might be anonymous, in aggregate, it would help identify users from their behavior. Each digital interaction throughout your day, whether through mobile, or desktop, or bank, leaves a trace and that can eventually be tracked and matched with an identifiable person.

And Google, with its dominance of your life, search, email, docs, buzz, photos, video, etc, is collecting huge amounts of your behavioral data, and it will be one of the main keys in unlocking your privacy.

Welcome to the future transparency of your past.



Story link | Subscribe free | Categories: A Top Story, SearchWatch




ForemskiInnovator.jpg

The Holmes Report names Tom Foremski one of the top 25 Innovators of 2013.




-->