en

The VICE Channels

    ​Tim Libert, data viz.

    Looking Up Symptoms Online? These Companies Are Tracking You

    Written by Brian Merchant

    It’s 2015—when we feel sick, fear disease, or have questions about our health, we turn first to the internet. According to the Pew Internet Project, 72 percent of US internet users look up health-related information online. But an astonishing number of the pages we visit to learn about private health concerns—confidentially, we assume—are tracking our queries, sending the sensitive data to third party corporations, even shipping the information directly to the same brokers who monitor our credit scores. It’s happening for profit, for an “improved user experience,” and because developers have flocked to “free” plugins and tools provided by data-vacuuming companies.

    In April 2014, Tim Libert, a researcher at the University of Pennsylvania, custom-built software called webXray to analyze the top 50 search results for nearly 2,000 common diseases (over 80,000 pages total). He found the results startling: a full 91 percent of the pages made what are known as third-party requests to outside companies. That means when you search for “cold sores,” for instance, and click the highly ranked “Cold Sores Topic Overview WebMD” link, the website is passing your request for information about the disease along to one or more (and often many, many more) other corporations.

    According to Libert’s research, which is published in the the Communications of the ACM, about 70 percent of the time, the data transmitted “contained information exposing specific conditions, treatments, and diseases.” That, he says, is “potentially putting user privacy at risk.” And it means you’ll probably want to think twice before looking up medical information on the internet.

    “WebMD is basically calling up everybody in town and telling them that’s what you’re looking at”

    Here’s what’s happening in a bit greater detail: Let’s say you make a search for “herpes.” Plugging that query into a search engine will return a list of results. Chances are, whatever site you choose to click on next will send information not just to the server of the intended site—say, the Centers for Disease Control, which maintains the top search result from Google—but to companies that own the elements installed on the page. Here’s why.

    When you click that CDC link, you’re making a so-called “first party request.” That request goes to the CDC’s servers, and it returns the HTML file with the page you’re looking for. In this case, it’s “Genital Herpes - CDC Factsheet,” which is perhaps the page on the internet you’d least want anyone to know you’re looking at. But because the CDC has installed Google Analytics to measure its traffic stats, and has, for some reason, included AddThis code which allows Facebook and Twitter sharing (beckoning the question of who socializes disease pages), the CDC also sends a third party request to each of those companies. That request looks something like this—http://www.cdc.gov/std/herpes/STDFact-Herpes.htm—and makes explicit to those third party corporations in its HTTP referrer string that your search was about herpes.

    Thus, Libert has discovered that the vast majority of health sites, from the for-profit WebMD.com to the government-run CDC.gov, are loaded with tracking elements that are sending records of your health inquiries to the likes of web giants like Google, Facebook, and Pinterest, and data brokers like Experian and Acxiom.

    From there, it becomes relatively easy for the companies receiving the requests, many of which are collecting other kinds of data (in cookies, say) about your browsing as well, to identify you and your illness. That URL, or URI, which very clearly contains the disease being searched for, is broadcast to Google, Twitter, and Facebook, along with your computer’s IP address and other identifying information.

    “The underlying significance of the 91 percent figure is that this is utterly endemic across all types of sites,” Libert told me, “this isn’t just commercial sites who need to turn a profit, these are organizations you trust: the government, non-profits, universities.”

    The CDC example is notable because it’s a government site, one we assume should be free of the profit motive, and entirely safe for use. “It’s basically negligence,” Libert told me.

    But for-profit health sites are often much worse. WebMD, for instance, is the 106th most-visited site in the US, according to Alexa, and figures prominently in search results for most commonly searched diseases. It sends third party requests to a whopping 34 separate domains, including the data brokers Experian and Acxiom.

    “WebMD is basically calling up everybody in town and telling them that’s what you’re looking at,” Libert said. Seeing as how there’s a good chance that’s a sensitive disease, users would likely not be pleased.

    The same is true for About.com (which ships your requests to comScore, Experian, Google, and Microsoft, among others), Health.com (which sends your data to over a dozen different third party corporations), and many others—if you’re visiting a for-profit health website, you can essentially guarantee you’re being tracked, and that your requests are ending up in the hands of not just firms that earn revenue from advertising (which is why Facebook and Google collect this kind of data) but from selling data explicitly (as Experian and Acxiom do).

    Many of the top search results for medical queries aren’t explicitly health sites—FreeDictionary, eHow, Merriam-Webster, Answers, and LiveStrong all figure prominently—and they have the most tracking elements of any health-related sites Libert analyzed. Expect a visit to any of the above to be noted by over a dozen tracking companies.

    Even trusted, nonprofit public websites are tracking you—the Mayo Clinic and Planned Parenthood, for example, each send your data to third parties like Google and Ensighten. This isn’t because either is intending to do anything nefarious; it’s just because they’ve installed convenient free software—but it is nonetheless sending data about the health issues you’re looking at to corporations. (WebMD, the CDC, and the Mayo Clinic did not immediately respond when asked to comment.)

    And Libert’s work confirms reports made by the Associated Press and the Electronic Frontier Foundation that Healthcare.gov was exposing user data.

    “Healthcare.gov is just the tip of the iceberg,” Cooper Quintin, the EFF’s staff technologist, who exposed how that site was inadvertently making user data public through referrer strings in much the same way, told me after learning of Libert’s research. “You might think that the data that you read on the internet is private, between you and your service provider,” Quintin said. “But it’s not.”

    A random sample of 1,000 third party requests. Data visualization via Tim Libert. Google dwarfs its peers in requests.

    So why are so many sites broadcasting your confidential, potentially embarrassing, and possibly damaging health information to corporations? With nonprofit sites like the CDC and the Mayo Clinic, again, it’s not due to any insidious intent; it’s simply because developers are installing “free” tools like Google Analytics and social media “share” buttons on their sites, and most users have no idea that means information about their searches is being shared with third parties.

    “The problem is that using these 'free' third-party tools is really easy for web developers. What developers don’t consider is, why are these tools free?” Libert said. “These companies aren’t charities—they are providing these tools to make money from user data. So what you have is a government web developer passing the cost to users who have no idea their data is being traded away without their consent.”

    Google is the biggest offender here, not just on government sites, but across the board—it owns the vast majority of elements that are tracking you. Libert found that “78 percent of pages analyzed included elements which were owned by Google,” a result that he says stunned him.

    Google certainly isn’t the only company tracking on health sites. But second place is fairly distant: 38 percent of the 80,000 pages analyzed sent third party requests to comScore, another internet analytics company. Meanwhile, 31 percent of sites funneled data to Facebook, 22 percent to AppNexus, 18 percent to AddThis (a web tracking company), 18 percent to Twitter, 16 percent to Amazon, and 12 percent to Yahoo. A number sent requests to a combination of many of the separate companies listed above, and more. And it is notable that Google receives so much more data than any other company.

    “While I was expecting Google to have a big footprint I didn’t realize how big it would be exactly,” Libert told me. “Even if you use an iPhone, DuckDuckGo, and Hotmail, the second you open your browser there is a huge chance Google gets your data.” That’s because Google is absorbing your information through a variety of hosted services and domain names, from Google Analytics, which measures site traffic, to DoubleClick, an advertising service, and YouTube, its video platform. 

    If a website has Analytics installed on its backend, then a third party request is automatically sent to Google—because this occurs on what Libert calls the “invisible web,” the user is never aware this is taking place. Libert found that Analytics was spurring such requests on a full 45 percent of the 80,000 health pages he analyzed.

    “Regardless of type of services provided, in some way all of these HTTP requests funnel information back to Google,” Libert writes in the paper. “This means that a single company has the ability to record the web activity of a huge number of individuals seeking sensitive health-related information without their knowledge or consent.”

    What can happen when Google starts vacuuming up your health data? An incident that occurred in one Canadian’s inbox offers a clue.

    In January of 2014, Canada’s privacy commission ruled that Google had violated the nation’s privacy laws after a user discovered he was being targeted by ads for devices that claimed to treat sleep apnea. He had previously used the search engine to learn about the condition and to search for similar devices, but had never volunteered consent. The Office of Canada’s Privacy Commissioner was able to replicate the experience, and ruled Google had broken the law.

    “Most Canadians consider health information to be extremely sensitive,” the commissioner said at the time. “It is inappropriate for this type of information to be used in online behavioral advertising.” Google argued that the “display criteria and users lists for ads in its network are determined by individual advertisers,” not Google, according to The Register, and that it was against its policies to use sensitive information to advertise, but admitted that “certain advertisers or third party buyers can use remarketing products in error.” No fines were levied, however, as Canada was satisfied with Google’s insistence that it would adopt more stringent privacy policies. Still, it’s informative as to what Google is capable of—if it or the network of advertisers it serves wanted to display targeted ads hawking purported treatments for herpes to those who had been searching for the term, it’s clearly well within its ability.

    This risk falls under one of two that Libert identifies in his research; personal identification and blind discrimination. The Google case is an example of the first—if Google wants to, it has the data to discern who you are and what ails you. This is creepy, certainly, but it also has real-world ramifications beyond the fact that your search provider knows that you have IBS. Users have no control over how that data is stored or secured, for instance, and it may be vulnerable to hackers.

    A spokesperson for Google issued me the following statement: “Lots of websites use our services to measure their traffic, embed YouTube videos, or fund their content with advertising. We have strict policies prohibiting such websites from passing any personally identifiable data. We don't want and don't use that kind of sensitive data. And to be clear: we absolutely don't allow our ad systems to be used to form profiles, or to target ads, based on health or medical information.”

    Google did not immediately elaborate when asked how this blanket denial could be reconciled with Libert’s findings. Libert, for his part, called the response “typical boilerplate.”

    "The only thing that guides the use of this data is profit"

    The most disconcerting possibilities lie not with Google or Facebook, which both vacuum up data from all over the web and store it on their servers, but with the practices of data brokers like Experian, whose products Libert found present in roughly 5 percent of the pages sampled—typically, in for-profit sites like WebMD, About.com, and MedicineNet.com. Experian is a credit bureau that has ballooned into a “global information services group.” It was the subject of a scathing Senate investigation led by Jay Rockefeller in 2013, and concerns itself with collecting as much data about individuals as possible, then packaging and selling it. And yes, that includes private health data.

    “I found Experian on thousands of sites,” Libert told me, “here is a company that knows the intimate details of my student loans, and they may also know about my health concerns? That blew me away.”

    “It’s chilling to think that the companies keeping track of your credit are also keeping track of your health,” he added.

    There are a number of reasons that it is problematic that data brokers based around the world are storing your health data without your knowledge or consent. The first is simple—they could misuse it. In 2013, Experian was fined for selling troves of consumer data to identity thieves in Vietnam. Furthermore, the data, stored by unknown entities with unknown levels of security, may be at risk for hackers. “Merely storing personally-identifiable information on health conditions raises the potential for loss, theft, and abuse,” he writes.

    And as Libert points out in his paper, another company, Medbase200, was reported as using “proprietary models” to generate and sell lists with classifications such as “rape sufferers,” “domestic abuse victims,” and “HIV/AIDS patients,” which has to be among the ugliest stripe of targeted advertising yet conceived.

    “The problem is that people’s data is not just in a vacuum. Experian can take this data and add it to the other information they have,” Quintin said. “Google is the same thing, they have a cookie on you that has your real name and your data and all of that. And there’s data sharing going on behind the scenes. Experian can share this data with other companies.”

    And that kind of practice leads to what Libert calls the blind discrimination problem.

    “Experian is a data broker well known for selling credit scores—which include information on bankruptcies," Libert said. "Academic research by Senator Elizabeth Warren has shown that over 60 percent of bankruptcies are medical-related. Given that I found Experian tracking users on thousands of health-related web pages, it is entirely possible the company not only knows which individuals went bankrupt for medical reasons, but when they first went online to learn about their illness as well. In essence Experian can follow an individual from her first sneeze to her final unpaid hospital bill.” (Experian failed to respond when asked to comment.)

    Quintin agrees this poses a real threat. “I would say that’s totally possible.” He suggests that it’s plausible that the medical data these brokers vacuum up could eventually be factored into your credit score—and even used to determine how much you pay for health care. “Look, this is all speculative, right? But if I’m a bank and you’re applying for a loan, there’s no reason I would not want that information.” And the data brokers could provide it. “There’s this advertising demographic of you, and now you’re getting healthcare data in there, too. How much are we going to charge you for healthcare, if you’ve been searching for ‘cancer’ and a bunch of illnesses? Health care services could raise your rates.”

    “Another nightmare scenario is applying for jobs,” Quintin continued. “A company might get a demographic profile from one of these data brokers and use that information to decide whether or not to hire you.”

    But the chief problem is simply that just about all of the above, under current laws, is legal.

    “The only thing dictating what will happen with the data is who can make money from it,” Libert said. “The only thing that guides the use of this data is profit. No oversight, no laws, no nothing.” That’s true, in the US, of both Google’s data-vacuuming habits, and of Experian’s vast for-profit health tracking enterprise.

    “Health data is some of the most private data you have. That a data reveals a lot about you. There’s a reason that we have laws like HIPAA—unfortunately those don’t apply here,” Quintin said. HIPAA is the Health Insurance Portability and Accountability Act of 1996, and it forces the government and doctors to keep patient medical records secure and confidential. It has no jurisdiction over search engine companies or data brokers who sap data “volunteered” by users.

    “While Experian is subject to the Fair Credit Reporting Act, they do not fall under HIPAA, meaning health information they collect on the web is virtually regulation-free,” Libert said. “Clearly, Congress needs to step in here.”

    Quintin says that there are things users can do to protect themselves from such tracking right now—install ad blockers like the Privacy Badger. Or you could stop trafficking for-profit health websites altogether. There is one bright light here—Wikipedia. It was one of the only sites that trafficked in health information that sent no third party requests to corporations.

    For now, however, millions of people are exposing their personal health profiles to internet advertisers and data brokers, right at the moment they’re making the most confidential inquiries imaginable.


    “This is a huge problem on the web,” Quintin said. “It’s just the way the web works, and it needs to work differently—especially when it comes to people’s health data.”