Spam is the bane of the internet, flooding inboxes with offers of viagra pills, get-rich-quick schemes, or the promise of love with a mail-order bride. According to newly published documents, even Government Communications Headquarters (GCHQ)—the UK's signals intelligence agency—has a problem with junk emails.
“Spam emails are a large proportion of emails seen in SIGINT [signals intelligence],” reads part of a dense document from the Snowden archive, published by Boing Boing on Tuesday. “GCHQ would like to reduce the impact of spam emails on data storage, processing and analysis.”
Dated September 2011, the 96-page “HIMR Data Mining Research Problem Book” lists its authors as researchers from the Heilbronn Institute for Mathematical Research (HIMR). HIMR is a partnership between GCHQ and the University of Bristol, and supports “research across a range of areas of mathematics in the UK,” according to the university website.
In the document, researchers “set out areas of long-term data mining research,” all of which “are about improving our understanding of large datasets.” The section on the spam problem is just one small part of the document, which is marked as UK TOP SECRET STRAP1 COMINT, and only to be viewed by appropriate officials from Five Eyes member countries.
“Most external spam detectors work by analysing the content of an email however policy and processing mean this option is not always open to us,” the researchers write. This implies that the researchers would need to find a solution working with metadata alone.
“We therefore lower our target and instead aim to classify email addresses by the type of emails they send,” they continue, likely meaning that the researchers are trying to weed out annoying emails by looking at addresses associated with spam.
GCHQ did not immediately respond to a request for comment, and neither did Ms Chrystal Cherniwchan, the manager of the Heilbronn Institute of Mathematics, or the University of Bristol’s press office. Motherboard was unable to independently confirm the legitimacy of the document.
The document says that a source provided researchers with a dataset for tackling the spam problem by classifying email addresses based on features (the source was redacted by Boing Boing).
However, it notes that “the collection posture of GCHQ has changed considerably” since that data was collected.
This may refer to the launch of a large data collection program called Tempora. In around 2011, according to the Guardian, GCHQ obtained access not just to email metadata but also content, by tapping the undersea cables that carry the world's internet traffic.
That’s one way to make a spam filter, I guess.