FYI.

This story is over 5 years old.

Tech

The Budding Science of Chatroom Forensics

Anything you say—and how you say it—can be used against you in the court of law.
Image: Shutterstock

While the word "forensics" might conjure up images of collecting hair or fingerprints from crime scenes, the scope of forensic science extends beyond the physical into almost any realm. If you need to discover the weather patterns in the past, you need some forensic meteorology. Forensic accounting uncovers book-cookery. And down in Australia, they're working on forensic text comparison in order to catch pedophiles.

Advertisement

Presenting at 44th Conference of the Australian Linguistic Society, Shunichi Ishihara, a senior lecturer at the Australian National University, outlined how the nascent field works, and why the police are going to need to start taking some statistics classes. Because similar to a fingerprint, the way people chat online is fairly distinct.

Forensic text comparison is typically used to compare incriminating texts or chatlogs with those known to belong to "the suspect." As Ishihara outlines, "despite the wide recognition that the misuse of chat rooms by online child sex offenders is a serious problem, forensic studies specifically targeting chatlog messages are quite sparse."

It's possible that part of what holds FTC back is that it can only give two "conditional probabilities": the probability that the suspect is the offender, and the probability that the suspect isn't. This might not seem like much, but it follows the lead of DNA, fingerprint, handwriting, and voice forensic sciences.

While probably not enough to rest an entire case on, these are the stats that can assist the prosecution without relying on things like metadata and ISPs that can be thwarted with privacy software or legal prohibitions. Instead, to determine the strength of evidence, FTC relies on the transcript of the speech itself.

To calculate a "likelihood ratio," that a chat belongs to the perpetrator in question, the prosecution only needs a known sample of the suspect's chatlogs, the offensive chatlogs in question, and a background or reference sample from someone else entirely. As is explained in the paper, "the LR is a ratio of similarity to typicality, which quantifies how similar/different the questioned and the known samples are, and then evaluates that similarity/difference in terms of typicality/atypicality against the relevant background population (i.e. reference samples).

Advertisement

And what gets compared? Think of it as big data being used to look at small things: punctuation, capitalization, the average character number of each chat, and how special characters get used.

The paper even had a few examples from people who are unwittingly chatting with undercover police officers, whose chats are under "UP":

P1: I'm from Portland…
P1: :)
UP: cool me 2
P1: I'm Pedro…
UP: im brooke
P1: I have an apartment on Forest ave…
P1: How old r u?
UP: im good
UP: is that ur pic?
P1: Yes…would u like 2 c another?
UP: ok
P1: have u got one u wish 2 share??

The incriminating giveaways include the double question mark, the ellipses, and the tendency to capitalize at the beginning of the chat. Contrast that with "Predator 2":

P2: hu
P2: hi
UP: hi
P2: ?
P2: i said hi
P2: where in ga you from
UP: sw
UP: u
P2: sw
P2: bainbridge here
UP: near columbus
P2: cool what town if you dont mind me asking

Just to give you a sense of the differences, note the lack of punctuation and capitalization.

Stats wonks should check out the paper, which obviously goes further in depth. A PDF is up for free here.

Obviously, FTC seems ethically sound when applied to protecting children from would-be predators, but considered more broadly, it reveals how hard it is to really stay anonymous online. While the internet frees people from the scope of those other forensic sciences, such as DNA, or handwriting, or even the voice, anything you leave behind, online or elsewhere, can be used in the court of law.

When the FBI was hunting for the Silk Road-founding darknet pioneer the Dread Pirate Roberts, posts on forums and websites like Shroomery.org led them to Ross Ulbricht, and his arrest. Now, Ulbricht wasn't exactly following every privacy protocol, but even if he had, the feds have a way to link posts to an individual just based on the data itself.

This paper's a reminder of how we're leaving our fingerprints all over the place, even as we type.