Hack This: Extract Image Metadata Using Python

Not that I’m currently cruising for jobs with British intelligence or anything, but I happened upon (via Hacker News) this current coding challenge posted to the MI5 careers page. It consists of the following image file and an invitation to find the clue hidden within.

Intelligence isn’t always obvious and our engineers and analysts work hard to unlock it.

Videos by VICE

There’s a clue in the image file below, if you can find it.

The image itself, the colors and lines we perceive with our eyes, clearly pertains to chemtrails and their effect on the health of US presidential candidate Hillary Clinton, but that the challenge wording clearly specifies image file indicates that we should probably be looking beyond the picture itself and into the data within.

Images are of course data just like anything else. And for that data to be useful it needs to carry instructions on how it’s to be represented at its final destination. There needs to be information in there that is about the image but is not part of the image. This is image metadata—data about other data.

Image metadata varies in format and content. This can depend on the image file format itself and in many cases the camera the image was captured with. The going standard, starting in the 1990s, has been the IPTC Information Interchange Model (IMM), though that’s more recently been extended via the Extensible Metadata Platform (XMP) and Exif (“exchangeable image file format”), which is where we get into GPS tagging.

Photomegadata.org likens image data to a bento box:

Photos taken nowadays with a digital camera in JPG format are almost all guaranteed to contain Exif data, and this is what most metadata extractors are interested in. There are a billion such tools populating the internet for getting at this metadata and the Python language has its own pretty handy Exif extraction tool called ExifRead.

For the sake of learning stuff, and because we’ll eventually need to find metadata beyond the Exif standard, we’re going to skip ExifRead and use the Python Imaging Library (PIL), which is a much, much more general toolset for doing stuff to images via Python code. In other words, even if you don’t particularly care about image metadata, you’re going to learn something useful.

Prerequisites: Assuming you’ve already downloaded and installed Python, you should do two things. One: spend 10 minutes doing this “Hello, World” Python for non-programmers tutorial. Two: spend another five minutes doing this tutorial on using Python modules.

0.0) Install Pillow

The active version of PIL is actually known as Pillow, so this is what we need to install. You should do this with the Python package manager pip, which is covered in the second prerequisite tutorial above. Just:

Now, create a new Python script in whatever text editor you like. I’m using Sublime Text, which is great. I called my script metaread.py.

1.0) Create an Image object

First thing we’re going to do is actually bring in the Pillow module we installed, which is the first line below. Next, we need to create an object representation of our MI5 image, puzzle.png. This exposes the image and all of the things we can do with it via the Pillow module to our Python script. To see some more of these capabilities, check out Hack This: Edit an Image in Python.

2.0) Extract the Exif data

Not all image formats contain Exif data. Mostly just JPGs. Which is fine because that’s most pictures. The MI5’s image is actually a .PNG file, which we’ll have to handle somewhat differently. Let’s do a quick JPG though.

There’s really nothing to it. I create the image object as above then call the _getexif() function on it. In return, I get a dictionary data structure full of metadata.

The dictionary consists of tag-value pairs, which we can extract and view using a for-loop, like this. Note that I had to import some extra stuff at the top:

So, that just outputs all of the Exif data contained within a given image as a series of entries. It’s hardly guaranteed to be the same for every image. I had to search online for a sample image containing GPS metadata because I got tired of scanning through everything on my computer trying to find an example (though it wouldn’t be too hard to write a script that could comb through a file of images and automatically pull out those that do include it). In any case, you can find the same image here.

A sampling of the output:

2.1) Extract non-Exif data

Again, PNGs don’t come with Exif data.

Don’t panic. Just because it’s not in Exif format doesn’t mean that puzzle.png’s metadata is all that more difficult to access.

It so happens that when an image is loaded per step 1.0, the PIL module will automatically load up a dictionary with whatever metadata it can id. We can barf it all out to the screen with a simple print statement:

Or we can loop through it as in 2.0 as such:

Problem solved?

So, at this point I need to confess that this .info method is not actually returning all of the metadata from puzzle.png, and I don’t quite know why. In addition to regular old Photoshop and the ExifRead Python tool mentioned above, I also tried four different online metadata extraction tools and only one was able to return a complete listing: Jeffrey Friedl’s Image Metadata Viewer. Said viewer is based on a command-line tool called ExifTool, which I downloaded and ran. It too worked.

But I promised Python and Python we shall write. It’s actually pretty easy to run a command-line program from within Python, but you’ll still have to download the actual command line program, which is available here. Now, we can run this script on our image file, and the ExifTool will output the result via Python to the screen. Try it.

See the clue?

I don’t know why it was so difficult to pull metadata from this file. It may have something to do with how metadata in PNG files is laid out. Within the file, metadata is kept in data structures called chunks. Chunks are given weird coded names that define, among other things, whether they should be considered “critical” or not. Critical chunks include actual image data, bit depth, and color palette. Not-critical chunks offer histograms, gamma values, default background colors, and, finally, text. There are three different types of text chunks all with a standard dictionary entry format. Each text entry has a name or title, and then some associated text. They can be user-defined, but there are some text field types that come predefined, such as “comment.” Which in our MI5 file contains this:

Now that we’re at this point, writing metadata back to the file isn’t much more involved. If you want to join MI5, you should probably be able to figure that part out on your own. Start by reading up on ExifTool.