​Image: Rob Mitchum/Data Science for Social Good

The Unplumbed Depths of Government Data

Governments are collecting more useful data than ever before. Too bad there's no one to parse it.

Feb 4 2015, 4:00pm

​Image: Rob Mitchum/Data Science for Social Good

From her first childhood visit to Pearl Harbor at home in Hawaii with her Japan-born dad, Sonya Kuki had a sense of war and the wider world. Drawn to the scope of global affairs, she studied international relations at USC and Columbia and held a half-dozen research posts, using web searches and spreadsheets to track news, legislation, economic indicators and operational initiatives, distilling reams of information and in-person interviews into policy papers.

Then, last summer, she learned how to code. The strict scripting commands and precise outputs were unlike anything she'd studied before: Learning to scrape and process masses of data, she practiced Python and pandas on a bet that public policy's future will be shaped by these and other programming skills.

"A lot of people are resistant to science and technology; a lot of people say 'I can't do that!'" Kuki said after completing an interdisciplinary data studies program at Columbia (a program that, full disclosure, I run). "I don't want to become that," she added. "I want to be able to roll with it."

With the big data boom transforming nearly every sector of society, it should come as no surprise that government—whose scientists in fact coined the term "big data" in a 1997 NASA paper on scientific data visualization—is deep in the mix. NSA data mining aside, the state's use of advanced analytics is spreading from espionage to operations, with a growing number of everyday offices looking for new ways to map citizen needs, spot inefficiencies, tweak policy, and predict and prepare for otherwise unexpected events.

If data savvy is scarce in the private sector, things are far worse in government

Yet as government embraces more data-smart goals, it's often plagued by a pipeline problem: It just can't find enough analysts or engineers willing to forego private sector salaries, or public servants willing to invest in these ever-changing skills. Nonprofits like San Francisco-based Code for America and New York's DataKind have tried to plug that gap with volunteers, introducing techies to government, and civil servants to tech. Now, more traditional institutions are bridging the divide, with some of the world's top universities launching initiatives to teach coding and analytics to policy and social science students.

"If we're going to make government more efficient, techniques need to change," said former Chicago Chief Data and Chief Information Officer Brett Goldstein, now a senior fellow at the University of Chicago's Harris School of Public Policy. "We need to be able to ask hard questions, and have people go back to their desks and crunch on them and come back with ideas. They can't be scared by having a lot of data."

A color-coded risk assessment for every residential property in the City of Memphis, created from administrative data such as tax assessments and foreclosures. Image: Data Science for Social Good, used with permission

Demand: A Data-Driven State

The public sector's need for data skills runs deep. Since the first known census, government has been a key warehouse of information. Its military and intelligence communities have long led in finding ways to collect and sort more material, whether breaking Cold War codes, or predicting which Army convoy routes are most likely plagued with IEDs. Other parts of government are catching on now, too: 70 percent of federal IT executives expect big data to be "critical" in meeting goals across government by 2018, boosting efficiency and saving $500 billion a year, according to a 2013 EMC Corporation poll.

Big data tools are already behind attempts to target audits at the IRS and Post Office, to model public health or drug risks at the CDC and FDA, or to scale solar power at the Department of Energy, to name a few. And the White House is growing its own big data initiative, launched with six federal entities and $200 million in 2012.

At the State Department, the Bureau of Conflict and Stabilization Operations is exploring how algorithms can analyze millions of social media and news reports, mapping on-the-ground emotions in time to head off conflict, a process described by the US Institute of Peace's Sheldon Himelfarb; or to proactively tweak policy based on what people are saying they think and feel.

"Machines can look at every available data point and find these fascinating little pieces," Kalev Leetaru, an internet scholar and former fellow at Georgetown's Walsh School of Foreign Service, told an October panel. "It can tell us these hidden patterns. And then we as humans can layer a theory on top of that and say, 'Well, here's why I think we're seeing this,'" crafting policy on the basis of more nuanced information.

As mushrooming megacities play a growing role in world affairs, big data techniques are affecting urban policy, too. Metropolises from New York to Los Angeles already scour city datasets to spot patterns that could help to redirect resources or slow crime, and similar tools can be used to project urban migration or trade flows or to better target international aid. In the future, software could even be programmed to test a law's actual impact against its stated aims, and automatically then tweak the regulations put in place to achieve them—a controversial proposition dubbed "algorithmic regulation."

Jeff Alstott presents how his team looked for signs of contract-bidding corruption in data from the World Bank. Image: Rob Mitchum/Data Science for Social Good, used with permission

Citizens, meanwhile, already crunch numbers to shape policy, challenging official inflation estimates by scraping online supermarket prices, like the Billion Prices Project in Argentina, or scouring traffic stop data for racial bias to get police policies changed or court cases dismissed in the US.

This kind of work transcends the public sector's past focus on managerial performance metrics, integrating and analyzing far more information. Drawing on a branch of computer science known as machine learning, it requires huge amounts of historical data to "train" new algorithms that sort through and spot anomalies or correlations in past behavior, using them to predict future events, so that precise treatments can be tailored in advance. To do that, policy analysts not only need area expertise and statistical skills, but comfort with the kinds of powerful computational tools that allow proactive decision-making.

Supply: The Evolving Academy

The notion of computational policymaking has come a long way since the 1970s, when Carnegie Mellon and MIT pioneered the first programs in engineering and public affairs, in part exploring how technology can assist policy design.

The digital revolution and emerging Internet of Things have unleashed an unmatched avalanche of public sector data, generated at nearly every wired touchpoint with citizens, and a growing "open data" movement has pushed to make that information freely available to researchers, activists and entrepreneurs. Gains in processing power, cloud computing and consumer-friendly analytic tools have meanwhile made it easier for computer scientists and non-engineers alike to store and examine that material—in many ways decentralizing and democratizing data analysis, while also fueling demand.

Data Science for Social Good co-organizer Matt Gee discusses a project with fellow Sarah Evans during the 2013 Summer Fellowship. Image: Rob Kozloff/University of Chicago, used with permission

McKinsey predicts a deficit of 1.5 million data-savvy managers in the U.S. by 2018, and a 50 percent shortfall in advanced analytic skills. That appetite is driving a boom in data science studies—a relatively new blend of statistics, math and computer science—which has seen a surge in degree programs at universities worldwide.

But the data revolution has created a need for more applied skillsets, too: For experts who not only have the interdisciplinary tools of data science, but also an interdisciplinary use for them; who not only track and analyze new floods of digital information, but who bring the detailed, sector-specific knowledge to keep them in context and to interpret and act on their findings. For government, that means less outsourcing of analytics to big companies like IBM and SAS, which have built big businesses doing data science for the public sector, and more hiring of their own teams of analysts, schooled in each agency's specific issues, trained to crunch numbers in context and to give decision-makers quick advice grounded in data as well as domain.

But that kind of analyst is hard to find. If data savvy is scarce in the private sector, things are far worse in government, where 96 percent of federal managers surveyed by SAS and the GovLoop network last year said their agencies lack sufficient data skills. In part, that's because so many job-seeking techies still opt for the prestige or paycheck of the research lab or consumer web, leaving a risky talent gap in government at the very time that policymakers and regulators most need to understand the data deluge around them.

To help fill that void, some of the world's most established public policy and computer science programs are now partnering up, looking to spread analytic and coding skills from their typical homes in the academy or tech company, and deeper into government.

"It's this very strange world, where you come up with a technical idea and your first instinct is, 'Oh, I can use it to make ads better, or search better,' because that's what you're most often exposed to," Rayid Ghani, chief scientist for the data-smart Obama 2012 campaign, told me last year.

Everyone has a voice, but everyone is drowning in noise

To introduce techies to other options, Ghani in 2013 launched Data Science for Social Good, a summer fellowship at the University of Chicago that has since drawn more than 80 statisticians, engineers, mathematicians and others to help nonprofits and public agencies from the World Bank to the City of Memphis address issues like corruption, homelessness, climate change, and blight—opening their minds to public sector possibilities along the way.

"I wanted them to be exposed to these kinds of problems, so when they hear a technical idea, they say, 'Oh, that could be really useful for that organization that's doing disaster-relief work' or something, so they make that link and pursue it" over commercial applications, Ghani said.

Working with the Case Foundation, John Brock, Kyla Cheung, Giorgio Cavaggion, and Ahmad Qamar created a "hairball" network of similarity between 2,000 nonprofits based upon tweets from the organizations. Image: Data Science for Social Good, used with permission

His program is just one of Chicago's efforts. The university also launched an Urban Center for Computation and Data and a Center for Data Science and Public Policy, not to mention a two-year master's in Computational Analysis and Public Policy, which includes coursework in the Java, Python, R and C++ programming languages, along with microeconomics and political feasibility analysis.

A wave of other interdisciplinary programs has spread through schools like UT Austin, UC Berkeley,University College London, Rochester Institute of Technology, and Columbia, which is home to the shorter Lede Program that Kuki tacked onto her master's in International Affairs (and where I'm director). A collaboration between Columbia's Journalism School and Department of Computer Science, that three- or six-month boot camp gives social science graduates the same investigative tools that a contemporary data journalist would use to mine and present data.

In an ever-flattening, interdisciplinary digital world, the idea goes, everyone has a voice, but everyone is drowning in noise. Leaders of all kinds should be able to sift through the data deluge and speak smartly on what they find.

"This is part of the world we live in now," Mark Hansen, the Columbia statistician and journalism professor who envisioned the program, told a room of policy students there in December. "Every single discipline on campus is finding its core artifacts digitized and, in the process, having its core practice opened to a kind of data analysis."

In that universe of emerging digital humanities, in which algorithms analyze poetry, and computational social sciences, in which software predicts economic behavior or social ties, analytic policy programs can have an especially wide impact—and seed a more nuanced sense of what all this data can really do.

Caution & Context

In the public sector, that kind of cross-disciplinary context is especially important given the newness of many data techniques, as well as their tendency to evolve faster than popular understandings of them. While big data has real potential to boost government insight, efficiency and accountability, it is no panacea: Like traditional policy tools, analytic models are still designed by people, vulnerable to bias and error and dependent on frame—and all the more dangerous when assumed to be uncontestable.

"It is crucial to begin asking questions about the analytic assumptions, methodological frameworks, and underlying biases embedded in the big data phenomenon," scholars danah boyd and Kate Crawford wrote in a 2011 paper outlining the fundamentally interpretive nature of any analytic process. "Working with big data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth."

The White House and FTC have for their part flagged the potential perils of "digital redlining" and "discrimination by algorithm" in the private sector, with the White House warning in a May 2014 report that industry's "ability to segment the population and to stratify consumer experiences so seamlessly as to be almost undetectable demands greater review."

Data Science for Social Good mentor Aron Culotta (standing) consults with fellows John Brock, Ahmad Qamar, and Kyla Cheung during the 2013 Summer Fellowship.

In the words of FTC chair Edith Ramirez that same month: "Big data analytics raises the possibility that facially neutral algorithms may be used to discriminate against low-income and economically vulnerable consumers," as proprietary models apply hidden criteria to screen job-seekers, prospective tenants or borrowers.

At the same time, billions of the world's poor are excluded from the potential benefits of big data. People have to be connected to technology to generate much of the digital information that data-driven products and policies use. But with just 40 percent of the planet's population online in 2014, according to the UN, a gaping data divide threatens further marginalization.

Big data "could restructure societies so that the only people who matter—quite literally the only ones who count—are those who regularly contribute to the right data flows," State Department legal adviser Jonas Lerman wrote for Stanford Law Review. Instead, he proposed, it may be time for a new equality doctrine to guarantee people with "light data footprints" equal access to public goods and services.

In all these cases, a key first step to guarding against discrimination, abuse and accidental injury is to have more regulators, policy makers and public servants who understand the "how" and "what next" of big data—who can build, interpret and audit new analytic models, ensuring that they're democratically sound.

"If we're going to create data-empowered individuals as change agents in government, they need to have depth," to be able to interrogate not just data, but the range of ways in which it can be used, said Goldstein, who introduced many of these methods to the City of Chicago.

"We need to break down these classical silos," he added. "The government worker, the policy person, the social scientist of the future, needs to have subject matter expertise, and at the same time have the applied data science skills to do real work on their own. And they need to understand what they're doing."

Theresa Bradley is director of The Lede Program in Data Practices at Columbia University. You can follow her on Twitter at ​@tbradley.

Top image: Cindy Chen, Isaac McCreery and Carl Shan explore data from the Chicago Alliance to End Homelessness. Credit: Rob Mitchum/Data Science for Social Good