Watchlist: Algorithms and big data

cc licensed Argonne National Laboratory

More and more people talk about big data and algorithms to do all sorts of things. What are these concepts and what do they tell us about the world we live in?

Big data in essence is the name we give to a data set so large that we need to use computers to make sense of it. It could be government records on people passing through borders, or people liking something on Facebook, or choices people make shopping online. These data sets often contain millions and millions of individual data points and would take far too long to be analysed by humans, so an algorithm is needed. Algorithms perform all sorts of work and there are plenty of technical, mathematical definitions of algorithms, but for the purpose of this post, we can say that an algorithm is basically a set of steps that can be applied to a set of data. Algorithms sort data and produce an output that allows us to draw certain conclusions about what that data tells us. They are an essential part of modern computing.

Algorithms and big data are being used for an increasingly wide range of purposes. For example, police forces around the US are using algorithms to engage in what they call ‘predictive policing.’ Academic experiments have been running for a while, and Microsoft announced last year it would be working with police in developing technology for predictive policing purposes. A study from UCLA has found that predictive policing algorithms reduce crime in the field. Using data about past crime patterns, the model predicted where crimes were likely to occur, including crimes such as burglaries, theft from cars and theft of cars, but also assaults. The algorithm suggested particular places were hot spots for crime, indicating that police should be deployed to those areas. The outcome in test runs has been that the actual number of crimes reduced.

It is important to remember that predictive policing algorithms are based primarily on crime statistics, in other words, past behaviour of law enforcement in tackling crime. Crime statistics do not reflect what crimes are occurring. A better way to think about this data set is that it provides a picture of the state’s response to crime. This creates a real risk that the biases and discriminatory social trends in everyday policing will be reproduced in the supposedly more objective and scientific methodology of computerized predictive policing. Consider that the NYPD was successfully sued for practicing racial profiling in stop and frisks. This is the kind of data that will go into the algorithms. Or consider that one of the biggest crimes in recent times, according to people affected, is actually the rigging of Volkswagen cars to evade environmental regulation. An engineer has pleaded guilty to this crime. Yet this is not a crime that is not included in the data set.

It may be that we consider street crime to be a more important issue than corporate crime, and that we think it is too expensive to generate alternative data sets to crime statistics to run these algorithms on. These are value judgements, and whether they are valid or not, they affect the data that goes into the algorithm. Clarity about these biases is important because without it, the potential exists to entrench discrimination in ways that are insidious and hard to untangle. Feeding data into automated processes without careful analysis of the assumptions being made can provide misleading answers to important questions. Many technologists and data scientists prefer a more direct metaphor: garbage in, garbage out.

Another stark example of was revealed in documents leaked by Edward Snowden. The Skynet program run by the NSA uses an algorithm applied to data to identify terrorists. This algorithm was developed using data about ‘known terrorists’ and comparing it with a wide range of behavioural data taken from mobile phone use. It works by classifying some behaviour as more ‘terroristy’ than others, like traveling to certain places where terrorist activity occurs, or the act of turning off a mobile phone or swapping sim cards on the assumption that this evades surveillance. After crunching the numbers, the algorithm’s highest rating target was a man named Ahmad Zaidan. But Zaidan is not a terrorist; he is the bureau chief in Islamabad for Al-Jazeera. Yet the NSA documents label Zaidan with apparent confidence as a ‘MEMBER OF ALQA’IDA.’ How could this happen? Well, you can read what Zaidan says himself, but he obviously has to travel to interview people who may be involved with terrorism, something which meant the algorithm produced a dodgy output.

So not only do we need to be careful about biases in the data we feed into an algorithm, we also have to be careful about the assumptions that go into the algorithm itself. Any conclusions we draw from the outputs of an algorithm need to reflect a careful assessment of these factors. Rather than saying this data and algorithm used by police predicts crime, a better way to phrase it is that it predicts certain types of crime based on certain historical data. It might sound a little less glamorous, but it is more accurate.

If an algorithm is going to be used to predict human behaviour in a way that might result in people being targeted for surveillance or prosecution, it is important that these algorithms are transparent. Unless it is possible to see the various biases and assumptions that go into an algorithm, we cannot make any confident assessments about the output.

It is not that all algorithms are bad, or even that the predictive policing ones are a problem. One policy alternative could be to use the output of predictive police algorithms totally differently, without involving law enforcement and criminal justice. This kind of data analysis could be used to inform government spending and social programs, for example. There are all sorts of ways in which data could inform social interventions and the provision of services to try and de-escalate interpersonal crime, and resolve these issues without police intervention. Unfortunately, these proposals remain woefully under-explored.