Predicting Calls To Police

Brian Ross
5 min readNov 22, 2019

The data for this model can be found at the City of Detroit Open Data Portal.

Goal

The goal of this project is to predict the frequency of 911 calls to the police for a given geographic area on a specified date.

2.5 million 911 calls in Detroit

Processing The Data

I started with loading the data set and pruning it down to only the necessary information. The original data set contains all information for every 911 call within Detroit over a 3 year period. This amounts to over 2.5 million observations. To start with I pruned the data down to only those columns that would be needed later. Next step was to parse through the call descriptions and retain only those observations which contained calls directed to the police. The information that was valid for the purposes of this model was that relating to location and time. The call timestamp was broken down to year, week, month, day of week, day and the part of day using the following function:

The next step was to separate the locations into distinct geographical spaces. After filtering for geographic outliers a function was designed to create essentially what is a grid of the Detroit area and then assign to each observation the latitude and longitude grid space in which it exists. After all this has been done then we can group all the observations by their locations in time and space allowing us to have a tally of the total number of events for each grid space during a given date and time.

Heatmap of the locations of 911 calls to police in Detroit

One of the things that I noticed while working on this model was that with just geographic and date time information alone resulted in a model that was tuned strictly to the geographic trends. In other words those areas with high amounts of 911 calls remained so no matter what the date.

My hypothesis was that finding other information to add to this would result in more variance and thus a better model. This ended up being correct as I decided to flag those dates which were holidays or major events like the Super Bowl, as well as incorporating the weather data for the area(temperatures, type of weather and weather or not there was a severe weather event for the day). Doing so proved to be fruitful as it improved the model score by 5%. I plan to implement more features at a later date.

Selecting A Model

As is best practice I began with a baseline which produced the following:

The best performing linear model, logistic regression, was only able to equal the baseline model but this was to be expected due to the nature of the data. The model that was essentially selected was a Random Forest Regressor, which proved a significant improvement over the baseline.

Looking at the feature importances reveals that the geographic data still holds the most weight when it comes to making a prediction here, however it was good to see some of the added features playing a role as well. It is important to know that this is a significant limitation of the model as it exists now.

Using partial dependence plots we can see how predictions vary from observation to observation with an individual feature. The feature I chose here was the part of day.

With this we can see rather clearly that 911 calls increase during the night hours and then die back down as morning approaches.

Trying Out The Model

The final step is to take the model for a spin so to say. For this I predicted for a future date of 15 May 2020, with a high of 70 degrees and low of 65, and clear skies. The result was as follows, but we will have to wait and see how accurate the model was.

Conclusion

There are some obvious limitations to modelling this kind of data, the biggest being how tuned the model becomes to the geographic locations. One way of saying that high crime areas perhaps tend to stay high crime. The other is that we would want to strongly consider what if any biases are included in the model. Lastly, predicting things like this is incredibly difficult and any implementation of models like these should be taken at face-value. However I do see tremendous room for improvement and fine tuning, and I think eventually this could become an important tool for law enforcement or other first responders.

Thanks for taking the time to read my article and check back in the future for more interesting projects and ideas.

--

--

Brian Ross

Primarily interested in the intersection of advancements in data science and public good. linkedin.com/in/brianthomasross