In this block, we focus on a particular type of geometry: points. As we will see, points can represent a very particular type of spatial entity. We explore how that is the case and what are its implications, and then wrap up with a particular machine learning technique that allows us to identify clusters of points in space.
Collections of points referencing geographical locations are sometimes called point patterns. In this section, we talk about what’s special about point patterns and how they differ from other collections of geographical features such as polygons.
Once you have gone over the clip above, watch the one below, featuring Luc Anselin from the University of Chicago providing an overview of point patterns. This will provide a wider perspective on the particular nature of points, but also on their relevance for many disciplines, from ecology to economic geography..
If you want to delve deeper into point patterns, watch the video on the expandable below, which features Luc Anselin delivering a longer (and slightly more advanced) lecture on point patterns.
Show code cell outputs Hide code cell outputs
Once we have a better sense of what makes points special, we turn to visualising point patterns. Here we cover three main strategies: one to one mapping, aggregation, and smoothing.
We will put all of these ideas to visualising points into practice on the Hands-on section.
As we have seen in this course, “cluster” is a hard to define term. In Block G we used it as the outcome of an unsupervised learning algorithm. In this context, we will use the following definition:
Concentrations/agglomerations of points over space, significantly more so than in the rest of the space considered
Spatial/Geographic clustering has a wide literature going back to spatial mathematics and statistics and, more recently, machine learning. For this section, we will cover one algorithm from the latter discipline which has become very popular in the geographic context in the last few years: Density-Based Spatial Clustering of Applications with Noise, or DBSCAN ester1996density.
Wath the clip below to get the intuition of the algorithm first:
Let’s complement and unpack the clip above in the context of this course. The video does a very good job at explaining how the algorithm works, and what general benefits that entails. Here are two additional advantages that are not picked up in the clip:
It is not necessarily spatial. In fact, the original design was for the area of “data mining” and “knowledge discovery in databases”, which historically does not work with spatial data. Instead, think of purchase histories of consumers, or warehouse stocks: DBSCAN was designed to pick up patterns of similar behaviour in those contexts. Note also that this means you can use DBSCAN not only with two dimensions (e.g. longitude and latitude), but with many more (e.g. product variety) and its mechanics will work in the same way.
Fast and scalable. For similar reasons, DBSCAN is very fast and can be run in relatively large databases without problem. This contrasts with much of the traditional point pattern methods, that rely heavily on simulation and thus are trickier to scale feasibly. This is one of the reasons why DBSCAN has been widely adopted in Geographic Data Science: it is relatively straightforward to apply and will run fast, even on large datasets, meaning you can iterate over ideas quickly to learn more about your data.
DBSCAN also has a few drawbacks when compared to some of the techniques we have seen earlier in this course. Here are two prominent ones:
It is not based on a probabilistic model. Unlike the LISAs, for example, there is no underlying model that helps us characterise the pattern the algorithms returns. There is no “null hypothesis” to reject, no inferential model and thus no statistical significance. In some cases, this is an important drawback if we want to ensure what we are observing (and the algorithm is picking up) is not a random pattern.
Agnostic about the underlying process. Because there is no inferential model and the algorithm imposes very little prior structure to identify clusters, it is also hard to learn anything about the underlying process that gave rise to the pattern picked up by the algorithm. This is by no means a unique feature of DBSCAN, but one that is always good to keep in mind as we are moving from exploratory analysis to more confirmatory approaches.
If this section was of your interest, there is plenty more you can read and explore. A good “next step” is the Points chapter on the GDS book (in progress) reyABwolf.