Geographic Data Science - Lecture VIII

Grouping Data over Space

Dani Arribas-Bel

Today

  • The need to group data
  • Geodemographic analysis
  • Non-spatial clustering
  • Regionalization
  • Examples "in the wild"

The need to group data

Everything should be made as simple as possible, but not simpler

Albert Einstein

The need to group data

  • The world (and its problems) are complex and multidimensional
  • Univariate analysis involves focusing only one way of measure the world
  • Sometimes, world issues are best understood as multivariate:

    • Percentage of foreign-born Vs. What is a neighborhood?
    • Years of schooling Vs. Human development
    • Monthly income Vs. Deprivation

Grouping as simplifying

  • Define a given number of categories based on many characteristics (multi-dimensional)
  • Find the category where each observation fits best
  • Reduce complexity, keep all the relevant information
  • Produce easier-to-understand outputs

Geodemographic analysis

Geodemographic analysis

  • Technique developed in 1970’s attributed to Richard Webber
  • Identify similar neighborhoods  →  Target urban deprivation funding
  • Originated in the Public Sector (policy) and spread to the Private sector (marketing and business intelligence)

Source

Source

How do you segment/cluster observations over space?

  • Statistical clustering
  • Explicitly spatial clustering (regionalization)

Non-spatial clustering

Split a dataset into groups of observations that are similar within the group and dissimilar between groups, based on a series of attributes

Machine learning

  • The computer learns some of the properties of the dataset without the human specifying them

Unsupervised

  • There is no a-priori structure imposed on the classification  →  before the analysis, no observations is in a category

Intuition

Clustering

K-means [Source]

K-means [Source]

K-means

More clustering...

  • Hierarchical clustering
  • Agglomerative clustering
  • Spectral clustering
  • Neural networks (e.g. Self-Organizing Maps)
  • DBScan
  • ...

Different properties, different best usecases

See interesting comparison table

Regionalization

Spatial Machine Learning

Aggregating basic spatial units (areas) into larger units (regions)

Regionalization

Split a dataset into groups of observations that are similar within the group and dissimilar between groups, based on a series of attributes...

...with the additional constraint observations need to be spatial neighbors

Regionalization

  • All the methods aggregate geographical areas into a predefined number of regions, while optimizing a particular aggregation criterion;
  • The areas within a region must be geographically connected (the spatial contiguity constraint);
  • The number of regions must be smaller than or equal to the number of areas;
  • Each area must be assigned to one and only one region;
  • Each region must contain at least one area.

Duque et al. (2007)

Regionalization

  • All the methods aggregate geographical areas into a predefined number of regions, while optimizing a particular aggregation criterion;
  • The areas within a region must be geographically connected (the spatial contiguity constraint);
  • The number of regions must be smaller than or equal to the number of areas;
  • Each area must be assigned to one and only one region;
  • Each region must contain at least one area.

Duque et al. (2007)

Algorithms

  • Automated Zoning Procedure (AZP)
  • Arisel
  • Max-P
  • ...

See Duque et al. (2007) for an excellent, though advanced, overview

Examples

Census geographies

Choropleth

AirBnb neighborhoods

Livehoods

Recapitulation

  • Some problems are truly highly dimensional and univariate representations are not appropriate
  • Clustering can help reduce complexity by creating categories that retain statistical information but are easier to understand
  • Two main types of clustering in this context:
    • Geo-demographic analysis
    • Regionalization

Creative Commons License
Geographic Data Science'15 - Lecture 8 by Dani Arribas-Bel is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.