External links
This is a nicely formatted version of the IPython Notebook that shows how
to reproduce every results in the paper. The source .ipynb
file
is the repository (link
to file) and a viewable version can be accessed here
too. Numbered cells show you the code required and the subsequent cells
present the output you obtain when you run it.
import copy
import pysal as ps
import numpy as np
import pandas as pd
import statsmodels.api as sm
from pyGDsandbox.dataIO import dbf2df, df2dbf
The data are provided in the same repository as this notebook and incorporate a csv
file with the attributes plus a shapefile with the geometry of the neighborhoods in Amsterdam. If you have downloaded the repository, you can load the data with:
db = pd.read_csv('buzz_data.csv', index_col=0)
If you don't have the csv around but you are connected to the internet, you can load it remotely like this:
url = 'https://raw.github.com/darribas/buzz_adam/master/buzz_data.csv'
db = pd.read_csv(url, index_col=0)
If the data has loaded correctly, the info of the table should look like this, 23 columns with data on 96 neighborhoods:
db
The paper contains two descriptives: a choropleth of the distribution and a table that shows basic statistics of each variable. The latter is very simple to replicate (note I only show the variables displayed in the table of the paper, although the dataset contains a few more:
vars = ['4sq-pct_Arts & Entertainment', \
'4sq-pct_College & University', \
'4sq-pct_Food', \
'4sq-pct_Other', \
'4sq-pct_Outdoors & Recreation', \
'4sq-pct_Professional & Other Places', \
'4sq-pct_Travel & Transport', \
'industrie_pct_area', \
'kantoor_pct_area', \
'sport_pct_area', \
'winkel_pct_area', \
'4sq_total_venues', \
'total_units', \
'div_i', \
'checkins_all']
np.round(db[vars].describe().T[['mean', 'std', 'min', '25%', '50%', '75%', 'max']], 3)
The map in the paper was created using QGIS, and it overlays the neighborhood polygons with the actual checkin points. I cannot re-share the individual data (although you can contact the authors), but we can re-create the choropleth (you will need PySAL 1.7 to reproduce the following cell):
from pysal.contrib.viz import mapping as maps
shp_link = 'adam_buzz_buurts.shp'
_ = maps.plot_choropleth(shp_link, db['checkins_all'].values, 'quantiles', \
k=9, cmap='YlGn', figsize=(12, 12), \
title='Checkin volume')
The main part of the paper is devoted to regressions on the total volume of checkins of a neighborhood. In particular, we estimate a baseline OLS model as specified in Equation (1):
\[ B_i = \alpha + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i \]
Such regression, displayed in the first column of Table 2 is estimated next:
name_y = 'checkins_all'
x = db[vars[:-1]]
x['constant'] = 1
model = sm.OLS(np.log(db[name_y]), x).fit()
print model.summary()
As mentioned in the paper, values for nearby neighborhoods are likely to be correlated. To statistically test, the presence of spatial autocorrelation and, if so, of what type, we run the LM-Lag and LM-Error diagnostics. For that we previously require a spatial weights matrix, which we build following the queen contiguity and row-standardize (we also write it out for later use):
wo = ps.queen_from_shapefile('adam_buzz_buurts.shp')
neigh = copy.copy(wo.neighbors)
weigh = copy.copy(wo.weights)
neigh[27].extend([26, 28])
weigh[27].extend([1., 1.])
w = ps.W(neigh, weigh)
ps.open('adam_buzz_buurts_queenPS.gal', 'w').write(w)
w.transform = 'r'
The island mentioned in the warning is actually a coding error in the shapefile, so in this case we fix it manually. The diagnostics are easily run as follows (note that we require an OLS from PySAL, which we create as mmodel
):
mmodel = ps.spreg.OLS(np.log(db[[name_y]].values),
db[vars[:-1]].values, w=w, spat_diag=True, nonspat_diag=False)
from pysal.spreg.summary_output import summary_spat_diag_ols
print '\n\t\tSPATIAL DIAGNOSTICS'
print '\t\t-------------------\n'
print summary_spat_diag_ols(mmodel, False)
The clear indication towards a spatial lag prompts us to run such specficiation. In formal notation, this boils down to:
\[ B_i = \alpha + \rho \displaystyle\sum^N_j w_{ij} B_j + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i \]
Since the model is estimated using ML, we use spdep
in R for the regression:
%load_ext rmagic
%R library(spdep)
%%R
db.link = 'adam_vars_richdiv.csv'
db = read.csv(db.link, sep=';')
landuses = names(db)[grep('_pct_area', names(db))]
landuses = c('industrie_pct_area', 'kantoor_pct_area', 'sport_pct_area',
'winkel_pct_area')
sq4 = names(db)[grep('X4sq.pct', names(db))][1:7]
other = c()
name.X = c(sq4, landuses,
c(
'X4sq_total_venues',
'total_units',
'div_i'
)
)
x = db[name.X]
########### Rescaling ##############
x['total_units'] <- x['total_units'] / 1000
x['div_i'] <- x['div_i'] * 10
####################################
y = 'checkins_all'
y = log(db[, y])
wnb <- read.gal('adam_buzz_buurts_queenPS.gal',
override.id=T)
w <- nb2listw(wnb, style="W")
model <- lagsarlm(y ~ ., data=x, w)
print(summary(model))
These results match those in the second column of Table 2.
The last part of the paper explores the variability of the effect of diversity over different times of the week. To do that, we run a separate regression for the volume of checkins that occur within a given time window of the week. This is equivalent to:
\[ B_{it} = \alpha + \rho \displaystyle\sum^N_j w_{ij} B_{jt} + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i \]
Since we are just exploring this variability, we do not report the entire model but only keep the coefficient for \(Div_i\) to plot it over the week, together with its confidence intervals. The regressions though output the entire summary of the model to a file named 'lag_models_summaries.txt'
in the same folder.
%%R
cross.walk <- list("X..WeekDay....Morning.." = "Morning-WeekDay",
"X..WeekDay....Afternoon.." = "Afternoon-WeekDay",
"X..WeekDay....Evening.." = "Evening-WeekDay",
"X..WeekDay....Night.." = "Night-WeekDay",
"X..WeekEnd....Morning.." = "Morning-WeekEnd",
"X..WeekEnd....Afternoon.." = "Afternoon-WeekEnd",
"X..WeekEnd....Evening.." = "Evening-WeekEnd",
"X..WeekEnd....Night.." = "Night-WeekEnd"
)
db.link = 'buzz_data.csv'
db = read.csv(db.link)
landuses = names(db)[grep('_pct_area', names(db))]
landuses = c('industrie_pct_area', 'kantoor_pct_area', 'sport_pct_area',
'winkel_pct_area')
sq4 = names(db)[grep('X4sq.pct', names(db))][1:7]
other = c()
name.X = c(sq4, landuses,
c(
'X4sq_total_venues',
'total_units',
'div_i'
)
)
########### Rescaling ##############
db['total_units'] <- db['total_units'] / 1000
db['div_i'] <- db['div_i'] * 10
####################################
wnb <- read.gal('adam_buzz_buurts_queenPS.gal',
override.id=T)
w <- nb2listw(wnb, style="W")
week.times <- names(db)[grep('Week', names(db))]
res <- data.frame()
divs <- data.frame()
vari <- 15 # Position for Div_i
print('Running regressions')
sink('lag_models_summaries.txt')
for(wtime in week.times){
print(paste('Model for week time:', wtime))
y = db[, wtime]
y[y == 0] = 0.000001
y = log(y)
x = db[name.X]
model <- lagsarlm(y ~ ., data=x, w)
coefs <- as.data.frame(model$coefficients)
colnames(coefs) <- cross.walk[wtime]
res <- c(res, coefs)
print(summary(model))
div <- data.frame(beta=model$coefficients[vari], se=model$rest.se[vari],
lower=model$coefficients[vari] - model$rest.se[vari]*1.96,
upper=model$coefficients[vari] + model$rest.se[vari]*1.96
)
rownames(div) <- cross.walk[wtime]
divs <- rbind(div, divs)
}
sink()
res <- as.data.frame(res)
time.line <- c("Morning-WeekDay", "Afternoon-WeekDay", "Evening-WeekDay", "Night-WeekDay",
"Morning-WeekEnd", "Afternoon-WeekEnd", "Evening-WeekEnd", "Night-WeekEnd")
divs <- divs[time.line, ]
print(divs)
These results can be then plotted, reproducing the structure in Figure 2 of the paper.
%%R
par(mar = c(7, 4, 4, 2) + 0.1, bty='n')
ylab <- 'Div. coefficients'
plot(seq(8), xaxt = "n", divs$beta, type='l', xlab='', ylab=ylab,
ylim=c(min(divs$lower), max(divs$upper)))
#lines(seq(8), xaxt = "n", divs$lower, col='green', lw=0.5, type='l', xlab='', ylab='Div_i Coef.')
#lines(seq(8), xaxt = "n", divs$upper, col='green', lw=0.5, type='l', xlab='', ylab='Div_i Coef.')
xx <- c(seq(8), rev(seq(8)))
yy <- c(divs$upper, rev(divs$lower))
polygon(xx, yy, col = 'darkgray', border = NA)
lines(seq(8), xaxt = "n", divs$beta, type='l', xlab='', ylab='Div. Coef.',
ylim=c(min(divs$lower), max(divs$upper)))
abline(h=0, col='red', lw=2)
abline(v=4.5, col='black', lw=0.5)
axis(1, labels = FALSE)
text(1:8, par("usr")[3] - 0.075, srt = 45, adj = 1,
labels = time.line, xpd = TRUE)
text(2.25, 1.5, labels='Week days')
text(6.75, 1.5, labels='Weekend')
title(main=paste('Effect of diversity on buzz\n(spatial lag models)'))
"Replication of results for The socio-cultural sources of urban buzz" by Daniel Arribas-Bel is licensed under a Creative Commons Attribution 4.0 International License.