Buzz adam by darribas

This is a nicely formatted version of the IPython Notebook that shows how to reproduce every results in the paper. The source .ipynb file is the repository (link to file) and a viewable version can be accessed here too. Numbered cells show you the code required and the subsequent cells present the output you obtain when you run it.

In [2]:

import copy
import pysal as ps
import numpy as np
import pandas as pd
import statsmodels.api as sm
from pyGDsandbox.dataIO import dbf2df, df2dbf

Data

The data are provided in the same repository as this notebook and incorporate a csv file with the attributes plus a shapefile with the geometry of the neighborhoods in Amsterdam. If you have downloaded the repository, you can load the data with:

In [3]:

db = pd.read_csv('buzz_data.csv', index_col=0)

If you don't have the csv around but you are connected to the internet, you can load it remotely like this:

In [4]:

url = 'https://raw.github.com/darribas/buzz_adam/master/buzz_data.csv'
db = pd.read_csv(url, index_col=0)

If the data has loaded correctly, the info of the table should look like this, 23 columns with data on 96 neighborhoods:

In [5]:

db

Out[5]:

<class 'pandas.core.frame.DataFrame'>
Index: 96 entries, BU03630000 to BU03631491
Data columns (total 23 columns):
4sq-pct_Arts & Entertainment           96  non-null values
4sq-pct_College & University           96  non-null values
4sq-pct_Food                           96  non-null values
4sq-pct_Other                          96  non-null values
4sq-pct_Outdoors & Recreation          96  non-null values
4sq-pct_Professional & Other Places    96  non-null values
4sq-pct_Travel & Transport             96  non-null values
industrie_pct_area                     96  non-null values
kantoor_pct_area                       96  non-null values
sport_pct_area                         96  non-null values
winkel_pct_area                        96  non-null values
4sq_total_venues                       96  non-null values
total_units                            96  non-null values
div_i                                  96  non-null values
checkins_all                           96  non-null values
('WeekDay', 'Afternoon')               96  non-null values
('WeekDay', 'Evening')                 96  non-null values
('WeekDay', 'Morning')                 96  non-null values
('WeekDay', 'Night')                   96  non-null values
('WeekEnd', 'Afternoon')               96  non-null values
('WeekEnd', 'Evening')                 96  non-null values
('WeekEnd', 'Morning')                 96  non-null values
('WeekEnd', 'Night')                   96  non-null values
dtypes: float64(20), int64(3)

Descriptive figures

The paper contains two descriptives: a choropleth of the distribution and a table that shows basic statistics of each variable. The latter is very simple to replicate (note I only show the variables displayed in the table of the paper, although the dataset contains a few more:

In [6]:

vars = ['4sq-pct_Arts & Entertainment', \
        '4sq-pct_College & University', \
        '4sq-pct_Food', \
        '4sq-pct_Other', \
        '4sq-pct_Outdoors & Recreation', \
        '4sq-pct_Professional & Other Places', \
        '4sq-pct_Travel & Transport', \
        'industrie_pct_area', \
        'kantoor_pct_area', \
        'sport_pct_area', \
        'winkel_pct_area', \
        '4sq_total_venues', \
        'total_units', \
        'div_i', \
        'checkins_all']

In [7]:

np.round(db[vars].describe().T[['mean', 'std', 'min', '25%', '50%', '75%', 'max']], 3)

Out[7]:

	mean	std	min	25%	50%	75%	max
4sq-pct_Arts & Entertainment	7.993	12.322	0.000	0.000	0.000	15.885	46.154
4sq-pct_College & University	1.503	7.377	0.000	0.000	0.000	0.000	50.000
4sq-pct_Food	6.498	16.068	0.000	0.000	0.000	5.794	100.000
4sq-pct_Other	46.072	30.457	0.000	25.000	50.000	63.727	100.000
4sq-pct_Outdoors & Recreation	1.617	10.620	0.000	0.000	0.000	0.000	100.000
4sq-pct_Professional & Other Places	12.212	22.494	0.000	0.000	0.000	16.667	100.000
4sq-pct_Travel & Transport	7.041	16.066	0.000	0.000	0.000	11.111	100.000
industrie_pct_area	7.195	14.445	0.000	0.789	1.970	4.892	76.820
kantoor_pct_area	8.522	12.679	0.014	1.399	3.652	10.339	84.004
sport_pct_area	1.351	3.579	0.000	0.090	0.363	1.204	29.413
winkel_pct_area	3.409	3.666	0.000	1.308	2.382	4.416	29.075
4sq_total_venues	12.865	22.032	0.000	1.000	6.000	11.000	116.000
total_units	4822.031	3001.490	13.000	2390.250	4844.500	6722.750	14763.000
div_i	0.601	0.145	0.000	0.525	0.599	0.725	0.807
checkins_all	806.531	1070.210	5.000	147.000	422.000	1028.250	6620.000

The map in the paper was created using QGIS, and it overlays the neighborhood polygons with the actual checkin points. I cannot re-share the individual data (although you can contact the authors), but we can re-create the choropleth (you will need PySAL 1.7 to reproduce the following cell):

In [8]:

from pysal.contrib.viz import mapping as maps
shp_link = 'adam_buzz_buurts.shp'

_ = maps.plot_choropleth(shp_link, db['checkins_all'].values, 'quantiles', \
                         k=9, cmap='YlGn', figsize=(12, 12), \
                         title='Checkin volume')

Aggregated volume regressions

The main part of the paper is devoted to regressions on the total volume of checkins of a neighborhood. In particular, we estimate a baseline OLS model as specified in Equation (1):

\[ B_i = \alpha + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i \]

Such regression, displayed in the first column of Table 2 is estimated next:

In [9]:

name_y = 'checkins_all'
x = db[vars[:-1]]
x['constant'] = 1
model = sm.OLS(np.log(db[name_y]), x).fit()
print model.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           checkins_all   R-squared:                       0.735
Model:                            OLS   Adj. R-squared:                  0.689
Method:                 Least Squares   F-statistic:                     16.04
Date:                Wed, 18 Dec 2013   Prob (F-statistic):           7.37e-18
Time:                        11:40:45   Log-Likelihood:                -108.82
No. Observations:                  96   AIC:                             247.6
Df Residuals:                      81   BIC:                             286.1
Df Model:                          14                                         
=======================================================================================================
                                          coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------------------
4sq-pct_Arts & Entertainment            0.0285      0.008      3.456      0.001         0.012     0.045
4sq-pct_College & University           -0.0104      0.012     -0.880      0.381        -0.034     0.013
4sq-pct_Food                            0.0208      0.006      3.633      0.000         0.009     0.032
4sq-pct_Other                           0.0121      0.003      3.590      0.001         0.005     0.019
4sq-pct_Outdoors & Recreation           0.0171      0.008      2.059      0.043         0.001     0.034
4sq-pct_Professional & Other Places     0.0129      0.005      2.513      0.014         0.003     0.023
4sq-pct_Travel & Transport              0.0138      0.006      2.423      0.018         0.002     0.025
industrie_pct_area                      0.0184      0.008      2.339      0.022         0.003     0.034
kantoor_pct_area                        0.0339      0.008      4.467      0.000         0.019     0.049
sport_pct_area                         -0.0271      0.030     -0.915      0.363        -0.086     0.032
winkel_pct_area                         0.0187      0.032      0.593      0.555        -0.044     0.081
4sq_total_venues                        0.0209      0.006      3.390      0.001         0.009     0.033
total_units                             0.0002   3.63e-05      5.097      0.000         0.000     0.000
div_i                                   1.5176      0.803      1.890      0.062        -0.080     3.116
constant                                2.1549      0.524      4.112      0.000         1.112     3.198
==============================================================================
Omnibus:                        2.335   Durbin-Watson:                   1.596
Prob(Omnibus):                  0.311   Jarque-Bera (JB):                1.711
Skew:                          -0.273   Prob(JB):                        0.425
Kurtosis:                       3.360   Cond. No.                     6.24e+04
==============================================================================

Warnings:
[1] The condition number is large, 6.24e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

As mentioned in the paper, values for nearby neighborhoods are likely to be correlated. To statistically test, the presence of spatial autocorrelation and, if so, of what type, we run the LM-Lag and LM-Error diagnostics. For that we previously require a spatial weights matrix, which we build following the queen contiguity and row-standardize (we also write it out for later use):

In [10]:

wo = ps.queen_from_shapefile('adam_buzz_buurts.shp')
neigh = copy.copy(wo.neighbors)
weigh = copy.copy(wo.weights)
neigh[27].extend([26, 28])
weigh[27].extend([1., 1.])
w = ps.W(neigh, weigh)
ps.open('adam_buzz_buurts_queenPS.gal', 'w').write(w)
w.transform = 'r'

WARNING: there is one disconnected observation (no neighbors)
Island id:  [27]

The island mentioned in the warning is actually a coding error in the shapefile, so in this case we fix it manually. The diagnostics are easily run as follows (note that we require an OLS from PySAL, which we create as mmodel):

In [11]:

mmodel = ps.spreg.OLS(np.log(db[[name_y]].values),
        db[vars[:-1]].values, w=w, spat_diag=True, nonspat_diag=False)

from pysal.spreg.summary_output import summary_spat_diag_ols
print '\n\t\tSPATIAL DIAGNOSTICS'
print '\t\t-------------------\n'
print summary_spat_diag_ols(mmodel, False)


		SPATIAL DIAGNOSTICS
		-------------------

Lagrange Multiplier (lag)         1       20.874762        0.0000049
Robust LM (lag)                   1       12.079756        0.0005097
Lagrange Multiplier (error)       1        8.942169        0.0027866
Robust LM (error)                 1        0.147163        0.7012617
Lagrange Multiplier (SARMA)       2       21.021925        0.0000272

The clear indication towards a spatial lag prompts us to run such specficiation. In formal notation, this boils down to:

\[ B_i = \alpha + \rho \displaystyle\sum^N_j w_{ij} B_j + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i \]

Since the model is estimated using ML, we use spdep in R for the regression:

In [12]:

%load_ext rmagic
%R library(spdep)

Loading required package: sp
Loading required package: Matrix
Loading required package: lattice

In [13]:

%%R
db.link = 'adam_vars_richdiv.csv'
db = read.csv(db.link, sep=';')
landuses = names(db)[grep('_pct_area', names(db))]
landuses = c('industrie_pct_area', 'kantoor_pct_area', 'sport_pct_area',
             'winkel_pct_area')
sq4 = names(db)[grep('X4sq.pct', names(db))][1:7]
other = c()
name.X = c(sq4, landuses, 
           c(
            'X4sq_total_venues',
            'total_units',
            'div_i'
            )
        )
x = db[name.X]
########### Rescaling ##############
x['total_units'] <- x['total_units'] / 1000
x['div_i'] <- x['div_i'] * 10
####################################
y = 'checkins_all'
y = log(db[, y])

wnb <- read.gal('adam_buzz_buurts_queenPS.gal',
                override.id=T)
w <- nb2listw(wnb, style="W")
model <- lagsarlm(y ~ ., data=x, w)
print(summary(model))


Call:lagsarlm(formula = y ~ ., data = x, listw = w)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.5708760 -0.3645739 -0.0035605  0.4496111  2.1311628 

Type: lag 
Coefficients: (asymptotic standard errors) 
                                       Estimate Std. Error z value  Pr(>|z|)
(Intercept)                          -0.0510550  0.6701247 -0.0762 0.9392701
X4sq.pct_Arts...Entertainment         0.0157820  0.0068313  2.3102 0.0208746
X4sq.pct_College...University        -0.0063080  0.0096102 -0.6564 0.5115730
X4sq.pct_Food                         0.0132307  0.0048092  2.7511 0.0059391
X4sq.pct_Other                        0.0127954  0.0027478  4.6566 3.214e-06
X4sq.pct_Outdoors...Recreation        0.0093330  0.0067592  1.3808 0.1673445
X4sq.pct_Professional...Other.Places  0.0099200  0.0042057  2.3587 0.0183402
X4sq.pct_Travel...Transport           0.0134285  0.0046176  2.9081 0.0036366
industrie_pct_area                    0.0216871  0.0063970  3.3902 0.0006985
kantoor_pct_area                      0.0223510  0.0063125  3.5407 0.0003990
sport_pct_area                       -0.0213363  0.0242902 -0.8784 0.3797309
winkel_pct_area                       0.0407958  0.0256985  1.5875 0.1124046
X4sq_total_venues                     0.0137894  0.0053181  2.5929 0.0095162
total_units                           0.1736649  0.0295318  5.8806 4.088e-09
div_i                                 0.1266651  0.0652172  1.9422 0.0521123

Rho: 0.43991, LR test value: 19.808, p-value: 8.5624e-06
Asymptotic standard error: 0.09327
    z-value: 4.7165, p-value: 2.3991e-06
Wald statistic: 22.246, p-value: 2.3991e-06

Log likelihood: -98.91989 for lag model
ML residual variance (sigma squared): 0.44128, (sigma: 0.66429)
Number of observations: 96 
Number of parameters estimated: 17 
AIC: 231.84, (AIC for lm: 249.65)
LM test for residual autocorrelation
test value: 3.6243, p-value: 0.056941

These results match those in the second column of Table 2.

Regressions by time of the week

The last part of the paper explores the variability of the effect of diversity over different times of the week. To do that, we run a separate regression for the volume of checkins that occur within a given time window of the week. This is equivalent to:

\[ B_{it} = \alpha + \rho \displaystyle\sum^N_j w_{ij} B_{jt} + \beta_1 F_i + \beta_2 E_i + \gamma Div_i + \epsilon_i \]

Since we are just exploring this variability, we do not report the entire model but only keep the coefficient for \(Div_i\) to plot it over the week, together with its confidence intervals. The regressions though output the entire summary of the model to a file named 'lag_models_summaries.txt' in the same folder.

In [14]:

%%R
cross.walk <- list("X..WeekDay....Morning.." = "Morning-WeekDay",
                   "X..WeekDay....Afternoon.." = "Afternoon-WeekDay",
                   "X..WeekDay....Evening.." = "Evening-WeekDay",
                   "X..WeekDay....Night.." = "Night-WeekDay",
                   "X..WeekEnd....Morning.." = "Morning-WeekEnd",
                   "X..WeekEnd....Afternoon.." = "Afternoon-WeekEnd",
                   "X..WeekEnd....Evening.." = "Evening-WeekEnd",
                   "X..WeekEnd....Night.." = "Night-WeekEnd"
                   )
db.link = 'buzz_data.csv'
db = read.csv(db.link)
landuses = names(db)[grep('_pct_area', names(db))]
landuses = c('industrie_pct_area', 'kantoor_pct_area', 'sport_pct_area',
             'winkel_pct_area')
sq4 = names(db)[grep('X4sq.pct', names(db))][1:7]
other = c()
name.X = c(sq4, landuses, 
           c(
            'X4sq_total_venues',
            'total_units',
            'div_i'
            )
        )
########### Rescaling ##############
db['total_units'] <- db['total_units'] / 1000
db['div_i'] <- db['div_i'] * 10
####################################

wnb <- read.gal('adam_buzz_buurts_queenPS.gal',
                override.id=T)
w <- nb2listw(wnb, style="W")

week.times <- names(db)[grep('Week', names(db))]
res <- data.frame()
divs <- data.frame()
vari <- 15 # Position for Div_i
print('Running regressions')
sink('lag_models_summaries.txt')
for(wtime in week.times){
    print(paste('Model for week time:', wtime))
    y = db[, wtime]
    y[y == 0] = 0.000001
    y = log(y)
    x = db[name.X]
    model <- lagsarlm(y ~ ., data=x, w)
    coefs <- as.data.frame(model$coefficients)
    colnames(coefs) <- cross.walk[wtime]
    res <- c(res, coefs)
    print(summary(model))

    div <- data.frame(beta=model$coefficients[vari], se=model$rest.se[vari],
                      lower=model$coefficients[vari] - model$rest.se[vari]*1.96, 
                      upper=model$coefficients[vari] + model$rest.se[vari]*1.96
                      )
    rownames(div) <- cross.walk[wtime]
    divs <- rbind(div, divs)
}
sink()
res <- as.data.frame(res)
time.line <- c("Morning-WeekDay", "Afternoon-WeekDay", "Evening-WeekDay", "Night-WeekDay", 
               "Morning-WeekEnd", "Afternoon-WeekEnd", "Evening-WeekEnd", "Night-WeekEnd")
divs <- divs[time.line, ]
print(divs)

[1] "Running regressions"
                       beta        se       lower     upper
Morning-WeekDay   0.1047662 0.0789775 -0.05002975 0.2595621
Afternoon-WeekDay 0.3147000 0.1538102  0.01323208 0.6161680
Evening-WeekDay   0.4999191 0.2386536  0.03215808 0.9676800
Night-WeekDay     1.1662525 0.3208840  0.53731992 1.7951850
Morning-WeekEnd   0.7755847 0.2077812  0.36833347 1.1828359
Afternoon-WeekEnd 0.6429518 0.1903473  0.26987110 1.0160325
Evening-WeekEnd   0.8430234 0.2695521  0.31470129 1.3713454
Night-WeekEnd     0.8461017 0.2893739  0.27892897 1.4132745

These results can be then plotted, reproducing the structure in Figure 2 of the paper.

In [15]:

%%R
par(mar = c(7, 4, 4, 2) + 0.1, bty='n')
ylab <- 'Div. coefficients'
plot(seq(8), xaxt = "n", divs$beta, type='l', xlab='', ylab=ylab,
     ylim=c(min(divs$lower), max(divs$upper)))
#lines(seq(8), xaxt = "n", divs$lower, col='green', lw=0.5, type='l', xlab='', ylab='Div_i Coef.')
#lines(seq(8), xaxt = "n", divs$upper, col='green', lw=0.5, type='l', xlab='', ylab='Div_i Coef.')
xx <- c(seq(8), rev(seq(8)))
yy <- c(divs$upper, rev(divs$lower))
polygon(xx, yy, col = 'darkgray', border = NA)
lines(seq(8), xaxt = "n", divs$beta, type='l', xlab='', ylab='Div. Coef.',
     ylim=c(min(divs$lower), max(divs$upper)))
abline(h=0, col='red', lw=2)

abline(v=4.5, col='black', lw=0.5)
axis(1, labels = FALSE)
text(1:8, par("usr")[3] - 0.075, srt = 45, adj = 1,
               labels = time.line, xpd = TRUE)
text(2.25, 1.5, labels='Week days')
text(6.75, 1.5, labels='Weekend')
title(main=paste('Effect of diversity on buzz\n(spatial lag models)'))

"Replication of results for The socio-cultural sources of urban buzz" by Daniel Arribas-Bel is licensed under a Creative Commons Attribution 4.0 International License.