The following shows the steps, as well as the Python code, required to reproduce the results in the paper. This has been ellaborated using the IPython notebook. You can read through the page below, download the source file or a pdf version.
To reproduce the code below you will need to have installed the following requirements:
The notebook behind this page is also available for download in the repository.
%matplotlib inline
import os
import pandas as pd
import employment_centers_tools as tools
Employment data for 1990 and 2000 comes from the “Census 2000 Special Tabulation Product 64” (stp64
), which offers tract- to-tract commuting flows for the whole country. We consider only tracts within boundaries of the MSAs studied and sum over inflows to obtain employment numbers at the tract level. For 2010, there is no stp64
so we use the Census Transportation Planning Products (CTPP) 5-year small area data, which is based on ACS 2006-2010 Census Data and provides employment counts at the tract level for 2010.
Since it is not clear to the authors whether it is legal to redistribute original Census data (particularly when the stp64
is not available as a free download), the raw dataset used for this paper is not included. However, in order to illustrate how the process was carried out, below we show the structure of the data and how that is entered into the code to identify employment centers. This should allow users who have accessed the original data from the Census to replicate our results. We do provide the output of the center identification in the form of shapefiles (one per MSA per year). These can be found in the repository where this notebook and other code is hosted.
We store employment data in csv
files, using a separate one for each MSA for each year. The raw format of these files is as follows, where the top of the file for MSA 10180 in 1990 is shown:
!head /Users/dani/AAA/LargeData/T-CentersData/attributes/3-empDen/empDen1990/10180.csv
These files can be loaded and combined for 1990 in one DataFrame
with the following code:
empF90 = '/Users/dani/AAA/LargeData/T-CentersData/attributes/3-empDen/empDen1990/'
emp90 = pd.concat([tools.load_msa_data(empF90+f, y90=True) for f in os.listdir(empF90)])
emp90.info()
emp90.head()
Similar steps can be taken for 2000:
empF00 = '/Users/dani/AAA/LargeData/T-CentersData/attributes/3-empDen/empDen2000/'
emp00 = pd.concat([tools.load_msa_data(empF00+f) for f in os.listdir(empF00)])
Steps for 2010 are slightly different as the data are collected from a different source. We begin with a csv
that looks as follows:
!head /Users/dani/AAA/LargeData/ctpp/data2010/tractPOW_all.csv
emp10_link = '/Users/dani/AAA/LargeData/ctpp/data2010/tractPOW_all.csv'
shp_link = '/Users/dani/AAA/LargeData/nhgis/shapeFiles/nhgis0019_shapefile_tl2010_us_tract_2010/US_tract_2010.shp'
cty2msa = pd.read_csv('cty2msa.csv', names=['cty', 'msa'], index_col=0, squeeze=True)
shpW_out = '/Users/dani/AAA/LargeData/nhgis/shapeFiles/nhgis0019_shapefile_tl2010_us_tract_2010/msa_shapes_ws/'
emp10 = pd.read_csv(emp10_link)[['GISJOIN', 'emp', 'Shape_area']]
emp10['Shape_area'] = emp10['Shape_area'] * 0.000001 #Area in Km2
emp10['msa'] = emp10['GISJOIN'].apply(lambda x: \
tools.msafy('c'+x[1:3]+x[4:7], cty2msa))
emp10 = emp10.dropna() # Only tracts in MSA && w/ employment
Once all the data are loaded, it is very straightforward to replicate Table 1 in the paper:
table = pd.DataFrame()
for year, emp in [(1990, emp90), (2000, emp00), (2010, emp10)]:
minTpMSA = emp.groupby('msa').count()['emp'].min()
maxTpMSA = emp.groupby('msa').count()['emp'].max()
meanTpMSA = emp.groupby('msa').count()['emp'].mean()
summary = pd.Series({'N. MSAs': len(emp['msa'].unique()), \
'N. Tracts': len(emp), \
'Min. N. Tracts/MSA': minTpMSA, \
'Max. N. Tracts/MSA': maxTpMSA, \
'Average. N. Tracts/MSA': meanTpMSA})
table[year] = summary
table.reindex(['N. MSAs', 'N. Tracts', 'Min. N. Tracts/MSA', \
'Max. N. Tracts/MSA', 'Average. N. Tracts/MSA'])
We use the data we have just loaded and combine them with geographical data to derive spatial relationships (these are needed to run the main identification algorithm). We use shapefiles of the tracts for each year downloaded from NHGIS and projected to state plane so areas can be calculated (not included in the repository either for similar redistribution reasons as with employment data).
Since the data are not shipped in the repository, the code below will only run if you provide it separately. This implies you need to point cent_inYY
(replace YY
by the year) to the folder where you have separate files for each MSA. Once you run the code below, shapefiles and .gal
files will be created in the cent_outYY
directory.
import numpy as np
import multiprocessing as mp
#pool = mp.Pool(mp.cpu_count())
seed = np.random.seed(1234)
For 1990, this would identify the centers (make sure to modify cent_in90
and cent_out90
to fit your setup):
cent_in90 = '/Users/dani/AAA/LargeData/T-CentersData/shapes/msaTracts1990polygonsSP/'
cent_out90 = '/Users/dani/Desktop/test90/'
emp90['dens_raw'] = (emp90['emp'] * 1.) / emp90['Shape_area']
emp90['GISJOIN'] = emp90.index
pars = [(emp90[emp90['msa']==msa], cent_in90+msa[1:]+'.shp', cent_out90) \
for msa in emp90['msa'].unique()]
out = pool.map(tools.act_on_msa, pars[:2])
For 2000, this would identify the centers (make sure to modify cent_in00
and cent_out00
to fit your setup):
cent_in00 = '/Users/dani/AAA/LargeData/T-CentersData/shapes/msaTracts2000polygonsSP/'
cent_out00 = '/Users/dani/Desktop/test00/'
emp00['dens_raw'] = (emp00['emp'] * 1.) / emp00['Shape_area']
emp00['GISJOIN'] = emp00.index
pars = [(emp00[emp00['msa']==msa], cent_in00+msa[1:]+'.shp', cent_out00) \
for msa in emp00['msa'].unique()]
out = pool.map(tools.act_on_msa, pars[:2])
For 2010, this would identify the centers (make sure to modify cent_in10
and cent_out10
to fit your setup). Note that in this case, the way the shapefile is passed is slightly different because we are using 2010 data that we have sourced differently.
cent_out10 = '/Users/dani/Desktop/test10/'
shp_link = '/Users/dani/AAA/LargeData/nhgis/shapeFiles/nhgis0019_shapefile_tl2010_us_tract_2010/US_tract_2010.shp'
emp10['dens_raw'] = (emp10['emp'] * 1.) / emp10['Shape_area']
emp10['GISJOIN'] = emp10.index
pars = [(emp10[emp10['msa']==msa], shp_link, cent_out10) \
for msa in emp10['msa'].unique()]
out = pool.map(tools.act_on_msa, pars[:2])
Once this has run, we have created three folders that contain shapefiles with the identified centers in each MSA in each period. A compilation of these can be found in the repository in the geojson
format, one for each year. The output can be reloaded again and used to recreate Tables 2-4 in the paper.
cent_out90 = '/Users/dani/Desktop/test90/'
cent_out00 = '/Users/dani/Desktop/test00/'
cent_out10 = '/Users/dani/Desktop/test10/'
# Collect all results
years = [1990, 2000, 2010]
links = [cent_out90, cent_out00, cent_out10]
db = []
for year, link in zip(years, links):
temp = pd.concat(map(lambda l: pd.read_csv(l, index_col=0), \
[link+l for l in os.listdir(link) if l[-4:]=='.csv']))
temp['year'] = year
db.append(temp)
db = pd.concat(db)
tab = tools.evol_tab(db)
tab
Which we can re-format to make it look more appealing:
print "Table 2"
tab2 = tab.groupby(level=[0, 1]).sum().unstack().fillna(0)
tab2['Total 1990'] = tab2.sum(axis=1)
tab2.loc['Total 2000', :] = tab2.sum(axis=0)
tab2
print "Table 3"
tab3 = tab.groupby(level=[1, 2]).sum().unstack().fillna(0)
tab3['Total 2000'] = tab3.sum(axis=1)
tab3.loc['Total 2010', :] = tab3.sum(axis=0)
tab3
print "Table 4"
tab4 = tab.groupby(level=[0, 2]).sum().unstack().fillna(0)
tab4['Total 2000'] = tab4.sum(axis=1)
tab4.loc['Total 2010', :] = tab4.sum(axis=0)
tab4
For this section, we also rely on two additional layers. One is the centroid of the MSAs, and is derived from the previous shapefiles by performing a simple GIS operation and projecting the output into the EPSG:2146
. The second one is an additional shapefile downloaded from NHGIS as well that contains the polygons of the US continental lower states. Since this is very common and is used only for aesthetic purposes, the shapefile is not provided either.
For this illustration, we use only 99 random permutations to build the empirical distribution of both the global and local versions of Moran's I. The results presented in the paper, however, are based on 99,999 and thus are more reliable (although take much longer to run).
import pysal as ps
msa_pts_link = '/Users/dani/Desktop/msas_points2163.shp'
states_link = '/Users/dani/Desktop/states2163.shp'
years = [1990, 2000, 2010]
perms = 99
msas = db.groupby(['msa', 'year']).apply(\
lambda x: x.groupby('center_id').ngroups).unstack()
ord = ['m'+i for i in ps.open(msa_pts_link.replace('.shp', '.dbf')).by_col('CBSA')]
msas = msas.reindex(ord)
Moran's I indices that are used to justify further spatial analysis in the paper. The idea is to explore whether the spatial arrangement of polycentricity follows any specific pattern discernible from spatial randomness. Rejecting the null points to such case.
morans = []
for k in [9]:
w = ps.knnW_from_shapefile(msa_pts_link, k=k)
w.transform = 'R'
m_90 = ps.Moran(msas[1990].astype(float), w, permutations=perms)
m_00 = ps.Moran(msas[2000].astype(float), w, permutations=perms)
m_10 = ps.Moran(msas[2010].astype(float), w, permutations=perms)
t90_00 = (msas[2000] - msas[1990]).astype(float)
m_90_00 = ps.Moran(t90_00, w, permutations=perms)
t00_10 = (msas[2010] - msas[2000]).astype(float)
m_00_10 = ps.Moran(t00_10, w, permutations=perms)
out = pd.DataFrame({'I': [m_90.I, m_00.I, m_10.I, m_90_00.I, m_00_10.I], \
'p': [m_90.p_sim, m_00.p_sim, m_10.p_sim, m_90_00.p_sim, m_00_10.p_sim], \
'var': ['m_90', 'm_00', 'm_10', 'm_90_00', 'm_00_10']})
out['w'] = 'w' + str(k)
morans.append(out)
morans = pd.concat(morans).pivot('var', 'w')
morans
# Within a loop in case the user wants to explore different values for k
for k in [9]:
w = ps.knnW_from_shapefile(msa_pts_link, k=k)
w.transform = 'R'
m_90 = ps.Moran_Local(msas[1990].astype(float).values, w, permutations=perms)
p = tools.plot_lisa(m_90, st=states_link, msa=msa_pts_link, title='Centers in 1990')
m_00 = ps.Moran_Local(msas[2000].astype(float).values, w, permutations=perms)
p = tools.plot_lisa(m_00, st=states_link, msa=msa_pts_link, title='Centers in 2000 | Map 1')
m_10 = ps.Moran_Local(msas[2010].astype(float).values, w, permutations=perms)
p = tools.plot_lisa(m_10, st=states_link, msa=msa_pts_link, title='Centers in 2010')
t90_00 = (msas[2000] - msas[1990]).astype(float).values
m_90_00 = ps.Moran_Local(t90_00, w, permutations=perms)
p = tools.plot_lisa(m_90_00, st=states_link, msa=msa_pts_link, \
title='Change in centers 1990-2000')
t00_10 = (msas[2010] - msas[2000]).astype(float).values
m_00_10 = ps.Moran_Local(t00_10, w, permutations=perms)
p = tools.plot_lisa(m_00_10, st=states_link, msa=msa_pts_link,
title='Change in centers 2000-2010 | Map 2')
Results from the random labeling analysis performed in Section 6 and presented in Tables 5 to 8 of the paper. This basically calculates average values for several variables by type of MSA (no centers, monocentric, polycentric) and tests for statistically significant differences between the averages by using the random labeling technique, borrowed from Rey & Sastré-Gutiérrez (2010). The variables employed are: total population, employment density, income per capita and percentage of people below the poverty line.
Population, income and poverty are also obtained from NHGIS and not redistributed in this context. Employment and density are calculated from the previous data shown in the sections above.
pop00F='/Users/dani/AAA/LargeData/T-CentersData/attributes/5-pop/pop2000/'
pop00 = pd.concat([tools.load_soc_ec(pop00F+l) for l in os.listdir(pop00F)])
pop00 = pop00.groupby('msa').sum()
sec00F='/Users/dani/AAA/LargeData/T-CentersData/attributes/7-socEc/socEc2000/'
sec00 = pd.concat([tools.load_soc_ec(sec00F+l) for l in os.listdir(sec00F)])
sec00 = sec00.groupby('msa').sum().drop('incpc', axis=1)
msas_soec = db.groupby(['msa', 'year']).apply(\
lambda x: x.groupby('center_id').ngroups)\
.apply(tools._monopoly).unstack()
msas_soec = msas_soec.join(pop00).join(sec00)
msas_soec = msas_soec.join(db[db['year']==2000][['emp', 'Shape_area', 'msa']]\
.groupby('msa').sum())
msas_soec['emp_dens'] = msas_soec['emp'] * 1. / msas_soec['Shape_area']
msas_soec['inc_pc'] = msas_soec['inc'] * 1. / msas_soec['pop']
msas_soec['pct_below'] = msas_soec['below'] * 100. / msas_soec['pop']
msas_soec = msas_soec[[1990, 2000, 2010, \
'pop', 'emp_dens', 'inc_pc', 'pct_below']]
perms = 99
rl90_00 = tools.do_rl(msas_soec, [1990, 2000], perms)
for var in rl90_00:
print "\n\t%s"%var
print rl90_00[var].unstack().fillna('')
print '-------------------------------------------------------------------'
rl_tables = []
rl00_10 = tools.do_rl(msas_soec, [2000, 2010], perms)
for var in rl00_10:
print "\n\t%s"%var
tab = rl00_10[var].unstack().fillna('')
print tab
rl_tables.append(tab)
These tables can be presented in a more pleasant way, adding the global averages as well the way they are shown in the paper:
print "Table 5: Size"
print "Global average: %.1f"%msas_soec['pop'].mean()
rl_tables[0]
print "Table 6: Density"
print "Global average: %.1f"%msas_soec['emp_dens'].mean()
rl_tables[1]
print "Table 7: Income per capita"
print "Global average: %.1f"%msas_soec['inc_pc'].mean()
rl_tables[2]
print "Table 8: Poverty (in %)"
print "Global average: %.1f"%msas_soec['pct_below'].mean()
rl_tables[3]