Spatial Data Processing with PySAL & Pandas

IPYNB

#by convention, we use these shorter two-letter names
import pysal as ps
import pandas as pd
import numpy as np

PySAL has two simple ways to read in data. But, first, you need to get the path from where your notebook is running on your computer to the place the data is. For example, to find where the notebook is running:

!pwd # on windows !cd

/Users/dani/code/gds_scipy16/content/part1

PySAL has a command that it uses to get the paths of its example datasets. Let's work with a commonly-used dataset first.

ps.examples.available()

['10740',
 'arcgis',
 'baltim',
 'book',
 'burkitt',
 'calemp',
 'chicago',
 'columbus',
 'desmith',
 'geodanet',
 'juvenile',
 'Line',
 'mexico',
 'nat',
 'networks',
 'newHaven',
 'Point',
 'Polygon',
 'sacramento2',
 'sids2',
 'snow_maps',
 'south',
 'stl',
 'street_net_pts',
 'taz',
 'us_income',
 'virginia',
 'wmat']

ps.examples.explain('us_income')

{'description': 'Per-capita income for the lower 47 US states 1929-2010',
 'explanation': [' * us48.shp: shapefile ',
  ' * us48.dbf: dbf for shapefile',
  ' * us48.shx: index for shapefile',
  ' * usjoin.csv: attribute data (comma delimited file)'],
 'name': 'us_income'}

csv_path = ps.examples.get_path('usjoin.csv')

f = ps.open(csv_path)
f.header[0:10]

['Name',
 'STATE_FIPS',
 '1929',
 '1930',
 '1931',
 '1932',
 '1933',
 '1934',
 '1935',
 '1936']

y2009 = f.by_col('2009')

y2009[0:10]

[32274, 32077, 31493, 40902, 40093, 52736, 40135, 36565, 33086, 30987]

Working with shapefiles

We can also work with local files outside the built-in examples.

To read in a shapefile, we will need the path to the file.

shp_path = '../data/texas.shp'
print(shp_path)

../data/texas.shp

Then, we open the file using the ps.open command:

f = ps.open(shp_path)

f is what we call a "file handle." That means that it only points to the data and provides ways to work with it. By itself, it does not read the whole dataset into memory. To see basic information about the file, we can use a few different methods.

For instance, the header of the file, which contains most of the metadata about the file:

f.header

{'BBOX Mmax': 0.0,
 'BBOX Mmin': 0.0,
 'BBOX Xmax': -93.50721740722656,
 'BBOX Xmin': -106.6495132446289,
 'BBOX Ymax': 36.49387741088867,
 'BBOX Ymin': 25.845197677612305,
 'BBOX Zmax': 0.0,
 'BBOX Zmin': 0.0,
 'File Code': 9994,
 'File Length': 49902,
 'Shape Type': 5,
 'Unused0': 0,
 'Unused1': 0,
 'Unused2': 0,
 'Unused3': 0,
 'Unused4': 0,
 'Version': 1000}

To actually read in the shapes from memory, you can use the following commands:

f.by_row(14) #gets the 14th shape from the file

<pysal.cg.shapes.Polygon at 0x10d8baa20>

all_polygons = f.read() #reads in all polygons from memory

len(all_polygons)

So, all 254 polygons have been read in from file. These are stored in PySAL shape objects, which can be used by PySAL and can be converted to other Python shape objects.

They typically have a few methods. So, since we've read in polygonal data, we can get some properties about the polygons. Let's just have a look at the first polygon:

all_polygons[0:5]

[<pysal.cg.shapes.Polygon at 0x10d8baba8>,
 <pysal.cg.shapes.Polygon at 0x10d8ba908>,
 <pysal.cg.shapes.Polygon at 0x10d8ba860>,
 <pysal.cg.shapes.Polygon at 0x10d8ba8d0>,
 <pysal.cg.shapes.Polygon at 0x10d8baa90>]

all_polygons[0].centroid #the centroid of the first polygon

(-100.27156110567945, 36.27508640938005)

all_polygons[0].area

0.23682222998468205

all_polygons[0].perimeter

1.9582821721538344

While in the Jupyter Notebook, you can examine what properties an object has by using the tab key.

polygon = all_polygons[0]

polygon. #press tab when the cursor is right after the dot

  File "<ipython-input-20-aa03438a2fa8>", line 1
    polygon. #press tab when the cursor is right after the dot
                                                              ^
SyntaxError: invalid syntax

Working with Data Tables

dbf_path = "../data/texas.dbf"
print(dbf_path)

../data/texas.dbf

When you're working with tables of data, like a csv or dbf, you can extract your data in the following way. Let's open the dbf file we got the path for above.

f = ps.open(dbf_path)

Just like with the shapefile, we can examine the header of the dbf file.

f.header

['NAME',
 'STATE_NAME',
 'STATE_FIPS',
 'CNTY_FIPS',
 'FIPS',
 'STFIPS',
 'COFIPS',
 'FIPSNO',
 'SOUTH',
 'HR60',
 'HR70',
 'HR80',
 'HR90',
 'HC60',
 'HC70',
 'HC80',
 'HC90',
 'PO60',
 'PO70',
 'PO80',
 'PO90',
 'RD60',
 'RD70',
 'RD80',
 'RD90',
 'PS60',
 'PS70',
 'PS80',
 'PS90',
 'UE60',
 'UE70',
 'UE80',
 'UE90',
 'DV60',
 'DV70',
 'DV80',
 'DV90',
 'MA60',
 'MA70',
 'MA80',
 'MA90',
 'POL60',
 'POL70',
 'POL80',
 'POL90',
 'DNL60',
 'DNL70',
 'DNL80',
 'DNL90',
 'MFIL59',
 'MFIL69',
 'MFIL79',
 'MFIL89',
 'FP59',
 'FP69',
 'FP79',
 'FP89',
 'BLK60',
 'BLK70',
 'BLK80',
 'BLK90',
 'GI59',
 'GI69',
 'GI79',
 'GI89',
 'FH60',
 'FH70',
 'FH80',
 'FH90']

So, the header is a list containing the names of all of the fields we can read. If we just wanted to grab the data of interest, HR90, we can use either by_col or by_col_array, depending on the format we want the resulting data in:

HR90 = f.by_col('HR90')
print(type(HR90).__name__, HR90[0:5])
HR90 = f.by_col_array('HR90')
print(type(HR90).__name__, HR90[0:5])

list [0.0, 0.0, 18.31166453, 0.0, 3.6517674554]
ndarray [[  0.        ]
 [  0.        ]
 [ 18.31166453]
 [  0.        ]
 [  3.65176746]]

As you can see, the by_col function returns a list of data, with no shape. It can only return one column at a time:

HRs = f.by_col('HR90', 'HR80')

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-25-1fef6a3c3a50> in <module>()
----> 1 HRs = f.by_col('HR90', 'HR80')


TypeError: __call__() takes 2 positional arguments but 3 were given

This error message is called a "traceback," as you see in the top right, and it usually provides feedback on why the previous command did not execute correctly. Here, you see that one-too-many arguments was provided to __call__, which tells us we cannot pass as many arguments as we did to by_col.

If you want to read in many columns at once and store them to an array, use by_col_array:

HRs = f.by_col_array('HR90', 'HR80')

HRs[0:10]

array([[  0.        ,   0.        ],
       [  0.        ,  10.50199538],
       [ 18.31166453,   5.10386362],
       [  0.        ,   0.        ],
       [  3.65176746,  10.4297038 ],
       [  0.        ,   0.        ],
       [  0.        ,  18.85369532],
       [  2.59514448,   6.33617194],
       [  0.        ,   0.        ],
       [  5.59753708,   6.0331825 ]])

It is best to use by_col_array on data of a single type. That is, if you read in a lot of columns, some of them numbers and some of them strings, all columns will get converted to the same datatype:

allcolumns = f.by_col_array(['NAME', 'STATE_NAME', 'HR90', 'HR80'])

allcolumns

array([['Lipscomb', 'Texas', '0.0', '0.0'],
       ['Sherman', 'Texas', '0.0', '10.501995379'],
       ['Dallam', 'Texas', '18.31166453', '5.1038636248'],
       ..., 
       ['Hidalgo', 'Texas', '7.3003167816', '8.2383277607'],
       ['Willacy', 'Texas', '5.6481219994', '7.6212251119'],
       ['Cameron', 'Texas', '12.302014455', '11.761321464']], 
      dtype='<U13')

Note that the numerical columns, HR90 & HR80 are now considered strings, since they show up with the single tickmarks around them, like '0.0'.

These methods work similarly for .csv files as well.

Using Pandas with PySAL

A new functionality added to PySAL recently allows you to work with shapefile/dbf pairs using Pandas. This optional extension is only turned on if you have Pandas installed. The extension is the ps.pdio module:

ps.pdio

<module 'pysal.contrib.pdutilities' from '/Users/dani/anaconda/envs/gds-scipy16/lib/python3.5/site-packages/pysal/contrib/pdutilities/__init__.py'>

To use it, you can read in shapefile/dbf pairs using the ps.pdio.read_files command.

shp_path = ps.examples.get_path('NAT.shp')
data_table = ps.pdio.read_files(shp_path)

This reads in the entire database table and adds a column to the end, called geometry, that stores the geometries read in from the shapefile.

Now, you can work with it like a standard pandas dataframe.

data_table.head()

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	STFIPS	COFIPS	FIPSNO	HR60	...	BLK90	GI59	GI69	GI79	GI89	FH60	FH70	FH80	FH90	geometry
0	Lake of the Woods	Minnesota	27	077	27077	27	77	27077	0.000000	...	0.024534	0.285235	0.372336	0.342104	0.336455	11.279621	5.4	5.663881	9.515860	<pysal.cg.shapes.Polygon object at 0x104125ef0>
1	Ferry	Washington	53	019	53019	53	19	53019	0.000000	...	0.317712	0.256158	0.360665	0.361928	0.360640	10.053476	2.6	10.079576	11.397059	<pysal.cg.shapes.Polygon object at 0x10d8ba390>
2	Stevens	Washington	53	065	53065	53	65	53065	1.863863	...	0.210030	0.283999	0.394083	0.357566	0.369942	9.258437	5.6	6.812127	10.352015	<pysal.cg.shapes.Polygon object at 0x104142400>
3	Okanogan	Washington	53	047	53047	53	47	53047	2.612330	...	0.155922	0.258540	0.371218	0.381240	0.394519	9.039900	8.1	10.084926	12.840340	<pysal.cg.shapes.Polygon object at 0x10dadd5c0>
4	Pend Oreille	Washington	53	051	53051	53	51	53051	0.000000	...	0.134605	0.243263	0.365614	0.358706	0.387848	8.243930	4.1	7.557643	10.313002	<pysal.cg.shapes.Polygon object at 0x10dadd630>

5 rows × 70 columns

The read_files function only works on shapefile/dbf pairs. If you need to read in data using CSVs, use pandas directly:

usjoin = pd.read_csv(csv_path)
#usjoin = ps.pdio.read_files(csv_path) #will not work, not a shp/dbf pair

usjoin.head()

	Name	STATE_FIPS	1929	1930	1931	1932	1933	1934	1935	1936	...	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009
0	Alabama	1	323	267	224	162	166	211	217	251	...	23471	24467	25161	26065	27665	29097	30634	31988	32819	32274
1	Arizona	4	600	520	429	321	308	362	416	462	...	25578	26232	26469	27106	28753	30671	32552	33470	33445	32077
2	Arkansas	5	310	228	215	157	157	187	207	247	...	22257	23532	23929	25074	26465	27512	29041	31070	31800	31493
3	California	6	991	887	749	580	546	603	660	771	...	32275	32750	32900	33801	35663	37463	40169	41943	42377	40902
4	Colorado	8	634	578	471	354	353	368	444	542	...	32949	34228	33963	34092	35543	37388	39662	41165	41719	40093

5 rows × 83 columns

The nice thing about working with pandas dataframes is that they have very powerful baked-in support for relational-style queries. By this, I mean that it is very easy to find things like:

The number of counties in each state:

data_table.groupby("STATE_NAME").size()

STATE_NAME
Alabama                  67
Arizona                  14
Arkansas                 75
California               58
Colorado                 63
Connecticut               8
Delaware                  3
District of Columbia      1
Florida                  67
Georgia                 159
Idaho                    44
Illinois                102
Indiana                  92
Iowa                     99
Kansas                  105
Kentucky                120
Louisiana                64
Maine                    16
Maryland                 24
Massachusetts            12
Michigan                 83
Minnesota                87
Mississippi              82
Missouri                115
Montana                  55
Nebraska                 93
Nevada                   17
New Hampshire            10
New Jersey               21
New Mexico               32
New York                 58
North Carolina          100
North Dakota             53
Ohio                     88
Oklahoma                 77
Oregon                   36
Pennsylvania             67
Rhode Island              5
South Carolina           46
South Dakota             66
Tennessee                95
Texas                   254
Utah                     29
Vermont                  14
Virginia                123
Washington               38
West Virginia            55
Wisconsin                70
Wyoming                  23
dtype: int64

Or, to get the rows of the table that are in Arizona, we can use the query function of the dataframe:

data_table.query('STATE_NAME == "Arizona"')

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	STFIPS	COFIPS	FIPSNO	HR60	...	BLK90	GI59	GI69	GI79	GI89	FH60	FH70	FH80	FH90	geometry
1707	Navajo	Arizona	04	017	04017	4	17	4017	5.263989	...	0.905251	0.366863	0.414135	0.401999	0.445299	13.146998	12.1	13.762783	18.033782	<pysal.cg.shapes.Polygon object at 0x10e280080>
1708	Coconino	Arizona	04	005	04005	4	5	4005	3.185449	...	1.469081	0.301222	0.377785	0.381655	0.403188	9.475171	8.5	11.181563	15.267643	<pysal.cg.shapes.Polygon object at 0x10e2800f0>
1722	Mohave	Arizona	04	015	04015	4	15	4015	0.000000	...	0.324075	0.279339	0.347150	0.375790	0.374383	11.508554	4.8	7.018268	9.214294	<pysal.cg.shapes.Polygon object at 0x10e280710>
1726	Apache	Arizona	04	001	04001	4	1	4001	10.951223	...	0.162361	0.395913	0.450552	0.431013	0.489132	15.014738	14.6	18.727548	22.933635	<pysal.cg.shapes.Polygon object at 0x10e2808d0>
2002	Yavapai	Arizona	04	025	04025	4	25	4025	3.458771	...	0.298011	0.289509	0.378195	0.376313	0.384089	9.930032	8.6	7.516372	9.483521	<pysal.cg.shapes.Polygon object at 0x10e394518>
2182	Gila	Arizona	04	007	04007	4	7	4007	6.473749	...	0.246171	0.265294	0.337519	0.353848	0.386976	10.470261	8.1	9.934237	11.706102	<pysal.cg.shapes.Polygon object at 0x10e4500b8>
2262	Maricopa	Arizona	04	013	04013	4	13	4013	6.179259	...	3.499221	0.277828	0.352374	0.366015	0.372756	10.642382	9.8	11.857260	14.404902	<pysal.cg.shapes.Polygon object at 0x10e49d438>
2311	Greenlee	Arizona	04	011	04011	4	11	4011	2.896284	...	0.349650	0.177691	0.257158	0.283518	0.337256	9.806115	6.7	5.295110	10.453284	<pysal.cg.shapes.Polygon object at 0x10e4c1a58>
2326	Graham	Arizona	04	009	04009	4	9	4009	4.746648	...	1.890487	0.310256	0.362926	0.383554	0.408379	11.979335	10.1	11.961367	16.129032	<pysal.cg.shapes.Polygon object at 0x10e4ea128>
2353	Pinal	Arizona	04	021	04021	4	21	4021	13.828390	...	3.134586	0.304294	0.369974	0.361193	0.400130	10.822965	8.8	10.341699	15.304144	<pysal.cg.shapes.Polygon object at 0x10e4ead68>
2499	Pima	Arizona	04	019	04019	4	19	4019	5.520841	...	3.118252	0.268266	0.367218	0.375039	0.392144	11.381626	10.2	12.689768	16.163178	<pysal.cg.shapes.Polygon object at 0x10e575e80>
2514	Cochise	Arizona	04	003	04003	4	3	4003	4.845049	...	5.201590	0.261208	0.359500	0.359701	0.399208	10.197573	8.7	9.912732	13.733872	<pysal.cg.shapes.Polygon object at 0x10e59b550>
2615	Santa Cruz	Arizona	04	023	04023	4	23	4023	9.252406	...	0.326863	0.327130	0.396807	0.393240	0.413795	19.007213	14.7	15.690913	18.272244	<pysal.cg.shapes.Polygon object at 0x10e60a2e8>
3080	La Paz	Arizona	04	012	04012	4	12	4012	5.046682	...	2.628811	0.271556	0.364110	0.372662	0.405743	9.216414	8.0	9.296093	12.379134	<pysal.cg.shapes.Polygon object at 0x10e79bb70>

14 rows × 70 columns

Behind the scenes, this uses a fast vectorized library, numexpr, to essentially do the following.

First, compare each row's STATE_NAME column to 'Arizona' and return True if the row matches:

data_table.STATE_NAME == 'Arizona'

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
3055    False
3056    False
3057    False
3058    False
3059    False
3060    False
3061    False
3062    False
3063    False
3064    False
3065    False
3066    False
3067    False
3068    False
3069    False
3070    False
3071    False
3072    False
3073    False
3074    False
3075    False
3076    False
3077    False
3078    False
3079    False
3080     True
3081    False
3082    False
3083    False
3084    False
Name: STATE_NAME, dtype: bool

Then, use that to filter out rows where the condition is true:

data_table[data_table.STATE_NAME == 'Arizona']

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	STFIPS	COFIPS	FIPSNO	HR60	...	BLK90	GI59	GI69	GI79	GI89	FH60	FH70	FH80	FH90	geometry
1707	Navajo	Arizona	04	017	04017	4	17	4017	5.263989	...	0.905251	0.366863	0.414135	0.401999	0.445299	13.146998	12.1	13.762783	18.033782	<pysal.cg.shapes.Polygon object at 0x10e280080>
1708	Coconino	Arizona	04	005	04005	4	5	4005	3.185449	...	1.469081	0.301222	0.377785	0.381655	0.403188	9.475171	8.5	11.181563	15.267643	<pysal.cg.shapes.Polygon object at 0x10e2800f0>
1722	Mohave	Arizona	04	015	04015	4	15	4015	0.000000	...	0.324075	0.279339	0.347150	0.375790	0.374383	11.508554	4.8	7.018268	9.214294	<pysal.cg.shapes.Polygon object at 0x10e280710>
1726	Apache	Arizona	04	001	04001	4	1	4001	10.951223	...	0.162361	0.395913	0.450552	0.431013	0.489132	15.014738	14.6	18.727548	22.933635	<pysal.cg.shapes.Polygon object at 0x10e2808d0>
2002	Yavapai	Arizona	04	025	04025	4	25	4025	3.458771	...	0.298011	0.289509	0.378195	0.376313	0.384089	9.930032	8.6	7.516372	9.483521	<pysal.cg.shapes.Polygon object at 0x10e394518>
2182	Gila	Arizona	04	007	04007	4	7	4007	6.473749	...	0.246171	0.265294	0.337519	0.353848	0.386976	10.470261	8.1	9.934237	11.706102	<pysal.cg.shapes.Polygon object at 0x10e4500b8>
2262	Maricopa	Arizona	04	013	04013	4	13	4013	6.179259	...	3.499221	0.277828	0.352374	0.366015	0.372756	10.642382	9.8	11.857260	14.404902	<pysal.cg.shapes.Polygon object at 0x10e49d438>
2311	Greenlee	Arizona	04	011	04011	4	11	4011	2.896284	...	0.349650	0.177691	0.257158	0.283518	0.337256	9.806115	6.7	5.295110	10.453284	<pysal.cg.shapes.Polygon object at 0x10e4c1a58>
2326	Graham	Arizona	04	009	04009	4	9	4009	4.746648	...	1.890487	0.310256	0.362926	0.383554	0.408379	11.979335	10.1	11.961367	16.129032	<pysal.cg.shapes.Polygon object at 0x10e4ea128>
2353	Pinal	Arizona	04	021	04021	4	21	4021	13.828390	...	3.134586	0.304294	0.369974	0.361193	0.400130	10.822965	8.8	10.341699	15.304144	<pysal.cg.shapes.Polygon object at 0x10e4ead68>
2499	Pima	Arizona	04	019	04019	4	19	4019	5.520841	...	3.118252	0.268266	0.367218	0.375039	0.392144	11.381626	10.2	12.689768	16.163178	<pysal.cg.shapes.Polygon object at 0x10e575e80>
2514	Cochise	Arizona	04	003	04003	4	3	4003	4.845049	...	5.201590	0.261208	0.359500	0.359701	0.399208	10.197573	8.7	9.912732	13.733872	<pysal.cg.shapes.Polygon object at 0x10e59b550>
2615	Santa Cruz	Arizona	04	023	04023	4	23	4023	9.252406	...	0.326863	0.327130	0.396807	0.393240	0.413795	19.007213	14.7	15.690913	18.272244	<pysal.cg.shapes.Polygon object at 0x10e60a2e8>
3080	La Paz	Arizona	04	012	04012	4	12	4012	5.046682	...	2.628811	0.271556	0.364110	0.372662	0.405743	9.216414	8.0	9.296093	12.379134	<pysal.cg.shapes.Polygon object at 0x10e79bb70>

14 rows × 70 columns

We might need this behind the scenes knowledge when we want to chain together conditions, or when we need to do spatial queries.

This is because spatial queries are somewhat more complex. Let's say, for example, we want all of the counties in the US to the West of -121 longitude. We need a way to express that question. Ideally, we want something like:

SELECT
        *
FROM
        data_table
WHERE
        x_centroid < -121

So, let's refer to an arbitrary polygon in the the dataframe's geometry column as poly. The centroid of a PySAL polygon is stored as an (X,Y) pair, so the longitude is the first element of the pair, poly.centroid[0].

Then, applying this condition to each geometry, we get the same kind of filter we used above to grab only counties in Arizona:

data_table.geometry.apply(lambda x: x.centroid[0] < -121)\
                   .head()

0    False
1    False
2    False
3    False
4    False
Name: geometry, dtype: bool

If we use this as a filter on the table, we can get only the rows that match that condition, just like we did for the STATE_NAME query:

data_table[data_table.geometry.apply(lambda x: x.centroid[0] < -119)].head()

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	STFIPS	COFIPS	FIPSNO	HR60	...	BLK90	GI59	GI69	GI79	GI89	FH60	FH70	FH80	FH90	geometry
3	Okanogan	Washington	53	047	53047	53	47	53047	2.612330	...	0.155922	0.258540	0.371218	0.381240	0.394519	9.039900	8.1	10.084926	12.840340	<pysal.cg.shapes.Polygon object at 0x10dadd5c0>
27	Whatcom	Washington	53	073	53073	53	73	53073	1.422131	...	0.508687	0.247630	0.346935	0.369436	0.358418	9.174415	7.1	9.718054	11.135022	<pysal.cg.shapes.Polygon object at 0x10df05080>
31	Skagit	Washington	53	057	53057	53	57	53057	2.596560	...	0.351958	0.239346	0.344830	0.364623	0.362265	8.611518	7.9	10.480031	11.382484	<pysal.cg.shapes.Polygon object at 0x10df05240>
42	Chelan	Washington	53	007	53007	53	7	53007	4.908698	...	0.153110	0.246292	0.367681	0.374505	0.383486	8.787907	8.1	9.968454	12.236493	<pysal.cg.shapes.Polygon object at 0x10df05710>
44	Clallam	Washington	53	009	53009	53	9	53009	3.330891	...	0.568504	0.240573	0.349320	0.361619	0.366854	8.788882	6.5	9.660900	12.281690	<pysal.cg.shapes.Polygon object at 0x10df057f0>

5 rows × 70 columns

len(data_table[data_table.geometry.apply(lambda x: x.centroid[0] < -119)]) #how many west of -119?

Other types of spatial queries

Everybody knows the following statements are true:

If you head directly west from Reno, Nevada, you will shortly enter California.
San Diego is in California.

But what does this tell us about the location of San Diego relative to Reno?

Or for that matter, how many counties in California are to the east of Reno?

geom = data_table.query('(NAME == "Washoe") & (STATE_NAME == "Nevada")').geometry

lon,lat = geom.values[0].centroid

lon

-119.6555030699793

cal_counties = data_table.query('(STATE_NAME=="California")')

cal_counties[cal_counties.geometry.apply(lambda x: x.centroid[0] > lon)]

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	STFIPS	COFIPS	FIPSNO	HR60	...	BLK90	GI59	GI69	GI79	GI89	FH60	FH70	FH80	FH90	geometry
1312	Mono	California	06	051	06051	6	51	6051	15.062509	...	0.431900	0.229888	0.327520	0.388414	0.366316	6.743421	6.6	8.187135	10.083699	<pysal.cg.shapes.Polygon object at 0x10dfd5e48>
1591	Fresno	California	06	019	06019	6	19	6019	5.192037	...	5.007266	0.286651	0.379884	0.394981	0.412947	11.788963	11.8	13.998747	18.523541	<pysal.cg.shapes.Polygon object at 0x10e0ecc50>
1620	Inyo	California	06	027	06027	6	27	6027	8.558713	...	0.432143	0.242293	0.334361	0.353784	0.378516	9.676365	8.1	8.480065	12.067279	<pysal.cg.shapes.Polygon object at 0x10e220940>
1765	Tulare	California	06	107	06107	6	107	6107	3.364944	...	1.480503	0.303303	0.380229	0.395059	0.410811	10.398757	9.5	11.957928	16.303423	<pysal.cg.shapes.Polygon object at 0x10e2a4a58>
1956	Kern	California	06	029	06029	6	29	6029	6.393044	...	5.544117	0.273083	0.364603	0.380617	0.395393	10.225236	10.4	12.037755	16.118827	<pysal.cg.shapes.Polygon object at 0x10e348fd0>
1957	San Bernardino	California	06	071	06071	6	71	6071	3.243373	...	8.103188	0.259038	0.350927	0.362389	0.372659	9.857519	10.0	12.999831	15.403925	<pysal.cg.shapes.Polygon object at 0x10e36f080>
2117	Ventura	California	06	111	06111	6	111	6111	3.180374	...	2.336118	0.259643	0.325908	0.350430	0.329175	10.304761	8.9	11.047165	12.044930	<pysal.cg.shapes.Polygon object at 0x10e40c390>
2255	Riverside	California	06	065	06065	6	65	6065	4.898903	...	5.433210	0.284000	0.376737	0.383226	0.368177	9.769959	9.2	11.145612	12.678340	<pysal.cg.shapes.Polygon object at 0x10e49d128>
2279	Orange	California	06	059	06059	6	59	6059	2.083555	...	1.770587	0.255864	0.316731	0.355117	0.330447	8.802545	9.1	12.405423	12.974648	<pysal.cg.shapes.Polygon object at 0x10e49dc18>
2344	San Diego	California	06	073	06073	6	73	6073	2.387842	...	6.377301	0.269147	0.352428	0.382815	0.368301	11.425041	11.5	14.384523	15.489702	<pysal.cg.shapes.Polygon object at 0x10e4ea908>
2351	Los Angeles	California	06	037	06037	6	37	6037	4.564947	...	11.203381	0.267207	0.353837	0.402031	0.392669	13.071850	13.4	17.767317	18.808224	<pysal.cg.shapes.Polygon object at 0x10e4eac18>
2358	Imperial	California	06	025	06025	6	25	6025	4.160599	...	2.398836	0.286568	0.374913	0.397430	0.426348	11.466417	9.7	12.590249	17.243741	<pysal.cg.shapes.Polygon object at 0x10e4eaf98>

12 rows × 70 columns

len(cal_counties)

This works on any type of spatial query.

For instance, if we wanted to find all of the counties that are within a threshold distance from an observation's centroid, we can do it in the following way.

But first, we need to handle distance calculations on the earth's surface.

from math import radians, sin, cos, sqrt, asin

def gcd(loc1, loc2, R=3961):
    """Great circle distance via Haversine formula

    Parameters
    ----------

    loc1: tuple (long, lat in decimal degrees)

    loc2: tuple (long, lat in decimal degrees)

    R: Radius of the earth (3961 miles, 6367 km)

    Returns
    -------
    great circle distance between loc1 and loc2 in units of R


    Notes
    ------
    Does not take into account non-spheroidal shape of the Earth



    >>> san_diego = -117.1611, 32.7157
    >>> austin = -97.7431, 30.2672
    >>> gcd(san_diego, austin)
    1155.474644164695


    """
    lon1, lat1 = loc1
    lon2, lat2 = loc2
    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)

    a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
    c = 2*asin(sqrt(a))

    return R * c

def gcdm(loc1, loc2):
    return gcd(loc1, loc2)

def gcdk(loc1, loc2):
    return gcd(loc1, loc2, 6367 )

san_diego = -117.1611, 32.7157
austin = -97.7431, 30.2672
gcd(san_diego, austin)

1155.474644164695

gcdk(san_diego, austin)

1857.3357887898544

loc1 = (-117.1611, 0.0)
loc2 = (-118.1611, 0.0)
gcd(loc1, loc2)

69.13249167149539

loc1 = (-117.1611, 45.0)
loc2 = (-118.1611, 45.0)
gcd(loc1, loc2)

48.88374342930467

loc1 = (-117.1611, 89.0)
loc2 = (-118.1611, 89.0)
gcd(loc1, loc2)

1.2065130336642724

lats = range(0, 91)
onedeglon = [ gcd((-117.1611,lat),(-118.1611,lat)) for lat in lats]

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(lats, onedeglon)
plt.ylabel('miles')
plt.xlabel('degree of latitude')
plt.title('Length of a degree of longitude')

<matplotlib.text.Text at 0x114174470>

san_diego = -117.1611, 32.7157
austin = -97.7431, 30.2672
gcd(san_diego, austin)

1155.474644164695

Now we can use our distance function to pose distance-related queries on our data table.

# Find all the counties with centroids within 50 miles of Austin
def near_target_point(polygon, target=austin, threshold=50):
    return gcd(polygon.centroid, target) < threshold 

data_table[data_table.geometry.apply(near_target_point)]

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	STFIPS	COFIPS	FIPSNO	SOUTH	HR60	...	BLK90	GI59	GI69	GI79	GI89	FH60	FH70	FH80	FH90	geometry
2698	Burnet	Texas	48	053	48053	48	53	48053	1	0.000000	...	1.186224	0.327508	0.449285	0.385079	0.405890	10.774142	6.5	7.115629	10.568742	<pysal.cg.shapes.Polygon object at 0x10e6567f0>
2716	Williamson	Texas	48	491	48491	48	491	48491	1	9.511852	...	4.916482	0.363603	0.379902	0.341976	0.345201	13.532731	9.0	7.582572	12.032589	<pysal.cg.shapes.Polygon object at 0x10e656fd0>
2742	Travis	Texas	48	453	48453	48	453	48453	1	4.242561	...	10.959791	0.299292	0.372293	0.378953	0.388149	12.976379	10.9	14.459691	17.307113	<pysal.cg.shapes.Polygon object at 0x10e67eba8>
2751	Lee	Texas	48	287	48287	48	287	48287	1	7.449622	...	13.847829	0.376002	0.433132	0.394000	0.394959	12.305699	10.1	8.875542	10.530896	<pysal.cg.shapes.Polygon object at 0x10e67eef0>
2754	Blanco	Texas	48	031	48031	48	31	48031	1	0.000000	...	0.937709	0.369814	0.436449	0.394609	0.394414	9.365854	6.0	8.074074	9.080119	<pysal.cg.shapes.Polygon object at 0x10e6aa160>
2762	Bastrop	Texas	48	021	48021	48	21	48021	1	3.938946	...	11.792071	0.370264	0.419933	0.390927	0.380907	14.747191	12.5	10.559006	12.281387	<pysal.cg.shapes.Polygon object at 0x10e6aa518>
2769	Hays	Texas	48	209	48209	48	209	48209	1	5.016555	...	3.383424	0.385061	0.424973	0.378992	0.372266	13.064187	9.4	10.003691	10.337188	<pysal.cg.shapes.Polygon object at 0x10e6aa908>
2795	Caldwell	Texas	48	055	48055	48	55	48055	1	9.677544	...	10.704001	0.380080	0.412614	0.410802	0.413940	15.229616	10.5	12.894034	17.191502	<pysal.cg.shapes.Polygon object at 0x10e6d04a8>
2798	Comal	Texas	48	091	48091	48	91	48091	1	3.359538	...	0.854684	0.274182	0.359174	0.375810	0.380032	11.315107	7.6	8.693149	9.104427	<pysal.cg.shapes.Polygon object at 0x10e6d05f8>
2808	Guadalupe	Texas	48	187	48187	48	187	48187	1	5.743759	...	5.649500	0.340374	0.388821	0.354559	0.378073	11.758921	9.6	10.418149	12.918448	<pysal.cg.shapes.Polygon object at 0x10e6d0a58>

10 rows × 70 columns

Moving in and out of the dataframe

Most things in PySAL will be explicit about what type their input should be. Most of the time, PySAL functions require either lists or arrays. This is why the file-handler methods are the default IO method in PySAL: the rest of the computational tools are built around their datatypes.

However, it is very easy to get the correct datatype from Pandas using the values and tolist commands.

tolist() will convert its entries to a list. But, it can only be called on individual columns (called Series in pandas documentation).

So, to turn the NAME column into a list:

data_table.NAME.tolist()[0:10]

['Lake of the Woods',
 'Ferry',
 'Stevens',
 'Okanogan',
 'Pend Oreille',
 'Boundary',
 'Lincoln',
 'Flathead',
 'Glacier',
 'Toole']

To extract many columns, you must select the columns you want and call their .values attribute.

If we were interested in grabbing all of the HR variables in the dataframe, we could first select those column names:

HRs = [col for col in data_table.columns if col.startswith('HR')]
HRs

['HR60', 'HR70', 'HR80', 'HR90']

We can use this to focus only on the columns we want:

data_table[HRs].head()

	HR60	HR70	HR80	HR90
0	0.000000	0.000000	8.855827	0.000000
1	0.000000	0.000000	17.208742	15.885624
2	1.863863	1.915158	3.450775	6.462453
3	2.612330	1.288643	3.263814	6.996502
4	0.000000	0.000000	7.770008	7.478033

With this, calling .values gives an array containing all of the entries in this subset of the table:

data_table[['HR90', 'HR80']].values

array([[  0.        ,   8.85582713],
       [ 15.88562351,  17.20874204],
       [  6.46245315,   3.4507747 ],
       ..., 
       [  4.36732988,   5.2803488 ],
       [  3.72771194,   3.00003   ],
       [  2.04885495,   1.19474313]])

Using the PySAL pdio tools means that if you're comfortable with working in Pandas, you can continue to do so.

If you're more comfortable using Numpy or raw Python to do your data processing, PySAL's IO tools naturally support this.

Exercises

Find the county with the western most centroid that is within 1000 miles of Austin.
Find the distance between Austin and that centroid.

Spatial data processing with PySAL