Reading & Manipulating Tabular Data¶
import pandas
import numpy as np
from sqlalchemy import create_engine
To note:
matplotlib
magicLibrary import and alias
Data¶
Read in
.csv
Assuming you have the file downloaded on the path ../data/
:
db = pandas.read_csv("../data/paris_abb.csv.zip")
If you’re online, you can do:
db = pandas.read_csv("https://github.com/darribas/data_science_studio/raw/master/content/data/paris_abb.csv.zip")
Connect to
SQLite
and query
engine = create_engine("sqlite:///../data/paris_abb_mini.db")
qry = """
SELECT *
FROM db
LIMIT 5;
"""
dbs = pandas.read_sql(qry, engine)
dbs
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3109 | Observatoire | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 60.0 |
1 | 5396 | Hôtel-de-Ville | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Pull-out Sofa | 115.0 |
2 | 7397 | Hôtel-de-Ville | Apartment | Entire home/apt | 4 | 1.0 | 2.0 | 2.0 | Real Bed | 119.0 |
3 | 7964 | Opéra | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 130.0 |
4 | 9952 | Popincourt | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 75.0 |
qry = """
SELECT id, Price
FROM db
WHERE neighbourhood_cleansed = 'Popincourt';
"""
dbs = pandas.read_sql(qry, engine)
dbs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5137 entries, 0 to 5136
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5137 non-null int64
1 Price 5137 non-null float64
dtypes: float64(1), int64(1)
memory usage: 80.4 KB
Explore
db.head()
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3109 | Observatoire | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 60.0 |
1 | 5396 | Hôtel-de-Ville | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Pull-out Sofa | 115.0 |
2 | 7397 | Hôtel-de-Ville | Apartment | Entire home/apt | 4 | 1.0 | 2.0 | 2.0 | Real Bed | 119.0 |
3 | 7964 | Opéra | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 130.0 |
4 | 9952 | Popincourt | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 75.0 |
db.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50280 entries, 0 to 50279
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 50280 non-null int64
1 neighbourhood_cleansed 50280 non-null object
2 property_type 50280 non-null object
3 room_type 50280 non-null object
4 accommodates 50280 non-null int64
5 bathrooms 50280 non-null float64
6 bedrooms 50280 non-null float64
7 beds 50280 non-null float64
8 bed_type 50280 non-null object
9 Price 50280 non-null float64
dtypes: float64(4), int64(2), object(4)
memory usage: 3.8+ MB
db.describe()
id | accommodates | bathrooms | bedrooms | beds | Price | |
---|---|---|---|---|---|---|
count | 5.028000e+04 | 50280.000000 | 50280.000000 | 50280.000000 | 50280.000000 | 50280.000000 |
mean | 1.968671e+07 | 3.063425 | 1.117045 | 1.077804 | 1.664698 | 110.630549 |
std | 1.140675e+07 | 1.556731 | 0.687370 | 1.001000 | 1.164917 | 157.848189 |
min | 3.109000e+03 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 9.340209e+06 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 60.000000 |
50% | 1.999153e+07 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 80.000000 |
75% | 2.987492e+07 | 4.000000 | 1.000000 | 1.000000 | 2.000000 | 120.000000 |
max | 3.858893e+07 | 17.000000 | 50.000000 | 50.000000 | 50.000000 | 10000.000000 |
DataFrame
objects can hold different types of data
“Whole” numbers (
int
)Decimals (
float
)Categorical (
pandas.Category
)Dates (
pandas.Timestamp
)Geo (
geopandas.GeoDataFrame
)…
Indices¶
db.head(2)
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3109 | Observatoire | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 60.0 |
1 | 5396 | Hôtel-de-Ville | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Pull-out Sofa | 115.0 |
db.set_index("id").head(2)
neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
3109 | Observatoire | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 60.0 |
5396 | Hôtel-de-Ville | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Pull-out Sofa | 115.0 |
dbi = db.set_index("id")
dbi.index
Int64Index([ 3109, 5396, 7397, 7964, 9952, 10710,
11170, 11213, 11265, 11798,
...
38485563, 38489275, 38513797, 38516223, 38516341, 38517692,
38520175, 38537044, 38546594, 38588929],
dtype='int64', name='id', length=50280)
dbi.columns
Index(['neighbourhood_cleansed', 'property_type', 'room_type', 'accommodates',
'bathrooms', 'bedrooms', 'beds', 'bed_type', 'Price'],
dtype='object')
Slicing and Dicing Data¶
Index-based queries¶
Columns
db["Price"].head()
0 60.0
1 115.0
2 119.0
3 130.0
4 75.0
Name: Price, dtype: float64
Generic point-querying
db.loc[0, "Price"]
60.0
Full slice of one dimension
db.loc[0, :]
id 3109
neighbourhood_cleansed Observatoire
property_type Apartment
room_type Entire home/apt
accommodates 2
bathrooms 1
bedrooms 0
beds 1
bed_type Real Bed
Price 60
Name: 0, dtype: object
db.loc[:, "Price"]
0 60.0
1 115.0
2 119.0
3 130.0
4 75.0
...
50275 250.0
50276 40.0
50277 60.0
50278 65.0
50279 69.0
Name: Price, Length: 50280, dtype: float64
Range queries
db.loc[0:5, "neighbourhood_cleansed"]
0 Observatoire
1 Hôtel-de-Ville
2 Hôtel-de-Ville
3 Opéra
4 Popincourt
5 Élysée
Name: neighbourhood_cleansed, dtype: object
db.loc[5, "property_type":"bed_type"]
property_type Apartment
room_type Entire home/apt
accommodates 4
bathrooms 1
bedrooms 1
beds 2
bed_type Real Bed
Name: 5, dtype: object
List-based queries
db.loc[[0, 49, 19, 29, 39, 9], ["Price", "id"]]
Price | id | |
---|---|---|
0 | 60.0 | 3109 |
49 | 128.0 | 32082 |
19 | 65.0 | 17919 |
29 | 80.0 | 21264 |
39 | 49.0 | 26567 |
9 | 120.0 | 11798 |
Order-based queries¶
dbi.iloc[0:5, :]
neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
3109 | Observatoire | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 60.0 |
5396 | Hôtel-de-Ville | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Pull-out Sofa | 115.0 |
7397 | Hôtel-de-Ville | Apartment | Entire home/apt | 4 | 1.0 | 2.0 | 2.0 | Real Bed | 119.0 |
7964 | Opéra | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 130.0 |
9952 | Popincourt | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 75.0 |
EXERCISE:
Slice the dataset to keep only properties with the following IDs, in that order:
38520175
,619716
, and37847454
Extract the section of the dataset that includes the 50th to the 100th rows, and the
room_type
andbedrooms
columns
Conditional queries¶
Using
loc
db.loc[db["neighbourhood_cleansed"] == "Observatoire",
["neighbourhood_cleansed", "Price"]]\
.head()
neighbourhood_cleansed | Price | |
---|---|---|
0 | Observatoire | 60.0 |
47 | Observatoire | 90.0 |
52 | Observatoire | 150.0 |
104 | Observatoire | 84.0 |
144 | Observatoire | 140.0 |
db.loc[db["Price"] < 100, ["id", "neighbourhood_cleansed"]].head()
id | neighbourhood_cleansed | |
---|---|---|
0 | 3109 | Observatoire |
4 | 9952 | Popincourt |
5 | 10710 | Élysée |
6 | 11170 | Panthéon |
10 | 11848 | Popincourt |
db.loc[(db["Price"] < 100) & \
(db["bathrooms"] >= 8),
:]
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
12066 | 8876983 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 16.0 | Real Bed | 85.0 |
22964 | 18766792 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 17.0 | Real Bed | 85.0 |
24788 | 19819352 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 20.0 | Real Bed | 85.0 |
25798 | 20433587 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26048 | 20691340 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26108 | 20747725 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26121 | 20768013 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
Conditional filters
fltr = db["bathrooms"] > 8
fltr.head()
0 False
1 False
2 False
3 False
4 False
Name: bathrooms, dtype: bool
db[fltr]
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
12066 | 8876983 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 16.0 | Real Bed | 85.0 |
22964 | 18766792 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 17.0 | Real Bed | 85.0 |
24788 | 19819352 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 20.0 | Real Bed | 85.0 |
25798 | 20433587 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26048 | 20691340 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26108 | 20747725 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26121 | 20768013 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
Concatenated queries
db.loc[(db["Price"] < 100) & \
(db["bathrooms"] >= 8),
:]
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
12066 | 8876983 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 16.0 | Real Bed | 85.0 |
22964 | 18766792 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 17.0 | Real Bed | 85.0 |
24788 | 19819352 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 20.0 | Real Bed | 85.0 |
25798 | 20433587 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26048 | 20691340 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26108 | 20747725 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26121 | 20768013 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
db.loc[(db["Price"] < 5) | \
(db["Price"] > 5000),
:]
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
8149 | 6088687 | Temple | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 8500.0 |
10113 | 7225849 | Buttes-Montmartre | Apartment | Entire home/apt | 8 | 2.0 | 2.0 | 5.0 | Real Bed | 9379.0 |
11286 | 8093890 | Passy | Apartment | Entire home/apt | 3 | 1.0 | 2.0 | 2.0 | Real Bed | 8721.0 |
25106 | 19974916 | Buttes-Montmartre | Condominium | Entire home/apt | 4 | 1.0 | 1.0 | 2.0 | Real Bed | 0.0 |
25558 | 20219162 | Buttes-Chaumont | Apartment | Entire home/apt | 1 | 1.0 | 0.0 | 1.0 | Real Bed | 0.0 |
25676 | 20291987 | Passy | Apartment | Entire home/apt | 3 | 1.0 | 0.0 | 2.0 | Real Bed | 0.0 |
25697 | 20313940 | Temple | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 0.0 |
26899 | 21422028 | Popincourt | Apartment | Entire home/apt | 3 | 2.0 | 2.0 | 3.0 | Real Bed | 0.0 |
32190 | 25448670 | Vaugirard | Apartment | Entire home/apt | 1 | 1.0 | 0.0 | 1.0 | Real Bed | 8576.0 |
35291 | 27608896 | Observatoire | Apartment | Private room | 1 | 1.0 | 0.0 | 1.0 | Real Bed | 9059.0 |
43961 | 34380017 | Bourse | Serviced apartment | Hotel room | 8 | 2.5 | 3.0 | 4.0 | Real Bed | 5150.0 |
46767 | 36019554 | Louvre | Serviced apartment | Hotel room | 8 | 2.5 | 2.0 | 4.0 | Real Bed | 8000.0 |
47574 | 36402651 | Batignolles-Monceau | House | Entire home/apt | 12 | 3.0 | 4.0 | 5.0 | Real Bed | 10000.0 |
Using
query
db.query("neighbourhood_cleansed == 'Observatoire'")\
[["neighbourhood_cleansed", "Price"]]\
.head()
neighbourhood_cleansed | Price | |
---|---|---|
0 | Observatoire | 60.0 |
47 | Observatoire | 90.0 |
52 | Observatoire | 150.0 |
104 | Observatoire | 84.0 |
144 | Observatoire | 140.0 |
db.query("Price < 100")\
[["id", "neighbourhood_cleansed"]]\
.head()
id | neighbourhood_cleansed | |
---|---|---|
0 | 3109 | Observatoire |
4 | 9952 | Popincourt |
5 | 10710 | Élysée |
6 | 11170 | Panthéon |
10 | 11848 | Popincourt |
db.query("(Price < 100) & (bathrooms >= 8)")
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
12066 | 8876983 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 16.0 | Real Bed | 85.0 |
22964 | 18766792 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 17.0 | Real Bed | 85.0 |
24788 | 19819352 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 20.0 | Real Bed | 85.0 |
25798 | 20433587 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26048 | 20691340 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26108 | 20747725 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
26121 | 20768013 | Luxembourg | Boutique hotel | Private room | 16 | 50.0 | 50.0 | 50.0 | Real Bed | 85.0 |
Editing tables¶
Modifying single values
db.loc[1, "bed_type"]
'Pull-out Sofa'
db.loc[1, "bed_type"] = "Pullout Sofa"
db.loc[1, "bed_type"]
'Pullout Sofa'
Modifying blocks of values
db.neighbourhood_cleansed.unique()
array(['Observatoire', 'Hôtel-de-Ville', 'Opéra', 'Popincourt', 'Élysée',
'Panthéon', 'Entrepôt', 'Buttes-Montmartre', 'Gobelins',
'Buttes-Chaumont', 'Luxembourg', 'Louvre', 'Palais-Bourbon',
'Reuilly', 'Bourse', 'Ménilmontant', 'Vaugirard',
'Batignolles-Monceau', 'Temple', 'Passy'], dtype=object)
db.loc[db["neighbourhood_cleansed"] == "Hôtel-de-Ville",
"neighbourhood_cleansed"] = "City Council"
db.neighbourhood_cleansed.unique()
array(['Observatoire', 'City Council', 'Opéra', 'Popincourt', 'Élysée',
'Panthéon', 'Entrepôt', 'Buttes-Montmartre', 'Gobelins',
'Buttes-Chaumont', 'Luxembourg', 'Louvre', 'Palais-Bourbon',
'Reuilly', 'Bourse', 'Ménilmontant', 'Vaugirard',
'Batignolles-Monceau', 'Temple', 'Passy'], dtype=object)
Creating new columns and rows
db["more_beds_than_accomodates"] = db["beds"] > db["accommodates"]
EXERCISE
Find how many properties have more than ten bathrooms
Can you rent an AirBnb in Paris with only one bed but three bedrooms?
In which neighbourhoods can you rent an “Airbed”?
Writing Data¶
[i for i in dir(db) if i[:3]=="to_"]
['to_clipboard',
'to_csv',
'to_dict',
'to_excel',
'to_feather',
'to_gbq',
'to_hdf',
'to_html',
'to_json',
'to_latex',
'to_markdown',
'to_numpy',
'to_parquet',
'to_period',
'to_pickle',
'to_records',
'to_sql',
'to_stata',
'to_string',
'to_timestamp',
'to_xarray']