Advanced Tabular Manipulation¶
%matplotlib inline
import pandas
Assuming you have the file downloaded on the path ../data/
:
db = pandas.read_csv("../data/paris_abb.csv.zip")
If you’re online, you can do:
db = pandas.read_csv("https://github.com/darribas/data_science_studio/raw/master/content/data/paris_abb.csv.zip")
Sorting¶
By values
# Top-5 cheapes properties
db.sort_values("Price")\
.head(5)
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
25676 | 20291987 | Passy | Apartment | Entire home/apt | 3 | 1.0 | 0.0 | 2.0 | Real Bed | 0.0 |
25697 | 20313940 | Temple | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 0.0 |
26899 | 21422028 | Popincourt | Apartment | Entire home/apt | 3 | 2.0 | 2.0 | 3.0 | Real Bed | 0.0 |
25558 | 20219162 | Buttes-Chaumont | Apartment | Entire home/apt | 1 | 1.0 | 0.0 | 1.0 | Real Bed | 0.0 |
25106 | 19974916 | Buttes-Montmartre | Condominium | Entire home/apt | 4 | 1.0 | 1.0 | 2.0 | Real Bed | 0.0 |
# Top-5 most expensive properties
db.sort_values("Price", ascending=False)\
.head(5)
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|---|
47574 | 36402651 | Batignolles-Monceau | House | Entire home/apt | 12 | 3.0 | 4.0 | 5.0 | Real Bed | 10000.0 |
10113 | 7225849 | Buttes-Montmartre | Apartment | Entire home/apt | 8 | 2.0 | 2.0 | 5.0 | Real Bed | 9379.0 |
35291 | 27608896 | Observatoire | Apartment | Private room | 1 | 1.0 | 0.0 | 1.0 | Real Bed | 9059.0 |
11286 | 8093890 | Passy | Apartment | Entire home/apt | 3 | 1.0 | 2.0 | 2.0 | Real Bed | 8721.0 |
32190 | 25448670 | Vaugirard | Apartment | Entire home/apt | 1 | 1.0 | 0.0 | 1.0 | Real Bed | 8576.0 |
By index
tmp = db.set_index("property_type")\
.sort_index()
tmp.head()
id | neighbourhood_cleansed | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | |
---|---|---|---|---|---|---|---|---|---|
property_type | |||||||||
Aparthotel | 31728955 | Vaugirard | Hotel room | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 141.0 |
Aparthotel | 13003443 | Vaugirard | Entire home/apt | 4 | 1.0 | 0.0 | 3.0 | Real Bed | 75.0 |
Aparthotel | 31733851 | Temple | Hotel room | 4 | 1.0 | 1.0 | 3.0 | Real Bed | 141.0 |
Aparthotel | 17630233 | Bourse | Hotel room | 4 | 1.0 | 2.0 | 2.0 | Real Bed | 330.0 |
Aparthotel | 24387710 | Observatoire | Private room | 12 | 5.0 | 5.0 | 11.0 | Real Bed | 698.0 |
(Useful for quick subsetting:)
tmp.loc["Tiny house",
["id", "neighbourhood_cleansed", "Price"]
]
id | neighbourhood_cleansed | Price | |
---|---|---|---|
property_type | |||
Tiny house | 20976623 | Panthéon | 110.0 |
Tiny house | 6838781 | Entrepôt | 60.0 |
Tiny house | 29322690 | Passy | 30.0 |
Tiny house | 18919023 | Louvre | 100.0 |
Tiny house | 4191080 | Louvre | 80.0 |
Tiny house | 9582468 | Popincourt | 80.0 |
Tiny house | 37312602 | Ménilmontant | 40.0 |
Tiny house | 2555221 | Buttes-Montmartre | 114.0 |
Tiny house | 34572791 | Passy | 35.0 |
Tiny house | 37174177 | Vaugirard | 50.0 |
Tiny house | 26292760 | Ménilmontant | 75.0 |
Tiny house | 36864294 | Passy | 38.0 |
Tiny house | 20273867 | Panthéon | 100.0 |
Tiny house | 34816113 | Observatoire | 40.0 |
Tiny house | 768986 | Temple | 79.0 |
Tiny house | 13220094 | Buttes-Chaumont | 160.0 |
Tiny house | 5699102 | Buttes-Montmartre | 71.0 |
Tiny house | 36048024 | Buttes-Montmartre | 70.0 |
Tiny house | 20688679 | Buttes-Montmartre | 73.0 |
Tiny house | 15157047 | Temple | 42.0 |
Tiny house | 12588228 | Passy | 75.0 |
Joinning¶
Additional data (linked through ID!)
reviews = pandas.read_csv("../data/paris_abb_review.csv.zip")
reviews.head()
id | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | |
---|---|---|---|---|---|---|---|---|
0 | 3109 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
1 | 5396 | 90.0 | 9.0 | 8.0 | 9.0 | 9.0 | 10.0 | 8.0 |
2 | 7397 | 94.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 10.0 |
3 | 7964 | 96.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
4 | 9952 | 98.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
Join to original table:
dbj1 = db.join(reviews.set_index("id"),
on = "id"
)
dbj1.head()
id | neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3109 | Observatoire | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 60.0 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
1 | 5396 | Hôtel-de-Ville | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Pull-out Sofa | 115.0 | 90.0 | 9.0 | 8.0 | 9.0 | 9.0 | 10.0 | 8.0 |
2 | 7397 | Hôtel-de-Ville | Apartment | Entire home/apt | 4 | 1.0 | 2.0 | 2.0 | Real Bed | 119.0 | 94.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 10.0 |
3 | 7964 | Opéra | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 130.0 | 96.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
4 | 9952 | Popincourt | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 75.0 | 98.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
dbj2 = db.set_index("id")\
.join(reviews.set_index("id"))
dbj2.head()
neighbourhood_cleansed | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | Price | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||
3109 | Observatoire | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Real Bed | 60.0 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
5396 | Hôtel-de-Ville | Apartment | Entire home/apt | 2 | 1.0 | 0.0 | 1.0 | Pull-out Sofa | 115.0 | 90.0 | 9.0 | 8.0 | 9.0 | 9.0 | 10.0 | 8.0 |
7397 | Hôtel-de-Ville | Apartment | Entire home/apt | 4 | 1.0 | 2.0 | 2.0 | Real Bed | 119.0 | 94.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 10.0 |
7964 | Opéra | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 130.0 | 96.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
9952 | Popincourt | Apartment | Entire home/apt | 2 | 1.0 | 1.0 | 1.0 | Real Bed | 75.0 | 98.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
Note:
Left can choose index/column, right needs to be index (results “almost” the same)
For more flexibility, check out
merge
:
Grouping¶
Get the mean price for “Louvre”:
db.query("neighbourhood_cleansed == 'Louvre'")\
["Price"]\
.mean()
175.2634194831014
Get the mean price for “Luxembourg”:
db.query("neighbourhood_cleansed == 'Luxembourg'")\
["Price"]\
.mean()
160.02262142381903
Get the mean price for “Palais-Bourbon”:
db.query("neighbourhood_cleansed == 'Palais-Bourbon'")\
["Price"]\
.mean()
169.8526690391459
For every neighbourhood???
db.groupby("neighbourhood_cleansed")\
["Price"]\
.mean()
neighbourhood_cleansed
Batignolles-Monceau 103.355239
Bourse 138.898017
Buttes-Chaumont 76.971745
Buttes-Montmartre 84.867211
Entrepôt 99.130493
Gobelins 80.000000
Hôtel-de-Ville 152.547723
Louvre 175.263419
Luxembourg 160.022621
Ménilmontant 71.074504
Observatoire 98.907340
Opéra 122.495921
Palais-Bourbon 169.852669
Panthéon 124.683662
Passy 149.237946
Popincourt 90.832782
Reuilly 84.225256
Temple 147.904592
Vaugirard 106.269580
Élysée 194.158416
Name: Price, dtype: float64
MultiIndex
Tables¶
Grouping can be based on more than one variable only…
nr = dbj1.groupby(["neighbourhood_cleansed", "room_type"])\
[["Price", "review_scores_rating"]]\
.mean()
This generates a MultiIndex
:
nr.head()
Price | review_scores_rating | ||
---|---|---|---|
neighbourhood_cleansed | room_type | ||
Batignolles-Monceau | Entire home/apt | 103.501922 | 92.484096 |
Hotel room | 343.660714 | 93.607143 | |
Private room | 58.698305 | 95.064407 | |
Shared room | 53.142857 | 90.214286 | |
Bourse | Entire home/apt | 139.618020 | 91.608503 |
nr.index
MultiIndex([('Batignolles-Monceau', 'Entire home/apt'),
('Batignolles-Monceau', 'Hotel room'),
('Batignolles-Monceau', 'Private room'),
('Batignolles-Monceau', 'Shared room'),
( 'Bourse', 'Entire home/apt'),
( 'Bourse', 'Hotel room'),
( 'Bourse', 'Private room'),
( 'Bourse', 'Shared room'),
( 'Buttes-Chaumont', 'Entire home/apt'),
( 'Buttes-Chaumont', 'Hotel room'),
( 'Buttes-Chaumont', 'Private room'),
( 'Buttes-Chaumont', 'Shared room'),
( 'Buttes-Montmartre', 'Entire home/apt'),
( 'Buttes-Montmartre', 'Hotel room'),
( 'Buttes-Montmartre', 'Private room'),
( 'Buttes-Montmartre', 'Shared room'),
( 'Entrepôt', 'Entire home/apt'),
( 'Entrepôt', 'Hotel room'),
( 'Entrepôt', 'Private room'),
( 'Entrepôt', 'Shared room'),
( 'Gobelins', 'Entire home/apt'),
( 'Gobelins', 'Hotel room'),
( 'Gobelins', 'Private room'),
( 'Gobelins', 'Shared room'),
( 'Hôtel-de-Ville', 'Entire home/apt'),
( 'Hôtel-de-Ville', 'Hotel room'),
( 'Hôtel-de-Ville', 'Private room'),
( 'Hôtel-de-Ville', 'Shared room'),
( 'Louvre', 'Entire home/apt'),
( 'Louvre', 'Hotel room'),
( 'Louvre', 'Private room'),
( 'Louvre', 'Shared room'),
( 'Luxembourg', 'Entire home/apt'),
( 'Luxembourg', 'Hotel room'),
( 'Luxembourg', 'Private room'),
( 'Luxembourg', 'Shared room'),
( 'Ménilmontant', 'Entire home/apt'),
( 'Ménilmontant', 'Hotel room'),
( 'Ménilmontant', 'Private room'),
( 'Ménilmontant', 'Shared room'),
( 'Observatoire', 'Entire home/apt'),
( 'Observatoire', 'Hotel room'),
( 'Observatoire', 'Private room'),
( 'Observatoire', 'Shared room'),
( 'Opéra', 'Entire home/apt'),
( 'Opéra', 'Hotel room'),
( 'Opéra', 'Private room'),
( 'Opéra', 'Shared room'),
( 'Palais-Bourbon', 'Entire home/apt'),
( 'Palais-Bourbon', 'Hotel room'),
( 'Palais-Bourbon', 'Private room'),
( 'Palais-Bourbon', 'Shared room'),
( 'Panthéon', 'Entire home/apt'),
( 'Panthéon', 'Hotel room'),
( 'Panthéon', 'Private room'),
( 'Panthéon', 'Shared room'),
( 'Passy', 'Entire home/apt'),
( 'Passy', 'Hotel room'),
( 'Passy', 'Private room'),
( 'Passy', 'Shared room'),
( 'Popincourt', 'Entire home/apt'),
( 'Popincourt', 'Hotel room'),
( 'Popincourt', 'Private room'),
( 'Popincourt', 'Shared room'),
( 'Reuilly', 'Entire home/apt'),
( 'Reuilly', 'Hotel room'),
( 'Reuilly', 'Private room'),
( 'Reuilly', 'Shared room'),
( 'Temple', 'Entire home/apt'),
( 'Temple', 'Hotel room'),
( 'Temple', 'Private room'),
( 'Temple', 'Shared room'),
( 'Vaugirard', 'Entire home/apt'),
( 'Vaugirard', 'Hotel room'),
( 'Vaugirard', 'Private room'),
( 'Vaugirard', 'Shared room'),
( 'Élysée', 'Entire home/apt'),
( 'Élysée', 'Hotel room'),
( 'Élysée', 'Private room'),
( 'Élysée', 'Shared room')],
names=['neighbourhood_cleansed', 'room_type'])
These indices allow us to do several more things than single index objects. For example:
One-level queries:
nr.xs("Bourse", level="neighbourhood_cleansed")
Price | review_scores_rating | |
---|---|---|
room_type | ||
Entire home/apt | 139.618020 | 91.608503 |
Hotel room | 321.244898 | 93.367347 |
Private room | 71.088710 | 93.225806 |
Shared room | 35.062500 | 90.875000 |
nr.xs("Shared room", level="room_type")
Price | review_scores_rating | |
---|---|---|
neighbourhood_cleansed | ||
Batignolles-Monceau | 53.142857 | 90.214286 |
Bourse | 35.062500 | 90.875000 |
Buttes-Chaumont | 25.228571 | 90.400000 |
Buttes-Montmartre | 42.555556 | 89.333333 |
Entrepôt | 30.458333 | 90.458333 |
Gobelins | 35.600000 | 92.533333 |
Hôtel-de-Ville | 64.000000 | 95.375000 |
Louvre | 72.000000 | 93.400000 |
Luxembourg | 55.000000 | 88.666667 |
Ménilmontant | 34.923077 | 90.730769 |
Observatoire | 41.000000 | 97.142857 |
Opéra | 44.125000 | 94.500000 |
Palais-Bourbon | 35.400000 | 95.000000 |
Panthéon | 73.800000 | 95.200000 |
Passy | 39.785714 | 93.785714 |
Popincourt | 31.880952 | 89.261905 |
Reuilly | 30.300000 | 93.200000 |
Temple | 30.833333 | 92.833333 |
Vaugirard | 51.333333 | 94.750000 |
Élysée | 70.000000 | 96.000000 |
nr.loc[("Bourse", "Shared room"), :]
Price 35.0625
review_scores_rating 90.8750
Name: (Bourse, Shared room), dtype: float64
But also “unstack” it so we can cross-tab:
unstacked = nr.unstack()
unstacked
Price | review_scores_rating | |||||||
---|---|---|---|---|---|---|---|---|
room_type | Entire home/apt | Hotel room | Private room | Shared room | Entire home/apt | Hotel room | Private room | Shared room |
neighbourhood_cleansed | ||||||||
Batignolles-Monceau | 103.501922 | 343.660714 | 58.698305 | 53.142857 | 92.484096 | 93.607143 | 95.064407 | 90.214286 |
Bourse | 139.618020 | 321.244898 | 71.088710 | 35.062500 | 91.608503 | 93.367347 | 93.225806 | 90.875000 |
Buttes-Chaumont | 82.568452 | 117.947368 | 45.866667 | 25.228571 | 93.077806 | 91.368421 | 93.528205 | 90.400000 |
Buttes-Montmartre | 87.478750 | 118.342105 | 55.217899 | 42.555556 | 92.940811 | 90.250000 | 93.729572 | 89.333333 |
Entrepôt | 101.621981 | 311.736842 | 55.144893 | 30.458333 | 92.995046 | 87.508772 | 93.320665 | 90.458333 |
Gobelins | 86.284188 | 106.961538 | 53.238671 | 35.600000 | 91.858974 | 91.346154 | 93.305136 | 92.533333 |
Hôtel-de-Ville | 154.687543 | 412.291667 | 79.372881 | 64.000000 | 93.123193 | 90.000000 | 93.745763 | 95.375000 |
Louvre | 165.077367 | 426.830508 | 102.828947 | 72.000000 | 91.763279 | 91.457627 | 94.500000 | 93.400000 |
Luxembourg | 164.863531 | 204.746667 | 87.226562 | 55.000000 | 92.269854 | 92.186667 | 92.515625 | 88.666667 |
Ménilmontant | 75.727787 | 128.200000 | 45.138425 | 34.923077 | 93.332795 | 92.800000 | 92.076372 | 90.730769 |
Observatoire | 95.354978 | 306.525000 | 90.521401 | 41.000000 | 92.864564 | 93.150000 | 93.501946 | 97.142857 |
Opéra | 115.688708 | 287.089655 | 75.561905 | 44.125000 | 93.126144 | 89.172414 | 93.704762 | 94.500000 |
Palais-Bourbon | 170.620910 | 450.722222 | 76.144144 | 35.400000 | 92.757382 | 93.527778 | 94.180180 | 95.000000 |
Panthéon | 129.989522 | 141.925000 | 69.246753 | 73.800000 | 92.206942 | 90.575000 | 94.006494 | 95.200000 |
Passy | 155.033317 | 331.766667 | 77.497674 | 39.785714 | 91.810702 | 92.600000 | 93.660465 | 93.785714 |
Popincourt | 95.073171 | 138.333333 | 53.348606 | 31.880952 | 93.042408 | 94.238095 | 93.541833 | 89.261905 |
Reuilly | 89.299656 | 104.727273 | 53.468165 | 30.300000 | 93.082090 | 91.636364 | 94.441948 | 93.200000 |
Temple | 150.560496 | 241.977273 | 77.474453 | 30.833333 | 92.378794 | 94.295455 | 93.043796 | 92.833333 |
Vaugirard | 108.483296 | 288.891892 | 58.145729 | 51.333333 | 92.632517 | 93.067568 | 93.778894 | 94.750000 |
Élysée | 194.715942 | 300.590361 | 94.835165 | 70.000000 | 91.547826 | 89.867470 | 92.780220 | 96.000000 |
This in turn creates a MultiIndex
on the columns instead, which works similarly:
unstacked["Price"]
room_type | Entire home/apt | Hotel room | Private room | Shared room |
---|---|---|---|---|
neighbourhood_cleansed | ||||
Batignolles-Monceau | 103.501922 | 343.660714 | 58.698305 | 53.142857 |
Bourse | 139.618020 | 321.244898 | 71.088710 | 35.062500 |
Buttes-Chaumont | 82.568452 | 117.947368 | 45.866667 | 25.228571 |
Buttes-Montmartre | 87.478750 | 118.342105 | 55.217899 | 42.555556 |
Entrepôt | 101.621981 | 311.736842 | 55.144893 | 30.458333 |
Gobelins | 86.284188 | 106.961538 | 53.238671 | 35.600000 |
Hôtel-de-Ville | 154.687543 | 412.291667 | 79.372881 | 64.000000 |
Louvre | 165.077367 | 426.830508 | 102.828947 | 72.000000 |
Luxembourg | 164.863531 | 204.746667 | 87.226562 | 55.000000 |
Ménilmontant | 75.727787 | 128.200000 | 45.138425 | 34.923077 |
Observatoire | 95.354978 | 306.525000 | 90.521401 | 41.000000 |
Opéra | 115.688708 | 287.089655 | 75.561905 | 44.125000 |
Palais-Bourbon | 170.620910 | 450.722222 | 76.144144 | 35.400000 |
Panthéon | 129.989522 | 141.925000 | 69.246753 | 73.800000 |
Passy | 155.033317 | 331.766667 | 77.497674 | 39.785714 |
Popincourt | 95.073171 | 138.333333 | 53.348606 | 31.880952 |
Reuilly | 89.299656 | 104.727273 | 53.468165 | 30.300000 |
Temple | 150.560496 | 241.977273 | 77.474453 | 30.833333 |
Vaugirard | 108.483296 | 288.891892 | 58.145729 | 51.333333 |
Élysée | 194.715942 | 300.590361 | 94.835165 | 70.000000 |
EXERCISE
Create a table that shows the average price for properties by room and property type
More at: