Visualising Tabular Data¶

👨‍🏫 [Sync]

%matplotlib inline

import pandas
import seaborn as sns
import matplotlib.pyplot as plt

Local files

Assuming you have the file downloaded on the path ../data/:

db = pandas.read_csv("../data/paris_abb.csv.zip")

Online read

If you’re online, you can do:

db = pandas.read_csv("https://github.com/darribas/data_science_studio/raw/master/content/data/paris_abb.csv.zip")

Simple & Quick (`pandas`)¶

db["Price"].plot.hist()

<AxesSubplot:ylabel='Frequency'>

../../_images/04-Tabular_data_viz_5_1.png

db["Price"].plot.hist(bins=100)

<AxesSubplot:ylabel='Frequency'>

../../_images/04-Tabular_data_viz_6_1.png

db.query("Price < 500")["Price"].plot.kde()

<AxesSubplot:ylabel='Density'>

../../_images/04-Tabular_data_viz_7_1.png

db.plot.scatter("accommodates", "Price")

<AxesSubplot:xlabel='accommodates', ylabel='Price'>

../../_images/04-Tabular_data_viz_8_1.png

EXERCISE

Create a histogram of the distribution of number of people a property accommodates (accommodates)
Create a scatter plot of the number of beds and number of bedrooms

More

https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

We can also combine this with groupings from the previous notebook:

db.groupby("neighbourhood_cleansed")\
  ["Price"]\
  .mean()\
  .plot.bar()

<AxesSubplot:xlabel='neighbourhood_cleansed'>

../../_images/04-Tabular_data_viz_11_1.png

EXERCISE

Create a bar plot of the average review_scores_rating by neighbourhood, sorted by average price.

For a “pro” touch (optional), substract 90 from the reviews score before plotting.

`seaborn`¶

Univariate

sns.distplot(db.query("Price < 500")["Price"], 
             kde=True, 
             rug=True)

/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2055: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
  warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='Price', ylabel='Density'>

../../_images/04-Tabular_data_viz_15_2.png

Bivariate

sns.jointplot(x = "accommodates",
              y = "Price",
              data=db,
              kind="hex"
             );

../../_images/04-Tabular_data_viz_17_0.png

sns.regplot(x = "accommodates",
            y = "Price",
            data=db.query("Price < 500")
           );

../../_images/04-Tabular_data_viz_18_0.png

sns.jointplot(x = "accommodates",
              y = "Price",
              data=db.query("Price < 500")\
                     .sample(1000),
              kind="kde"
             );

../../_images/04-Tabular_data_viz_19_0.png

EXERCISE

Explore the documentation of jointplot and create a figure similar to the one above where you replace the hexagonal binning for a KDE. Since this will probably take too long, subset your data before plotting to only properties cheaper than $500, and then randomly sample 1,000 observations (tip: check out the sample method).

More

http://seaborn.pydata.org/tutorial.html

Categorical

sns.stripplot(x = "Price",
              y = "neighbourhood_cleansed",
              data=db
             );

../../_images/04-Tabular_data_viz_22_0.png

sns.swarmplot(x = "room_type",
              y = "Price",
              data=db.query("neighbourhood_cleansed == 'Louvre'")\
                     .sample(250),
              alpha=0.25
             );

/opt/conda/lib/python3.8/site-packages/seaborn/categorical.py:1296: UserWarning: 58.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

../../_images/04-Tabular_data_viz_23_1.png

To note:

With larger datasets, it’s hard to see any pattern
This is true even if you jitter the points around to avoid overlap and/or you play with transparency (alpha)
Algorithms to separate out dots exist but they’re computationally intensive and can only do so much

sns.stripplot(x = "Price",
              y = "property_type",
              hue = "room_type",
              data=db.query("neighbourhood_cleansed == 'Louvre'")\
                     .sample(250),
              alpha=0.25
             );

../../_images/04-Tabular_data_viz_25_0.png

sns.boxplot(x = "room_type",
            y = "Price",
            data = db
           );

../../_images/04-Tabular_data_viz_26_0.png

nei_list = ["Hôtel-de-Ville", "Louvre", "Buttes-Montmartre"]
sub = db["neighbourhood_cleansed"].isin(nei_list)

sns.violinplot(x = "Price",
               y = "neighbourhood_cleansed",
               data = db[sub].query("Price < 500")
              );

../../_images/04-Tabular_data_viz_27_0.png

EXERCISE

Explore the distribution of price by property type

Full control (`matplotlib`)¶

One¶

f, ax = plt.subplots(1)

ax.hist(db['Price'], bins=100)

plt.show()

../../_images/04-Tabular_data_viz_31_0.png

sub = db.query("Price < 500")\
        .sample(1000)

f, ax = plt.subplots(1)

sns.kdeplot(sub['Price'], sub['accommodates'], 
            shade=True, ax=ax)
ax.scatter(sub['Price'], sub['accommodates'],
           alpha=0.1, s=0.75, color='orange')

plt.show()

/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

../../_images/04-Tabular_data_viz_32_1.png

f, ax = plt.subplots(1, figsize=(4, 4))

sns.kdeplot(sub['Price'],
            sub['accommodates'], 
            shade=True, ax=ax, cmap='Purples')

plt.show()

/opt/conda/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

../../_images/04-Tabular_data_viz_33_1.png

CHALLENGE - Create a visualisation that includes a KDE and a scatter plot and that explores the relationship between number of beds and number of people it accommodates, for a random sample of 500 properties.

Two or more¶

f, axs = plt.subplots(1, 2, figsize=(12, 6))

sns.distplot(sub['Price'], 
             kde=False, 
             rug=True, 
             ax=axs[0]
            )

sns.distplot(sub['accommodates'], 
             hist=False, 
             kde=True, 
             rug=True, 
             ax=axs[1]
            )

plt.show()

/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2055: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2055: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
  warnings.warn(msg, FutureWarning)

../../_images/04-Tabular_data_viz_36_1.png

CHALLENGE - Create a visualisation for all of the properties with three subplots:

Histogram of price
Scatter plot of price Vs number of people it accommodates
Histogram of number of people the property accommodates

Data Science Studio

Visualising Tabular Data¶

Simple & Quick (pandas)¶

seaborn¶

Full control (matplotlib)¶

One¶

Two or more¶

Simple & Quick (`pandas`)¶

`seaborn`¶

Full control (`matplotlib`)¶