Pair Plot
First, we can set our initial settings.
import wbdata
import pandas as pd
import datetime
dates = (datetime.datetime(2016, 1, 1),datetime.datetime(2017, 1, 1))
levels = ["HIC","MIC","LIC"]
indicators = {"SP.POP.TOTL":"Population","NY.GDP.PCAP.CD":"GDP per Capita","SH.DYN.NMRT":"Neonatal Mortality","SL.UEM.TOTL.ZS":"Unemployment"}
And now get data.
dataArray = []
for level in levels:
countries = [i["id"] for i in wbdata.get_country(incomelevel=level, display=False)]
data = wbdata.get_dataframe(indicators, country=countries, data_date=dates)
data["Income Level"] = level
dataArray.append(data)
df = pd.concat(dataArray)
df.dropna(inplace=True)
df
Now, to get the pairplots we use pairplot in seaborn, all we need to give is the data as well as what variable to use for the hue.
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df, hue="Income Level")
plt.show()
We get scatter plots for the interaction between each variable, but also get histograms for the diagonal rows.
Something we may want to do is apply a log transformation to some of data. Look at the population and GDP histograms, they look like there could potentially be some sort of log based numbers in there, let’s try transforming the data and see if it helps us at all.
import numpy as np
df["GDP per Capita"] = df["GDP per Capita"].apply(np.log)
df["Population"] = df["Population"].apply(np.log)
We apply the numpy log function to these two columns and then we can plot the new pairplot.
sns.pairplot(df, hue="Income Level")
plt.show()
That looks much better.
Source Code