Notes on making scatterplots in matplotlib and seaborn

Many of my programming tips, like my notes for making Leaflet maps in R or margins plots in Stata, I’ve just accumulated doing projects over the years. My current workplace is a python shop though, so I am figuring it out all over for some of these things in python. I made some ugly scatterplots for a presentation the other day, and figured it would be time to spend alittle time making some notes on making them a bit nicer.

For prior python graphing post examples, I have:

For this post, I am going to use the same data I illustrated with SPSS previously, a set of crime rates in Appalachian counties. Here you can download the dataset and the python script to follow along.

Making scatterplots using matplotlib

So first for the upfront junk, I load my libraries, change my directory, update my plot theme, and then load my data into a dataframe crime_dat. I technically do not use numpy in this script, but soon as I take it out I’m guaranteed to need to use np. for something!

################################################################
import pandas as pd
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

my_dir = r'C:\Users\andre\OneDrive\Desktop\big_scatter'
os.chdir(my_dir)

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}

matplotlib.rcParams.update(andy_theme)
crime_dat = pd.read_csv('Rural_appcrime_long.csv')
################################################################

First, lets start from the base scatterplot. After defining my figure and axis objects, I add on the ax.scatter by pointing the x and y’s to my pandas dataframe columns, here Burglary and Robbery rates per 100k. You could also instead of starting from the matplotlib objects start from the pandas dataframe methods (as I did in my prior histogram post). I don’t have a good reason for using one or the other.

Then I set the axis grid lines to be below my points (is there a way to set this as a default?), and then I set my X and Y axis labels to be nicer than the default names.

################################################################
#Default scatterplot
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(crime_dat['burg_rate'], crime_dat['rob_rate'])
ax.set_axisbelow(True)
ax.set_xlabel('Burglary Rate per 100,000')
ax.set_ylabel('Robbery Rate per 100,000')
plt.savefig('Scatter01.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

You can see here the default point markers, just solid blue filled circles with no outline, when you get a very dense scatterplot just looks like a solid blob. I think a better default for scatterplots is to plot points with an outline. Here I also make the interior fill slightly transparent. All of this action is going on in the ax.scatter call, all of the other lines are the same.

################################################################
#Making points have an outline and interior fill
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(crime_dat['burg_rate'], crime_dat['rob_rate'], 
           c='grey', edgecolor='k', alpha=0.5)
ax.set_axisbelow(True)
ax.set_xlabel('Burglary Rate per 100,000')
ax.set_ylabel('Robbery Rate per 100,000')
plt.savefig('Scatter02.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

So that is better, but we still have quite a bit of overplotting going on. Another quick trick is to make the points smaller and up the transparency by setting alpha to a lower value. This allows you to further visualize the density, but then makes it a bit harder to see individual points – if you started from here you might miss that outlier in the upper right.

Note I don’t set the edgecolor here, but if you want to make the edges semitransparent as well you could do edgecolor=(0.0, 0.0, 0.0, 0.5), where the last number of is the alpha transparency tuner.

################################################################
#Making the points small and semi-transparent
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(crime_dat['burg_rate'], crime_dat['rob_rate'], c='k', 
            alpha=0.1, s=4)
ax.set_axisbelow(True)
ax.set_xlabel('Burglary Rate per 100,000')
ax.set_ylabel('Robbery Rate per 100,000')
plt.savefig('Scatter03.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

This dataset has around 7.5k rows in it. For most datasets of anymore than a hundred points, you often have severe overplotting like you do here. One way to solve that problem is to bin observations, and then make a graph showing the counts within the bins. Matplotlib has a very nice hexbin method for doing this, which is easier to show than explain.

################################################################
#Making a hexbin plot
fig, ax = plt.subplots(figsize=(6,4))
hb = ax.hexbin(crime_dat['burg_rate'], crime_dat['rob_rate'], 
               gridsize=20, edgecolors='grey', 
               cmap='inferno', mincnt=1)
ax.set_axisbelow(True)
ax.set_xlabel('Burglary Rate per 100,000')
ax.set_ylabel('Robbery Rate per 100,000')
cb = fig.colorbar(hb, ax=ax)
plt.savefig('Scatter04.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

So for the hexbins I like using the mincnt=1 option, as it clearly shows areas with no points, but then you can still spot the outliers fairly easy. (Using white for the edge colors looks nice as well.)

You may be asking, what is up with that outlier in the top right? It ends up being Letcher county in Kentucky in 1983, which had a UCR population estimate of only 1522, but had a total of 136 burglaries and 7 robberies. This could technically be correct (only some local one cop town reported, and that town does not cover the whole county), but I’m wondering if this is a UCR reporting snafu.

It is also a good use case for funnel charts. I debated on making some notes here about putting in text labels, but will hold off for now. You can add in text by using ax.annotate fairly easy by hand, but it is hard to automate text label positions. It is maybe easier to make interactive graphs and have a tooltip, but that will need to be another blog post as well.

Making scatterplots using seaborn

The further examples I show are using the seaborn library, imported earlier as sns. I like using seaborn to make small multiple plots, but it also has a very nice 2d kernel density contour plot method I am showing off.

Note this does something fundamentally different than the prior hexbin chart, it creates a density estimate. Here it looks pretty but creates a density estimate in areas that are not possible, negative crime rates. (There are ways to prevent this, such as estimating the KDE on a transformed scale and retransforming back, or reflecting the density back inside the plot would probably make more sense here, ala edge weighting in spatial statistics.)

Here the only other things to note are used filled contours instead of just the lines, I also drop the lowest shaded area (I wish I could just drop areas of zero density, note dropping the lowest area drops my outlier in the top right). Also I have a tough go of using the default bandwidth estimators, so I input my own.

################################################################
#Making a contour plot using seaborn
g = sns.kdeplot(crime_dat['burg_rate'], crime_dat['rob_rate'], 
                shade=True, cbar=True, gridsize=100, bw=(500,50),
                cmap='plasma', shade_lowest=False, alpha=0.8)
g.set_axisbelow(True)
g.set_xlabel('Burglary Rate per 100,000')
g.set_ylabel('Robbery Rate per 100,000')
plt.savefig('Scatter05.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################ 

So far I have not talked about the actual marker types. It is very difficult to visualize different markers in a scatterplot unless they are clearly separated. So although it works out OK for the Iris dataset because it is small N and the species are clearly separated, in real life datasets it tends to be much messier.

So I very rarely use multiple point types to symbolize different groups in a scatterplot, but prefer to use small multiple graphs. Here is an example of turning my original scatterplot, but differentiating between different county areas in the dataset. It is a pretty straightforward update using sns.FacetGrid to define the group, and then using g.map. (There is probably a smarter way to set the grid lines below the points for each subplot than the loop.)

################################################################
#Making a small multiple scatterplot using seaborn
g = sns.FacetGrid(data=crime_dat, col='subrgn', 
                   col_wrap=2, despine=False, height=4)
g.map(plt.scatter, 'burg_rate', 'rob_rate', color='grey', 
       s=12, edgecolor='k', alpha=0.5)
g.set_titles("{col_name}")
for a in g.axes:
    a.set_axisbelow(True)
g.set_xlabels('Burglary Rate per 100,000')
g.set_ylabels('Robbery Rate per 100,000')
plt.savefig('Scatter06.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

And then finally I show an example of making a small multiple hexbin plot. It is alittle tricky, but this is an example in the seaborn docs of writing your own sub-plot function and passing that.

To make this work, you need to pass an extent for each subplot (so the hexagons are not expanded/shrunk in any particular subplot). You also need to pass a vmin/vmax argument, so the color scales are consistent for each subplot. Then finally to add in the color bar I just fiddled with adding an axes. (Again there is probably a smarter way to scoop up the plot coordinates for the last plot, but here I just experimented till it looked about right.)

################################################################
#Making a small multiple hexbin plot using seaborn

#https://github.com/mwaskom/seaborn/issues/1860
#https://stackoverflow.com/a/31385996/604456
def loc_hexbin(x, y, **kwargs):
    kwargs.pop("color", None)
    plt.hexbin(x, y, gridsize=20, edgecolor='grey',
               cmap='inferno', mincnt=1, 
               vmin=1, vmax=700, **kwargs)

g = sns.FacetGrid(data=crime_dat, col='subrgn', 
                  col_wrap=2, despine=False, height=4)
g.map(loc_hexbin, 'burg_rate', 'rob_rate', 
      edgecolors='grey', extent=[0, 9000, 0, 500])
g.set_titles("{col_name}")
for a in g.axes:
    a.set_axisbelow(True)
#This goes x,y,width,height
cax = g.fig.add_axes([0.55, 0.09, 0.03, .384])
plt.colorbar(cax=cax, ax=g.axes[0])
g.set_xlabels('Burglary Rate per 100,000')
g.set_ylabels('Robbery Rate per 100,000')
plt.savefig('Scatter07.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

Another common task with scatterplots is to visualize a smoother, e.g. E[Y|X] the expected mean of Y conditional on X, or you could do any other quantile, etc. That will have to be another post though, but for examples I have written about previously I have jittering 0/1 data, and visually weighted regression.

Notes on matplotlib and seaborn charts (python)

My current workplace is a python shop. I actually didn’t use pandas/numpy for most of my prior academic projects, but I really like pandas for data manipulation now that I know it better. I’m using python objects (lists, dictionaries, sets) inside of data frames quite a bit to do some tricky data manipulations.

I do however really miss using ggplot to make graphs. So here are my notes on using python tools to make plots, specifically the matplotlib and seaborn libraries. Here is the data/code to follow along on your own.

some set up

First I am going to redo the data analysis for predictive recidivism I did in a prior blog post. One change is that I noticed the default random forest implementation in sci-kit was prone to overfitting the data – so one simple regularization was to either limit depth of trees, or number of samples needed to split, or the total number of samples in a final leaf. (I noticed this when I developed a simulated example xgboost did well with the defaults, but random forests did not. It happened to be becauase xgboost defaults had a way smaller number of potential splits, when using similar defaults they were pretty much the same.)

Here I just up the minimum samples per leaf to 100.

#########################################################
#set up for libraries and data I need
import pandas as pd
import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

my_dir = r'C:\Users\andre\Dropbox\Documents\BLOG\matplotlib_seaborn'
os.chdir(my_dir)

#Modelling recidivism using random forests, see below for background 
#https://andrewpwheeler.com/2020/01/05/balancing-false-positives/

recid = pd.read_csv('PreppedCompas.csv')
#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
                    'juv_fel_count','YearsScreening']]
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
recid_prep['race'] = recid['race']

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1]
recid_test = recid_prep[recid_prep['Train'] == 0]

#Now estimating the model
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis'] #no race in model
dep_var = 'Recid30'
rf_mod = RandomForestClassifier(n_estimators=500, random_state=10, min_samples_leaf=100)
rf_mod.fit(X = recid_train[ind_vars], y = recid_train[dep_var])

#Now applying out of sample
pred_prob = rf_mod.predict_proba(recid_test[ind_vars] )
recid_test['prob'] = pred_prob[:,1]
#########################################################

matplotlib themes

One thing you can do is easily update the base template for matplotlib. Here are example settings I typically use, in particular making the default font sizes much larger. I also like a using a drop shadow for legends – although many consider drop shadows for data chart-junky, they actually help distinguish the legend from the background plot (a trick I learned from cartographic maps).

#########################################################
#Settings for matplotlib base

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}
 
print( matplotlib.rcParams )
#matplotlib.rcParams.update(andy_theme)

#print(plt.style.available)
#plt.style.use('classic')
#########################################################

I have it commented out here, but once you define your dictionary of particular style changes, then you can just run matplotlib.rcParams.update(your_dictionary) to update the base plots. You can also see the ton of options by printing matplotlib.rcParams, and there are a few different styles already available to view as well.

creating a lift-calibration line plot

Now I am going to create a plot that I have seen several names used for – I am going to call it a calibration lift-plot. Calibration is basically “if my model predicts something will happen 5% of the time, does it actually happen 5% of the time”. I used to always do calibration charts where I binned the data, and put the predicted on the X axis, and observed on the Y (see this example). But data-robot has an alternative plot, where you superimpose those two lines that has been growing on me.

#########################################################
#Creating a calibration lift-plot for entire test set

bin_n = 30
recid_test['Bin'] = pd.qcut(recid_test['prob'], bin_n, range(bin_n) ).astype(int) + 1
recid_test['Count'] = 1

agg_bins = recid_test.groupby('Bin', as_index=False)['Recid30','prob','Count'].sum()
agg_bins['Predicted'] = agg_bins['prob']/agg_bins['Count']
agg_bins['Actual'] = agg_bins['Recid30']/agg_bins['Count']

#Now can make a nice matplotlib plot
fig, ax = plt.subplots(figsize=(6,4))
ax.plot(agg_bins['Bin'], agg_bins['Predicted'], marker='+', label='Predicted')
ax.plot(agg_bins['Bin'], agg_bins['Actual'], marker='o', markeredgecolor='w', label='Actual')
ax.set_ylabel('Probability')
ax.legend(loc='upper left')
plt.savefig('Default_mpl.png', dpi=500, bbox_inches='tight')
plt.show()
#########################################################

You can see that the model is fairly well calibrated in the test set, and that the predictions range from around 10% to 75%. It is noisy and snakes high and low, but that is expected as we don’t have a real giant test sample here (around a total of 100 observations per bin).

So this is the default matplotlib style. Here is the slight update using my above specific theme.

matplotlib.rcParams.update(andy_theme)
fig, ax = plt.subplots(figsize=(6,4))
ax.plot(agg_bins['Bin'], agg_bins['Predicted'], marker='+', label='Predicted')
ax.plot(agg_bins['Bin'], agg_bins['Actual'], marker='o', markeredgecolor='w', label='Actual')
ax.set_ylabel('Probability')
ax.legend(loc='upper left')
plt.savefig('Mytheme_mpl.png', dpi=500, bbox_inches='tight')
plt.show()

Not too different from the default, but I only have to call matplotlib.rcParams.update(andy_theme) one time and it will apply it to all my charts. So I don’t have to continually set the legend shadow, grid lines, etc.

making a lineplot in seaborn

matplotlib is basically like base graphics in R, where if you want to superimpose a bunch of stuff you make the base plot and then add in lines() or points() etc. on top of the base. This is ok for only a few items, but if you have your data in long format, where a certain category distinguishes groups in the data, it is not very convenient.

The seaborn library provides some functions to get closer to the ggplot idea of mapping aesthetics using long data, so here is the same lineplot example. seaborn builds stuff on top of matplotlib, so it inherits the style I defined earlier. In this code snippet, first I melt the agg_bins data to long format. Then it is a similarish plot call to draw the graph.

#########################################################
#Now making the same chart in seaborn
#Easier to melt to wide data

agg_long = pd.melt(agg_bins, id_vars=['Bin'], value_vars=['Predicted','Actual'], var_name='Type', value_name='Probability')

plt.figure(figsize=(6,4))
sns.lineplot(x='Bin', y='Probability', hue='Type', style='Type', data=agg_long, dashes=False,
             markers=True, markeredgecolor='w')
plt.xlabel(None)
plt.savefig('sns_lift.png', dpi=500, bbox_inches='tight')   
#########################################################

By default seaborn adds in a legend title – although it is not stuffed into the actual legend title slot. (This is because they will handle multiple sub-aesthetics more gracefully I think, e.g. map color to one attribute and dash types to another.) But here I just want to get rid of it. (Similar to maps, no need to give a legend the title “legend” – should be obvious.) Also the legend did not inherit the white edge colors, so I set that as well.

#Now lets edit the legend
plt.figure(figsize=(6,4))
ax = sns.lineplot(x='Bin', y='Probability', hue='Type', style='Type', data=agg_long, dashes=False,
             markers=True, markeredgecolor='w')
plt.xlabel(None)
handles, labels = ax.get_legend_handles_labels()
for i in handles:
    i.set_markeredgecolor('w')
legend = ax.legend(handles=handles[1:], labels=labels[1:])
plt.savefig('sns_lift_edited_leg.png', dpi=500, bbox_inches='tight')

making a small multiple plot

Another nicety of seaborn is that it can make small multiple plots for you. So here I conduct analysis of calibration among subsets of data for different racial categories. First I collapse the different racial subsets into an other category, then I do the same qcut, but within the different groupings. To figure that out, I do what all good programmers do, google it and adapt from a stackoverflow example.

#########################################################
#replace everyone not black/white as other
print( recid_test['race'].value_counts() )
other_group = ['Hispanic','Other','Asian','Native American']
recid_test['RaceComb'] = recid_test['race'].replace(other_group, 'Other')
print(recid_test['RaceComb'].value_counts() )

#qcut by group
bin_sub = 20
recid_test['BinRace'] = (recid_test.groupby('RaceComb',as_index=False)['prob']
                        ).transform( lambda x: pd.qcut(x, bin_sub, labels=range(bin_sub))
                        ).astype(int) + 1

#Now aggregate two categories, and then melt
race_bins = recid_test.groupby(['BinRace','RaceComb'], as_index=False)['Recid30','prob','Count'].sum()
race_bins['Predicted'] = race_bins['prob']/race_bins['Count']
race_bins['Actual'] = race_bins['Recid30']/race_bins['Count']
race_long = pd.melt(race_bins, id_vars=['BinRace','RaceComb'], value_vars=['Predicted','Actual'], var_name='Type', value_name='Probability')

#Now making the small multiple plot
d = {'marker': ['o','X']}
ax = sns.FacetGrid(data=race_long, col='RaceComb', hue='Type', hue_kws=d,
                   col_wrap=2, despine=False, height=4)
ax.map(plt.plot, 'BinRace', 'Probability', markeredgecolor="w")
ax.set_titles("{col_name}")
ax.set_xlabels("")
#plt.legend(loc="upper left")
plt.legend(bbox_to_anchor=(1.9,0.8))
plt.savefig('sns_smallmult_niceleg.png', dpi=500, bbox_inches='tight')
#########################################################

And you can see that the model is fairly well calibrated for each racial subset of the data. The other category is more volatile, but it has a smaller number of observations as well. But overall does not look too bad. (If you take out my end leaf needs 100 samples though, all of these calibration plots look really bad!)

I am having a hellishly hard time doing the map of sns.lineplot to the sub-charts, but you can just do normal matplotlib plots. When you set the legend, it defaults to the last figure that was drawn, so one way to set it where you want is to use bbox_to_anchor and just test to see where it is a good spot (can use negative arguments in this function to go to the left more). Seaborn has not nice functions to map the grid text names using formatted string substitution. And the post is long enough, you can play around yourself to see how the other options change the look of the plot.

For a few notes on various gotchas I’ve encountered so far:

  • For sns.FacetGrid, you need to set the size of the plots in that call, not by instantiating plt.figure(figsize=(6,4)). (This is because it draws multiple figures.)
  • When drawing elements in the chart, even for named arguments the order often matters. You often need to do things like color first before other arguments for aethetics. (I believe my problem mapping sns.lineplot to my small multiple is some inheritance problems where plt.plot does not have named arguments for x/y, but sns.lineplot does.)
  • To edit the legend in a FacetGrid generated set of charts, ax returns a grid, not just one element. Since each grid inherits the same legend though, you can do handles, labels = ax[0].get_legend_handles_labels() to get the legend handles to edit if you want.