Chapter 10: Plotting with Seaborn#

There are a number of plotting libraries available for Python including Bokeh, Plotly, and MayaVi; but the most prevalent library is still probably matplotlib. It is often the first plotting library a Python user will learn, and for good reason. It is stable, well supported, and there are few plots that matplotlib cannot generate. Despite its popularity, there are some drawbacks… namely, it can be quite verbose. That is, you may be able to generate nearly any plot, but it will take at least a few lines of code if not dozens to create and customize your figure.

One attractive alternative is the seaborn plotting library. While seaborn cannot generate the same variety of plots as matplotlib, it is good at generating a few common plots that people use regularly, and here is the key detail… it often does what would take matplotlib 10+ lines of code in only one or two lines. To make things even better, seaborn is built on top of matplotlib. This means that if you are not completely happy with what seaborn creates, you can fine tune it with the same matplotlib commands you already know! In addition, seaborn is designed to work closely with the pandas library. For example, think of all the lines of code you have typed to simply add labels to your x- and y-axes. Instead, seaborn often pulls the labels from the DataFrame column headers. Again, if you do not like this default behavior, you can still override it with plt.xlabel() and other commands that you already know.

By convention, seaborn is imported with the sns alias, but being that this is a relatively young library, it is unclear how strong this convention is. The official seaborn website uses it, so we will as well. All code in this chapter assumes the following import.

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

10.1 Seaborn Plot Types#

A map of the seaborn plotting library is mainly a series of the different types of plots that it can generate. Below is table of the main categories. The rest of this chapter is a more in-depth survey of select plotting functions, and it is certainly not a complete list.

Table 1 Seaborn Plotting Type Categories Covered Herein

Category

Description

Regression

Draws a regression line through the data

Categorical

Plots frequency versus a category

Distribution

Plots frequency versus a continuous value

Matrix

Displays the data as a colored grid

Relational

Visualizes the relationship between two continuous variables

One distinction between some of the plotting categories above is whether they display continuous versus discrete/categorial information. When data are continuous, they can be nearly any value in a range like the density of a metal. This in contrast to discrete or categorial data that places data in a limited number of groups or bins such as the element(s) present in a metal sample.

10.2 Regression Plots#

Generating a regression line through data is a common task in science, and seaborn includes multiple plotting types that perform this task. All of the plots discussed below use a least square best fit and include a confidence interval for the regression line as a shaded region. Remember that there is uncertainty in both the slope and y-intercept for a regression line. If we were to plot all the possible variations of the regression line within the slope and intercept uncertainties, we get the regression confidence interval. By default, seaborn displays the 95% confidence interval, but this can be changed.

10.2.1 regplot#

The regplot generates a single scatter plot of data with a linear regression through the data points complete with a 95% confidence interval. The sns.regplot() function can take x and y positional arguments just like plt.plot(), but it also can take the x and y column names from a pandas DataFrame. Both approaches are demonstrated below.

rng = np.random.default_rng()
x = np.arange(10)
y = 2 * x + rng.normal(size=10)
sns.regplot(x=x, y=y);
../../_images/d935a561918263710fc8840502df2afe0e0447c271144bec1d22ebc7d81de8ff.svg

If the data is in a DataFrame, the x and y values can be provide as the column names, and seaborn will automatically add the column names as x and y labels. Below is a series of boiling point and molecular weights for various organic compounds.

bp = pd.read_csv('data/org_bp.csv')
bp
bp MW type
0 65 32.04 alcohol
1 78 46.07 alcohol
2 98 60.10 alcohol
3 118 74.12 alcohol
4 139 88.15 alcohol
5 157 102.18 alcohol
6 176 116.20 alcohol
7 195 130.23 alcohol
8 212 144.25 alcohol
9 232 158.28 alcohol
10 36 72.15 alkane
11 69 86.18 alkane
12 98 100.21 alkane
13 126 114.23 alkane
14 151 128.26 alkane
15 174 142.29 alkane
16 196 156.31 alkane
17 216 170.34 alkane
18 63 86.18 alkane
19 117 114.23 alkane
20 28 72.15 alkane
21 80 100.21 alkane
22 108 74.12 alcohol
23 83 74.12 alcohol
24 131 88.15 alcohol
25 135 102.18 alcohol
26 140 116.20 alcohol
27 182 94.11 alcohol
28 202 108.14 alcohol
29 220 136.19 alcohol

If you choose to provide column names from a pandas DataFrame, you must also provide the name of the DataFrame using the data keyword argument.

sns.regplot(x='MW', y='bp', data=bp);
../../_images/013e6d80f4eecd84ed5ee6b8562295761a1a5fb82ae2173ec86d1d46e9773407.svg

While the DataFrame column names provide accurate axis labels, the units are missing. We can use matplotlib commands from chapter 3 to modify the axis labels.

sns.regplot(x='MW', y='bp', data=bp)
plt.xlabel('MW, g/mol')
plt.ylabel('bp, $^o$C');
../../_images/35d6ffbe9bb364ab6dc35740c87fc1a71dfb86e04526afe2546558a4d756d630.svg

10.2.2 lmplot#

An lmplot() is very similar to the regplot() function except that an lmplot() also allows for multiple regressions based on additional pieces of information about each data point. For example, the org_bp.csv file above contains the boiling points of various alcohols and alkanes along with their molecular weights. Chemical intuition might bring one to expect two independent boiling point trends between the alcohol and alkanes, so we need two independent regression lines for the two classes of organic molecule. The lmplot() function can do exactly this.

The lmplot() function takes the x and y variables and the DataFrame name as either positional or keyword arguments, so the function call could also be as shown below where the first three arguments are positional arguments providing the x-values, y-values, and the DataFrame name in this order.

sns.lmplot('MW', 'bp', bp, hue='type')

The hue= argument is the column name that dictates the color of the markers, so in this example, it will be the type of organic molecule.

sns.lmplot(x='MW', y='bp', data=bp, hue='type')
plt.xlabel('MW, g/mol')
plt.ylabel('bp, $^o$C');
../../_images/03de29f7a90439ad17f7f8bc8113e24e7a1b80f7939f6c01a0323fc26f8d2287.svg

The lmplot() function also provides arguments for modifying the appearance of the plot. Below is a demonstration of a few extra adjustments to the plot. The facet_kws argument takes extra parameters in the form of a dictionary with key-value pairs. In this case, the legend_out key controls whether the legend is outside the plot’s boundaries. The aspect argument sets the ratio of the x-axis versus the y-axis, and the marker shapes can also be modified using the markers argument with matplotlib conventions from section 3.1.2.

sns.lmplot(x='MW', y='bp', hue='type', data=bp, markers=['o', '^'], 
           aspect=1.5, facet_kws={'legend_out':False})
plt.xlabel('MW, g/mol')
plt.ylabel('bp, $^o$C');
../../_images/ed4d0a521b8a627279360fc61828e1b2122b3a81a8bf4155068300aea064b8a0.svg

10.3 Categorical Plots#

Categorical plots contain one axis of continuous values and one axis of discrete or categorical values. For example, if the density of three metals were measured repeatedly in lab, we would want to plot measured density (continuous) with respect to metal identity (categorial). Below are a few fictitious laboratory measurements for the densities of copper, iron, and zinc.

Table 2 Density (g/mL) Measurements for Different Metals

Cu

Fe

Zn

8.51

7.95

6.79

9.49

7.53

7.06

8.48

8.09

7.96

9.40

7.44

7.06

8.83

8.38

6.69

9.45

7.83

7.21

8.73

6.88

7.35

9.00

7.90

6.65

8.84

8.51

7.41

9.32

7.89

7.89

If we want to compare these values, the density can be plotted on the y-axis and metal on the x-axis. First, we need to load the values into a DataFrame.

labels = ['Cu', 'Fe', 'Zn']
densities = [[8.51, 7.95, 6.79],
             [9.49, 7.53, 7.06],
             [8.48, 8.09, 7.96],
             [9.40, 7.44, 7.06],
             [8.83, 8.38, 6.69],
             [9.45, 7.83, 7.21],
             [8.73, 6.88, 7.35],
             [9.00, 7.90, 6.65],
             [8.84, 8.51, 7.41],
             [9.32, 7.89, 7.89]]
df = pd.DataFrame(densities, columns=labels)
df.head()
Cu Fe Zn
0 8.51 7.95 6.79
1 9.49 7.53 7.06
2 8.48 8.09 7.96
3 9.40 7.44 7.06
4 8.83 8.38 6.69

10.3.1 Strip Plot#

The simplest categorical plot function is stripplot() which generates a scatter plot with the x-axis as the categorical dimension and the y-axis as the continuous value dimension. By providing the function with the DataFrame, it will assume the columns are the categories.

sns.stripplot(data=df);
../../_images/d2f06942bd3d4762a84daf8ed5dc5a02cdd7a4e90bb716014b1ae170eb1d6377.svg

By default, the x-axis contains the column labels from the DataFrame, but the y-axis is without any label. Again, one of the conveniences of the seaborn library is that it is built on top of matplotlib, so any plot created by seaborn can be further modified by matplotlib commands as shown below.

sns.stripplot(data=df)
plt.ylabel('Density, g/mL')
plt.xlabel('Metals');
../../_images/6508ceefbe8438dfe8668442812a66f4c830f33ba390dd705e12eeb63c1671f4.svg

10.3.2 Swarm Plot#

While the plots above are elegantly simple, they can make it difficult to accurately interpret the data when multiple data points are overlapping as can happen with larger numbers of data points. This obscures the quantity of points in various regions. One plot that alleviates this issue is the swarm plot which is almost identical to the strip plot except that points are no permitted to overlap to make the quantity more apparent.

sns.swarmplot(data=df)
plt.ylabel('Density, g/mL')
plt.xlabel('Metals');
../../_images/0b43641d3ef106f72f8c06b9d21b94b9eb89fe8dbf5d2195d8fc886f2c029999.svg

10.3.3 Violin Plot#

An additional option for understanding the density of points is the violin plot. By default, this plot renders a blob with the width representing the density of points at various regions. Inside the blob are miniature box plots (discussed in the next section) that provide more information about the distribution of data points.

sns.violinplot(data=df)
plt.ylabel('Density, g/mL')
plt.xlabel('Metals');
../../_images/4db94cfd7cbe5396e61225dfc3efec366e3ec005ef90adde5c320a6676e509f6.svg

10.3.4 Box Plot#

The box plot is a classic plot in statistics for representing the distribution of data and can be easily generated in seaborn using the boxplot() function which works much the same way as the above categorical plots. There are three main components to a box plot. The center box contains lines marking the 25, 50, and 75 percentile regions. For example, the 75 percentile line is where 75% of the data points are below. The 50 percentile is also known as the median. The length of the box (i.e., from 25 percentile to 75 percentile) is known as the inner quartile range (IQR). Beyond the box are the bars known as whiskers which mark the range of the rest of the data points up to 1.5x the IQR. If a data point is beyond 1.5x the IQR, it is an outlier and is explicitly represented with a spot (Figure 1)

Figure 1 A box plot is composed of a box with lines at the 25\(^{th}\), 50\(^{th}\), and 75\(^{th}\) percentiles and whiskers that extend out to the rest of the non-outlier data points. If a data point is greater than 1.5 \(\times\) the inner quartile range from the 25\(^{th}\) or 75\(^{th}\) percentiles, it is an outlier represented by a dot.

sns.boxplot(data=df)
plt.ylabel('Density, g/mL')
plt.xlabel('Metals');
../../_images/c1bcd1a7ebc29dc190a53edbce5f9247fcac2cab3f33d1277e25b63086b06a99.svg

10.3.5 Count Plot#

The count plot represents the frequency of values for different categories. This is similar to a histogram plot except that a histogram’s x-axis is a continuous set of values while a count plot’s x-axis is made up of discrete categories. The countplot() function accepts a raw collection of responses, tallies them up, and plots them as a labeled bar plot. For example, if we have a data set of all the chemical elements up to rutherfordium (Rf) and their physical state under standard conditions, the function accepts the list of their physical states, counts them, and generates the plot.

elem = pd.read_csv('data/elements_data.csv')
elem.head()
symbol AN row block state
0 H 1 1 s gas
1 He 2 1 s gas
2 Li 3 2 s solid
3 Be 4 2 s solid
4 B 5 2 p solid
sns.countplot(x='state', data=elem);
../../_images/64bf370f27dad6ddddfd3e8c72c5fb85be35103b0b255f741e2b21b46a1a9313.svg

Like many plotting types in seaborn, the count plot can be further customized through keyword arguments and using other available data. One shortcoming of the above plot is that the states are listed in the order they first appear in the data set instead of based on disorder. We can assert a different order by providing the order argument as a list of how the states should appear.

sns.countplot(x='state', data=elem, order=['gas', 'liquid', 'solid']);
../../_images/47de04a297cceafea00201cb5c96d65535e2668c4e054f9626717a52acc67f5e.svg

We can also set the color of each bar based on the valence orbital block by providing the hue argument with the name of the column.

sns.countplot(x='state', hue='block', data=elem);
../../_images/fb49ed6012b69bf9a0709681950ae26c68fc5bdc053c8b2ffbf7565ed601f98c.svg

10.4 Distribution Plots#

Seaborn provides a set of plotting types that represent the distribution of data. These are essentially extensions of the histogram plot but with extra features like additional dimensions, kernel density estimates, and generating grids of histogram plots.

10.4.1 histplot#

The histplot() function is one of the most basic distribution plotting functions in seaborn. This function is similar to the matplotlib plt.hist() function except that seaborn brings a few extra options like setting the color (hue=) based on a particular column of data.

To demonstrate this, we will use the results of a one-dimensional stochastic diffusion simulation. During the individual steps of this simulation, each of a thousand simulated molecules are either moved to the right one unit, to the left one unit, or not moved at all. A random number generator dictates this movement as demonstrated below.

loc = np.zeros(1000)   # locations of molecules
for step in range(1000):
    loc += rng.integers(-1, high=2, size=1000)
sns.histplot(loc)
plt.xlabel('Location')
plt.ylabel('Fraction of Molecules');
../../_images/191330b3170e8c60a0e9f4a1560eb6ad6beababb0414f4029b13a24b85d9c5d5.svg

10.4.2 kde Plot#

The kdeplot() function is very similar to the histplot() except that it fits the histogram with a kernel density estimates (kde) curve. This curve is basically just a smoothed curve over the data to help visualize the overall trend.

sns.kdeplot(loc)
plt.xlabel('Location')
plt.ylabel('Fraction of Molecules')
plt.tight_layout()
../../_images/473f74632b6213938c0f40cac2c696d9577851f833b208c3508a9873cffaeafc.svg

10.4.3 jointplot (diffusion simulation)#

A joint plot can be described as a scatter plot with histograms on the sides providing additional information or clarification on the density of the data points. To demonstrate this, below is a two dimensional stochastic diffusion simulation and the results. The principles are the same as above except applied to two dimensions.

x = np.sum(rng.integers(-1, high=2, size=(5000,7000)), axis=0)
y = np.sum(rng.integers(-1, high=2, size=(5000,7000)), axis=0)
plt.plot(x, y, '.')
plt.axis('equal');
../../_images/b68159374aaa467914317e6fecf74bc4771736a97f0a2fe3e696e07cbdd585c9.svg

One of the issues with this plot is that there are so many data points in the plot that it is difficult to determine the distribution/density inside the blanket of solid dots. The seaborn joint plot adds histograms to the side to help the viewer recognize where most of the data points reside.

The joint plot function, sns.jointplot(), takes two required arguments of the x and y variables. While this function does not require the use of pandas or a DataFrame, it is convenient because the axis labels are pulled directly from the column headers.

df = pd.DataFrame(data={'X Distance, au': x, 'Y Distance, au': y})

sns.jointplot(x=df['X Distance, au'], y=df['Y Distance, au'], 
              height=7, color='C0', joint_kws={'s':10})
<seaborn.axisgrid.JointGrid at 0x130e35e10>
../../_images/db00be705e6d546b047533db3bcc05042f895043b46a98325e8ab0475a9bdbd4.svg

There are numerous arguments to fine tune the joint plot. For example, the joint plot does not need to be a scatter plot with histograms. The density of the data points can be represented with hexagonal patches or kernel density estimates (kde). The latter represents the density of points through contours and is a reoccurring option in other plotting functions in the seaborn library. It is worth noting that the kde plotting types take a little time to calculate, so expect a brief delay in generating these plots.

sns.jointplot(x=df['X Distance, au'], y=df['Y Distance, au'], kind='hex');
../../_images/06d14e51e4e34bec3a7674dac072ae9fa4788c10bf0a37cf71a692f460ebf073.svg
sns.jointplot(x=df['X Distance, au'], y=df['Y Distance, au'], kind='kde');
../../_images/38b6d19d9294e4f126379db731b51692e0cdfa974ab791e76827224beda50ba9.svg

10.5 Pair Plot#

The pair plot belongs to the category of distribution plots, but it is different enough to be worth addressing separately. A pair plot is designed to show the relationship among multiple variables by generating a grid of plots in a single figure. Each plot in the grid is a scatter plot showing the relationship between two of the variables on either axis with the exception of the plots in the diagonals. Because the diagonal plots are the intersection between a variable and itself, these are histograms showing the distributions of values for that variable. Pair plots are particularly useful for looking at new data to see if there are any trends worth investigating because this entire grid can be easily generated with a single sns.pairplot() function.

To demonstrate a pair plot, the file periodic_trends.csv contains physical data on non-noble gas elements in the first three rows of the periodic table. To quickly see how each of the columns of data relate to each other, we will generate a pair plot.

per = pd.read_csv('data/periodic_trends.csv')
per.head()
symbol AN EN row IE_kJ radius_pm
0 H 1 2.1 1 1310 38
1 Li 3 1.0 2 520 134
2 Be 4 1.5 2 900 90
3 B 5 2.0 2 800 82
4 C 6 2.5 2 1090 77
per.drop(['AN', 'symbol'], axis=1, inplace=True)
per.head()
EN row IE_kJ radius_pm
0 2.1 1 1310 38
1 1.0 2 520 134
2 1.5 2 900 90
3 2.0 2 800 82
4 2.5 2 1090 77
sns.pairplot(per);
../../_images/c227e0d25affbd880a05b6827fdcff906ce64383a2d9fa53ed5b5a08c8c4e935.svg

The color can also be set based on any piece of information. Below, the row is used to dictate the color of each data point.

sns.pairplot(per, hue='row', palette='tab10');
../../_images/32023ea14ffae2d8d7c8bf52cba9697ef05d89b5f5b18782739dca4fabd11aca.svg

10.6 Heat Map#

Heat maps are color representations of 2D grids of numerical data and are ideal for making large tables of values easily interpretable. As an example, we can import a table of bond dissociation energies (in kJ/mol) and visualize these data as a heat map. In the following pandas function call, the index_col=0 tells pandas to apply the first column as column headers as well.

bde = pd.read_csv('data/bond_enthalpy_kJmol.csv', index_col=0)
bde
H C N O F
H 436 415 390 464 569
C 415 345 290 350 439
N 390 290 160 200 270
O 464 350 200 140 160
F 569 439 270 160 160

This grid of numerical values is difficult to quickly interpret, and if it were a larger table of data, it could become almost impossible to interpret in this form. We can plot the heat map using the heatmapt() function and feeding it the DataFrame. The function also accepts NumPy arrays, but without the index and column labels of a DataFrame, the axes will not be automatically labeled.

sns.heatmap(bde);
../../_images/c3438cfd178ffe49ef72712640b5584b47d9bf63877525fa2e0fb8301c8f2f85.svg

Now we have a color grid where the colors represent numerical values defined in a colorbar automatically displayed on the righthand side. This default color map can easily be customized through various arguments in the heatmap() function. One nice addition is to display the numerical values on the heat map by setting annot=True. If you choose to annotate the rectangles, you may need to use the fmt= parameter to dictate the format of the annotation labels. Some common formats are d for decimal, f for floating point, and .2f gives two places after the decimal place in a floating point number. If you want a different color map, this can be set using the cmap argument and any matplotlib colormap you want. Below, the annotation is turned on with the perceptually uniform viridis colormap. To further customize the colorbar, use the cbar_kws= argument that takes a dictionary of parameters found on the matplotlib website. For example, to add a label, use the label key and the label text is the dictionary value as shown below.

sns.heatmap(bde, annot=True, fmt='d', cmap='viridis',
           cbar_kws={'label':'Bond Enthalpy, kJ/mol'});
../../_images/d10c58eb5f2dfe3103dbcecd7faaf78b391833b33ead23f7e5117a76c70bfaac.svg

10.7 Relational Plots#

Relational plots are a new addition to the seaborn library as of version 0.9 and include seaborn’s functions for scatter and line plots. Of course, matplotlib does a nice job making scatter and line plots reasonably easy, but seaborn offers a few extra ease-of-use improvements upon matplotlib that may be worth something to you depending upon your needs.

10.7.1 Scatter Plots#

One difference between seaborn and matplotlib in generating scatter and line plots is that seaborn allows the user to change the color, size, and marker styles of individual markers based on numerical values or text data. Matplotlib can also change the color and size of the markers but only based on numerical values, and to change the marker style, the plt.scatter() function needs to be called a second time. Seaborn allows this whole process in a single function call.

Below, we are using the periodic trends data (per) imported in section 10.5. We can start with plotting the electronegativity (EN) versus the atomic radius (radius_pm) using the sns.scatterplot() function which takes many of the same basic arguments as plots we have seen so far with seaborn.

sns.scatterplot(x='radius_pm', y='EN', data=per);
../../_images/6530029bf45242b14ec5ca65547cfbae9df3b2a9465d6db7506a649f2a9ac801.svg

To modify the color, size, and marker style of the data points, use the hue, size, and marker arguments. This allows additional information to be infused into a single plot. Note that the legend automatically appears on the plot. In addition, the colormap for the plot can be modified using the palette keyword argument and the name of any matplotlib colormap.

sns.scatterplot(x='radius_pm', y='EN', data=per,  hue='IE_kJ', 
                size='IE_kJ', style='row', palette='winter');
../../_images/e8e1913437b2f17caa87a1d166c447dd12af2ba0ee1b2240fe329380f789f591.svg

10.7.2 Line Plots#

The lineplot() function in seaborn is somewhat similar to the plt.plot() function in matplotlib except it also includes a number of extra features similar to those seen in other seaborn plotting functions. This includes the ability to change the plotting color and style based on additional information, easy visualization of confidence intervals, automatic generation of a legend, and others. To demonstrate the lineplot() function, we will import simulated kinetic data for a first-order chemical reaction run seven times (i.e., runs 0 \(\rightarrow\) 6).

kinetics = pd.read_csv('data/kinetic_runs.csv')
kinetics.head()
time [A] run [P]
0 0.000000 0.956279 0.0 0.043721
1 10.526316 0.636978 0.0 0.363022
2 21.052632 0.355690 0.0 0.644310
3 31.578947 0.161173 0.0 0.838827
4 42.105263 0.157420 0.0 0.842580
sns.lineplot(x='time', y='[A]', data=kinetics, hue='run', palette='viridis')
plt.xlabel('Time, s')
plt.ylabel('[A], M');
../../_images/8b954ef430a72923a8b63ea88f1f1172d2eaa7d4e07f6be0b1c93e05d5569ead.svg

The [A] was plotted versus Time, and the hue of each line was set to the Run number. The result is that each kinetic run is shown in a separate color. If the user is not concerned so much with seeing the individual runs but instead wants to see an average of each all the runs with some indication of the variation, the lineplot() function provides a default 95% confidence interval as is shown below.

sns.lineplot(x='time', y='[A]', data=kinetics)
plt.xlabel('Time, s')
plt.ylabel('[A], M');
../../_images/71b1c007d074623eb0f413663a0bccadf7fc260191bce931d88ecf240f1c7764.svg

A confidence interval is only shown if there are multiple data points for each time. The confidence intervals can also be represented with error bars by setting err_style = 'bars'.

sns.lineplot(x='time', y='[A]', data=kinetics, err_style='bars')
plt.xlabel('Time, s')
plt.ylabel('[A], M');
../../_images/a617fe8ec5d358a7e3e3bf1a31ed1d2d5fe119c2d68c13d39e6e856743a45cc2.svg

10.8 Internal Data Sets#

Similar to a number of other Python libraries, seaborn brings with it data sets for users to experiment with. These are callable using the sns.load_dataset() function with the name of the data set as the argument. Below is a table describing a few of the available Seaborn data sets. This list may change, so you can use the sns.get_dataset_names() to see the most current list.

Table 3 A Few Data Sets Available in Seaborn

Name

Description

anscombe

Anscombe’s quartet data with four artificial data sets that exhibit the same mean, standard deviation, and linear regression among other statistical descriptors

car_crashes

Data on car crashes including mph above the speed limit among other information

exercise

Diet and exercise data

flights

Aircraft flight information including year, month, and number of passengers

iris

Ronald Fisher’s famous iris data set used frequently in machine learning classification examples

planets

Information on discovered planets

tips

Restaurant information including bill total, tip, and information about the client

titanic

Titanic survivor data set

Further Reading#

  1. Seaborn Website. https://seaborn.pydata.org/ (free resource)

Exercises#

Complete the following exercises in a Jupyter notebook and seaborn library. Any data file(s) refered to in the problems can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.

  1. Import the file linear_data.csv and visualize it using a regression plot.

  2. Import the file titled ir_carbonyl.csv and visualize the carbonyl stretching frequencies using a seaborn categorical plot. Represent the different molecules with different colors.

  3. Import the file titled ir_carbonyl.csv containing carbonyl stretches of ketones and aldehydes.

    a) Separate the ketones and aldehydes values into individual Series.

    b) Visualize the distribution of both ketone and aldehyde carbonyl stretches using a kde plot.

  4. Import the elements_data.csv file and generate a count plot showing the number of elements in each block of the periodic table (i.e., s, p, d, f).

  5. The following equation is Plank’s law which describes the relationship between the radiation intensity(\(M\)) with respect to wavelength (\(\lambda\)) and temperature (\(T\)).

    \[ M = \frac{2 \pi hc^2}{\lambda ^5 (e^{hc/\lambda kT} - 1)} \]

    Import the data called blackbody.csv containing intensities at various temperatures and wavelengths based on Plank’s law. Generate a plot of intensity versus wavelength using the lineplot() function, and display the different temperatures as different colors.

  6. Import the file ionization_energies.csv showing the first four ionization energies for a number of elements. Plot this grid of data as a heat map. Include labels in each cell using the annot= argument.

  7. Import the file ROH_data_small.csv and plot visualize how boiling point (bp), molecular weight (MW), degree, and whether a compound is aliphatic are correlated using a pairplot.

  8. The following code generates the radial probability plot for hydrogen atomic orbitals for n = 1-4 (see section 3.1) and determines the radius of maximum probability (see section 6.1.1). These values are combined into a pandas DataFrame called max_prob where the rows are the principle quantum numbers and columns are the angular quantum numbers. Display the DataFrame using a heatmap. Your heatmap should include numerical labels on each colored block on the heatmap, and you should select a non-default, perceptually uniform colormap for your colormap.

import numpy as np
import pandas as pd
import sympy
from sympy.physics.hydrogen import R_nl

R = sympy.symbols('R')
r = np.arange(0,60,0.01)

max_radii = []

for n in range(1,5):
    shell_max_radii = []
    for l in range(0, n):
        psi = R_nl(n, l, R)
        f = sympy.lambdify(R, psi, 'numpy')
        max = np.argmax(f(r)**2 * r**2)
        shell_max_radii.append(max/100)
    max_radii.append(shell_max_radii)    
    
columns, index = (0,1,2,3), (4,3,2,1)
max_prob = pd.DataFrame(reversed(max_radii), columns=columns, index=index)
max_prob