Chapter 5: Pandas

Chapter 5: Pandas#

While NumPy is the foundation of much of the SciPy ecosystem and provides very capable ndarray objects, it has a few missing features. The first is that NumPy arrays cannot hold different types of objects in a single array. For example, if we attempt to convert the following list containing integers, floats, and strings into an array, NumPy coverts all elements into strings as a way of making the object types uniform.

nums = [1, 2, 3, 'four', 5, 'six', 7.0]

import numpy as np
np.array(nums)

array(['1', '2', '3', 'four', '5', 'six', '7.0'], dtype='<U32')

The second shortcoming is that NumPy arrays do not have strong support for labels in the data. That is, you might want to label rows and column describing what they represent like you might see in a well constructed spreadsheet. While there is some support for this in NumPy, it is not as strong as pandas’ support. Finally, while NumPy contains a wealth of basic tools for working with data, there are still many operations that it does not support like grouping data based on the value of a particular column or the ability to merge two data sets with automatic alignment of related data.

To fill in these missing features, the pandas library provides a wealth of additional tools on top of NumPy for working with data, and possibly the most endearing feature, the ability to call data based on labels. That is, data columns and rows can contain human-readable labels that are used to access the data. Pandas still supports accessing data using indices if the user wishes to go that route, but the user can now access data without knowing which column it is in as long as the user knows the column label.

By popular convention, the pandas library is imported with the pd alias, which is used here. This chapter assumes the following imports.

import pandas as pd
import matplotlib.pyplot as plt

5.1 Basic Pandas Objects#

To support the wealth of features, pandas uses its own objects to hold data called a Series and a DataFrame, which are built on NumPy arrays. Because they are built on NumPy, many of the NumPy functions (e.g., np.mean()) work on pandas objects. The key difference between a Series and DataFrame is that a Series is one-dimensional while a DataFrame is two-dimensional. Unlike a NumPy array, pandas objects have fixed dimensionality. There is a three-dimensional object called a Panel, but this will not be covered here as it is not often used.

5.1.1 Series#

While the pandas Series is restricted to being a single dimension, it can be as long as necessary to hold the data. A Series containing the atomic masses of the first five elements on the periodic table is generated below using the pd.Series() function. This function is always capitalized.

mass = pd.Series([1.01,4.00,6.94,9.01,10.81])
mass

   1.01
   4.00
   6.94
   9.01
  10.81
dtype: float64

The right column is the actual data in the Series while the values on the left are the assigned indices for each value in the Series. The index column is not part of the dimensionality of the Series; it is metadata (i.e., data about the data). Think of the numbers as the row labels you would see in a traditional spreadsheet software application.

Consistent with list, tuples, and ndarrays, values in a Series can be accessed using indexing with square brackets as demonstrated below.

mass[2]

np.float64(6.94)

Unlike most other multi-element objects seen so far, data in a Series can be accessed using indices different from the default (i.e., 0, 1, 2, etc…) values. That is, custom row labels can be assigned using the index= argument shown below.

index=('H', 'He', 'Li', 'Be', 'B')
mass2 = pd.Series([1.01,4.00,6.94,9.01,10.81], index=index)
mass2

H      1.01
He     4.00
Li     6.94
Be     9.01
B     10.81
dtype: float64

The custom indices can now be used to access an element in a Series. This makes a Series behave something like a dictionary (section 2.2).

mass2['He']

np.float64(4.0)

The indices can be accessed by using mass2.index. Series indices can also be modified after a Series has been created by using .index and assignment as demonstrated below.

mass2.index

Index(['H', 'He', 'Li', 'Be', 'B'], dtype='object')

mass.index =['H', 'He', 'Li', 'Be', 'B']
mass

H      1.01
He     4.00
Li     6.94
Be     9.01
B     10.81
dtype: float64

Even if we create or modify a Series to have custom indices, we can still access the elements using the traditional numerical indices using the iloc[] method. This method allows the user to access elements the same way as in a NumPy array regardless of custom index values.

mass2.iloc[2]

np.float64(6.94)

5.1.2 DataFrame#

Most data you will find yourself working with will be best placed in a two-dimensional pandas object called a DataFrame which is always written with two capital letters. The DataFrame is similar to a Series except that now there are also columns with names. The columns can be accessed by column names and rows can be accesses by indices. You might think of a DataFrame as a collection of Series objects. Below, a DataFrame is constructed to hold the names, atomic numbers, masses, and ionization energies of the first five elements.

name = ['hydrogen', 'helium', 'lithium', 'beryllium','boron']
AN = [1,2,3,4,5]
mass = [1.01,4.00,6.94,9.01,10.81]
IE = [13.6, 24.6, 5.4, 9.3, 8.3]

columns = ['H', 'He', 'Li', 'Be','B'] 
index = ['name', 'AN', 'mass', 'IE']
elements = pd.DataFrame([name, AN, mass, IE], 
                        columns=columns, index=index)
elements

	H	He	Li	Be	B
name	hydrogen	helium	lithium	beryllium	boron
AN	1	2	3	4	5
mass	1.01	4.0	6.94	9.01	10.81
IE	13.6	24.6	5.4	9.3	8.3

To access data in a DataFrame, place the column name in square brackets.

elements['Li']

name    lithium
AN            3
mass       6.94
IE          5.4
Name: Li, dtype: object

Essentially what we get out of a column is a Series with the indices shown on the lefthand side.

To indicate a row, instead use the loc[] method. We again get a Series with indices derived from the column names in the source DataFrame. This Series can be placed in a variable and indexed just like in section 5.1.1.

elements.loc['IE']

H     13.6
He    24.6
Li     5.4
Be     9.3
B      8.3
Name: IE, dtype: object

atomic_number = elements.loc['AN']

atomic_number['B']

Alternatively, we can use the DataFrame directly and index it with the loc[] method as [row, column].

elements.loc['IE', 'Li']

5.4

Numerical index values can also be used with the iloc[] method. This reduces indexing to how NumPy arrays are indexed.

elements.iloc[2:, 2]

mass    6.94
IE       5.4
Name: Li, dtype: object

A summary of the methods of indexing pandas Series and DataFrames is presented below in Table 1.

Table 1 Summary of Pandas Indexing

Index Method	Description
`s[index]`	Index Series with assigned index values
`s.iloc[index]`	Index Series with default numerical index values
`df[column]`	Index DataFrame with column name
`df.loc[row]`	Index DataFrame with row name
`df.loc[row, column]`	Index DataFrame with row and column names
`df.iloc[row, column]`	Index DataFrame with row and column default numerical index values

5.2 Reading/Writing Data#

Similar to NumPy, pandas contains multiple, convenient functions for reading/writing data directly to and from its own object types, and each function is suited to a specific file format. This includes CSV, HTML, JSON, SQL, Excel, and HDF5 files among others.

Table 2 Import/Export Functions in Pandas

Function	Description
`read_csv()` and `to_csv()`	Imports/Exports data from/to a CSV file
`read_table()` and `to_table()`	General-purpose importer/exporter
`read_hdf5()` and `to_hdf5()`	Imports/Exports data from/to an HDF5 file
`read_clipboard()` and `to_clipboard()`	Transfers data to/from the clipboard to a Series or DataFrame
`read_excel()` and `to_excel()`	Reads/writes an Excel file

5.2.1 General-Purpose Delimited File Reader#

Before we start with more well-defined file formats, pandas provides a general purpose file reader pd.read_table(). This function imports text files where lines represent rows and the data in each row is separated by characters or spaces. The user can designate what character(s) separate the data by using the delimiter or sep arguments (they do the same thing). To set a space as a delimiter, use sep=\s+. The function also includes a series of other arguments listed below in Table 3.

Table 3 More pd.read_table() Arguments

Argument	Description
`delimiter`	Data separator; default is tab
`sep`	Data separator; default is tab
`skiprows`	Number of rows at the top of the file to skip before reading data
`skipfooter`	Number of rows at the bottom of the file to skip
`skip_blank_lines`	If `True`, skips blank lines in file; default is `False`
`header`	Row number to use for a data header; also accepts None if no header is provided in the file
`skipinitialspace`	If `True`, skips white space after delimiter

As an example, we can use this function to read a calculated PDB file of benzene and extract the \(xyz\) coordinates for each atom. This particular file type, shown below, is strictly formatted based on the position in a line, but being that all the data columns here have spaces between them, we can use space delimitation by setting sep=\s+. Because the data do not start until the third line and we do not need the last thirteen lines of the file, we should exclude these rows. We set header=None because we do not want the function to treat the first line of data as a header or data label.

HEADER 
REMARK 
HETATM    1  H   UNK  0001       0.000   0.000  -0.020 
HETATM    2  C   UNK  0001       0.000   0.000   1.067 
HETATM    3  C   UNK  0001       0.000   0.000   3.857 
HETATM    4  C   UNK  0001       0.000  -1.208   1.764 
HETATM    5  C   UNK  0001       0.000   1.208   1.764 
HETATM    6  C   UNK  0001       0.000   1.208   3.159 
HETATM    7  C   UNK  0001       0.000  -1.208   3.159 
HETATM    8  H   UNK  0001       0.000  -2.149   1.221 
HETATM    9  H   UNK  0001       0.000   2.149   1.221 
HETATM   10  H   UNK  0001       0.000   2.149   3.703 
HETATM   11  H   UNK  0001       0.000  -2.149   3.703 
HETATM   12  H   UNK  0001       0.000   0.000   4.943 
CONECT    1    2 
CONECT    2    1    5    4 
CONECT    3    6    7   12 
CONECT    4    7    2    8 
CONECT    5    2    6    9 
CONECT    6    5    3   10 
CONECT    7    3    4   11 
CONECT    8    4 
CONECT    9    5 
CONECT   10    6 
CONECT   11    7 
CONECT   12    3 
END

benz = pd.read_table('data/benzene.pdb', sep='\s+', 
                     skiprows=2, skipfooter=13, header=None, 
                     engine='python')
benz

	0	1	2	3	4	6	7
0	HETATM	1	H	UNK	1	0.000	-0.020
1	HETATM	2	C	UNK	1	0.000	1.067
2	HETATM	3	C	UNK	1	0.000	3.857
3	HETATM	4	C	UNK	1	-1.208	1.764
4	HETATM	5	C	UNK	1	1.208	1.764
5	HETATM	6	C	UNK	1	1.208	3.159
6	HETATM	7	C	UNK	1	-1.208	3.159
7	HETATM	8	H	UNK	1	-2.149	1.221
8	HETATM	9	H	UNK	1	2.149	1.221
9	HETATM	10	H	UNK	1	2.149	3.703
10	HETATM	11	H	UNK	1	-2.149	3.703
11	HETATM	12	H	UNK	1	0.000	4.943

The \(x\), \(y\), and \(z\) data are in columns 5, 6, and 7, respectively and can be extracted by indexing as discussed in section 5.1.2.

5.2.2 Comma Separated Values Files#

Pandas provides a collection of more format-specific functions for reading/writing files. The most popular is possibly the CSV file because it is simple and many scientific instruments support exporting data in this format. To import a CSV file, we will use the read_csv() function. This function is very similar to the read_table() function except that a default value for the separator/delimiter is set to a comma. To create a CSV file, use the to_csv() method which at a minimum requires the file name and a pandas object with the data.

We can write the above chemical element data assembled in section 5.1 as shown below. Because we are starting from a pandas object and are using a pandas method, the df.to_csv() format is used where df is a DataFrame.

elements.to_csv('elements.csv')

If we check the directory containing the Jupyter notebook, the data folder contains a file titled elements.csv that looks like the following. Each row in the DataFrame is a different line in the file, and every column is separated by a comma.

,H,He,Li,Be,B
name,hydrogen,helium,lithium,berylium,boron
AN,1,2,3,4,5
mass,1.01,4.0,6.94,9.01,10.81
IE,13.6,24.6,5.4,9.3,8.3

To read the data back in from the file, use pd.read_csv(). Because we are not starting with a pandas object, the function is called using the pd.function() format.

pd.read_csv('data/elements.csv')

	Unnamed: 0	H	He	Li	Be	B
0	name	hydrogen	helium	lithium	beryllium	boron
1	AN	1	2	3	4	5
2	mass	1.01	4.0	6.94	9.01	10.81
3	IE	13.6	24.6	5.4	9.3	8.3

5.2.3 Excel Notebook Files#

Pandas provides another useful function that imports Excel notebook files (i.e., .xls or .xlsx). Excel files are a specialized file type that requires the support of additional libraries, known as dependancies, that pandas does not install by default. A list of these dependancies is provided on the pandas website. You can either install each dependency yourself, or pandas provides a shortcut (for pandas version 2.0.0 and later) of pip install "pandas[excel]" that is run in the Terminal window (see section 0.2 for Terminal instructions). However, please check the pandas website for the full and most current instructions as things may have changed. Because Excel files can contain multiple sheets, this function is a little more complicated to use. The simplest way to import an Excel file is to use pd.read_excel() and provide it with the Excel file name.

pd.read_excel('data/test.xls')

	x	y
0	1	1
1	2	4
2	3	9
3	4	16
4	5	25
5	6	36
6	7	49

In the above example, pandas loads the first sheet in the file, which is the default behavior. If you want to access a different sheet in the file, you can specify this by using the sheet_name keyword argument. If you do not know the sheet name, the sheet_name argument also accepts integer index values (i.e., 0 for the first sheet and so on).

data = pd.read_excel('data/test.xls', sheet_name='Sheet2')
data

	a	b	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	b.1
0	1	0.841471	NaN	NaN	NaN	NaN	NaN	NaN
1	2	0.909297	NaN	NaN	NaN	NaN	NaN	NaN
2	3	0.141120	NaN	NaN	NaN	NaN	NaN	NaN
3	4	-0.756802	NaN	NaN	NaN	NaN	NaN	NaN
4	5	-0.958924	NaN	NaN	NaN	NaN	NaN	NaN
5	6	-0.279415	NaN	NaN	NaN	NaN	NaN	NaN
6	7	0.656987	NaN	NaN	NaN	NaN	NaN	NaN
7	8	0.989358	NaN	NaN	NaN	NaN	NaN	NaN
8	9	0.412118	NaN	NaN	NaN	NaN	NaN	NaN

Alternatively, if you want to extract the sheet names, you can use the sheets_names method with the ExcelFile class as demonstrated below.

xl = pd.ExcelFile('data/test.xls')
xl.sheet_names

['Sheet1', 'Sheet2']

Writing to an Excel file requires two steps – generate an ExcelWriter engine and then write each sheet. The Excel writer offers more power in generating Excel files including embedding charts, conditional formatting, coloring cells, and other tasks; but we will stick to the basics here.

data.to_excel('new_file.xlsx', sheet_name='First Sheet')

with pd.ExcelWriter('new_file.xlsx') as writer:
    data.to_excel(writer, sheet_name='First Sheet')
    data.to_excel(writer, sheet_name='Copy of First Sheet')

5.2.4 Computer Clipboard#

Pandas will also accept data from the computer’s copy and paste clipboard. Start by highlighting some data from a webpage or a spreadsheet, select copy. This is typically located under the Edit menu of most software applications. Alternatively, you can type Command + C on a macOS or Control + C on Windows and Linux. Finally, use the pd.read_clipboard() function to convert it to a pandas DataFrame.

pd.read_clipboard()

Loading data from the clipboard is not a robust and efficient way to do much of your automated data analysis, but it is a very convenient method to experiment with data or to quickly grab some data off a website to experiment with.

5.3 Examining Data with Pandas#

Once you load data into pandas, you will likely want to get an idea of what the data look like before you proceed to calculations and in-depth analyses. This section covers a few methods provided in pandas to gain a preliminary understanding of your data.

5.3.1 Descriptive Functions#

Pandas provides a few simple functions to view and describe new data. The first two are head() and tail() which allow you to see the top and bottom of the DataFrame, respectively. These are particularly useful when dealing with very large DataFrames. Below, a DataFrame containing random values in an even, normal, and poisson distribution (\(\lambda\) = 3.0) demonstrates these functions.

rng = np.random.default_rng()

random = pd.DataFrame({'even': rng.random(1000),
                       'normal': rng.normal(size=1000),
                       'poisson': rng.poisson(lam=3.0, size=1000)})

random.head()

	even	normal	poisson
0	0.399306	0.366845	3
1	0.717415	0.240555	3
2	0.280823	-1.014316	4
3	0.082725	-0.249080	3
4	0.969771	0.414425	3

random.tail()

	even	normal	poisson
995	0.068204	-0.628008	3
996	0.618927	0.232807	1
997	0.103016	0.182341	5
998	0.483815	1.177764	4
999	0.750867	-0.837742	2

Pandas also contains a describe() function that returns a variety of statistics on each column. For example, the mean is provided which are approximately 0.5, 0.0, and 3.0 for the even, normal, and poisson distributions, respectively. This is not surprising being that the even distribution is centered around 0.5, the normal around 0.0, and the poisson distribution is generated for an average of 3.0. The user is also provided with the minimum, maximum, standard deviation, and the quartile boundaries.

random.describe()

	even	normal	poisson
count	1000.000000	1000.000000	1000.000000
mean	0.495366	0.009895	2.965000
std	0.293186	1.000833	1.688972
min	0.000812	-3.324801	0.000000
25%	0.230925	-0.657265	2.000000
50%	0.505785	0.003473	3.000000
75%	0.742341	0.651874	4.000000
max	0.999107	3.143804	10.000000

Another useful function is the value_counts() method which returns all unique values in a Series (or DataFrame column or row). Below, it is demonstrated on the poisson column being that the other two columns will have a relatively large number of unique values.

counts = random['poisson'].value_counts()
counts

poisson
   235
   224
   188
   155
    74
    52
    43
    18
     5
    3
     3
Name: count, dtype: int64

Data in DataFrames can be plotting by calling the desired columns of data and feeding them into plotting functions like plt.scatter(). The data can also be visualized by using the df.plot(kind=) format where df is the DataFrame and kind is the plot type (e.g., 'bar', 'hist', 'scatter', 'line', 'pie', etc…). However, this is just matplotlib doing the plotting and is largely redundant with other methods already covered. Below is a quick example of the counts data generated above.

counts.plot(kind='bar');

../../_images/9c3e48a57e64e4a986a2c8c707207d34b43e789c34cd645c9ca6e69b0d91cdf3.svg

5.3.2 Broadcasted Mathematical Operations#

Because pandas is built upon NumPy arrays, mathematical operations are propagated through Series and DataFrames. The user is able to use NumPy methods on pandas objects, and there are a number of other mathematical operations to chose from such as those listed below.

Table 4 Broadcasted Pandas Methods

Function	Description
`abs()`	Absolute value
`count()`	Counts items
`cumsum()`	Cumulative sum
`cumprod()`	Cumulative product
`mad()`	Mean absolute deviation
`max()`	Maximum
`min()`	Minimum
`mean()`	Mean
`median()`	Median
`mode()`	Mode
`std()`	Standard deviation

5.4 Modifying DataFrames#

Now that you are able to generate DataFrames, it is useful to be able to modify them as you clean your data or perform calculations. This can be done through methods such as assignment, dropping rows and columns, and combining DataFrames or Series.

5.4.1 Insert Columns via Assignment#

Possibly the easiest method of adding a new column is through assignment. If a nonexistent column is called and assigned values, instead of returning an error, pandas creates a new column with the given name and populates it with the data. For example, the elements DataFrame below does not contain a carbon column, so the column is added when assigned to a Series with the data.

elements

	H	He	Li	Be	B
name	hydrogen	helium	lithium	beryllium	boron
AN	1	2	3	4	5
mass	1.01	4.0	6.94	9.01	10.81
IE	13.6	24.6	5.4	9.3	8.3

elements['C'] = ['carbon', 6, 12.01, 11.3]
elements

	H	He	Li	Be	B	C
name	hydrogen	helium	lithium	beryllium	boron	carbon
AN	1	2	3	4	5	6
mass	1.01	4.0	6.94	9.01	10.81	12.01
IE	13.6	24.6	5.4	9.3	8.3	11.3

5.4.2 Automatic Alignment#

Another important feature of pandas is the ability to automatically align data based on labels. In the above example, carbon is added to the DataFrame with the name, atomic number, atomic mass, and ionization energy in the same order as in the DataFrame. What happens if the new data is not in the correct order? If we are using NumPy, this would require additional effort on the part of the user to reorder the data. However, if each value is labeled, pandas will see to it that they are placed in the correct location.

nitrogen = pd.Series([7, 14.01, 'nitrogen', 14.5], 
                     index=['AN', 'mass', 'name', 'IE'])
nitrogen

AN             7
mass       14.01
name    nitrogen
IE          14.5
dtype: object

Data for nitrogen is placed in a Series above. Notice that the values are out of order with respect to the data in elements. There are index labels (i.e., row labels) that tell pandas what each piece of data is, and pandas will use them to determine where to place the new information.

elements['N'] = nitrogen
elements

	H	He	Li	Be	B	C	N
name	hydrogen	helium	lithium	beryllium	boron	carbon	nitrogen
AN	1	2	3	4	5	6	7
mass	1.01	4.0	6.94	9.01	10.81	12.01	14.01
IE	13.6	24.6	5.4	9.3	8.3	11.3	14.5

The new column of nitrogen data has been added to elements with all pieces of data residing in the correct row.

5.4.3 Dropping Columns#

When cleaning up data, you may wish to drop a column or row. Pandas provides the drop() method for this purpose. It requires the name of the column or row to be dropped, and by default, it assumes a row, axis=0, is to be dropped. If you want to drop a column, change the axis using the axis=1 argument. Below, the hydrogen column is dropped from the elements DataFrame.

elements.drop('H', axis=1)

	He	Li	Be	B	C	N
name	helium	lithium	beryllium	boron	carbon	nitrogen
AN	2	3	4	5	6	7
mass	4.0	6.94	9.01	10.81	12.01	14.01
IE	24.6	5.4	9.3	8.3	11.3	14.5

elements.drop('IE', axis=0)

	H	He	Li	Be	B	C	N
name	hydrogen	helium	lithium	beryllium	boron	carbon	nitrogen
AN	1	2	3	4	5	6	7
mass	1.01	4.0	6.94	9.01	10.81	12.01	14.01

In the second example above, the hydrogen is back despite being previously dropped. This is because the drop() method does not by default modify the origional DataFrame. To make the changes perminent, either assigned the new DataFrame to a new variable or add the inplace=True keyword argument to the above drop() function.

There is a similar function pd.dropna() that drops columns or rows from a DataFrame that contain nan values. This is commonly used to remove incomplete data from a data set. The pd.dropna() function behaves very similarily to the pd.drop() function including the inplace= and axis= arguments.

5.4.4 Merge#

To merge multiple DataFrames, pandas provides a merge() method. Similar to above, the merge() function will properly align data, but because DataFrames have multiple columns and index values to choose from, the merge() function can align data based on any of these values. The default behavior for merge() is to check for common columns between the two DataFrames and align the data based on those columns. As an example, below are two DataFrames containing data from various chemical compounds.

chemdata1 = [['MW', 58.08, 32.04], ['dipole', 2.91, 1.69], 
             ['formula', 'C3H6O', 'CH3OH']] 
columns=['property','acetone', 'methanol']
chmdf1 = pd.DataFrame(chemdata1, columns=columns)

chmdf1

	property	acetone	methanol
0	MW	58.08	32.04
1	dipole	2.91	1.69
2	formula	C3H6O	CH3OH

chmdata2 = [['formula', 'C6H6', 'H2O'], ['dipole', 0.00, 1.85], 
            ['MW', 78.11, 18.02]]
chmdf2 = pd.DataFrame(chmdata2 , columns=['property', 'benzene', 'water'])

chmdf2

	property	benzene	water
0	formula	C6H6	H2O
1	dipole	0.0	1.85
2	MW	78.11	18.02

Both DataFrames above have a property column, so the merge() function uses this common column to align all the data into a new DataFrame.

chmdf1.merge(chmdf2)

	property	acetone	methanol	benzene	water
0	MW	58.08	32.04	78.11	18.02
1	dipole	2.91	1.69	0.0	1.85
2	formula	C3H6O	CH3OH	C6H6	H2O

If there are multiple columns with the same name, the user can specify which to use with the on keyword argument (e.g., on='property'). Alternatively, if the two DataFrames contain columns with different names that the user wants used for alignment, the user can specify which columns to use with the left_on and right_on keyword arguments.

comps1 = pd.DataFrame({'element':['Co', 'Fe', 'Cr','Ni'], 
                       'protons': [27, 26, 24, 28]})
comps2 = pd.DataFrame({'metal':['Fe', 'Co', 'Cr', 'Ni'], 
                       'IE': [7.90, 7.88, 6.79, 7.64]})

In the two DataFrames generated above, each contains data on on cobalt, iron, chromium, and nickel; but the first DataFrame labels metals as element while the second labels the metals as metal. The following merges the two DataFrames based on values in these two columns.

comps1.merge(comps2, left_on='element',right_on='metal')

	element	protons	metal	IE
0	Co	27	Co	7.88
1	Fe	26	Fe	7.90
2	Cr	24	Cr	6.79
3	Ni	28	Ni	7.64

Notice that the values in the element and metal columns were aligned in the resulting DataFrame. To get rid of one of the redundant columns, just use the drop() method described in section 5.4.3.

comps3 = comps1.merge(comps2, left_on='element', 
                      right_on='metal')
comps3.drop('metal', axis=1, inplace=True)
comps3

	element	protons	IE
0	Co	27	7.88
1	Fe	26	7.90
2	Cr	24	6.79
3	Ni	28	7.64

5.4.5 Concatenation#

Concatenation is the process of splicing two DataFrames along a given axis. This is different from the merge() method above in that merge() merges and aligns common data between the two DataFrames while pd.concat() blindly appends one DataFrame to another. As an example, imagine two lab groups measure the densities of magnesium, aluminum, titanium, and iron and load their results into DataFrames below.

group1 = pd.DataFrame({'metal':['Mg', 'Al', 'Ti', 'Fe'], 
                       'density': [1.77, 2.73, 4.55, 7.88]})
group2 = pd.DataFrame({'metal':['Al', 'Mg', 'Ti', 'Fe'], 
                       'density': [2.90, 1.54, 4.12, 8.10]})

group1

	metal	density
0	Mg	1.77
1	Al	2.73
2	Ti	4.55
3	Fe	7.88

See what happens when these two DataFrames are concatenated.

pd.concat((group1, group2))

	metal	density
0	Mg	1.77
1	Al	2.73
2	Ti	4.55
3	Fe	7.88
0	Al	2.90
1	Mg	1.54
2	Ti	4.12
3	Fe	8.10

Notice how the two DataFrames are appended with no consideration for common values in the metal column. The default behavior is to concatenate along the first axis (axis=0), but this behavior can be modified with the axis= keyword argument. Again, the metals are not all aligned below because they were not in the same order in the original DataFrames.

pd.concat((group1, group2), axis=1)

	metal	density	metal	density
0	Mg	1.77	Al	2.90
1	Al	2.73	Mg	1.54
2	Ti	4.55	Ti	4.12
3	Fe	7.88	Fe	8.10

For comparison, if the two DataFrames are merged instead of concatenating them, pandas will align the data based on the metal as demonstrated below. Because density appears twice as a column header, pandas deals with this by adding a suffix to differentiate between the two data sets.

pd.merge(group1, group2, on='metal')

	metal	density_x	density_y
0	Mg	1.77	1.54
1	Al	2.73	2.90
2	Ti	4.55	4.12
3	Fe	7.88	8.10

Exercises#

Complete the following exercises in a Jupyter notebook using the pandas library. Avoid using for loops unless absolutely necessary. Any data file(s) refered to in the problems can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.

Below is a table containing the melting points and boiling points of multiple common chemical solvents.

Solvent

bp

mp

benzene

80

6

acetone

56

-95

toluene

111

-95

pentane

36

-130

ether

35

-116

ethanol

78

-114

methanol

65

-98

a) Create a Series containing the boiling points of the above solvents with the solvent names as the indices. Call the Series to look up the boiling point of ethanol.

b) Create a DataFrame that contains both the boiling points and melting points with the solvent names as the indices. Call the DataFrame to look up the melting point of benzene.

c) Access the boiling point of pentane in the DataFrame from part b using numerical indices.
Import the attached file blue1.csv containing the absorption spectrum of Blue 1 food dye using pandas.

a) Set the wavelengths as the index values.

b) Plot the absorption versus wavelength.

c) Determine the absorbance of Blue 1 at 620 nm.
Chemical Kinetics: Import the file kinetics.csv containing time series data for the conversion of A \(\rightarrow\) Product using pandas IO tools. Generate new columns for \(ln[A]\), \([A]^{-1}\), and \([A]^{0.5}\) and determine the order of the reaction.
Import the ROH_data.csv file containing data on various simple alcohols to a DataFrame. Notice that this data is missing densities for some of the compounds.

a) Use pandas to remove any rows with incomplete information in the density column using the pd.dropna() function. Check the DataFrame to see if it has changed.

b) Again using the pd.dropna() function, drop incomplete row with the parameter inplace=True. Check to see if the DataFrame has changed.
Import the following four files containing UV-vis spectra of four food dyes with the first column listing the wavelengths (nm) and the second column containing the absorbances. Each file contains data in from 400-850 nm in 1 nm increments.

\[ red40.csv \quad green3.csv \quad blue1.csv \quad yellow6.csv \]

a) Concatenation the files into a single DataFrame with the first column as the wavelength (nm) and the other four columns as the absorbances for each dye.

b) Replace the column headers with meaningful labels.
Import the two files alcohols.csv and alkanes.csv containing the boiling points of the two classes of organic compounds with respect to the number of carbons in each compound.

a) Drop the columns containing the names of the compounds.

b) Merge the two DataFrames allowing pandas to align the two DataFrames based on carbon number.