Chapter 6: Signal & Noise#
When collecting data from a scientific instrument, a measurement is returned as a value or series of values, and these values are composed of both signal and noise. The signal is the component of interest while the noise is random instrument response resulting from a variety of sources that can include the instrument itself, the sample holder, and even the tiny vibrations of the building. For the most interpretable data, you want the largest signal-to-noise ratio possible in order to reliably identify the features in the data.
This chapter introduces the processing of signal data including detecting features, removing noise from the data, and fitting the data to mathematical models. We will be using the NumPy library in this chapter and also start to use modules from the SciPy library. SciPy, short for “scientific python,” is one of the core libraries in the scientific Python ecosystem. This library includes a variety of modules for dealing with signal data, performing Fourier transforms, and integrating sampled data among other common tasks in scientific data analysis. Table 1 summarizes some of the key modules in the SciPy library.
Table 1 Common SciPy Modules
Module |
Description |
Examples |
---|---|---|
|
Compilation of scientific constants |
|
|
Fourier transform functions |
|
|
Integration for both functions and sampled data |
|
|
Data interpolation |
Section 6.4.4 |
|
File importers and exporters |
|
|
Linear algebra functions |
|
|
Optimization algorithms |
|
|
Signal processing functions |
Unlike NumPy, many of the functions in SciPy are stored in modules, so each module from SciPy needs to be imported individually or listed when calling the function. It is common to see specific SciPy modules imported as shown below.
from scipy import module
Alternatively, you can import a single function from a module.
from scipy.module import function
Because NumPy and plotting are used heavily in signal processing, the examples in this chapter assume the following NumPy and matplotlib imports.
import numpy as np
import matplotlib.pyplot as plt
6.1 Feature Detection#
When analyzing experimental data, there are typically key features in the signal that you are most interested in. Often, they are peaks or a series of peaks, but they can also be negative peaks (i.e., low point), the slopes, or inflection points. This section covers extracting feature information from signal data.
6.1.1 Global Maxima & Minima#
The simplest and probably most commonly sought after features in signal data are peaks and negative peaks. These are known as the maxima and minima, respectively, or collectively known as the extrema. In the simplest data, there may be only one peak or negative peak, so finding it is a matter of finding the maximum or minimum value in the data. For this, we can use NumPy’s np.maximum()
and np.minimum()
functions, and these functions can also be called using the shorter np.max()
and np.min()
function calls, respectively.
To demonstrate peak finding, we will use both a \(^{13}\)C{\(^1\)H} Nuclear Magnetic Resonance (NMR) spectrum and an infrared (IR) spectrum. These data are imported below using NumPy.
nmr = np.genfromtxt('data/13C_ethanol.csv', delimiter=',',
skip_footer=1, skip_header=1)
plt.plot(nmr[:,0], nmr[:,1], lw=0.5)
plt.xlabel('Chemical Shift, ppm')
plt.xlim(70, 0);
ir = np.genfromtxt('data/IR_acetone.csv', delimiter=',')
plt.plot(ir[:,0], ir[:,1])
plt.xlabel('Wavenumbers, cm$^{-1}$')
plt.ylabel('Transmittance, %')
plt.xlim(4000, 600);
NMR resonances are positive peaks while IR stretches are represented here as negative peaks, so we can find the largest features in both spectra by finding the maximum value in the NMR spectrum and the smallest value in the IR spectrum.
np.max(nmr[:,1])
np.float64(11.7279863357544)
np.min(ir[:,1])
np.float64(66.80017)
These functions output the max and min values of the independent variable (\(y\)-axis). If we want to know the location on the \(x\)-axes, we need to use the NumPy functions np.argmax()
and np.argmin()
which return the indices of the max or min values instead of the actual value (“arg” is short for argument).
imax = np.argmax(nmr[:,1])
imax
np.int64(5395)
imin = np.argmin(ir[:,1])
imin
np.int64(2302)
With the indices, we can extract the desired information using indexing of the \(x\)-axes. Below, the largest peak in the NMR spectrum is at 18.3 ppm while the smallest transmittance (i.e., largest absorbance) is at 1710 cm\(^{-1}\) in the IR spectrum.
nmr[imax, 0]
np.float64(18.312606267778)
ir[imin, 0]
np.float64(1710.068)
Below, these values are plotted on the spectra as orange dots to validate that they are indeed the largest features in the spectra.
plt.plot(nmr[:,0], nmr[:,1], lw=0.5)
plt.plot(nmr[imax,0], nmr[imax,1], 'o')
plt.xlabel('Chemical Shift, ppm')
plt.xlim(70,0);
plt.plot(ir[:,0], ir[:,1])
plt.plot(ir[imin, 0], ir[imin, 1], 'o')
plt.xlim(4000, 600)
plt.xlabel('Wavenumbers, cm$^{-1}$')
plt.ylabel('Transmittance, %');
Both of these functions find the global extremes (or global extrema). If all you need is the largest feature in a spectrum, this works just fine. To find multiple features, we will need to find the local extrema addressed in the following section.
6.1.2 Local Maximums & Minimums#
A considerable amount of data in science contain numerous peaks and negative peaks which are called local extrema. To locate the multiple max and min values, we will use SciPy’s relative max/min functions argrelmax()
and argrelmin()
. These functions determine if a point is a max/min by checking a range of data points on both sides to see if the point is the larges/smallest. The range of data points examined is known as the window, and the window can be modified using the order
argument. Instead of the actual max/min values, these functions return the indices as the “arg” part of the name suggests.
from scipy.signal import argrelmax, argrelmin
imax = argrelmax(nmr[:,1], order=2000)
imax
(array([1219, 5395]),)
The argrelmax()
function returned two indices as an array wrapped in a tuple. If we plot the maxima marked with dots, we see that the function correctly identified both peaks.
plt.plot(nmr[:,0], nmr[:,1], lw=0.5)
plt.plot(nmr[imax, 0], nmr[imax, 1], 'C1o')
plt.xlabel('Chemical Shift, ppm')
plt.xlim(70,0);
The argrelmax()
function may at times identify an edge or a point in a flat region as a local maximum because there is nothing larger near it. There are multiple ways to mitigate these erroneous peaks. First, we can increase the window for which the function checks to see if a point is the largest value in its neighborhood. Unfortunately, making the window too large can also prevent the identification of multiple extrema near each other. The second mitigation is to change the function’s mode from the default 'clip'
to 'wrap'
. This makes the function treat the data as wrapped around on itself instead of stopping at the edge. That is, both edges of the data are treated as being connected. This makes it more likely that an extrema value is in the neighborhood. Finally, the user can filter the returned for values that correspond to peaks above a certain height value. Below is an example of filtering values based on a height. The window below is intentionally narrowed so that the argrelmax()
function returns too many values for demonstration purposes.
imax = argrelmax(nmr[:,1], order=1000)[0]
imax
array([1219, 2860, 3943, 5395, 6613])
Next, we will create a boolean mask (see section 4.3.4) which is a series of True
and False
values indicating if the data point is above a height value or not. In this example, we are using 1
as a height, but another height value may be more appropriate for different data. This is accomplished below by using the boolean >
operator. The nmr[imax, 1]
indexes the identified peaks from above and only returns the height values as a result of the 1
. If the 1
was not included, we would get a collection of [ppm, height] pairs.
mask = nmr[imax, 1] > 1
mask
array([ True, False, False, True, False])
Finally, we treat the mask of True
/False
values as if they are indicies to get only the values for legitimate peaks.
imax[mask]
array([1219, 5395])
6.1.3 SciPy find_peaks()
Function#
The scipy.signal
module includes a convenient find_peaks()
function that facilitates the finding of multiple peaks in a spectrum based on parameters such as the height of the peaks or promience. This function requires a one-dimensional array as a positional argument and a number of optional, keyword arguments (Table 2).
Table 2 Select Keyword Argument for the scipy.signal.find_peaks()
Function
Parameter |
Descrption |
---|---|
|
Verticle height of the peak apex. |
|
Verticle distance between a data point and the adjacent data points. |
|
Horizontal distance between a peak and its nearest neighbor. If two peaks are near each other, the smaller one is discarded. |
|
Distance between a peak apex and the base of the peak. |
|
Peak width measured in number of data points. |
Each of these parameters can take a single number treated as a minimum value. Alternatively, most of these parameters can also take two numbers in an array, list, or tuple in which case the first value is a minimum while the second value is a maximum.
find_peaks(data, height=min)
find_peaks(data, height=(mim, max))
This function only identifies positive peaks (i.e. pointing upwards). Our IR spectrum is currently represented as percent transmittance, so we can convert it to absorbance using the following equation.
absorb = 2 - np.log10(ir[:,1])
Now that the peaks are point upward, we feed the data into the find_peaks()
function and decide how to best identify the peaks we are interested in. This will depend up the type of spectrum and other conditions. One straight forward method is the heigh=
parameter where any peaks above this level are identified at their apex.
from scipy.signal import find_peaks
find_peaks(absorb)
(array([ 2, 16, 31, 46, 62, 84, 102, 116, 130, 140, 161,
176, 199, 215, 220, 235, 249, 262, 270, 279, 291, 307,
382, 442, 455, 471, 486, 497, 524, 625, 749, 791, 812,
837, 864, 1020, 1114, 1285, 1573, 1698, 1731, 1858, 1879, 1908,
1919, 1934, 1948, 1988, 2009, 2020, 2064, 2091, 2108, 2148, 2172,
2184, 2302, 2381, 2468, 2539, 2565, 2579, 2607, 2628, 2661, 2673,
2693, 2716, 2733, 2745, 2756, 2772, 2789, 2811, 2829, 2846, 2853,
2865, 2886, 2895, 2909, 2927, 2950, 2958, 2971, 2985, 2996, 3008,
3049, 3058, 3069, 3083, 3092, 3109, 3123, 3142, 3168, 3196, 3213,
3227, 3240, 3257, 3277, 3284, 3303, 3315, 3334, 3348, 3363, 3380,
3397, 3409, 3433, 3446, 3469, 3497, 3527, 3557, 3568, 3586, 3608,
3651, 3662, 3711, 3731, 3758, 3774, 3794, 3805, 3826, 3838, 3852,
3869, 3881, 3897, 3911, 3925, 3939, 3979, 3994, 4003, 4019, 4032,
4038, 4048, 4095, 4111, 4153, 4181, 4191, 4206, 4222, 4241, 4251,
4264, 4285, 4298, 4313, 4329, 4348, 4367, 4379, 4386, 4405, 4418,
4437, 4454, 4472, 4488, 4523, 4534, 4549, 4568, 4590, 4603, 4618,
4645, 4670, 4688, 4701, 4716, 4729, 4818, 4827, 4849, 4914, 4984,
5095, 5127, 5148, 5178, 5193, 5212, 5233, 5247, 5272, 5294, 5302,
5318, 5338, 5361, 5374, 5400, 5415, 5432, 5440, 5451, 5476, 5506,
5526, 5540, 5554, 5567, 5578, 5598, 5618, 5636, 5656, 5666, 5673,
5683, 5696, 5708, 5725, 5742, 5765, 5830, 5880, 5899, 5915, 5927,
5934, 5946, 5965, 5991, 6003, 6021, 6043, 6051, 6060, 6072, 6093,
6131, 6154, 6165, 6192, 6211, 6222, 6235, 6249, 6261, 6280, 6318,
6368, 6375, 6404, 6432, 6451, 6462, 6478, 6502, 6519, 6532, 6547,
6557, 6571, 6594, 6610, 6625, 6639, 6662, 6681, 6696, 6708, 6723,
6746, 6763, 6780, 6794, 6812, 6827, 6850, 6879, 6903, 6919, 6934,
6956, 6976, 6991, 7004, 7013, 7026]),
{})
The function returns and tuple containing an array and a dictionary in this order. The array contains the indicies of identified peaks while the dictionary may either be empty or include information about the identified peaks depending upon what keyword arguments are used in the function.
Below, we can plot the results of the function with the horizontal dotted line representing the chosen height and the orange dots represent identified peaks.
height = 0.01
i_peaks = find_peaks(absorb, height=height)[0]
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.hlines(height, 4000, 600, 'r', linestyles='dotted', label='Height')
ax.plot(ir[:,0], absorb)
ax.plot(ir[i_peaks,0], absorb[i_peaks], 'o', label='Identified peaks')
ax.set_xlim(4000, 600)
ax.set_xlabel('Wavenumbers, cm$^{-1}$')
ax.set_ylabel('Absorbance')
ax.legend()
<matplotlib.legend.Legend at 0x10c17fd10>
When using keyword arguments, the find_peaks()
function returns the values used by the keyword arguments in the dictionary. For example, because we used the height=
argument, the heights are returned.
find_peaks(absorb, height=height)
(array([ 2, 16, 31, 46, 62, 84, 102, 116, 130, 140, 161,
176, 199, 215, 220, 235, 249, 262, 270, 279, 291, 307,
382, 442, 455, 471, 486, 497, 524, 625, 749, 791, 812,
837, 864, 1020, 1114, 1285, 1573, 1698, 1731, 1858, 1879, 1908,
1919, 1934, 1948, 1988, 2009, 2020, 2108, 2148, 2172, 2184, 2302,
2381, 4984]),
{'peak_heights': array([0.01713256, 0.01951099, 0.01586994, 0.0166401 , 0.01443504,
0.01303453, 0.01354936, 0.01276529, 0.01307347, 0.01335544,
0.01305901, 0.01252041, 0.0123151 , 0.01175591, 0.01175484,
0.01211075, 0.0120464 , 0.01183543, 0.0118249 , 0.01181035,
0.01161127, 0.01167738, 0.01561745, 0.01103975, 0.01103948,
0.0109757 , 0.01097877, 0.01117627, 0.01139271, 0.0229449 ,
0.01129687, 0.01178175, 0.01157742, 0.0115503 , 0.01188149,
0.03266487, 0.01204908, 0.14069589, 0.123156 , 0.04351303,
0.0392759 , 0.01102015, 0.01090951, 0.01069086, 0.01036943,
0.01053283, 0.01075127, 0.01098612, 0.0104168 , 0.01008441,
0.01000628, 0.01187185, 0.01309263, 0.01389863, 0.17522243,
0.03075821, 0.01286983])})
This approach struggles with identifying short peaks without mislabeling non-peaks, so we need another condition to limit what is marked as a peak. The peak prominence (prominence=
) is how far the apex of a peak is above the base of the peak. The base of the peak may or may not be the baseline of the spectrum itself. By adding this condition, now only peaks that satify both the height and prominence condition will be identified.
height = 0.01
i_peaks = find_peaks(absorb, height=height, prominence=0.002)[0]
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.hlines(height, 4000, 600, 'r', linestyles='dotted', label='Height')
ax.plot(ir[:,0], absorb)
ax.plot(ir[i_peaks,0], absorb[i_peaks], 'o', label='Identified peaks')
ax.set_xlim(4000, 600)
ax.set_xlabel('Wavenumbers, cm$^{-1}$')
ax.set_ylabel('Absorbance')
ax.legend()
<matplotlib.legend.Legend at 0x10d69ec90>
6.1.4 Slopes & Inflection Points#
The slope is a useful feature as it can be used to identify inflection points, edges, and make subtle features in a curve more obvious. Unfortunately, noisy data can make it challenging to examine the slope as the noise causes the slope to fluctuate so much that it sometimes dwarfs the overall signal. It is sometimes recommended that the noise be first removed by signal smoothing covered in section 6.2 before trying to identify signal features. To demonstrate the challenges of noisy data, we will generate both noise-free and noisy synthetic data below and calculate the slopes for both.
rng = np.random.default_rng()
x = np.linspace(0, 2*np.pi, 1000)
y_smooth = np.sin(x)
y_noisy = np.sin(x) + 0.07 * rng.random(len(x))
plt.plot(x, y_smooth);
We will use NumPy to calculate the slope using the np.diff()
function which calculates the differential of a user defined order (n
). Because the slope is the dy/dx between every pair of adjacent points, the resulting slope data is one data point shorter than the original data. This is important when plotting the data because the length of the x and y values must be the same.
When examing the slope, it is important to use smooth data. In the example below, the slope from the noise in the noisy data dwarfs that of the main signal. Therefore, we will use the slope of the smooth data to find the inflection point below.
dx = 2*np.pi/(1000 - 1)
dy_smooth = np.diff(y_smooth, n=1)
dy_noisy = np.diff(y_noisy, n=1)
x2 = (x[:-1] + x[1:]) / 2 # x values one shorter
plt.plot(x2, dy_noisy/dx, label='Noisy Data')
plt.plot(x2, dy_smooth/dx, label='Smooth Data')
plt.ylabel('Slope, dy/dx')
plt.legend();
Because the inflection point in the center of the data has a negative slope, we will need to find the minimum slope. This may not always be the case with other data.
i = np.argmin(dy_smooth) # finds min slope index
plt.plot(x, y_smooth)
plt.plot(x[i], y_smooth[i], 'o');
6.2 Smoothing Data#
It is not uncommon to collect signal data that has a considerable amount of noise in it. Smoothing the data can help in the processing and analysis of the data such as making it easier to identify peaks or preventing the noise from hiding the extremes in the derivative of the data. Smoothing alters the actual data, so it is important to be transparent to others that the data were smoothed and how they were smoothed.
There are a variety of ways to smooth data including moving averages, band filters, and the Savitzky-Golay filter. We will focus on moving averages and Savitzky-Golay here. For this section, we will work with a noisy cyclic voltammogram (CV) stored in the file CV_noisy.csv.
CV = np.genfromtxt('data/CV_noisy.csv', delimiter=',')
potent = CV[:,0]
curr = CV[:,1]
plt.plot(potent, curr)
plt.xlabel('Potential, V')
plt.ylabel('Current, A');
6.2.1 Unweighted Average#
The first and simplest way to smooth data is to take the moving average of each data point with its immediate neighbors. This is an unweighted sliding average smooth or a rectangular boxcar smooth. From noisy data point \(D_j\), we get smoothed data point \(S_j\) by the following equation where \(D_{j-1}\) and \(D_{j+1}\) are the points immediately preceding and following a data point \(D_j\), respectively.
One thing to note about this smoothing method is that it is only valid for all points except the first and last because there are no data points both before and after them to take the average with. As a result, the smoothed data is two data points shorter. There are approximations that can be used to maintain the length of the data, but for simplicity, we will allow the data to shorten.
sum = curr[:-2] + curr[1:-1] + curr[2:]
rect_smooth = sum / 3
plt.plot(potent[1:-1], rect_smooth);
The data are smoothed relative to the original data, but there is still a considerable amount of noise present.
6.2.2 Weighted Averages#
The above method treats each point equally and only takes the average with the immediately adjacent data points. The triangular smooth approach averages extra data points with the points closer to the original point weighted more heavily than those further away. For example, if we take the average using five data points, this is described by the following equation.
The resulting data is shortened by four points as the end points have insufficient neighbors to be averaged.
sum = curr[:-4] + 2*curr[1:-3] + 3*curr[2:-2] + 2*curr[3:-1] + curr[4:]
tri_smooth = sum / 9
plt.plot(potent[2:-2], tri_smooth);
The triangular smooth results in a smoother data set than the rectangular smooth. This is not surprising as applying the triangular smooth above is mathematically equivalent to applying the rectangular smooth twice.
6.2.3 Median Smoothing#
While the above filters take some form of the mean of the surrounding data points, a median filter takes the median. This filter is sometimes applied to images because it reduces noise while maintaining sharp edges.
array2d = np.vstack((curr[2:], curr[1:-1], curr[:-2]))
median_smooth = np.median(array2d, axis=0)
plt.plot(potent[1:-1], median_smooth);
6.2.4 Savitzky–Golay#
Another approach is the Savitzky–Golay filter which incrementally moves along the noisy data and fits sections (i.e.., windows) of data points to a polynomial using least square minimization. While this approach had been previously described in the mathematical literature, Abraham Savitzky and M. J. E. Golay are known for applying it to spectroscopy (https://doi.org/10.1021/ac60214a047). Conveniently, SciPy contains a built-in function for this called savgol_filter()
from the scipy.signal
module shown below.
scipy.signal.savgol_filter(data, window, polyorder)
This function requires three arguments which include the original data as a NumPy array, window
which is the width of the moving window the savgol algorithm fits to a polynomial, and polyorder
which is the order of polynomial
used for the moving data fit. You are encouraged to experiment with the window
and polyorder
arguments to see what works the best for your application. However, polyorder
must be less than the window
size, and the window
must be an odd integer.
from scipy.signal import savgol_filter
sg_smooth = savgol_filter(curr, 101, 1)
plt.plot(potent, sg_smooth);
The Savitzky–Golay filter appears to have done a decent job removing the noise. Despite there being some remaining noise and other artifacts in the CV, the denoised CV makes it significantly easier to locate the maxima and minima in this example.
6.3 Fourier Transforms#
Another approach to filtering noise is to filter based on frequency. Many times, random noise in data occurs at a different frequency than the data itself, and the noise can be reduced by filtering noise frequency ranges while maintaining signal frequencies. If the noise is higher frequency than the signal, it can be filtered out with what is known as a low-pass filter. Alternatively, filtering out low-frequency noise is known as a high-pass filter, and filtering out noise above and blow the signal frequency is known as a band-pass filter. Frequency filtering is somewhat involved being that we need to use window functions which are covered in the Think DSP book by Allen Downey listed at the end of this chapter. Instead, we will just look at the distribution of signal and noise frequencies in synthetic data. This is useful for analyzing the noise in data and also is used routinely in nuclear magnetic resonance (NMR) spectroscopy and Fourier Transform infrared spectroscopy (FTIR).
To convert the data from the time domain to the frequency domain, we will use the fast Fourier transform (FFT) algorithm. This algorithm is only for data that is periodic. Below, synthetic data is generated oscillating at 62.0 Hz with some random noise to make it more interesting.
t = np.linspace(0,1,1000)
freq = 62.0 # Hz
signal = np.sin(freq*2*np.pi*t)
noise = rng.random(1000)
data = signal + 0.5 * noise
plt.plot(t, data)
plt.xlabel('Time, s');
SciPy contains an entire module called fft
dedicated to Fourier transforms and reverse Fourier transforms. We will use the basic fft()
function for our synthetic data which returns a mixture of real and imaginary values. For plotting, we will simply look at the real component of the result using .real
.
from scipy.fft import fft
fdata = fft(data)
plt.plot(fdata.real)
plt.xlim(0,500/2)
plt.xlabel('Frequency, Hz');
Only the first half of the Fourier transform output is plotted above because the second half is a mirror image of the first. A single peak at 62.0 Hz is present from our signal. The rest of the baseline of the plot is not smooth because there is noise present at a variety of frequencies. It is important to note that the erratic variations in the baseline of the frequency plot is not the noise itself but more like a histogram of all the frequencies present in the original data.
6.4 Fitting & Interpolation#
Signal data or information taken from signal data often conforms to linear, polynomial, or other mathematical trends, and fitting data is important because it allows scientists to determine the equation describing the physical or chemical behavior of the data. In data fitting, the user provides the data and the general class of equation expected, and the software returns the coefficients for the equation. Interpolation is the method of predicting values in regions among known data points. The calculation of values where no data was collected can be accomplished by either using the coefficients derived from a curve fit or using a special interpolation function that generates a callable function to calculate the new data points. Both approaches are demonstrated below.
Before we can do our fitting, we need some new, noisy data to examine. A linear set of data with added noise is generated below along with a second-order curve with the noise.
x = np.linspace(0,10,100)
noise = rng.random(100)
y_noisy = 2.2 * x + 3 * noise
y2_noisy = 3.4 * x**2 + 4 * x + 7 + 3 * noise
plt.scatter(x, y_noisy);
plt.scatter(x, y2_noisy);
6.4.1 Linear Regression#
Now we can fit the noisy linear data with a line using the NumPy np.polyfit(x, y, degree)
function. The function takes the x
and y
data along with the degree
of the polynomial.
A line is a first-degree polynomial, and the function returns an array containing the coefficients for the fit with the highest order coefficients first. This is effectively a linear regression.
a, b = np.polyfit(x, y_noisy, 1)
print((a, b))
(np.float64(2.240683698467965), np.float64(1.2799581449656965))
For a linear equation of the form \(y = ax + b\), we get an array of the form array([a, b])
, so the fitted equation above is \(y = 2.17x + 1.66\). The positive shift of the \(y\)-intercept above zero is not surprising being that we added random noise not centered around zero; the average of our rng.random()
noise should be around 0.5, not zero. This could be remedied either by subtracting 0.5 from the noise or using another random number generator such as the normal distribution, such as randn()
, which is centered around zero.
We can view our linear regression by plotting a line on top of our data points using the coefficients found above.
y_reg = a*x + b
plt.scatter(x, y_noisy, label='linear data')
plt.plot(x, y_reg, 'C1-', label='linear regression')
plt.legend();
We can also obtain the statistics for our fit using the linregress()
function from the SciPy stats
module. Note that this does not return the \(r^2\) value but instead the \(r\)-value which can be squared to generate the \(r^2\) value.
from scipy import stats
stats.linregress(x, y_noisy)
LinregressResult(slope=np.float64(2.2406836984679646), intercept=np.float64(1.2799581449656952), rvalue=np.float64(0.9929311301807056), pvalue=np.float64(1.5904112918209302e-92), stderr=np.float64(0.027056369871974507), intercept_stderr=np.float64(0.15660399723300497))
Note
We are starting to see examples of functions that return multiple values which can be assigned to multiple variables using tuple unpacking like below.
There may be times when you don’t need all of the returned values from a function. In these instances, it is common to use __
(double underscore) as a junk variable which is broadly understood to store information that will never be used in the code. You may also see a _
(single underscore) used for this purpose, but this is discouraged as a single underscore is also used by the Python interpreter to store the last ouput.
6.4.2 Polynomial Fitting#
Fitting to a polynomial of a higher order works the same way except that the order is above one. You will need to already know the order of the polynomial, or you can make a guess and see how well a particular order fits the data. Below, the np.polyfit()
function determines the second-order data can be fit by the equation \(y = 3.40x^2 + 3.95x + 8.70\). We can again plot this fit equation over our data points to see how well the data agree with our equation.
a, b, c = np.polyfit(x, y2_noisy, 2)
print((a, b, c))
(np.float64(3.394946026145204), np.float64(4.091223437015926), np.float64(8.19657608473501))
y_fit = a*x**2 + b*x + c
plt.scatter(x, y2_noisy, label='curve data')
plt.plot(x, y_fit, 'C1-', label='curve fit')
plt.legend();
6.4.3 Multivariable Linear Regression#
Multivariable linear regression (aka. multiple linear regression) is similar to the linear regession seen in section 6.4.1 except that there are multiple independant variable. This takes the form below where \(y\) is the dependant variable, \(x\) are the independant variables (plural) with coefficients \(a\), and \(b\) is the bias term. There are \(k\) independant variables below.
The goal is to solve for the \(a\) coefficients and the value for \(b\) given a series of \(x\)-values with their corresponding \(y\)-values. Essentially, this is taking regular linear regression to three dimensions or higher. There are multiple methods available in Python to solve this type of problem including, but not limited to, the following.
Scikit-learn’s
LinearRegressor()
demonstred in section 12.1.3Moore-Penrose psudoinvere and some matrix math demonstrated in section 8.3.2
NumPy’s
np.linalg.lstsq()
functionUsing optimization algorithms to fit the equation similar to what is demonstrated in section 14.2
The first option will often involve the fewest lines of code but does require knowledge of using the scikit-learn library. Being that the other options 1, 2, and 4 either require other libraries or specialized knowledge not yet addressed, we will solve a multivariable linear regression problem using the np.linalge.lstsq()
function.
In the example below, we have an array y
which contains our dependant variable values and array X
which contains our independant variable values. For this approach, we want these two arrays to be related by the following equation where a
is an array containing the our coefficents and bias term. The issue is that our array X
has too few columns to take the dot product of, so there is nothing that multiplies by \(b\).
X = np.array([[ 2, 11, 10, 7],
[ 7, 13, 2, 10],
[ 3, 2, 8, 14],
[11, 11, 11, 12],
[ 8, 2, 12, 7],
[ 8, 6, 3, 13]])
y = np.array([192.36 , 254.1, 175.1, 284.4, 145.2, 221.3])
We need to add a column of ones to array X
as is done below which simply adds ones. Now when performing the above multiplcation, \(b\) is always multiplied by 1 to return \(b\).
X = np.c_[X, np.ones(6)]
X
array([[ 2., 11., 10., 7., 1.],
[ 7., 13., 2., 10., 1.],
[ 3., 2., 8., 14., 1.],
[11., 11., 11., 12., 1.],
[ 8., 2., 12., 7., 1.],
[ 8., 6., 3., 13., 1.]])
To solve for array a
, we will use the np.linalg.lstsq()
function which take the arrays containing independant and dependant variabls in this order.
a = np.linalg.lstsq(X, y)
a
(array([5.22016673, 9.07380853, 1.22219907, 8.66136646, 9.77747831]),
array([2.02564125]),
np.int32(5),
array([40.69036598, 11.56475425, 9.36236285, 6.61646926, 0.34506865]))
The output includes the coefficients plus other information about the fit such as the sum of the squared residuals from the fit. If you just want the coefficients, use indexing like below.
a[0]
array([5.22016673, 9.07380853, 1.22219907, 8.66136646, 9.77747831])
This result is interpreted as the equation \(y = 5.22x_0 + 9.07x_1 + 1.22x_2 + 8.66x_3 + 9.78\).
6.4.4 Interpolation#
The practical difference between the np.polyfit
function and the interpolation functions in SciPy is that the former returns coefficients for the equation while the interpolation functions return a Python function that can be used to calculate values. There are times when one is more desirable than the other depending upon your application. Below we will use the interpolation function to interpolate a one dimensional function.
Below is a dampening sine wave that we will interpolate from ten data points.
x = np.linspace(1,20, 10)
y = np.sin(x)/x
plt.plot(x,y, 'o');
To interpolate this one-dimensional function, we will use the interp1d()
method from SciPy. Along with the x
and y
values, interp1d()
requires a mode of interpolation using the kind keyword which can include the items listed in Table 3.
Table 3 Modes for interp1d()
Method
Kind |
Description |
---|---|
|
Linear interpolation between data points |
|
Constant value until the next data point |
|
Predicts values equaling the closest data point |
|
Interpolates with a second-order spline |
|
Interpolates with a third-order spline |
Below is a demonstration of both linear and cubic interpolation. The two functions f()
and f2()
are generated and can be used like any other Python function to calculate values.
from scipy import interpolate
f = interpolate.interp1d(x, y, kind='linear')
f2 = interpolate.interp1d(x, y, kind='cubic')
xnew = np.linspace(1,20,100)
plt.plot(xnew, f(xnew), 'C1-', label='Linear Interpolation')
plt.plot(xnew, f2(xnew), 'C2--', label='Cubic Interpolation')
plt.plot(x,y, 'o', label='Sampled Data')
plt.legend();
6.5 Baseline Correction#
The baseline of chemical spectra may sometimes slope or undulate requiring baseline correction. This is a two step process - first, the baseline needs to be identified and then the baseline is subtracted from the original spectrum. Predicting a baseline for a spectrum is not a trivial task. Fortunately, the pybaselines library provides Python implementations of various algorithms for determining the baseline. Because pybaselines is not a standard library with Anaconda or Colab, it needs to be installed using either pip or conda.
For our example below, we will correct the baseline of an IR spectrum of 2-pentanone. We can see that the baseline curves upward at frequencies below 2000 cm\(^{-1}\).
IR_spec = np.genfromtxt('data/IR_2pentanone.csv', delimiter=',')
wavenums = IR_spec[:,0]
absorb = IR_spec[:,1]
plt.plot(wavenums, absorb)
plt.xlabel('Wavenumbers (cm$^{-1}$)')
plt.ylabel('Absorbance')
plt.gca().invert_xaxis();
To identify a baseline, we will use the Baseline
module of pybaselines imported below. The first step is to create a baseline fitter object which accepts the x-axis data from your spectrum.
from pybaselines import Baseline
fitter = Baseline(x_data=wavenums)
Next, the fitter object is used to predict the baseline using various baseline algorithms. There are numerous algorithms available, and many algorithms have multiple parameters that fine tune how the baseline is identified. Finding the ideal algorithm and parameters to use can come down to trial-and-error, so feel free to try a few and see what works the best. We will try the modified polynomial, asymmetric least squares, and morphological based algorithms below, but there are many others to choose from. The output of each prediction is a background and the parameters from the baseline fit.
bg1, params1 = fitter.modpoly(absorb, poly_order=3)
bg2, params2 = fitter.asls(absorb, lam=1e7, p=0.0002)
bg3, params3 = fitter.mor(absorb, half_window=200)
plt.plot(wavenums, absorb, label='IR Spectrum')
plt.plot(wavenums, bg1, alpha=0.5, label='Modified Polynomial')
plt.plot(wavenums, bg2, alpha=0.5, label='Asymmetric Least Squares')
plt.plot(wavenums, bg3, alpha=0.5, label='Morphological')
plt.xlabel('Wavenumbers (cm$^{-1}$)')
plt.ylabel('Absorbance')
plt.legend()
plt.gca().invert_xaxis()
Finally, we will subtract the baseline from the original data. The good news is that the baseline fitter generated the baseline as a NumPy array with the same size as the original data, so subtraction is a matter of subtracting one array from another.
absorb_corrected = absorb - bg2
plt.plot(wavenums, absorb_corrected)
plt.xlabel('Wavenumbers (cm$^{-1}$)')
plt.ylabel('Absorbance')
plt.gca().invert_xaxis()
Further Reading#
The ultimate authority on NumPy and SciPy are the Numpy & SciPy Documentation page listed below. As changes and improvements occur in these libraries, this is one of the best places to find information. For information on digital signal processing (DSP), there are numerous sources such as Allen Downey’s Think DSP book or articles such as those listed below.
Numpy and Scipy Documentation. https://docs.scipy.org/doc/ (free resource)
Downey, Allen B. Think DSP, Green Tea Press, 2016. http://greenteapress.com/wp/think-dsp/ (free resource)
O’Haver, T. C. An Introduction to Signal Processing in Chemical Measurement. J. Chem. Educ. 1991, 68 (6), A147-A150. https://doi.org/10.1021/ed068pA147
Savitzky, A.; Golay, M.J.E. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal. Chem. 1964, 36 (8) 1627–1639. https://doi.org/10.1021/ac60214a047
Exercises#
Complete the following exercises in a Jupyter notebook. Any data file(s) refered to in the problems can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.
Import the file CV_K3Fe(CN)6.csv which contains a cyclic voltammogram for potassium cyanoferrate. Plot the data with the green dots on the highest point(s) and red triangles on the lowest point(s).
Import the file titled CV_K3Fe(CN)6.csv and determine the inflection points. Plot the data with a marker on both inflection points. Hint: There are two inflections points in these data with one running in the reverse direction making it have a negative slop.
Generate noisy synthetic data from the following code.
from scipy.signal import sawtooth import numpy as np rng = np.random.default_rng() t = np.linspace(0, 4, 1000) sig = sawtooth(2 * np.pi * t) + rng.random(1000)
a) Smooth the data using moving averages and plot the smoothed signal. Feel free to use the moving averages code from this chapter.
b) Smooth the same data using a Savitzky–Golay filter. Plot the smoothed signal.
Import the \(^{31}\)P NMR file titled fid_31P.csv and determine the number of major frequencies are in this wave. Keep in mind that there will be a second echo for each peak.
The wavelength of emitted light (\(\lambda\)) from hydrogen is related to the electron energy level transitions by the following equation where R\(_\infty\) is the Rydberg constant, n\(_i\) is the initial principle quantum number of the electron, and n\(_f\) is the final principle quantum number of the electron.
\[ \frac{1}{\lambda} = R_\infty \left(\frac{1}{n_f^2} - \frac{1}{n_i^2} \right) \]The following is experimental data of the wavelengths for five different transitions from the Balmer series (i.e., n\(_f\) = 2).
Transition (\(n_i\) \(\rightarrow\) \(n_f\))
Wavelength (nm)
3 \(\rightarrow\) 2
656.1
4 \(\rightarrow\) 2
485.2
5 \(\rightarrow\) 2
433.2
6 \(\rightarrow\) 2
409.1
7 \(\rightarrow\) 2
396.4
n_i = [3, 4, 5, 6, 7] wl = [656.1, 485.2, 433.2, 409.1, 396.4]
Calculate a value for the Rydberg constant (R\(_\infty\)) using a linear fit of the above data. The data will need to be first linearized.
The following data is for the initial rate of a chemical reaction for different concentrations of starting material (A). Calculate a rate constant (k) for this reaction using a nonlinear fit.
Conc A (M)
Rate (M/s)
0.10
0.0034
0.16
0.0087
0.20
0.014
0.25
0.021
0.41
0.057
0.55
0.10
conc = [0.10, 0.16, 0.20, 0.25, 0.41, 0.55] rate = [0.0034, 0.0087, 0.0136, 0.0213, 0.0572, 0.103]
A colorimeter exhibits the following absorbances for known concentrations of Red 40 food dye. Generate a calibration curve using the data below and then calculate the concentration of Red 40 dye in a pink soft drink with an absorbance of 0.481.
Absorb. (@ 504 nm)
Red 40 (10\(^{-5}\) M)
0.125
0.150
0.940
1.13
2.36
2.84
2.63
3.16
3.31
3.98
3.77
4.53
ab = [0.125, 0.940, 2.36, 2.63, 3.31, 3.77] conc = [0.150, 1.13, 2.84, 3.16, 3.98, 4.53]
The following are points on the 2s radial wave function (\(\Psi\)) for a hydrogen atom with respect to the radial distance from the nucleus in Bohrs (\(a_0\)). Visualize the radial wave function as a smooth curve by interpolating the following data points.
Radius (\(a_0\))
\(\Psi\)
1.0
0.21
5.0
-0.087
9.0
-0.027
13.0
-0.0058
17.0
-0.00108
radius = [1.0, 5.0, 9.0, 13.0, 17.0] psi = [0.21, -0.087, -0.027, -0.0058, -0.00108]
The file Cp2Fe_Mossbauer.txt contains Mossbauer data for a ferrocene complex where the left data column is velocity in millimeters per second and the right column is relative transmission. Using Python, determine the velecities of the six negative peaks. Plot the spectrum with dots on lowest point of each negative peak, and be sure to label your axes complete with units.
Load the file XRF_Cu.csv, which contains X-ray fluorence (XRF) data for elemental copper, and use Python to determine the energy in eV of the two peaks. Notice that the x-axis is not in eV yet (see row 17 of data). You are advised to load the data using a pandas function, and setting a threshold will likely be neccesary.