Chapter 4: NumPy#
NumPy is a popular library in the Python ecosystem and a critical component of the SciPy stack. So much so that NumPy is even included in Apple’s default installation of Python and in other Python-powered applications such as Blender. While it may be tempting to work with NumPy’s objects as lists or to circumnavigate the NumPy library altogether, the time it takes to learn NumPy’s powerful features is well worth it! It will often allow you to solve problems with less effort and time and with shorter and faster-executing code. This is due to:
NumPy automatically propagating operations to all values in an array instead of requiring
for
loopsA massive collection of functions for working with numerical data
Many of NumPy’s functions are Python-wrapped C code making them run faster
The NumPy package can be imported by import numpy
, but the scientific Python community has developed an unofficial, but strong, convention of importing NumPy using the np
alias. It is a matter of personal preference whether to use the alias or not, but it is strongly encouraged for consistency with the rest of the community. Instead of numpy.function()
, the function is then called by the shorter np.function()
. All of the NumPy code in this and subsequent chapters assumes the following import.
import numpy as np
4.1 NumPy Arrays#
One of the main contributions of NumPy is the ndarray (i.e., “n-dimensional array”), NumPy array, or just array for short. This is an object similar to a list or nested list of lists except that mathematical operations and NumPy functions automatically propagate to each element instead of requiring a for
loop to iterate over it. Because of their power and convenience, arrays are the default object type for any operation performed with NumPy and many scientific libraries that are built on NumPy (e.g., SciPy, pandas, scikit-learn, etc…).
4.1.1 Basic Arrays#
The NumPy array looks like a Python list wrapped in array()
. It is an iterable object, so you could iterate over it using a for
loop if you really want to. However, because NumPy automatically propagates operations through the array, for
loops are typically unnecessary. For example, let us say you want to multiply a list of numbers by 2. Doing this with a list would likely look like the following.
nums = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for value in nums:
print(2 * value)
0
2
4
6
8
10
12
14
16
18
In contrast, performing this same operation using a NumPy array only requires multiplying the array by 2.
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print(2 * arr)
[ 0 2 4 6 8 10 12 14 16 18]
4.1.2 Type Conversion to Arrays#
There are three common ways to generate a NumPy array that we will cover in the beginning of this chapter. The first is simply to convert a list or tuple to an array using the np.array()
function.
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # list
arr = np.array(a)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
The fact that the object is an ndarray is denoted by the array()
.
4.1.3 Array from Sequence#
We can also create an array using NumPy sequence-generating functions. There are two common functions in NumPy for this task: np.arange()
and np.linspace()
. The np.arange()
function behaves similarly to the native Python range()
function with the key difference that it outputs an array. Another minor difference is that while range()
generates a range object, np.arange()
generates a sequences of values immediately. The arguments for np.arange()
are similar to that of Python’s range()
function where start is inclusive and stop is exclusive, but unlike range()
, the step size for np.arange()
does not need to be an integer value.
np.arange(start, stop, step)
The np.linspace()
function is related to np.arange()
except that instead of defining the sequence based on step size, it generates a sequence based on how many evenly distributed points to generate in the given span of numbers. Additionally, np.arange()
excludes the stop values while np.linspace()
includes it. The difference between these two functions is somewhat subtle, and the use of one over the other often comes down to user preference or convenience.
np.linspace(start, stop, number of points)
arr = np.arange(0, 10, 0.5)
arr
array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])
arr = np.linspace(0,10, 20)
arr
array([ 0. , 0.52631579, 1.05263158, 1.57894737, 2.10526316,
2.63157895, 3.15789474, 3.68421053, 4.21052632, 4.73684211,
5.26315789, 5.78947368, 6.31578947, 6.84210526, 7.36842105,
7.89473684, 8.42105263, 8.94736842, 9.47368421, 10. ])
Two other useful functions for generating arrays are np.zeros()
and np.ones()
which generate arrays populated with exclusively zeros and ones, respectively. The functions accept the shape argument as a tuple of the array dimensions in the form (rows, columns)
.
np.zeros((2,4))
array([[0., 0., 0., 0.],
[0., 0., 0., 0.]])
np.ones(10)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
You should commit to remembering np.arange()
and np.linspace()
, as these are used often. The np.zeros()
and np.ones()
functions are not as common but are useful in particular applications. They can also be used to generate arrays filled with other values. For example, to generate an array of threes, an array of zeros can be generated and then incremented by 3.
arr = np.zeros((2,4))
arr += 3
print(arr)
[[3. 3. 3. 3.]
[3. 3. 3. 3.]]
4.1.4 Arrays from Functions#
A third approach is to generate an array from a function using np.fromfunction()
which generates an array of values using the array indices as inputs. This function requires a function as an argument.
np.fromfunction(function, shape)
Let us make an array of the dimensions (3,3) where each element is the product of the row and column indices.
def prod(x, y):
return x * y
np.fromfunction(prod, (3,3))
array([[0., 0., 0.],
[0., 1., 2.],
[0., 2., 4.]])
4.2 Reshaping & Merging Arrays#
Modifying the dimensions of one or more arrays is a common task in NumPy. This may involve changing the number of columns and rows or merging multiple arrays into a larger array. The size and shape or an array are the number of elements and dimensions, respectively. These can be determined using the size
and shape
NumPy methods.
counting = np.array([[1, 2, 3], [4, 5, 6]])
counting.size
6
counting.shape
(2, 3)
The NumPy convension is to provide the dimensions of a two-dimensional array as (row, columns).
4.2.1 Reshaping Arrays#
The dimensions of arrays can be modified using the np.reshape()
method. This method maintains the number of elements and order of elements in the array but repacks them into a different number of rows and columns. Because the number of elements is maintained, the new array size needs to be able to contain the same number of elements as the original.
np.reshape(array, dimensions)
In this function, array
is the NumPy array being reshaped and dimensions
is a tuple containing the desired number of rows and columns in that order. The original array must fit exactly into the new dimensions or else NumPy will refuse to change it. This method does not change the original array in place but rather returns a modified copy. This is a good time to note that because this and other NumPy functions are methods for NumPy arrays, they can also be called by listing the array up front like list and string methods presented in chapter 1. For example, the reshape()
function can be called with array.reshape(dimensions)
.
array_1D = np.linspace(0, 9.5, 20)
array_1D
array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])
The following code reshapes the array into a 4 \(\times\) 5 array.
array_2D = np.reshape(array_1D, (4, 5))
array_2D
array([[0. , 0.5, 1. , 1.5, 2. ],
[2.5, 3. , 3.5, 4. , 4.5],
[5. , 5.5, 6. , 6.5, 7. ],
[7.5, 8. , 8.5, 9. , 9.5]])
As an alternative and preferred way to reshape an array, the reshape()
function can be used as an array method. Start with the origional array and follow it with .reshape((rows, columns))
like below. This format is often preferred and will be used often herein.
array_1D.reshape((4,5))
array([[0. , 0.5, 1. , 1.5, 2. ],
[2.5, 3. , 3.5, 4. , 4.5],
[5. , 5.5, 6. , 6.5, 7. ],
[7.5, 8. , 8.5, 9. , 9.5]])
If you need to reshape an array with only one new dimension known, place a -1
in the other. This signals to NumPy that it should choose the second dimension to make the data fit.
4.2.2 Flatten Arrays#
Flattening an array takes a higher-dimensional array and squishes it into a one-dimensional array. To flatten out an array, the np.flatten()
method is often the most convenient way.
array_2D.flatten()
array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])
The format of the output makes it look like it is still a 2D array, but notice that there is a comma instead of a square bracket at the end of the first row. The dimensions of this array are 1 \(\times\) 20.
4.2.3 Transpose Arrays#
Transposing an array rotates the array around the diagonal (Figure 1).
Figure 1 The np.transpose()
or array.T
method transposes the NumPy array effectively flipping the rows and columns.
The np.transpose()
method flips the rows and columns. NumPy also provides an alias/shortcut of array.T
to accomplish the same outcome. The latter is far more common, so it is the method used here.
array_2D
array([[0. , 0.5, 1. , 1.5, 2. ],
[2.5, 3. , 3.5, 4. , 4.5],
[5. , 5.5, 6. , 6.5, 7. ],
[7.5, 8. , 8.5, 9. , 9.5]])
array_2D.T
array([[0. , 2.5, 5. , 7.5],
[0.5, 3. , 5.5, 8. ],
[1. , 3.5, 6. , 8.5],
[1.5, 4. , 6.5, 9. ],
[2. , 4.5, 7. , 9.5]])
4.2.4 Merge Arrays#
Merging arrays can be done in multiple ways. NumPy provides convenient methods for merging arrays using np.vstack
, np.hstack
, and np.dstack
which merge arrays along the vertically, horizontally, and depth-wise axes, respectively (Figure 2).
Figure 2 NumPy arrays can be stacked vertically (top left), as columns (top center), depth-wise (top right), or horizontally (bottom) using the np.vstack()
, np.column_stack()
, np.dstack()
, and np.hstack()
functions, respectively.
a = np.arange(0, 5)
b = np.arange(5, 10)
np.vstack((a, b))
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
np.hstack((a, b))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.dstack((a, b))
array([[[0, 5],
[1, 6],
[2, 7],
[3, 8],
[4, 9]]])
A related function is the np.column_stack()
function that stacks the corresponding elements in column lists in a column arrangement.
np.column_stack((a,b))
array([[0, 5],
[1, 6],
[2, 7],
[3, 8],
[4, 9]])
The outcome of the np.column_stack()
function can also be accomplished by transposing the output of the np.vstack()
function.
np.vstack((a,b)).T
array([[0, 5],
[1, 6],
[2, 7],
[3, 8],
[4, 9]])
4.3 Indexing Arrays#
Similar to lists, it is often useful to be able to index and slice ndarrays. Because arrays are often higher dimensional, there are some differences in indexing that provide extra convenience.
4.3.1 One-Dimensional Arrays#
Indexing one-dimensional arrays is done in an identical fashion to lists. Simply include the index value(s) or range in square brackets behind the array name.
array_1D
array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])
array_1D[5]
np.float64(2.5)
4.3.2 Two-Dimensional Arrays#
Two-dimensional arrays can be also indexed in a similar fashion to nested lists, but because arrays are often multidimensional, there is also a shortcut below to make work with arrays more convenient. To access the entire second row of an array, provide the row index in square brackets behind the array name just like indexing in lists.
array_2D
array([[0. , 0.5, 1. , 1.5, 2. ],
[2.5, 3. , 3.5, 4. , 4.5],
[5. , 5.5, 6. , 6.5, 7. ],
[7.5, 8. , 8.5, 9. , 9.5]])
array_2D[1]
array([2.5, 3. , 3.5, 4. , 4.5])
To access the first element in the second row, it is perfectly valid to use two adjacent square brackets just as one would use in a nested list of lists. However, to make work more convenient, these square brackets are often combined with the row and column indices separated by commas.
array_name[rows, columns]
array_2D[1][0]
np.float64(2.5)
array_2D[1, 0]
np.float64(2.5)
Ranges of values can also be accessed in arrays by using slicing. The following array input generates a slice of the second row of the array.
array_2D[1, 1:]
array([3. , 3.5, 4. , 4.5])
As seen above, if you want to access an entire row, it is not necessary to indicate the columns. It is implicitly understood that all columns are requested. However, if you want to access the first column, something needs to be placed before the column. The easiest solutions is the use a colon to explicitly indicate all rows.
array_2D[0] # implicitly understood all columns
array([0. , 0.5, 1. , 1.5, 2. ])
array_2D[0, :] # explicit indicating all columns
array([0. , 0.5, 1. , 1.5, 2. ])
array_2D[:, 0] # all rows
array([0. , 2.5, 5. , 7.5])
4.3.3 Advanced Indexing#
In the event you have a multidimensional array, you can access elements in the array using multiple collections of values (i.e., ndarrays, lists, or tuples) where each collection indicates the location along a different dimension. This is an instance of fancy indexing. For example, if we want to select the following bolded, orange elements from array_2D
, we can create two lists - the first list contains the row indicies for each element and the second list likewise contains the column indicies.
row = [2,2,0]
col = [0,1,3]
array_2D[row, col]
array([5. , 5.5, 1.5])
Another feature of indexing ndarrays is that the returned array will have the same dimensions as the array containing the indicies. In the following example, we have two index arrays where i_flat
is a 1 \(\times\) 4 array while i_square
is a 2 \(\times\) 2 array resulting in 1 \(\times\) 4 and 2 \(\times\) 2 arrays, respectively
threes = np.arange(3, 30, 3)
i_flat = np.array([0, 3,1, 5])
i_square = np.array([[0, 3],
[1, 5]])
threes[i_flat]
array([ 3, 12, 6, 18])
threes[i_square]
array([[ 3, 12],
[ 6, 18]])
The latter result can also be accomplished by indexing using a flat (i.e., one-dimensional) array followed by reshaping it to the desired dimensions as demonstred below.
i = np.array([0, 3,1, 5])
threes[i].reshape((2,2))
array([[ 3, 12],
[ 6, 18]])
4.3.4 Masking#
Elements in a NumPy array can also be selected using a boolean array through a process known as masking. The masking array is a boolean array filled with either 1
and 0
or True
and False
and has the same dimensions as the origional array. Any element in the origional array that has a 1
or True
in the corresponding position of the masking array is returned.
For example,
orig_array = np.array([[5, 7, 1],
[3, 4, 2],
[0, 9, 8]])
mask = np.array([[0, 1, 0],
[1, 1, 1],
[1, 0, 1]], dtype=bool)
orig_array[mask]
array([7, 3, 4, 2, 0, 8])
It’s important to note that if you use 1
and 0
in the masking array, it is required that you include dtype=bool
or else NumPy will treat the 1
and 0
as indicies instead of booleans and attempt indexing.
mask = np.array([[0, 1, 0],
[1, 1, 1],
[1, 0, 1]])
orig_array[mask]
array([[[5, 7, 1],
[3, 4, 2],
[5, 7, 1]],
[[3, 4, 2],
[3, 4, 2],
[3, 4, 2]],
[[3, 4, 2],
[5, 7, 1],
[3, 4, 2]]])
The true power of masking is when the masking array is generated through boolean logic such as >
, <=
, or ==
. This enables the user to select elements of an array through conditions as demonstrated below where we select all elements of the orig_array
that are greater than 5.
cond = orig_array > 5
cond
array([[False, True, False],
[False, False, False],
[False, True, True]])
orig_array[cond]
array([7, 9, 8])
We can also include the condition directly in the square brackets to save a step as shown below.
orig_array[orig_array > 5]
array([7, 9, 8])
4.4 Vectorization & Broadcasting#
One of the major advantages of NumPy arrays over lists is that operations automatically vectorize across the arrays. That is, mathematical operations propagate through the array(s) instead of requiring for
loops. This both speeds up the calculations and makes the code easier to read and write.
4.4.1 NumPy Functions#
Let’s take the square root of numbers using NumPy’s np.sqrt()
function. The square root is taken of each elements automatically.
squares = np.array([1, 4, 9, 16, 25])
np.sqrt(squares)
array([1., 2., 3., 4., 5.])
Performing this operation requires NumPy’s sqrt()
function. If this is attempted with the math
module’s sqrt()
function, an error is returned because this function cannot take the square root of a multi-element object without loops.
import math
math.sqrt(squares)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[51], line 2
1 import math
----> 2 math.sqrt(squares)
TypeError: only length-1 arrays can be converted to Python scalars
4.4.2 Scalars & Arrays#
When performing mathematical operations between a scalar and an array, the same operation is performed across each elements of the array returning an array of the same dimension as the starting array. Below, an array is multiplied by the scaler 3 which results in every element in the array being multiplied by this value.
3 * np.array([[5, 6], [7, 8]])
array([[15, 18],
[21, 24]])
The same outcome arises when performing a similar operation between a 1 \(\times\)1 array and a larger array.
np.array([2]) + np.array([10, 20])
array([12, 22])
4.4.3 Arrays of the Same Dimensions#
If a mathematical operation is performed between two arrays of the same dimensions, then the mathematical operation is performed between corresponding elements in the two arrays. For example, if a pair of 2 \(\times\) 2 arrays are added to one another, then the corresponding elements are added to one another. This means that the top left elements are added together and so on.
a = np.array([[1,2], [3,4]])
b = np.array([[5,6], [7,8]])
a + b
array([[ 6, 8],
[10, 12]])
4.4.4 Arrays of Different Dimensions#
Broadcasting is another form of vectorization that is a series of rules for dealing with mathematical operations between two arrays of different dimensions. In broadcasting, the one of the dimensions of the two arrays must be either identical or one-dimensional, otherwise nothing happens except an error message. To deal with the different dimensions, NumPy clones the array with fewer dimensions out so that it has the same dimensions as the other array. It should be noted that NumPy does not really clone out the array in the background; its behavior acts as if it does. It is a convenient way of thinking about the behavior and results. For example, below is the addition between a 2\(\times\)2 and a 1\(\times\)2 array.
To make the two arrays the same size, the smaller array is cloned along the smaller dimension until the two arrays are the same size as shown below. We are then left with simple corresponding element by corresponding element mathematical operations described in section 4.4.3.
a = np.array([[1,2], [3,4]])
b = np.array([2,5])
a + b
array([[3, 7],
[5, 9]])
What happens if a mathematical operation is performed between an array of higher dimensions with a scalar or a 1\(\times\)1 array as shown below? You already probably know the answer from section 4.4.2, but here is how to rationalize the behavior. In this case, no dimensions are the same, but being that one of the arrays has dimensions of one where the two arrays differ, the arrays still broadcast.
Again, the smaller array is cloned until the two arrays are the same size.
a = np.array([[1,2], [3,4]])
b = np.array([2])
a * b
array([[2, 4],
[6, 8]])
Finally, if we attempt to perform a mathematical operation between two arrays with different dimensions and none of the arrays have a dimension of one where the two arrays are different, an error is raised, and no operation is performed.
a = np.array([[1,2], [3,4]])
b = np.array([[1,1,1], [2,2,2], [3,3,3]])
a + b
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[57], line 3
1 a = np.array([[1,2], [3,4]])
2 b = np.array([[1,1,1], [2,2,2], [3,3,3]])
----> 3 a + b
ValueError: operands could not be broadcast together with shapes (2,2) (3,3)
4.4.5 Vectorizing Python Functions#
Standard Python functions are often designed to perform a calculation a single time and output Python objects and not NumPy arrays. As an example, the following function calculates the rate of a first-order reaction given the rate constant (k
) and concentration of reactant (conc
).
def rate(k, conc):
return k * conc
rate(1.2, 0.80)
0.96
What happens if we attempt the above calculation using a list of concentration values?
concs = [0.1, 0.5, 1.0, 1.5, 2.0]
rate(1.2, concs)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[61], line 1
----> 1 rate(1.2, concs)
Cell In[58], line 2, in rate(k, conc)
1 def rate(k, conc):
----> 2 return k * conc
TypeError: can't multiply sequence by non-int of type 'float'
We get an error because Python cannot multiply a list by a value the way NumPy can. However, the above function can be converted to a NumPy function using np.vectorize()
which will allows the function to perform the calculation on a series of values and returns a NumPy array.
vrate = np.vectorize(rate)
vrate(1.2, concs)
array([0.12, 0.6 , 1.2 , 1.8 , 2.4 ])
4.5 Array Methods#
Technically, NumPy array methods have already been employed in this chapter. The functions above are NumPy methods specifically for working with NumPy arrays. If an array is fed to many non-NumPy functions, an error will result because they cannot handle multi-element objects or arrays specifically. Interestingly, if a float or integer is fed into a NumPy method, it will still work. As an example, the integer 4 can be fed into the np.sqrt()
function as well as an array of values.
np.sqrt(4)
np.float64(2.0)
np.sqrt(np.array([1, 4, 9]))
array([1., 2., 3.])
NumPy contains an extensive listing of methods for working with arrays… so much so that it would be impractical to list them all here. However, below are tables of some common and useful methods. It is worth browsing and being aware of them; many are worth committing to memory. If you ever find yourself needing to manipulate an array in some complex fashion, it is worth doing a quick internet search and including “NumPy” in your search. You will likely either find additional NumPy methods that will help or advice on how others solved a similar problem.
Table 1 Common Methods for Generating Arrays
Method |
Description |
---|---|
|
Generates an array from another object |
|
Creates an array from [start, stop) with a given step size |
|
Creates an array from [start, stop] with given number of steps |
|
Creates an “empty” array (actually filled with garbage) |
|
Generates an array of a given dimensions filled with zeros |
|
Generates an array of a given dimensions filled with ones |
|
Generates an array using a Python function |
|
Loads text file data into an array |
|
Load text file data into an array (cannot handle missing data) |
Table 2 Array Attribute Methods
Method |
Description |
---|---|
|
Returns the dimensions of an array |
|
Returns the number of dimensions (e.g., a 2D array is 2) |
|
Returns the number of elements in an array |
Table 3 Array Modification Methods
Method |
Description |
---|---|
|
Flattens an array in place |
|
Returns a flattened view of the array without changing the array |
|
Reshapes an array in place |
|
Returns a resized view of an array without modifying the original |
|
Returns a view of transposed array |
|
Vertically stacks an arrays into a new array |
|
Horizontally stacks an arrays into a new array |
|
Depth-wise stacks an arrays into a new array |
|
Splits an array vertically |
|
Splits an array horizontally |
|
Splits an array depth-wise |
|
Creates a meshgrid (see chapter 3 for an example) |
|
Sorts elements in array; defaults along last axis |
|
Returns index values of sorted array |
|
Sets all values in an array to |
|
Rolls the array along the given axis; elements that fall off one end of the array appear at the other |
|
Returns the floor (i.e., rounds down) of all elements in an array |
|
Rounds every number in an array |
Table 4 Array Measurement and Statistics Methods
Method |
Description |
---|---|
|
Returns the minimum value in the array |
|
Returns the maximum value in the array |
|
Returns argument (i.e., index) of min |
|
Returns argument (i.e., index) of max |
|
Returns argument (i.e., index) of the local max |
|
Returns the element-by-element min between two arrays of the same size |
|
Returns the element-by-element max between two arrays of the same size |
|
Returns the specified percentile |
|
Returns the mean (commonly known as the average) |
|
Returns the median |
|
Returns the standard deviation; be sure to include |
|
Returns counts and bins for a histogram |
|
Returns the cumulative product |
|
Returns the cumulative sum |
|
Returns the sum of all elements |
|
Returns the product of all elements |
|
Returns the peak-to-peak separation of max and min values |
|
Returns an array of unique elements in an array, set |
|
Returns an array of unique elements in an array and a second array with the frequency of these elements |
4.6 Missing Data#
Real data sets frequently contain gaps or missing values, so it is important to be able to deal with missing data. When importing data into NumPy, there are two commonly employed functions, np.genfromtxt()
and np.loadtxt()
. Though these are largly analogous functions in terms of capabilities, there is a key difference in that np.genfromtxt()
can handle missing data while np.loadtxt()
cannot. This means if your data set may contain gaps, you should use np.genfromtxt()
.
In the event the data file contains a gap, the np.genfromtxt()
function will place a nan
in that loacation by default. The nan
stands for “not a number” and is simply a placeholder. For example, the file dHf_ROH.csv contains the number of carbons in linear alcohols and the gas-phase heat of formation in kJ/mol of each alcohol. The value for 1-undecanol (eleven carbons) is missing, so np.genfromtxt()
places a nan
in its place.
np.genfromtxt('data/dHf_ROH.csv', delimiter=',')
array([[ 1., -205.],
[ 2., -234.],
[ 3., -256.],
[ 4., -277.],
[ 5., -298.],
[ 6., -316.],
[ 7., -340.],
[ 8., -356.],
[ 9., -377.],
[ 10., -395.],
[ 11., nan],
[ 12., -437.]])
Some data files use placeholder values instead of no value at all. These placeholders are often -1
, 0
, 999
, or some physically meaningless or improbable value. If you have an alternative values you want in the missing data location, you can specify this using the filling_values=
argument. As an example below, the missing value is replaced with a 999
.
np.genfromtxt('data/dHf_ROH.csv', delimiter=',', filling_values=999)
array([[ 1., -205.],
[ 2., -234.],
[ 3., -256.],
[ 4., -277.],
[ 5., -298.],
[ 6., -316.],
[ 7., -340.],
[ 8., -356.],
[ 9., -377.],
[ 10., -395.],
[ 11., 999.],
[ 12., -437.]])
In the event you have data with missing values, the nan
placeholders can pose an issue when running statistics on the data. Below, we use the np.mean()
method to try to calculate the mean enthalpy of formation but get a nan
instead because the np.mean()
function cannot handle the placeholder.
dHf = np.genfromtxt('data/dHf_ROH.csv', delimiter=',')
np.mean(dHf[:,1])
np.float64(nan)
Alternatively, NumPy has a number of versions of functions (Table 5) that are specifically designed to handle data with missing values.
Table 5 Statistics Methods Dealing with NaNs
Function |
Description |
---|---|
|
Standard deveation |
|
Mean |
|
Variance |
|
Median |
|
Qth percentile |
|
Qth quantile |
np.nanmean(dHf[:,1])
np.float64(-317.3636363636364)
np.nanquantile(dHf[:,1], 0.6)
np.float64(-298.0)
4.7 Random Number Generation#
Stochastic simulations, addressed in chapter 9, are a common tool in the sciences and rely on a series of random numbers, so it is worth addressing their generation using NumPy. Depending upon the requirements of the simulation, random numbers may be a series of floats or integers, and they may be generated from various ranges of values. The numbers may also be generated as a uniform distribution where all values are equally likely or biased a distribution where some values are more probable than others. Below are random number functions from the NumPy random
module useful in generating random number distributions to suit the needs of your simulations.
4.7.1 Uniform Distribution#
The simplest distribution is the uniform distribution of random numbers where every number in the range has an equal probability of occurring. The distribution may not always appear as even with small sample sizes due to the random nature of the number generation, but as a larger population of samples is generated, the relative distribution will appear more even. The histograms below (Figure 3) are of a hundred (left) and a hundred thousand (right) randomly generated floats from the [0,1) range in an even distribution. While the plot on the right appears more even, this is mostly an effect of the different scales.
Figure 3 Histograms of a hundred (left) and a hundred thousand (right) randomly generated floats from the [0,1) range in an even distribution using the random()
method.
Starting in verion 1.18, NumPy’s preferred method for producing random values is through a Generator called using the rng = np.random.default_rng()
function. Once a generator has been crated, it can be used to generate the neccesary random values. NumPy has multiple methods available for generating evenly-distributed random numbers including the following two functions where n
is the number of random values to be generated. The rng.random(n)
function generates n
random floats from the range [0,1). The rng.integers(low, high=, size=n)
function generates random integers in the range [low, high) and can generate multiple values using the size argument.
rng = np.random.default_rng()
rng.random(n)
rng.integers(low, high=x, size=n)
Warning
Prior to version 1.18 of NumPy, random numbers were generated using function calls that look like np.random.randint()
. While these should still work, they are considered legacy, so it is uncertain how long they will continue to be supported.
rng = np.random.default_rng()
rng.random(5)
array([0.77435051, 0.71700695, 0.60990918, 0.52450878, 0.29885536])
rng.integers(0, high=10, size=10)
array([4, 0, 5, 6, 8, 4, 0, 7, 4, 1])
4.7.2 Binomial Distribution#
A binomial distribution results when values are generated from two possible outcomes. This is useful for applications such as deciding if a simulated molecule reacts or whether a polymer chain terminates or propagates. The two outcomes are represented by a 0
or 1
with the probability, p
, of a 1
being generated. Binomial distributions are generated by the NumPy random module using the rng.binomial()
function call.
rng = np.random.default_rng()
rng.binomial(t, p, size=n)
The t
argument is the number of trials while the size=
argument is the number of generated values. For example, if t = 2
, two binomial values are generated and the sum is returned, which may be 0
, 1
, or 2
. Basic probability predicts that these sums will occur in a 1:2:1 ratio, respectively. If t
is increased to 10
, a shape more closely representing a bell curve is obtained. A Bernoulli distribution is the specific instance of a binomial distribution where t = 1. The histograms below (Figure 4) are of a hundred randomly generated numbers in a binomial distribution with p = 0.5
and where t = 1
(left), t = 2
(center), and t = 10
(right).
Figure 4 Histograms of a hundred randomly generated numbers in a binomial distribution with p = 0.5
and t = 1
(left), t = 2
(center), and t = 10
(right).
4.7.3 Poisson Distribution#
A Poisson distribution is a probability distribution of how likely it is for independent events to occur in a given interval (time or space) with a known average frequency (\(\lambda\)). Each sample in a poisson distribution is a count of how many events have occurred in the time interval, so they are always integers. NumPy can generate integers in a Poisson distribution using the rng.poisson()
function, which accepts two arguments.
rng.poisson(lam=1.0, size=n)
The first argument, \(\lambda\) (lam
), is the statistical mean for the values generated, and the second argument, size
, is the requested number of values. For example, a Geiger counter can be simulated detecting background radiation in a location that is known to have an average of 3.6 radiation counts per second with the following function call.
rng.poisson(lam=3.6, size=30)
array([2, 2, 4, 2, 3, 2, 4, 3, 2, 6, 3, 3, 4, 3, 4, 4, 3, 0, 5, 5, 4, 4,
8, 2, 2, 3, 7, 4, 4, 5])
The returned array of values are the total radiation detections for each second for thirty seconds, and the mean value is 3.8 counts. While not precisely the target of 3.6 counts, it is close and larger sample sizes are statistically more likely to generate results closer to the target value. A histogram of these values is shown in below (Figure 5, left). When this simulation was repeated with thirty thousand samples (Figure 5, right), a mean of 3.61 counts is obtained. In addition, the larger number of values results in a classic Poisson distribution curve which appears something like a bell curve with more tapering on the high end.
Figure 5 Histograms of thirty (left) and thirty thousand (right) randomly generated integers in a Poisson distribution with a target mean (\(\lambda\)) of 3.6 (dashed red line).
Alternative distributions of random number can be generated by manipulating the output of the above functions. For example, random numbers in a [-1, 1) distribution, which is useful in a 2D diffusion simulation, can be generated by subtracting 0.5 from values in the range [0, 1) and multiplying by two.
rand_float = 2 * (rng.random(10) - 0.5)
rand_float
array([-0.40626378, -0.95185875, -0.13479041, 0.9508908 , 0.21275482,
0.50651172, 0.13768064, 0.98436196, -0.47759103, -0.2360542 ])
4.7.4 Other Functions#
The random module in NumPy also includes a large variety of other random number and sequence generators. This includes rng.normal()
which generates values centered around zero in a normal distribution. The rng.choice()
function selects a random value from a provided array of values, while the rng.shuffle()
function randomizes the order of values for a given array. Other random distribution functions can be found on the SciPy website (see Further Reading). A summary of common NumPy random functions are in Table 6.
Table 6 Summary of Common NumPy np.random
Functions
Function |
Description |
---|---|
|
Generates random floats in the range [0,1) in an even distribution |
|
Generates random integers from a given range in an even distributionb |
|
Generates random floats in a normal distribution centered around zero |
|
Generates random integers in a binomial distribution; takes a probability , |
|
Generates random floats in a Poisson distribution; takes a target mean argument ( |
|
Selects random values taken from a 1-D array or range |
|
Randomizes the order of an array |
rng.random(1)
array([0.28404729])
rng.integers(0, high=100)
np.int64(29)
rng.normal(loc=0.0, scale=1.0, size=3)
array([ 1.23226045, 0.6006648 , -1.05247125])
rng.binomial(2, p=0.5, size=3)
array([1, 1, 2])
rng.poisson(lam=2.0, size=5)
array([2, 2, 1, 5, 5])
rng.choice(20, size=3)
array([ 5, 19, 2])
arr = np.array([0, 1, 2, 3, 4])
rng.shuffle(arr)
arr
array([0, 1, 2, 3, 4])
Further Reading#
The NumPy documentation is well written and a good resource. Because NumPy is the foundation of the SciPy ecosystem, if you find a Python book on scientific computing, odds are that it will discuss or use NumPy at some level.
NumPy Website. http://www.numpy.org/ (free resource)
NumPy User Guide. https://numpy.org/doc/stable/user/index.html#user (free resource)
VanderPlas, J. Python data Science Handbook: Essential Tools for Working with Data, 1st ed.; O’Reilly: Sebastopol, CA, 2017, chapter 2. Freely available from the author at https://jakevdp.github.io/PythonDataScienceHandbook/ (free resource)
Exercises#
Complete the following exercises in a Jupyter notebook using NumPy and NumPy arrays. Avoid using for
loops whenever possible. Any data file(s) refered to in the problems can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.
Generate an array containing the atomic numbers for the first 26 elements.
The following equation defines the relationship between energy (J) of a photo and its wavelength (m) where h is Plank’s constant (6.626 \(\times \, 10^{-34} J\cdot s\)) and c is the speed of light in a vacuum (2.998 \(\times \, 10^8 m/s\)).
\[ E = \frac{hc}{\lambda} \]a. Generate an array containing the wavelengths of visible light (4.00 \(\times\) 10\(^{-7}\) m \(\rightarrow\) 8.00 \(\times\) 10\(^{-7}\) m) in 5 \(\times\) 10\(^{-8}\) m increments.
b. Generate a second array containing the energy of each wavelength of light from part a.
Generate an array containing 101.325 a hundred times.
The following array contains temperatures in Fahrenheit. Convert these values to \(^\circ\)C without using a for loop.
F = array([0, 32, 100, 212, 451])
Generate two arrays containing the following sine functionsfrom x = 0 \(\rightarrow 10 \pi\)
\[ y = sin(x) \]\[ y = sin(1.1x + 0.5) \]a. Plot these two sine waves on the same plot.
b. Add the two sine functions together and plot the result.
c. Explain why the signal in part b is smaller in one area and larger in another. Hint: look at your plot for part a to see how the two origonal sine waves related to each other.
The numerical relationship between \(\Delta G^o\) and K (equilibrium constant) is shown below. Plot \(\Delta G^o\) versus K at standard temperature and pressure for K values of 0.001 \(\rightarrow\) 1000. Use NumPy arrays and do not use any for loops.
\[ \Delta G = -RTln(k) \]The numerical relationship between k (rate constant) and E\(_a\) is shown below. Plot k versus Ea at standard temperature and pressure for activation energies of 1 \(\rightarrow\) 20 kJ/mol. Use NumPy arrays, do not use any for loops, and use A = 1. Watch your energy units carefully.
\[ k = Ae^{-E_a / RT} \]Generate an array containing integers 0 \(\rightarrow\) 14 (inclusive).
a. Reshape the array to be a 3 \(\times\) 5 array.
b. Transpose the array, so now it should be a 5 \(\times\) 3 array.
c. Make the array one-dimensional again without using the
reshape()
method.Generating an Combining arrays – Bohr hydrogen atom.
a. Create an array containing the principle quantum numbers (n) for the first eight orbits of a hydrogen atom (e.i., 1 \(\rightarrow\) 8).
b. Generate a second array containing the energy (J) of each orbit in part A for a Bohr model of a hydrogen atom using the equation below.
\[ E = -2.18 \times 10^{-18}J \frac{1}{n^2} \]c. Combine the two arrays from parts A and B into a new 8 \(\times\) 2 array with the first column containing the principle quantum numbers and the second containing the energies.
Generate a one-dimensional array with the following code and index the 5th element of the array.
arr = np.random.randint(0, high=10, size=10)
Generate a two-dimensional array with the following code.
arr2 = np.random.randint(0, high=10, size=15).reshape(5, 3)
a. Index the second element of the third column.
b. Slice the array to get the entire third row.
c. Slice the array to access the entire first column.
d. Slice the array to get the last two elements of the first row.
Predict the outcome of the following operation between two NumPy arrays. Test your your prediction.
\[\begin{split} \left[ \begin{array}{cc} 1 & 1 \\ 2 & 2 \end{array} \right] + \left[1 \right] = \,\, ?\end{split}\]Predict the outcome of the following operation between two NumPy arrays. Test your your prediction.
\[\begin{split} \left[ \begin{array}{ccc} 1 & 8 & 9 \\ 8 & 1 & 9 \\ 1 & 8 & 1 \end{array} \right] + \left[ \begin{array}{cc} 1 & 1 \\ 1 & 1 \end{array} \right] = \,\, ? \end{split}\]Predict the outcome of the following operation between two NumPy arrays. Test your your prediction.
\[\begin{split} \left[ \begin{array}{cc} 1 & 8 \\ 3 & 2 \end{array} \right] + \left[ \begin{array}{cc} 1 & 1 \\ 1 & 1 \end{array} \right] = \,\, ?\end{split}\]For the following randomly-generated array:
arr = np.random.rand(20)
a. Find the index of the largest values in the following array.
b. Calculate the mean value of the array.
c. Calculate the cumulative sum of the array.
d. Sort the array.
Generate a random array of values from -1 \(\rightarrow\) 1 (exclusive) and calculate its median value. Hint: start with an array of values 0 \(\rightarrow\) 1 (exclusive) and manipulate it.
Generate a random array of integers from 0 \(\rightarrow\) 35 (inclusive) and then sort it.
Hydrogen nuclei can have a spin of +1/2 and -1/2 and occur in approximately a 1:1 ratio. Simulate the number of +1/2 hydrogen nuclei in a molecule of six hydrogen atoms and plot the distribution. Hint: being that there are two possible outcomes, this can be simulated using a binomial distribution. See section 4.7.2.
Using NumPy’s
random
module, generate a random DNA sequence (i.e., series of ‘A’, ‘T’, ‘C’, and ‘G’ bases) 40 bases long stored in an array.