Appendix 4: Regular Expressions#
There is a saying that synthetic chemists spend 10% of their time running reactions and 90% of their time purifying compounds. A similar saying could be said that working with chemical data is 10% performing the intended calculations or analyses on the data and 90% of the time cleaning and organizing the data. While these are both hyperbole, they underline the large amount of effort required to clean materials. This chapter is dedicated to a powerful methods known as regular expressions, or regex for short, for cleaning and filtering text data, especially in situations requiring complex pattern matching. Python string methods and indexing offer basic search and filtering functionality, but they tend to only allow for identifying simple and consistent patterns. For example, if you want a file name without the file extension (e.g., titration instead of titration.png), this can be solved using indexing and the string split()
method because file extensions always follow the last period in the full file name. Likewise, parsing data from a PDB file can be parsed with only a string search and slicing because PDB files follow very strict formatting rules based on labels and positions in rows. The reason these two examples are not terribly complex to parse is because they are consistent and were designed to be machine readable. However, not all data follow well-defined formatting rules or there could be more variation that needs to be accounted for. Regular expressions is not strictly a Python feature but rather is a syntax supported by Python using the re
module imported below. This module is a built-in Python module, so it comes with every installation of Python.
import re
Below we will first cover some key functions from the re
modules followed by generating more complex patterns, and finally ending with a couple chemical databases and literature examples.
Regular Expression Basics#
re
Functions#
The re
module provides a series of functions including those listed in Table 1 that allow the user to search for, split on, or substitute for patterns within a string. Additional functions can be found on the Python regular expressions page.
Table 1 Select re
Functions
Regex Function |
Description |
---|---|
|
Returns a list of strings that match the pattern |
|
Returns iterable of Match objects |
|
Returns the first pattern match as Match object |
|
Splits string at pattern matches |
|
Replaces all occurrences with new string |
The way these functions work is that the user provides a pattern to search for, which in the most basic scenarios can be a simple string, along with a string in which the function will search for the pattern. In the example below, we search a string of amine names for an aniline derivative by using 'aniline'
as a pattern.
amines = '2-methylcyclohexylamine N-methylaniline 3-methylbutylamine N-methyl-3-pentanamine o-methylaniline'
pattern = 'aniline'
re.findall(pattern, amines)
['aniline', 'aniline']
This is not terribly informative being that all it tells us is that 'aniline'
appears twice. The re.finditer()
function can be used instead to return an iterator providing the user with the location of each match using either a for
loop or list()
function. We can see below that there are three matches along with the indices of those matches and the string that matches the pattern.
for x in re.finditer('aniline', amines):
print(x)
<re.Match object; span=(32, 39), match='aniline'>
<re.Match object; span=(90, 97), match='aniline'>
list(re.finditer('aniline', amines))
[<re.Match object; span=(32, 39), match='aniline'>,
<re.Match object; span=(90, 97), match='aniline'>]
Tp access the matched strings, use the group()
method on the Match
objections like below.
for x in re.finditer('aniline', amines):
print(x.group())
aniline
aniline
The re
module can also be used to find and replace patterns such as replacing 'aniline'
with 'anilinium'
like below.
re.sub('aniline', 'anilinium', amines)
'2-methylcyclohexylamine N-methylanilinium 3-methylbutylamine N-methyl-3-pentanamine o-methylanilinium'
We could still probably have done the above tasks with string methods and indexing. The real power of regular expressions is its ability to generate more complex and flexible patterns, which is what we address below.
Symbols & Characters#
Let’s try something a little more complicated by searching for any instance of a methyl not located on a nitrogen. This means that the name should have a 'methyl'
string with a hyphenated number before it. The re
module provides syntax, Table 2, for indicating specific types of characters and delimiters. For example, \d
indicates a digit. Many of these character designators also have a negative version using the capital letter, so \D
, for example, signifies any character except a number.
Table 2 Regex Character Designators
Character Type |
Present |
Not Present |
Description/Examples |
---|---|---|---|
Any character |
|
Any character except new line (i.e., \n) |
|
Digits |
\d |
\D |
Digits from 0-9 |
Letters/Word characters |
\w |
\W |
abcABC |
Space |
\s |
\S |
White space, tabs, and end-of-lines |
Boundary between words |
\b |
\B |
Space, start of line, or non-alphanumeric characters |
Character at start of string |
|
|
|
Character at end of string |
|
|
Being that we need any number before the methyl, the pattern is \d-methyl
. Now that we have patterns that use a backslash, you may see a SyntaxError because the backslash is also a Python escape character. To avoid this error, either precede the backslash with another backslash, \\d-methyl
, or make your string a raw string by preceding it with an r
like is done below.
for x in re.finditer(r'\d-methyl', amines):
print(x)
<re.Match object; span=(0, 8), match='2-methyl'>
<re.Match object; span=(40, 48), match='3-methyl'>
The \D
could be used as a means of locating methyls that are not on an aliphatic carbon chain because they do not have numbers before them (at least in this example) like is done below. Now that our patterns are more broad, the listing of matches like below are more informative because we can see that both N-methyl
and o-methyl
fit our pattern.
for x in re.finditer(r'\D-methyl', amines):
print(x)
<re.Match object; span=(24, 32), match='N-methyl'>
<re.Match object; span=(59, 67), match='N-methyl'>
<re.Match object; span=(82, 90), match='o-methyl'>
As another example, below is a string that lists chemical identifiers including chemical names, CAS numbers, and a PubChem CID. The first thing we might want to do is split this up into a list where each item represents a different chemical.
chemicals = ('2-methylphenol methanol N,N-diethylamine pentanol 281-23-2 '
'ethyl benzoate glycerol 93-89-0 5793 ethanoic acid acetic anhydride')
Using a string method to split based on spaces demonstrated below will not work well because some chemicals (ethyl benzoate, ethanoic acid, and acetic anhydride) have a space in the name. There is also a complication where there are multiple spaces after '2-methylphenol'
. This problem will be solved below using additional tools from regular expressions.
re.split(r'\s', chemicals)
['2-methylphenol',
'',
'',
'methanol',
'N,N-diethylamine',
'pentanol',
'281-23-2',
'ethyl',
'benzoate',
'glycerol',
'93-89-0',
'5793',
'ethanoic',
'acid',
'acetic',
'anhydride']
Quantifiers#
Let’s first deal with the multiple spaces using quantifiers in Table 3. These quantifiers allow the user to specify how many of something will be in the pattern. For example, the a+
will search for one or more a’s while '\s{1,3}'
looks for 1-3 spaces.
Table 3 Regex Quantifiers
Flag |
Description |
Example |
---|---|---|
* |
Search for 0 or more |
|
? |
0 or 1 |
|
+ |
Search for 1 or more |
|
{} |
Number of preceding character to search for |
|
Below, we use \s+
to split our string of chemicals based on one or more spaces.
re.split(r'\s+', chemicals)
['2-methylphenol',
'methanol',
'N,N-diethylamine',
'pentanol',
'281-23-2',
'ethyl',
'benzoate',
'glycerol',
'93-89-0',
'5793',
'ethanoic',
'acid',
'acetic',
'anhydride']
Lookahead and Lookbehind#
Now let’s address the issue of spaces inside the name. IUPAC nomenclature for esters follow a pattern where the first word always ends in -yl, and carboxylic acids and anhydrides have -ic at the end of the first word (i.e., the carboxyl part). These trends can be used to identify spaces where the string should not be split, and we will carry this out using something known as a lookahead or lookbehind shown in Table 4. These look for the presence or absence of something before or after our main pattern. We specifically want spaces that do not have a yl or ic preceding them. We will add these one at a time. Below(?<!yl)
is added in front of \s+
to avoid splitting on yl
patterns.
Table 4 Lookahead and Lookbehind Syntax
Lookahead (\(\rightarrow\)) |
Lookbehind (\(\leftarrow\)) |
|
---|---|---|
Present |
|
|
Absent |
|
|
pattern = r'(?<!yl)\s+'
re.split(pattern, chemicals)
['2-methylphenol',
'methanol',
'N,N-diethylamine',
'pentanol',
'281-23-2',
'ethyl benzoate',
'glycerol',
'93-89-0',
'5793',
'ethanoic',
'acid',
'acetic',
'anhydride']
A lookbehind for the ic can also be added like below.
pattern = r'(?<!yl)(?<!ic)\s+'
re.split(pattern, chemicals)
['2-methylphenol',
'methanol',
'N,N-diethylamine',
'pentanol',
'281-23-2',
'ethyl benzoate',
'glycerol',
'93-89-0',
'5793',
'ethanoic acid',
'acetic anhydride']
Character Sets#
What happens if there are multiple symbols that need to be matched? By placing the symbols or characters to be matched in square brackets, []
, anything in the brackets is searched for. For example, it is not uncommon to see numbers separated by either a period or dash (e.g., phone numbers), so [-.]
can be used to indicate that either symbol is a fit. Regular expressions also allows for ranges of letters and numbers such as [a-e]
for any of the first five lowercase letters. It is a good idea to place the dash first to ensure that it does not get interpreted as a range.
Below there is a string of toluene derivatives. If we want to filter for only para-substituted toluene derivatives, the name (at least in this example) should start with either p- or 4-. Both symbols can be enclosed in the square brackets like [4p]
. The next challenge is figuring out how to deal with the rest of the symbols. We could try .+
to indicated any number of more symbols, but this includes white spaces and returns the rest of the string.
toluene = '3-chlorotoluene 4-methyltoluene p-bromotoluene o-methoxytoluene'
re.findall(r'[4p]-.+', toluene)
['4-methyltoluene p-bromotoluene o-methoxytoluene']
To solve this, we can again use character sets to include any letter, number, or dash like below. By including the +
behind the square brackets, this means one or more of these symbols.
re.findall(r'[4p][-\d\w]+', toluene)
['4-methyltoluene', 'p-bromotoluene']
Groups#
Regular expressions in Python also support the extraction of information from specific segments in a string. In section section 1.3.4, string formatting is introduced where the user can create a template string and insert different strings in various locations. Below are examples where the compound and molecular weight can be swapped out using either the format()
method or f-string formatting.
compound = 'ammonia'
MW = 17.03
'The molar mass of {} is {} g/mol.'.format(compound, MW)
'The molar mass of ammonia is 17.03 g/mol.'
compound = 'urea'
MW = 60.06
f'The molar mass of {compound} is {MW} g/mol.'
'The molar mass of urea is 60.06 g/mol.'
Groups in regular expressions are essentially the opposite of above where information from the string is instead extracted. Groups are helpful for extracting data from a larger pattern. Below are a couple beginnings of NMR data listings that would appear is chemical literature. If we are interested in the carrier frequency, we simply write out the regular expression as normal but then wrap the part we want to extract in parentheses.
1H NMR (CDCl3, 400 MHz):
13C NMR (C6D6, 100 MHz):
H_NMR = '1H NMR (CDCl3, 400 MHz):'
C_NMR = '13C NMR (C6D6, 100 MHz):'
carrier = r'1\d?[HC] NMR \([\d\w]+, (\d+) MHz\):'
re.findall(carrier, H_NMR)
['400']
re.findall(carrier, C_NMR)
['100']
Multiple groupings can be extracted by wrapping multiple sections in parentheses. Below extracts both the solvent and the carrier frequency.
carrier = r'1\d?[HC] NMR \(([\d\w]+), (\d+) MHz\):'
re.findall(carrier, H_NMR)
[('CDCl3', '400')]
Finding CAS Numbers#
Let’s now do some extra examples. When downloading data files from PubChem, the CAS number is mixed in with other names and numerical identifiers. There are two challenges here. The first is that CAS numbers vary in length. They are always three segments of numbers separated by hyphens, such as 58-08-2 or 2501-94-2, where the second segment is always two digits and the thirds is always a single digit. However the first segment varies from 2-7 digits. The second major issue is that the CAS numbers are mixed in with other chemical identifiers such as CID numbers, common names, and IUPAC names. These other identifiers can include hyphens and numbers, so indexing and string searches cannot easily filter for CAS numbers without a long series of boolean conditions.
This is a relatively simple task for regular expressions. We indicate digits with the \d
and use curly brackets to indicate the number of digits as demonstrated below.
re.findall(r'\d{2,7}-\d{2}-\d', chemicals)
['281-23-2', '93-89-0']
As a demonstration, PubChem allows for the free download of datasets which include a Synonym column. This column includes identifiers such as common and IUPAC names, CAS numbers, and PubChem CID numbers. The following code extracts the CAS numbers from one of these files. Two additional challenges arise from multiple CAS numbers being listed for a given compound or no CAS number being listed at all. When there are multiple CAS numbers, the most common one is stored, and if no CAS number is present, a NaN
is stored in its place.
# get CAS number from Synonyms column
import pandas as pd
import numpy as np
solv = pd.read_csv('data/solvents.csv')
names = solv['Synonyms']
cas_pattern = r'\d{2,7}-\d{2}-\d'
cas = []
for row in names:
cas_in_row = re.findall(cas_pattern, row)
try:
# get more common CAS number
most_common_cas = max(set(cas_in_row), key=cas_in_row.count)
cas.append(most_common_cas)
except ValueError:
# append NaN if no CAS number found
cas.append(np.nan)
cas[:10]
['107-06-2',
'120-82-1',
'67-64-1',
'71-43-2',
'71-36-3',
'111-65-9',
'67-68-5',
'64-17-5',
'75-12-7',
'67-56-1']
Parse NMR Data#
When data on an NMR spectrum is reported in the literature, it follows relatively strict formatting rules, but these rules are designed to be ready by humans, not machines. To make things more complicated, there are numerous commas and spaces in the data making it difficult to use these as delimiters, so regular expressions are ideal for parsing this kind of data. Below is the \(^1\)H NMR data for butanamide in DMSO-\(d_6\) at 22 \(^\circ\)C following American Chemical Society guidelines.
\(^1\)H NMR ((CD)\(_3\)SO, 400 MHz): \(\delta\) 7.23 (br, 1H), 6.70 (br, 1H), 2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J = 7.3, 7.3 Hz), 0.84 (t, 3H, J = 7.3 Hz).
As an example, we will extract the entries for each signal in the NMR spectrum. Each entry looks like 7.23 (br, 1H)
or 0.84 (t, 3H, J = 7.3 Hz)
where the decimal is the chemical shift and additional information on the signal is provided in the parentheses behind the chemical shift.
proton = ('1H NMR ((CD)3SO, 400 MHz): δ 7.23 (br, 1H), 6.70 (br, 1H),'
'2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J = 7.3, 7.3 Hz),'
'0.84 (t, 3H, J = 7.3 Hz).')
Each signal starts with a number to two decimal places, but there may be one or two digits before the decimal place. Even though our example always has one digit before the decimal, we want our code to be robust and versatile. The regular expression for this number is '\d{1,2}.\d{2}'
.
nmr_pattern = r'\d{1,2}.\d{2}'
re.findall(nmr_pattern, proton)
['7.23', '6.70', '2.00', '1.48', '0.84']
Next, the information about the signal is stored in parentheses separated from the chemical shift by a space. We will use '\s+'
just in case someone accidentally used multiple spaces. Because parentheses are a regular expression character, we need to precede it with a backslash to indicate that we actually mean just a parentheses character.
nmr_pattern = r'\d+.\d{2}\s+\('
re.findall(nmr_pattern, proton)
['7.23 (', '6.70 (', '2.00 (', '1.48 (', '0.84 (']
Inside the parentheses is the
Splitting pattern as one or more letters,
'\w+'
Integration as an integer with an
H
, so'\d+H'
Coupling information as starting with
J =
followed by a number to two decimal places, so'J\s+=\s+\d+.\d+\s+Hz'
.
nmr_pattern = r'\d+.\d{2}\s+\(\w+,\s+\d+H,\s+J\s+=\s+\d+.\d\s+Hz\)'
re.findall(nmr_pattern, proton)
['2.00 (t, 2H, J = 7.3 Hz)', '0.84 (t, 3H, J = 7.3 Hz)']
The current pattern misses the signals that do not include the coupling information or have multiple coupling constants. This is where quantifier are helpful. By placing the regular expression that pattern matches , J = 7.3
in square brackets followed by an asterisk like below, it indicates that there could be zero or more of these.
[,\s+J\s?=\s?\d+.\d]*
nmr_pattern = r'\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*\sHz\)'
re.findall(nmr_pattern, proton)
['2.00 (t, 2H, J = 7.3 Hz)',
'1.48 (tq, 2H, J = 7.3, 7.3 Hz)',
'0.84 (t, 3H, J = 7.3 Hz)']
Now the regular expression finds all signals that have coupling constants but is still missing the two without coupling constants. This is because the pattern still requires a ' Hz'
. Because there should be either zero or one of these, the regular expression that searches for this should also be enclosed in square brackets and followed by an *
like below.
[\sHz]*
nmr_pattern = r'\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*[\sHz]*\)'
re.findall(nmr_pattern, proton)
['7.23 (br, 1H)',
'6.70 (br, 1H)',
'2.00 (t, 2H, J = 7.3 Hz)',
'1.48 (tq, 2H, J = 7.3, 7.3 Hz)',
'0.84 (t, 3H, J = 7.3 Hz)']
It looks like the code finds all the signals. One more addition that would be helpful in making the code more robust is to add the possibility of a negative chemical shift. While proton chemical shift are typically positive, negative values do show up in situations such as silanes with Si-H bonds or metal hydrides. To allow for this possibility, a -?
is placed in the front indicated that the negative may or may not be there. To test this, an extra negative resonance was added just for testing purposes.
proton = ('1H NMR ((CD)3SO, 400 MHz): δ 7.23 (br, 1H), 6.70 (br, 1H),'
'2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J = 7.3, 7.3 Hz),'
'0.84 (t, 3H, J = 7.3 Hz), -0.54 (s, 1H).')
nmr_pattern = r'-?\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*[\sHz]*\)'
re.findall(nmr_pattern, proton)
['7.23 (br, 1H)',
'6.70 (br, 1H)',
'2.00 (t, 2H, J = 7.3 Hz)',
'1.48 (tq, 2H, J = 7.3, 7.3 Hz)',
'0.84 (t, 3H, J = 7.3 Hz)',
'-0.54 (s, 1H)']
If someone wanted to extract values from the NMR signals, additional regular expressions could be written to iterate through the list and extract the desired information.
Further Reading#
Documentation for
re
package. https://docs.python.org/3/library/re.html (free resource)Python
re
module documentation. Provides a good list of flags.Regular Expressions HOWTO. https://docs.python.org/3/howto/regex.html (free resource)
An offical Python documentation page that provides an additional tutorial on regular expressions in Python.
Datacamp Regular Expressions Cheat Sheet. https://www.datacamp.com/cheat-sheet/regular-expresso (free resource)
A one-page summary of key regular expression pattern characters good for hanging above a desk.