Appendix 4: Regular Expressions

Appendix 4: Regular Expressions#

There is a saying that synthetic chemists spend 10% of their time running reactions and 90% of their time purifying compounds. A similar saying could be that working with chemical data is 10% performing the intended calculations or analyses on the data and 90% of the time cleaning and organizing the data. While these are both hyperboles, they underline the large amount of effort required to clean materials. This chapter is dedicated to a powerful method known as regular expressions, or regex for short, for cleaning and filtering text data, especially in situations requiring complex pattern matching. Python string methods and indexing offer basic search and filtering functionality, but they tend to only allow for identifying simple and consistent patterns. For example, if you want a file name without the file extension (e.g., titration instead of titration.png), this can be solved using indexing and the string split() method because file extensions always follow the last period in the full file name. Likewise, parsing data from a PDB file can be parsed with only a string search and slicing because PDB files follow very strict formatting rules based on labels and positions in rows. The reason these two examples are not terribly complex to parse is because they are consistent and were designed to be machine readable. However, not all data follow well-defined formatting rules or there could be more variation that needs to be accounted for. Regular expressions are not strictly a Python feature but rather is a syntax supported by Python using the re module imported below. This module is a built-in Python module, so it comes with every installation of Python.

import re

Below we will first cover some key functions from the re module followed by generating more complex patterns, and finally ending with a couple of chemical databases and literature examples.

Regular Expression Basics#

`re` Functions#

The re module provides a series of functions including those listed in Table 1 that allow the user to search for, split on, or substitute for patterns within a string. Additional functions can be found on the Python regular expressions page.

Table 1 Select re Functions

Regex Function	Description
`re.findall(pattern, str)`	Returns a list of strings that match the pattern
`re.finditer(pattern, str)`	Returns iterable of Match objects
`re.search(pattern, str)`	Returns the first pattern match as Match object
`re.split(pattern, str)`	Splits string at pattern matches
`re.sub(pattern, replacement, str)`	Replaces all occurrences with new string

The way these functions work is that the user provides a pattern to search for, which in the most basic scenarios can be a simple string, along with a string in which the function will search for the pattern. In the example below, we search a string of amine names for an aniline derivative by using 'aniline' as a pattern.

amines = '2-methylcyclohexylamine N-methylaniline 3-methylbutylamine N-methyl-3-pentanamine o-methylaniline'
pattern = 'aniline'

re.findall(pattern, amines)

['aniline', 'aniline']

This is not terribly informative being that all it tells us is that 'aniline' appears twice. The re.finditer() function can be used instead to return an iterator providing the user with the location of each match using either a for loop or list() function. We can see below that there are three matches along with the indices of those matches and the string that matches the pattern.

for x in re.finditer('aniline', amines):
    print(x)

<re.Match object; span=(32, 39), match='aniline'>
<re.Match object; span=(90, 97), match='aniline'>

list(re.finditer('aniline', amines))

[<re.Match object; span=(32, 39), match='aniline'>,
 <re.Match object; span=(90, 97), match='aniline'>]

Tp access the matched strings, use the group() method on the Match objects like below.

for x in re.finditer('aniline', amines):
    print(x.group())

aniline
aniline

The re module can also be used to find and replace patterns such as replacing 'aniline' with 'anilinium' like below.

re.sub('aniline', 'anilinium', amines)

'2-methylcyclohexylamine N-methylanilinium 3-methylbutylamine N-methyl-3-pentanamine o-methylanilinium'

We could still probably have done the above tasks with string methods and indexing. The real power of regular expressions is its ability to generate more complex and flexible patterns, which is what we address below.

Symbols & Characters#

Let’s try something a little more complicated by searching for any instance of a methyl not located on a nitrogen. This means that the name should have a 'methyl' string with a hyphenated number before it. The re module provides syntax, Table 2, for indicating specific types of characters and delimiters. For example, \d indicates a digit. Many of these character designators also have a negative version using the capital letter, so \D, for example, signifies any character except a number.

Table 2 Regex Character Designators

Character Type	Present	Not Present	Description/Examples
Any character	`.`		Any character except new line (i.e., \n)
Digits	\d	\D	Digits from 0-9
Letters/Word characters	\w	\W	abcABC
Space	\s	\S	White space, tabs, and end-of-lines
Boundary between words	\b	\B	Space, start of line, or non-alphanumeric characters
Character at start of string	`^`		`^2` finds a 2 at the start of a string
Character at end of string	`$`		`^1` finds a 1 at the end of a string

Being that we need any number before the methyl, the pattern is \d-methyl. Now that we have patterns that use a backslash, you may see a SyntaxError because the backslash is also a Python escape character. To avoid this error, either precede the backslash with another backslash, \\d-methyl, or make your string a raw string by preceding it with an r as is done below.

for x in re.finditer(r'\d-methyl', amines):
    print(x)

<re.Match object; span=(0, 8), match='2-methyl'>
<re.Match object; span=(40, 48), match='3-methyl'>

The \D could be used as a means of locating methyls that are not on an aliphatic carbon chain because they do not have numbers before them (at least in this example) as is done below. Now that our patterns are more broad, the listing of matches like below are more informative because we can see that both N-methyl and o-methyl fit our pattern.

for x in re.finditer(r'\D-methyl', amines):
    print(x)

<re.Match object; span=(24, 32), match='N-methyl'>
<re.Match object; span=(59, 67), match='N-methyl'>
<re.Match object; span=(82, 90), match='o-methyl'>

As another example, below is a string that lists chemical identifiers including chemical names, CAS numbers, and a PubChem CID. The first thing we might want to do is split this up into a list where each item represents a different chemical.

chemicals = ('2-methylphenol   methanol N,N-diethylamine pentanol 281-23-2 ' 
              'ethyl benzoate glycerol 93-89-0 5793 ethanoic acid acetic anhydride')

Using a string method to split based on spaces demonstrated below will not work well because some chemicals (ethyl benzoate, ethanoic acid, and acetic anhydride) have a space in their name. There is also a complication where there are multiple spaces after '2-methylphenol'. This problem will be solved below using additional tools from regular expressions.

re.split(r'\s', chemicals)

['2-methylphenol',
 '',
 '',
 'methanol',
 'N,N-diethylamine',
 'pentanol',
 '281-23-2',
 'ethyl',
 'benzoate',
 'glycerol',
 '93-89-0',
 '5793',
 'ethanoic',
 'acid',
 'acetic',
 'anhydride']

Quantifiers#

Let’s first deal with the multiple spaces using quantifiers in Table 3. These quantifiers allow the user to specify how many of something will be in the pattern. For example, the a+ will search for one or more a’s while '\s{1,3}' looks for 1-3 spaces.

Table 3 Regex Quantifiers

Flag	Description	Example
*	Search for 0 or more	`\w*` for 0 or more letters
?	0 or 1	`\s?` for a space that may or may not be present
+	Search for 1 or more	`\d+` for one or more digits
{}	Number of preceding character to search for	`\d{3}` for three digits, `\d{3, 7}` for 3-7 digits

Below, we use \s+ to split our string of chemicals based on one or more spaces.

re.split(r'\s+', chemicals)

['2-methylphenol',
 'methanol',
 'N,N-diethylamine',
 'pentanol',
 '281-23-2',
 'ethyl',
 'benzoate',
 'glycerol',
 '93-89-0',
 '5793',
 'ethanoic',
 'acid',
 'acetic',
 'anhydride']

Lookahead and Lookbehind#

Now let’s address the issue of spaces inside the name. IUPAC nomenclature for esters follows a pattern where the first word always ends in -yl, and carboxylic acids and anhydrides have -ic at the end of the first word (i.e., the carboxyl part). These trends can be used to identify spaces where the string should not be split, and we will carry this out using something known as a lookahead or lookbehind shown in Table 4. These look for the presence or absence of something before or after our main pattern. We specifically want spaces that do not have a yl or ic preceding them. We will add these one at a time. Below(?<!yl) is added in front of \s+ to avoid splitting on yl patterns.

Table 4 Lookahead and Lookbehind Syntax

	Lookahead ($\rightarrow$)	Lookbehind ($\leftarrow$)
Present	`pattern1(?=pattern2)`	`(?<=pattern2)pattern1`
Absent	`pattern1(?!pattern2)`	`(?<!pattern2)pattern1`

pattern = r'(?<!yl)\s+'
re.split(pattern, chemicals)

['2-methylphenol',
 'methanol',
 'N,N-diethylamine',
 'pentanol',
 '281-23-2',
 'ethyl benzoate',
 'glycerol',
 '93-89-0',
 '5793',
 'ethanoic',
 'acid',
 'acetic',
 'anhydride']

A lookbehind for the ic can also be added like below.

pattern = r'(?<!yl)(?<!ic)\s+'
re.split(pattern, chemicals)

['2-methylphenol',
 'methanol',
 'N,N-diethylamine',
 'pentanol',
 '281-23-2',
 'ethyl benzoate',
 'glycerol',
 '93-89-0',
 '5793',
 'ethanoic acid',
 'acetic anhydride']

Character Sets#

What happens if there are multiple symbols that need to be matched? By placing the symbols or characters to be matched in square brackets, [], anything in the brackets is searched for. For example, it is not uncommon to see numbers separated by either a period or dash (e.g., phone numbers), so [-.] can be used to indicate that either symbol is a fit. Regular expressions also allow for ranges of letters and numbers such as [a-e] for any of the first five lowercase letters. It is a good idea to place the dash first to ensure that it does not get interpreted as a range.

Below there is a string of toluene derivatives. If we want to filter for only para-substituted toluene derivatives, the name (at least in this example) should start with either p- or 4-. Both symbols can be enclosed in the square brackets like [4p]. The next challenge is figuring out how to deal with the rest of the symbols. We could try .+ to indicate any number of more symbols, but this includes white spaces and returns the rest of the string.

toluene = '3-chlorotoluene 4-methyltoluene p-bromotoluene o-methoxytoluene' 

re.findall(r'[4p]-.+', toluene)

['4-methyltoluene p-bromotoluene o-methoxytoluene']

To solve this, we can again use character sets to include any letter, number, or dash like below. By including the + behind the square brackets, this means one or more of these symbols.

re.findall(r'[4p][-\d\w]+', toluene)

['4-methyltoluene', 'p-bromotoluene']

Groups#

Regular expressions in Python also support the extraction of information from specific segments of a string. Back in section 1.3.4, string formatting is introduced where the user can create a template string and insert different strings or values in the template string. As a refresher, below are examples where the compound and molecular weight are added using either the format() method or f-string formatting.

compound = 'ammonia'
MW = 17.03

'The molar mass of {} is {} g/mol.'.format(compound, MW)

'The molar mass of ammonia is 17.03 g/mol.'

compound = 'urea'
MW = 60.06

f'The molar mass of {compound} is {MW} g/mol.'

'The molar mass of urea is 60.06 g/mol.'

Groups in regular expressions are essentially the opposite of above, where information from the string is instead extracted. Groups are helpful for extracting data from a larger pattern. Below are a couple of beginnings of NMR data listings that would appear in chemical literature. If we are interested in the carrier frequency, we simply write out the regular expression as normal but then wrap the part we want to extract in parentheses, ().

1H NMR (CDCl3, 400 MHz):
13C NMR (C6D6, 100 MHz):

H_NMR = '1H NMR (CDCl3, 400 MHz):'
C_NMR = '13C NMR (C6D6, 100 MHz):'

carrier = r'1\d?[HC] NMR \([\d\w]+, (\d+) MHz\):'

re.findall(carrier, H_NMR)

['400']

re.findall(carrier, C_NMR)

['100']

Multiple groupings can be extracted by wrapping multiple sections in parentheses. The code below extracts both the solvent and the carrier frequency.

carrier = r'1\d?[HC] NMR \(([\d\w]+), (\d+) MHz\):'
re.findall(carrier, H_NMR)

[('CDCl3', '400')]

Finding CAS Numbers#

Let’s now do some extra examples. When downloading data files from PubChem, the CAS number is mixed in with other names and numerical identifiers. There are two challenges here. The first is that CAS numbers vary in length. They are always three segments of numbers separated by hyphens, such as 58-08-2 or 2501-94-2, where the second segment is always two digits and the third is always a single digit. However, the first segment varies from 2-7 digits. The second major issue is that the CAS numbers are mixed in with other chemical identifiers such as CID numbers, common names, and IUPAC names. These other identifiers can include hyphens and numbers, so indexing and string searches cannot easily filter for CAS numbers without a long series of boolean conditions.

This is a relatively simple task for regular expressions. We indicate digits with the \d and use curly brackets to indicate the number of digits as demonstrated below.

re.findall(r'\d{2,7}-\d{2}-\d', chemicals)

['281-23-2', '93-89-0']

As a demonstration, PubChem allows for the free download of datasets which include a Synonym column. This column includes identifiers such as common and IUPAC names, CAS numbers, and PubChem CID numbers. The following code extracts the CAS numbers from one of these files. Two additional challenges arise from multiple CAS numbers being listed for a given compound or no CAS number being listed at all. When there are multiple CAS numbers, the most common one is stored, and if no CAS number is present, a NaN is stored in its place.

# get CAS number from Synonyms column
import pandas as pd
import numpy as np

solv = pd.read_csv('data/solvents.csv')
names = solv['Synonyms']

cas_pattern = r'\d{2,7}-\d{2}-\d'
cas = []
for row in names:
    cas_in_row = re.findall(cas_pattern, row)
    try:
        # get more common CAS number
        most_common_cas = max(set(cas_in_row), key=cas_in_row.count)
        cas.append(most_common_cas)
    except ValueError:
        # append NaN if no CAS number found
        cas.append(np.nan)

cas[:10]

['107-06-2',
 '120-82-1',
 '67-64-1',
 '71-43-2',
 '71-36-3',
 '111-65-9',
 '67-68-5',
 '64-17-5',
 '75-12-7',
 '67-56-1']

Parse NMR Data#

When data on an NMR spectrum is reported in the literature, it follows relatively strict formatting rules, but these rules are designed to be read by humans, not machines. To make things more complicated, there are numerous commas and spaces in the data making it difficult to use these as delimiters, so regular expressions are ideal for parsing this kind of data. Below is the $^1$H NMR data for butanamide in DMSO-$d_6$ at 22 $^\circ$C following American Chemical Society guidelines.

$^1$H NMR ((CD)$_3$SO, 400 MHz): $\delta$ 7.23 (br, 1H), 6.70 (br, 1H), 2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J = 7.3, 7.3 Hz), 0.84 (t, 3H, J = 7.3 Hz).

As an example, we will extract the entries for each signal in the NMR spectrum. Each entry looks like 7.23 (br, 1H) or 0.84 (t, 3H, J = 7.3 Hz) where the decimal is the chemical shift and additional information on the signal is provided in the parentheses behind the chemical shift.

proton = ('1H NMR ((CD)3SO, 400 MHz): δ 7.23 (br, 1H), 6.70 (br, 1H),' 
          '2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J = 7.3, 7.3 Hz),' 
          '0.84 (t, 3H, J = 7.3 Hz).')

Each signal starts with a number to two decimal places, but there may be one or two digits before the decimal place. Even though our example always has one digit before the decimal, we want our code to be robust and versatile. The regular expression for this number is '\d{1,2}.\d{2}'.

nmr_pattern = r'\d{1,2}.\d{2}'
re.findall(nmr_pattern, proton)

['7.23', '6.70', '2.00', '1.48', '0.84']

Next, the information about the signal is stored in parentheses separated from the chemical shift by a space. We will use '\s+' just in case someone accidentally used multiple spaces. Because parentheses are a regular expression character, we need to precede it with a backslash to indicate that we actually mean just a parentheses character.

nmr_pattern = r'\d+.\d{2}\s+\('
re.findall(nmr_pattern, proton)

['7.23 (', '6.70 (', '2.00 (', '1.48 (', '0.84 (']

Inside the parentheses is the

Splitting pattern as one or more letters, '\w+'
Integration as an integer with an H, so '\d+H'
Coupling information as starting with J = followed by a number to two decimal places, so 'J\s+=\s+\d+.\d+\s+Hz'.

nmr_pattern = r'\d+.\d{2}\s+\(\w+,\s+\d+H,\s+J\s+=\s+\d+.\d\s+Hz\)'

re.findall(nmr_pattern, proton)

['2.00 (t, 2H, J = 7.3 Hz)', '0.84 (t, 3H, J = 7.3 Hz)']

The current pattern misses the signals that do not include the coupling information or have multiple coupling constants. This is where quantifiers are helpful. By placing the regular expression that pattern matches , J = 7.3 in square brackets followed by an asterisk like below, it indicates that there could be zero or more of these.

[,\s+J\s?=\s?\d+.\d]*

nmr_pattern = r'\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*\sHz\)'

re.findall(nmr_pattern, proton)

['2.00 (t, 2H, J = 7.3 Hz)',
 '1.48 (tq, 2H, J = 7.3, 7.3 Hz)',
 '0.84 (t, 3H, J = 7.3 Hz)']

Now the regular expression finds all signals that have coupling constants but is still missing the two without coupling constants. This is because the pattern still requires a ' Hz'. Because there should be either zero or one of these, the regular expression that searches for this should also be enclosed in square brackets and followed by an * like below.

[\sHz]*

nmr_pattern = r'\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*[\sHz]*\)'

re.findall(nmr_pattern, proton)

['7.23 (br, 1H)',
 '6.70 (br, 1H)',
 '2.00 (t, 2H, J = 7.3 Hz)',
 '1.48 (tq, 2H, J = 7.3, 7.3 Hz)',
 '0.84 (t, 3H, J = 7.3 Hz)']

It looks like the code finds all the signals. One more addition that would be helpful in making the code more robust is to add the possibility of a negative chemical shift. While proton chemical shifts are typically positive, negative values do show up in situations such as silanes with Si-H bonds or metal hydrides. To allow for this possibility, a -? is placed in the front indicated that the negative may or may not be there. To test this, an extra negative resonance was added just for testing purposes.

proton = ('1H NMR ((CD)3SO, 400 MHz): δ 7.23 (br, 1H), 6.70 (br, 1H),' 
          '2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J = 7.3, 7.3 Hz),' 
          '0.84 (t, 3H, J = 7.3 Hz), -0.54 (s, 1H).')

nmr_pattern = r'-?\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*[\sHz]*\)'
re.findall(nmr_pattern, proton)

['7.23 (br, 1H)',
 '6.70 (br, 1H)',
 '2.00 (t, 2H, J = 7.3 Hz)',
 '1.48 (tq, 2H, J = 7.3, 7.3 Hz)',
 '0.84 (t, 3H, J = 7.3 Hz)',
 '-0.54 (s, 1H)']

If someone wanted to extract values from the NMR signals, additional regular expressions could be written to iterate through the list and extract the desired information.

Appendix 4: Regular Expressions

Contents

Appendix 4: Regular Expressions#

Regular Expression Basics#

`re` Functions#

Symbols & Characters#

Quantifiers#

Lookahead and Lookbehind#

Character Sets#

Groups#

Finding CAS Numbers#

Parse NMR Data#

Further Reading#

Appendix 4: Regular Expressions

Contents

Appendix 4: Regular Expressions#

Regular Expression Basics#

re Functions#

Symbols & Characters#

Quantifiers#

Lookahead and Lookbehind#

Character Sets#

Groups#

Finding CAS Numbers#

Parse NMR Data#

Further Reading#

`re` Functions#