Chapter 15: Cheminformatics with RDKit

Chapter 15: Cheminformatics with RDKit#

Cheminformatics can be thought of as the intersection of data science, computer science, and chemistry as a means of better understanding and solving chemical problems. This chapter introduces a popular and versatile Python cheminformatics library known as RDKit which is useful for tasks such as:

Visualizing molecules
Reading SMILES or InChI molecular representations
Quantifying structural features in molecules such as number of rings or hydrogen bond donors
Generating all possible stereoisomers of a molecular structure
Filtering molecules based on structural features

This is a popular library for those in chemical computing research with examples of its use being relatively easy to find in the chemical literature. As of this writing, RDKit can be installed with either conda or pip (see section 0.2.1 and link below). If you are using Google Colab, you will need to install RDKit at the top of your notebook (see section 0.2.2) as it is not installed by default in Colab.

Installing RDKit

This chapter assumes the following imports from RDKit.

from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, PandasTools
from rdkit.Chem.Draw import SimilarityMaps
from rdkit.Chem import rdFingerprintGenerator

from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

RDKit is composed of a number of modules including, but not limited to, the following.

Table 1 Key Modules and Submodules in the RDKit Library

Module/Submodule	Description
`Chem`	General purpose tools for chemistry. The RDKit website describes it as “A module for molecules and stuff”.
`Chem.AllChem`	Submodule containing more specialized or less often used features; needs to be imported separately from `Chem`
`Chem.Descriptors`	Submodule for quantifying molecular features
`Chem.Draw`	Submodule for visualizing molecules
`ML`	Machine learning tools

The Chem and ML modules are the major modules in RDKit, but for this chapter, we will only be focusing on the Chem module which has already been imported above.

15.1 Loading Molecular Representations into RDKit#

There are many ways to depict molecular structures on paper such as Lewis structures, line-angle structural formulas, and condensed notation. When representing molecules for a computer, machine-readable methods such as Simplified Molecular-Input Line-Entry System (SMILES), the International Chemical Identifier (InChI), or mol files are preferred. For example, the SMILES and InChI representations for benzene are listed below.

SMILES: c1ccccc1

InChI: 1S/C6H6/c1-2-4-6-5-3-1/h1-6H

These are not the most human-readable formats, but computer software such as RDKit is quite good at dealing with them. We will not get into the structure and rules for interpreting these representations here because it is not really necessary; reading and writing them is RDKit’s job. You can obtain these representations of a molecular structure from a variety of sources such as generating them from chemical drawing software (e.g., ChemDraw or ChemDoodle), searching NIST Chemical Webbook or NIH PubChem, and many other sources. In this chapter, we will mainly focus on SMILES representations, but working with the InChI and MOL file formats is analogous and may be used from time-to-time herein.

The functions below can read and write molecular structures from a variety of formats including SMILES, InChI, and MOL files. When reading these molecular structures, a Molecule object (RDKit-specific class of object) is generated.

Table 2 Functions for Loading Molecular Structures

Function	Description
`Chem.MolFromSmiles()`	Generates a Molecule object from SMILES representation
`Chem.MolToSmiles()`	Generates SMILES representation from a Molecule object
`Chem.inchi.MolFromInchi()`	Generates a Molecule object from InChI representation
`Chem.inchi.MolToInchi()`	Generates InChI representation from a Molecule object
`MolFromMolFile()`	Generates a Molecule object from a MOL file

As an example, we will load the structure of aspirin (acetylsalicylic acid) using the Chem.MolFromSmiles() function from the Chem module.

aspirin = Chem.MolFromSmiles('O=C(C)Oc1ccccc1C(=O)O')
aspirin

If we check the object type, we find that it is a Molecule (rdchem.Mol) RDKit object.

type(aspirin)

rdkit.Chem.rdchem.Mol

RDKit can generate other molecular representations such as InChI from the Molecule object as demonstrated below.

Chem.inchi.MolToInchi(aspirin)

'InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)'

15.2 Visualizing Chemical Structures#

In the above examples, RDKit provided an image of the molecule simply by Jupyter running the Molecule object. By default, this generates a rather small and low resolution image. To generate a sharper image, like above, running the following code at the top of a notebook changes the settings to produce SVG (Scalable Vector Graphic) images which are a vector graphic format.

from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True

15.2.1 Single Chemical Structures#

However, simply running the Molecule object does not provide easy control over the image. In this section, we will generate images that can be saved along with visualizing grids of molecules and other visual representations.

To view the molecule, we can use the Chem.Draw.MolToImage(Mol) function which takes one required positional arguments of the Molecule object (Mol). Optional keyword arguments can be used to set other parameters such as the image size (size=) in pixels.

Chem.Draw.MolToImage(aspirin, size=(400,400))

../../_images/0117c1512fcf1eb086b9845cd56c4cd0d8681d30a10084641b9494e21baf3e6d.png

If we want to save the image to a file, this is accomplished using the Chem.Draw.MolToFile() function which requires two pieces of information - the Molecule object and the name of the new file as a string.

Chem.Draw.MolToFile(mol_object, 'file_name.png', size=(width, height), imageType='png')

Other optional parameters include the size= which is a tuple that takes the width and height, respectively, in pixels, and the imageType= accepts a string to designate the file format (‘png’ or ‘svg’).

Chem.Draw.MolToFile(aspirin, 'aspirin.svg',
                    size=(500,500),
                    imageType='svg')

Molecules can also be displayed in plots created by matplotlib. Below is an example of the trans-cinnamic acid structure being displayed on top of the IR spectrum of the compound.

cinn_acid = Chem.MolFromSmiles(r'O=C(O)\C=C\c1ccccc1')
image = Chem.Draw.MolToImage(cinn_acid)

IR = np.genfromtxt('data/cinnamic_acid.CSV', delimiter=',')

plt.figure(figsize=(8,4))
plt.plot(IR[:-1,0], IR[:-1,1])
plt.gca().invert_xaxis()
plt.xlabel('Wavenumbers, cm$^{-1}$')
plt.ylabel('%, Transmittance')

ax = plt.axes([0.15, 0.2, 0.48, 0.35], frameon=False)
ax.axis('off')
ax.imshow(image);

../../_images/71fcd784b6a838465781e4b75c992a6e38b78db21b8fe8422e28ef8db4d4e884.svg

15.2.2 Grids of Chemical Structures#

Whenever we are dealing with collections of molecules, it may be helpful to generate an image that includes multiple molecular structures known as a grid. As an example, we will load the SMILES strings of the twenty common amino acids from a text file using pandas and then load the Molecule objects for each structure into a single list called AminoAcids.

df = pd.read_csv('data/amino_acid_SMILES.txt', skiprows=2)
df

	name	SMILES
0	alanine	C[C@@H](C(=O)[O-])[NH3+]
1	arginine	[NH3+][C@@H](CCCNC(=[NH2+])N)C(=O)[O-]
2	asparagine	O=C(N)C[C@H]([NH3+])C(=O)[O-]
3	aspartate	C([C@@H](C(=O)[O-])[NH3+])C(=O)[O-]
4	cysteine	C([C@@H](C(=O)[O-])[NH3+])S
5	glutamine	[NH3+][C@@H](CCC(=O)N)C([O-])=O
6	glutamate	C(CC(=O)[O-])[C@@H](C(=O)[O-])[NH3+]
7	glycine	C(C(=O)[O-])[NH3+]
8	histidine	O=C([C@H](CC1=CNC=N1)[NH3+])[O-]
9	isoleucine	CC[C@H](C)[C@@H](C(=O)[O-])[NH3+]
10	leucine	CC(C)C[C@@H](C(=O)[O-])[NH3+]
11	lysine	C(CC[NH3+])C[C@@H](C(=O)[O-])[NH3+]
12	methionine	CSCC[C@H]([NH3+])C(=O)[O-]
13	phenylalanine	[NH3+][C@@H](CC1=CC=CC=C1)C([O-])=O
14	proline	[O-]C(=O)[C@H](CCC2)[NH2+]2
15	serine	C([C@@H](C(=O)[O-])[NH3+])O
16	threonine	C[C@H]([C@@H](C(=O)[O-])[NH3+])O
17	tryptophan	c1[nH]c2ccccc2c1C[C@H]([NH3+])C(=O)[O-]
18	tyrosine	[NH3+][C@@H](Cc1ccc(O)cc1)C([O-])=O
19	valine	CC(C)[C@@H](C(=O)[O-])[NH3+]

AminoAcids = [Chem.MolFromSmiles(SMILES) for SMILES in df['SMILES']]
AminoAcids

[<rdkit.Chem.rdchem.Mol at 0x112922730>,
 <rdkit.Chem.rdchem.Mol at 0x112920ac0>,
 <rdkit.Chem.rdchem.Mol at 0x1129225e0>,
 <rdkit.Chem.rdchem.Mol at 0x112922570>,
 <rdkit.Chem.rdchem.Mol at 0x112922500>,
 <rdkit.Chem.rdchem.Mol at 0x112921cb0>,
 <rdkit.Chem.rdchem.Mol at 0x112922490>,
 <rdkit.Chem.rdchem.Mol at 0x1129223b0>,
 <rdkit.Chem.rdchem.Mol at 0x1129215b0>,
 <rdkit.Chem.rdchem.Mol at 0x112921690>,
 <rdkit.Chem.rdchem.Mol at 0x112921700>,
 <rdkit.Chem.rdchem.Mol at 0x1129213f0>,
 <rdkit.Chem.rdchem.Mol at 0x112921310>,
 <rdkit.Chem.rdchem.Mol at 0x112921230>,
 <rdkit.Chem.rdchem.Mol at 0x112921070>,
 <rdkit.Chem.rdchem.Mol at 0x112921150>,
 <rdkit.Chem.rdchem.Mol at 0x112920f20>,
 <rdkit.Chem.rdchem.Mol at 0x112920f90>,
 <rdkit.Chem.rdchem.Mol at 0x112920d60>,
 <rdkit.Chem.rdchem.Mol at 0x112920e40>]

To generate the grid, we will use the MolsToGridImage() function from the Chem.Draw submodule. This function requires one positional argument of an array-like object (e.g., list, tuple, ndarray, etc.) containing the Molecule objects. Other optional keyword arguments include the number of molecules per row (molsPerRow=), the pixel dimensions of each molecule (subImgSize=), labels below each molecule (legends=), and the ability to make images SVG format (usesSVG=). The image dimensions only matter if using a raster image format and requires a tuple with the width and height in that order. The legends= argument requires an array-like object with the labels in the same order as the object containing the Molecule objects.

Chem.Draw.MolsToGridImage(AminoAcids, 
                          molsPerRow=4,
                          subImgSize=(200,200),
                          legends=list(df['name']),
                          useSVG=True)    

../../_images/60cac339e086d89f821e29c339d35e23ed3e1491e00d2b52cb4fd1376269a06e.svg

15.2.3 Molecules in Pandas DataFrames#

RDKit also supports visualizing molecules inside pandas DataFrames using the AddMoleculeColumnToFrame() function from the PandasTools submodule (rdkit.Chem.PandasTools). This function accepts a DataFrame with a column of SMILES (smilesCol=) and adds a new column of Molecule objects. The molCol= parameter will be the header for the new column.

ligands = pd.read_csv('data/ligands.csv')
ligands

	ligand	smiles
0	dppe	c1ccc(P(CCP(c2ccccc2)c2ccccc2)c2ccccc2)cc1
1	acac	CC(=O)CC(C)=O
2	acetonitrile	CC#N
3	dcpe	C1CCC(P(CCP(C2CCCCC2)C2CCCCC2)C2CCCCC2)CC1
4	HMDS	C[Si](C)(C)N[Si](C)(C)C
5	PPh3	c1ccc(P(c2ccccc2)c2ccccc2)cc1

PandasTools.AddMoleculeColumnToFrame(ligands, 
                                     smilesCol='smiles',
                                     molCol='molecules')
ligands

	ligand	smiles
0	dppe	c1ccc(P(CCP(c2ccccc2)c2ccccc2)c2ccccc2)cc1
1	acac	CC(=O)CC(C)=O
2	acetonitrile	CC#N
3	dcpe	C1CCC(P(CCP(C2CCCCC2)C2CCCCC2)C2CCCCC2)CC1
4	HMDS	C[Si](C)(C)N[Si](C)(C)C
5	PPh3	c1ccc(P(c2ccccc2)c2ccccc2)cc1

The DataFrame can also be exported as an Excel spreadsheet complete with images using the SaveXlsxFromFrame(). This function accepts accepts the DataFrame name, output file name, name of the Molecule object column, and the image size as arguments.

PandasTools.SaveXlsxFromFrame(ligands, 'ligands.xlsx', molCol='molecules', size=(300, 300))

15.3 Stereochemistry#

RDKit can assign the stereochemistry of stereocenters, including chiral centers (R vs. S) and alkene stereocenters (E vs. Z), determine the number of isomers possible, and even generate all possible isomers. Whether or not any stereochemistry is designated in the SMILES representation or Molecule object is an important detail in carrying out the above tasks. Even though a molecule may contain a chiral center or an alkene carbon, the stereochemistry around that atom may be ambiguous.

The SMILES representation shows stereochemistry around a tetrahedral carbon with either @ or @@ and around an alkene with \ and / symbols. If the SMILES representation does not include these symbols, the stereochemistry is not indicated.

Table 3 SMILES Stereochemical Designations

Designation	Alkene	Chiral sp\(^3\) Atom
No isomer designation	C(=CC)C	CCC(C)O
First isomer	C/C=C\C	CC[C@H](C)O
Second isomer	C/C=C/C	CC[C@@H](C)O

15.3.1 Assigning Stereochemistry#

The first task is to assign the absolute stereochemistry of a molecule. As an example, below we have a single isomer of pent-3-en-2-ol which has a single chiral center and an alkene that could potentially be either E or Z. Let’s have RDKit tell us the absolute configuration (i.e., R or S) of the tetrahedral chiral center and if the alkene is E or Z. First, we will load the SMILES representation of this compound, O[C@@H](C)/C=C/C, which contains both @ and / symbols, so we know the stereochemistry is assigned in this representation. When we visualize it below, we can see a wedge for the methyl on the chiral center instead of a regular line, for example.

pentenol = Chem.MolFromSmiles('O[C@@H](C)/C=C/C')
pentenol

To obtain the absolute configuration (i.e.,R or S), we can use the Chem.FindMolChiralCenters() function which returns the absolute configuration and an index indication which atom has that configuration.

Chem.FindMolChiralCenters(pentenol)

[(1, 'S')]

Our pent-3-en-2-ol isomer above has an S stereocenter. Being that pent-3-en-2-ol has only one chiral center, it is not difficult to determine which atom has the stereochemistry, but if there are multiple chiral centers, it can get confusing. To see the atom indices and stereochemistry labels on the molecule, this can be enabled (or disabled using False) by the following code.

IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.drawOptions.addStereoAnnotation = True

pentenol

To obtain the stereochemistry of double bonds, we can iteration through the bonds and obtain the stereochemistry using the GetStereo() bond method as shown below. There are three possible outputs list below.

Table 4 Bond Stereochemical Designations in RDKit

Ouput	Description
STEREONONE	No stereochemistry (often not a double bond)
STEREOE	E stereochemistry
STEREOZ	Z stereochemistry

for bond in pentenol.GetBonds():
    print(bond.GetStereo())

STEREONONE
STEREONONE
STEREONONE
STEREOE
STEREONONE

In the above example, there are four bonds with no stereochemistry due to being single bonds, and there is one E bond corresponding to the alkene. If there are multiple double bonds, it can be difficult to determine which bond has which stereochemistry. In this case, either use the image like shown above or use additional bond methods (see section 15.6) to obtain more information about the bonds.

As another example, below we will look at the bonds in 9-cis-retinoic acid where we can see examples of all three possible bond stereochemical assignments.

retinoic = Chem.MolFromSmiles(r'O=C(O)\C=C(\C=C\C=C(/C=C/C1=C(/CCCC1(C)C)C)C)C')

for bond in retinoic.GetBonds():
    print(bond.GetStereo())

STEREONONE
STEREONONE
STEREONONE
STEREOE
STEREONONE
STEREOE
STEREONONE
STEREOZ
STEREONONE
STEREOE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE

retinoic

15.3.2 Counting and Generating Isomers#

Another interesting feature of RDKit is the ability to determine the number of stereoisomers possible for a given structure and to generate the different isomers. In both these applications, RDKit treats any explicitly assigned stereocenter as fixed and will not allow it to be changed. For example, below we will again look at (2S, 3E)-pent-3-en-2-ol. Because the structure already designates this as the (2S, 3E) isomer, the stereochemistry of the chiral center and alkene cannot be changed. As a result, when using the GetStereoisomerCount() method from the EnumerateStereoisomers module, it returns a 1 indicating that there is only one stereoisomer possible with these constraints.

Chem.EnumerateStereoisomers.GetStereoisomerCount(pentenol)

In contrast, if we provide the GetStereoisomerCount() function hexan-2-ol without any stereochemistry designated (see above), it returns 2 as the number of stereoisomers. This is because (S)-hexan-2-ol and (R)-hexan-2-ol are both possible isomers.

hexanol = Chem.MolFromSmiles('OC(C)CCCC')
Chem.EnumerateStereoisomers.GetStereoisomerCount(hexanol)

The EnumerateStereoisomers module can also generate the different possible isomers, and again, it will only generate isomers by changing stereochemical features that do not already have assigned configurations. If we again look at hexan-2-ol, it generates two Molecule objects which are the two isomers.

isomers = list(Chem.EnumerateStereoisomers.EnumerateStereoisomers(hexanol))
isomers

[<rdkit.Chem.rdchem.Mol at 0x112b6c7c0>,
 <rdkit.Chem.rdchem.Mol at 0x112b6c9f0>]

IPythonConsole.drawOptions.addAtomIndices = False
IPythonConsole.drawOptions.addStereoAnnotation = True

Chem.Draw.MolsToGridImage(isomers)

../../_images/53b6dd508d331ccbfaa898f26e1f679ca6913d4bed8c8c6c654cd13e704bad9d.svg

As a more challenging example, arabinos has three chiral centers allowing for up to eight possible stereoisomers. Because there is a lack of symmetry between the top and bottom (i.e., -CHO and -CH\(_2\)OH are different), no meso compound can exist, so it will have the full eight stereoisomers. The real challenge lies in drawing out all eight… unless we make RDKit do the work for us like below.

arabinos = Chem.MolFromSmiles('O=CC(O)C(O)C(O)CO')
isomers = list(Chem.EnumerateStereoisomers.EnumerateStereoisomers(arabinos))
Chem.Draw.MolsToGridImage(isomers, useSVG=True)

../../_images/27267a536df9d17e2b5a9d5d6f4776d533cb398dc3caf49df98c8fa653e00539.svg

While the examples above mainly focus on stereoisomers from tetrahedral chiral centers, this also works with E/Z stereoisomers. One limitation with RDKit is that it currently struggles to recognize non-alkene cis/trans stereoisomers when there are stereocenters that are not chiral centers involved such as rings (see GitHub issue 5597). For example, with 1,2,3-trimethylcyclopropane, it only believes there are eight stereoisomers when in fact there are two.

TriCProp = Chem.MolFromSmiles('CC1C(C1C)C ')
Chem.EnumerateStereoisomers.GetStereoisomerCount(TriCProp)

In contrast, it has no difficulty identifying the three isomers for 1,2-dimethylcyclopropane because both methylated carbons are chiral centers.

DiCProp = Chem.MolFromSmiles('CC1CC1C')
CPropisomers = list(Chem.EnumerateStereoisomers.EnumerateStereoisomers(DiCProp))
Chem.Draw.MolsToGridImage(CPropisomers)

../../_images/7162b4df8e670677221b6369062885825757c11dfdb72e1f0eb0420ef268b074.svg

15.4 `Chem.Descriptor` Module#

RDKit can be use to determine a number of key physical properties of molecules known as descriptors using the Chem.Descriptor module. These can be useful for generating features for a large number of molecules for machine learning or understanding structural trends in a body of chemical compounds.

15.4.1 Molecular Features#

There are numerous descriptor functions available which are callable using Chem.Descriptors.method() where method() is the name of a descriptor function that accepts an RDKit Molecule object and returns a numerical value. Below are a few examples of descriptor functions, with a more complete list is available on the RDKit website.

Table 5 Examples of Molecular Descriptors

Function	Description
`MolWt`	Molecular weight, assumes natural isotopic distribution
`HeavyAtomCount()`	Number of non-hydrogen atoms
`NOCount()`	Number of N and O atoms
`NumAliphaticRings()`	Number of aliphatic rings
`NumAromaticRings()`	Number of aromatic rings
`NumSaturatedRings()`	Number of saturated rings
`NumHAcceptors()`	Number of hydrogen bond acceptors
`NumHDonors()`	Number of hydrogen bond donors
`NumRadicalElectrons()`	Number of radical electrons
`NumValenceElectrons()`	Number of valence electrons
`NumRotatableBonds()`	Number of rotatable bonds
`RingCount()`	Number of rings

Below we will look at a few of these descriptor functions demonstrated on the compound paclitaxel. Specifically, we will generate the molecular weight, number of rings, number of aromatic rings, number of valence electrons, and number of rotatable bonds.

ptx = Chem.MolFromSmiles('CC1=C2[C@@]([C@]([C@H]([C@@H]3[C@]4([C@H](OC4)C[C@@H]'\
                         '([C@]3(C(=O)[C@@H]2OC(=O)C)C)O)OC(=O)C)OC(=O)c5ccccc5)'\
                         '(C[C@@H]1OC(=O)[C@H](O)[C@@H](NC(=O)c6ccccc6)c7ccccc7)O)(C)C')

AllChem.Compute2DCoords(ptx) # makes molecule display more clearly
Chem.Draw.MolToImage(ptx, size=(500,500))

../../_images/febac7bbf23fc4d473dc45a5bbaa08a16a59af8e91d8b084e0312f181c8d395b.png

# molecular weight
Chem.Descriptors.MolWt(ptx)

853.9180000000003

# number of rings
Chem.Descriptors.RingCount(ptx)

# number of aromatic rings
Chem.Descriptors.NumAromaticRings(ptx)

# number of valence electrons
Chem.Descriptors.NumValenceElectrons(ptx)

# number of rotable bonds
Chem.Descriptors.NumRotatableBonds(ptx)

15.4.2 Quantifying Functional Groups#

Among the descriptor methods is a long list of functions that look like fr_group() where group is the name or abbreviation of a chemical functional group. These functions return an integer quantification of that functional group present in the molecule. A table with a few examples is provided below, but there are over 80 of these functions available in RDKit.

Table 6 Examples of Methods to Quantify Functional Groups

Function	Functional Group
`fr_Al_OH()`	Aliphatic alcohols
`fr_aldehyde()`	Aldehydes
`fr_amide()`	Amide
`fr_C_C()`	Carbonyl oxygens
`fr_guanido()`	Guanidine
`fr_NH0()`	Amines with 0 H’s (i.e., tertiary)
`fr_phenol()`	Phenol
`fr_phos_ester()`	Phosphoric ester
`fr_SH()`	Thiol

Tip

To see a complete list of functional groups, type Chem.Descriptors.fr_ into a code cell, press Tab for autocomplete, and see the long list of options. If the functional group is not obvious from the name, place the computer cursor inside the function’s parentheses and press Shift + Tab to see the Docstring description of what functional group it quantifies.

We will again look at paclitaxel to see how many benzene rings, aliphatic alcohols, aromatic carboxyls, and esters are present in the structure.

# number of benzene rings
Chem.Descriptors.fr_benzene(ptx) 

# number of alphatic alcohols
Chem.Descriptors.fr_Al_OH(ptx)

# number of aromatic carboxyls
Chem.Descriptors.fr_Ar_COO(ptx)

# number of esters
Chem.Descriptors.fr_ester(ptx)

15.5 Searching Molecules for Structural Patterns#

Molecules can be searched for key structural features using the HasSubstructMatch() method which returns True or False depending if a structural pattern exists in a molecule or not. This function requires two RDKit Molecule objects - one Molecule object (molecule) is checked for the presence of the other Molecule object structure (substructure) as shown below. There are optional keyword parameters such as useChirality= which allows for chirality to be factored into whether there is a match or not. The default setting is useChirality=False.

molecule.HasSubstructMatch(substructure, useChirality=False)

As an example, we will look for the presence of a carbonyl (i.e., C=O bond) in acetone and pent-3-en-2-ol below, so the substructure that we will search for is a C=O.

acetone = Chem.MolFromSmiles('CC(=O)C')
acetone

substructure = Chem.MolFromSmiles('C=O')
acetone.HasSubstructMatch(substructure)

True

pentenol.HasSubstructMatch(substructure)

False

Not very surprisingly, the HasSubstructMatch() function returns True for acetone and False for the alcohol because the latter has a single CO bond, not a double. If we change our substructure to CO, we are now searching for a carbon-oxygen single bond (see Table 7), so acetone returns False while pent-3-en-2-ol returns True.

Table 7 SMILES Bond Order Notation

SMILES Bond	Bond Type
- (or nothing)	Single
=	Double
#	Triple
:	Aromatic

substructure = Chem.MolFromSmiles('CO')
substructure

acetone.HasSubstructMatch(substructure)

False

pentenol.HasSubstructMatch(substructure)

True

For a more interesting set of examples, we can search our collection of 20 common amino acids (see section 15.2.2) for key substructures. We will start by using glycine, the simplest of the common amino acids, as the substructure which should return all 20 amino acids. As an extra step below, we will also orient all the amino acids in the same way with respect to the substructure. That is, the substructural element that we are searching for in each amino acid will be oriented the same way for all 20 amino acids.

# seraches for substruture
substructure= Chem.MolFromSmiles('C(C(=O)[O-])[NH3+]')
matching_amino_acids = [AA for AA in AminoAcids if AA.HasSubstructMatch(substructure)]

# orients common substructures the same way
AllChem.Compute2DCoords(substructure)
for amino_acid in matching_amino_acids:
    _ = AllChem.GenerateDepictionMatching2DStructure(amino_acid, substructure)

# generates grid of matching molecules
Chem.Draw.MolsToGridImage(matching_amino_acids, 
                          molsPerRow=4,
                          subImgSize=(200,200),
                          legends=list(df['name'])) 

../../_images/416e80d172bb2f14bd2e92288fc47254e1f59a4adb8eba8681c300134dc7a8ea.svg

Indeed, it did return all 20 amino acids, and notice how the core structure of all amino acids are oriented the same direction. Now let us try something a little more interesting by search for all amino acids with a benzene ring in them. The substructural bonding pattern in this case is benzene itself, and the three aromatic amino acids are returned.

substructure = Chem.MolFromSmiles('c1ccccc1')
AA_with_pattern = [AA for AA in AminoAcids if AA.HasSubstructMatch(substructure)]

Chem.Draw.MolsToGridImage(AA_with_pattern)

../../_images/6ce53a1e464ba85228a70e4cbee6f14ef67a553d00efcaba386188e5baf4d0cd.svg

It might be nice to still have the name labels for our three matches, so the above search is repeated but instead on a zip object comprised of the names of the amino acids and the Molecule objects.

AA_zipped = list(zip(df['name'], AminoAcids))

substructure = Chem.MolFromSmiles('c1ccccc1')
with_pattern = [AA for AA in AA_zipped if AA[1].HasSubstructMatch(substructure)]

name = [AA[0] for AA in with_pattern]
mol_obj =  [AA[1] for AA in with_pattern]

Chem.Draw.MolsToGridImage(mol_obj, legends=name)

../../_images/a1503e48c6df595a2848d593fbc8df835d53876a318cc8f9e1de505f19979d84.svg

15.6 Atoms and Bonds#

RDKit allows access to information on specific atoms and bonds through the GetAtoms() and GetBonds() methods, respectively. These functions return a sequence type of object that can be iterated through using a for loop to access individual atoms or bonds. Using the following methods, the user can access or even modify various pieces of information about the atoms or bonds. Below Table 9 and Table 10 contain some key functions for working with atoms and bonds.

Table 9 Select Atom Methods

Function	Description
`GetDegree()`	Returns number of atoms bonded directly to it, includes hydrogens only if they are explicitly defined
`GetAtomicNum()`	Returns atomic number
`GetChiralTag()`	Determines if the atom is a chiral center and CW or CCW designation
`GetFormalCharge()`	Returns formal charage of atom
`GetHybridization()`	Returns hybridization of atom
`GetIsAromatic()`	Returns bool as to whether atom is aromatic
`GetIsotope()`	Returns isotope number if designated, otherwise returns `0`
`GetNeighbors()`	Returns tuple of directly bonded atoms
`GetSymbol()`	Returns atomic symbols as a string
`GetTotalNumHs()`	Returns number of hydrogens bonded to the atom
`IsInRing()`	Returns bool designating if the atom is in a ring
`SetAtomicNum()`	Sets the atomic number to user defined value
`SetFormalCharge()`	Sets formal charge to user defined value
`SetIsotope()`	Sets isotope to user defined integer value

As an example, let’s look at the atoms in aspirin.

aspirin

If we generate a list populated with the degrees of atoms (i.e., number of other atoms bonded directly to it), you may notice that there are no 4 values even though the methyl (i.e., -CH\(_3\)) carbon should have four atoms attached to it. This is because the hydrogen atoms are not explicitly designated in the structure (i.e., the are implicit), so they are not counted.

[atom.GetDegree() for atom in aspirin.GetAtoms()]

[1, 3, 1, 2, 3, 2, 2, 2, 2, 3, 3, 1, 1]

We can count the number of implicit hydrogens using the GetNumImplicitHs() method, and the third value is a 3 making it the methyl carbon.

[atom.GetNumImplicitHs() for atom in aspirin.GetAtoms()]

[0, 0, 3, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1]

We can also use these atom methods to change values and attributes of various atoms. For example, we can set the isotopes of the carbonyl carbons (i.e., C=O) to \(^{13}\)C. This is accomplished with the following code that iterates through all the atoms and finds the carbonyl carbons by testing for atoms that have an atomic number of 6, are not aromatic, and have no hydrogens and then setting the isotope value to 13. The molecular weight is calculated before and after the isotopes are changed for comparison.

Chem.Descriptors.MolWt(aspirin)

180.15899999999996

for atom in aspirin.GetAtoms():
    if atom.GetAtomicNum() == 6 and \
        not atom.GetIsAromatic() and \
        atom.GetTotalNumHs() == 0:
    
        atom.SetIsotope(13)

print(Chem.Descriptors.MolWt(aspirin))
aspirin

182.14370968

The molar mass has increased due to two of the carbon atoms being isotopically labeled, and we can see in the image which of the two carbons were isotopically labeled. It is worth nothing that the molecular weight before isotopically labeling assumes a natural distribution of isotopes which for carbon is 98.9% \(^{12}\)C and 1.1% \(^{13}\)C. In the isotopically labeled structure, the two carbonyl carbons are 100% \(^{13}\)C.

Using bond methods, we can perform analogous types of operations except that bonds have different attributes than atoms. A table of selected bond methods is provided below.

Table 10 Select Bond Methods

Function	Description
`GetBeginAtom()`	Returns first atom in bond
`GetEndAtom()`	Returns second atom in bond
`GetBondType()`	Returns type of bond (e.g., SINGLE, DOUBLE, AROMATIC)
`GetIsAromatic()`	Returns bool as to whether bond is aromatic
`GetIsConjugated()`	Returns bool as to wether bond is conjugated
`IsInRing()`	Returns bool as to wether bond is in ring
`SetBondType()`	Sets bond type
`SetIsAromatic()`	Sets bool designating if a bond is aromatic

As a demonstration, we will examine the bonds in the structure of acetone and change the carbonyl double bond to a single bond. This is done by searching for a double bond, setting it to a single bond, and then changing the formal charges of the atoms attached to that bond.

acetone

for bond in acetone.GetBonds():
    if bond.GetBondType() == Chem.BondType.DOUBLE:
        bond.SetBondType(Chem.BondType.SINGLE)
        end = bond.GetEndAtom().SetFormalCharge(-1)
        begin = bond.GetBeginAtom().SetFormalCharge(+1)
acetone

Exercises#

Complete the following exercises in a Jupyter notebook using RDKit. You are encouraged to also to use data libraries such as NumPy or pandas to support your solutions. Any data file(s) referred to in the problems can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.

Load the structure for morphine into RDKit using either a SMILES or InChI representation. You will need to either generate one of these representations using chemical drawing software or find one online from a free resource.

a) Visualize the structure of morphine and save it as an SVG image file.

b) Use RDKit to determine the number of chiral centers in the structure. Your code should output an integer value, not just a list of chiral centers.

c) Use RDKit to determine the number of hydrogen bond acceptors in the structure.

d) Use RDKit to determine the number of rings in the structure.
Load the amino_acid_SMILES.txt file and use RDKit for the following.

a) Determine the absolute configuration (i.e., R vs. S) of the \(\alpha\)-carbon for all the chiral amino acids. Most are the same, but one is an exception. Which is it?

b) How many amino acids have two chiral centers?
Load the organic_molecules.txt dataset containing SMILES representations of a range of organic molecules.

a) Using descriptors, generate a list containing the SMILES representations of only primary and secondary alphatic alcohols.

b) Using pattern matching, generate an image containing only primary alcohols from the file. To help you along, here is the SMARTS representation of a primary alcohol for the pattern matching. Chem.MolFromSmarts('[CH2][OH]')

c) Calculate the percentage of heavy-element (i.e., not with hydrogens) bonds that are C-O bonds.

d) Calculate the percentage of carbon atoms in a ring.
Use RDKit to generate an image showing all isomers of 1,2-dimethylcyclohexane. You will need to look up the SMILES or other representation first.

Chapter 15: Cheminformatics with RDKit

Contents

Chapter 15: Cheminformatics with RDKit#

15.1 Loading Molecular Representations into RDKit#

15.2 Visualizing Chemical Structures#

15.2.1 Single Chemical Structures#

15.2.2 Grids of Chemical Structures#

15.2.3 Molecules in Pandas DataFrames#

15.3 Stereochemistry#

15.3.1 Assigning Stereochemistry#

15.3.2 Counting and Generating Isomers#

15.4 `Chem.Descriptor` Module#

15.4.1 Molecular Features#

15.4.2 Quantifying Functional Groups#

15.5 Searching Molecules for Structural Patterns#

15.6 Atoms and Bonds#

Further Reading#

Exercises#

Chapter 15: Cheminformatics with RDKit

Contents

Chapter 15: Cheminformatics with RDKit#

15.1 Loading Molecular Representations into RDKit#

15.2 Visualizing Chemical Structures#

15.2.1 Single Chemical Structures#

15.2.2 Grids of Chemical Structures#

15.2.3 Molecules in Pandas DataFrames#

15.3 Stereochemistry#

15.3.1 Assigning Stereochemistry#

15.3.2 Counting and Generating Isomers#

15.4 Chem.Descriptor Module#

15.4.1 Molecular Features#

15.4.2 Quantifying Functional Groups#

15.5 Searching Molecules for Structural Patterns#

15.6 Atoms and Bonds#

Further Reading#

Exercises#

15.4 `Chem.Descriptor` Module#