Appendix 1: Remote Requests

Appendix 1: Remote Requests#

There are a number of freely available online chemical databases that can be used to build datasets such as the Chemical Abstract Services (CAS), ChEMBL, ChemSpider, RCSB Protein Data Bank, PubChem, and PubMed among others. While some databases principally support access through a web browser such as Spectral Database for Organic Compounds (SDBS), many databases support programmatically accessing the data that enables the user to automate the downloading or searching of data from databases.

This requires the database to have what is know as an Application Programming Interface (API) that allows Python to communicate with the database software. The APIs often have idiosyncratic formatting rules that must be carefully followed to ensure no errors arise. It is also important to follow the database usages rules such as how much data may be downloaded, what the data may be used for, or if users are required to register to use the database. The latter is typically free for academic or nonprofit use. In this example, you will learn to access the PubChem databases and build a small dataset of organic chemicals with the chemical features to describe them. PubChem does not require any registration to use it, but there is a rate limit to accessing the data which will be addressed below.

To access the database, we will use the Python requests library which allows the user to use Python to access data from remote web servers. This package is installed by default with Anaconda or can be installed using pip. It is also prudent to keep this library updated just as you would with a web browser because it makes remote requests.

PubChem requests use a URL like your web browser with the following five components:

  • prolog_URL - https://pubchem.ncbi.nlm.nih.gov/rest/pug

  • data_input - compound/smiles

  • identifier - OC(C=1C=CN=C2C=CC(OC)=CC21)C3N4CCC(C3)C(C=C)C4

  • operation - property/Volume3D

  • output - txt

The prolong is the base URL which allows requests to find the remote database server, the data_input indicates what information will be provided to look up a chemical compound, the identifier is the chemical identifier, the operation is what information you want out, and the output is the format of the returned information. The latter will be text in our case, but you can have PubChem return other formats such as PNG or CSV if desired. The five above pieces are concatenated with / serparating them using the join() method and provided as an overall URL to the requests library. You could also concatenate the above strings using the + operator as long an you ensure there are / serarating each component.

full_url = '/'.join([prolog_URL, data_input, identifier, operation, ouput])

Once the result is concatenated, it will look something like below.

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/OC(C=1C=CN=C2C=CC(OC)=CC21)C3N4CCC(C3)C(C=C)C4/property/Volume3D/txt

This URL is then fed into the requests.get() function like below which makes the actual request to the remote serve to fetch the information.

requests.get(full_url)

Once you have the result, use the .text method to get the regular text, and you will need to remove the last two characters.

import requests
prolog_URL = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
data_input = "compound/smiles"
identifier = 'OC(C=1C=CN=C2C=CC(OC)=CC21)C3N4CCC(C3)C(C=C)C4'
operation = "property/Volume3D"
output = "txt"

full_url = '/'.join([prolog_URL, data_input, identifier, operation, output])

res = requests.get(full_url)
res
<Response [200]>

Once you have the result, use the .text method to get the regular text, and you will need to remove the last two characters.

res.text
'252.2\n'
res.text[:-1]
'252.2'

If you want to access a larger number of molecules through this approach, you will need to use a for loop with a list of molecular identifiers that can be swapped out in each request. It is important to note that PubChem limits requests to no more than 5 per second, so you will need to limit your request rate. This is relatively easy to accomplished using the time.sleep(n) function from the native Python time module where n is the number of seconds to pause your code. For example, every time time.sleep(1) is run, the function waits 1 second before the next line of code is executed. By placing this in our for loop, it ensures a maximum rate of requests will not be exceeded.

As an example, below we request the volume of four alcohols from PubChem and store them in a list.

import time

ROH_smiles = ['CC(O)C', 'C1CCCCC1O', 'CC(C)(C)O', 'O[C@H]1[C@H](C(C)C)CC[C@@H](C)C1']

volumes = []
for ROH in ROH_smiles:
    full_url = '/'.join([prolog_URL, data_input, ROH, operation, output])
    res = requests.get(full_url)
    volumes.append(res.text[:-1])
    time.sleep(1) # pauses for 1 second

volumes
['54.3', '84.6', '66.7', '134.3']