Project SMILES for Researchers: SAR in Scientific Literature

Why do we need manually curated SMILES SAR data?

The idea is to convert patents/articles with structure activity relationship (SAR) data for molecules into easy handle formats such as SMILES.
I agree there are automatic ways of extracting information from scientific literature; and people are using them but the truth is that they are not 100% reliable. To make the matter worst, they create false result too.
Such automatic applications surely have a robust extraction procedure but the true problem is picking what to extract.  This is what they fail at consistently.
SAR data is usually provided in pictorial form, i.e. as chemical drawings and are pasted objects in the text. Hence, software’s just cannot make any sense.
One of the possible solutions is that the authors start reporting their SAR in form of SMILES. The only problem is that we are trained to admire structures and not read string of C, H, N and O etc. Though, this technique was attempted by few pharma firms in their patents; but it was just few cases. These patents too failed at automatic curation as “things kept missing” during automatic conversions. Personally saw many chemically unacceptable bonds and valencies.


In order to understand what it meant here, the best example that comes to mind is that of Brigatinib.




Brigatinib represents the most clinically advanced phosphine oxide-containing drug candidate to date, and is currently being evaluated in a global phase 2 registration trial for non-small-cell lung cancer (NSCLC). Brigatinib displayed low nM IC50s against native ALK and all tested clinically relevant ALK mutants in both enzyme-based biochemical and cell-based viability assays, and demonstrated efficacy in multiple ALK+ xenografts in mice, including Karpas-299 (anaplastic large-cell lymphomas [ALCL]) and H3122 (NSCLC).
Till the publication of US20150225436A1, the structure of Brigatinib (also AP26113) was reported wrong by authors, blog writers and also guess who… the chemical manufacturers. Wikipedia (accessed on 7th May 2016) still carries the originally reported wrong structure (here). Few manufacturers have changed to the correct structure, keeping the other as Brigatinib-analog; but some still have the older structure.

The patent application was revealed in first quarter of 2015, but still has been pretty successful in hiding from search engines. Automatic curation of the pdf of the same patent, too didn’t report AP26113. The reason is simple because the data is provided as an inserted picture. It is mentioned as example 122 (compound 5) in the text.


Therefore, it makes sense to support patents/articles with their SMILES SAR data. Manual curation is the best way to handle it. Patents are written, edited and approved by independent sets of eyes, they are not automated. Similarly, articles undergo various level of reading and screening, all involving humans. Hence, the best and correct way is to do via Manual Curation. Parts such as citation data or assay data can be extracted automatically, but structure and activity data has to be done manually. 

Now, the final choice is (a) authors submit it along while publishing (b) it is added after each work after publication by some independent persons.