Why do we need manually curated SMILES SAR data?
The idea is to convert patents/articles with structure activity relationship (SAR) data for molecules into easy handle formats such as SMILES.
I agree
there are automatic ways of extracting information from scientific literature;
and people are using them but the truth is that they are not 100% reliable. To
make the matter worst, they create false result too.
Such
automatic applications surely have a robust extraction procedure but the true
problem is picking what to extract. This
is what they fail at consistently.
SAR data is
usually provided in pictorial form, i.e. as chemical drawings and are pasted objects in the text. Hence, software’s just cannot make any sense.
One of
the possible solutions is that the authors start reporting their SAR in form of
SMILES. The only problem is that we are trained to admire structures and not read
string of C, H, N and O etc. Though, this technique was attempted by few pharma
firms in their patents; but it was just few cases. These
patents too failed at automatic curation as “things kept missing” during
automatic conversions. Personally saw many chemically unacceptable bonds and valencies.
The patent application was revealed in first quarter of 2015, but still has been pretty successful in hiding from search engines. Automatic curation of the pdf of the same patent, too didn’t report AP26113. The reason is simple because the data is provided as an inserted picture. It is mentioned as example 122 (compound 5) in the text.
In
order to understand what it meant here, the best example that comes to mind is
that of Brigatinib.
Brigatinib
represents the most clinically advanced phosphine oxide-containing drug
candidate to date, and is currently being evaluated in a global phase 2
registration trial for non-small-cell lung cancer (NSCLC). Brigatinib displayed
low nM IC50s against native ALK and all tested clinically relevant
ALK mutants in both enzyme-based biochemical and cell-based viability assays,
and demonstrated efficacy in multiple ALK+ xenografts in mice, including
Karpas-299 (anaplastic large-cell lymphomas [ALCL]) and H3122 (NSCLC).
Till the publication of US20150225436A1, the
structure of Brigatinib (also AP26113) was reported wrong by authors, blog
writers and also guess who… the chemical manufacturers. Wikipedia
(accessed on 7th May 2016) still carries the originally reported wrong
structure (here). Few manufacturers have changed to the correct structure,
keeping the other as Brigatinib-analog; but some still have the older
structure.The patent application was revealed in first quarter of 2015, but still has been pretty successful in hiding from search engines. Automatic curation of the pdf of the same patent, too didn’t report AP26113. The reason is simple because the data is provided as an inserted picture. It is mentioned as example 122 (compound 5) in the text.
Therefore,
it makes sense to support patents/articles with their SMILES SAR data. Manual
curation is the best way to handle it. Patents are written, edited and approved
by independent sets of eyes, they are not automated. Similarly, articles undergo
various level of reading and screening, all involving humans. Hence, the best
and correct way is to do via Manual Curation. Parts such as citation data or
assay data can be extracted automatically, but structure and activity data has
to be done manually.
Now, the final choice is (a) authors submit it along while publishing (b) it is added after each work after publication by some independent persons.
Now, the final choice is (a) authors submit it along while publishing (b) it is added after each work after publication by some independent persons.