. @dan2097 (or anyone) Is there an index of #opsin IUPAC failure codes with possible fixes? I can correct obvious typos/whitespace issues but things like "locant 7 unsaturated" or "Too many radicals for esters" not obvious to fix
@iChemLabs Incidentally (Optical) Chemical Structure Recognition, abbreviated OCSR or CSR, is another common way of describing this without calling the process "OCR"
@houndcl In a real-world setting you could potentially also get improvements by processing all images from a given document and preferring the interpretation that gave more groups in common with the other images, as often a series of related compounds is synthesized
@houndcl Many compounds will be in PubChem, so there are real-world benefits, but I still wonder whether a fragment-based approach could be more general e.g. for documents with novel compounds; likely no advantage for the Kaggle comp, but probably within their new restrictions.
@jwmay@uspto@ReedTechLifeSci They appear to now have switched their naming scheme of _oldest.tar _old.tar, no suffix to no suffix, _r1.tar and _r2.tar. In the case I looked at the re-released files had a supplementary information zip that was missing from the original
Hi @uspto, updates to your bulk data look to have left some stray files "*_old.tar" laying around. bulkdata.uspto.gov/data/patent/ap…. @ReedTechLifeSci bulk data file sizes match the "*_old.tar" rather than the new ones. Any info on what changed, old files should probably be removed?
@cdsouthan@markussitzmann@vfscalfani@theNCI The issue is the forward slash. Many web servers don't allow them as part of the URL, even when encoded, ostensibly for security reasons. As well as OSRA, there's also Imago from Epam, and MOLVEC from NCATS: molvec.ncats.io
@ChemConnector@markussitzmann@baoilleach I'm sure @baoilleach will correct me if I'm wrong, but I think LeadMine's current code for NMR extraction for the most part comes from that project. It's unfortunate that the next step with MestreLab, to do the NMR prediction to verify the spectra, didn't come to fruition.
@markussitzmann@baoilleach Your hypothesis may be proven correct, or incorrect, don't know until the work is done. ANd that's exactly the work I wanted to do but we never made progress. However, I STILL want to do it if there is interest.
@egonwillighagen@baoilleach Are you eluding to the Experimental Data Checker which tried to check whether a textual spectrum was plausible by comparison with the structure?
@cdsouthan The issue is actually the forward slash, quite a few web servers (including this one) don't correct handle encoded slashes, ostensibly for security reasons. I should probably send requests in a different way to workaround this issue.
@baoilleach You can also encounter mentions of Tyr(O-Me), which possibly makes the reason for the O clearer, I think it's to indicate which atom on the tyrosine is the substitution point, but given the normal meaning of an O in a formula this just ends up being even more unclear
Peptide scientists, please decide: is it Phe(4-OMe), Tyr(Me) or my favourite, Tyr(OMe)?
Fortunately, Tyr(Me) appears to be much preferred in PubMed Abstracts. This follows the rule of minimising the substitutent size.
@baoilleach@marwinsegler@ChemProfCramer@ACSCOMP@phisch124@nmsoftware It is technically in the CML version of the dataset. It can be found by searching for reactions with a reactionAction where the action is "Irradiate". This action however doesn't distinguish between sonication and EM radiation (but the text can be checked)
@marwinsegler@ChemProfCramer@ACSCOMP@phisch124@nmsoftware Just spoke to John. I think we already have this information - though it's not in the dataset. Though the weird catalysts are already pretty distinctive I think for these reactions.
@i_vishalll@cdsouthan Apologies for the issue, OPSIN was moved to a new server earlier this week that was not configured quite right for JNI-InChI. This is now resolved.
@eawRDM InChI is the answer to the question...but if I wanted to use the chemical structure I would also want a format that more precisely captured the structure. As @baoilleach commented, the InChI/InChIKey can be generated precisely from the SMILES, but not vice versa.