Explore
Home 
Literature 
Links 
Posts 
Molecules 
Blogs 
Zeitgeist 
Markup Help 
News 
All posts Reviews Conferences Research
The contribution of Alfa Aesar melting point data to our open collection has facilitated the validation of a significant amount of the entire dataset. However, this process of curation is never-ending. A good example is the discovery of an error in one of the sources for the melting point of warfarin. Following David Weinberger's post about our melting point explorer, his brother Andy noticed a problem and this enabled us to fix it.

In a way, creating an open environment to make it easy to find and report errors - as well as add new data - complicates scientific evaluation. In order to report a reproducible process and outcome, it is necessary to take a snapshot of the dataset. Choosing the exact composition of a dataset for a particular application is somewhat arbitrary. Aside from selecting a threshold for excluding measurements that deviate too much, compounds may be excluded based on their type.

For the sake of clarity, we archived the various datasets we created from multiple sources with brief descriptions of the filtering and merging at each step. From the perspective of an organic chemist, ONSMP013 is probably the most useful at this time. It contains averaged measurements for 12634 organic compounds and excludes salts, inorganics or organometallics. The original file provided by Alfa Aesar contained several of these excluded compounds and can be obtained from ONSMP000. It might be interesting at some point to create a collection of melting points for inorganics or salts. We would welcome contributions of collections of melting points with different filters.

One of the advantages of ONSMP013 is that it is possible to generate CDK descriptors for each entry (and these are included in the spreadsheet). By not using commercial software to generate descriptors, it enables fully transparent modeling - and extension of that modeling by anyone.

With this in mind, Andrew Lang has used ONSMP013 to generate a Random forest melting point model (MPM002). The most important descriptors turned out to be the number of hydrogen bond donors and the Topological Polar Surface Area (TPSA). The scatter plot below shows the correlation (R2 = 0.79) between the predicted and experimental values. (color represents TPSA and size relates to H-bond donors)


Andy has described in much more detail the rationale for selecting the Random forest approach over a linear model in MPM001. He has also compared the performance of CDK descriptors versus those from a commercial program for a small set of drug melting points in MPM003.

The Random forest model (MPM002) is also now available as a web service by entering the ChemSpiderID (CSID) of a compound in a URL. See this example for benzoic acid. If experimental results exist they will appear on top and a link to obtain the predicted melting point will appear underneath.

Note that the current web service for predicting melting points can be slow - it may take a minute to process.

Additional web services for melting point data will be listed on the ONS web services wiki.

Posts linking to this one

On March 30 and 31, 2011 I presented two related talks - the first remotely for the American Chemical Society (ACS) Meeting and the second in Philadelphia at the meeting for the Association of College and Research Libraries (ACRL).In the ACS talk "Rapid Dissemination...
Last week, I came across (via David Bradley) a paper by an MIT group regarding the desalination of water using a very clever application of solubility behavior:Anurag Bajpayee, Tengfei Luo, Andrew Muto and Gang Chen, Energy Environ. Sci., 2011 Very low temperature...
You must read this previous post first. Now, it is important to realize there are standards at many levels. Open specifications allow people to implement the specification without having to pay fees, run into patents, etc. To me, an Open Specification is something...