ICM User's Guide: A Little Theory on Learning

Aug 4 2022 Feedback.

Contents


Help Videos
Reference Guide
Getting Started
Protein Structure
Molecular Graphics
Slides & ActiveICM
Cheminformatics
Learn and Predict
Learn
Predict
Fingerprint Methods
3D QSAR
Theory
MolScreen
3D Ligand Editor
Tables and Plots
Local Databases
KNIME
Tutorials

Index

ICM User's Guide
8.5 A Little Theory on Learning

For a more detailed explanation of the theory behind Partial Least Squares (PLS) we suggest you read Geladi et al Analytica Chimica Acta (1986) 1-17.

PLS (Partial Least Squares) Regression PLS regression algorithm builds linear prediction model: in format y=(w,x)+b, where b is the bias - a real number, and w is the weights vector, which is scalarly multiplied by the data vector x. PLS uses the given learning y values very actively which allows it to produce fairly good models with respect to constraint of being linear. Although linear regression models have an advantage of weights for each descriptor which gives a useful information and allows feature selection in many cases.

The linear model simply is not able to predict higher order dependencies.

There are different ways to deal with it. By adding the second order columns into the descriptor set you can let PLS predict them. Actually if you have a lot of columns derived from basic data, the linear model built will be able to make a high-quality linear approximations of the actual functions. ICM has a powerful tool for automatical generation of such descriptors based on compound data -- molecule fingerprints generation algorithm. It generates hundreds of columns based on initial data. The withdraw is that analysing the weights given by PLS to generated descriptors is almost senseless. You will need a mol column in your table to use this feature.

ICM has built-in models for prediction of several significant molecule properties, like logP, logS, PSA based on fingerprints+PLS symbiosis, which have proven their quality.

PC (Principal Component) Regression

PCR also builds linear model in its simplest form, as PLS does, though it sets other weights to descriptors, and built models are usually worse in sense of predicting, because PCR uses value information of the training data only in secondary way. We recommend you to use PCR, when you want to build an ordinary regression (MLR - Multiple Linear Regression) model by using only some number of first principal components of X data matrix (ordered by decreasing eigenvalues) or even builing the full MLR model (by setting the number of PCs to value higher than the number of rows in matrix).

Random Forest regression and classification

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Prev
3D QSAR

Home
Up

Next
Learn

Copyright© 1989-2020, Molsoft,LLC - All Rights Reserved.

This document contains proprietary and confidential information of Molsoft, LLC.
The content of this document may not be disclosed to third parties, copied or duplicated in any form,
in whole or in part, without the prior written permission from Molsoft, LLC.