Aug 8 2008
Contents
 
Introduction
Overview - RECOMMENDED READING FOR NEW ICM USERS
File Menu
Graphics Move Tools
Display Tab
Light Tab
Labels Tab
PDB Search Tab
Meshes Tab
View Menu
Selections
Tables
 Standard
 Molecular Tables
 Plot
 PCA
 SVM and PLS
  Learn
  Predict
  Theory
  Clustering
 Cluster
Local DB
Sequences
Bioinfo Menu
Tools Menu - Xray
Tools Menu - 3D Predict
Tools Menu - Analysis
Tools Menu - Superimpose
Homology and Modelling
Working with Chemistry Tools
Chemsitry Menu
Docking
Ligand Editor
Animations, Slides, & Documents
ActiveICM
Movie Making
Frequently Asked Questions
Tutorial - Graphical Display
Molecular Document
Tutorial - Working with PDB Protein Structures
Tutorial - Working with Sequence Alignments
Tutorial - Ligand Binding Pocket Analysis
Tutorial - Homology and Modeling Tools
Tutorial - Crystallographic Analysis Tools
Tutorial - Working with Chemical Tables
Tutorial - Working with the Molecular Editor
Tutorial - Chemical Searching
Tutorial - Docking and Virtual Ligand Screening
 
Index
PrevICM User's Guide
12.5 Learn and Predict
Next

[ Learn | Predict | Theory | Clustering ]

Support Vector Machines (SVM) and Partial Least Squares (PLS) are commonly used methods which are implemented in ICM to predict compound properties or any other variable. There are many tutorials in the web available for free download. For instance, as a comprehensive resourse about SVM we can suggest http://www.kernel-machines.org/. For the details of ICM implementation and the explanation of our terminolgy see the theory section below.

In order to perform 'learn and predict' in ICM information must be stored in a table, molecular table or csv file. See the tables chapter for more information on ICM tables. Both chemical compounds and numeric data can be source for building prediction models.

All molecular property predictors are calculated using fragment-based contributions. We developed an original method for splitting a molecule into a set of linear or non-linear fragments of different length and representation levels and then each chemical pattern found is converted into a descriptor.

12.5.1 Learn


First load in a table of data on which you wish to perform the learn and predict functions. See the tables chapter for more information on ICM tables.

  • Select Tools/Table/Learn and a window as shown below will be displayed. Or use the Chemistry/Build Prediction Model option.

  • Enter the name of table with which you want to perform the predictions. You may locate your table from the drop down arrow menu.
  • Select the column from which you wish to learn. Use the drop down arrow to select.
    NOTE If the table does not contain any numeric (integer or real) columns, there is nothing to predict, so the "Learn" button will be disabled.
  • Enter a name for the learn model.
  • Select which regression method you wish to use from the drop down menu. See the theory section to determine which method and parameters to use.
  • Select which columns (descriptors) of your table you wish to use to 'learn'.
  • If you are using chemical descriptors to produce your model select the maximal chain length.
  • Select the number of cross-validation groups you wish to use or selected rows can be used for cross validation. The number of iterations will impact the speed of the calculation. 5 is the default number of groups but 2 would be the least rigorous and selecting the 'Leave-1-out' would be the most rigorous calculation.
  • Click on the learn button and a table summarizing your model will be displayed as shown below.

  • Click OK and this table will be removed.

All models are then stored in the ICM workspace as shown below. A number of options are displayed in the right click menu.

12.5.2 Predict


To make a prediction using a created model.

Read the table of data into ICM from which you wish to predict. Make sure the table contains the same columns used for the learn model.

  • Tools/Table/Predict or Chemistry/Predict
  • Select which table you wish to make the prediction on.
  • Select which model you wish to use.
  • Check that the required columns are in the table. If they are absent a red mark will appear against the column that is missing.
  • Click Predict.

12.5.3 A little theory on learning


Regression Estimation.

From the most general view the problem is to "guess" the function f:X->Y , by having only a set of argument-value pairs. Let's formalize it further: in regression case Y=R(set of real numbers), let's have also an assumption of X being a real vector space of dimensionality n : X=R^n (we say we have n features or columns), we have a set of m vector-value pairs (x_1,y_1),..(x_m,y_m), where x_i \in R^n, y_i \in R, y_i=f(x_i) (we say we have a trainig data set, consisting of m training argument-value pairs) . The terminology divides the process into two steps:

  • Learning ("guessing" f by given training data so that it will estimate the value of y for given x taken from the same source as good as possible.
  • Prediction (making y=f(x) evaluations for any given x.

The main question is what "guessing" means in the paragraph above, and how to measure its quality. Usually we assume that there is a source of data vectors and their values (mostly unlimited), and we can use only m of them to determine what the connection actually is and then we'll be able to estimate the actual values for data vectors. To develop a mathematical approach let's say that x values are independent and identically distributed.

ICM Prediction module uses ICM table as its main data input and output set. In common casefor learning you will have a table containg all the argument (descriptor) columns and the value column. So all the training data will be contained in table rows. For learning select the columns you want to learn on, than right-click on the header of the value column and select Learn... command from the pop-up context menu. The Learn Options dialog allows to change various learning parameters, though the default sets are useful in most cases so you will not need to change most of them.

PLS (Partial Least Squares) Regression

PLS regression algorithm builds linear prediction model: in format y=(w,x)+b, where b is the bias - a real number, and w is the weights vector, which is scalarly multiplied by the data vector x. PLS uses the given learning y values very actively which allows it to produce fairly good models with respect to constraint of being linear. Although linear regression models have an advantage of weights for each descriptor which gives a useful information and allows feature selection in many cases.

The linear model simply is not able to predict higher order dependencies.

There are different ways to deal with it. By adding the second order columns into the descriptor set you can let PLS predict them. Actually if you have a lot of columns derived from basic data, the linear model built will be able to make a high-quality linear approximations of the actual functions. ICM has a powerful tool for automatical generation of such descriptors based on compound data -- molecule fingerprints generation algorithm. It generates hundreds of columns based on initial data. The withdraw is that analysing the weights given by PLS to generated descriptors is almost senseless. You will need a mol column in your table to use this feature.

ICM has built-in models for prediction of several significant molecule properties, like logP, logS, PSA based on fingerprints+PLS symbiosis, which have proven their quality.

SVM (Support Vector Machine) Regression

The other way to deal with estimating complex functions is building non-linear models. A good example is SVM. It builds a model in form of

y(x) = Sum_{i=1}^m( a_i*K(x_i,x) ) + b

K - is the kernel function, K: X,X->R, with some certain properties (look for advanced SVM description for the details). In general, it is a function, describing the mutual similarity of its argument vectors; the simplest case is the inner product function (~~dot kernel). ICM allows to choose it from the following set of candidates (some of them have additional parameters):

  • dot K(x,y) = (x,y) (scalar)
  • polynomial K(x,y) = ( (x,y)+c )^d; d:int,>0, c:real
  • radial K(x,y) = exp( -gamma ||x-y||^2 ); gamma:real
  • sigmoid K(x,y) = tanh(k (x,y) + c); k:real, c:real
  • tanimoto K(x,y) = (x,y) / ( (x,x) + (y,y) - (x,y) )

Dot kernel allows to build in some sense linear (first order) models, although the form of y function is more complex than the one for standard linear regression, it implicitly includes some of the training vectors. That's why there's no straightforward way, as in PLSR, for analysing the influence of each descriptor in built model in this case.

SVMR may be as well used with molecular fingerprints.

PC (Principal Component) Regression

PCR also builds linear model in its simplest form, as PLS does, though it sets other weights to descriptors, and built models are usually worse in sense of predicting, because PCR uses value information of the training data only in secondary way. We recommend you to use PCR, when you want to build an ordinary regression (MLR - Multiple Linear Regression) model by using only some number of first principal components of X data matrix (ordered by decreasing eigenvalues) or even builing the full MLR model (by setting the number of PCs to value higher than the number of rows in matrix).

12.5.4 Data Clustering


ICM allows you to create hierarchical clusters for chemical and other objects. Cluster trees can be used for:

  • Navigation through large data sets.
  • Selecting group representatives (taxons).
  • Filtering tables to exclude redundancy.
  • Finding similar elements, and more.
  • Creating hierarchical views of data sets in many different styles, with subsequent image export/printing ability.


Prev
PCA
Home
Up
Next
Cluster

Copyright© 1989-2004, Molsoft,LLC - All Rights Reserved.
This document contains proprietary and confidential information of Molsoft, LLC.
The content of this document may not be disclosed to third parties, copied or duplicated in any form,
in whole or in part, without the prior written permission from Molsoft, LLC.