ICM GUI Manual
PrevICM User's Guide
11.1 Learn
Next

[ Read and Split | Build 2D QSAR Model | About the Model | Model Weights | Save and Share Model ]

| 2D QSAR Tutorial |

How to Learn from a data set and make a model:

11.1.1 Read and Split Data


First load in a table of data on which you wish to perform the learn and predict functions. See the tables chapter for more information on ICM tables.

Split data into training and test sets

There are two ways to split your data into a training and test set:

1. Random split

Randomly assign rows to the training and test sets and specify the percentage for training.

Navigate to Tools -> Table -> Train Split and select the Random tab. Choose the training set percentage; the remaining rows become the test set.

2. Kennard-Stone split

The Kennard-Stone method builds a representative training set by selecting diverse points from the dataset. In Tools -> Table -> Train Split, choose the Kennard-Stone tab and pick one of these strategies:

11.1.2 Build 2D QSAR Model


Descriptors

Chemical Fingerprints

Test

All models are then stored in the ICM workspace as shown below. A number of options are displayed in the right click menu.

11.1.3 About the Model


To see more details about your model (e.g. R2, RMSE etc....):

Predictive power is the primary measure of a model's performance and reflects its ability to generalize to unseen chemical space. External validation metrics quantify how well the model reproduces experimental trends and magnitudes for molecules excluded from training.

testR2 is the key indicator of external performance, representing the proportion of activity variance in the test set explained by the model. Values above 0.5 typically indicate robust generalization. testMAE measures the mean absolute deviation between predicted and experimental values for external compounds, providing an estimate of the model's average prediction error. testSpearman assesses the rank correlation of predicted versus experimental activities and is particularly informative when the model is intended for compound prioritization rather than precise potency prediction.

Model evaluation metrics
Metric Description
nofLatVec Number of latent vectors used in the model (for example, PLS components). Each latent vector captures variance from descriptors relevant to activity. Too many can overfit; too few can underfit.
selfMAE Mean absolute error on the training set. The average absolute difference between predicted and experimental values for molecules used to train the model. Measures how well the model fits known data.
selfR2 Coefficient of determination on the training set. Fraction of activity variance explained within the training set. Higher means a better fit to known data.
selfRMSE Root mean squared error on the training set. The square root of the average squared prediction error on training data. More sensitive to outliers than MAE.
selfSpearman Spearman rank correlation on the training set. Measures how well the model preserves ranking order of activities. Useful when prioritizing compounds where order matters more than exact values.
testMAE Mean absolute error on the external test set. The average error on molecules not used during training. Measures generalization to unseen data.
testR2 Coefficient of determination on the test set. Fraction of activity variance explained on external data. A primary indicator of external predictive power.
testRMSE Root mean squared error on the external test set. Similar to MAE but penalizes large deviations more. Useful to detect large outlier prediction errors.
testSpearman Spearman rank correlation on the external test set. Measures ranking ability on unseen molecules. High values indicate the model can prioritize compounds effectively.

11.1.4 Model Weights


The Weights dialog can be helpful to see what is driving the correlation in your model or for troubleshooting. The key column is "w" which represents the weight of the fragment in the regression. You may see a particular fragment providing a high weight to the model so you may want to add similar fragments to your training set to see if it will improve the model. Removing fragments from the training set based on their weight is not recommended because each fragment is part of a multiple component and it is difficult to know precisely its importance.

BITS = the number of the chain or fragment.

name = smiles string of fragment. You can copy and paste this into the molecular editor to view it in 2D.

w = is actual weight of the fragment in the regression (used directly to get the result value).

mean = indicates the mean occurrence of the fragment in the model . For example, a value of 6 means the fragment is used 6 times in the regression.

rmsd = a high value indicates high occurrence of the fragment and influences the relative weight parameter.

wRel = relative 'importance' of the fragment based on the RMSD, mean and CorrY values.

When you run the prediction you will see an option to color by atom contribution. This will color the atoms in the prediction results from negative (red) to positive (blue) based on their contribution weight (w column - see above) to the modek.

11.1.5 Save and Share Model


You can save and share a model by:


Prev
Learn and Predict
Home
Up
Next
Predict