15.3.1 Kernel Chemical Classification/Regression (kcc) models

Hybrid Fingerprints

Dataset

IC50, EC50, Ki, Kd data was downloaded from ChEMBL18, they were combined and converted to pKd value. Compounds with pKd > 5. (i.e. 10 micro Molar) were classified as positives. 90% of ChEMBL18 compounds were assigned to training set. The remaining 10% compounds were assigned to external test set. To the training set, a subset of compounds with pKd>7. against any targets from ChEMBL18 was added as decoy. To the external test set, the approved drugs were added as decoy

Training

A Kernel Chemical Classification/Regression (kcc) model was trained using the training set compounds. The performance of the model was evaluated using the external test set

Predicted Value

The kcc model returns two scores:

The kcc score is the classification score: The positives have a median kcc score of 1. The decoys have a median kcc score of 0. The kcc score is renamed to MolClass in the pairwise table.

The kca score is the regression score of pKd value A kca score of 6. indicates microMolar activities. The kca score is renamed to MolpKd in the pairwise table.

In the pairwise table, a pPvalue is calculated for both the kcc and kca score. The maximum pPvalue was taken as the final pPvalue for that model. The pPvalue indicates -Log of probability that the compound belongs to the random decoy. A pPvalue of 1 indicates the compound is comparable to the top 90 percentile decoy. A pPvalue of 2 indicates the compound is comparable to the top 99 percentile decoy.

15.3.2 Docking to Ligand Field Z-Score (dfz) models

APF

Dataset

If there is available data from ChEMBL18: IC50, EC50, Ki, Kd data was downloaded from ChEMBL18, they were combined and converted to pKd value. Compounds with pKd > 5. (i.e. 10 micro Molar) were classified as positives. 80% of ChEMBL18 compounds were assigned to training set. The remaining 20% compounds were assigned to external test set. To the training set, a subset of compounds with pKd>7. against any targets from ChEMBL18 was added as decoy. To the external test set, the approved drugs were added as decoy

If ChEMBL data is not available, approved drugs were used as decoy to calculate Z-Score.

Training

The dfz model was training in the following way:

1. For any target, all of its associated mammalian pocketome entries were used. 2. If ChEMBL data is not available, the pocketome entry with the highest number of co-crystallized ligand was used. 3. If ChEMBL training set is available, it was docked to each pocketome entries using APF method. 4. The best combination of template clusters were selected to maximize differentiation between actives and decoys. 5. Z-Score was calculated using approved drug's mean and standard deviation of APF score.

Predicted Value

The dfz model returns one score:

The dfz score is the Z-Score: A score of 1 means the compound is 1 standard deviation above the mean score of approved drugs decoy. The dfz score is renamed to MolZScore in the pairwise table.

In the pairwise table, a pPvalue is calculated for both the Z-Score. The pPvalue indicates -Log of probability that the compound belongs to the random decoy. A pPvalue of 1 indicates the compound is comparable to the top 90 percentile decoy. A pPvalue of 2 indicates the compound is comparable to the top 99 percentile decoy.

15.3.3 Docking to Ligand Field Classification/Regression (dfa) models

Hybrid 4D/2D

Dataset

Training

The dfa model is trained in the following steps:

1. Either: a. Training set was docked to the 4D maps of all the pocketome entries associated with the target. b. If pocketome entry is not available. The training set compounds were aligned in 3D. 2. Combining the training set compounds with pocketome co-crystallized ligands (if available), cluster in APF. 3. Subset of ligands were selected from each cluster as APF template. 4. All training set compounds were docked using APF method to all clusters. 5. The best combination of clusters were selected to maximize recognition of actives from decoys. 6. For each selected cluster, a pKd regression model was trained using the 3D poses of the ligands above a certain APF score cutoff.

Predicted Value

Any compound will be predicted using the dfa models in the following way: The compound will be docked using APF method to each of the template clusters. The compound will be assigned to the cluster that gives the highest normalized APF score. The 3D regression model of that cluster will then be used to predict the pKd value of that compound if the APF score is within the score cutoff.

The dfa model returns two scores:

The dfc score is the classification score: The positives have a median dfc score of 1. The decoys have a median dfc score of 0. The dfc score is renamed to MolClass in the pairwise table.

The dfa score is the regression score of pKd value A dfa score of 6. indicates microMolar activities. The dfa score is renamed to MolpKd in the pairwise table.

In the pairwise table, a pPvalue is calculated for both the dfc and dfa score. The maximum pPvalue was taken as the final pPvalue for that model. The pPvalue indicates -Log of probability that the compound belongs to the random decoy. A pPvalue of 1 indicates the compound is comparable to the top 90 percentile decoy. A pPvalue of 2 indicates the compound is comparable to the top 99 percentile decoy.

15.3.4 Docking to Protein Pocket Classification/Regression (dpc) models

Hybrid Docking

Dataset

Training

The dpc model is trained in the following steps:

1. The pocketome entry with the most co-crystallized ligands associated with the target is selected 2. The residues around the pocket were clustered, selected representative PDBs were retained 3. All training set compounds were docked to the 4D maps of the pocket in the presences of the co-crystallized ligands in the form of APF template 4. A score cutoff to maximize sensitivity and accuracy was selected 5. All compounds within that score cutoff were used to train a pKd prediction model using their 3D poses.

Predicted Value

Any compound will be predicted using the dpc models in the following way: The compound will be docked to the 4D maps of the pocket in the presence of APF template The 3D regression model of the pocket will then be used to predict the pKd value of that compound if the docking score is within the score cutoff.

The dpc model returns two scores:

The dpc score is the classification score: The positives have a median dpc score of 1. The decoys have a median dpc score of 0. The dpc score is renamed to MolClass in the pairwise table.

The dpa score is the regression score of pKd value A dpa score of 6. indicates microMolar activities. The dpa score is renamed to MolpKd in the pairwise table.

In the pairwise table, a pPvalue is calculated for both the dpc and dpa score. The maximum pPvalue was taken as the final pPvalue for that model. The pPvalue indicates -Log of probability that the compound belongs to the random decoy. A pPvalue of 1 indicates the compound is comparable to the top 90 percentile decoy. A pPvalue of 2 indicates the compound is comparable to the top 99 percentile decoy.

15.3.5 Neural Network Chemical Classification (ncc) models

Neural Network Chemical Classification (ncc) models

Dataset

IC50, EC50, Ki, Kd data was downloaded from ChEMBL24, they were combined and converted to pKd value. Compounds with pKd > 5. (i.e. 10 micro Molar) were classified as positives. 75% of ChEMBL24 compounds were assigned to training set. The remaining 25% compounds were assigned to external test set. The total number of GPCR compounds in the training set was 104911, and 44234 in the external test set To the training set, a subset of 35008 compounds with pKd>7. against any targets from ChEMBL24 was added as decoy. To the external test set, 2286 approved drugs were added as decoy

Training

All the training compounds were combined, and assigned a vector of 1/0 against all targets: 1 if active, 0 if inactive or unknown A Neural Network Chemical Classification (ncc) model was trained using the training set compounds. The performance of the model was evaluated using the external test set: For each target, only compounds with known pKd against it, or approved drug decoy, were used for evaluation.

Predicted Value

The ncc model returns two scores:

Default: l_probabilityscore = yes The ncc score is a probability score: A ncc score of 0.5 indicates 50% chance the compound is a binder A random compound is assumed to have 0.5% chance of hitting a target The ncc score is renamed to MolClass in the pairwise table.

if l_probabilityscore = no The ncc score is the percentile score in relation to random compound Score of 1: indicates the compound's score is comparable to the top 90 percentile of random compound Score of 2: top 99 percentile Score of 3: top 99.9 percentile This score is renamed to pPvalue in the pairwise table.