AutoQSAR Best Practices

"Data" refers to the endpoint used in training models.

  • Be consistent in ligand preparation across training and prediction molecules, e.g. if you use neutral ligands in training, you must neutralize ligands in prediction too.

    • Normalizes chemotypes and treatment of tautomeric forms

  • Ideally, models should be created from a set of at least 50 compounds, and validation of models on a separate set of compounds is strongly recommended.

  • For data sets with more than 5000 molecules, Deep AutoQSAR is recommended.

  • Duplicate compounds should be removed—one data point per compound only.

  • Potency data should be provided on a scale that is proportional to the associated binding free energy change.

    • Convert to logarithmic scale if necessary, e.g. IC50 to pIC50. Recent AutoQSAR releases recognize and do this automatically.

    • In the case of pIC50 or pKi, it is desirable for the learning set to span at least 3 log units.

  • Data ideally should be in molar and not weight units.

  • Data heterogeneity should be avoided if possible, i.e. data from different species or protocols.

  • Data is expected to be somewhat normally distributed for continuous model fitting. If you have a bifurcated distribution, creating two continuous models with a categorical model to decide what continuous model to employ may be more appropriate.

  • Data sets should be balanced in terms of bin populations for categorical model fitting. You should avoid creating models for things like HTS data where 99.9% of predictions are 'dead' and 0.1% of data is 'active'. Such models tend to have very high accuracy simply by predicting all compounds to be 'dead'. Instead, randomly select data such that the distribution of molecules in bins is within an order of magnitude.

  • When building a categorical model based on cutoffs (e.g., active if pKi > 6), it is recommended to discard compounds that fall in some gray area (e.g., active if pKi > 6, inactive if pKi < 5, discarded if 5 ≤ pKi ≤ 6).

  • Avoid including chiral compounds. This is particularly important if differences in chirality lead to large differences in activity (e.g., more than a factor of 10 difference in IC50 or Ki between R and S forms). If it's not feasible to exclude all chiral compounds, then here are some considerations:

    • Radial fingerprints (and only radial fingerprints) are sensitive to chirality, although only by way of turning off/on a relatively small number of arbitrary bits in the R and S forms, so there's really no true encoding of spatial information. The differences in the fingerprints essentially serve the purpose of not yielding a similarity of 1.0 for the R and S forms.

    • It's possible that radial fingerprints may provide some ability to model the differences in activity due to chirality, and this would typically be evidenced by a disproportionate number of radial fingerprint models in the top 10. External validation is even more important in such cases to ensure that the models aren't being trained against the peculiarities in the way radial fingerprints are calculated.

    • If activities are fairly insensitive to differences in chirality, then it may not be detrimental to the model to include both enantiomers, or to simply include one enantiomer from each pair.

    • If the activities are from a racemic mixture, be very cautious. While you could include both the R and S forms and use the racemic activity for both, you generally have no idea how potent one form is relative to the other. It could be a factor of 2, it could be a factor of 10, or it could be a factor of 100. So you're really just including activities for which the experimental uncertainty is completely unbounded.