Applying Empirical Corrections for pKa Calculations
By default, raw micro-pKas are converted into micro-pKas with a machine learning approach using atomic graph convolutional neural networks (GCNN).
You can use a linear empirical correction instead, which we refer to as the Jaccard similarity method, with the use_ml keyword.
The Jaccard similarity method
An empirical correction is applied to the raw pKa calculated for each tautomer to obtain a micro-Ka with the form:
|
|
(1) |
We have generated a set of reference data during the development of this empirical correction (See Theory of Empirical Corrections for pKa Calculations). All data points in the set have a raw pKa value and an experimental pKa.
Instead of applying a fixed correction, we can employ a dynamic fitting procedure (pKa-on-the-fly) which only uses data that is chemically similar to the molecule in question in the fit. The procedure can be described in the following steps:
-
Identify "fragments" in the molecule centered around the protonation/deprotonation site using a molecular graph up to six bonds deep away from the site. Fragments are generated for all paths in the graph of length 1 through 6 to create a pool of fragments describing the chemical environment locally and semi-locally around the site.
-
Jaccard similarity is used to compare these pools of fragments generated for the input molecule and molecules in the reference data. This identifies reference molecules which have a chemical environment around the protonation/deprotonation site which matches the input molecule's. Here, a modified Jaccard similarity is used, where two fragments are equivalent if:
- The fragments have the same length (i.e. are at the same depth in the molecular graph)
- Every pair of atomic elements (taken in order) is identical in both fragments
- Every pair of bond lengths (taken in order) is equal to within a tolerance T (set to 0.05 Å)
-
The reference data set is sorted by the similarity score, and the top 5 matches are taken.
-
The subset is weighted in favor of chemically important fragments. The weights are used in a weighted linear least-squares fit to generate a and b fitting parameters.
-
If the fit is good according to a pre-defined criteria based on both fragment match and fit quality, the next match down the list is added to the subset of matches, and step 4 is repeated. This cycle continues until the RMSD of the fit does not decrease with the addition of matched points. If the fit is not good, we apply a fixed fitting correction instead, where a and b are fitting parameters from the linear least-square fit of the raw pKa values against the experimental pKa values using all the data in the set. These parameters are fixed, and are universally applied to any molecule that uses generic fitting as the method for empirical correction.