Epik Classic Tautomer Database Format
The default tautomer database is not accessible to users. However, you can provide your own file to either completely override or add patterns to the default tautomer collection.
At the top level of the tautomer database file the following four items can be present: name, clear_standard, group_def, and tautomer_set. These items are described in the following sections. Lines beginning with a # are comment lines and are ignored when interpreting the contents of the tautomer database file. Blank lines are also ignored.
name Data Item
name specifies the name of the solvent. For example:
name: water
water and DMSO are standard names for which the tautomerizer already has information. Currently, the DMSO tautomer information is just a duplicate of that for water.
clear_standard Directive
By default, information in a custom tautomer database file is added to any existing information already available for the solvent specified. Including clear_standard: in a tautomer database file clears any values for this solvent accumulated before the current file was read.
group_def Data Structure
The tautomerization facility does not support recursive SMARTS. However, a mechanism that supports some of the functionality of recursive SMARTS is provided by the group_def data structure. This data structure permits you to define variables that correspond to SMARTS patterns. The variables may be reused in groups and tautomer_sets that appear later in the tautomer database file.
Each group contains two items:
name: an arbitrary name for the group which is used to reference the group.
pattern: The SMARTS pattern for the group. This pattern may refer to previously defined groups using $groupname.
Below are some examples of group_def data structures:
group_def{
name: Halogens
pattern: [F,Cl,Br,I]
}
group_def{
name: Amides
pattern: [CX3](=[OX1])-[NX3]
}
group_def{
name: Carbonyls
pattern: [CX3](=[OX1])
}
group_def{
name: Carbonyls_only
pattern: [$Carbonyls;!$Amides]
}
tautomer_set Data Structure
tautomer_set data structures define sets of interconvertible tautomers. There are more than 150 tautomer sets available by default for water.
Some examples of tautomer_set data structures are given below, and the syntax for the data structures is described following these examples.
Note: The entry for pattern: values must be a single line. In the examples below, some of the pattern: text wraps to the next line due to formatting constraints within this manual. When creating tautomer data structure files in a text editor, ensure that text-wrapping is turned off, or that margins are set wide enough to accommodate single-line entry for this value.
tautomer_set{
name: single-sided_ket-enol
# From: Handbook of organic chemistry
tautomer{
name: enol
pattern: [CX3](-[#1,$Sub_aC])(-[#1,$Sub_aC])=[CX3](-[#1,$Sub_carbonyl_C])-[OX2]-[#1]
probability: 0.00005
}
tautomer{
name: ket
pattern: [CX4](-[#1])(-[#1,$Sub_aC])(-[#1,$Sub_aC])-[CX3](-[#1,$Sub_carbonyl_C])=[OX1]
probability: 0.99995
}
}
tautomer_set{
name: double-sided_ket-enol
# From: Handbook of organic chemistry
tautomer{
name: 1enol
pattern: [CX3](-[#1,$Sub_aC])(-[#1,$Sub_aC])=[CX3](-[CX4](-[#1])(-[#1,$Sub_aC])(-[#1,$Sub_aC]))-[OX2]-[#1]
probability: 0.00000001
}
tautomer{
name: ket
pattern: [CX4](-[#1,$Sub_aC])(-[#1,$Sub_aC])(-[#1])-[CX3](-[CX4](-[#1])(-[#1,$Sub_aC])(-[#1,$Sub_aC]))=[OX1]
probability: .99999998
}
tautomer{
name: 2enol
pattern: [CX4](-[#1,$Sub_aC])(-[#1,$Sub_aC])(-[#1])-[CX3](=[CX3](-[#1,$Sub_aC])(-[#1,$Sub_aC]))-[OX2]-[#1]
probability: 0.00000001
}
}
tautomer_set{
name: imidazole
tautomer{
name: form1
pattern: c1(~[#1,$Sub_c])n(-[#1,$Sub_n])-c(-[#1,$Sub_c])=[nX2]c1(~[#1,$Sub_c])
probability: 0.50
}
tautomer{
name: form2
pattern: c1(~[#1,$Sub_c])[nX2]=c(-[#1,$Sub_c])-n(-[#1,$Sub_n])c1(~[#1,$Sub_c])
probability: 0.50
}
}
Each tautomer set contains a name: designator and a number of tautomer structures. The name: designator is followed by a space and a contiguous non-blank label to identify the class of tautomers described by the set. The label provided does not affect processing. In the examples below, there are three tautomeric sets: single-sided_enol-ket, double-sided_enol-ket, and imidazole.
The tautomer structure describes the properties of one tautomeric form. There are three designators that may be used within a tautomer structure: name:, probability:, and pattern:.
The name: designator provides a label for the tautomer but does not otherwise affect processing.
The probability: designator is used to assign a probability or fractional population of this tautomer within this tautomeric set. In many cases, reliable information on the probability of various tautomeric forms is not available and the values entered in the database are simply educated guesses.
The pattern: designator is followed by a contiguous SMARTS-like pattern. A difference between this pattern and a normal SMARTS pattern is that explicit single “–” and double “=” bond designators are used to make the corresponding Lewis structures clear. In addition, these patterns may include references to previously defined groups via the $group_name mechanism. Information on SMARTS patterns is provided on the web page: http://www.daylight.com/learn. The SMARTS-like pattern is used to detect the corresponding groups of molecules in the input structures and to permit the tautomerization facility to understand how the bonding patterns (Lewis structures) differ between tautomers so that they may be interconverted. For heavy atoms that are expected to carry a formal charge it is advisable to include the charge in the SMARTS pattern. To ensure that the SMARTS patterns are properly interpreted by Epik Classic, the following restrictions must be applied:
-
The SMARTS patterns for all tautomers within a tautomer set include the same list of non-hydrogen atoms in the same order.
-
All SMARTS patterns must explicitly designate the hydrogens that shift positions in any tautomer within a tautomer set with a
-[#1]pattern. -
All SMARTS patterns within a tautomer set must contain the same number of explicitly designated mobile hydrogen atoms.
-
In both non-aromatic and aromatic portions of the SMARTS pattern, bond orders that change between single and double in any tautomer must be explicitly specified in the SMARTS patterns for all tautomers in a tautomer set.
-
In portions of molecules that must be represented by aromatic atom types (e.g.,
candn), only changes in the bond orders of bonds involvingnatoms in the corresponding Lewis structures are supported. If such a bond changes order in any tautomer in a tautomer set, it must be represented as ‘:’ in all the tautomers. See the guanosine tautomer set in the example above. -
Recursive SMARTS patterns are not supported.
-
SMARTS patterns within the same tautomer set must all specify the same overall formal charge.
The database provided with this release contains templates for keto-enol tautomers and their sulfur analogues, imine-enamine tautomers, histidine-like tautomers, tautomers of DNA and RNA bases, and a large number of common heteroaromatic rings containing C, S, O, and N.