combinatorial_diversity Command Help
Command: $SCHRODINGER/utilities/combinatorial_diversity
usage: combinatorial_diversity [-h] [-min_pop <m>] [-ndim <n>] [-rand <seed>]
[-nocopy] [-nofail] [-maxopen <n> | -nosplit]
[-products <p>] [-inflate <factor>]
[-fptype {dendritic,linear,molprint2D,radial}]
[-savefp] [-onlyfp] [-out <outfile>] [-no3d]
[-v3000] [-verbose] [-filter <file>]
[-list_props] [-hba <file>] [-hbd <file>]
[-NJOBS NJOBS] [-HOST <hostname>]
[-TMPDIR TMPDIR] [-JOBNAME JOBNAME]
<infile> <ndiverse>
Performs diverse structure selection with optional biasing of properties to
lie in specified ranges. Runs in combinatorial mode, where diverse structures
are selected after enumerating a minimum number of diverse products, or in
conventional mode, where diverse structures are selected directly from a file.
Copyright Schrodinger LLC, All Rights Reserved.
positional arguments:
<infile> Source of input structures. May be a combinatorial
synthetic route file (.json), a 32-bit Canvas
fingerprint file with SMILES and properties (.fp), a
CSV file with SMILES, titles and properties (.csv) or
a SMILES file (.smi).
<ndiverse> The number of diverse structures to select. Linear
scaling and distributed processing are achieved by
splitting chemical space into 2**N distinct regions
(where N is determined by -min_pop) and selecting the
appropriate number of diverse structures from each
region. To ensure speedy selection and high diversity,
it is strongly recommended that <ndiverse> be no
larger than 5% of the total pool from which selections
are to be made.
options:
-h, --help Show this message and exit.
-min_pop <m> The minimum population of each distinct region of
chemical space. This option would normally be used to
speed up a job for which the 5% rule is being
exceeded. For example, if selecting 10,000 diverse
structures from a pool of 100,000, reducing the
minimum population from 10,000 to 5,000 would
typically double the number of regions and halve total
selection time (default: 10,000).
-ndim <n> The number of dimensions in the chemical space from
which the distinct regions are defined. A maximum of
2**(n-1) regions are possible, so if n=10, up to 512
regions can be defined. This parameter would normally
be adjusted only when the pool of structures is so
large that the population of each region significantly
exceeds 10,000, even when splitting over the maximum
number of regions. A good rule of thumb is to use the
default value of 10 for a pool of up to 5 million, and
increase by 1 for each doubling of the pool size,
e.g., 10 million -> -ndim 11, 20 million -> -ndim 12,
etc.
-rand <seed> Random seed integer for initializing diversity
algorithm. Results are always the same for a given
random seed (default: 1).
-nocopy Utilize <infile> at its specified location and do not
copy to the job directory. This option is most useful
for very large input fingerprint files, as it allows a
given diversity subjob to directly access the
fingerprint rows assigned to it, without the cost of
copying or physically splitting the fingerprint file.
The file name must be specified using an absolute
path, and that path must be accessible to all compute
nodes on the host where the job is to run.
-nofail Exit with an error if a fingerprint generation subjob
or diversity selection subjob fails to successfully
complete. The default behavior is to issue a warning
to the log file but proceed with the partial results
from successfully completed subjobs.
-maxopen <n> When physically splitting an input fingerprint file or
an intermediate fingerprint file generated from the
input structures, allow no more than <n> output
fingerprint files to be open at any time. A larger
value of <n> results in faster splitting but greater
memory use (default: 256). Use -nosplit to disable
physical splitting.
-nosplit Do not physically split an input fingerprint file or
an intermediate fingerprint file generated from the
input structures. Similar to -nocopy, in that it
avoids the expense of splitting the fingerprint file,
and it allows each diversity subjob to directly access
its fingerprint rows. Differs from -nocopy, in that it
does not require an absolute path, but it does result
in the entire fingerprint file being copied to the job
directory of each diversity subjob. Mutually exclusive
with -maxopen.
-products <p> The minimum number of products that must be
successfully enumerated before selecting diverse
structures. Applies only to .json input. The default
is 20 times the number of diverse structures. This
option MUST be specified if the number of diverse
structures is greater than 50,000.
-inflate <factor> Product inflation factor. This value is multiplied by
the minimum number of products and supplied to
combinatorial_synthesis to ensure that an excess of
products are made. Applies only to .json input
(default: 1.25).
-fptype {dendritic,linear,molprint2D,radial}
The type of Canvas fingerprints to generate for .json,
.csv and .smi inputs (default: molprint2D).
-savefp Save generated fingerprints to <jobname>_<fptype>.fp.
A default set of physicochemical properties are saved
with the fingerprints for .json and .smi inputs if a
property filter is supplied (see -filter).
-onlyfp Save generated fingerprints and exit without selecting
diverse structures. This option is provided to allow
large fingerprint files to be moved to a cross-mounted
location and supplied in a subsequent job with the
-nocopy option.
-out <outfile> Output Maestro, SD, CSV or SMILES file for diverse
structures (default: <jobname>_diverse.csv).
-no3d Skip 3D coordinate generation for diverse structures.
-v3000 Write SD file structures in V3000 format.
-verbose Output details of diversity selection/property
biasing.
Property Biasing Options:
-filter <file> CSV file containing one or more property filters, with
one filter per line. Each filter consists of the name
of a property, followed by the preferred minimum and
maximum values of that property, e.g., AlogP,2.0,5.0.
In the case of .json or .smi input, use of this option
triggers the creation of a set of default
physicochemical properties to which filters may be
applied. In the case of .fp or .csv input, filters may
be applied only to the numeric properties present in
those files. Use -list_props to see available
properties. Note that diverse structures are selected
with a bias toward satisfying as many filters as
possible, but not necessarily all filters. Note also
that a given property may appear in more than one
filter, so that multiple desired ranges are possible.
-list_props Get the list of properties available for biasing. Will
be the automatically calculated properties for .json
and .smi inputs, and the properties present in the
file for .fp and .csv inputs.
-hba <file> Use supplied rules to assign hydrogen bond acceptor
counts for .json and .smi input. Default rules are in
the file HbondAcceptor.typ in the Schrodinger software
installation.
-hbd <file> Use supplied rules to assign hydrogen bond donor
counts for .json and .smi input. Default rules are in
the file HbondDonor.typ in the Schrodinger software
installation.
Standard Options:
-NJOBS NJOBS Divide the overall job into NJOBS subjobs.
Job Control Options:
-HOST <hostname> Run job remotely on the indicated host entry.
-TMPDIR TMPDIR The name of the directory used to store files
temporarily during a job.
-JOBNAME JOBNAME Provide an explicit name for the job.