Crystal Structure Prediction Workflow

Please note that Job Server is required for submission of crystal structure prediction jobs. To learn more about Job Server, see Job Submission to a Cluster.

This section highlights how the crystal structure prediction (CSP) workflow may be set up and run. For an overview of CSP, see the Crystal Structure Prediction Panel.

Crystal Structure Prediction Workflow Table of Contents

Workflow
Generating the config.yaml File
Description and Examples of Required and Basic Settings in the config.yaml file
Launching
Retrospective Validation

Workflow

The full CSP workflow is comprised of 5 sub-workflows:

ConfGen (confgen.py): Generates conformers.
PackingSearch (packing_search.py): Creates crystal packings of the various conformers.
MDRelaxation (md_relaxation.py): Relaxes the crystals using molecular mechanics and removes those with high energy.
QRNNRelaxation (qrnn_relaxation.py): Relaxes the crystals using QRNN (charge recursive neural network) and removes those with high energy.
DFTRelaxation (dft_relaxation.py): Relaxes the crystals using periodic DFT (first with PBE-D3 functional , then with R2SCAN-D3 functional) and removes those with high energy.

Running the full CSP workflow (csp.py) runs each of these 5 sub-workflows in order. In addition, each of these sub-workflows can be run individually on their own.

We also support a CSPPackingToRanking workflow (csp_packing_to_ranking.py), which runs the sub-workflows after ConfGen. This allows the user to run ConfGen, examine the results, and then run the rest of the workflow.

Generating the config.yaml File

Prior to launching a workflow, one must generate a config.yaml file, which specifies the job settings.

A default config.yaml file gets generated by running any of the following commands which are versions of the command for the main workflow and for each of the sub-workflows:

$SCHRODINGER/run -FROM scisol csp/csp.py -write-default-config config.yaml
$SCHRODINGER/run -FROM scisol csp/csp_packing_to_ranking.py -write-default-config config.yaml
$SCHRODINGER/run -FROM scisol csp/confgen.py -write-default-config config.yaml
$SCHRODINGER/run -FROM scisol csp/packing_search.py -write-default-config config.yaml
$SCHRODINGER/run -FROM scisol csp/md_relaxation.py -write-default-config config.yaml
$SCHRODINGER/run -FROM scisol csp/qrnn_relaxation.py -write-default-config config.yaml
$SCHRODINGER/run -FROM scisol csp/dft_relaxation.py -write-default-config config.yaml

This will generate a config.yaml file that is similar to (the actual file will depend on which workflow is being used):

###############################################################################
# Required settings
###############################################################################

# MAE file of input ligand.
input_file: 

###############################################################################
# Basic settings
###############################################################################

# Path to custom OPLS directory. If a custom OPLS directory has already been generated for the structure, it should be provided here. Otherwise, this can be set either to a null value to use the default OPLS forcefield (not recommended, as results may be less accurate) or to "auto" to run ffbuilder automatically as part of the workflow.
oplsdir: auto

# Turn on DFT relaxation in the CSP workflow.
enable_dft_relaxation: False

###############################################################################
# Advanced settings
###############################################################################

# Dielectric constant. This value is currently used in:
# - Ring conformer generation
# - Full-conformer single-point energy calculation and geometry relaxation
# DIFFERENT values of the dielectric constant (specified in parentheses) are applied in:
# - Profile torsion conformers (1.0)
# - Relaxation of intermediate fragments (2.0)
# - Restrained relaxation of conformers (1.0)
dielectric: 10.0

# JSON file specifying rules about what torsion angles will be sampled for specific chemical groups. Set to a null value to use default rules.
torsion_sample_angle_rules_file: null

# Constant spacing between torsion angles that will be sampled (degrees).
torsion_sample_angle_spacing: 15

# Filter out torsion angle samples with energies higher than this value, relative to the lowest-energy torsion angle within the local energy basin.
torsion_sample_local_filter_energy: 3

# Filter out torsion angle samples with energies higher than this value, relative to the lowest-energy torsion angle along the entire torsion energy profile (kcal/mol). Set to a null value to not apply this filter.
torsion_sample_global_filter_energy: null

# Target number of torsions in each intermediate fragment. The actual number may be less if the target is not possible.
target_number_torsions_for_intermediate_fragment: 3

# Filter out intermediate fragments with energies higher than this value, relative to the lowest-energy intermediate fragment (kcal/mol). Set to a null value to use a default determination of the optimal value.
intermediate_fragment_filter_energy: null

# List of nitrogen improper torsion angles to apply to each conformer if an invertible nitrogen is present (degrees). Set to a null value to not adjust nitrogen improper torsion angles.
nitrogen_improper_angle_lists: [-150, 150, 180]

# Apply ring restraints during the constrained conformer relaxation.
coord_ring_restraints: True

# Maximum number of relaxation steps to perform during the constrained conformer relaxation.
max_relaxation_steps: 1000
			
# Filter out conformers with energies higher than this value, relative to the lowest-energy conformer (kcal/mol).
conformer_filter_energy: 8.0

# RMSD tolerance for similarity check during the final deduplication.
final_dedup_rmsd_tol: 0.5

# Torsion angle tolerance for similarity check during the final deduplication. Set to a null value to use a default determination of the optimal value.
final_dedup_tor_tol: null

# Turn on implicit solvation filtering for conformers
solvation_filtering: True

# Filter out conformers with total energy plus implicit solvation energy higher than this value, relative to the lowest-energy conformer (kcal/mol). This value is used only if solvation filtering is turned on.
conformer_filter_solvation_energy: 7.0

# Whether to generate all spacegroups regardless of the input molecule's chirality.
force_all_spgc: False

# Restrict packing search to this space-delimited list of spacegroups. Only used for testing/debugging. Set to a null value to search all spacegroups consistent with ASU chirality (default).
restrict_search_spgs: null

# Multiples of the x, y, and z unit cell edge lengths. For each structure, this list will be iterated through until the multiples produce a supercell large enough to inscribe a sphere of diameter d_inscribe, so it is recommended that it be given in an increasing order, as it is cheaper to run MD with a smaller supercell.
extents: [[1, 1, 1], [1, 1, 2], [1, 2, 1], [2, 1, 1], [1, 1, 3], [1, 3, 1], [3, 1, 1], [2, 2, 1], [1, 2, 2], [2, 1, 2], [1, 1, 6], [6, 1, 1], [1, 6, 1], [2, 2, 3], [2, 3, 2], [3, 2, 2], [1, 11, 1], [11, 1, 1], [1, 1, 11], [1, 3, 7], [1, 7, 3], [3, 1, 7], [3, 7, 1], [7, 1, 3], [7, 3, 1], [1, 2, 12], [2, 1, 12], [2, 12, 1], [1, 12, 2], [12, 1, 2], [12, 2, 1], [3, 3, 4], [4, 3, 3], [3, 4, 3], [2, 4, 8], [2, 8, 4], [4, 2, 8], [4, 8, 2], [8, 2, 4], [8, 4, 2], [4, 4, 4], [4, 5, 5], [5, 4, 5], [5, 5, 4], [12, 4, 4], [4, 12, 4], [4, 4, 12], [6, 6, 6], [7, 7, 5], [5, 7, 7], [7, 5, 7], [11, 6, 4], [6, 11, 4], [4, 11, 6], [6, 4, 11], [4, 6, 11], [11, 4, 6]]

# Twice the value of the cutoff radius that will be used in MD (Angstroms).
d_inscribe: 18

# Energy cutoff above the lowest-energy structure before MD relaxation (kcal/mol/molecule).
cutoff_energy_vrun: 12

# Energy cutoff above the lowest-energy structure after MD relaxation (kcal/mol/molecule).
cutoff_energy_md: 10

# Width of energy bins for deduplication (kcal/mol/molecule).
energy_bin_size_md: 5

# Energy cutoff above the lowest-energy structure after QRNN relaxation (kcal/mol/molecule).
cutoff_energy_qrnn: 6

# Width of energy bins for deduplication (kcal/mol/molecule).
energy_bin_size_qrnn: 1

# Number of MPI cores to use for the Quantum Espresso runner jobs. Do not set to a number larger than the number of cores available on a node.mpicores: 16

This shows that there are three classes of settings:

Required settings: These are the input files for the run which must be provided by the user.
Basic settings: These should be set by the user. Default settings should not be relied upon.
Advanced settings: The default values for these are generally fine to use. Advanced users may wish to adjust them.

You can now modify the config.yaml file directly. Lines that start with a # symbol are treated as comments and do not impact the simulations.

Description and Examples of Required and Basic Settings in the config.yaml file

CSP (full)

input_file: This is the structure file (.mae or .maegz) of the input monomer.
oplsdir: This should be set to one of auto, null, or the path to a custom OPLS directory created using the force field builder. It should be set to auto the first time the workflow is run for a given monomer to have the force field builder run. For future runs with the same monomer, the force field builder stage can be skipped by specifying the path to the already-generated directory.

ConfGen

input_file: Same as in CSP (full).
oplsdir: Same as in CSP (full).

PackingSearch

conformer_pool: This is the output file generated by the ConfGen workflow. Its default value is therefore set to output_confgen.maegz.

MDRelaxation

packing_search_results: This is the output file generated by the PackingSearch workflow. Its default value is therefore set to output_packing_search.maegz.
oplsdir: Same as in CSP (full). The value should be consistent with what's used in the ConfGen workflow: If oplsdir: null was used to get the conformer pool, set null for this field; if oplsdir: auto was set before, a custom OPLS directory was generated in the ConfGen output directory, and that OPLS directory path should be entered here.

QRNNRelaxation

md_relaxation_result: This is the output file generated by the MDRelaxation workflow. Its default value is therefore set to output_md_relaxation.maegz.

DFTRelaxation

dft_input_file: This is the output file generated by the QRNNRelaxation workflow. Its default value is therefore set to output_qrnn_relaxation.maegz.
mpicores: This is the number of MPI cores used for Quantum Espresso runner jobs (each runner job performs a periodic DFT calculation on one unit cell structure). This value should not be set to a number larger than the number of cores available on a compute node.

CSPPackingToRanking (PackingSearch+3 energy-ranking workflows)

conformer_pool: This is the output file generated by the ConfGen workflow. Its default value is therefore set to output_confgen.maegz.
oplsdir: Same as in CSP (full). The value should be consistent with what's used in the ConfGen workflow: If oplsdir: null was used to get the conformer pool, set null for this field; if oplsdir: auto was set before, a custom OPLS directory was generated in the ConfGen output directory, and that OPLS directory path should be entered here.

Launching

All of the workflows are launched via a similar methodology, using the following command (again, showing different versions of the command for each workflow):

$SCHRODINGER/run -FROM scisol csp/csp.py -config config.yaml -HOST [driver] -SUBHOST [compute] -GPUHOST [gpuhost] -JOBNAME [jobname]
$SCHRODINGER/run -FROM scisol csp/csp_packing_to_ranking.py -config config.yaml -HOST [driver] -SUBHOST [compute] -GPUHOST [gpuhost] -JOBNAME [jobname]
$SCHRODINGER/run -FROM scisol csp/confgen.py -config config.yaml -HOST [driver] -SUBHOST [compute] -JOBNAME [jobname]
$SCHRODINGER/run -FROM scisol csp/packing_search.py -config config.yaml -HOST [driver] -SUBHOST [compute] -JOBNAME [jobname]
$SCHRODINGER/run -FROM scisol csp/md_relaxation.py -config config.yaml -HOST [driver] -SUBHOST [compute] -GPUHOST [gpuhost] -JOBNAME [jobname]
$SCHRODINGER/run -FROM scisol csp/qrnn_relaxation.py -config config.yaml -HOST [driver] -SUBHOST [compute] -JOBNAME [jobname]
$SCHRODINGER/run -FROM scisol csp/dft_relaxation.py -config config.yaml -HOST [driver] -SUBHOST [compute] -JOBNAME [jobname]

These commands launch a JobServer-managed job which can be monitored using the standard $SCHRODINGER/jsc commands.

Note that the -GPUHOST option is needed for the MDRelaxation workflow, and thus also the full CSP and CSPPackingToRanking workflows that contain MDRelaxation. This option is used to set the GPU host for Multisim jobs that run molecular dynamics calculations.

Jobs should be launched to a host that will serve as the main driver node.

For the driver node, it is recommended to use a highmem and exclusive node to ensure access to 64 GB of memory.

If the number of subhost processes is not specified, it is assumed to be 32. This will parallelize each stage over 32 workers, potentially making each subhost process need to run many bundles before it finishes. To take advantage of additional nodes for increased parallelization, specify the number of processes via -SUBHOST [compute:nproc] (e.g., -SUBHOST compute-16core-64gb-preemptible:1000). It is recommended to specify a very large number of processes (e.g., 10,000) such that each need run very few bundles before finishing, since if a process fails, all bundles on that process (even bundles that had finished) will need to be rerun on restart. However it cannot be too high since the driver node consumes memory to manage each worker; we have found that we can run large systems on a driver node with 64 GB of memory when the number of subhost processes is set to 50,000 but not when it is set to 150,000, so we do not recommend going above 50,000 when running with such a driver.

DFTRelaxation

The DFTRelaxation workflow requires Quantum Espresso (QE) to have been installed and configured for use. See Installing and Configuring Quantum ESPRESSO.

Retrospective Validation

To perform retrospective validation, comparing the output of the CSP workflow with a known experimental crystal structure, run the following script:

$SCHRODINGER/run align_spherical_cluster.py --reference [experimental-reference.maegz] --test output_[one of packing_search, md_relaxation, or qrnn_relaxation].maegz

This will align the output crystals with the reference crystal, write a output_*_aligned.maegz file that can be viewed in Maestro, and calculate RMSD_N values as a measure of successful match. If one of the output crystals gives N>20 and RMSD_N<0.8, it is considered a match. Note that the script runs in serial, so it will be slow when run with the many crystals present in output_packing_search.maegz and output_md_relaxation.maegz.

To perform retrospective validation, comparing the output of just the ConfGen sub-workflow with a conformer from a known experimental crystal structure, run:

$SCHRODINGER/run compare_ASUs_RMSD.py [experimental-reference.maegz] output_confgen.maegz --allow-improper-rotation --reorder-atoms --save-aligned-structure

This will align the output conformers with the reference conformer, write a rmsd.maegz file that can be viewed in Maestro, and calculate RMSD values as a measure of successful match, with the smallest RMSD being printed at the end. If a conformer has an RMSD<0.5, it is considered a match.