Library Analysis Workflow

Profiling Step

This step takes a CSV file with a “SMILES” in the header field.

The library_analysis.py script is used:

$SCHRODINGER/run library_analysis.py <database>.csv <database_profiled>.csv -simple2D -batch_size 100000 -HOST bolt_cpu

This takes about 4 cpu hours per 1M compounds. It uses mostly RDKit to compute the following 2D properties:

MW, HBA, RB, Total Rings, Aliph. Rings, Stereo Centers, Frac. CSP3, N+O, HAC, AlogP, CRB.Max, CRB.Mean, PSA, Arom. Rings, HBD, Unspecified Stereo Centers, HTL.MPO.geom, HTL.MPO.arim, HTL.MPO.sum, Class HTL('hit_to_lead' property-based risk-function).

Alternatively, you can run it in “3D” mode which runs LigPrep to get the pKa.

$SCHRODINGER/run library_analysis.py <database>.csv <database>_profiled.csv -batch_size 10000 -HOST bolt_cpu

The full workflow with LigPrep is much slower (~800 cpu hours per 1M compounds). The default settings for LigPrep is “-s 1 -nd -bvac -epik -W e,-best_neutral,-ph,7.4” and the LigPrep output will NOT be saved.

The output from this workflow will have the following properties in addition to the 2D workflow:

InChI=1S, InChIKey, Murko Scaffold SMILES, Murko InChI=1S, Murko InChIKey, Eccentricity, Best Neutral State Penalty, AlogD@7.4, Ion Class, Max. pKa, Min. pKa, Class, CNS.MPO (CNS MPO based on Wager et al. )

You can also use a YAML file to control more settings like different batch_size and host for each step, detail settings for LigPrep, etc.

Filtering Step

This step will need a YAML file to specify the filtering criteria. This step runs about 10 cpu mins per 1M compounds.

$SCHRODINGER/run library_analysis.py -filter <database>_profile.csv <database>_filtered.csv -config filter.yaml -HOST bolt_cpu

Below is an example of the filter.yaml file:

PropertyFilterProfiledMol:
  property_ranges:
    MW: [300, 350]
    Class: ["Druglike", "Leadlike"]

Look in the CSV headers for the available property that can be used for filtering. The current implementation will do a range filtering for numeric properties and an exact match for non-numeric properties.

PDF Generation Step

This step uses database_report.py and takes a few minutes for a ~10M library.

$SCHRODINGER/run database_report.py <database>_profiled.csv <database>_report.pdf

An example for the PDF can be seen here.