Managing Databases with phase_database

Databases can be created and managed with the program phase_database. This program performs all the necessary database management tasks. The syntax is as follows:

SCHRODINGER/phase_database database task [jobName] [options]

where database is the full path to the database, which must have the extension .phdb. The database is a directory, dbName.phdb, which contains the database files (much like a Maestro project or a Canvas project). Information on the database structure is given in Phase Database Structure. The default for the optional job name is database_task. Error messages are written to jobName_errors.out.

The allowed tasks are summarized in Table 1. Write permissions to the database are required for import, revise, extract, delete, and convert. For information about these tasks, you can use -help_task to get options for a particular task, rather than --help, which gives the summary help message.

Table 1. Available tasks for phase_database
Task	Action
`import`	Import structures into a new or existing database.
`splice`	Efficiently import large structure set into a new or existing database using multiple procesors.
`revise`	Add sites, conformers or Canvas properties.
`index`	Manage alternate 2D/3D database indices which speed up screens with hypotheses that contain incompatible feature definitions.
`extract`	Extract all properties into a single SQLite table.
`query`	Perform a property or substructure query.
`delete`	Delete records.
`convert`	Convert or merge one database into another.
`export`	Export database records to structure files.
`subset`	Create or operate on database subsets.
`prefer`	Manage preferences that will be utilized by other software that accesses the database.

The options for each task are described in the usage message, which you can display by running the phase_database command with the -h option. Notes on the tasks are given below. The standard Job Control options and the -LOCAL and -NOJOBID options, as described in Running Jobs From the Command Line, are accepted.

Restarting of failed import, revise, and convert tasks is supported with the -RESTART option. Complete instructions on restarting a job are stored in the database, at database/database_restart/README. The revise and convert task subjobs are automatically restarted if they fail, up to 3 times, after which a failure is recorded. You can set the number of retries with the SCHRODINGER_PHASE_MAX_RESTART environment variable. Restarting is also tried on each slave process run by a subjob. If the slave process itself fails, it is retried up to 5 times (settable with the SCHRODINGER_PHASE_MAX_RETRY environment variable). At a finer level, failed structures are retried once, and then skipped.

Import Task

New record numbers are written to the file jobName_new_phase.inp.

If using the -unique option, a summary of each duplicate structure that is rejected is written to the file jobName_reject.out, containing the SMILES string and the title of the input structure and the molID and title of the database structure, as shown in this example:

Rejected: [O-]C(=O)[C@H](C)[C@H](C([O-])=O)Cc1cc(C)cc(C)c1 "385089_3"
Duplicate of: block_1/mol_21 "385089_3"

All duplicates encountered in the database are listed. Duplicates in the input file are also listed.

Splice Task

There are several limitations to this task: conformer sets cannot be imported, and no checking is done for redundant structures. The task creates a number of mini-databases in parallel which are "spliced" together into the final database. The structure source can be one of the following:

Maestro file (.mae, .mae.gz, .maegz).
SD file (.sdf, .sd, .sdfgz, .sdf.gz, .sd.gz).
List file (.list). This is a text file that contains the names of one or more Maestro or SD files, with one name per line. Files must all be of the same type (i.e., all Maestro or all SD) and have the same compression state. There is no attempt to perceive conformers, so each structure will be stored in a separate database record.

Revise Task

The revise job can be distributed across multiple CPUs. Any combination of -sites, -confs, and -props is allowed, but at least one must be used.

Conformer generation is not exhaustive, and therefore depends to some extent on the input structures. However, the conformers generated should represent a reasonable sample, and the results of a search should not depend much on the input structures. If you want to be sure that you have a complete set of conformers, you should run a conformational search beforehand (with MacroModel, for example) and import the conformer sets.

The feature definitions used for the sites are taken from the definition file copied into the database at the point it was created. If you want to use custom feature definitions, you can copy the feature definition file into the database by running an inport with the -fd option. You should do this when you create the database or before you run a revise task for the first time. If you change the feature definitions after creating sites, you can get inconsistencies in the database, and you should run a revise task with -sites to ensure that there are no inconsistencies.

Extract Task

All properties that have been imported or computed are extracted and written to a single table in the SQLite database database/database.sqlite. A copy of the same data is written to database/database_props.csv.

Query Task

Matching record numbers are written to jobName_matches_phase.inp, and all properties for those records are written to jobName_matches.csv.

Convert Task

The destination database is the first argument in the phase_database command (i.e., database), and it must have been created (or be created) using phase_database. The job can be distributed across multiple processors.

New sites are created in the destination database if there is an upgrade in the storage format, the feature definitions differ, or if there are no sites for a given record. If none of these conditions is met, the default is to copy the source database sites to the destination database.

The -nosites option is intended for merging a source database with no conformers or sites into a destination database with no conformers or sites, without adding sites in the destination database. This option ensures that sites are not created in the destination database for source records that are missing sites. If there is no upgrade in the database format, existing sites in the source database are copied (regardless of changes in feature definitions).

Subset Task

The options -hits, -has, -titles, and -logic are mutually exclusive. An existing database is not required with -hits and -logic, so the database argument is parsed but not used.