Running Distributed Schrödinger Jobs
A number of Schrödinger products can distribute work over multiple processors. There are several algorithms for performing the distribution. Some jobs divide the input structures into batches. Each batch is then submitted to a processor for execution as a subjob. The number of processors used and the number of batches (subjobs) can be different. Others divide the work among the available processors, so that the number of subjobs and the number of processors is the same. In addition to single jobs, there are a number of workflows that submit many single jobs (e.g. Desmond or Jaguar jobs) as subjobs in the overall process.
Jaguar, QSite, and Quantum ESPRESSO can use multiple processors with OpenMP threading as well as distributed processing.
For each distributed job there is a driver that is responsible for dividing up the work, submitting the subjobs, and collating the results. Thus, if N processors are requested, the job creates N+1 processes. Depending on the program, the driver can run on the local host or on the remote host. If the driver runs on the remote host, there are two separate cases to consider:
-
The remote host (or collection of remote hosts) does not involve a queuing system. Even though the driver does not takes much time, the overall job does not necessarily run inefficiently, because the operating system can swap the driver process out when it is idle and use the time for other processes.
-
The remote host is running a queuing system. The driver is run as a separate queued job, and the subjobs are then submitted to the queue by the driver. The driver job can then occupy a processor that will be idle most of the time. The exceptions are Prime loop refinement, Phase, Jaguar, and Quantum ESPRESSO, for which the driver runs (or can run) a subjob locally.
The hosts on which the driver and the subjobs run are set with one or more of the following options:
-
-HOST: General list of hosts for the job. If one of other options is not used, the specified hosts are generally used for the other. If neither of the other options is used, the driver runs either onlocalhostor the first host, the subjobs can run on the remaining hosts or on all the hosts, including the first. -
-DRIVERHOST: host on which to run the driver (must be a single host name). If omitted, the driver may be run on the first host specified by-HOSTor onlocalhost, if-HOSTis not used or the application default is to run the driver locally. -
-SUBHOST: hosts on which to run the subjobs. If omitted, all hosts specified by-HOSTare used for the subjobs. The number of CPUs used is determined by the number of hosts or host/CPU combinations specified.
The syntax of these options is given in Running Jobs From the Command Line. Information on the number of subjobs, number of CPUs, and the driver job location is given in Table 1 for running distributed jobs from the command line, and in Table 2 for running distributed jobs from Maestro. The notation “Standard” means that the options listed above are used to determine the driver location and the number of CPUs.
The launch directory must be available from the master job host for Induced Fit Docking.