Troubleshooting FEP+ Technical Failures

FEP+ jobs may fail due to various technical reasons. If you see the following commonly occurring technical failures, please follow the suggestions to debug.

To get technical support, create a postmortem and file a support request at https://www.schrodinger.com/support. See Knowledge Base Article 1473 for more information.

Memory allocation error

The error message on memory allocation is:

>DIAGNOSIS: The GPU ran out of memory. It may be that your system is too large to run on this GPU model or that more than one simulation is running on the GPU due to an incorrect configuration.
>Cleaning up files…

By default the 12 lambda windows are running on 4 GPU cards, and 3 replicas are running on the same card in the production stage. If the system size is too large to fit on the GPU memory, the following traceback and error reporting occurs:

allocate_mol: malloc failed for 22258152 bytes

If this happens, you should use more GPUs to run the job (6 or 12), or reduce your system size.

NO GPU available error

>DIAGNOSIS: Desmond cannot find a GPU to run on. Your hardware, queuing system and/or schrodinger.hosts file may be incorrectly configured.
>Cleaning up files...

This error should not happen if the jobs are submitted correctly and the queuing system is configured properly. It suggests there is a problem with a compute node (GPU error or hung gdesmond process).

Sometimes Job Control loses track of certain subjobs. If this happens, the subjob keeps running on the computing node, and Job Control and the queuing system regard the compute node as available. When new subjobs are submitted to that particular compute node, the following error occurs:

no GPU is available

If this happens, try to determine which compute node this error is occurring for, and ask your sysadmin to fix it.

Note that if the required NVIDIA driver is not installed you will also get a no GPU is available error.

Kernel synchronization failed error

Kernel synchronization errors occur when the backend crashes. The ‘Kernel synchronization failed’ message is not indicative of any single problem and is just the beginning of most tracebacks such as the two scenarios above.

It could be that the system went to a very high-energy nonphysical geometry and the simulation blew up; or it could be that the particular compute node is having some hardware problem; or it could be due to a backend bug.

If this happens, please restart the job to see whether the error can be reproduced.

If the error is reproducible, please look at the trajectory of the failed job to check whether the ligand or protein went to a nonphysical geometry.

If no problem was found, please file a support case and attach the input mae file and msj file for that failed edge.