dwww Home | Manual pages | Find package

OPAL_CRS(7)                        Open MPI                        OPAL_CRS(7)

NAME
       OPAL_CRS  -  Open PAL MCA Checkpoint/Restart Service (CRS): Overview of
       Open PAL's CRS framework, and selected modules.  Open MPI 4.1.2.

DESCRIPTION
       Open PAL can involuntarily checkpoint and restart sequential  programs.
       Doing  so  requires  that Open PAL was compiled with thread support and
       that the back-end checkpointing systems are available at run-time.

   Phases of Checkpoint / Restart
       Open PAL defines three phases for checkpoint /  restart  support  in  a
       procress:

       Checkpoint
           When  the  checkpoint  request arrives, the procress is notified of
           the request before the checkpoint is taken.

       Continue
           After a checkpoint has successfully completed, the same process  as
           the checkpoint is notified of its successful continuation of execu-
           tion.

       Restart
           After a checkpoint has successfully completed, a  new  /  restarted
           process is notified of its successful restart.

       The Continue and Restart phases are identical except for the process in
       which they are invoked. The Continue  phase  is  invoked  in  the  same
       process  as the Checkpoint phase was invoked. The Restart phase is only
       invoked in newly restarted processes.

GENERAL PROCESS REQUIREMENTS
       In order for a process to use the Open PAL CRS components it  must  ad-
       hear to a few programmatic requirements.

       First,  the  program  must  call OPAL_INIT early in its execution. This
       should only be called once, and it is not possible  to  checkpoint  the
       process without it first having called this function.

       The  program  must  call  OPAL_FINALIZE before termination. This does a
       significant amount of cleanup. If it is not called,  then  it  is  very
       likely that remnants are left in the filesystem.

       To  checkpoint and restart a process you must use the Open PAL tools to
       do so. Using the backend checkpointer's checkpoint  and  restart  tools
       will   lead  to  undefined  behavior.   To  checkpoint  a  process  use
       opal_checkpoint  (opal_checkpoint(1)).   To  restart  a   process   use
       opal_restart (opal_restart(1)).

AVAILABLE COMPONENTS
       Open PAL ships with two CRS components: self and blcr.

       The following MCA parameters apply to all components:

       crs_base_verbose
           Set the verbosity level for all components. Default is 0, or silent
           except on error.

   self CRS Component
       The self component invokes user-defined functions to save  and  restore
       checkpoints.  It is simply a mechanism for user-defined functions to be
       invoked at Open PAL's Checkpoint, Continue, and Restart phases.  Hence,
       the only data that is saved during the checkpoint is what is written in
       the user's checkpoint function. No libary state is saved at all.

       As such, the model for the self component is slightly differnt than for
       other  components. Specifically, the Restart function is not invoked in
       the same process image  of  the  process  that  was  checkpointed.  The
       Restart  phase  is  invoked during OPAL_INIT of the new instance of the
       applicaiton (i.e., it starts over from main()).

       The self component has the following MCA parameters:

       crs_self_prefix
           Speficy a string prefix for the name of the  checkpoint,  continue,
           and  restart functions that Open PAL will invoke during the respec-
           tive stages. That is,  by  specifying  "-mca  crs_self_prefix  foo"
           means that Open PAL expects to find three functions at run-time:

              int foo_checkpoint()

              int foo_continue()

              int foo_restart()

           By default, the prefix is set to "opal_crs_self_user".

       crs_self_priority
           Set the self components default priority

       crs_self_verbose
           Set the verbosity level. Default is 0, or silent except on error.

       crs_self_do_restart
           This is mostly internally used. A general user should never need to
           set this value. This is set to non-0 when a the new process  should
           invoke  the  restart callback in OPAL_INIT. Default is 0, or normal
           execution.

   blcr CRS Component
       The Berkeley Lab Checkpoint/Restart (BLCR) single-process checkpoint is
       a  software  system developed at Lawrence Berkeley National Laboratory.
       See the project website for more details:

           http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

       The blcr component has the following MCA parameters:

       crs_blcr_priority
           Set the blcr components default priority.

       crs_blcr_verbose
           Set the verbosity level. Default is 0, or silent except on error.

   none CRS Component
       The none component simply selects no CRS  component.  All  of  the  CRS
       function calls return immediately with OPAL_SUCCESS.

       This  component  is  the last component to be selected by default. This
       means that if another component is available, and  the  none  component
       was  not  explicity requested then OPAL will attempt to activate all of
       the available components before falling back to this component.

SEE ALSO
         opal_checkpoint(1), opal_restart(1)

4.1.2                            Nov 24, 2021                      OPAL_CRS(7)

Generated by dwww version 1.14 on Fri Jan 24 06:05:35 CET 2025.