DragonFly kernel List (threaded) for 2003-09
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index]

Re: checkpoint/restart milestone 1

From:	David Leimbach <leimy2k@xxxxxxx>
Date:	Mon, 15 Sep 2003 00:04:56 -0500

MPI programs that run on clusters can have Fault Tolerance in the form of checkpoint/restart. This technology has been around for a little bit but I don't think I have seen it at a general application level before :).

I guess I am just pointing out that this sort of thing can be done
in a distributed manner.

    Kip, could you brief us on what this `checkpoint/restart' stuff
    exactly means?  I did not go to the Con, so I my guess is that
    this is like a software suspend kind of a thing, where the state
    of the system is saved, and it can be later resumed in the exact
    condition -- although I maybe wrong. :-)
You're right, except its only at the application level, not the whole
system.
I'll give you an example: You have a compute-bound program that can run for weeks. The program, like many scientific applications was not well structured, i.e. it has a lot of implicit state sitting around in various globals and statics here and there. Hence, having the application programmer save its state is out of the question. However, you don't trust your hardware, so you'd like to be able checkpoint the program in an application-independent way. This provides you with functionality for doing that. So if your program/computer crashes the program can just restart from the last time it did a checkpoint.
This could also be useful for debugging as well. If you have an
application that starts doing something weird, you can just checkpoint
it and send the checkpoint off to the developer.
-Kip

References:
- checkpoint/restart milestone 1
  - From: Kip Macy
- Re: checkpoint/restart milestone 1
  - From: Hiten Pandya
- Re: checkpoint/restart milestone 1
  - From: Kip Macy

[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index]