DragonFly BSD
DragonFly kernel List (threaded) for 2004-08
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: VFS ROADMAP (and vfs01.patch stage 1 available for testing)


From: "Martin P. Hellwig" <mhellwig@xxxxxxxxx>
Date: Tue, 17 Aug 2004 15:33:55 +0200

Janet Sullivan wrote:

How would such a system gracefully deal with hardware failures? What happens when one of the nodes in the cluster dies?

In my opinion hardware failure is always problematic.
If you want to have high availability then you always and certainly have to deal with fail over systems like duplicated hardware. (RAID, dual power and a identical stand-by system for fail-over).
The problem with this is when nothing fails you got a lot of CPU power doing nothing, a huge waste of resources.
So this is quite interesting problem, on one side you wish to use all available performance and on the other hand you want fail-over.


The problem is also organizational, you can not have more then 1 leader, if you have more you have to agree on a leader among those leaders.
Which is a problem because that's a single point of failure just waiting to be happening. So there is no easy way to prevent performance loss there, the best imaginable way would be:


One DVM master, which only controls resource spreading and dynamically points out DVM backups.
2 DVM backups, is a live sync with the master and each other, they perform normal distributed tasks too.


When failure occurs:

Scenario 1, A DVM backup is lost

View from Master & Backup:
Master detects that a backup DVM is not accessible any more,it asks the other backup DVM if it accessible via it, if not the node can be seen as lost and a other node is taken from a list and pointed out as a DVM backup and resumes normal operation.


If it has connection to the lost backup via the normal backup node, it waits one time-out (5 minutes or so) and checks if he has direct connection again if he has then it resumes normal operation if not then it proclaims the Backup node as lost and a other one is taken in place.

On regular bases say every 5 minutes both DVM backups try to connect to the lost DVM backup, if a connection is made (this can also be a other list node proxying the message from the inaccessible node), the lost DVM backup up is told to loose its status as DVM Management node and is put back on the list if it is fully accessible and functional again.
If it is not fully functional the machine will be on hold until it is fully functional again (a message send to the administrator stating the problem).


<side not> The DVM Managing hardware priority list is a pool of _servers_ which are ment to be on permantly, and by this have the capability to be a cluster manager, priority is give to hardware with a good combination of available resource (network and cpu being more important then disk space) and uptime with low downtime </side note>

If after 24 hours the node is still unavailable the node will no longer be contacted and a mail is send to the administrator to find out what the problem is. If the node appears within that time again and a contact is initiated the node is told to loose its status as DVM Management node and put back on the list (if fully functional) of course quite low because it has a bad record of having long non-access-time if it is not fully functional the same procedure will be followed as previous described.

View from lost Backup:
The node if still on, finds that he can not reach the Master and Backup it will contact the list nodes to contact for him a DVM Management node, if these list nodes can contact a other Master DVM Management node then he parses the message that he drops his Management status and stays on hold till its is fully accessible and functional.
When that requirement is met then it soliciting again to the list.
If it can't access any other list nodes, it holds it states till there is a contactable list node to confirm its status. It could be advisable that after a longer period of solitude (say 3 days) the lost Backup DVM node write his changes locally and shuts down.
If the Backup has contact to other list nodes which can not confirm his status and those list nodes can't connect to the original Master and the other Backup, the lasting Backup proclaims follows Scenario 2.


Scenario 2, the DVM Master and one DVM Backup is lost:
The backup detects that the other DVM Management nodes are unavailable.
It queries the list nodes (again) to contact a other DVM Management node for him. If the list nodes have access to a other DVM Management beside himself then the DVM Backup follows scenario 1 otherwise it proclaims itself semi-Master and waits till there are enough nodes on the DVM Managing hardware priority list (3 nodes one for spare, two for backup) to enlist two backups, then it proclaims itself from semi-Master to Master and resumes operation while setting a flag that this is the x=+1 generation DVM Management.
It then follows procedure 1 for recovering of the lost Backup nodes.


Scenario 3, both backups are lost:
The Master detects that both backups are lost and tries the usual reconnect procedure, if he has still no access it proclaims itself backup and follows scenario 2.


Scenario 4, Master is lost.
View from backups:
Both backup nodes try to contact the master directly and via list nodes , if the master is not accessible (give a time-out of 1 minute or so ) a new master will be selected from the list and operation will be resumed , the lost master which by now will be proclaimed as a lost Backup DVM Management node, the procedure as in scenario 1 can now be followed.


Scenario 5, 2 or more Masters reconnect after a major network problems.
All DVM Management nodes are supposed to merge with with the DVM Management nodes who has the lowest generation flag. If 2 equal generation Flags are in collapse then the one with the least uptime merges with the longest uptime.


---
So this is the graceful part now the part how you can do this without interrupting services.


In my view every node within the cluster gives away a certain part of resources. These resources are distributed and controlled by the DVM Management nodes, the DVM Management is a service on the native system and only the master is actively sharing the resources the other backups are just fail-over as described above.
The DVM can abstract the resources to one (or more) virtual machines which are installed with DragonFly (or a other BSD) with an adapted kernel which is aware of its "virtual" state.


If you need full fail-over you configure 2 virtual machines on the cluster although you miss performance, you still have the advantage when there is more performance needed you just pop in more hardware on the network (like pxe booting image with a dragonfly install pre-configured to be a part of a cluster).

I like to compare this "future" technology with a (Dragonfly) facet eye, although there are many facets there are only 2 eyes and these are overlapping the eye side almost completely:-)

A nice thought is scalability, you could combine a view DragonFly installs on the virtual hardware together as cluster again and so on and so on.

I have the schematics quite clear on my mind I hope you guys can follow me (if haven't bored you away anyway :-) ) because I know that my explanation of this idea is quite badly and I am not sure which logical failures I've done. So I hope you can point me to them.

mph



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]