~% mpirun --np 5 ./a.out [02/05 opensolaris]: I am a servant [04/05 opensolaris]: I am a servant [00/05 opensolaris]: I am the master [01/05 opensolaris]: I am a servant [03/05 opensolaris]: I am a servant ~%
Our goal is to construct a virtual MPI cluster. For reasons of brevity we will target a 2-node cluster, but the described steps may be repeated (or scripted) to allow for the addition of more nodes.
Please mind that it’s perfectly possible to run and test MPI programs in a single node, as shown in the following example output of an SIMD program. In this sense, our experiment is more an attempt to explore and blend different technologies, rather than setup a production environment for doing MPI computations. |
~% mpirun --np 5 ./a.out [02/05 opensolaris]: I am a servant [04/05 opensolaris]: I am a servant [00/05 opensolaris]: I am the master [01/05 opensolaris]: I am a servant [03/05 opensolaris]: I am a servant ~%
Host operating system is OpenSolaris snv build 129 / x86:
~% uname -a SunOS opensolaris 5.11 snv_129 i86pc i386 i86pc Solaris
The machine is an Intel Core2 Quad CPU Q9550 @ 2.83GHz, 4-core, with 8GB RAM:
~% prtdiag [snip] -------------------------------- -------------------------- Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz LGA 775 [snip]
The physical network interface is an on-board low-end Realtek:
~% pfexec scanpci | grep -i realtek Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
We will be using the following technologies:
Solaris Zones for process isolation and resource control
ZFS for clone / CoW (Copy on Write) capabilities and
Crossbow for network virtualisation
Our strategy will focus on creating a leading node that we will then replicate to form our computational cluster. This will save us both administration time and storage space, by deduplicating the invariant configuration steps and by using ZFS CoW (Copy on Write) capabilities, respectively.
For a start we need a filesystem to host our node.
# zfs create rpool/node1 # chmod -R 700 /rpool/node1
Every node will be equipped with two virtual network interfaces.
The first will be used by a node to communicate with other nodes in the same subnet via a virtual switch. This is the logical channel where messages and data will flow during MPI execution.
The second one will allow a node to access Internet. We need it for installing OpenMPI tools, at least for the antecedent node. Besides that, we may decide later to install some other tool that we haven’t accounted for, while doing the initial cluster population via replication.
The naming convention we adhere to goes like this: every nodeX, with X = {1, 2, …} will have vnicX (10.0.10.x) and vnicXX (10.0.0.10x). We use different subnets to insulate network traffic and to allow for fine grained routing. The exact network topology is shown in the following ascii diagram:
Internet | +----------------------+-----------------------+ |10.0.0.x | | | modem/router | | 10.0.0.138 | | | | | | | | | | | rge0 | | +---------------10.0.0.1-------------+ | | | | | | | | | | | | | |10.0.0.101 10.0.0.102| | vnic11 vnic22 | | | | | +----+------------------------------------+----+ | | | | | | +----+----+ +----+----+ | | | | | node1 | | node2 | | | | | +----+----+ +----+----+ | | | | | | +----+------------------------------------+----+ | vnic1 / \ vnic2 | |10.0.10.1 /virtual\ 10.0.10.2| | |switch | | | \ / | |10.0.10.x \ / | +----------------------------------------------+
Therefore, we create the virtual network equipment:
# dladm create-etherstub etherstub0 # dladm create-vnic -l etherstub0 vnic1 # dladm create-vnic -l etherstub0 vnic2 # dladm create-vnic -l rge0 vnic11 # dladm create-vnic -l rge0 vnic22
I had problems with nwamd(1M) — the network auto-magic daemon, so I disabled it (svcadm disable svc:/network/physical:nwam && svcadm enable svc:/network/physical:default) I also manually set up my host’s network to use static IPs via /etc/hostname.<interface>, default route with route(1M), nameservers in /etc/resolv.conf and so on. |
And verify:
# dladm show-link LINK CLASS MTU STATE BRIDGE OVER rge0 phys 1500 up -- -- etherstub0 etherstub 9000 unknown -- -- vnic1 vnic 9000 up -- etherstub0 vnic2 vnic 9000 up -- etherstub0 vnic11 vnic 1500 up -- rge0 vnic22 vnic 1500 up -- rge0 #
Zones support many configuration options, but we stick to the bare minimum for now. We can later review our setup.
# zonecfg -z node1 node1: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:node1> create zonecfg:node1> set zonepath=/rpool/node1 zonecfg:node1> set ip-type=exclusive zonecfg:node1> add net zonecfg:node1:net> set physical=vnic1 zonecfg:node1:net> end zonecfg:node1> add net zonecfg:node1:net> set physical=vnic11 zonecfg:node1:net> end zonecfg:node1> verify zonecfg:node1> commit zonecfg:node1> ^D #
And verify:
# zoneadm list -vc ID NAME STATUS PATH BRAND IP 0 global running / ipkg shared - node1 configured /rpool/node1 ipkg excl
# zoneadm -z node1 install A ZFS file system has been created for this zone. Publisher: Using opensolaris.org (http://pkg.opensolaris.org/dev/ ). Image: Preparing at /rpool/node1/root. Cache: Using /var/pkg/download. Sanity Check: Looking for 'entire' incorporation. Installing: Core System (output follows) DOWNLOAD PKGS FILES XFER (MB) Completed 58/58 13856/13856 117.5/117.5 PHASE ACTIONS Install Phase 19859/19859 No updates necessary for this image. Installing: Additional Packages (output follows) DOWNLOAD PKGS FILES XFER (MB) Completed 35/35 3253/3253 19.6/19.6 PHASE ACTIONS Install Phase 4303/4303 Note: Man pages can be obtained by installing SUNWman Postinstall: Copying SMF seed repository ... done. Postinstall: Applying workarounds. Done: Installation completed in 921.697 seconds. Next Steps: Boot the zone, then log into the zone console (zlogin -C) to complete the configuration process. #
And verify:
# zoneadm list -vc ID NAME STATUS PATH BRAND IP 0 global running / ipkg shared - node1 installed /rpool/node1 ipkg excl
# zoneadm -z node1 boot # zlogin -C node1 [Connected to zone 'node1' console] 87/87 Reading ZFS config: done. Mounting ZFS filesystems: (4/4) [snip] What type of terminal are you using? 1) ANSI Standard CRT 2) DEC VT100 3) PC Console 4) Sun Command Tool 5) Sun Workstation 6) X Terminal Emulator (xterms) 7) Other Type the number of your choice and press Return: [snip]
Finally our setup should look like this:
Primary network interface: vnic1 Secondary network interfaces: vnic11 Host name: node1 IP address: 10.0.10.1 System part of a subnet: Yes Netmask: 255.255.0.0 Enable IPv6: No Default Route: 10.0.0.138
And like this for vnic11:
Use DHCP: No Host name: node1-1 IP address: 10.0.0.101 System part of a subnet: Yes Netmask: 255.255.255.0 Enable IPv6: No Default Route: None
We verify:
# zoneadm list -vc ID NAME STATUS PATH BRAND IP 0 global running / ipkg shared 3 node1 running /rpool/node1 ipkg excl
The operating system uses a number of databases of information about hosts, ipnodes, users, etc. Data for these can originate from a variety of sources. For example, hostnames and host addresses, can be found in /etc/hosts, NIS/+, DNS, LDAP, and so on.
Here, we will be using DNS:
root@node1:~# cp /etc/nsswitch.dns /etc/nsswitch.conf
Alternatively, you can edit /etc/nsswitch.conf and append dns to hosts and ipnodes. |
We edit /etc/resolv.conf file, and add our DNS servers.
root@node1:~# cat /etc/resolv.conf domain lan nameserver 194.177.210.210 nameserver 197.177.210.211 nameserver 208.67.222.222 nameserver 208.67.220.220 nameserver 10.0.0.138 root@node1:~#
At this point we should be able to access the Internet:
root@node1:~# traceroute www.gooogle.com traceroute: Warning: www.gooogle.com has multiple addresses; using 209.85.229.106 traceroute: Warning: Multiple interfaces found; using 10.0.0.101 @ vnic11 traceroute to www.gooogle.com (209.85.229.106), 30 hops max, 40 byte packets 1 10.0.0.138 (10.0.0.138) 0.421 ms 0.368 ms 0.325 ms 2 r.edudsl.gr (83.212.27.202) 19.534 ms 19.321 ms 19.421 ms 3 grnetRouter.edudsl.athens3.access-link.grnet.gr (194.177.209.193) 19.504 ms 19.054 ms 19.384 ms [snip] root@node1:~#
I mistyped www.gooogle.com, but it seems that Google owns that domain as well. |
The hosts file is used to map hostnames to IP addresses. We will assign aliases to the every node, so that we can reference them easily: in /etc/hosts:
10.0.10.1 node1 10.0.10.2 node2
At this point we should be able to access other clients in our virtual LAN:
root@node1:~# ping node1 node1 is alive root@node1:~# ping node2 node2 is alive root@node1:~#
And, from node2:
root@node2:~# ping node1 node1 is alive root@node2:~# ping node2 node2 is alive root@node2:~#
Every node that takes part in the computational cluster is necessary to have OpenMPI tools installed. As of now, the most recent version is clustertools_8.1.
~% pfexec zlogin node1 [Connected to zone 'node1' pts/1] Last login: Mon Dec 28 14:42:06 on pts/2 Sun Microsystems Inc. SunOS 5.11 snv_129 November 2008 root@node1:~# pkg install clustertools_8.1 DOWNLOAD PKGS FILES XFER (MB) Completed 2/2 1474/1474 13.7/13.7 PHASE ACTIONS Install Phase 1696/1696 root@node1:~#
The compilers that come with the clustertools package, are only wrappers around the "true" compilers, like gcc or Sun’s that need to be present in the system.
Based on the Sun HPC ClusterTools 8.1 Software User’s Guide, the only supported compilers for Solaris systems are the Sun’s. We, though, have managed to make OpenMPI work with gcc-3. Therefore, you need to do:
root@node1:~# pkg install clustertools_8.1 [snip]
Append the following line in your .profile file:
export PATH=$PATH:/opt/SUNWhpc/HPC8.1/sun/bin
We halt the zone:
# zoneadm -z node 1 halt
We extract leading zone’s configuration to use it as a template:
# zonecfg -z node1 export -f ./node1.cfg
And verify:
# cat ./node1.cfg create -b set zonepath=/rpool/node1 set brand=ipkg set autoboot=false set ip-type=exclusive add net set physical=vnic1 end add net set physical=vnic11 end #
We edit node1.cfg and adapt it according to our needs. E.g., we change vnics to vnic2 and vnic22, respectively.
# zonecfg -z node2 -f ./node1.cfg # zoneadm -z node2 clone node1 # zoneadm list -vc ID NAME STATUS PATH BRAND IP 0 global running / ipkg shared - node1 installed /rpool/node1 ipkg excl - node2 installed /rpool/node2 ipkg excl #
The new node2 takes up only some K of storage space (the USED column in zfs list output):
# zfs list | head -n1 NAME USED AVAIL REFER MOUNTPOINT # zfs list | grep node rpool/node1 562M 39.7G 22K /rpool/node1 rpool/node1/ROOT 562M 39.7G 19K legacy rpool/node1/ROOT/zbe 562M 39.7G 562M legacy rpool/node2 263K 39.7G 23K /rpool/node2 rpool/node2/ROOT 240K 39.7G 19K legacy rpool/node2/ROOT/zbe 221K 39.7G 562M legacy #
What really happened when we cloned the zone is that a snapshot of rpool/node1 filesystem has been created. And a new filesystem, rpool/node2, has been emerged on top of that.
Due to the CoW nature of ZFS, the second filesystem occupies space only for the differences in content with respect to the ancestor filesystem. The more the two filesystems diverge from each other as time passes by, the more space will be occupied.
# zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT [snip] rpool/node1/ROOT/zbe@node2_snap 0 - 562M -
Normally, the non-global zones will lag behind as the global zone is regularly updated. It is therefore desirable to keep all nodes (global and non-global) in sync with each other, by doing as follows:
# zoneadm -z node1 halt # zoneadm -z node2 halt # # zoneadm -z node1 detach # zoneadm -z node2 detach # # ... update global zone ... # # zoneadm -z node1 attach -u # zoneadm -z node2 attach -u
At last, we need an actual MPI program to test the infrastructure:
root@node1:~# cat test.c #include <mpi.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { char procname[MPI_MAX_PROCESSOR_NAME]; int len, nprocs, rank; /* Initialize MPI environment. */ MPI_Init(&argc, &argv); /* Get size and rank. */ MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* Get processor name -- no shit shirlock! */ MPI_Get_processor_name(procname, &len); if (rank == 0) { printf( "[%02d/%02d %s]: I am the master\n", rank, nprocs, procname); } else { printf( "[%02d/%02d %s]: I am a servant\n", rank, nprocs, procname); } /* We are done -- cleanup */ MPI_Finalize(); return (EXIT_SUCCESS); } root@node1:~# root@node1:~# mpicc test.c root@node1:~#
Starting jobs from node1:
root@node1:~# mpirun -np 5 --host node1,node2 ./a.out Password: [02/05 node1]: I am a servant [00/05 node1]: I am the master [04/05 node1]: I am a servant [03/05 node2]: I am a servant [01/05 node2]: I am a servant root@node1:~# mpirun -np 5 --host node2,node1 ./a.out Password: [01/05 node1]: I am a servant [03/05 node1]: I am a servant [00/05 node2]: I am the master [02/05 node2]: I am a servant [04/05 node2]: I am a servant root@node1:~#
Starting jobs from node2:
root@node2:~# mpirun -np 5 --host node2,node1 ./a.out Password: [02/05 node2]: I am a servant [04/05 node2]: I am a servant [00/05 node2]: I am the master [03/05 node1]: I am a servant [01/05 node1]: I am a servant root@node2:~# mpirun -np 5 --host node1,node2 ./a.out Password: [03/05 node2]: I am a servant [02/05 node1]: I am a servant [01/05 node2]: I am a servant [00/05 node1]: I am the master [04/05 node1]: I am a servant root@node2:~#
Boot & login into the zone:
# zoneadm -z mynode boot # zlogin mynode [Connected to zone 'mynode' pts/1] Last login: Thu Dec 24 14:38:51 on pts/1 Sun Microsystems Inc. SunOS 5.11 snv_129 November 2008 root@mynode:~#
And run:
root@mynode:~# sys-unconfig WARNING This program will unconfigure your system. It will cause it to revert to a "blank" system - it will not have a name or know about other systems or networks. This program will also halt the system. Do you want to continue (y/n) ? y sys-unconfig started Sat Dec 26 04:08:17 2009 rm: //etc/vfstab.sys-u: No such file or directory sys-unconfig completed Sat Dec 26 04:08:17 2009 Halting system... svc.startd: The system is coming down. Please wait. svc.startd: 54 system services are now being stopped. svc.startd: Killing user processes. Dec 26 04:08:25 The system is down. Shutdown took 6 seconds. [NOTICE: Zone halted]
Next time you will zlogin -C to the zone, you will go through the setup process again.