1. Overview

1.1. Goal

Our goal is to construct a virtual MPI cluster. For reasons of brevity we will target a 2-node cluster, but the described steps may be repeated (or scripted) to allow for the addition of more nodes.

Caution Please mind that it’s perfectly possible to run and test MPI programs in a single node, as shown in the following example output of an SIMD program. In this sense, our experiment is more an attempt to explore and blend different technologies, rather than setup a production environment for doing MPI computations.
~% mpirun --np 5 ./a.out
[02/05 opensolaris]: I am a servant
[04/05 opensolaris]: I am a servant
[00/05 opensolaris]: I am the master
[01/05 opensolaris]: I am a servant
[03/05 opensolaris]: I am a servant
~%

1.2. Platform specifications

Host operating system is OpenSolaris snv build 129 / x86:

~% uname -a
SunOS opensolaris 5.11 snv_129 i86pc i386 i86pc Solaris

The machine is an Intel Core2 Quad CPU Q9550 @ 2.83GHz, 4-core, with 8GB RAM:

~% prtdiag
[snip]
-------------------------------- --------------------------
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz LGA 775
[snip]

The physical network interface is an on-board low-end Realtek:

~% pfexec scanpci | grep -i realtek
 Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller

1.3. Technologies

We will be using the following technologies:

2. Initial node creation

Our strategy will focus on creating a leading node that we will then replicate to form our computational cluster. This will save us both administration time and storage space, by deduplicating the invariant configuration steps and by using ZFS CoW (Copy on Write) capabilities, respectively.

2.1. Filesystem

For a start we need a filesystem to host our node.

# zfs create rpool/node1
# chmod -R 700 /rpool/node1

2.2. Virtual network interfaces

Every node will be equipped with two virtual network interfaces.

The first will be used by a node to communicate with other nodes in the same subnet via a virtual switch. This is the logical channel where messages and data will flow during MPI execution.

The second one will allow a node to access Internet. We need it for installing OpenMPI tools, at least for the antecedent node. Besides that, we may decide later to install some other tool that we haven’t accounted for, while doing the initial cluster population via replication.

The naming convention we adhere to goes like this: every nodeX, with X = {1, 2, …} will have vnicX (10.0.10.x) and vnicXX (10.0.0.10x). We use different subnets to insulate network traffic and to allow for fine grained routing. The exact network topology is shown in the following ascii diagram:

                    Internet
                       |
+----------------------+-----------------------+
|10.0.0.x              |                       |
|                 modem/router                 |
|                  10.0.0.138                  |
|                      |                       |
|                      |                       |
|                      |                       |
|                     rge0                     |
|    +---------------10.0.0.1-------------+    |
|    |                                    |    |
|    |                                    |    |
|    |                                    |    |
|10.0.0.101                          10.0.0.102|
|  vnic11                               vnic22 |
|    |                                    |    |
+----+------------------------------------+----+
     |                                    |
     |                                    |
     |                                    |
+----+----+                          +----+----+
|         |                          |         |
|  node1  |                          |  node2  |
|         |                          |         |
+----+----+                          +----+----+
     |                                    |
     |                                    |
     |                                    |
+----+------------------------------------+----+
|  vnic1             /     \            vnic2  |
|10.0.10.1          /virtual\         10.0.10.2|
|                   |switch |                  |
|                   \       /                  |
|10.0.10.x           \     /                   |
+----------------------------------------------+

Therefore, we create the virtual network equipment:

# dladm create-etherstub etherstub0
# dladm create-vnic -l etherstub0 vnic1
# dladm create-vnic -l etherstub0 vnic2
# dladm create-vnic -l rge0 vnic11
# dladm create-vnic -l rge0 vnic22
Caution I had problems with nwamd(1M) — the network auto-magic daemon, so I disabled it (svcadm disable svc:/network/physical:nwam && svcadm enable svc:/network/physical:default) I also manually set up my host’s network to use static IPs via /etc/hostname.<interface>, default route with route(1M), nameservers in /etc/resolv.conf and so on.

And verify:

# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
rge0        phys      1500   up       --         --
etherstub0  etherstub 9000   unknown  --         --
vnic1       vnic      9000   up       --         etherstub0
vnic2       vnic      9000   up       --         etherstub0
vnic11      vnic      1500   up       --         rge0
vnic22      vnic      1500   up       --         rge0
#

2.3. Zone setup

2.3.1. Configure

Zones support many configuration options, but we stick to the bare minimum for now. We can later review our setup.

# zonecfg -z node1
node1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:node1> create
zonecfg:node1> set zonepath=/rpool/node1
zonecfg:node1> set ip-type=exclusive
zonecfg:node1> add net
zonecfg:node1:net> set physical=vnic1
zonecfg:node1:net> end
zonecfg:node1> add net
zonecfg:node1:net> set physical=vnic11
zonecfg:node1:net> end
zonecfg:node1> verify
zonecfg:node1> commit
zonecfg:node1> ^D
#

And verify:

# zoneadm list -vc
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              ipkg     shared
   - node1            configured /rpool/node1                   ipkg     excl

2.3.2. Install

# zoneadm -z node1 install
A ZFS file system has been created for this zone.
   Publisher: Using opensolaris.org (http://pkg.opensolaris.org/dev/ ).
       Image: Preparing at /rpool/node1/root.
       Cache: Using /var/pkg/download.
Sanity Check: Looking for 'entire' incorporation.
  Installing: Core System (output follows)
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                58/58 13856/13856  117.5/117.5

PHASE                                        ACTIONS
Install Phase                            19859/19859
No updates necessary for this image.
  Installing: Additional Packages (output follows)
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                35/35   3253/3253    19.6/19.6

PHASE                                        ACTIONS
Install Phase                              4303/4303

        Note: Man pages can be obtained by installing SUNWman
 Postinstall: Copying SMF seed repository ... done.
 Postinstall: Applying workarounds.
        Done: Installation completed in 921.697 seconds.

  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
              to complete the configuration process.
#

And verify:

# zoneadm list -vc
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              ipkg     shared
   - node1            installed  /rpool/node1                   ipkg     excl

2.3.3. Boot & Login

# zoneadm -z node1 boot
# zlogin -C node1
[Connected to zone 'node1' console]
87/87
Reading ZFS config: done.
Mounting ZFS filesystems: (4/4)

[snip]

What type of terminal are you using?
 1) ANSI Standard CRT
 2) DEC VT100
 3) PC Console
 4) Sun Command Tool
 5) Sun Workstation
 6) X Terminal Emulator (xterms)
 7) Other
Type the number of your choice and press Return:

[snip]

Finally our setup should look like this:

      Primary network interface: vnic1
    Secondary network interfaces: vnic11
                       Host name: node1
                      IP address: 10.0.10.1
         System part of a subnet: Yes
                         Netmask: 255.255.0.0
                     Enable IPv6: No
                   Default Route: 10.0.0.138

And like this for vnic11:

                   Use DHCP: No
                  Host name: node1-1
                 IP address: 10.0.0.101
    System part of a subnet: Yes
                    Netmask: 255.255.255.0
                Enable IPv6: No
              Default Route: None

We verify:

# zoneadm list -vc
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              ipkg     shared
   3 node1            running    /rpool/node1                   ipkg     excl

2.4. Internet and LAN access

2.4.1. /etc/nsswitch.conf

The operating system uses a number of databases of information about hosts, ipnodes, users, etc. Data for these can originate from a variety of sources. For example, hostnames and host addresses, can be found in /etc/hosts, NIS/+, DNS, LDAP, and so on.

Here, we will be using DNS:

root@node1:~# cp /etc/nsswitch.dns /etc/nsswitch.conf
Note Alternatively, you can edit /etc/nsswitch.conf and append dns to hosts and ipnodes.

2.4.2. DNS servers

We edit /etc/resolv.conf file, and add our DNS servers.

root@node1:~# cat /etc/resolv.conf
domain lan
nameserver 194.177.210.210
nameserver 197.177.210.211
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 10.0.0.138
root@node1:~#

At this point we should be able to access the Internet:

root@node1:~# traceroute www.gooogle.com
traceroute: Warning: www.gooogle.com has multiple addresses; using 209.85.229.106
traceroute: Warning: Multiple interfaces found; using 10.0.0.101 @ vnic11
traceroute to www.gooogle.com (209.85.229.106), 30 hops max, 40 byte packets
 1  10.0.0.138 (10.0.0.138)  0.421 ms  0.368 ms  0.325 ms
 2  r.edudsl.gr (83.212.27.202)  19.534 ms  19.321 ms  19.421 ms
 3  grnetRouter.edudsl.athens3.access-link.grnet.gr (194.177.209.193)  19.504 ms  19.054 ms  19.384 ms
[snip]
root@node1:~#
Note I mistyped www.gooogle.com, but it seems that Google owns that domain as well.

2.4.3. /etc/hosts

The hosts file is used to map hostnames to IP addresses. We will assign aliases to the every node, so that we can reference them easily: in /etc/hosts:

10.0.10.1       node1
10.0.10.2       node2

At this point we should be able to access other clients in our virtual LAN:

root@node1:~# ping node1
node1 is alive
root@node1:~# ping node2
node2 is alive
root@node1:~#

And, from node2:

root@node2:~# ping node1
node1 is alive
root@node2:~# ping node2
node2 is alive
root@node2:~#

2.5. OpenMPI tools installation

Every node that takes part in the computational cluster is necessary to have OpenMPI tools installed. As of now, the most recent version is clustertools_8.1.

~% pfexec zlogin node1
[Connected to zone 'node1' pts/1]
Last login: Mon Dec 28 14:42:06 on pts/2
Sun Microsystems Inc.   SunOS 5.11      snv_129 November 2008
root@node1:~# pkg install clustertools_8.1
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  2/2   1474/1474    13.7/13.7

PHASE                                        ACTIONS
Install Phase                              1696/1696
root@node1:~#

2.6. Compiler installation

The compilers that come with the clustertools package, are only wrappers around the "true" compilers, like gcc or Sun’s that need to be present in the system.

Based on the Sun HPC ClusterTools 8.1 Software User’s Guide, the only supported compilers for Solaris systems are the Sun’s. We, though, have managed to make OpenMPI work with gcc-3. Therefore, you need to do:

root@node1:~# pkg install clustertools_8.1
[snip]

2.7. Setup PATH

Append the following line in your .profile file:

export PATH=$PATH:/opt/SUNWhpc/HPC8.1/sun/bin

3. Node replication

3.1. Export node configuration

We halt the zone:

# zoneadm -z node 1 halt

We extract leading zone’s configuration to use it as a template:

# zonecfg -z node1 export -f ./node1.cfg

And verify:

# cat ./node1.cfg
create -b
set zonepath=/rpool/node1
set brand=ipkg
set autoboot=false
set ip-type=exclusive
add net
set physical=vnic1
end
add net
set physical=vnic11
end
#

We edit node1.cfg and adapt it according to our needs. E.g., we change vnics to vnic2 and vnic22, respectively.

3.2. Zone cloning

# zonecfg -z node2 -f ./node1.cfg
# zoneadm -z node2 clone node1
# zoneadm list -vc
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              ipkg     shared
   - node1            installed  /rpool/node1                   ipkg     excl
   - node2            installed  /rpool/node2                   ipkg     excl
#

The new node2 takes up only some K of storage space (the USED column in zfs list output):

# zfs list | head -n1
NAME                        USED  AVAIL  REFER  MOUNTPOINT
# zfs list | grep node
rpool/node1                 562M  39.7G    22K  /rpool/node1
rpool/node1/ROOT            562M  39.7G    19K  legacy
rpool/node1/ROOT/zbe        562M  39.7G   562M  legacy
rpool/node2                 263K  39.7G    23K  /rpool/node2
rpool/node2/ROOT            240K  39.7G    19K  legacy
rpool/node2/ROOT/zbe        221K  39.7G   562M  legacy
#

What really happened when we cloned the zone is that a snapshot of rpool/node1 filesystem has been created. And a new filesystem, rpool/node2, has been emerged on top of that.

Due to the CoW nature of ZFS, the second filesystem occupies space only for the differences in content with respect to the ancestor filesystem. The more the two filesystems diverge from each other as time passes by, the more space will be occupied.

# zfs list -t snapshot
NAME                                           USED  AVAIL  REFER  MOUNTPOINT
[snip]
rpool/node1/ROOT/zbe@node2_snap                   0      -   562M  -

4. Node updating

Normally, the non-global zones will lag behind as the global zone is regularly updated. It is therefore desirable to keep all nodes (global and non-global) in sync with each other, by doing as follows:

# zoneadm -z node1 halt
# zoneadm -z node2 halt
#
# zoneadm -z node1 detach
# zoneadm -z node2 detach
#
# ... update global zone ...
#
# zoneadm -z node1 attach -u
# zoneadm -z node2 attach -u

5. Actual test

At last, we need an actual MPI program to test the infrastructure:

root@node1:~# cat test.c
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int argc, char *argv[])
{
        char procname[MPI_MAX_PROCESSOR_NAME];
        int len, nprocs, rank;

        /* Initialize MPI environment. */
        MPI_Init(&argc, &argv);

        /* Get size and rank. */
        MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);

        /* Get processor name -- no shit shirlock! */
        MPI_Get_processor_name(procname, &len);

        if (rank == 0) {
                printf( "[%02d/%02d %s]: I am the master\n",
                        rank, nprocs, procname);
        } else {
                printf( "[%02d/%02d %s]: I am a servant\n",
                        rank, nprocs, procname);
        }

        /* We are done -- cleanup */
        MPI_Finalize();

        return (EXIT_SUCCESS);
}
root@node1:~#
root@node1:~# mpicc test.c
root@node1:~#

Starting jobs from node1:

root@node1:~# mpirun -np 5 --host node1,node2 ./a.out
Password:
[02/05 node1]: I am a servant
[00/05 node1]: I am the master
[04/05 node1]: I am a servant
[03/05 node2]: I am a servant
[01/05 node2]: I am a servant
root@node1:~# mpirun -np 5 --host node2,node1 ./a.out
Password:
[01/05 node1]: I am a servant
[03/05 node1]: I am a servant
[00/05 node2]: I am the master
[02/05 node2]: I am a servant
[04/05 node2]: I am a servant
root@node1:~#

Starting jobs from node2:

root@node2:~# mpirun -np 5 --host node2,node1 ./a.out
Password:
[02/05 node2]: I am a servant
[04/05 node2]: I am a servant
[00/05 node2]: I am the master
[03/05 node1]: I am a servant
[01/05 node1]: I am a servant
root@node2:~# mpirun -np 5 --host node1,node2 ./a.out
Password:
[03/05 node2]: I am a servant
[02/05 node1]: I am a servant
[01/05 node2]: I am a servant
[00/05 node1]: I am the master
[04/05 node1]: I am a servant
root@node2:~#

6. Troubleshooting

6.1. Reset a zone to its unconfigured state

Boot & login into the zone:

# zoneadm -z mynode boot
# zlogin mynode
[Connected to zone 'mynode' pts/1]
Last login: Thu Dec 24 14:38:51 on pts/1
Sun Microsystems Inc.   SunOS 5.11      snv_129 November 2008
root@mynode:~#

And run:

root@mynode:~# sys-unconfig
                        WARNING

This program will unconfigure your system.  It will cause it
to revert to a "blank" system - it will not have a name or know
about other systems or networks.

This program will also halt the system.

Do you want to continue (y/n) ? y
sys-unconfig started Sat Dec 26 04:08:17 2009
rm: //etc/vfstab.sys-u: No such file or directory
sys-unconfig completed Sat Dec 26 04:08:17 2009
Halting system...
svc.startd: The system is coming down.  Please wait.
svc.startd: 54 system services are now being stopped.
svc.startd: Killing user processes.
Dec 26 04:08:25 The system is down.  Shutdown took 6 seconds.

[NOTICE: Zone halted]

Next time you will zlogin -C to the zone, you will go through the setup process again.