. . . next_group up previous


Download compressed postscript

A PC cluster with high speed network interconnects

R.Innocente $^{1,2}$ , M.Corbatto $^{1,2}$, S.Cozzini $^{1,3}$
$^{1}$ Abdus Salam ICTP, 34014 Trieste(Italy)
$^{2}$ SISSA, Via Beirut 2/4, 34014 Trieste(Italy)
$^{3}$ INFM, Sissa Trieste Unit

August 10,2000

Abstract:

Following a previous project of a lightweight cluster with 20 single processor nodes interconnected by Fast Ethernet on which the feasibility of large scientific computations on PC clusters has been explored, we present a report on the ongoing realization and evaluation of a PC cluster with multiple high speed interconnects. 1

Cluster configuration

The cluster is composed of 8 dual processor computing nodes interconnected by 3 different network technologies: Fast Ethernet, Gigabit Ethernet and Myrinet. A service node is connected only to the Fast Ethernet network and performs as a dhcp/boot/file server and batch scheduler.

Figure 1: Network interconnections
\begin{figure}
\centerline {\epsfig{file=net.ps,height=5cm,angle=0}}
\end{figure}

On the computing nodes there are 2 Pentium III (Katmai) processors running at 550 Mhz. These processors have an on chip 16 KB + 16 KB separate instruction and data level 1 cache and an off chip (but in-package) discrete 512 KB level 2 cache on a separate bus (Back Side Bus) at f/2 Mhz.

Figure 2: Computing node 2xPIII-550 + 440BX
\begin{figure}
\centerline {\epsfig{file=intel.ps,height=7cm}}
\end{figure}

The processor bus or Front Side Bus runs at 100 Mhz. On the Front Side Bus, the Intel 440BX chipset manages the memory and the peripheral buses(a 32 bits @33 Mhz PCI). Memories are commodity 100 Mhz SDRAM. Each computing node has 256 MB of memory.

Figure 3: P3-550 Cache/Memory performance
\begin{figure}
\centerline {\epsfig{file=p3-550.ps,height=8cm,angle=-90}}
\end{figure}

Software setup

Because of the requirement to frequently switch kernels, drivers and software setup, since the beginning we planned to reduce at a minimum the system management overhead. For this reason we installed netboot eproms on the Fast Ethernet cards and all the computing nodes access their root and usr partition on the service node via NFS. The nodes are now running a linux 2.2.14 smp-enabled kernel. Computing nodes have a local disk that is mainly used as a temporary scratch area and swap disk. We chose to install PBS (Portable Batch System) as our batch system. We mainly use the fortran compilers available from the Portland Group (PGI).

Network hardware

The service network - a Fast Ethernet - supports remote boot of the computing nodes, a common file system through NFS, remote logins, etc. For the service network we have used 3COM 3C905B cards and an Allied Telesyn switch. For this network the most strict requirement was the possibility to have netboot eproms on the cards. We also used the performance of this network in some comparisons as a frame of reference. The aim of the cluster was the characterization of the performance of multiple high speed interconnects and software combinations. As high performance interconnects we installed Myrinet and Gigabit Ethernet.

Protocols

The quality of the communication software has a great weight in the overall performance of a cluster. The protocols stacks available are quite different :

Base Performance

The tests have been run mainly on four bi-processor nodes. We used BIP 0.99, GM 1.2, MPICH 1.2.0 both over TCP and over GM. During the tests a considerable difference in the TCP latency time has been observed enabling/disabling the SMP capability of the Linux kernel: this is due to the overhead in locking/unlocking of the internal structures to assure mutually exclusive access in case of multiple CPU. The reported latency time over Myrinet is comprehensive of the switch delay while in the case of fast ethernet both values were measured.

LATENCY TIME (us) SMP Kernel UP kernel
TCP loop-back 53 29
TCP 100Mb direct link 98 53
TCP 100Mb switched 105 61
TCP 1Gb direct link 108 70
TCP over Myrinet 103 63
GM over Myrinet 15 15
BIP over Myrinet 5 n.a.
MPICH/TCP over 100Mb 172 n.a.
MPICH/TCP over 1 Gb 175 n.a.
MPICH/GM over Myrinet 15 n.a.

The bandwidth obtained using directly TCP and GM are here summarized:

Figure 7: Best Bandwidths on Beowulf : BIP/GM/TCP
\begin{figure}
\centerline {\epsfig{file=bestres2.ps,height=8cm,angle=-90}}
\end{figure}

At the MPI level it's evident the difference of performance when communication take place between two process in the same node.

Figure 8: MPI bandwidth
\begin{figure}
\centerline {\epsfig{file=all_bandwidth.ps,height=8cm,angle=-90}}
\end{figure}

Another performance index that is frequently cited is the Scalapack performance. This library is the parallel counterpart of LAPACK. We compiled our version with the BLAS library produced running the ATLAS package, in this way we obtained a performance of 2.063 Gflops/s on 4 nodes (8 processors).

Figure 9: LU decomposition performance
\begin{figure}
\centerline {\epsfig{file=lu.ps,height=8cm,angle=-90}}
\end{figure}

Compared Performance on scientific applications

In this final section we present some benchmarking results obtained on some parallel scientific applications routinely used at Sissa. Performance of parallel machines (T3E and Origin 2000) that are in this moment our production platforms is compared to the performance of the PC cluster. There are three main classes of codes used at Sissa: Ab-initio codes, classical molecular dynamics and Quantum Montecarlo Code. We can compare these three families looking at specific aspects of high performance computing (HPC). Table 1 presents such a comparison. The three columns represent the three classes of codes we discussed above while each row indicates a peculiar characteristic of the HPC. An adjective defines the behavior of each class on that specific subject. From the table it comes out clearly that the computational requirements of the three classes are quite different.

Table 1: Comparison between codes on different aspects of HPC.
aspect Ab-initio Classical MD Quantum MC
Communications high high negligible
Memory requirements high moderate moderate
Parallelization strategy MPI MPI/Shared Memory MPI
Scalability high not scalable linear
Use of Linear Algebra kernel high almost null moderate
Accessing Cache good bad moderate


We report in table 2 a short overview of the performance obtained by all codes on the different machines; for each code we define two benchmark values: the first, named Scalar Benchmark (SB) is given by summing times over all the serial runs. The second, named Parallel Benchmark (PB), is similar but the summed value refers to 8 processors runs.

Table 2: scalar and parallel benchmark values for all the codes. See text for definition.
code SCALAR (SB) PARALLEL (PB)
  T3E O2K Beowulf T3E O2K Beowulf
PWSCF(ab-initio) 1094 775 1485 663 - 1325
FPMD(ab-initio) 56 42.1 119 6.5 5.4 17.4
DLPROTEIN_2.0(MD) 4249 1781 2831 801 412 693
AMBER-5.0(MD) - 168 309 - 61 99
QMC 67 - 71 537 - 569


These benchmark values help to draw some observations about the overall behaviour of different classes of codes on the parallel platforms we tested. We consider first ab-initio codes; for this group the T3E machine is the actual production machine and we look at the performance of the Beowulf machine with respect to this one. Scalar performance between T3E and Beowulf cluster ( $SB_{beowulf}/SB_{t3e} = 1.35 - 2.12$) is better than the parallel ones ( $SB_{beowulf}/SB_{t3e} = 1.99 - 2.67 $). We can not test at this moment parallel performance on larger configurations (16-32 nodes) that could allow to run real size systems but it seems that the Beowulf network will be unable to cope with large number of nodes. Therefore it could be rather difficult to run real simulation with the network technology at our disposal in this moment. It has to be noted however that increasing the scalar performance by means of more recent INTEL cpus, could reduce the need of a large number of processors allowing us to run at least medium size system in a range where the network behavior is still acceptable. This makes the Beowulf cluster a valid support platform where to run small/medium size applications. Concerning Classical MD codes the current target machine is the Origin-2000: the ratio in performance between the Beowulf cluster and the Origin-2000 is 1.65 -1.83 and maintains almost the same values also for the parallel case; this is a good result that makes the Beowulf machine an interesting computational resource for this kind of code with an excellent performance/price ratio. Scalar and parallel performance obtained with the QMC code are the same due to the embarrassing parallelism of the code itself. The overall performance obtained confirm that this family of codes is perfect to be run on a Beowulf cluster.

Bibliography

1
R. Innocente, Proceedings of the Workshop PC-NETS 1999, MPI Performance of a PC Cluster at the ICTP, INFN-LNGS 1999 (http://www.ictp.trieste.it/parallel/performance/ )

2
S.Cozzini, R.Innocente, M.Corbatto, 6th European SGI/Cray MPP Workshop, Comparing scientific codes on different parallel platforms

Contents


List of Figures

  1. Network interconnections
  2. Computing node 2xPIII-550 + 440BX
  3. P3-550 Cache/Memory performance
  4. 3com 3c985 Gigabit Ethernet NIC
  5. Myrinet M2M NIC
  6. Link layer latency (usec)
  7. Best Bandwidths on Beowulf : BIP/GM/TCP
  8. MPI bandwidth
  9. LU decomposition performance

About this document ...

A PC cluster with high speed network interconnects

This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 hw2

The translation was initiated by root on 2000-09-26


Footnotes

... interconnects.1
This work has been partially supported by a grant from Regione Friuli Venezia-Giulia.

next_group up previous
root
2000-09-26