Wiki‎ > ‎

Compile PMEMD on Triton SDSC

posted Aug 5, 2011, 9:40 PM by Dong Xu
Background at http://tritonresource.sdsc.edu/#


1) tar up the pmemd directory from the current AMBER11 CVS tree:

       cd amber11/src/pmemd
       make clean
       cd ..
       tar cfvj pmemd.tar.bz2 ./pmemd

2) Copy this over to triton
       scp pmemd.tar.bz2 user@triton-login.sdsc.edu:/
home/user


3) Login to triton and set some environment variables for the compile
       module load pgi/8.0
       module load openmpi_mx/1.3.3
       export MPI_HOME=/opt/openmpi_pgimx


4) Create the following file in the config_data of pmemd

[mjw006@login-4-0 config_data]$ cat linux_em64t.pgf90
DIRFRC_DEFINES = -DDIRFRC_EFS -DDIRFRC_NOVEC
CPP = /lib/cpp
CPPFLAGS = -traditional -P
F90_DEFINES = -DFFTLOADBAL_2PROC

F90 = pgf90
MODULE_SUFFIX = mod
F90FLAGS = -c
F90_OPT_DBG = -g
F90_OPT_LO =  -fastsse -O1
F90_OPT_MED = -fastsse -O2
F90_OPT_HI =  -fastsse -O3
F90_OPT_DFLT =  $(F90_OPT_HI)

CC = pgcc
CFLAGS = -fastsse -O3

LOAD = pgf90
LOADFLAGS =
LOADLIBS =



5) Generate a config.h file:
       ./configure linux_em64t pgf90 lam

6) Change in the generated config.h, the following lines:
       F90 = $(MPI_HOME)/bin/mpif90
       MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread
       MPI_DEFINES = -DMPI

7) Compile it;
       make

8) The compiled binary is now at ./src/pmemd



Initial Benchmarks
==================





Compile type 0
===============

2e5823824df60ee1d4d32e345e880403  pmemd

MATH_DEFINES =
MATH_LIBS =
FFT_DEFINES = -DPUBFFT
FFT_INCLUDE =
FFT_LIBS =
NETCDF_HOME =
NETCDF_DEFINES =
NETCDF_MOD =
NETCDF_LIBS =
MPI_HOME = /opt/openmpi_pgimx
MPI_DEFINES = -DMPI -DSLOW_NONBLOCKING_MPI
MPI_INCLUDE = -I$(MPI_HOME)/include
MPI_LIBDIR = $(MPI_HOME)/lib
MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread
DIRFRC_DEFINES = -DDIRFRC_EFS -DDIRFRC_NOVEC
CPP = /lib/cpp
CPPFLAGS = -traditional -P
F90_DEFINES = -DFFTLOADBAL_2PROC

F90 = $(MPI_HOME)/bin/mpif90
MODULE_SUFFIX = mod
F90FLAGS = -c
F90_OPT_DBG = -g
F90_OPT_LO =  -fastsse -O1
F90_OPT_MED = -fastsse -O2
F90_OPT_HI =  -fastsse -O3
F90_OPT_DFLT =  $(F90_OPT_HI)

CC = pgcc
CFLAGS = -fastsse -O3

LOAD = pgf90
LOADFLAGS =
LOADLIBS =


n       8ppn    4ppn

2       1172    1169
4       637      639
8       383      362
16      228      224
32      197      134
64      421       95

Compile type 1 (Basically 0 with, no -DSLOW_NONBLOCKING_MPI)
===============

5eb2e04b00dfb3f3122590a6eba40ea7  pmemd

MATH_DEFINES =
MATH_LIBS =
FFT_DEFINES = -DPUBFFT
FFT_INCLUDE =
FFT_LIBS =
NETCDF_HOME =
NETCDF_DEFINES =
NETCDF_MOD =
NETCDF_LIBS =
MPI_HOME = /opt/openmpi_pgimx
MPI_DEFINES = -DMPI
MPI_INCLUDE = -I$(MPI_HOME)/include
MPI_LIBDIR = $(MPI_HOME)/lib
MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread
DIRFRC_DEFINES = -DDIRFRC_EFS -DDIRFRC_NOVEC
CPP = /lib/cpp
CPPFLAGS = -traditional -P
F90_DEFINES = -DFFTLOADBAL_2PROC

F90 = $(MPI_HOME)/bin/mpif90
MODULE_SUFFIX = mod
F90FLAGS = -c
F90_OPT_DBG = -g
F90_OPT_LO =  -fastsse -O1
F90_OPT_MED = -fastsse -O2
F90_OPT_HI =  -fastsse -O3
F90_OPT_DFLT =  $(F90_OPT_HI)

CC = pgcc
CFLAGS = -fastsse -O3

LOAD = pgf90
LOADFLAGS =
LOADLIBS =




NEW (WIP)
n       8ppn    4ppn

2       1166    1168
4       635     633
8       375     364
16      250     222
32      393     133
64      768     crash i)

warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
[tcc-5-33:08134] *** Process received signal ***
[tcc-5-33:08134] Signal: Segmentation fault (11)
[tcc-5-33:08134] Signal code: Address not mapped (1)
[tcc-5-33:08134] Failing at address: 0xfffffffe011c2934
[tcc-5-33:08134] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 63 with PID 8134 on node tcc-5-33 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------



Conclusions
       i) SLOW_NONBLOCKING_MPI is not having an effect on the 4ppn jobs
       ii) SLOW_NONBLOCKING_MPI is showing an improvement for the 8ppn at
       larger values of n
       iii) Sweet spot seems to be n=32 at 4ppn




Compile type 2 (remove -DFFTLOADBAL_2PROC and -DDIRFRC_NOVEC)
===============

14224e7d2db4968d2fa026826c140970  ./pmemd

TH_DEFINES =
MATH_LIBS =
FFT_DEFINES = -DPUBFFT
FFT_INCLUDE =
FFT_LIBS =
NETCDF_HOME =
NETCDF_DEFINES =
NETCDF_MOD =
NETCDF_LIBS =
MPI_HOME = /opt/openmpi_pgimx
MPI_DEFINES = -DMPI
MPI_INCLUDE = -I$(MPI_HOME)/include
MPI_LIBDIR = $(MPI_HOME)/lib
MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread
DIRFRC_DEFINES = -DDIRFRC_EFS
CPP = /lib/cpp
CPPFLAGS = -traditional -P
F90_DEFINES =

F90 = $(MPI_HOME)/bin/mpif90
MODULE_SUFFIX = mod
F90FLAGS = -c
F90_OPT_DBG = -g
F90_OPT_LO =  -fastsse -O1
F90_OPT_MED = -fastsse -O2
F90_OPT_HI =  -fastsse -O3
F90_OPT_DFLT =  $(F90_OPT_HI)

CC = pgcc
CFLAGS = -fastsse -O3

LOAD = pgf90
LOADFLAGS =
LOADLIBS =


1)Repeat above, but remove -DFFTLOADBAL_2PROC and -DDIRFRC_NOVEC from the final config.h
(NO SLOW_NONBLOCKING_MPI is intrinsic here)


Results:
========

n       8ppn            4ppn

2       1194            1198
4       656             668
8       381             370
16      252             228
32      322             134  (first time crash i) )
64      crash ii)       95



i)
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
[tcc-3-22:28986] *** Process received signal ***
[tcc-3-22:28986] Signal: Segmentation fault (11)
[tcc-3-22:28986] Signal code: Address not mapped (1)
[tcc-3-22:28986] Failing at address: 0xfffffffe15cd4634
[tcc-3-22:28986] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 31 with PID 28986 on node tcc-3-22 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Could it be this:
       http://www.open-mpi.org/community/lists/users/2008/01/4896.php




ii-a)
[tcc-3-29.local:13720] Error in mx_open_endpoint (error Busy)
[tcc-3-29.local:13720] mca_btl_mx_init: mx_open_endpoint() failed with status
20 (Busy)
warning:regcache incompatible with malloc
[tcc-3-29.local:13717] mca_btl_mx_init: mx_open_endpoint() failed with status
20 (Busy)
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 23 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 42 with PID 1681 on
node tcc-3-6 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[tcc-2-56.local:22636] 2 more processes have sent help message
help-mpi-api.txt / mpi-abort
[tcc-2-56.local:22636] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages

ii-b)
[tcc-3-22.local:30836] mca_btl_mx_init: mx_open_endpoint() failed with status
20 (Busy)
[tcc-3-22:30838] *** Process received signal ***
[tcc-3-22:30838] Signal: Segmentation fault (11)
[tcc-3-22:30838] Signal code: Address not mapped (1)
[tcc-3-22:30838] Failing at address: 0xfffffffe0dc5a474
[tcc-3-22:30838] *** End of error message ***
[tcc-2-56.local][[52651,1],5][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-56.local][[52651,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-76.local][[52651,1],31][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-3-6.local][[52651,1],52][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-73.local][[52651,1],47][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-3-6.local][[52651,1],55][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-3-6.local][[52651,1],53][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-67.local][[52651,1],23][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-76.local][[52651,1],28][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-73.local][[52651,1],44][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-3-6.local][[52651,1],50][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-67.local][[52651,1],22][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-73.local][[52651,1],46][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[tcc-2-67.local][[52651,1],19][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-70.local][[52651,1],32][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-3-6.local][[52651,1],49][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-52.local][[52651,1],11][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-70.local][[52651,1],38][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[tcc-2-52.local][[52651,1],13][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-67.local][[52651,1],20][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[tcc-2-76.local][[52651,1],29][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[tcc-2-70.local][[52651,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

[tcc-2-76.local][[52651,1],25][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[tcc-2-70.local][[52651,1],37][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-67.local][[52651,1],17][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-70.local][[52651,1],34][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-73.local][[52651,1],40][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[tcc-2-67.local][[52651,1],16][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[tcc-2-76.local][[52651,1],26][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 63 with PID 30838 on node tcc-3-22 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-73.local][[52651,1],43][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[tcc-2-73.local][[52651,1],41][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)



Compile type 3
===============

Same as compile type 2, but -DDIRFRC_NOVEC is added back into the final config.h



5eb2e04b00dfb3f3122590a6eba40ea7  pmemd

MATH_DEFINES =
MATH_LIBS =
FFT_DEFINES = -DPUBFFT
FFT_INCLUDE =
FFT_LIBS =
NETCDF_HOME =
NETCDF_DEFINES =
NETCDF_MOD =
NETCDF_LIBS =
MPI_HOME = /opt/openmpi_pgimx
MPI_DEFINES = -DMPI
MPI_INCLUDE = -I$(MPI_HOME)/include
MPI_LIBDIR = $(MPI_HOME)/lib
MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread
DIRFRC_DEFINES = -DDIRFRC_EFS -DDIRFRC_NOVEC
CPP = /lib/cpp
CPPFLAGS = -traditional -P
F90_DEFINES = -DFFTLOADBAL_2PROC

F90 = $(MPI_HOME)/bin/mpif90
MODULE_SUFFIX = mod
F90FLAGS = -c
F90_OPT_DBG = -g
F90_OPT_LO =  -fastsse -O1
F90_OPT_MED = -fastsse -O2
F90_OPT_HI =  -fastsse -O3
F90_OPT_DFLT =  $(F90_OPT_HI)

CC = pgcc
CFLAGS = -fastsse -O3

LOAD = pgf90
LOADFLAGS =
LOADLIBS =


Running as of 17:01 15th Sep 2009

n       8ppn            4ppn

2       1169            1176
4       638             642
8       393             365
16      263             221
32      399
64

Comments