Background at http://tritonresource.sdsc.edu/# 1) tar up the pmemd directory from the current AMBER11 CVS tree: cd amber11/src/pmemd make clean cd .. tar cfvj pmemd.tar.bz2 ./pmemd 2) Copy this over to triton scp pmemd.tar.bz2 user@triton-login.sdsc.edu:/ home/user 3) Login to triton and set some environment variables for the compile module load pgi/8.0 module load openmpi_mx/1.3.3 export MPI_HOME=/opt/openmpi_pgimx 4) Create the following file in the config_data of pmemd [mjw006@login-4-0 config_data]$ cat linux_em64t.pgf90 DIRFRC_DEFINES = -DDIRFRC_EFS -DDIRFRC_NOVEC CPP = /lib/cpp CPPFLAGS = -traditional -P F90_DEFINES = -DFFTLOADBAL_2PROC F90 = pgf90 MODULE_SUFFIX = mod F90FLAGS = -c F90_OPT_DBG = -g F90_OPT_LO = -fastsse -O1 F90_OPT_MED = -fastsse -O2 F90_OPT_HI = -fastsse -O3 F90_OPT_DFLT = $(F90_OPT_HI) CC = pgcc CFLAGS = -fastsse -O3 LOAD = pgf90 LOADFLAGS = LOADLIBS = 5) Generate a config.h file: ./configure linux_em64t pgf90 lam 6) Change in the generated config.h, the following lines: F90 = $(MPI_HOME)/bin/mpif90 MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread MPI_DEFINES = -DMPI 7) Compile it; make 8) The compiled binary is now at ./src/pmemd Initial Benchmarks ================== Compile type 0 =============== 2e5823824df60ee1d4d32e345e880403 pmemd MATH_DEFINES = MATH_LIBS = FFT_DEFINES = -DPUBFFT FFT_INCLUDE = FFT_LIBS = NETCDF_HOME = NETCDF_DEFINES = NETCDF_MOD = NETCDF_LIBS = MPI_HOME = /opt/openmpi_pgimx MPI_DEFINES = -DMPI -DSLOW_NONBLOCKING_MPI MPI_INCLUDE = -I$(MPI_HOME)/include MPI_LIBDIR = $(MPI_HOME)/lib MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread DIRFRC_DEFINES = -DDIRFRC_EFS -DDIRFRC_NOVEC CPP = /lib/cpp CPPFLAGS = -traditional -P F90_DEFINES = -DFFTLOADBAL_2PROC F90 = $(MPI_HOME)/bin/mpif90 MODULE_SUFFIX = mod F90FLAGS = -c F90_OPT_DBG = -g F90_OPT_LO = -fastsse -O1 F90_OPT_MED = -fastsse -O2 F90_OPT_HI = -fastsse -O3 F90_OPT_DFLT = $(F90_OPT_HI) CC = pgcc CFLAGS = -fastsse -O3 LOAD = pgf90 LOADFLAGS = LOADLIBS = n 8ppn 4ppn 2 1172 1169 4 637 639 8 383 362 16 228 224 32 197 134 64 421 95 Compile type 1 (Basically 0 with, no -DSLOW_NONBLOCKING_MPI) =============== 5eb2e04b00dfb3f3122590a6eba40ea7 pmemd MATH_DEFINES = MATH_LIBS = FFT_DEFINES = -DPUBFFT FFT_INCLUDE = FFT_LIBS = NETCDF_HOME = NETCDF_DEFINES = NETCDF_MOD = NETCDF_LIBS = MPI_HOME = /opt/openmpi_pgimx MPI_DEFINES = -DMPI MPI_INCLUDE = -I$(MPI_HOME)/include MPI_LIBDIR = $(MPI_HOME)/lib MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread DIRFRC_DEFINES = -DDIRFRC_EFS -DDIRFRC_NOVEC CPP = /lib/cpp CPPFLAGS = -traditional -P F90_DEFINES = -DFFTLOADBAL_2PROC F90 = $(MPI_HOME)/bin/mpif90 MODULE_SUFFIX = mod F90FLAGS = -c F90_OPT_DBG = -g F90_OPT_LO = -fastsse -O1 F90_OPT_MED = -fastsse -O2 F90_OPT_HI = -fastsse -O3 F90_OPT_DFLT = $(F90_OPT_HI) CC = pgcc CFLAGS = -fastsse -O3 LOAD = pgf90 LOADFLAGS = LOADLIBS = NEW (WIP) n 8ppn 4ppn 2 1166 1168 4 635 633 8 375 364 16 250 222 32 393 133 64 768 crash i) warning:regcache incompatible with malloc warning:regcache incompatible with malloc [tcc-5-33:08134] *** Process received signal *** [tcc-5-33:08134] Signal: Segmentation fault (11) [tcc-5-33:08134] Signal code: Address not mapped (1) [tcc-5-33:08134] Failing at address: 0xfffffffe011c2934 [tcc-5-33:08134] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 63 with PID 8134 on node tcc-5-33 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- Conclusions i) SLOW_NONBLOCKING_MPI is not having an effect on the 4ppn jobs ii) SLOW_NONBLOCKING_MPI is showing an improvement for the 8ppn at larger values of n iii) Sweet spot seems to be n=32 at 4ppn Compile type 2 (remove -DFFTLOADBAL_2PROC and -DDIRFRC_NOVEC) =============== 14224e7d2db4968d2fa026826c140970 ./pmemd TH_DEFINES = MATH_LIBS = FFT_DEFINES = -DPUBFFT FFT_INCLUDE = FFT_LIBS = NETCDF_HOME = NETCDF_DEFINES = NETCDF_MOD = NETCDF_LIBS = MPI_HOME = /opt/openmpi_pgimx MPI_DEFINES = -DMPI MPI_INCLUDE = -I$(MPI_HOME)/include MPI_LIBDIR = $(MPI_HOME)/lib MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread DIRFRC_DEFINES = -DDIRFRC_EFS CPP = /lib/cpp CPPFLAGS = -traditional -P F90_DEFINES = F90 = $(MPI_HOME)/bin/mpif90 MODULE_SUFFIX = mod F90FLAGS = -c F90_OPT_DBG = -g F90_OPT_LO = -fastsse -O1 F90_OPT_MED = -fastsse -O2 F90_OPT_HI = -fastsse -O3 F90_OPT_DFLT = $(F90_OPT_HI) CC = pgcc CFLAGS = -fastsse -O3 LOAD = pgf90 LOADFLAGS = LOADLIBS = 1)Repeat above, but remove -DFFTLOADBAL_2PROC and -DDIRFRC_NOVEC from the final config.h (NO SLOW_NONBLOCKING_MPI is intrinsic here) Results: ======== n 8ppn 4ppn 2 1194 1198 4 656 668 8 381 370 16 252 228 32 322 134 (first time crash i) ) 64 crash ii) 95 i) warning:regcache incompatible with malloc warning:regcache incompatible with malloc [tcc-3-22:28986] *** Process received signal *** [tcc-3-22:28986] Signal: Segmentation fault (11) [tcc-3-22:28986] Signal code: Address not mapped (1) [tcc-3-22:28986] Failing at address: 0xfffffffe15cd4634 [tcc-3-22:28986] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 31 with PID 28986 on node tcc-3-22 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- Could it be this: http://www.open-mpi.org/community/lists/users/2008/01/4896.php ii-a) [tcc-3-29.local:13720] Error in mx_open_endpoint (error Busy) [tcc-3-29.local:13720] mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy) warning:regcache incompatible with malloc [tcc-3-29.local:13717] mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy) warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 23 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 42 with PID 1681 on node tcc-3-6 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [tcc-2-56.local:22636] 2 more processes have sent help message help-mpi-api.txt / mpi-abort [tcc-2-56.local:22636] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages ii-b) [tcc-3-22.local:30836] mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy) [tcc-3-22:30838] *** Process received signal *** [tcc-3-22:30838] Signal: Segmentation fault (11) [tcc-3-22:30838] Signal code: Address not mapped (1) [tcc-3-22:30838] Failing at address: 0xfffffffe0dc5a474 [tcc-3-22:30838] *** End of error message *** [tcc-2-56.local][[52651,1],5][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-56.local][[52651,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-76.local][[52651,1],31][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-3-6.local][[52651,1],52][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-73.local][[52651,1],47][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-3-6.local][[52651,1],55][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-3-6.local][[52651,1],53][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-67.local][[52651,1],23][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-76.local][[52651,1],28][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-73.local][[52651,1],44][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-3-6.local][[52651,1],50][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-67.local][[52651,1],22][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-73.local][[52651,1],46][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [tcc-2-67.local][[52651,1],19][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-70.local][[52651,1],32][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-3-6.local][[52651,1],49][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-52.local][[52651,1],11][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-70.local][[52651,1],38][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [tcc-2-52.local][[52651,1],13][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-67.local][[52651,1],20][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)[tcc-2-76.local][[52651,1],29][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [tcc-2-70.local][[52651,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-76.local][[52651,1],25][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [tcc-2-70.local][[52651,1],37][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-67.local][[52651,1],17][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-70.local][[52651,1],34][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-73.local][[52651,1],40][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [tcc-2-67.local][[52651,1],16][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [tcc-2-76.local][[52651,1],26][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -------------------------------------------------------------------------- mpirun noticed that process rank 63 with PID 30838 on node tcc-3-22 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-73.local][[52651,1],43][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [tcc-2-73.local][[52651,1],41][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) Compile type 3 =============== Same as compile type 2, but -DDIRFRC_NOVEC is added back into the final config.h 5eb2e04b00dfb3f3122590a6eba40ea7 pmemd MATH_DEFINES = MATH_LIBS = FFT_DEFINES = -DPUBFFT FFT_INCLUDE = FFT_LIBS = NETCDF_HOME = NETCDF_DEFINES = NETCDF_MOD = NETCDF_LIBS = MPI_HOME = /opt/openmpi_pgimx MPI_DEFINES = -DMPI MPI_INCLUDE = -I$(MPI_HOME)/include MPI_LIBDIR = $(MPI_HOME)/lib MPI_LIBS = -L$(MPI_LIBDIR) -lmpi_f77 -lmpi -ldl -lpthread DIRFRC_DEFINES = -DDIRFRC_EFS -DDIRFRC_NOVEC CPP = /lib/cpp CPPFLAGS = -traditional -P F90_DEFINES = -DFFTLOADBAL_2PROC F90 = $(MPI_HOME)/bin/mpif90 MODULE_SUFFIX = mod F90FLAGS = -c F90_OPT_DBG = -g F90_OPT_LO = -fastsse -O1 F90_OPT_MED = -fastsse -O2 F90_OPT_HI = -fastsse -O3 F90_OPT_DFLT = $(F90_OPT_HI) CC = pgcc CFLAGS = -fastsse -O3 LOAD = pgf90 LOADFLAGS = LOADLIBS = Running as of 17:01 15th Sep 2009 n 8ppn 4ppn 2 1169 1176 4 638 642 8 393 365 16 263 221 32 399 64 |
Wiki >