Julia parallel computing across multiple nodes in a cluster

I am running some jobs on a shared cluster and I was trying to use more than 1 node at a time. Although the usage julia -p #processors

works for cores on one node, it doesn't find other nodes. The cluster uses SGE and I have tried many different ways to get the nodes to work, but only one works. Is there an easy way, built in Julia, to start Julia using julia -mpi 32

or something similar?
Using

using ClusterManagers
println(nworkers(),nprocs(),Sys.CPU_CORES)
ClusterManagers.addprocs_sge(16)
ClusterManagers.addprocs_sge(15)
println(nworkers(),nprocs(),Sys.CPU_CORES)

      

does not work (I submitted the job, reserving 2 nodes with 16 cores in SGE), the job output file is empty, and instead I get 16 different output files julia-70755.o8252776.*

(* = 1...16)

with the following text:

julia_worker:9009#192.168.17.206
Master process (id 1) could not connect within 60.0 seconds.
exiting.

      

Launching Julia with julia --machinefile $PE_HOSTFILE

also failed:

Warning: Permanently added the RSA host key for IP address '192.168.18.10' to th
e list of known hosts.
ERROR: connect: invalid argument (EINVAL)
 in uv_error at ./libuv.jl:68 [inlined]
 in connect!(::TCPSocket, ::IPv4, ::UInt16) at ./socket.jl:652
 in connect!(::TCPSocket, ::SubString{String}, ::UInt16) at ./socket.jl:688
 in connect at ./stream.jl:959 [inlined]
 in connect_to_worker(::SubString{String}, ::Int16) at ./managers.jl:483
 in connect(::Base.SSHManager, ::Int64, ::WorkerConfig) at ./managers.jl:425
 in create_worker(::Base.SSHManager, ::WorkerConfig) at ./multi.jl:1786
 in setup_launched_worker(::Base.SSHManager, ::WorkerConfig, ::Array{Int64,1}) a
t ./multi.jl:1733
 in (::Base.##669#673{Base.SSHManager,Array{Int64,1}})() at ./task.jl:360
 in sync_end() at ./task.jl:311
 in macro expansion at ./task.jl:327 [inlined]
 in #addprocs_locked#665(::Array{Any,1}, ::Function, ::Base.SSHManager) at ./mul
ti.jl:1688
 in (::Base.#kw##addprocs_locked)(::Array{Any,1}, ::Base.#addprocs_locked, ::Bas
e.SSHManager) at ./<missing>:0
 in #addprocs#664(::Array{Any,1}, ::Function, ::Base.SSHManager) at ./multi.jl:1
658
 in (::Base.#kw##addprocs)(::Array{Any,1}, ::Base.#addprocs, ::Base.SSHManager) 
at ./<missing>:0
 in #addprocs#764(::Bool, ::Cmd, ::Int64, ::Array{Any,1}, ::Function, ::Array{An
y,1}) at ./managers.jl:112
 in process_options(::Base.JLOptions) at ./client.jl:227
 in _start() at ./client.jl:321
UndefRefError()

      

I was suggested to use the MPI.jl package , but it doesn't look like it actually supports the julia parallel syntax the way I use it, just by writing @sync @parallel in front of the for loop I want to run in parallel (e.g. Metropolis- Montecarlo).

+3


source to share


2 answers


The IT team came back to me and said that SGE doesn't allow ssh without a password, so addprocs_sge()

won't work. However, they have now added a file for the job that I can pass to Julia and asked to run this job with this script:

qlogin -pe mpi_28_tasks_per_node 56
module load julia/0.5.1
julia --machinefile $TMPDIR/machines

      



The machines file looks like this:

::::::::::::::
/scratch/8548498.1.u/machines
::::::::::::::
{hostname1}
{hostname1}
...
{hostname2}
{hostname2}

      

+2


source


You might want to read the julia docs on parallel computing which has a section on cluster managers. Also, take a look at ClusterManagers.jl where SGE is supported:



julia> using ClusterManagers
julia> ClusterMangers.addprocs_sge(5)

      

0


source







All Articles