2.2.2. Cross-Node Computation with MPI and Rust
You can find the code mentioned in this chapter in this book's repository!
What is OpenMPI, and How Does it Differ from OpenMP?
I'll start by loosely contrasting the functionality of the two:
- OpenMP is for shared memory across multiple processors/threads
 - OpenMPI is for distributed memory across multiple systems
 
OpenMP is something that's semi-comparable to our previous approach to multithreading. It's a lot beefier, but the author of this documentation generally discourages the use of it - recommending to instead use either language parallelism features (for simplicity) or OpenMPI (for drastically improved performance).
How OpenMPI Works
MPI is a protocol for a (M)essage-(P)assing (I)nterface.
There's four concepts core to MPI:
- Universe - The collection of all nodes. On Metis, this is abstracted away by PBS Professional, so don't think too hard about it.
 - World - The collection of all MPI processes and their intercommunication layers. You can think of this as the "meeting room", where each process has a headset and microphone to talk to oneanother.
 - Size - The number of MPI processes in the World.
 - Rank - The index representing this MPI process in the World.
 
When you launch an MPI program, all programs with the same binary and memory. In fact, if you didn't use the rank indicator anywhere in the program, they'd be identical in most cases, simply duplicating output size times.
Here's an example in Rust that does just that:
use mpi::traits::*; use anyhow::{Context, Result}; fn main ( ) -> Result<()> { let universe = mpi::initialize() .context("Failed to initialize MPI.")?; let world = universe.world(); let size: i32 = world.size(); println!("Size: {size}"); Ok(()) }
To get this to run, you'll need to do the following:
$ mkdir -p ~/projects/rust/basic-mpi
$ cd ~/projects/rust/basic-mpi
$ cargo init .
$ cargo add anyhow
$ cargo add mpi
$ module purge
$ module load openmpi/openmpi-5.0.7-gcc-14.2.0-cuda-12.8
$ cargo run
For an MPI run with size 3, you'd get:
Size: 3
Size: 3
Size: 3
Neat, but why run the same thing in multiple places?
Rank-Based Inter-process Logic and Communication
Things start to get interesting quick when you consider the rank:
use mpi::traits::*; use anyhow::{Context, Result}; fn main ( ) -> Result<()> { let universe = mpi::initialize() .context("Failed to initialize MPI.")?; let world = universe.world(); let size: i32 = world.size(); let rank: i32 = world.rank(); println!("Size: {size} - Rank: {rank}"); Ok(()) }
Running this, we get a different result:
Size: 3 - Rank: 0
Size: 3 - Rank: 1
Size: 3 - Rank: 2
If you've been paying close attention, it probably just clicked why this is big - you can diverge in logic based on this rank.
Here's a full example for this logic at work:
use mpi::traits::*; use rand::prelude::*; use anyhow::{Context, Result}; const NUM_ELEMENTS: usize = 1_000_000; fn calculate_random_avg ( n: usize ) -> f64 { let mut arr = vec!(0.0; n); rand::rng().fill(&mut arr[..]); arr.into_iter().sum::<f64>() / (n as f64) } fn main ( ) -> Result<()> { let universe = mpi::initialize() .context("Failed to initialize MPI.")?; let world = universe.world(); let size: i32 = world.size(); let rank: i32 = world.rank(); let local_elements = NUM_ELEMENTS / (size as usize) + (rank == 0) .then_some(NUM_ELEMENTS % (size as usize)) .unwrap_or(0); let local_avg = calculate_random_avg(local_elements); world.process_at_rank(0).send(&local_avg); if rank == 0 { let global_avg = (0..size) .fold(0f64, |acc, _| { acc + world.any_process().receive::<f64>().0 }) / (size as f64); println!("Global average of {global_avg} over {NUM_ELEMENTS} elements"); println!("Computed on {size} MPI processes"); } Ok(()) }
Lots of things just got added, so let's break it down.
- Main can fail. That's why it returns a 
Result<()>, and also why theuniversehas a.context(...)?snippet - that wraps the error with additional information and early returns if it indeed contains an error. - Each and every node calculates the average of 
(NUM_ELEMENT / size)elements, and then sends their result to the node with rank 0. - The node with rank 0, and only that node, receives the results from each node (including itself) and prints the result.
 - Since our 
mainfunction returns aResult<()>, we need to finish the function by returning theResult::Ok<()>variant, which can be shortened toOk<()>since we declared we'd be returning anResultin the function definition 
Building and Executing a Rust MPI binary
Now, the above steps still work - but what's the point of running this on one process?
Let's get started with a basic 2-process PBS batchfile:
#!/bin/bash
#PBS -N basic-mpi
#PBS -l select=2:ncpus=1:mpiprocs=1
#PBS -l walltime=00:10:00
#PBS -j oe
#PBS -o basic-mpi.out
# Change to the directory from which the job was submitted
cd $PBS_O_WORKDIR
# Load MPI module (adjust to your cluster's modules)
echo ""
echo "[ Loading Modules ]"
module purge
module load openmpi/openmpi-5.0.7-gcc-14.2.0-cuda-12.8
# Path to your compiled Rust binary
echo ""
echo "[ Building Program ]"
cargo build --release
BIN=./target/release/basic-mpi
# Run with 2 processes (1 per node)
echo ""
echo "[ Starting Program ]"
mpirun -np 2 -hostfile $PBS_NODEFILE $BIN
Let's note two things. Firstly, we aren't using cargo run here, we're actually building a binary with the release profile. This is important because not only is a binary required by mpirun, but a binary built this way is significantly better optimized.
Secondly, we must coordinate that we're asking for 2 MPI processes in two places:
- The PBS directive (
select=2) mpirunwith (-np 2)
Final Results
Testing this, we get a successful output:
$ qsub run.pbs
69937.cm
$ cat basic-mpi.out
[ Loading Modules ]
Loading openmpi/openmpi-4.1.8-gcc-11.4.0-cuda-11.8
  Loading requirement: gcc/gcc-11.4.0 cuda/cuda-11.8
[ Building Program ]
   Compiling basic-mpi v0.1.0 (/nfs/ihfs/home_metis/z1994244/projects/rust/basic-mpi)
    Finished `release` profile [optimized] target(s) in 0.42s
[ Starting Program ]
Global average of 0.4995294844859745 over 1000000 elements
Computed on 2 MPI processes
With this, we've successfully run a multi-node Rust program with distributed memory!
This approach is hyper-modern - you're getting the low-level performance of C and OpenMPI with the safety and opinionated predictability of Rust.
Very interesting stuff - Rust uniquely positions itself as a potential competitor in the HPC space with the aforementioned benefits.