OpenMC Guide
Parallel Computing in OpenMC
Why Parallel Monte Carlo
Monte Carlo particle transport is inherently parallel: each particle history samples a random walk through the geometry independently of every other history. The only synchronization point in a criticality calculation is the end of each batch, where the fission bank is redistributed and k-effective is updated. This makes Monte Carlo codes scale efficiently across many processors, provided the particle population is large enough to keep all workers busy and the geometry data fits in memory.
OpenMC supports two complementary parallelism models. OpenMP uses shared memory within a single compute node — all threads see the same geometry and material data, and particle histories are divided among threads with minimal overhead. MPI distributes work across separate processes that may reside on different nodes of a cluster; each MPI rank loads its own copy of the geometry and cross-section data, and ranks communicate only to exchange fission bank sites at batch boundaries. A hybrid approach combining MPI across nodes with OpenMP within each node is standard on HPC systems.
OpenMP (Shared Memory)
OpenMP parallelism is the simplest to use and requires no special installation beyond an OpenMC build compiled with OpenMP support (the default on conda-forge). Each thread tracks particles independently, sharing the same geometry and tally data structures in memory. Thread counts are specified either through the Python API or the command line.
Python API
import openmc
model = openmc.Model(geometry=geometry, materials=materials, settings=settings)
# Run with 8 OpenMP threads
model.run(threads=8)
# Or equivalently from the command line:
# openmc -s 8Tutorial snippet — no separate file in examples repo
The optimal thread count is typically equal to the number of physical CPU cores on the machine. Hyperthreading (SMT) rarely helps Monte Carlo workloads because the computation is memory-bandwidth-limited, and two threads sharing a core compete for cache. For a 16-core workstation, start with threads=16 and benchmark; going to 32 threads on a 16-core/32-thread CPU usually yields little or no improvement.
The particle population per batch should be large enough that each thread has at least a few thousand histories to track. With 10,000 particles and 8 threads, each thread handles roughly 1,250 particles per batch — adequate but tight. For production runs with many tallies, 50,000–100,000 particles per batch is typical.
MPI (Distributed Memory)
MPI parallelism distributes the simulation across separate processes, each with its own memory space. This is essential on clusters where compute nodes do not share memory. Each MPI rank loads all geometry, materials, and cross-section data independently, so the per-node memory requirement is roughly the serial memory footprint (the particle bank is split, but the cross-section data is duplicated).
Command-Line MPI Execution
# Run OpenMC on 4 MPI ranks
mpiexec -n 4 openmc
# With a Python script that builds the model
mpiexec -n 4 python build_and_run.py
# On a cluster with a job scheduler (SLURM example)
# srun -n 64 openmcTutorial snippet — no separate file in examples repo
Python API with MPI
import openmc
model = openmc.Model(geometry=geometry, materials=materials, settings=settings)
# Run with 4 MPI ranks from within Python
model.run(mpi_args=['mpiexec', '-n', '4'])
# Note: when running under mpiexec directly (mpiexec -n 4 python script.py),
# do NOT pass mpi_args — MPI is already initialized by the launcher.Tutorial snippet — no separate file in examples repo
In criticality mode, at each batch boundary every MPI rank sends its locally produced fission sites to rank 0, which combines and redistributes them uniformly across all ranks for the next batch. This communication step is fast relative to the transport time, but it means that the total number of particles per batch must be large enough to give each rank a meaningful share. A rule of thumb is at least 1,000 particles per rank per batch; fewer than that and the overhead of fission bank redistribution dominates.
Hybrid MPI + OpenMP
On multi-node clusters, the standard approach is one or a few MPI ranks per node with OpenMP threads filling the remaining cores. This minimizes memory duplication (fewer copies of cross-section data) while still utilizing all available cores. A 64-core node might run 2 MPI ranks × 32 threads, or 4 ranks × 16 threads, depending on the memory footprint.
Hybrid Execution
# 4 MPI ranks, each with 8 OpenMP threads (32 cores total)
mpiexec -n 4 openmc -s 8
# SLURM job script for 2 nodes, 64 cores per node
# 2 ranks per node × 32 threads = 128 cores total
# srun --nodes=2 --ntasks-per-node=2 openmc -s 32
# Environment variable alternative for threads:
# export OMP_NUM_THREADS=8
# mpiexec -n 4 openmcTutorial snippet — no separate file in examples repo
The optimal split between MPI ranks and threads depends on the problem. Geometry-heavy models with many cells and surfaces benefit from fewer ranks (less memory duplication). Tally-heavy models where each thread must accumulate into shared tally bins may benefit from more ranks and fewer threads, since MPI ranks have fully independent tally arrays while OpenMP threads must use atomic operations or thread-local accumulation.
Scaling and Performance
Strong scaling measures how simulation wall time decreases as the number of processors increases for a fixed problem size. Ideal strong scaling halves the time when the processor count doubles. Monte Carlo transport typically achieves 85–95% parallel efficiency up to hundreds of cores, with degradation at higher counts due to communication overhead and load imbalance in the fission bank redistribution.
Weak scaling holds the work per processor constant and measures how wall time changes as both problem size and processor count grow together. Monte Carlo weak scaling is generally excellent because adding more particles is embarrassingly parallel — the only limiting factor is the batch-boundary synchronization.
Performance Settings
import openmc
settings = openmc.Settings()
# Population large enough for good parallel efficiency:
# At least 1000 particles per MPI rank, ideally more
settings.particles = 100000 # 100k particles per batch
settings.batches = 200
settings.inactive = 50
# Verbosity 7 prints timing breakdowns (transport, I/O, communication)
settings.verbosity = 7
# For reproducibility across different parallel decompositions,
# OpenMC uses a skip-ahead random number generator. The same
# seed produces the same results regardless of thread/rank count.
settings.seed = 1Tutorial snippet — no separate file in examples repo
Reproducibility: OpenMC's random number generator guarantees identical results for the same seed regardless of the number of threads or MPI ranks. This is achieved through a skip-ahead algorithm that assigns each particle a deterministic subsequence. Changing the particle count or batch count will change results, but changing the parallel decomposition will not.
Memory Considerations
Each MPI rank loads the full geometry, material, and cross-section data, which for a large reactor model with many nuclides can reach several gigabytes. The particle bank is split across ranks, but this is usually a small fraction of total memory. OpenMP threads within a rank share all data, so increasing threads does not increase memory.
For memory-constrained systems, reducing the number of MPI ranks per node and increasing threads is the primary strategy. If the cross-section data alone exceeds available memory per rank, consider reducing the number of nuclides tracked, using natural-element cross sections where isotopic detail is unnecessary (e.g., structural materials), or moving to nodes with more memory.
Tallies also consume memory proportional to the number of bins × the number of threads (each thread maintains a local accumulation array to avoid contention). A fine 3D mesh tally with many energy groups and scores can dominate memory usage. Reduce tally dimensionality or use triggers to stop the simulation once the desired precision is reached.
Common Issues and Troubleshooting
Typical Parallel Problems
- Slow scaling beyond ~16 threads: Usually caused by NUMA effects on multi-socket systems. Pin threads to cores using
OMP_PROC_BIND=spreadandOMP_PLACES=cores. - Out of memory with many MPI ranks: Each rank duplicates the full cross-section library. Reduce ranks per node and increase threads instead.
- Poor batch-to-batch timing consistency: Uneven fission source distribution can cause load imbalance. Increase inactive batches to allow the source to converge before active batches begin.
- MPI errors on Windows: OpenMC's MPI support is primarily tested on Linux. On Windows, use WSL2 for MPI runs.
- File I/O contention: On shared file systems (NFS, Lustre), many ranks writing statepoints simultaneously can bottleneck. Use
settings.statepoint = {batches: []}to disable automatic statepoints and write only at the end.