Getting distributed training jobs to run on huge clusters is hard! This is especially true when you start looking at more complex setups like distributed reinforcement learning. Debugging these kinds of jobs is frustrating, and the turnaround time for changes tends to be very slow.
Monarch is a distributed programming framework for PyTorch that makes the cluster programmable through a simple Pytho...
