Summary: Advice on Reinforcement Learning Experimentation
During the BeNeRL seminar talks researchers also briefly share their approach to RL experimentation. The below list keeps track of their main advice.
(BE = Benjamin Eysenbach, DP=Daniel Palenicek, AA=Ademi Adeniji, TD=Tal Daniel, HJ=Hojoon Lee)
Managing experiments
Maintain a lab/experiment journal (BE)
Number every experiment you perform
Rule of thumb: a paper requires in total ~200-250 experiments (unfruitful research directions will stop earlier)
Before each experiment, write down your hypotheses (BE)
Determine what you want to get out of the next experiment
Think beyond "does my method outperform the baseline?" --> there are many more (interesting) questions
Add reminders to yourself (BE)
In your journal, add notes to yourself: when to check back to a certain experiment (with a date), what to look for, what to do next, etc.
Change one thing at a time (DP)
Don't change 5 things at once, since you will not know what broke your system.
Interpreting experiments
Log as much data as possible (BE, DP, TD, HL)
Don't only log learning curves. The more information you log, the more you can analyse
Dig deep into your results
Analysis > Coding. Spend a significant part of your time on analyzing the output of an experiment (BE)
Be curious, try to learn what is going on. Don't only look at learning curves. (BE)
Use your debugger for careful inspection during training (note: jax.debug.breakpoint() ) (DP)
Analyse sources of failures. (TD)
Visualize (BE, DP, AA)
Try to visualize your results as much as possible
Automate this, possibly through you own plotting pipeline, such as: https://github.com/danielpalen/wandb_plot
Name your experiments thoughtfully with the relevant hyperparameters and use wandb to record command-line flags and save a code snapshot
Don't draw premature conclusions from single seeds (DP, AA)
Have scripts ready to deploy you experiments to a cluster, ensure that you always run multiple seeds
Unless debugging, always run experiments over multiple seeds, at least 4
Implementations
Just use good implementations
Good impelementations can matter more than model-based vs model-free or on-policy vs off-policy, e.g. DreamerV3 requires minial tuning. (AA)
Start simple with the cleanest code possible (ClearnRL, PufferLib). (TD)
For offline RL - policy extraction temperature and conservative regularization term matters (AA)
For model-based RL- first make sure your algorithm works with groud-truth state (TD)
If time permits, try to re-implement algorithms from scratch to understand the details (HL)
Iterate fast, utlize GPU-based simulators: IssacLab, BRAX / PufferLib (TD)
Think big, start small (HL)