Summary: Advice on Reinforcement Learning Experimentation
During the BeNeRL seminar talks researchers also briefly share their approach to RL experimentation. The below list keeps track of their main advice.
(BE = Benjamin Eysenbach, DP=Daniel Palenicek, AA=Ademi Adeniji)
Managing experiments
Maintain a lab/experiment journal (BE)
Number every experiment you perform
Rule of thumb: a paper requires in total ~200-250 experiments (unfruitful research directions will stop earlier)
Before each experiment, write down your hypotheses (BE)
Determine what you want to get out of the next experiment
Think beyond "does my method outperform the baseline?" --> there are many more (interesting) questions
Add reminders to yourself (BE)
In your journal, add notes to yourself: when to check back to a certain experiment (with a date), what to look for, what to do next, etc.
Change one thing at a time (DP)
Don't change 5 things at once, since you will not know what broke your system.
Interpreting experiments
Log as much data as possible (BE, DP)
Don't only log learning curves. The more information you log, the more you can analyse
Dig deep into your results (BE, DP)
Analysis > Coding. Spend a significant part of your time on analyzing the output of an experiment
Be curious, try to learn what is going on. Don't only look at learning curves
Use your debugger for careful inspection during training (note: jax.debug.breakpoint() )
Visualize (BE, DP, AA)
Try to visualize your results as much as possible
Automate this, possibly through you own plotting pipeline, such as: https://github.com/danielpalen/wandb_plot
Name your experiments thoughtfully with the relevant hyperparameters and use wandb to record command-line flags and save a code snapshot
Don't draw premature conclusions from single seeds (DP, AA)
Have scripts ready to deploy you experiments to a cluster, ensure that you always run multiple seeds
Unless debugging, always run experiments over multiple seeds, at least 4
Implementations
For online RL - just use good implementations (AA)
Good impelementations can matter more than model-based vs model-free or on-policy vs off-policy, e.g. DreamerV3 requires minial tuning.
For offline RL - policy extraction temperature and conservative regularization term matters (AA)