Find the optimal reward handler
epochs: 2000
The other values are based on the results of QWDC2
Cross validation for reward handler
| L0 | L1 | L2 | L3 |
---|---|---|---|---|
learning rate | 0.8 | 0.8 | 0.8 | 0.8 |
| E0 |
---|---|
epsilon | 0.08 |
| ED0 |
---|---|
epsilon decay | decay-exp-1000 |
| D0 |
---|---|
discount | 0.25 |
| M0 |
---|---|
mapping | non-linear-3 |
| R0 | R1 | R2 | R3 |
---|---|---|---|---|
reward handler | continuous-consider-all | reduced-push-reward | can-see | can-see |
| F0 |
---|---|
fetch mode | eager |