Find the optimal reward handler
epochs: 2000
The other values are based on the results of QWDC2
Cross validation for reward handler
| | L0 | L1 | L2 | L3 |
|---|---|---|---|---|
| learning rate | 0.8 | 0.8 | 0.8 | 0.8 |
| | E0 |
|---|---|
| epsilon | 0.08 |
| | ED0 |
|---|---|
| epsilon decay | decay-exp-1000 |
| | D0 |
|---|---|
| discount | 0.25 |
| | M0 |
|---|---|
| mapping | non-linear-3 |
| | R0 | R1 | R2 | R3 |
|---|---|---|---|---|
| reward handler | continuous-consider-all | reduced-push-reward | can-see | can-see |
| | F0 |
|---|---|
| fetch mode | eager |