Problem QRW5: Still good solutions get lost after a time when using the 'speed-bonus' reward handler.
Goal: Find out if a more constant increase can be found
Run multiple trainings with the 'speed-bonus' reward handler and cross validate learning rate and epsilon. Use smaller values for learning rate and epsilon, as they promise more stability
epoch count: 10k
Considers all simulation events for calculating the reward.
Possible simulation events created for an agent:
After every simulation step:
At simulation end:
(t * x) is the 'speed bonus'
t = 1 - (s / max_s)
s: Number of steps when th simulation ended
max_s: Max number of steps for a simulation
Means, the reward/penalty is higher the shorter the simulation ran. The agent gets a higher reward when fast pushing out the opponent, or a higher penalty when fast moving unforced out of the field.
| | L0 | L1 | L1 |
|---|---|---|---|
| learning rate | 0.15 | 0.1 | 0.05 |
| | E0 | E1 | E2 |
|---|---|---|---|
| epsilon | 0.015 | 0.01 | 0.005 |
| | D0 |
|---|---|
| discount | 0.3 |
| | M0 |
|---|---|
| mapping | non-linear-3 |
| | R0 | R1 | R2 |
|---|---|---|---|
| reward handler | speed-bonus | speed-bonus | speed-bonus |