Problem QRW5: Still good solutions get lost after a time when using the 'speed-bonus' reward handler.
Goal: Find out if a more constant increase can be found
Run multiple trainings with the 'speed-bonus' reward handler and cross validate learning rate and epsilon. Use smaller values for learning rate and epsilon, as they promise more stability
epoch count: 10k
Considers all simulation events for calculating the reward.
Possible simulation events created for an agent:
After every simulation step:
At simulation end:
(t * x) is the 'speed bonus'
t = 1 - (s / max_s)
s: Number of steps when th simulation ended
max_s: Max number of steps for a simulation
Means, the reward/penalty is higher the shorter the simulation ran. The agent gets a higher reward when fast pushing out the opponent, or a higher penalty when fast moving unforced out of the field.
| L0 | L1 | L1 |
---|---|---|---|
learning rate | 0.15 | 0.1 | 0.05 |
| E0 | E1 | E2 |
---|---|---|---|
epsilon | 0.015 | 0.01 | 0.005 |
| D0 |
---|---|
discount | 0.3 |
| M0 |
---|---|
mapping | non-linear-3 |
| R0 | R1 | R2 |
---|---|---|---|
reward handler | speed-bonus | speed-bonus | speed-bonus |