Problem QRW6: Results still too unstable.
Goal: Find better values for L and E based on the results of QRW06
QRW06 shows the best results for L0 and E0. These were the highest values for these parameters. For this training we try those values and higher values.
epoch count: 10k
Considers all simulation events for calculating the reward.
Possible simulation events created for an agent:
After every simulation step:
At simulation end:
(t * x) is the 'speed bonus'
t = 1 - (s / max_s)
s: Number of steps when th simulation ended
max_s: Max number of steps for a simulation
Means, the reward/penalty is higher the shorter the simulation ran. The agent gets a higher reward when fast pushing out the opponent, or a higher penalty when fast moving unforced out of the field.
| L0 | L1 | L1 |
---|---|---|---|
learning rate | 0.25 | 0.2 | 0.15 |
| E0 | E1 | E2 |
---|---|---|---|
epsilon | 0.025 | 0.02 | 0.015 |
| D0 |
---|---|
discount | 0.3 |
| M0 |
---|---|
mapping | non-linear-3 |
| R0 | R1 | R2 |
---|---|---|---|
reward handler | speed-bonus | speed-bonus | speed-bonus |