number of epochs: 20k
Rewards are no longer limited by a lower and upper limit.
The ranges for learning rate, epsilon and discount were chosen from the results of Q-CV4
All values for mapping are used.
Considers all simulation events for calculating the reward.
Possible simulation events created for an agent:
After every simulation step:
At simulation end:
(t * x) is the 'speed bonus'
t = 1 - (s / max_s)
s: Number of steps when th simulation ended
max_s: Max number of steps for a simulation
Means, the reward/penalty is higher the shorter the simulation ran. The agent gets a higher reward when fast pushing out the opponent, or a higher penalty when fast moving unforced out of the field.
L0 | |
learning rate | 0.7 |
E0 | |
epsilon | 0.05 |
D0 | D1 | |
discount | 0.5 | 0.8 |
M0 | M1 | M2 | M3 | |
mapping | non-linear-1 | non-linear-2 | non-linear-3 | non-linear-4 |