QMAP01 Cross validation mapping

number of epochs: 20k

Rewards are no longer limited by a lower and upper limit.

The ranges for learning rate, epsilon and discount were chosen from the results of Q-CV4

All values for mapping are used.

training.simrunner.RewardHandlerName.continuous-consider-all

Considers all simulation events for calculating the reward.

Possible simulation events created for an agent:

  1. After every simulation step:

  2. At simulation end:

(t * x) is the 'speed bonus'

t = 1 - (s / max_s)

s:     Number of steps when th simulation ended   
max_s: Max number of steps for a simulation

Means, the reward/penalty is higher the shorter the simulation ran. The agent gets a higher reward when fast pushing out the opponent, or a higher penalty when fast moving unforced out of the field.

training.parallel.ParallelConfig.q-map-0

L0
learning rate0.7
E0
epsilon0.05
D0D1
discount0.50.8
M0M1M2M3
mappingnon-linear-1non-linear-2non-linear-3non-linear-4

Results for: QMAP01 L0E0D0M0

q-values
video 0 video 1 video 2 video 3
video 4 video 5 video 6 video 7
video 8 video 9 video 10
Results for: QMAP01 L0E0D0M1

q-values
video 0 video 1 video 2 video 3
video 4 video 5 video 6 video 7
video 8 video 9 video 10
Results for: QMAP01 L0E0D0M2

q-values
video 0 video 1 video 2 video 3
video 4 video 5 video 6 video 7
video 8 video 9 video 10
Results for: QMAP01 L0E0D0M3

q-values
video 0 video 1 video 2 video 3
video 4 video 5 video 6 video 7
video 8 video 9 video 10
Results for: QMAP01 L0E0D1M0

q-values
video 0 video 1 video 2 video 3
video 4 video 5 video 6 video 7
video 8 video 9 video 10
Results for: QMAP01 L0E0D1M1

q-values
video 0 video 1 video 2 video 3
video 4 video 5 video 6 video 7
video 8 video 9 video 10
Results for: QMAP01 L0E0D1M2

q-values
video 0 video 1 video 2 video 3
video 4 video 5 video 6 video 7
video 8 video 9 video 10
Results for: QMAP01 L0E0D1M3

q-values
video 0 video 1 video 2 video 3
video 4 video 5 video 6 video 7
video 8 video 9 video 10