Q-CV7 Cross validation reward at end

number of epochs: 20k

Rewards are no longer limited by a lower and upper limit.

Rewards are only calculated at the end of a simulation. Rewards on intermediate steps are zero.

The ranges for learning rate, epsilon and discount were chosen from the results of Q-CV4

training.simrunner.RewardHandlerName.end-consider-all

Considers all simulation events for calculating the reward.

Possible simulation events created for an agent:

  1. After every simulation step:

  2. At simulation end:


t = 1 - (s / max_s)

s:     Number of steps when th simulation ended   
max_s: Max number of steps for a simulation

Means, the reward/penalty is higher the shorter the simulation ran. The agent gets a higher reward when fast pushing out the opponent, or a higher penalty when fast moving unforced out of the field.

training.parallel.ParallelConfig.q-cross-1

L0L1L2
learning rate0.70.80.9
E0E1E2
epsilon0.010.050.1
D0D1D2D3
discount0.950.990.9950.999