Q-CV5 Cross validation reward unlimited

number of epochs: 20k

Rewards are no longer limited by a lower and upper limit.continuous-consider-all

The ranges for learning rate, epsilon and discount were chosen from the results of Q-CV4

training.simrunner.RewardHandlerName.continuous-consider-all

Considers all simulation events for calculating the reward.

Possible simulation events created for an agent:

After every simulation step:
- Pushed the opponent. Reward = +0.5
- Is pushed by the opponent. Reward = -0.1
At simulation end:
- Winner by pushing the opponent: Reward = 100 + t * 50
- Looser without being pushed: Reward = -100 - t * 50
- Looser being pushed: Reward: -10

(t * x) is the 'speed bonus'

t = 1 - (s / max_s)

s:     Number of steps when th simulation ended   
max_s: Max number of steps for a simulation

Means, the reward/penalty is higher the shorter the simulation ran. The agent gets a higher reward when fast pushing out the opponent, or a higher penalty when fast moving unforced out of the field.

training.parallel.ParallelConfig.q-cross-1

	L0	L1	L2
learning rate	0.7	0.8	0.9

	E0	E1	E2
epsilon	0.01	0.05	0.1

	D0	D1	D2	D3
discount	0.95	0.99	0.995	0.999