Q-CV7 Cross validation reward at end

number of epochs: 20k

Rewards are no longer limited by a lower and upper limit.

Rewards are only calculated at the end of a simulation. Rewards on intermediate steps are zero.

The ranges for learning rate, epsilon and discount were chosen from the results of Q-CV4

training.simrunner.RewardHandlerName.end-consider-all

Considers all simulation events for calculating the reward.

Possible simulation events created for an agent:

After every simulation step:
- Always zero
At simulation end:
- Winner by pushing the opponent: Reward = 100 + t * 50
- Looser without being pushed: Reward = -100 - t * 50
- Looser being pushed: Reward: -10
- Number of pushing the opponent. Reward = n * 0.5
- Number of being pushed by the opponent. Reward = n * (-0.1)

t = 1 - (s / max_s)

s:     Number of steps when th simulation ended   
max_s: Max number of steps for a simulation

Means, the reward/penalty is higher the shorter the simulation ran. The agent gets a higher reward when fast pushing out the opponent, or a higher penalty when fast moving unforced out of the field.

training.parallel.ParallelConfig.q-cross-1

	L0	L1	L2
learning rate	0.7	0.8	0.9

	E0	E1	E2
epsilon	0.01	0.05	0.1

	D0	D1	D2	D3
discount	0.95	0.99	0.995	0.999