number of epochs: 20k
Rewards are no longer limited by a lower and upper limit.
The ranges for learning rate, epsilon and discount were choosen from the results of Q-CV4
Considers all sensors.
Calculates a reward after every step.
Positive rewarded actions.
Negative rewarded actions.
Rewards are calculated after every step and might be different from zero.
L0 | L1 | L2 | |
learning rate | 0.7 | 0.8 | 0.9 |
E0 | E1 | E2 | |
epsilon | 0.01 | 0.05 | 0.1 |
D0 | D1 | D2 | D3 | |
discount | 0.95 | 0.99 | 0.995 | 0.999 |