Ants – some notes on learning process

Meanwhile managed to complete 10^8 steps training of 128/4 network with 128/64 memory (ant#302).
It shows nice behavior:

But it’s not as stable as I expect (extrinsic rewards learning showed much more stability):

*303 is just continue of learning of #302 w/o configuration changes

I guess it’s because of constant learning_rate_schedule, so trying ant#304 with “linear” one (initializing with latest ant#302) to get more stable version. And try to get back extrinsic reward system (on the top of the existing model / from scratch).

  • Spawn collision check radius (privateSpaceSpaceRadius) increased to prevent scenarios when agent can touch both resources (as it just locks in this position till the end of the episode giving huge positive result while actually it’s a rare case of the environment configuration).
  • Finally manages to fix spawn issue. But sometimes even 50 generation of random spawn points are not enough to get one collisions-free. So, in the future the problem will even grow with the number of objects….
    Solution to consider – poisson disk sampling algorithm
  • Learning doesn’t progress without normalization of the rewards. No idea yet. All rewards are in range -1;+1, probably cumulative reward should be in range (need to check ml-agents code).