This is a new implementation of Q-Learning algorithm, developed to resolve the convergence problem in a multi-agents system. It has been developed to resolve the reinforcement learning in the Wumpus world.
The implementation introduces a new concept of learning based on knowledge share. Each agent benefices from the experience of others, and adds his own experience to the knowledge base. If an agent A performs an action a in state s and gets a good/bad reward, when arriving in the same state agent B can benefice from the experience of agent A, and immediately make a better decision and choose better action.
The classical implementation uses one Hashtable (qtable), per agent, indexed by state and action. The new method uses a one dimensional array in each state indexed by actions. On a 5x5 grid, we have 25 states, this means 25 arrays (25 Q-Tables) indexed by the 8 actions (moveUp, moveDown, moveLeft, moveRight, shootUp, shootDown, shootLeft, shootRight).
In state s, each agent calculates the Q-value for the action he performs and adds it to the Q-Table, when passing by the same state, other agents can choose a random action to explore among those with positive values in the Q-table, or a the best action to exploit, the one that has the highest Q-value in the table (negative values, in the table, are for impossible actions such as bumping into a wall, shooting towards a wall/an empty cell with no Wumpus, or stepping into a pit).
In the classical method the convergence of the algorithm is strictly related to the filling rate of the Hashtables. In my method, the use of a small array in each state, and the fact that all agents contribute to filling it, mean that the more the number of agents is high the more the rate of filling of Q-value arrays is fast, then the convergence will be very quick.
The current implementation is mono-agent developed in Java under Jade platforme. It is a simple implementation that uses simple structures and proposes a revolutionary method.
The Cell and Grid classes form the data structures. The Cell class holds all information. All environment objects, the eight possible actions (in the Wumpus game) and the Q-Table are encapsulated in it. Each cell object represents a state with its all actions, perceptions and Q-Table. A grid object is a matrix of cell objects.
In the classical method, the convergence to an optimal policy is usually reached after a very important number of iterations (more than 1000). In my method, the hunter agent will learn very quickly thanks to the encapsulation of the Q-table inside a cell object. For example in a 5x5 grid, the agent will change from exploration to exploitation state in a minimal number of iterations (20) then he will win every time using the optimal policy.
Here are the program classes:
Aucun commentaire:
Enregistrer un commentaire