Dr Barry McCollum

School of Electronics, Electrical Engineering and Computer Science

Reinforcement Based Learning techniques in deciding on Optimisation Strategies

A benefit of using learning method to solving scheduling problems is that the learned agent will adapt it’s behavior to the problem considered, even when the problem changes slightly e.g. student number changes or room space changes. The agent can resolve the new problem by itself without the need for human intervene. This is a very useful feature for real world applications.

Another potential advantage of using reinforcement learning method is in finding a compromised solution when part of the scheduling has been arranged. For example, half of the examinations have passed and the administrator finds that the rest of the exam schedule needs to be re-arranged. In this situation, a sophisticated learned agent still can find the optimal solution for the rest of the exams no matter how good or bad the previous exams are.

Small scheduling problems could be solved directly through reinforcement learning. The agent can learn the Q(s, a) by temporal difference methods like Q-learning or SARSA. In this case, the agent will learn a tabular Q-value table representing the expected reward at a specific state. The optimal solution can be generated directly from the learned Q-value table.

The Monte Carlo method is another possible choice. In the example of exam scheduling problem as the immediate reward R and transition T are known, the agent can predict the next reward and state based on calculation without performing a real action. Monte Carlo method estimates the value states V(s) which avoid using the action space. Potentially, this MC method will reduce the usage of computational memory.

If the number of student or exam is large, the state space and action space of the exam scheduling problem will explode quickly. It is not practical to store all these information. Some parts of the action space or state space are more important than other. For example, states close to the constraint state are much more important than those far away to the constraints. A good generalization on state space can significantly reduce the state space and action space.

Function approximators (FAs) is a wildly used method to store the value of observed states and generalized unseen states. Neural networks could be engaged to update the weights of the FA parameters.

Other generalization methods may happen on the problem semantic level. For example, constraints related to invigilators/markers/estates are generalized as “admin”. Semantic level generalization normally can greatly simplify the problem domain. However, it requires human’s domain knowledge and assistant, hence it is very problem specific.

Another method could be to use multi-agent reinforcement learning methods. RL Agents Learns the policies under different constraints. Each serves under one constraints. If the state space too large, it would be better for agent to learn just near the constrain space or with combination of generalization methods. Each agent will learn a Q-value table which can be used to generate policies.

The next step is to find solutions by combing learned policies. A comment method would be searching the policies spaces to find solutions that satisfied all constrains. A simplified method is indexing the constraints based on importance (i.e. wC). Agents with higher important will have higher privilege to select its policies.

Although this method may not be theoretically sound, it is flexible due to the separated learning of constraints.

Contact Details

Dr Barry McCollum
SARC Building
School of Electronics, Electrical Engineering & Computer Science
Queen's University Belfast

Phone: +44 (0) 2890974622
Fax: +44 (0) 2890975666