Κ-確実探査法と動的計画法を用いた mdps 環境の効率的探索法

Transactions of the Japanese Society for Artificial Intelligence 16:11-19 (2001)
  Copy   BIBTEX

Abstract

One most common problem in reinforcement learning systems (e.g. Q-learning) is to reduce the number of trials to converge to an optimal policy. As one of the solution to the problem, k-certainty exploration method was proposed. Miyazaki reported that this method could determine an optimal policy faster than Q-learning in Markov decision processes (MDPs). This method is very efficient learning method. But, we propose an improvement plan that makes this method more efficient. In k-certainty exploration method, in case there is no k-uncertainty rule (rules which aren’t selected more than k times) in current state, an agent sometimes walks randomly until it finds a state where it can select a k-uncertainty rule. We think that this behavior is not efficient. To reduce this useless behavior, we propose combining k-certainty exploration method with Dynamic programming (DP). Miyazaki’s system uses DP after all rules are executed at least k times. But, our method uses k-certainty exploration method along with DP during learning. Our method, takes two pattern actions. In case an agent can select k-uncertainty rules, one of these rules is selected at random as k-certainty exploration method. In another case there is no k-uncertainty rule, behavior of agent is different from the behavior of k-certainty exploration method. In that case, our method uses DP to compute an optimal policy for moving from a current state to a state in which some k-uncertainty rules remain. The model for DP is constructed by using only known states. The outline will be described below. First, an agent makes a map constructed by using only known states. In the map, goals are states in which there are k-uncertainty rules and arbitrary state values are set in these states. Point to which attention should be paid is that the map is not given from outside. It is made from only experience. Next, each state values of states in which there are only k-certainty rules in the map are computed by DP (we used Policy Iteration). Finally, an agent continues to select greedy action until it arrives at a state in which it can select a k-uncertainty rule. By this improvement, we expect it can determine an optimal policy faster than k-certainty exploration method. And we have verified that our exploration method can determine an optimal policy faster than k-certainty exploration method by computer simulation.

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 91,881

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Similar books and articles

Analytics

Added to PP
2014-03-25

Downloads
22 (#709,216)

6 months
7 (#430,521)

Historical graph of downloads
How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

No references found.

Add more references