This paper presents a GA-based multi-agent reinforce- ment learning bidding approach (GMARLB) for perform- ing multi-agent reinforcementlearning. GMARLB inte- grates reinforcementlearning, bidding and genetic algo- rithms. The general idea of our multi-agent systems is as follows: There are a number of individual agents in a team, each agent of the team has two modules: Q module and CQ module. Each agent can select actions to be performed at each step, which are done by (...) the Q module. While the CQ module determines at each step whether the agent should continue or relinquish control. Once an agent relinquishes its control, a new agent is selected by bidding algorithms. We applied GA-based GMARLB to the Backgammon game. The experimental results show GMARLB can achieve a su- perior level of performance in game-playing, outperforming PubEval, while the system uses zero built-in knowledge. (shrink)
I argue for the role of reinforcementlearning in the philosophy of mind. To start, I make several assumptions about the nature of reinforcementlearning and its instantiation in minds like ours. I then review some of the contributions of reinforcementlearning methods have made across the so-called 'decision sciences.' Finally, I show how principles from reinforcementlearning can shape philosophical debates regarding the nature of perception and characterisations of desire.
Relational database management system is the most popular database system. It is important to maintain data security from information leakage and data corruption. RDBMS can be attacked by an outsider or an insider. It is difficult to detect an insider attack because its patterns are constantly changing and evolving. In this paper, we propose an adaptive database intrusion detection system that can be resistant to potential insider misuse using evolutionary reinforcementlearning, which combines reinforcementlearning and (...) evolutionary learning. The model consists of two neural networks, an evaluation network and an action network. The action network detects the intrusion, and the evaluation network provides feedback to the detection of the action network. Evolutionary learning is effective for dynamic patterns and atypical patterns, and reinforcementlearning enables online learning. Experimental results show that the performance for detecting abnormal queries improves as the proposed model learns the intrusion adaptively using Transaction Processing performance Council-E scenario-based virtual query data. The proposed method achieves the highest performance at 94.86%, and we demonstrate the usefulness of the proposed method by performing 5-fold cross-validation. (shrink)
Animals routinely adapt to changes in the environment in order to survive. Though reinforcementlearning may play a role in such adaptation, it is not clear that it is the only mechanism involved, as it is not well suited to producing rapid, relatively immediate changes in strategies in response to environmental changes. This research proposes that counterfactual reasoning might be an additional mechanism that facilitates change detection. An experiment is conducted in which a task state changes over time (...) and the participants had to detect the changes in order to perform well and gain monetary rewards. A cognitive model is constructed that incorporates reinforcementlearning with counterfactual reasoning to help quickly adjust the utility of task strategies in response to changes. The results show that the model can accurately explain human data and that counterfactual reasoning is key to reproducing the various effects observed in this change detection paradigm. (shrink)
This paper addresses weighting and partitioning in complex reinforcementlearning tasks, with the aim of facilitating learning. The paper presents some ideas regarding weighting of multiple agents and extends them into partitioning an input/state space into multiple regions with di erential weighting in these regions, to exploit di erential characteristics of regions and di erential characteristics of agents to reduce the learning complexity of agents (and their function approximators) and thus to facilitate the learning overall. (...) It analyzes, in reinforcementlearning tasks, di erent ways of partitioning a task and using agents selectively based on partitioning. Based on the analysis, some heuristic methods are described and experimentally tested. We nd that some o -line heuristic methods performed the best, signi cantly better than single-agent models. (shrink)
We investigate a simple stochastic model of social network formation by the process of reinforcementlearning with discounting of the past. In the limit, for any value of the discounting parameter, small, stable cliques are formed. However, the time it takes to reach the limiting state in which cliques have formed is very sensitive to the discounting parameter. Depending on this value, the limiting result may or may not be a good predictor for realistic observation times.
A fundamental question in reading research concerns whether attention is allocated strictly serially, supporting lexical processing of one word at a time, or in parallel, supporting concurrent lexical processing of two or more words (Reichle, Liversedge, Pollatsek, & Rayner, 2009). The origins of this debate are reviewed. We then report three simulations to address this question using artificial reading agents (Liu & Reichle, 2010; Reichle & Laurent, 2006) that learn to dynamically allocate attention to 1–4 words to “read” as efficiently (...) as possible. These simulation results indicate that the agents strongly preferred serial word processing, although they occasionally attended to more than one word concurrently. The reason for this preference is discussed, along with implications for the debate about how humans allocate attention during reading. (shrink)
The area of smart power grids needs to constantly improve its efficiency and resilience, to provide high quality electrical power in a resilient grid, while managing faults and avoiding failures. Achieving this requires high component reliability, adequate maintenance, and a studied failure occurrence. Correct system operation involves those activities and novel methodologies to detect, classify, and isolate faults and failures and model and simulate processes with predictive algorithms and analytics. In this paper, we showcase the application of a complex-adaptive, self-organizing (...) modeling method, and Probabilistic Boolean Networks, as a way towards the understanding of the dynamics of smart grid devices, and to model and characterize their behavior. This work demonstrates that PBNs are equivalent to the standard ReinforcementLearning Cycle, in which the agent/model has an interaction with its environment and receives feedback from it in the form of a reward signal. Different reward structures were created to characterize preferred behavior. This information can be used to guide the PBN to avoid fault conditions and failures. (shrink)
A deep reinforcementlearning-based computational guidance method is presented, which is used to identify and resolve the problem of collision avoidance for a variable number of fixed-wing UAVs in limited airspace. The cooperative guidance process is first analyzed for multiple aircraft by formulating flight scenarios using multiagent Markov game theory and solving it by machine learning algorithm. Furthermore, a self-learning framework is established by using the actor-critic model, which is proposed to train collision avoidance decision-making neural (...) networks. To achieve higher scalability, the neural network is customized to incorporate long short-term memory networks, and a coordination strategy is given. Additionally, a simulator suitable for multiagent high-density route scene is designed for validation, in which all UAVs run the proposed algorithm onboard. Simulated experiment results from several case studies show that the real-time guidance algorithm can reduce the collision probability of multiple UAVs in flight effectively even with a large number of aircraft. (shrink)
The paper presents an approach for developing multi-agent reinforcementlearning systems that are made up of a coalition of modular agents. We focus on learning to segment sequences (sequential decision tasks) to create modular structures, through a bidding process that is based on reinforcements received during task execution. The approach segments sequences (and divides them up among agents) to facilitate the learning of the overall task. Notably, our approach does not rely on a priori knowledge or (...) a priori structures. Initial experiments demonstrated the basic promise of the approach. This work shows how bidding and reinforcementlearning can be usefully combined, thus pointing to a new research direction. (shrink)
Autonomous underwater vehicles are widely used to accomplish various missions in the complex marine environment; the design of a control system for AUVs is particularly difficult due to the high nonlinearity, variations in hydrodynamic coefficients, and external force from ocean currents. In this paper, we propose a controller based on deep reinforcementlearning in a simulation environment for studying the control performance of the vectored thruster AUV. RL is an important method of artificial intelligence that can learn behavior (...) through trial-and-error interactions with the environment, so it does not need to provide an accurate AUV control model that is very hard to establish. The proposed RL algorithm only uses the information that can be measured by sensors inside the AUVs as the input parameters, and the outputs of the designed controller are the continuous control actions, which are the commands that are set to the vectored thruster. Moreover, a reward function is developed for deep RL controller considering different factors which actually affect the control accuracy of AUV navigation control. To confirm the algorithm’s effectiveness, a series of simulations are carried out in the designed simulation environment, which is a method to save time and improve efficiency. Simulation results prove the feasibility of the deep RL algorithm applied to the control system for AUV. Furthermore, our work also provides an optional method for robot control problems to deal with improving technology requirements and complicated application environments. (shrink)
After generalizing the Archimedean property of real numbers in such a way as to make it adaptable to non-numeric structures, we demonstrate that the real numbers cannot be used to accurately measure non-Archimedean structures. We argue that, since an agent with Artificial General Intelligence (AGI) should have no problem engaging in tasks that inherently involve non-Archimedean rewards, and since traditional reinforcementlearning rewards are real numbers, therefore traditional reinforcementlearning probably will not lead to AGI. We (...) indicate two possible ways traditional reinforcementlearning could be altered to remove this roadblock. (shrink)
methods to improve reinforcementlearning are identi ed and discussed in some detail Each demonstrates to some extent the advantages of combining RL and symbolic meth ods These methods point to the potentials and the chal lenges of this line of research..
Sequential action makes up the bulk of human daily activity, and yet much remains unknown about how people learn such actions. In one motor learning paradigm, the serial reaction time task, people are taught a consistent sequence of button presses by cueing them with the next target response. However, the SRT task only records keypress response times to a cued target, and thus it cannot reveal the full time-course of motion, including predictive movements. This paper describes a mouse movement (...) trajectory SRT task in which the cursor must be moved to a cued location. We replicated keypress SRT results, but also found that predictive movement—before the next cue appears—increased during the experiment. Moreover, trajectory analyses revealed that people developed a centering strategy under uncertainty. In a second experiment, we made prediction explicit, no longer cueing targets. Thus, participants had to explore the response alternatives and learn via reinforcement, receiving rewards and penalties for correct and incorrect actions, respectively. Participants were not told whether the sequence of stimuli was deterministic, nor if it would repeat, nor how long it was. Given the difficulty of the task, it is unsurprising that some learners performed poorly. However, many learners performed remarkably well, and some acquired the full 10-item sequence within 10 repetitions. Comparing the high- and low-performers’ detailed results in this reinforcementlearning task with the first experiment's cued trajectory SRT task, we found similarities between the two tasks, suggesting that the effects in Experiment 1 are due to predictive, rather than reactive processes. Finally, we found that two standard model-free reinforcementlearning models fit the high-performing participants, while the four low-performing participants provide better fit with a simple negative recency bias model. (shrink)
Rapid and precise air operation mission planning is a key technology in unmanned aerial vehicles autonomous combat in battles. In this paper, an end-to-end UAV intelligent mission planning method based on deep reinforcementlearning is proposed to solve the shortcomings of the traditional intelligent optimization algorithm, such as relying on simple, static, low-dimensional scenarios, and poor scalability. Specifically, the suppression of enemy air defense mission planning is described as a sequential decision-making problem and formalized as a Markov decision (...) process. Then, the SEAD intelligent planning model based on the proximal policy optimization algorithm is established and a general intelligent planning architecture is proposed. Furthermore, three policy training tricks, i.e., domain randomization, maximizing policy entropy, and underlying network parameter sharing, are introduced to improve the learning performance and generalizability of PPO. Experiments results show that the model in this work is efficient and stable, and can be adapted to the unknown continuous high-dimensional environment. It can be concluded that the UAV intelligent mission planning model based on DRL has powerful intelligent planning performance, and provides a new idea for researching UAV autonomy. (shrink)
We welcome Soltis' use of evolutionary signaling theory, but question his interpretations of colic as a signal of vigor and his explanation of abnormal high-pitched crying as a signal of poor infant quality. Instead, we suggest that these phenomena may be suboptimal by-products of a generally adaptive learning process by which infants adjust their crying levels in relation to parental responsiveness.
traction from reinforcement learners It addresses two ap proaches towards knowledge extraction the extraction of ex plicit symbolic rules from neural reinforcement learners and the extraction of complete plans from such learners The advantages of such knowledge extraction include the improvement of learning especially with the rule extraction approach and the improvement of the usability of re sults of learning..
Recent years have yielded many discussions on how to endow autonomous agents with the ability to make ethical decisions, and the need for explicit ethical reasoning and transparency is a persistent theme in this literature. We present a modular and transparent approach to equip autonomous agents with the ability to comply with ethical prescriptions, while still enacting pre-learned optimal behaviour. Our approach relies on a normative supervisor module, that integrates a theorem prover for defeasible deontic logic within the control loop (...) of a reinforcementlearning agent. The supervisor operates as both an event recorder and an on-the-fly compliance checker w.r.t. an external norm base. We successfully evaluated our approach with several tests using variations of the game Pac-Man, subject to a variety of “ethical” constraints. (shrink)
Cheap talk has often been thought incapable of supporting the emergence of cooperation because costless signals, easily faked, are unlikely to be reliable. I show how, in a social network model of cheap talk with reinforcementlearning, cheap talk does enable the emergence of cooperation, provided that individuals also temporally discount the past. This establishes one mechanism that suffices for moving a population of initially uncooperative individuals to a state of mutually beneficial cooperation even in the absence of (...) formal institutions. (shrink)
Reinforcementlearning approaches to cognitive modeling represent task acquisition as learning to choose the sequence of steps that accomplishes the task while maximizing a reward. However, an apparently unrecognized problem for modelers is choosing when, what, and how much to reward; that is, when (the moment: end of trial, subtask, or some other interval of task performance), what (the objective function: e.g., performance time or performance accuracy), and how much (the magnitude: with binary, categorical, or continuous values). (...) In this article, we explore the problem space of these three parameters in the context of a task whose completion entails some combination of 36 state–action pairs, where all intermediate states (i.e., after the initial state and prior to the end state) represent progressive but partial completion of the task. Different choices produce profoundly different learning paths and outcomes, with the strongest effect for moment. Unfortunately, there is little discussion in the literature of the effect of such choices. This absence is disappointing, as the choice of when, what, and how much needs to be made by a modeler for every learning model. (shrink)