DRON
Deep Reinforcement Opponent Network
Approach
- jointly learn a policy and the behavior of opponents into a DQN
- Using Mixture-of-Experts architecture(discover different strategy patterns of opponents)
Stated Questions
- how to combine the two networks
- what supervision signal to use.
- predicting Q-values only, as our goal is the best reward instead of accurately simulating opponents
- also predicting extra information about the opponent when it is available, e.g., the type of their strategy.
Conents
- Two critical questions in opponent modeling are what variable(s) to
model and how to use the predicted information - To account for changing behavior, we model uncertainty in the opponent’s strategy instead of classifying it into a set of stereotypes.
- domain knowledge is often required when prediction of the opponents are separated from learning the dynamics of the world. Therefore, we jointly learn a policy and model the opponent probabilistically
- DRON is a Q-Network (NQ) that evaluates actions for a state and an opponent network (No) that learns representation of π
Model | MultiTasks |
---|---|
|
|
- |
(a)it ignores the interaction between the world and the opponent
(b) , DRON-MOE knows that Q-values have different distributions depending on φ, each expert network captures one type of opponent strategy
Experiments
- Soccer game
- trivia game
experiment1 | experiment2 |
---|---|
|
Hello World!
* 궁금하신 사항에 대해서는 질문하셔도 됩니다.
* 답변하는데 다소 시간이 걸릴 수 있습니다.
* 소스코드 구현을 의뢰하실 수 있습니다.
'딥러닝 > 강화학습(RL)' 카테고리의 다른 글
[Analyse RLLib] 4. RLlib CallBacks (0) | 2021.02.26 |
---|---|
[Analyse RLLib] 3. Train Model with Ray Trainer (0) | 2021.02.26 |
[Analyse RLLib] 2. RLlib 기본 훈련 코드 돌리기 (0) | 2021.02.26 |
[Analyse RLLib] 1. Ray와 RLlib의 전체적인 구조 (0) | 2021.02.26 |
MARL - MADDPG 이해하기 (0) | 2020.12.20 |