DRON

Approach

jointly learn a policy and the behavior of opponents into a DQN
Using Mixture-of-Experts architecture(discover different strategy patterns of opponents)

how to combine the two networks
what supervision signal to use.
- predicting Q-values only, as our goal is the best reward instead of accurately simulating opponents
- also predicting extra information about the opponent when it is available, e.g., the type of their strategy.

Two critical questions in opponent modeling are what variable(s) to
model and how to use the predicted information
To account for changing behavior, we model uncertainty in the opponent’s strategy instead of classifying it into a set of stereotypes.
domain knowledge is often required when prediction of the opponents are separated from learning the dynamics of the world. Therefore, we jointly learn a policy and model the opponent probabilistically
DRON is a Q-Network (NQ) that evaluates actions for a state and an opponent network (No) that learns representation of π

Model	MultiTasks

$\phi_s=\text{agent state} ,\phi_{o}=\text{opponent state}$	-

(a)it ignores the interaction between the world and the opponent

(b) , DRON-MOE knows that Q-values have different distributions depending on φ, each expert network captures one type of opponent strategy

experiment1	experiment2

* 궁금하신 사항에 대해서는 질문하셔도 됩니다.

* 답변하는데 다소 시간이 걸릴 수 있습니다.

* 소스코드 구현을 의뢰하실 수 있습니다.

[Analyse RLLib] 4. RLlib CallBacks (0)	2021.02.26
[Analyse RLLib] 3. Train Model with Ray Trainer (0)	2021.02.26
[Analyse RLLib] 2. RLlib 기본 훈련 코드 돌리기 (0)	2021.02.26
[Analyse RLLib] 1. Ray와 RLlib의 전체적인 구조 (0)	2021.02.26
MARL - MADDPG 이해하기 (0)	2020.12.20