Multi-Agent Reinforcement Learning

2021. 11. 23. 16:51Artificial_Intelligence/Reinforcement Learning

MARL(Multi-Agent Reinforcement Learning)

- Trying to study Multi-Agent Algorithms in reinforcement learning.
- For collaboration or competition, it is a field in which multiple agents interact with each other and find optimal behavior.
- In reality, in order for Reinforcement learning to actually apply, the characteristics of various fields must be considered, so multi-agent consideration is essential.

 


강화학습에서의 다중 에이전트 알고리즘 연구를 하려고 함.
협업 또는 경쟁에 대해, 다수의 에이전트가 서로 상호작용하며 최적의 행동을 찾는 분야.
현실에서 강화학습이 실제로 적용시킬려면, 다양한 분야의 특성을 고려해야하고, 때문에 멀티 에이전트 고려는 필수적.

 

OpenAI DeepMind Tag(술래잡기)Game Emergent Tool Use from Multi-Agent Interaction

 

Simple Overview

what is Reinforcement Learning?

Reinforcement Learning

 

- An agent exploring an environment recognizes the current state and takes a certain action
- The agent then gets Reward from the environment and the information that has changed.

(If it's good action, reward is Plus score. but if it's bad, reward is Minus score)
- The algorithm of Reinforcement learning is a method in which an agent finds a policy defined as a series of actions that maximize the rewards to be accumulated in the future.

 

어떤 환경을 탐색하는 에이전트가 현재의 상태를 인식하여 어떤 행동을 취한다.

그러면 그 에이전트는 환경으로부터 보상과 그로인해 바뀐 정보를 얻게 된다. (보상은 잘하면 양수, 못하면 음수)

강화 학습의 알고리즘은 에이전트가 앞으로 누적될 보상을 최대화 하는 일련의 행동으로 정의되는 정책을 찾는 방법이다.

 

Q-learning

 

- Existing traditional Reinforcement learning is called Q-learning.
- The Q score is a function of selecting the maximum value among the expected values of the future reward value that an agent in a state can obtain through an action.
- In the existing reinforcement learning, every time an agent acts, it records everything in the form of a table, and if agent does the right thing, it gives a high reward, but if it does something wrong, it gives a minus.
- Based on this, the agent finds and learns a policy that maximizes accumulated rewards.

 

기존의 전통적인 강화학습을 Q-learning이라 부른다.
여기서 Q함수란, 어떠한 state에 있는 에이전트가 action를 해서 얻을 수 있는 미래 보상값의 기대값 중 최대값을 선택하는 함수.
기존의 강화학습에서는 에이전트가 행동을 할때마다 테이블 형식으로 전부 기록을 하고, 여기서 옳은 행동을 하면 높은 보상을 주고, 잘못된 행동을 하면 마이너스를 준다.
이를 기반으로 에이전트는 누적되는 보상이 최대화되는 정책을 찾아 학습한다.

 

- But, Q-learning is difficult to apply it to the real world because it needs to know all about the environment and there is a high cost of full-width backup.
- Currently, Deep Q-learning using deep-learning is used in reinforcement learning.

 

DQN

- A regression neural network is used, and the output of the neural network is the Q value when each action is performed.

It is to find an action that maximizes Reward. 
- The neural network compares with the target Q function and adjusts the weight in the direction of reducing the difference.

 

환경에 대해 모두 알아야하고, full-width backup을 하기에 비용이 높은 문제점이 있어 실제 세상에 적용시키기 힘듬
현재의 강화학습에서는 딥러닝을 사용한 Deep Q-learning을 사용함
회귀 신경망을 사용하며, 신경망의 출력은 각 액션을 하였을 때에 Q값이다.
보상을 최대화하는 액션을 찾는 것이다. 
신경망은 목표값인 Q함수와 비교하며 차이를 줄이는 방향으로 가중치를 조절한다.

 

 

what is Agent

- In general, the agent does not exist independently.

It is part of an environment or operates in it. 
- Work is performed on behalf of the user for a specific purpose.
- It has a knowledge base and reasoning function, and solves a given problem through information exchange and communication with a user, resource, or other agent. 
- In addition, agents themselves recognize environmental changes, take corresponding actions, and learn based on experience.

일반적으로 에이전트란, 독자적으로 존재하지 않는다. 
어떤 환경의 일부이거나, 그 안에서 동작한다. 
특정 목적에 대하여 사용자를 대신하여 작업을 수행한다.
지식베이스와 추론 기능을 가지고 있고, 사용자, 자원 또는 다른 에이전트와의 정보교환과 통신을 통해 주어진 문제를 해결한다. 
그뿐만 아니라, 에이전트 스스로 환경 변화를 인지하고 그에 대응하는 행동을 취하며, 경험을 바탕으로 학습하는 기능을 가지게 된다

 

Time-Invariant

- Time-varying is something that does not change over time.
That is, when the same operation is performed in the same state, the same compensation distribution is always calculated.

- In general, the Q function converges to the optimal value, and the agent can learn the optimal policy function.

- It is mathematically guaranteed that policies will converge optimally someday.

- However, in a multi-agent environment, the rewards an agent receives are changed to time-varying rather than Time-Invariant because they must rely not only on the agent's actions but also on the actions of other agents. 

- As shown in the upper side, the time constant environment converges to zero when averaged, but there is no guarantee that the lower side converges.

 

시불변성이란 시간이 지나도 변하지 않는 것이다. 즉, 같은 상태에서 같은 동작을 취하면 항상 동일한 보상 분포가 산출되는 것.
일반적으로 Q함수는 최적의 가치로 수렴하고, 에이전트는 최적의 정책 함수를 배울 수 있음.
언젠가는 정책이 최적으로 수렴함이 수학적으로 보장되어 있음.
다중 에이전트 환경에서는, 한 에이전트가 받는 보상이 그 에이전트의 동작 뿐만아니라 다른 에이전트들의 동작에도 의존해야하기에 시불변성이 깨지게 된다. 
노이즈가 있지만, 위측과 같이 시불변환경은 평균내었을때 0으로 수렴하지만, 아래측은 수렴한다는 보장이 없다.

 

시불변 환경에서, 주어진 한 상태 전이의 기대 가치(평균 가치)는 시간이 지나도 변하지 않는다(불변). 
모든 상태 전이에는 어느 정도의 확률적 요소가 있어서 시계열(time-series) 그래프가 잡음이 섞인 모 습이긴 하지만, 그래도 시계열의 평균은 상수이다. 
시변 환경에서는 주어진 한 상태 전이의 기대 가치 가 시간에 따라 변한다. 
아래 그래프를 보면 실제로 시계열의 평균 또는 기준선이 시간에 따라 변한다. 
Q 함수는 상태-동작 쌍들의 기대 가치를 배우려 하며, 수렴은 상태-동작 가치들이 시불변일 때만 보 장된다. 
다중 에이전트 설정에서는 다른 에이전트들의 정책이 변하기 때문에 상태-동작 가치 기댓값이 시간에 따라 변하며, 따라서 수렴은 보장되지 않는다

 

Single Agent → Multi Agent

IQL(Independent Q-Learning)

From a single agent reinforcement learning method to multiple agent settings.
- The easiest way is to control each agent by applying a separate DQN.
Each agent independently receives a state in the environment and performs an operation.
- If you all want to control the agent using the same policy, you can model and use multiple agents with one DQN instance.

This method is called IQL(Independent Q-Learning).

Weakness
- Interactions between agents do not affect each decision-making.
- In the algorithm IQL all agents are working independently, without considering their impact on itself at all other agents just to considered noise.

 

단일 에이전트 강화학습 방법에서 다중 에이전트 설정으로.
가장 쉬운 방법은, 각각의 에이전트가 따로 DQN 적용하여 제어하는 것. 

각 에이전트가 독립적으로 환경에서 상태를 받아서 동작을 취한다.
모두 같은 정책을 사용해 에이전트를 제어하고 싶으면, 하나의 DQN 인스턴스로 여러 에이전트를 모형화하여 사용하면 된다.
이러한 방법을 IQL(Independent q-learning)이라고 부른다.

단점
에이전트들 사이의 상호작용이 각각의 의사결정에 영향을 미추지 못함.
IQL 알고리즘에서는 전체 에이전트가 독립적으로 작동해서, 다른 에이전트들이 자신한테 미치는 영향을 전혀 고려하지 않고 그냥 잡음으로 간주함.

 

Problem

- In general, Q learning is not good in a multi-agent environment.
because the environment in which each agent learns new policies is not a Time-Invariant environment assumed by general Q learning, but a Time-varying environment.
- In a Time-varying environment, the expected value of compensation varies over time.
- To deal with the time-conversion landscape, the Q function must approach the combined operating space of other agents, where the size of the combined operating space increases exponentially to the number of agents, resulting in unrealistic computational costs even with a small number of agents.

 

일반적으로, Q학습은 다중 에이전트 환경에서 좋지 않음.
에이전트들이 각자 새로운 정책을 배우는 환경은 일반적인 Q학습이 가정하는 시불변 환경이 아니라 시변 환경이기 때문임.
시변 환경에서는 보상들의 기댓값이 시간에 따라 변함.
시변환경을 다루기 위해서는 Q함수가 다른 에이전트들의 결합 동작 공간에 접근해야하는데, 이 결합 동작 공간의 크기는 에이전트의 수에 지수적으로 증가해서 에이전트 수가 조금만 많아도 계산 비용이 비현실적으로 커짐

 

To do

Focusing on the problem to be solved, I will classify and study multi-agent reinforcement learning algorithms.

1. Modeling the relationship between agents.
Model the influence between agents using information such as the status and behavior of other agents.

2. Trust between agents.
A study on the allocation of trust between agents that deals with how much the agent's actions contribute to solving the overall problem.

3. Communication between agents.
Communication problems between agents used to overcome restrictions that require agents to use only their information during execution.

4. Exploration-use dilemma
A study that approached the exploration-use dilemma, a classic problem of reinforcement learning, in the MARL environment.

 

해결하고자 하는 문제를 중심으로 멀티 에이전트 강화학습 알고리즘을 분류하여 공부하려한다.
1. 에이전트 간의 관계 모델링 // 다른 에이전트의 상태와 행동 등의 정보를 이용하여 에이전트 간 영향을 모델링
2. 에이전트 간의 신뢰할당 // 에이전트의 행동이 전체 문제 해결에 얼마나 기여하는지 다루는 에이전트 간 신뢰 할당에 대한 연구
3. 에이전트 간의 통신 // 실행 시 에이전트가 자신의 정보만 이용해야 하는 제약 극복을 위해 사용되는 에이전트 간의 통신 문제
4. 탐색-이용 딜레마 //강화학습의 고전적인 문제인 탐색-이용 딜레마를 MARL환경에서 접근한 연구

728x90