MARL(Multi-Agent Reinforcement Learning)

- Trying to study Multi-Agent Algorithms in reinforcement learning.
- For collaboration or competition, it is a field in which multiple agents interact with each other and find optimal behavior.
- In reality, in order for Reinforcement learning to actually apply, the characteristics of various fields must be considered, so multi-agent consideration is essential.

 


κ°•ν™”ν•™μŠ΅μ—μ„œμ˜ λ‹€μ€‘ μ—μ΄μ „νŠΈ μ•Œκ³ λ¦¬μ¦˜ μ—°κ΅¬λ₯Ό ν•˜λ €κ³  ν•¨.
ν˜‘μ—… λ˜λŠ” κ²½μŸμ— λŒ€ν•΄, λ‹€μˆ˜μ˜ μ—μ΄μ „νŠΈκ°€ μ„œλ‘œ μƒν˜Έμž‘μš©ν•˜λ©° μ΅œμ μ˜ ν–‰λ™μ„ μ°ΎλŠ” λΆ„μ•Ό.
ν˜„μ‹€μ—μ„œ κ°•ν™”ν•™μŠ΅μ΄ μ‹€μ œλ‘œ μ μš©μ‹œν‚¬λ €λ©΄, λ‹€μ–‘ν•œ λΆ„μ•Όμ˜ νŠΉμ„±μ„ κ³ λ €ν•΄μ•Όν•˜κ³ , λ•Œλ¬Έμ— λ©€ν‹° μ—μ΄μ „νŠΈ κ³ λ €λŠ” ν•„μˆ˜μ .

 

OpenAI DeepMind Tag(술래작기)Game Emergent Tool Use from Multi-Agent Interaction

 

Simple Overview

what is Reinforcement Learning?

Reinforcement Learning

 

- An agent exploring an environment recognizes the current state and takes a certain action
- The agent then gets Reward from the environment and the information that has changed.

(If it's good action, reward is Plus score. but if it's bad, reward is Minus score)
- The algorithm of Reinforcement learning is a method in which an agent finds a policy defined as a series of actions that maximize the rewards to be accumulated in the future.

 

μ–΄λ–€ ν™˜κ²½μ„ νƒμƒ‰ν•˜λŠ” μ—μ΄μ „νŠΈκ°€ ν˜„μž¬μ˜ μƒνƒœλ₯Ό μΈμ‹ν•˜μ—¬ μ–΄λ–€ 행동을 μ·¨ν•œλ‹€.

그러면 κ·Έ μ—μ΄μ „νŠΈλŠ” ν™˜κ²½μœΌλ‘œλΆ€ν„° 보상과 κ·Έλ‘œμΈν•΄ 바뀐 정보λ₯Ό μ–»κ²Œ λœλ‹€. (보상은 μž˜ν•˜λ©΄ μ–‘μˆ˜, λͺ»ν•˜λ©΄ 음수)

κ°•ν™” ν•™μŠ΅μ˜ μ•Œκ³ λ¦¬μ¦˜μ€ μ—μ΄μ „νŠΈκ°€ μ•žμœΌλ‘œ λˆ„μ λ  보상을 μ΅œλŒ€ν™” ν•˜λŠ” 일련의 ν–‰λ™μœΌλ‘œ μ •μ˜λ˜λŠ” 정책을 μ°ΎλŠ” 방법이닀.

 

Q-learning

 

- Existing traditional Reinforcement learning is called Q-learning.
- The Q score is a function of selecting the maximum value among the expected values of the future reward value that an agent in a state can obtain through an action.
- In the existing reinforcement learning, every time an agent acts, it records everything in the form of a table, and if agent does the right thing, it gives a high reward, but if it does something wrong, it gives a minus.
- Based on this, the agent finds and learns a policy that maximizes accumulated rewards.

 

기쑴의 μ „톡적인 κ°•ν™”ν•™μŠ΅μ„ Q-learning이라 λΆ€λ₯Έλ‹€.
μ—¬κΈ°μ„œ Qν•¨μˆ˜λž€, μ–΄λ– ν•œ state에 μžˆλŠ” μ—μ΄μ „νŠΈκ°€ actionλ₯Ό ν•΄μ„œ μ–»μ„ μˆ˜ μžˆλŠ” λ―Έλž˜ λ³΄μƒκ°’μ˜ κΈ°λŒ€κ°’ μ€‘ μ΅œλŒ€κ°’을 μ„ νƒν•˜λŠ” ν•¨μˆ˜.
기쑴의 κ°•ν™”ν•™μŠ΅μ—μ„œλŠ” μ—μ΄μ „νŠΈκ°€ 행동을 ν• λ•Œλ§ˆλ‹€ ν…Œμ΄λΈ” ν˜•μ‹μœΌλ‘œ μ „λΆ€ 기둝을 ν•˜κ³ , μ—¬κΈ°μ„œ μ˜³μ€ 행동을 ν•˜λ©΄ 높은 보상을 μ£Όκ³ , 잘λͺ»λœ 행동을 ν•˜λ©΄ λ§ˆμ΄λ„ˆμŠ€λ₯Ό μ€€λ‹€.
이λ₯Ό κΈ°λ°˜μœΌλ‘œ μ—μ΄μ „νŠΈλŠ” λˆ„μ λ˜λŠ” λ³΄μƒμ΄ μ΅œλŒ€ν™”λ˜λŠ” μ •μ±…을 μ°Ύμ•„ ν•™μŠ΅ν•œλ‹€.

 

- But, Q-learning is difficult to apply it to the real world because it needs to know all about the environment and there is a high cost of full-width backup.
- Currently, Deep Q-learning using deep-learning is used in reinforcement learning.

 

DQN

- A regression neural network is used, and the output of the neural network is the Q value when each action is performed.

It is to find an action that maximizes Reward. 
- The neural network compares with the target Q function and adjusts the weight in the direction of reducing the difference.

 

ν™˜κ²½μ— λŒ€ν•΄ λͺ¨λ‘ μ•Œμ•„μ•Όν•˜κ³ , full-width backup을 ν•˜κΈ°μ— λΉ„μš©μ΄ λ†’은 λ¬Έμ œμ μ΄ μžˆμ–΄ μ‹€μ œ μ„Έμƒμ— μ μš©μ‹œν‚€κΈ° νž˜λ“¬
ν˜„μž¬μ˜ κ°•ν™”ν•™μŠ΅μ—μ„œλŠ” λ”₯λŸ¬λ‹μ„ μ‚¬μš©ν•œ Deep Q-learning을 μ‚¬μš©ν•¨
νšŒκ·€ μ‹ κ²½λ§μ„ μ‚¬μš©ν•˜λ©°, μ‹ κ²½λ§μ˜ μΆœλ ₯은 κ° μ•‘μ…˜μ„ ν•˜μ˜€μ„ λ•Œμ— Q값이닀.
보상을 μ΅œλŒ€ν™”ν•˜λŠ” μ•‘μ…˜μ„ μ°ΎλŠ” κ²ƒμ΄λ‹€. 
신경망은 λͺ©ν‘œκ°’인 Qν•¨μˆ˜μ™€ λΉ„κ΅ν•˜λ©° μ°¨μ΄λ₯Ό μ€„μ΄λŠ” λ°©ν–₯으둜 κ°€μ€‘μΉ˜λ₯Ό μ‘°μ ˆν•œλ‹€.

 

 

what is Agent

- In general, the agent does not exist independently.

It is part of an environment or operates in it. 
- Work is performed on behalf of the user for a specific purpose.
- It has a knowledge base and reasoning function, and solves a given problem through information exchange and communication with a user, resource, or other agent. 
- In addition, agents themselves recognize environmental changes, take corresponding actions, and learn based on experience.

일반적으둜 μ—μ΄μ „νŠΈλž€, λ…μžμ μœΌλ‘œ μ‘΄μž¬ν•˜μ§€ μ•ŠλŠ”λ‹€. 
μ–΄λ–€ ν™˜κ²½μ˜ μΌλΆ€μ΄κ±°λ‚˜, κ·Έ μ•ˆμ—μ„œ λ™μž‘ν•œλ‹€. 
νŠΉμ • λͺ©μ μ— λŒ€ν•˜μ—¬ μ‚¬μš©μžλ₯Ό λŒ€μ‹ ν•˜μ—¬ μž‘μ—…μ„ μˆ˜ν–‰ν•œλ‹€.
μ§€μ‹λ² μ΄μŠ€μ™€ μΆ”λ‘  κΈ°λŠ₯을 가지고 있고, μ‚¬μš©μž, μžμ› λ˜λŠ” λ‹€λ₯Έ μ—μ΄μ „νŠΈμ™€μ˜ μ •λ³΄κ΅ν™˜κ³Ό 톡신을 톡해 주어진 문제λ₯Ό ν•΄κ²°ν•œλ‹€. 
그뿐만 μ•„λ‹ˆλΌ, μ—μ΄μ „νŠΈ μŠ€μŠ€λ‘œ ν™˜κ²½ λ³€ν™”λ₯Ό μΈμ§€ν•˜κ³  κ·Έμ— λŒ€μ‘ν•˜λŠ” ν–‰λ™μ„ μ·¨ν•˜λ©°, κ²½ν—˜μ„ λ°”νƒ•μœΌλ‘œ ν•™μŠ΅ν•˜λŠ” κΈ°λŠ₯을 κ°€μ§€κ²Œ λœλ‹€

 

Time-Invariant

- Time-varying is something that does not change over time.
That is, when the same operation is performed in the same state, the same compensation distribution is always calculated.

- In general, the Q function converges to the optimal value, and the agent can learn the optimal policy function.

- It is mathematically guaranteed that policies will converge optimally someday.

- However, in a multi-agent environment, the rewards an agent receives are changed to time-varying rather than Time-Invariant because they must rely not only on the agent's actions but also on the actions of other agents. 

- As shown in the upper side, the time constant environment converges to zero when averaged, but there is no guarantee that the lower side converges.

 

μ‹œλΆˆλ³€μ„±μ΄λž€ μ‹œκ°„이 μ§€λ‚˜λ„ λ³€ν•˜μ§€ μ•ŠλŠ” κ²ƒμ΄λ‹€. μ¦‰, κ°™μ€ μƒνƒœμ—μ„œ κ°™μ€ λ™μž‘을 μ·¨ν•˜λ©΄ ν•­μƒ λ™μΌν•œ λ³΄μƒ λΆ„포가 μ‚°μΆœλ˜λŠ” κ²ƒ.
일반적으둜 Qν•¨μˆ˜λŠ” μ΅œμ μ˜ κ°€μΉ˜λ‘œ μˆ˜λ ΄ν•˜κ³ , μ—μ΄μ „νŠΈλŠ” μ΅œμ μ˜ μ •μ±… ν•¨μˆ˜λ₯Ό λ°°μšΈ μˆ˜ μžˆμŒ.
μ–Έμ  κ°€λŠ” μ •μ±…이 μ΅œμ μœΌλ‘œ μˆ˜λ ΄ν•¨μ΄ μˆ˜ν•™μ μœΌλ‘œ λ³΄μž₯λ˜μ–΄ μžˆμŒ.
닀쀑 μ—μ΄μ „νŠΈ ν™˜κ²½μ—μ„œλŠ”, ν•œ μ—μ΄μ „νŠΈκ°€ λ°›λŠ” λ³΄μƒμ΄ κ·Έ μ—μ΄μ „νŠΈμ˜ λ™μž‘ λΏλ§Œμ•„λ‹ˆλΌ λ‹€λ₯Έ μ—μ΄μ „νŠΈλ“€μ˜ λ™μž‘에도 μ˜μ‘΄ν•΄μ•Όν•˜κΈ°μ— μ‹œλΆˆλ³€μ„±μ΄ κΉ¨μ§€κ²Œ λœλ‹€. 
λ…Έμ΄μ¦ˆκ°€ μžˆμ§€λ§Œ, μœ„μΈ‘κ³Ό κ°™μ΄ μ‹œλΆˆλ³€ν™˜κ²½μ€ ν‰κ· λ‚΄μ—ˆμ„λ•Œ 0으둜 μˆ˜λ ΄ν•˜μ§€λ§Œ, μ•„λž˜μΈ‘은 μˆ˜λ ΄ν•œλ‹€λŠ” λ³΄μž₯이 μ—†λ‹€.

 

μ‹œλΆˆλ³€ ν™˜κ²½μ—μ„œ, μ£Όμ–΄μ§„ ν•œ μƒνƒœ μ „μ΄μ˜ κΈ°λŒ€ κ°€μΉ˜(평균 κ°€μΉ˜)λŠ” μ‹œκ°„이 μ§€λ‚˜λ„ λ³€ν•˜μ§€ μ•ŠλŠ”λ‹€(λΆˆλ³€). 
λͺ¨λ“  μƒνƒœ μ „μ΄μ—λŠ” μ–΄λŠ μ •λ„μ˜ ν™•λ₯ μ  μš”μ†Œκ°€ μžˆμ–΄μ„œ μ‹œκ³„μ—΄(time-series) κ·Έλž˜ν”„κ°€ μž‘음이 μ„žμΈ λͺ¨ μŠ΅μ΄κΈ΄ ν•˜μ§€λ§Œ, κ·Έλž˜λ„ μ‹œκ³„μ—΄μ˜ ν‰κ· μ€ μƒμˆ˜μ΄λ‹€. 
μ‹œλ³€ ν™˜κ²½μ—μ„œλŠ” μ£Όμ–΄μ§„ ν•œ μƒνƒœ μ „μ΄μ˜ κΈ°λŒ€ κ°€μΉ˜ κ°€ μ‹œκ°„에 λ”°λΌ λ³€ν•œλ‹€. 
μ•„λž˜ κ·Έλž˜ν”„λ₯Ό λ³΄λ©΄ μ‹€μ œλ‘œ μ‹œκ³„μ—΄μ˜ ν‰κ·  λ˜λŠ” κΈ°μ€€μ„ μ΄ μ‹œκ°„에 λ”°λΌ λ³€ν•œλ‹€. 
Q ν•¨μˆ˜λŠ” μƒνƒœ-λ™μž‘ μŒλ“€μ˜ κΈ°λŒ€ κ°€μΉ˜λ₯Ό λ°°μš°λ € ν•˜λ©°, μˆ˜λ ΄μ€ μƒνƒœ-λ™μž‘ κ°€μΉ˜λ“€μ΄ μ‹œλΆˆλ³€μΌ λ•Œλ§Œ λ³΄ μž₯λœλ‹€. 
닀쀑 μ—μ΄μ „νŠΈ μ„€μ •μ—μ„œλŠ” λ‹€λ₯Έ μ—μ΄μ „νŠΈλ“€μ˜ μ •μ±…이 λ³€ν•˜κΈ° λ•Œλ¬Έμ— μƒνƒœ-λ™μž‘ κ°€μΉ˜ κΈ°λŒ“값이 μ‹œκ°„에 λ”°λΌ λ³€ν•˜λ©°, λ”°λΌμ„œ μˆ˜λ ΄μ€ λ³΄μž₯λ˜μ§€ μ•ŠλŠ”λ‹€

 

Single Agent β†’ Multi Agent

IQL(Independent Q-Learning)

From a single agent reinforcement learning method to multiple agent settings.
- The easiest way is to control each agent by applying a separate DQN.
Each agent independently receives a state in the environment and performs an operation.
- If you all want to control the agent using the same policy, you can model and use multiple agents with one DQN instance.

This method is called IQL(Independent Q-Learning).

Weakness
- Interactions between agents do not affect each decision-making.
- In the algorithm IQL all agents are working independently, without considering their impact on itself at all other agents just to considered noise.

 

단일 μ—μ΄μ „νŠΈ κ°•ν™”ν•™μŠ΅ λ°©λ²•μ—μ„œ λ‹€μ€‘ μ—μ΄μ „νŠΈ μ„€μ •μœΌλ‘œ.
κ°€μž₯ μ‰¬μš΄ λ°©λ²•μ€, κ°κ°μ˜ μ—μ΄μ „νŠΈκ°€ λ”°λ‘œ DQN μ μš©ν•˜μ—¬ μ œμ–΄ν•˜λŠ” κ²ƒ. 

각 μ—μ΄μ „νŠΈκ°€ λ…λ¦½μ μœΌλ‘œ ν™˜κ²½μ—μ„œ μƒνƒœλ₯Ό λ°›μ•„μ„œ λ™μž‘을 μ·¨ν•œλ‹€.
λͺ¨λ‘ κ°™μ€ μ •μ±…을 μ‚¬μš©ν•΄ μ—μ΄μ „νŠΈλ₯Ό μ œμ–΄ν•˜κ³  μ‹ΆμœΌλ©΄, ν•˜λ‚˜μ˜ DQN μΈμŠ€ν„΄μŠ€λ‘œ μ—¬λŸ¬ μ—μ΄μ „νŠΈλ₯Ό λͺ¨ν˜•ν™”ν•˜μ—¬ μ‚¬μš©ν•˜λ©΄ λœλ‹€.
μ΄λŸ¬ν•œ λ°©λ²•μ„ IQL(Independent q-learning)이라고 λΆ€λ₯Έλ‹€.

단점
μ—μ΄μ „νŠΈλ“€ μ‚¬μ΄μ˜ μƒν˜Έμž‘μš©μ΄ κ°κ°μ˜ μ˜μ‚¬κ²°μ •μ— μ˜ν–₯을 λ―ΈμΆ”지 λͺ»ν•¨.
IQL μ•Œκ³ λ¦¬μ¦˜μ—μ„œλŠ” μ „체 μ—μ΄μ „νŠΈκ°€ λ…λ¦½μ μœΌλ‘œ μž‘λ™ν•΄μ„œ, λ‹€λ₯Έ μ—μ΄μ „νŠΈλ“€μ΄ μžμ‹ ν•œν…Œ λ―ΈμΉ˜λŠ” μ˜ν–₯을 μ „ν˜€ κ³ λ €ν•˜μ§€ μ•Šκ³  κ·Έλƒ₯ μž‘음으둜 κ°„주함.

 

Problem

- In general, Q learning is not good in a multi-agent environment.
because the environment in which each agent learns new policies is not a Time-Invariant environment assumed by general Q learning, but a Time-varying environment.
- In a Time-varying environment, the expected value of compensation varies over time.
- To deal with the time-conversion landscape, the Q function must approach the combined operating space of other agents, where the size of the combined operating space increases exponentially to the number of agents, resulting in unrealistic computational costs even with a small number of agents.

 

일반적으둜, Qν•™μŠ΅μ€ λ‹€μ€‘ μ—μ΄μ „νŠΈ ν™˜κ²½μ—μ„œ μ’‹μ§€ μ•ŠμŒ.
μ—μ΄μ „νŠΈλ“€μ΄ 각자 μƒˆλ‘œμš΄ 정책을 λ°°μš°λŠ” ν™˜κ²½μ€ 일반적인 Qν•™μŠ΅μ΄ κ°€μ •ν•˜λŠ” μ‹œλΆˆλ³€ ν™˜κ²½μ΄ μ•„λ‹ˆλΌ μ‹œλ³€ ν™˜κ²½μ΄κΈ° λ•Œλ¬Έμž„.
μ‹œλ³€ ν™˜κ²½μ—μ„œλŠ” λ³΄μƒλ“€μ˜ κΈ°λŒ“κ°’μ΄ μ‹œκ°„μ— 따라 변함.
μ‹œλ³€ν™˜κ²½μ„ λ‹€λ£¨κΈ° μœ„ν•΄μ„œλŠ” Qν•¨μˆ˜κ°€ λ‹€λ₯Έ μ—μ΄μ „νŠΈλ“€μ˜ κ²°ν•© λ™μž‘ κ³΅κ°„에 μ ‘κ·Όν•΄μ•Όν•˜λŠ”λ°, μ΄ κ²°ν•© λ™μž‘ κ³΅κ°„μ˜ ν¬κΈ°λŠ” μ—μ΄μ „νŠΈμ˜ μˆ˜μ— μ§€μˆ˜μ μœΌλ‘œ μ¦κ°€ν•΄μ„œ μ—μ΄μ „νŠΈ μˆ˜κ°€ μ‘°κΈˆλ§Œ λ§Žμ•„도 κ³„μ‚° λΉ„μš©μ΄ λΉ„ν˜„μ‹€μ μœΌλ‘œ μ»€μ§

 

To do

Focusing on the problem to be solved, I will classify and study multi-agent reinforcement learning algorithms.

1. Modeling the relationship between agents.
Model the influence between agents using information such as the status and behavior of other agents.

2. Trust between agents.
A study on the allocation of trust between agents that deals with how much the agent's actions contribute to solving the overall problem.

3. Communication between agents.
Communication problems between agents used to overcome restrictions that require agents to use only their information during execution.

4. Exploration-use dilemma
A study that approached the exploration-use dilemma, a classic problem of reinforcement learning, in the MARL environment.

 

ν•΄κ²°ν•˜κ³ μž ν•˜λŠ” λ¬Έμ œλ₯Ό μ€‘μ‹¬μœΌλ‘œ λ©€ν‹° μ—μ΄μ „νŠΈ κ°•ν™”ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ„ λΆ„λ₯˜ν•˜μ—¬ κ³΅λΆ€ν•˜λ €ν•œλ‹€.
1. μ—μ΄μ „νŠΈ κ°„μ˜ κ΄€κ³„ λͺ¨λΈλ§ // λ‹€λ₯Έ μ—μ΄μ „νŠΈμ˜ μƒνƒœμ™€ ν–‰λ™ λ“±μ˜ μ •λ³΄λ₯Ό μ΄μš©ν•˜μ—¬ μ—μ΄μ „νŠΈ κ°„ μ˜ν–₯을 λͺ¨λΈλ§
2. μ—μ΄μ „νŠΈ κ°„μ˜ μ‹ λ’°ν• λ‹Ή // μ—μ΄μ „νŠΈμ˜ ν–‰λ™μ΄ μ „체 λ¬Έμ œ ν•΄κ²°μ— μ–Όλ§ˆλ‚˜ κΈ°μ—¬ν•˜λŠ”지 λ‹€λ£¨λŠ” μ—μ΄μ „νŠΈ κ°„ μ‹ λ’° ν• λ‹Ήμ— λŒ€ν•œ μ—°κ΅¬
3. μ—μ΄μ „νŠΈ κ°„μ˜ ν†΅μ‹  // μ‹€ν–‰ μ‹œ μ—μ΄μ „νŠΈκ°€ μžμ‹ μ˜ μ •λ³΄λ§Œ μ΄μš©ν•΄μ•Ό ν•˜λŠ” μ œμ•½ κ·Ήλ³΅μ„ μœ„ν•΄ μ‚¬μš©λ˜λŠ” μ—μ΄μ „νŠΈ κ°„μ˜ ν†΅μ‹  λ¬Έμ œ
4. νƒμƒ‰-이용 λ”œλ ˆλ§ˆ //κ°•ν™”ν•™μŠ΅μ˜ κ³ μ „적인 λ¬Έμ œμΈ νƒμƒ‰-이용 λ”œλ ˆλ§ˆλ₯Ό MARLν™˜κ²½μ—μ„œ μ ‘κ·Όν•œ μ—°κ΅¬

728x90
λ°˜μ‘ν˜•
Liky