MARL(Multi-Agent Reinforcement Learning)
- Trying to study Multi-Agent Algorithms in reinforcement learning.
- For collaboration or competition, it is a field in which multiple agents interact with each other and find optimal behavior.
- In reality, in order for Reinforcement learning to actually apply, the characteristics of various fields must be considered, so multi-agent consideration is essential.
κ°ννμ΅μμμ λ€μ€ μμ΄μ νΈ μκ³ λ¦¬μ¦ μ°κ΅¬λ₯Ό νλ €κ³ ν¨.
νμ
λλ κ²½μμ λν΄, λ€μμ μμ΄μ νΈκ° μλ‘ μνΈμμ©νλ©° μ΅μ μ νλμ μ°Ύλ λΆμΌ.
νμ€μμ κ°ννμ΅μ΄ μ€μ λ‘ μ μ©μν¬λ €λ©΄, λ€μν λΆμΌμ νΉμ±μ κ³ λ €ν΄μΌνκ³ , λλ¬Έμ λ©ν° μμ΄μ νΈ κ³ λ €λ νμμ .
Simple Overview
what is Reinforcement Learning?
- An agent exploring an environment recognizes the current state and takes a certain action.
- The agent then gets Reward from the environment and the information that has changed.
(If it's good action, reward is Plus score. but if it's bad, reward is Minus score)
- The algorithm of Reinforcement learning is a method in which an agent finds a policy defined as a series of actions that maximize the rewards to be accumulated in the future.
μ΄λ€ νκ²½μ νμνλ μμ΄μ νΈκ° νμ¬μ μνλ₯Ό μΈμνμ¬ μ΄λ€ νλμ μ·¨νλ€.
κ·Έλ¬λ©΄ κ·Έ μμ΄μ νΈλ νκ²½μΌλ‘λΆν° 보μκ³Ό κ·Έλ‘μΈν΄ λ°λ μ 보λ₯Ό μ»κ² λλ€. (보μμ μνλ©΄ μμ, λͺ»νλ©΄ μμ)
κ°ν νμ΅μ μκ³ λ¦¬μ¦μ μμ΄μ νΈκ° μμΌλ‘ λμ λ 보μμ μ΅λν νλ μΌλ ¨μ νλμΌλ‘ μ μλλ μ μ±
μ μ°Ύλ λ°©λ²μ΄λ€.
- Existing traditional Reinforcement learning is called Q-learning.
- The Q score is a function of selecting the maximum value among the expected values of the future reward value that an agent in a state can obtain through an action.
- In the existing reinforcement learning, every time an agent acts, it records everything in the form of a table, and if agent does the right thing, it gives a high reward, but if it does something wrong, it gives a minus.
- Based on this, the agent finds and learns a policy that maximizes accumulated rewards.
κΈ°μ‘΄μ μ ν΅μ μΈ κ°ννμ΅μ Q-learningμ΄λΌ λΆλ₯Έλ€.
μ¬κΈ°μ Qν¨μλ, μ΄λ ν stateμ μλ μμ΄μ νΈκ° actionλ₯Ό ν΄μ μ»μ μ μλ λ―Έλ 보μκ°μ κΈ°λκ° μ€ μ΅λκ°μ μ ννλ ν¨μ.
κΈ°μ‘΄μ κ°ννμ΅μμλ μμ΄μ νΈκ° νλμ ν λλ§λ€ ν
μ΄λΈ νμμΌλ‘ μ λΆ κΈ°λ‘μ νκ³ , μ¬κΈ°μ μ³μ νλμ νλ©΄ λμ 보μμ μ£Όκ³ , μλͺ»λ νλμ νλ©΄ λ§μ΄λμ€λ₯Ό μ€λ€.
μ΄λ₯Ό κΈ°λ°μΌλ‘ μμ΄μ νΈλ λμ λλ 보μμ΄ μ΅λνλλ μ μ±
μ μ°Ύμ νμ΅νλ€.
- But, Q-learning is difficult to apply it to the real world because it needs to know all about the environment and there is a high cost of full-width backup.
- Currently, Deep Q-learning using deep-learning is used in reinforcement learning.
- A regression neural network is used, and the output of the neural network is the Q value when each action is performed.
It is to find an action that maximizes Reward.
- The neural network compares with the target Q function and adjusts the weight in the direction of reducing the difference.
νκ²½μ λν΄ λͺ¨λ μμμΌνκ³ , full-width backupμ νκΈ°μ λΉμ©μ΄ λμ λ¬Έμ μ μ΄ μμ΄ μ€μ μΈμμ μ μ©μν€κΈ° νλ¬
νμ¬μ κ°ννμ΅μμλ λ₯λ¬λμ μ¬μ©ν Deep Q-learningμ μ¬μ©ν¨
νκ· μ κ²½λ§μ μ¬μ©νλ©°, μ κ²½λ§μ μΆλ ₯μ κ° μ‘μ
μ νμμ λμ Qκ°μ΄λ€.
보μμ μ΅λννλ μ‘μ
μ μ°Ύλ κ²μ΄λ€.
μ κ²½λ§μ λͺ©νκ°μΈ Qν¨μμ λΉκ΅νλ©° μ°¨μ΄λ₯Ό μ€μ΄λ λ°©ν₯μΌλ‘ κ°μ€μΉλ₯Ό μ‘°μ νλ€.
what is Agent
- In general, the agent does not exist independently.
It is part of an environment or operates in it.
- Work is performed on behalf of the user for a specific purpose.
- It has a knowledge base and reasoning function, and solves a given problem through information exchange and communication with a user, resource, or other agent.
- In addition, agents themselves recognize environmental changes, take corresponding actions, and learn based on experience.
μΌλ°μ μΌλ‘ μμ΄μ νΈλ, λ
μμ μΌλ‘ μ‘΄μ¬νμ§ μλλ€.
μ΄λ€ νκ²½μ μΌλΆμ΄κ±°λ, κ·Έ μμμ λμνλ€.
νΉμ λͺ©μ μ λνμ¬ μ¬μ©μλ₯Ό λμ νμ¬ μμ
μ μννλ€.
μ§μλ² μ΄μ€μ μΆλ‘ κΈ°λ₯μ κ°μ§κ³ μκ³ , μ¬μ©μ, μμ λλ λ€λ₯Έ μμ΄μ νΈμμ μ 보κ΅νκ³Ό ν΅μ μ ν΅ν΄ μ£Όμ΄μ§ λ¬Έμ λ₯Ό ν΄κ²°νλ€.
κ·ΈλΏλ§ μλλΌ, μμ΄μ νΈ μ€μ€λ‘ νκ²½ λ³νλ₯Ό μΈμ§νκ³ κ·Έμ λμνλ νλμ μ·¨νλ©°, κ²½νμ λ°νμΌλ‘ νμ΅νλ κΈ°λ₯μ κ°μ§κ² λλ€
Time-Invariant
- Time-varying is something that does not change over time.
That is, when the same operation is performed in the same state, the same compensation distribution is always calculated.
- In general, the Q function converges to the optimal value, and the agent can learn the optimal policy function.
- It is mathematically guaranteed that policies will converge optimally someday.
- However, in a multi-agent environment, the rewards an agent receives are changed to time-varying rather than Time-Invariant because they must rely not only on the agent's actions but also on the actions of other agents.
- As shown in the upper side, the time constant environment converges to zero when averaged, but there is no guarantee that the lower side converges.
μλΆλ³μ±μ΄λ μκ°μ΄ μ§λλ λ³νμ§ μλ κ²μ΄λ€. μ¦, κ°μ μνμμ κ°μ λμμ μ·¨νλ©΄ νμ λμΌν 보μ λΆν¬κ° μ°μΆλλ κ².
μΌλ°μ μΌλ‘ Qν¨μλ μ΅μ μ κ°μΉλ‘ μλ ΄νκ³ , μμ΄μ νΈλ μ΅μ μ μ μ±
ν¨μλ₯Ό λ°°μΈ μ μμ.
μΈμ κ°λ μ μ±
μ΄ μ΅μ μΌλ‘ μλ ΄ν¨μ΄ μνμ μΌλ‘ 보μ₯λμ΄ μμ.
λ€μ€ μμ΄μ νΈ νκ²½μμλ, ν μμ΄μ νΈκ° λ°λ 보μμ΄ κ·Έ μμ΄μ νΈμ λμ λΏλ§μλλΌ λ€λ₯Έ μμ΄μ νΈλ€μ λμμλ μμ‘΄ν΄μΌνκΈ°μ μλΆλ³μ±μ΄ κΉ¨μ§κ² λλ€.
λ
Έμ΄μ¦κ° μμ§λ§, μμΈ‘κ³Ό κ°μ΄ μλΆλ³νκ²½μ νκ· λ΄μμλ 0μΌλ‘ μλ ΄νμ§λ§, μλμΈ‘μ μλ ΄νλ€λ 보μ₯μ΄ μλ€.
μλΆλ³ νκ²½μμ, μ£Όμ΄μ§ ν μν μ μ΄μ κΈ°λ κ°μΉ(νκ· κ°μΉ)λ μκ°μ΄ μ§λλ λ³νμ§ μλλ€(λΆλ³).
λͺ¨λ μν μ μ΄μλ μ΄λ μ λμ νλ₯ μ μμκ° μμ΄μ μκ³μ΄(time-series) κ·Έλνκ° μ‘μμ΄ μμΈ λͺ¨ μ΅μ΄κΈ΄ νμ§λ§, κ·Έλλ μκ³μ΄μ νκ· μ μμμ΄λ€.
μλ³ νκ²½μμλ μ£Όμ΄μ§ ν μν μ μ΄μ κΈ°λ κ°μΉ κ° μκ°μ λ°λΌ λ³νλ€.
μλ κ·Έλνλ₯Ό 보면 μ€μ λ‘ μκ³μ΄μ νκ· λλ κΈ°μ€μ μ΄ μκ°μ λ°λΌ λ³νλ€.
Q ν¨μλ μν-λμ μλ€μ κΈ°λ κ°μΉλ₯Ό λ°°μ°λ € νλ©°, μλ ΄μ μν-λμ κ°μΉλ€μ΄ μλΆλ³μΌ λλ§ λ³΄ μ₯λλ€.
λ€μ€ μμ΄μ νΈ μ€μ μμλ λ€λ₯Έ μμ΄μ νΈλ€μ μ μ±
μ΄ λ³νκΈ° λλ¬Έμ μν-λμ κ°μΉ κΈ°λκ°μ΄ μκ°μ λ°λΌ λ³νλ©°, λ°λΌμ μλ ΄μ 보μ₯λμ§ μλλ€
Single Agent → Multi Agent
IQL(Independent Q-Learning)
From a single agent reinforcement learning method to multiple agent settings.
- The easiest way is to control each agent by applying a separate DQN.
Each agent independently receives a state in the environment and performs an operation.
- If you all want to control the agent using the same policy, you can model and use multiple agents with one DQN instance.
This method is called IQL(Independent Q-Learning).
Weakness
- Interactions between agents do not affect each decision-making.
- In the algorithm IQL all agents are working independently, without considering their impact on itself at all other agents just to considered noise.
λ¨μΌ μμ΄μ νΈ κ°ννμ΅ λ°©λ²μμ λ€μ€ μμ΄μ νΈ μ€μ μΌλ‘.
κ°μ₯ μ¬μ΄ λ°©λ²μ, κ°κ°μ μμ΄μ νΈκ° λ°λ‘ DQN μ μ©νμ¬ μ μ΄νλ κ².
κ° μμ΄μ νΈκ° λ
립μ μΌλ‘ νκ²½μμ μνλ₯Ό λ°μμ λμμ μ·¨νλ€.
λͺ¨λ κ°μ μ μ±
μ μ¬μ©ν΄ μμ΄μ νΈλ₯Ό μ μ΄νκ³ μΆμΌλ©΄, νλμ DQN μΈμ€ν΄μ€λ‘ μ¬λ¬ μμ΄μ νΈλ₯Ό λͺ¨νννμ¬ μ¬μ©νλ©΄ λλ€.
μ΄λ¬ν λ°©λ²μ IQL(Independent q-learning)μ΄λΌκ³ λΆλ₯Έλ€.
λ¨μ
μμ΄μ νΈλ€ μ¬μ΄μ μνΈμμ©μ΄ κ°κ°μ μμ¬κ²°μ μ μν₯μ λ―ΈμΆμ§ λͺ»ν¨.
IQL μκ³ λ¦¬μ¦μμλ μ 체 μμ΄μ νΈκ° λ
립μ μΌλ‘ μλν΄μ, λ€λ₯Έ μμ΄μ νΈλ€μ΄ μμ νν
λ―ΈμΉλ μν₯μ μ ν κ³ λ €νμ§ μκ³ κ·Έλ₯ μ‘μμΌλ‘ κ°μ£Όν¨.
Problem
- In general, Q learning is not good in a multi-agent environment.
because the environment in which each agent learns new policies is not a Time-Invariant environment assumed by general Q learning, but a Time-varying environment.
- In a Time-varying environment, the expected value of compensation varies over time.
- To deal with the time-conversion landscape, the Q function must approach the combined operating space of other agents, where the size of the combined operating space increases exponentially to the number of agents, resulting in unrealistic computational costs even with a small number of agents.
μΌλ°μ μΌλ‘, Qνμ΅μ λ€μ€ μμ΄μ νΈ νκ²½μμ μ’μ§ μμ.
μμ΄μ νΈλ€μ΄ κ°μ μλ‘μ΄ μ μ±
μ λ°°μ°λ νκ²½μ μΌλ°μ μΈ Qνμ΅μ΄ κ°μ νλ μλΆλ³ νκ²½μ΄ μλλΌ μλ³ νκ²½μ΄κΈ° λλ¬Έμ.
μλ³ νκ²½μμλ 보μλ€μ κΈ°λκ°μ΄ μκ°μ λ°λΌ λ³ν¨.
μλ³νκ²½μ λ€λ£¨κΈ° μν΄μλ Qν¨μκ° λ€λ₯Έ μμ΄μ νΈλ€μ κ²°ν© λμ 곡κ°μ μ κ·Όν΄μΌνλλ°, μ΄ κ²°ν© λμ 곡κ°μ ν¬κΈ°λ μμ΄μ νΈμ μμ μ§μμ μΌλ‘ μ¦κ°ν΄μ μμ΄μ νΈ μκ° μ‘°κΈλ§ λ§μλ κ³μ° λΉμ©μ΄ λΉνμ€μ μΌλ‘ 컀μ§
To do
Focusing on the problem to be solved, I will classify and study multi-agent reinforcement learning algorithms.
1. Modeling the relationship between agents.
Model the influence between agents using information such as the status and behavior of other agents.
2. Trust between agents.
A study on the allocation of trust between agents that deals with how much the agent's actions contribute to solving the overall problem.
3. Communication between agents.
Communication problems between agents used to overcome restrictions that require agents to use only their information during execution.
4. Exploration-use dilemma
A study that approached the exploration-use dilemma, a classic problem of reinforcement learning, in the MARL environment.
ν΄κ²°νκ³ μ νλ λ¬Έμ λ₯Ό μ€μ¬μΌλ‘ λ©ν° μμ΄μ νΈ κ°ννμ΅ μκ³ λ¦¬μ¦μ λΆλ₯νμ¬ κ³΅λΆνλ €νλ€.
1. μμ΄μ νΈ κ°μ κ΄κ³ λͺ¨λΈλ§ // λ€λ₯Έ μμ΄μ νΈμ μνμ νλ λ±μ μ 보λ₯Ό μ΄μ©νμ¬ μμ΄μ νΈ κ° μν₯μ λͺ¨λΈλ§
2. μμ΄μ νΈ κ°μ μ λ’°ν λΉ // μμ΄μ νΈμ νλμ΄ μ 체 λ¬Έμ ν΄κ²°μ μΌλ§λ κΈ°μ¬νλμ§ λ€λ£¨λ μμ΄μ νΈ κ° μ λ’° ν λΉμ λν μ°κ΅¬
3. μμ΄μ νΈ κ°μ ν΅μ // μ€ν μ μμ΄μ νΈκ° μμ μ μ λ³΄λ§ μ΄μ©ν΄μΌ νλ μ μ½ κ·Ήλ³΅μ μν΄ μ¬μ©λλ μμ΄μ νΈ κ°μ ν΅μ λ¬Έμ
4. νμ-μ΄μ© λλ λ§ //κ°ννμ΅μ κ³ μ μ μΈ λ¬Έμ μΈ νμ-μ΄μ© λλ λ§λ₯Ό MARLνκ²½μμ μ κ·Όν μ°κ΅¬