OpenAI Gym CartPole-v0 ์ฝ”๋“œ ์‹คํ–‰ ๊ฒฐ๊ณผ 

import gym
import time

env = gym.make('CartPole-v1') #๊ฐ•ํ™”ํ•™์Šต ํ™˜๊ฒฝ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

for i_episode in range(20):
    # ์ƒˆ๋กœ์šด ์—ํ”ผ์†Œ๋“œ(initial environment)๋ฅผ ๋ถˆ๋Ÿฌ์˜จ๋‹ค(reset)
    observation = env.reset()
    for t in range(100):
        env.render() #ํ™”๋ฉด์— ์ถœ๋ ฅ / ํ–‰๋™ ์ทจํ•˜๊ธฐ ์ด์ „ ํ™˜๊ฒฝ์—์„œ ์–ป์€ ๊ด€์ฐฐ๊ฐ’(obsevation)์ ์šฉํ•ด์„œ ๊ทธ๋ฆผ
        time.sleep(0.05)

        # ํ–‰๋™(action)์„ ์ทจํ•˜๊ธฐ ์ด์ „์— ํ™˜๊ฒฝ์— ๋Œ€ํ•ด ์–ป์€ ๊ด€์ฐฐ๊ฐ’(observation)
        print('observation before action:')
        print(observation)
        action = env.action_space.sample()#์ž„์˜์˜ action ์„ ํƒ
        observation, reward, done, info = env.step(action)#์„ ํƒํ•œ action์„ ํ™˜๊ฒฝ์œผ๋กœ ๋ณด๋ƒ„
        time.sleep(0.05)

        # ํ–‰๋™(action)์„ ์ทจํ•œ ์ดํ›„์— ํ™˜๊ฒฝ์— ๋Œ€ํ•ด ์–ป์€ ๊ด€์ฐฐ๊ฐ’(observation)
        print('observation after action:')
        print(observation)

        if done:
            print("Episode finished after {} timesteps".format(t+1))
         
            break
            
env.close()

Gym์€ ๊ฐ•ํ™”ํ•™์Šต์˜ ๊ตฌํ˜„์›๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๊ฐ„๋‹จํžˆ ์‹ค์Šตํ•  ์ˆ˜ ์žˆ๋Š” OpenAI์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.

  • A pole is attached by an un-actuated joint to a cart, which moves along
  • a frictionless track. The pendulum starts upright, and the goal is to
  • prevent it from falling over by increasing and reducing the cart's velocity.
  • This environment corresponds to the version of the cart-pole problem

 

์—ํ”ผ์†Œ๋“œ ์ข…๋ฃŒ ์กฐ๊ฑด

  • ๋ง‰๋Œ€๊ธฐ์˜ ๊ฐ๋„๊ฐ€ 12๋„ ์ด์ƒ ๋„˜์–ด๊ฐˆ๋•Œ
  • ์นดํŠธ ์œ„์น˜๊ฐ€ 2.4๋ณด๋‹ค ํด๋•Œ(์นดํŠธ ์ค‘์•™์ด ๋””์Šคํ”Œ๋ ˆ์ด ๋์œผ๋กœ ๋„˜์–ด๊ฐˆ๋•Œ)
  • ์—ํ”ผ์†Œ๋“œ๊ฐ€ 200๋ณด๋‹ค ํด๋•Œ
  • ํ‰๊ท  100ํšŒ ์—ฐ์† ์‹œํ–‰์—์„œ 195.0๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜ ๊ฐ™์„๋•Œ ํ•ด๊ฒฐ๋œ๋‹ค ํŒ๋‹จํ•œ๋‹ค.

 

env.step ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด์„œ ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ(environment)์— ๋Œ€ํ•œ ํ–‰๋™(action)์„ ์ทจํ•˜๋ฉด, ํ–‰๋™ ์ดํ›„์— ํš๋“ํ•œ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋ฆฌํ„ด(return)ํ•ด์ฃผ๊ฒŒ ๋œ๋‹ค.

 

<CartPoleํ™˜๊ฒฝ์—์„œ ๋ฆฌํ„ดํ•ด์ฃผ๋Š” ๊ฐ’๋“ค>

observation : ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ 4-dimension vector๋กœ ๊ฐ๊ฐ Cart Position(์นดํŠธ์˜ ์œ„์น˜), Cart Velocity(์นดํŠธ์˜ ์†๋„), Pole Angle(๋ง‰๋Œ€๊ธฐ์˜ ๊ฐ๋„), Pole Velocity At Tip(๋ง‰๋Œ€๊ธฐ ๋์˜ ์†๋„)์„ ์˜๋ฏธํ•œ๋‹ค.

reward๋Š” ๋„˜์–ด์ง€์ง€ ์•Š์„ ๊ฒฝ์šฐ ๋งค time๋งˆ๋‹ค +1์˜ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค.

done์€ ํ˜„์žฌ ์—ํ”ผ์†Œ๋“œ(episode)๊ฐ€ ๋๋‚ฌ๋Š”์ง€ ๋๋‚˜์ง€ ์•Š์•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” boolean ๊ฐ’์ด๋‹ค.

(๋ง‰๋Œ€๊ธฐ๊ฐ€ ์“ฐ๋Ÿฌ์ง€๊ฑฐ๋‚˜ ์นดํŠธ๊ฐ€ ์ค‘์•™์—์„œ ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ์ด๋™ํ•˜๋ฉด episode๋ฅผ ์ข…๋ฃŒ(terminate)ํ•œ๋‹ค.

 

CartPole ํ™˜๊ฒฝ์—์„œ Agent๊ฐ€ ์ทจํ•  ์ˆ˜ ์žˆ๋Š” ํ–‰๋™(action_space)์€ 0, 1 ๋‘๊ฐœ์˜ ๊ฐ’์ด๋‹ค.

0์€ ์™ผ์ชฝ์œผ๋กœ ์ด๋™, 1์€ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™์„ ๋œปํ•œ๋‹ค.

 

CartPole-v0์€ ํ•œ ์—ํ”ผ์†Œ๋“œ์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€๋ณด์ƒ์€ 200 step์ด๋‹ค.
๋งค ์Šคํƒญ๋งˆ๋‹ค ๋ง‰๋Œ€๊ฐ€ ๋„˜์–ด์ง€์ง€ ์•Š์œผ๋ฉด 1์˜ ๋ณด์ƒ์„ ๋ฐ›๋Š”๋ฐ,
100 ์—ํ”ผ์†Œ๋“œ ์—ฐ์†์œผ๋กœ 195์ด์ƒ์˜ ๋ณด์ƒ(reward)๋ฅผ ํš๋“ํ•˜๋ฉด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค๊ณ  ๋ณธ๋‹ค.

 

step ํ•จ์ˆ˜
์„ ํƒํ•œ action์„ stepํ•จ์ˆ˜๋กœ ๋ณด๋‚ด๋ฉด, ๋‹ค์Œ์˜ 4๊ฐ€์ง€ value๋ฅผ returnํ•œ๋‹ค.
observation : ํ”ฝ์…€ ๋ฐ์ดํ„ฐ๊ณผ ๊ฐ™์€ ๊ด€์ฐฐ๊ฐ’
reward : ๊ทธ action์„ ํ•˜๋ฏ€๋กœ์„œ ํ™˜๊ฒฝ์—์„œ ๋ฐ›๋Š” reward๊ฐ’
done : ์—ํ”ผ์†Œ๋“œ๊ฐ€ terminal ๋˜๋ฉด True( ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ–ˆ๊ฑฐ๋‚˜, ์—์ด์ „ํŠธ๊ฐ€ ๋ชฉ์ˆจ์„ ์žƒ์—ˆ์„๋•Œ)
info : ํ™˜๊ฒฝ์˜ ์ •๋ณด๋“ค( ์ ์ˆ˜ ๋“ฑ๋“ฑ )

Agent๋Š” ๊ฐ time_step๋งˆ๋‹ค action์„ ์„ ํƒํ•˜๋ฉฐ Environment๊ณผ ์ƒํ˜ธ์ž‘์šฉ์„ ํ•œ๋‹ค.
์ด๋•Œ Environment๋Š” Agent๋กœ๋ถ€ํ„ฐ action์„ ๋ฐ›๊ณ  reward์™€ observation์„ return ํ•œ๋‹ค.

728x90
๋ฐ˜์‘ํ˜•
Liky