728x90
๋ฐ˜์‘ํ˜•

๊ฐ•ํ™”ํ•™์Šต 3

[ ๊ฐ•ํ™”ํ•™์Šต ] 2. Multi-arm Bandits

Part โ… . Tabular Solution Methods ๊ฐ•ํ™”ํ•™์Šต์˜ simplest forms์— ๋Œ€ํ•˜์—ฌ ๋ฐฐ์šฐ๋Š” ์ฑ•ํ„ฐ๋‹ค. action-value function์ด array๋‚˜ table ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚˜๊ธฐ์— ์ถฉ๋ถ„ํ•  ์ •๋„๋กœ ๊ทธ state์™€ action space๊ฐ€ ์ž‘์€ ํ˜•ํƒœ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ, optimal value function๊ณผ optimal policy๋ฅผ ์ฐพ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค. ์ด๋Š” ์˜ค์ง approximate solutions๋งŒ ์ฐพ์•„๋‚ด๋Š” much larger problems๊ณผ ๋Œ€๋น„๋œ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต์ด ๋‹ค๋ฅธ ํ•™์Šต๋“ค๊ณผ ๊ตฌ๋ถ„๋˜๋Š” ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŠน์ง•์€ correct actions์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์—ฌ instruct ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ actions์„ ํ‰๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์ด ๊ณง active exploration์˜..

[ ๊ฐ•ํ™”ํ•™์Šต ] 1. The Reinforcement Learning Problem

์ฃผ์–ด์ง„ ์–ด๋–ค ์ƒํ™ฉ(state)์—์„œ ๋ณด์ƒ(reward)์„ ์ตœ๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ํ–‰๋™(action)์— ๋Œ€ํ•ด ํ•™์Šต ๋‹ต์ด ์กด์žฌํ•˜๋Š” ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๋ฅผ ํ† ๋Œ€๋กœ ํ•œ ํ•™์Šต์ด ์•„๋‹Œ ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ํ•™์Šต ํ˜„์žฌ ์„ ํƒํ•œ Action์ด ๋ฏธ๋ž˜์˜ ์ˆœ์ฐจ์  ๋ณด์ƒ์— ์˜ํ–ฅ (Delayed Reward) External Supervisor์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. [ Trade-off between Exploitation and Exploration ] Agent๋Š” reward๋ฅผ ์–ป๊ธฐ ์œ„ํ•œ action์„ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฏธ ๊ฒฝํ—˜ํ•œ ๊ฒƒ์„ exploitํ•˜๊ฑฐ๋‚˜ ๋ฏธ๋ž˜์— ๋” ๋‚˜์€ action selection์„ ์œ„ํ•œ environment์™€์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•ด exploreํ•œ๋‹ค. ์œ„ ๋‘ ๋ฐฉ๋ฒ• ์ค‘์— ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค ๋ฐฉ๋ฒ•์„ ํƒํ•˜์—ฌ์•ผ ํ•œ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต ๊ตฌ์„ฑ์š”..

[ ๊ฐ•ํ™”ํ•™์Šต ] 0. Introduction

๊ฐ•ํ™”ํ•™์Šต ( Reinforcement Learning ) ํ™˜๊ฒฝ(Environment)์„ ํƒ์ƒ‰ํ•˜๋Š” ํ•™์Šต์ฃผ์ฒด(Agent)๋Š” ํ˜„์žฌ ์ƒํƒœ(State)๋ฅผ ์ธ์‹ํ•˜์—ฌ ์–ด๋–ค ํ–‰๋™(Action)์„ ์ทจํ•˜๊ณ , ํ™˜๊ฒฝ์œผ๋กœ๋ถ€ํ„ฐ ๋ณด์ƒ(Reward)๋ฅผ ์–ป๋Š”๋‹ค. ๊ฐ•ํ™”ํ•™์Šต์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Agent๊ฐ€ ์•ž์œผ๋กœ ๋ˆ„์ ๋  Reward๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ผ๋ จ์˜ Actions๋กœ ์ •์˜๋˜๋Š” Policy๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ํ˜„์žฌ ์„ ํƒํ•œ Action์ด ๋ฏธ๋ž˜์˜ ์ˆœ์ฐจ์  Reward์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. (Delayed Reward) ์œ„ ์„œ์ ๊ณผ ๋ฐ•์œ ์„ฑ ๊ต์ˆ˜๋‹˜์˜ ์„œ์ ์„ ์ฐธ๊ณ ํ•˜์—ฌ ๊ฐ•ํ™”ํ•™์Šต์— ๋Œ€ํ•œ ์ด๋ก ์ ์ธ ์ดํ•ด๋ฅผ, Python OpenAI Gym ๋ผ์ด๋ธŒ๋Ÿฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ทธ ๊ตฌํ˜„์„ ๋ชฉํ‘œ๋กœ ๊ณต๋ถ€ํ•˜๊ณ  ํ•ด๋‹น ๋‚ด์šฉ์„ ์ •..

728x90
๋ฐ˜์‘ํ˜•