A Neural Probabilistic Language Model

  • Yoshua Bengio,
  • Rรฉjean Ducharme,
  • Pascal Vincent,
  • Christian Janvin

2003๋…„ 3์›” 1์ผ

NPLM์€ ๋‹จ์–ด๋ฅผ ์ž„๋ฒ ๋”ฉํ•˜์—ฌ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์—์„œ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•˜์—ฌ ํ–ฅํ›„ Word2Vec์œผ๋กœ ๊ฐ€๋Š” ๊ธฐ๋ฐ˜์ด ๋˜์—ˆ๋‹ค๊ณ ํ•œ๋‹ค.

๊ฐ„๋‹จํ•˜๊ฒŒ

  • ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜์ง€ ์•Š๋Š” n-gram์ด ํฌํ•จ๋œ ๋ฌธ์žฅ์ด ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์„ 0์œผ๋กœ ๋งค๊ธด๋‹ค
  • n์„ 5์ด์ƒ์œผ๋กœ ์„ค์ •ํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์žฅ์˜ ์žฅ๊ธฐ ์˜์กด์„ฑ์„ ํฌ์ฐฉํ•ด๋‚ด๊ธฐ ์–ด๋ ต๋‹ค.
  • ๋‹จ์–ด/๋ฌธ์žฅ ๊ฐ„ ์œ ์‚ฌ๋„๋Š” ๊ณ ๋ ค ํ•˜์ง€ ์•Š๋Š”๋‹ค.

neural net์„ ์“ฐ๊ธฐ ์ด์ „์—๋Š” smoothing( ์ž‘์€ ์ƒ์ˆ˜๋ฅผ ๋”ํ•ด์„œ 0์ด ์•ˆ๋‚˜์˜ค๋„๋ก) ๋˜๋Š” backoff๋ฅผ ์‚ฌ์šฉํ•ด์„œ data sparcity๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค. long-term dependencies ๋ฌธ์ œ๋Š” n๊ฐœ์˜ ํ† ํฐ๋งŒ ๊ฒ€์ƒ‰ํ•˜๋ฏ€๋กœ ๋‹ค์Œ ํ† ํฐ์€ ์ถ”๋ก ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๋ฌธ์ œ์ธ๋ฐ ์ด๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด n์„ ๋Š˜๋ฆฌ๋ฉด ๋˜๋‹ค์‹œ data sparcity์™€ ๋งˆ์ฃผํ•˜๊ฒŒ ๋œ๋‹ค. n-gram์œผ๋กœ๋Š” long-term dependencies๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์—†๋‹ค.

nplm์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ณผ์ •์—์„œ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

Abstract(Eng)

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

Abstract (Kor)

ํ†ต๊ณ„ ์–ธ์–ด ๋ชจ๋ธ๋ง์˜ ๋ชฉํ‘œ๋Š” ์–ธ์–ด ๋‚ด ๋‹จ์–ด์˜ ์‹œํ€€์Šค์˜ ๊ณต๋™ ํ™•๋ฅ  ํ•จ์ˆ˜๋ฅผ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ฐจ์›์„ฑ์˜ ์ €์ฃผ๋กœ ์ธํ•ด ๋ณธ์งˆ์ ์œผ๋กœ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•  ๋‹จ์–ด ์‹œํ€€์Šค๋Š” ํ›ˆ๋ จ ์ค‘์— ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๋‹จ์–ด ์‹œํ€€์Šค์™€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. n-gram์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๊ธฐ์กด์˜ ๋งค์šฐ ์„ฑ๊ณต์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๊ต์œก ์„ธํŠธ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋งค์šฐ ์งง์€ ์ค‘์ฒฉ ์‹œํ€€์Šค๋ฅผ ์—ฐ๊ฒฐํ•จ์œผ๋กœ์จ ์ผ๋ฐ˜ํ™”๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ฐ ํ›ˆ๋ จ ๋ฌธ์žฅ์ด ์˜๋ฏธ๋ก ์ ์œผ๋กœ ์ธ์ ‘ํ•œ ๋ฌธ์žฅ์˜ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์ธ ์ˆ˜๋ฅผ ๋ชจ๋ธ์— ์•Œ๋ฆด ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด์˜ ๋ถ„์‚ฐ ํ‘œํ˜„์„ ํ•™์Šตํ•จ์œผ๋กœ์จ ์ฐจ์›์„ฑ์˜ ์ €์ฃผ์™€ ์‹ธ์šธ ๊ฒƒ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ (1) ๋‹จ์–ด ์‹œํ€€์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ  ํ•จ์ˆ˜์™€ ํ•จ๊ป˜ ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ถ„์‚ฐ ํ‘œํ˜„์„ ๋™์‹œ์— ํ•™์Šตํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ํ‘œํ˜„์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜ํ™”๋Š” ์ด์ „์— ๋ณผ ์ˆ˜ ์—†์—ˆ๋˜ ์ผ๋ จ์˜ ๋‹จ์–ด๋“ค์ด ์ด๋ฏธ ๋ณธ ๋ฌธ์žฅ์„ ํ˜•์„ฑํ•˜๋Š” ๋‹จ์–ด์™€ (์ฃผ๋ณ€ ํ‘œํ˜„์„ ๊ฐ–๋Š”๋‹ค๋Š” ์˜๋ฏธ์—์„œ) ์œ ์‚ฌํ•œ ๋‹จ์–ด๋กœ ๋งŒ๋“ค์–ด์ง„๋‹ค๋ฉด ๋†’์€ ํ™•๋ฅ ์„ ์–ป๊ธฐ ๋•Œ๋ฌธ์— ์–ป์–ด์ง‘๋‹ˆ๋‹ค. ํ•ฉ๋ฆฌ์ ์ธ ์‹œ๊ฐ„ ๋‚ด์— ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ๊ต์œกํ•˜๋Š” ๊ฒƒ ์ž์ฒด๊ฐ€ ํฐ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ํ™•๋ฅ  ํ•จ์ˆ˜์— ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ์‹คํ—˜์— ๋Œ€ํ•ด ๋ณด๊ณ ํ•˜๋ฉฐ, ์ œ์•ˆ๋œ ์ ‘๊ทผ๋ฒ•์ด ์ตœ์ฒจ๋‹จ n-๊ทธ๋žจ ๋ชจ๋ธ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•˜๊ณ  ์ œ์•ˆ๋œ ์ ‘๊ทผ๋ฒ•์ด ๋” ๊ธด ๋งฅ๋ฝ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๋‘ ๊ฐœ์˜ ํ…์ŠคํŠธ ๋ง๋ญ‰์น˜์— ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

OverView

Neural Probabilistic Language Model (NPLM) ์€ Distributed Representation์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.

๐Ÿ’ก
Distributed Representation (๋ถ„์‚ฐํ‘œํ˜„) ๋ถ„์‚ฐ ํ‘œํ˜„์€ ๋ถ„ํฌ ๊ฐ€์„ค์„ ์ด์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ํ•™์Šตํ•˜๊ณ , ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐ์˜ ์—ฌ๋Ÿฌ ์ฐจ์›์— ๋ถ„์‚ฐํ•˜์—ฌ ํ‘œํ˜„ํ•จ.

๊ธฐ์กด์—๋Š” ๋Œ€๋ถ€๋ถ„ one-hot-encoding์„ ์‚ฌ์šฉํ–ˆ๋‹ค. One-hot-encoding๋ž€, ์ „์ฒด Vocabulary์˜ ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ ๊ฐ๊ฐ ๊ณ ์œ ํ•œ ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค. ๊ฐ ๋‹จ์–ด์˜ word vector๋Š” vocabulary ํฌ๊ธฐ์™€ ๋™์ผํ•˜๊ฒŒ ์ƒ์„ฑ๋˜๊ณ , ํ•ด๋‹น ๋‹จ์–ด์—๊ฒŒ ๋ถ€์—ฌ๋œ ์ธ๋ฑ์Šค์˜ ๊ฐ’์„ 1๋กœ, ๋‚˜๋จธ์ง€ ๊ฐ’๋“ค์€ ๋ชจ๋‘ 0์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, arsenal, chelsea, liverpool, manchaster๋กœ ๊ตฌ์„ฑ๋œ volabulary๊ฐ€ ์กด์žฌํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, arsenal์„ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ฒซ ๋ฒˆ์งธ ๊ฐ’์„ 1๋กœ ํ• ๋‹นํ•˜๊ณ  ๋‚˜๋จธ์ง€ ๊ฐ’๋“ค์€ ๋ชจ๋‘ 0์œผ๋กœ ํ• ๋‹นํ•ด์„œ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

one-hot-encoding ์˜ ๋‹จ์ 

  1. ์ „์ฒด vocabulary์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋Š” ๊ฒฝ์šฐ, ํ•œ ๋‹จ์–ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋„ ํ•จ๊ป˜ ์ปค์ง. ์ „์ฒด ๋‹จ์–ด ๋ชฉ๋ก์ด ๋งค์šฐ ์ปค์ง€๋Š” ๊ฒฝ์šฐ, ํ•œ ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ๋ฒกํ„ฐ ๋˜ํ•œ ๋งค์šฐ ์ปค์ง„๋‹ค.
  1. ๋‘ ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•˜์ง€ ๋ชปํ•จ. ์„œ๋กœ ๋‹ค๋ฅธ ์ž๋ฆฌ๊ฐ€ 1๋กœ ๋˜์–ด ์žˆ๊ณ , ๋‚˜๋จธ์ง€ ์ž๋ฆฌ๋Š” ๋ชจ๋‘ 0์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋‘ ๋ฒกํ„ฐ๋ฅผ ๋‚ด์ ํ•˜๋Š” ๊ฒฝ์šฐ ๋ฌด์กฐ๊ฑด 0์ด ๋‚˜์˜จ๋‹ค. ๋‘ ๋ฒกํ„ฐ๊ฐ€ ์ง๊ตํ•œ๋‹ค๋Š” ๋œป์ธ๋ฐ, ์ด๋Ÿฌ๋ฉด ๋‘ ๋ฒกํ„ฐ๋Š” ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ์„œ๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ์กด์žฌํ•œ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ, one-hot-vector๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์—†๋‹ค.

๋‘ ๋ฒˆ์งธ ๋ฌธ์ œ๊ฐ€ ๊ฐ€์žฅ ํฐ ๋ฌธ์  ๋ฐ, ์‹ค์ œ ์ƒํ™œ ์†์—์„œ๋Š” ๋‹จ์–ด๋“ค ์‚ฌ์ด์— ๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•˜๊ณ , ๊ทธ ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ distributed representation์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค.

NPLM์—์„œ๋Š” ์—ฌํƒœ๊นŒ์ง€ ์ฃผ์–ด์ง„ ํ˜„์žฌ ๋‹จ์–ด๋ฅผ ์ด์ „ (n-1)๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์ด์šฉํ•ด์„œ ์˜ˆ์ธกํ•˜๊ฒŒ ๋œ๋‹ค. ์ฆ‰, t๋ฒˆ์งธ ๋‹จ์–ด๋Š” (tโˆ’1)๋ฒˆ์งธ ๋‹จ์–ด๋ถ€ํ„ฐ (tโˆ’n+1)๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์—์ธก์„ ํ•œ๋‹ค๋Š” ๋œป์ด๋‹ค. ์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ์“ฐ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

NPLM์€ ์œ„์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์„ ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด์ „์˜ (n-1) ๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์ด์šฉํ•ด์„œ ํ˜„์žฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ๋•Œ๋ฌธ์—, n-gram ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์œ„์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ถ„์ž๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ณ  ๋ถ„๋ชจ๋ฅผ ์ตœ์†Œํ™”ํ•ด์•ผ ํ•œ๋‹ค.

NPLM์˜ ์ถœ๋ ฅ์ธต์—์„œ๋Š” V-์ฐจ์›์˜ ์ ์ˆ˜ ๋ฒกํ„ฐ(score vector)์— ํ•ด๋‹นํ•˜๋Š” ywty_wt๏ปฟ ๊ฐ’์„ softmax ํ•จ์ˆ˜๋ฅผ ํ†ต๊ณผ์‹œ์ผœ ๋‚˜์˜จ ํ™•๋ฅ ๋ฒกํ„ฐ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ํ™•๋ฅ  ๋ฒกํ„ฐ์—์„œ ๋†’์€ ํ™•๋ฅ ๊ฐ’์ด ์ œ๊ณต๋˜๋Š” ์ธ๋ฑ์Šค์˜ ๋‹จ์–ด๊ฐ€ (one-hot-vector๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—) ์‹ค์ œ ์ •๋‹ต์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด์™€ ์ผ์น˜ํ•˜๋„๋ก ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

NPLM์˜ ์ž…๋ ฅ์ธต์—์„œ๋Š” (n-1)๊ฐœ์˜ ์„ ํ–‰ํ•˜๋Š” ๋‹จ์–ด๋“ค์˜ one-hot-vector๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜จ๋‹ค. One-hot-vector์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ๋ฒกํ„ฐ๋Š” V-์ฐจ์›์ผ ๊ฒƒ์ด๋‹ค. ๊ฐ ๋ฒกํ„ฐ๋“ค์€ C ๏ปฟ์™€ ์—ฐ์‚ฐ์„ ํ†ตํ•ด์„œ ์ง€์ •๋œ ํฌ๊ธฐ์ธ mm๏ปฟ์ฐจ์›์˜ ์ž…๋ ฅ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋œ๋‹ค. ํ•ด๋‹น ๋ณ€ํ™˜์˜ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์ž…๋ ฅ ๋ฒกํ„ฐ xtx_t๏ปฟ ๊ฐ€ mm๏ปฟ ์ฐจ์›์ด๊ธฐ ๋•Œ๋ฌธ์—, ํ•ด๋‹น ์ฐจ์›์˜ ๊ฒฐ๊ณผ๋ฌผ์„ ๋งŒ๋“ค์–ด ๋‚ด๊ธฐ ์œ„ํ•ด์„œ CC๏ปฟ๋Š” mร—โˆฃVโˆฃmร—|V|๏ปฟ์˜ ์ฐจ์›์„ ๋ณด์œ ํ•˜๊ฒŒ ๋œ๋‹ค. One-hot-vector๋กœ ๊ตฌ์„ฑ๋œwt w_t๏ปฟ์™€ CC๏ปฟ๊ฐ€ ์—ฐ์‚ฐ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์€, CC๏ปฟํ–‰๋ ฌ์—์„œ tt๏ปฟ ๋ฒˆ์งธ์˜ ์—ด๋งŒ ์ฐธ์กฐ(lookup)ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•œ ์—ฐ์‚ฐ์ด๋‹ค.


Curse of Dimensionality with Distributed Representations

์ €์ž๋Š” ๋‹จ์–ด๋ฅผ ์ปดํ“จํ„ฐ์—๊ฒŒ ์ž…๋ ฅ์‹œํ‚ค๋Š” ๊ณผ์ •์—์„œ ๊ทธ ์ฐจ์›์ด ์ปค์ง€๋Š” ๊ฒƒ์ด ํ•™์Šต์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“œ๋Š” ๊ทผ๋ณธ์ ์ธ ๋ฌธ์ œ๋ผ๊ณ  ๋งํ•˜๊ณ  ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 10000๊ฐœ์˜ Vocabulary size๊ฐ€ ์žˆ๊ณ  ๋‹จ์–ด๋ฅผ one-hot-encoding ์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค๋ฉด ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด์˜ element๋งŒ 1์ด ๋  ๊ฒƒ์ด๊ณ  ๋‚˜๋จธ์ง€๋Š” 0์ด ๋  ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ฒกํ„ฐ๋Š” sparse vector์ด๋ฉฐ ์—ฌ๋Ÿฌ ์ธก๋ฉด์—์„œ ํšจ์œจ์„ฑ์ด ๋งค์šฐ ๋–จ์–ด์ง€๊ณ , Vocabulary size๊ฐ€ ๋งค์šฐ ํฐ real problem์—์„œ๋Š” ๊ณ„์‚ฐ๋ณต์žก๋„๊ฐ€ ํฌ๊ฒŒ ๋Š˜์–ด๋‚  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์„ ๋ฐฉํ•ดํ•  ๊ฒƒ์ด๋‹ค. ๋˜ํ•œ ์ €์ž๋Š” ๋‹จ์–ด๊ฐ„์˜ ์˜๋ฏธ์ , ๋ฌธ๋ฒ•์  ์œ ์‚ฌ์„ฑ์„ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ ์—ญ์‹œ ์ค‘์š”ํ•˜๋‚˜ ์ด๋Ÿฐ ์ ๋„ ์ž˜ ์ด๋ค„์ง€์ง€ ์•Š๊ณ  ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ๋“ค์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆ๋œ ๊ฒƒ์ด ๋ถ„์‚ฐํ‘œํ˜„(Distributed Representation) ์ด๋‹ค.

๋ถ„์‚ฐํ‘œํ˜„์€ ๋‹จ์–ด๋ฅผ ๊ธฐ์กด์˜ ์›ํ•ซ๋ฒกํ„ฐ์ฒ˜๋Ÿผ ์ฐจ์›์ˆ˜๋ฅผ Vocabulary size ์ „์ฒด๋กœ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ทธ๋ณด๋‹ค ํ›จ์”ฌ ์ž‘์€ m์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜์—ฌ sparse vector๊ฐ€ ์•„๋‹Œ Dense vector ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. ์ด๋•Œ element๋“ค์€ discreteํ•œ variable์ด ์•„๋‹Œ continuousํ•œ variable์ธ๋ฐ, ์ €์ž๋Š” Continuous variable๋“ค์„ modeling ํ•  ๋•Œ๊ฐ€ smoothness ์ธก๋ฉด์—์„œ discrete variable๋ณด๋‹ค ํ›จ์”ฌ ์‰ฝ๊ณ  ํšจ์œจ์ ์ด๋ผ๋ฉฐ Continuous variable์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•˜๊ณ  ์žˆ๋‹ค. ์ฐจ์›์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ด ์ฃผ ๋ชฉ์ ์ด์ง€๋งŒ element๋“ค์ด Continuous variable์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฒกํ„ฐ๊ฐ„ ์œ ์‚ฌ๋„์™€ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๊ณ  ์ด๋Š” ๊ณง ์•ž์—์„œ ์–ธ๊ธ‰ํ•œ ๋‹จ์–ด์˜ ์œ ์‚ฌ์„ฑ์„ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋œป์ด๋‹ค.

n-gram

Statistical Language Model(ํ†ต๊ณ„์  ์–ธ์–ด๋ชจ๋ธ) ์€ ์ด์ „ ๋‹จ์–ด๋“ค์ด ์ฃผ์–ด์กŒ์„ ๋•Œ์— ๋Œ€ํ•œ ๋‹ค์Œ ๋‹จ์–ด์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ๋“ค์˜ ๊ณฑ์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค.

์—ฌ๊ธฐ์„œ ๋‹จ์–ด ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜๋Š” ์ด์ ์„ ์‚ด๋ฆฌ๊ณ  Word Sequence์—์„œ ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋“ค์€ ํ†ต๊ณ„์ ์œผ๋กœ ๋” dependentํ•˜๋‹ค๋Š” ์‚ฌ์‹ค์„ ์ ์šฉ, ์ด ๋ฐฉ๋ฒ•์„ ๊ฐœ์„ ํ•˜์—ฌ ์ด์ „ n-1๊ฐœ๋งŒ์˜ ๋‹จ์–ด๋“ค ํ˜น์€ ๋งฅ๋ฝ(context)์— ๋Œ€ํ•œ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ n-gram model์ด๋‹ค.

๊ทธ๋Ÿฌ๋‚˜, ๋งŒ์•ฝ ์–ด๋–ค n๊ฐœ์˜ word sequence๊ฐ€ Training corpus(๋ง๋ญ‰์น˜)์— ์—†๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ํ• ๊นŒ? ํ™•๋ฅ ์„ 0์œผ๋กœ๋Š” ํ•  ์ˆ˜ ์—†์„๊ฒƒ์ด๋‹ค. ์ €์ž๋Š” ์ƒˆ๋กœ์šด word sequence๋Š” ์–ผ๋งˆ๋“ ์ง€ ๋“ฑ์žฅํ•  ์ˆ˜ ์žˆ๊ณ  context๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก ๋”์šฑ ์ด๋Ÿฐ ์ƒํ™ฉ์ด ๋นˆ๋ฒˆํ•  ๊ฒƒ์ด๋ผ๊ณ  ์–˜๊ธฐํ•œ๋‹ค. ์ด์— ๋Œ€ํ•œ ๋Œ€ํ‘œ์ ์ธ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ ๋” ์ž‘์€ context๋ฅผ ์‚ฌ์šฉํ•ด ์‚ดํŽด๋ณด๋Š” Back-off ๋ฐฉ๋ฒ•(Katz, 1987)๊ณผ, ํŠน์ •ํ•œ ๊ฐ’์„ ๋”ํ•ด ํ™•๋ฅ ์„ 0์œผ๋กœ ๋งŒ๋“ค์ง€ ์•Š๋Š” Smoothing ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ์ €์ž๋Š” ๊ทธ๋Ÿฌ๋‚˜, ์ด n-gram ๋ชจ๋ธ์€ context ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ๋‹จ์–ด์™€์˜ ์—ฐ๊ด€์„ฑ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์œผ๋ฉฐ ๋‹จ์–ด๊ฐ„ ์œ ์‚ฌ์„ฑ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ๋‹ค.

NPLM์€ n-gram๋ชจ๋ธ์„ ๋ณธ์งˆ๋กœ ํ•˜๋˜, ์ด๋Ÿฌํ•œ ๋‹จ์ ๋“ค์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์•ž์—์„œ ์–ธ๊ธ‰ํ•œ ๋ถ„์‚ฐํ‘œํ˜„(Distributed Representation)์„ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

what is NPLM?

NPLM (Neural Probabilistic Language Model)์€ 2003๋…„์— ๊ฐœ๋ฐœ๋œ ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์œผ๋กœ, n-1๊ฐœ ๋‹จ์–ด ์ˆœ์„œ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์Œ n๋ฒˆ์งธ์— ๋“ฑ์žฅํ•  ๋‹จ์–ด๋ฅผ ๋งž์ถ”๋Š” n-gram ์–ธ์–ด ๋ชจ๋ธ์ด๋‹ค.

์ด๋Š” ๊ธฐ์กด์˜ ์–ธ์–ด ๋ชจ๋ธ์ด ๊ฐ€์ง€๊ณ  ์žˆ๋˜ ๋ฌธ์ œ์ ์„ ๋ณด์™„ํ•œ ๋ชจ๋ธ์ด๋‹ค.

  • ์กด์žฌํ•˜์ง€ ์•Š๋Š” n-gram์— ๋Œ€ํ•œ ํ™•๋ฅ  0์œผ๋กœ ๋ถ€์—ฌํ•˜๋Š” ๋ฌธ์ œ์ 
  • ์ฐจ์›์˜ ์ €์ฃผ : n์„ ํฌ๊ฒŒ ์„ค์ •ํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ํ™•๋ฅ ์ด 0์ด ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์ 
  • ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ์ 

wt w_t๏ปฟ๋Š” ๋ฌธ์žฅ์—์„œ tt๏ปฟ๋ฒˆ์งธ์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด.

index for wtindex~ for~ w_t๏ปฟ๋Š” t๋ฒˆ์งธ์ž„์„ ๊ฐ€๋ฆฌํ‚ค๋Š” one-hot vector์ด๋‹ค.

ํ–‰๋ ฌC C๏ปฟ๋Š” ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ฒกํ„ฐ๊ฐ€ ํ–‰์œผ๋กœ ์Œ“์—ฌ ์žˆ๋Š” ํ–‰๋ ฌ์ด๋ฉฐ, ์ดˆ๊ธฐ๊ฐ’์€ ๋žœ๋ค์œผ๋กœ ๋ถ€์—ฌ๋œ๋‹ค.

tt๏ปฟ๋Š” ์˜ˆ์ธกํ•  ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ์œ„์น˜์ด๊ณ , nn๏ปฟ์€ ์ž…๋ ฅ๋˜๋Š” ๋‹จ์–ด ๊ฐœ์ˆ˜.

tโˆ’(nโˆ’1)t-(n-1)๏ปฟ๋ฒˆ์งธ๋ถ€ํ„ฐ tโˆ’1t-1๏ปฟ๋ฒˆ์งธ ๋‹จ์–ด์— ๋Œ€ํ•œ one-hot vector๊ฐ’์ด ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ.

  1. ์ž…๋ ฅ์ธต (input layer) tโˆ’(nโˆ’1)t-(n-1)๏ปฟ ๋ถ€ํ„ฐ tโˆ’1t-1๏ปฟ๋ฒˆ์งธ ๋‹จ์–ด ๋ฒกํ„ฐ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” one-hot vector์™€ ํ–‰๋ ฌ C๊ฐ€ ๋‚ด์ ์œผ๋กœ ๊ณฑํ•ด์ง€๋ฉฐ ํ•ด๋‹น ๋‹จ์–ด์˜ ๋ฒกํ„ฐ๊ฐ’์ด ๋‚˜์˜จ๋‹ค. ์ด ๋‚˜์˜ค๋Š” ๋ฒกํ„ฐ๊ฐ’๋“ค์„ ๊ฐ๊ฐ xkx_k๏ปฟ๋ผ๊ณ  ํ•˜๋ฉฐ ์˜†์œผ๋กœ concatenateํ•˜์—ฌ xx๏ปฟ๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.
  1. ์€๋‹‰์ธต (hidden layer) tanhtanh๏ปฟ ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ score vector ๊ฐ’์„ ๊ตฌํ•˜๋ฉฐ, ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. y=b+Wx+Uโˆ—tanh(d+Hx)y = b + Wx + U*tanh(d+Hx)๏ปฟ (b, W, U, d, H๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜) (b์™€ d๋Š” ๊ฐ๊ฐ bias term  H๋Š” hidden layer์˜ weights  W๋Š” input layer์™€ output layer์˜ direct connection์„ ๋งŒ๋“ค ๊ฒฝ์šฐ์˜ weights๋ฅผ ์˜๋ฏธ)
  1. ์ถœ๋ ฅ์ธต (output layer) y๊ฐ’์— softmax ํ•จ์ˆ˜๋ฅผ ์ ์šฉ์‹œํ‚ค๋ฉฐ, ์ •๋‹ต one-hot vector์™€ ๊ฐ’์„ ๋น„๊ตํ•˜์—ฌ ์—ญ์ „ํŒŒ(back-propagation)๋ฅผ ํ†ตํ•ด ํ•™์Šต๋œ๋‹ค.

๋‹จ์–ด๋“ค์„ โˆฃVโˆฃโˆฃVโˆฃ๏ปฟ๋ณด๋‹ค ์ž‘์€ ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์‹œ๋„ ๋•๋ถ„์— ์ดํ›„ Word2vec์œผ๋กœ ๋ฐœ์ „ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ update ํ•ด์•ผํ•  ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ถ„์‚ฐํ‘œํ˜„์„ ๋‹ด๋‹นํ•˜๋Š” C์™ธ์—๋„ HU๋“ฑ์ด ์žˆ์–ด ์—ฌ์ „ํžˆ ๊ณ„์‚ฐ๋ณต์žก์„ฑ์ด ๋†’๋‹ค๋Š” ๊ฒƒ์ด ๋ฌธ์ œ์ด๋‹ค. ์‹ค์ œ๋กœ Word2Vec์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ๊ณ ์น˜๊ธฐ ์œ„ํ•ด ํ•™์Šตํ•ด์•ผํ•  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ค„์—ฌ๋ƒˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์ธ์šฉ ๋ฐ ์ถœ์ฒ˜

A neural probabilistic language model, The Journal of Machine Learning ResearchVolume 33/1/2003 pp 1137โ€“1155
https://misconstructed.tistory.com/35
https://heehehe-ds.tistory.com/entry/NLP-NPLMNeural-Probabilistic-Language-Model
https://velog.io/@donggunseo/NPLM
https://nobase2dev.tistory.com/24
728x90
๋ฐ˜์‘ํ˜•
Liky