'Artificial_Intelligence🤖/Natural Language Processing' 카테고리의 글 목록 (3 Page)

[논문리뷰] It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

2022.07.11·

Artificial_Intelligence🤖/Natural Language Processing

Schick, Timo, and Hinrich Schütze. "Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference." Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. Schick, Timo, and Hinrich Schütze. "It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners." Proceedings of the 20..

Text Similarity, Semantic Similarity

2022.06.02·

Artificial_Intelligence🤖/Natural Language Processing

텍스트 유사도 코사인 유사도 (Cosine Similarity) -> 두 개의 벡터 값의 Cos 각도 유클리디언 유사도 (Euclidean Similarity) -> 두 개의 점 사이의 거리 = L2 거리 맨하탄 유사도 (Menhattan Similarity) -> 사각 격자 최단 거리 = L1 거리 자카드 유사도 (Jaccard Similarity) -> 교집합과 합집합의 크기로 계산 두 문장이 주어졌을 때, 두 문장이 서로 얼마나 유사한지 나타내주는 기법 아래에서 입력값으로 받는 Sentences는 ["Hello World", "Hello Word"] 형식이다. ### 코사인 유사도 ### def cos_performance(sentences) : tfidf_vectorizer = TfidfVecto..

Count-Base Word Representation

2022.03.22·

Artificial_Intelligence🤖/Natural Language Processing

카운트 기반의 단어 표현이란 어떤 글의 문맥 안에 단어가 동시에 등장하는 횟수를 세는 방법입니다. 동시 등장 횟수를 하나의 행렬로 나타낸 뒤, 그 행렬을 수치화해서 단어 벡터로 만드는 방법을 사용하는 방식입니다. 텍스트를 위와 같은 방식으로 수치화하면, 통계적인 접근 방법을 통해 여러 문서로 이루어진 텍스트 데이터가 있을 때 어떤 단어가 특정 문서내에서 얼마나 중요한 것인지를 나타내거나, 문서의 핵심어 추출, 검색 엔진에서 검색 결과의 순위 결정, 문서들 간의 유사도 등의 용도로 사용가능합니다. 각 단어에 1번, 2번, 3번 등과 같은 숫자를 맵핑(mapping)하여 부여한다면 이는 국소 표현 방법에 해당됩니다. 반면, 분산 표현 방법의 해당 단어를 표현하기 위해 주변 단어를 참고합니다. puppy(강아..

Natural Language Processing with Disaster Tweets

2022.02.28·

Artificial_Intelligence🤖/Natural Language Processing

Natural Language Processing with Disaster Tweets Predict which Tweets are about real disasters and which ones are not https://www.kaggle.com/c/nlp-getting-started Natural Language Processing with Disaster Tweets | Kaggle www.kaggle.com NLP 공부를 하면서 초기 논문부터 하나씩 보면서 작성해보고, 최신 트렌드를 공부해가면서, 직접 데이터 처리부터 모델을 돌려보고, 자연어를 어떻게 처리하는지 과정을 직접 경험해 보고 싶었다. 즉, NLP 모델을 돌리기 위한 직접 코딩을 하고 싶었다. 기존에 BERT Model을 공부하면서 ..

New NLP Trands

2022.02.28·

Artificial_Intelligence🤖/Natural Language Processing

Timkey, W. and van Schijndel, M. (2021) → Rogue(작은 몇개의 차원) 개념 제안. → rogue가 모델을 좌우하니, 이를 제어하는 postprocessing 테크닉 제안 Paik, C., Aroca-Ouellette, S., Roncone, A., and Kann, K. (2021) → CoDa(사람이 인지 가능한 색을 구분하기 위한 데이터) 구성 → PLM의 한계 지적. (병백하게 딱 이거다! 라고 말하는 사람x. 텍스트만으로는 이러한 데이터를 인지하는 것에 부족함 발견. 따라서 다양한 형태의 데이터를 언어 모델에 적용하는 방법 탐구 Kalyan, A., Kumar, A., Chandrasekaran, A., Sabharwal, A., and Clark, P. (20..

[논문리뷰]Efficient Estimation of Word Representations in Vector Space

2022.02.20·

Artificial_Intelligence🤖/Natural Language Processing

2022.02.20 - [Artificial_Intelligence/Papers] - [논문리뷰]Distributed Representations of Words and Phrases and their Compositionality [논문리뷰]Distributed Representations of Words and Phrases and their Compositionality ㄴDistributed Representations of Words and Phrases and their Compositionality Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advance..

[논문리뷰]Distributed Representations of Words and Phrases and their Compositionality

2022.02.20·

Artificial_Intelligence🤖/Natural Language Processing

Distributed Representations of Words and Phrases and their Compositionality Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013). Abstract (Eng.) The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that cap..

[논문리뷰]A Neural Probabilistic Language Model

2022.02.20·

Artificial_Intelligence🤖/Natural Language Processing

A Neural Probabilistic Language Model Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent. "A neural probabilistic language model." Advances in Neural Information Processing Systems 13 (2000). NPLM은 단어를 임베딩하여 벡터로 바꾸는 과정에서 신경망 기반의 기법을 제시하여 향후 Word2Vec으로 가는 기반이 되었다고한다. 간단하게 학습 데이터에 존재하지 않는 n-gram이 포함된 문장이 나타날 확률을 0으로 매긴다 n을 5이상으로 설정하기 어렵기 때문에 문장의 장기 의존성을 포착해내기 어렵다. 단어/문장 간 유사도는 고려 하지 않는다. neural n..

한국어 문서 요약 표현 논문 정리

2022.02.10·

Artificial_Intelligence🤖/Natural Language Processing

1) 추출적 요약(extractive summarization) 추출적 요약은 원문에서 중요한 핵심 문장 또는 단어구를 몇 개 뽑아서 이들로 구성된 요약문을 만드는 방법입니다. 그렇기 때문에 추출적 요약의 결과로 나온 요약문의 문장이나 단어구들은 전부 원문에 있는 문장들입니다. 추출적 요약의 대표적인 알고리즘으로 머신 러닝 알고리즘인 텍스트랭크(TextRank)가 있습니다. 2) 추상적 요약(abstractive summarization) 추상적 요약은 원문에 없던 문장이라도 핵심 문맥을 반영한 새로운 문장을 생성해서 원문을 요약하는 방법입니다. 마치 사람이 요약하는 것 같은 방식인데, 당연히 추출적 요약보다는 난이도가 높습니다. 이 방법은 주로 인공 신경망을 사용하며 대표적인 모델로 seq2seq가 있..

DBLP DataSet Processing / 대용량 Json 파싱

2022.01.12·

Artificial_Intelligence🤖/Natural Language Processing

그래프 임베딩을 공부하기 위한 DataSet으로 DBLP로 정하고 이를 가져와보았다. https://www.aminer.org/citation AMiner www.aminer.org 이 곳에 들어가서 이 데이터를 가져와서 다운로드를 받았다. 그런데 문제는 이 데이터를 가져와서 전처리를 해야하는데 용량이 16.1GB 이다.. 웬만한 에디터로 열리지도 않는 데이터를 처리해야해서 막막했었다. 그래서 생각한 것이 데이터를 용량을 정해서 자르고, 자른 코드를 수작업으로 조금만 손봐주자고 생각하였다. 내가 사용한 프로그램은 GSplit 3 이다. 여기서 가져온 DBLP Json파일을 가져와서 1GB씩 먼저 잘랐다. 이렇게 되면, 딕셔너리로 자르는 것이 아닌 용량으로 자르기에 Json 형식이 깨지게 된다. 따라서, ..

티스토리툴바