간단한 자연어 분석

간단한 자연어 분석

2024. 1. 5. 09:15ㆍArtificial_Intelligence/Natural Language Processing

1. 각 라벨별 가장 많이 나오는 단어 찾기

from collections import Counter
import pandas as pd

df = pd.read_csv('train.csv')

uniqueLabel = df['label'].unique()

for Label in uniqueLabel:
    temp_df = df[df['label'] == Label]
    words = ' '.join(temp_df['sentence']).split()
    word_counts = Counter(words)
    most_common_word = word_counts.most_common(5)
    print(f"'{Label}'에서 가장 많이 나오는 단어: {most_common_word}")

2. 각 라벨 별 분포도 체크

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('train.csv')
labels = df['label'].value_counts()
print(labels)

plt.bar(labels.index, labels.values)
plt.show()

3. 문장 내 표제어 및 어간 추출

from konlpy.tag import Kkma, Komoran, Okt, Hannanum
"""
.morphs > 형태소 추출
.phrases > 어절 추출
okt.morphs(text, stem= True) > 어간추출
.nouns >  품사 중 명사만 추출
.pos >각 품사를 태깅
"""
okt = Okt()
kkma = Kkma()
komoran = Komoran()
hannanum = Hannanum()

df = pd.read_csv('train.csv')
tokenized_texts = df['sentence'].apply(lambda x: okt.morphs(x))
print(tokenized_texts)

4. 데이터프레임 내 단어 검색

import pandas as pd

def search_topic(df, topic):
    topic_index = []

    for i, row in df.iterrows():
        if topic.lower() in row['sentence'].lower():
            topic_index.append(i)

    return topic_index

topic_index = search_topic(df, 'school')

print(topic_index)
print(df['sentence'][topic_index])

728x90

저작자표시 비영리

'Artificial_Intelligence > Natural Language Processing' 카테고리의 다른 글

Decoding 기법 정리 (0)	2024.03.28
Transformer_Encoder (트랜스포머 인코더 쉽고 자세하게 설명하기) (0)	2023.04.28
[논문리뷰] CNM: An Interpretable Complex-valued Network for Matching (1)	2022.12.27
[논문리뷰] Emoberta: Speaker-aware emotion recognition in conversation with roberta (0)	2022.12.11
감정을 녹인 KoGPT 기반 챗봇 제작 (0)	2022.12.11

Liky

Liky

태그

최근글

댓글

공지사항

아카이브

'Artificial_Intelligence > Natural Language Processing' 카테고리의 다른 글

관련글

티스토리툴바