Natural Language Processing with Disaster Tweets

Predict which Tweets are about real disasters and which ones are not

https://www.kaggle.com/c/nlp-getting-started

 

Natural Language Processing with Disaster Tweets | Kaggle

 

www.kaggle.com

 

NLP ๊ณต๋ถ€๋ฅผ ํ•˜๋ฉด์„œ ์ดˆ๊ธฐ ๋…ผ๋ฌธ๋ถ€ํ„ฐ ํ•˜๋‚˜์”ฉ ๋ณด๋ฉด์„œ ์ž‘์„ฑํ•ด๋ณด๊ณ , ์ตœ์‹  ํŠธ๋ Œ๋“œ๋ฅผ ๊ณต๋ถ€ํ•ด๊ฐ€๋ฉด์„œ,

์ง์ ‘ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ถ€ํ„ฐ ๋ชจ๋ธ์„ ๋Œ๋ ค๋ณด๊ณ , ์ž์—ฐ์–ด๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ๊ณผ์ •์„ ์ง์ ‘ ๊ฒฝํ—˜ํ•ด ๋ณด๊ณ  ์‹ถ์—ˆ๋‹ค. 

์ฆ‰, NLP ๋ชจ๋ธ์„ ๋Œ๋ฆฌ๊ธฐ ์œ„ํ•œ ์ง์ ‘ ์ฝ”๋”ฉ์„ ํ•˜๊ณ  ์‹ถ์—ˆ๋‹ค.

 

๊ธฐ์กด์— BERT Model์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ์ธํ„ฐ๋„ท์— ์žˆ๋Š” ์˜ˆ์ œ๋กœ ๋„ค์ด๋ฒ„ ๋ฆฌ๋ทฐ ๊ฐ์„ฑ๋ถ„๋ฅ˜๋ฅผ ํ•ด๋ณด์•˜์—ˆ๋Š”๋ฐ,

๋‹ค๋ฅธ ๋ฌธ์ œ๋ฅผ ์ง์ ‘ ํ’€์–ด๋ณด๊ณ  ์‹ถ์–ด์„œ Kaggle์„ ๋’ค์ ๊ฑฐ๋ฆฌ๋‹ค๊ฐ€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ์˜ ํƒ€์ดํƒ€๋‹‰ ๊ฐ™์€ NLP ๊ณต๋ถ€ ๋Œ€ํšŒ๊ฐ€ ์žˆ์—ˆ๋‹ค.

Kaggle NLP Competition

์ด๋ฅผ ๊ธฐ์กด์— ์ž‘์„ฑํ–ˆ๋˜ ์ฝ”๋“œ๋“ค์„ ์‘์šฉํ•ด์„œ ์ง์ ‘ ๋Œ€ํšŒ์— ์ฐธ๊ฐ€ํ•˜์—ฌ ์ฝ”๋”ฉ์„ ํ•ด๋ณด์•˜๋‹ค.

 

์ด ๋Œ€ํšŒ๋Š” ํŠธ์œ„ํ„ฐ์— ์žฌ๋‚œ ํŠธ์œ—์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋Œ€ํšŒ์ธ๋ฐ, ํŠธ์œ—์„ ๋ณด๊ณ  ์‹ค์ œ๋กœ ์žฌ๋‚œ์ƒํ™ฉ์ธ์ง€, ๊ฑฐ์ง“ ์žฌ๋‚œ์ƒํ™ฉ์ธ์ง€ ๊ตฌ๋ถ„ํ•˜์—ฌ ๋ผ๋ฒจ๋งํ•˜๋Š” Getting Started competitions์ด๋‹ค.

 

Train ๋ฐ์ดํ„ฐ์…‹์€ ์ด๋Ÿฌํ•œ ํ˜•์‹์˜ CSV ํŒŒ์ผ์ด๋‹ค.

import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from pytorch_transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import Adam
import torch.nn.functional as F

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.head() # 7,613 Rows 5 Columns

train_df = train_data.dropna()
test_df = test_data.dropna()
train_df.shape #5,080 Rows 5 Columns

 ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ๋ฐ์ดํ„ฐ์…‹ ํŒŒ์ผ์„ ๋ถ„์„ํ•ด๋ณด๋‹ˆ 7,613๊ฐœ์˜ Rows๊ฐ€ ์žˆ์—ˆ๋‹ค.

๊ทธ๋ฆฌ ๋งŽ์ง€์•Š์€ ๋ฐ์ดํ„ฐ์ด๊ธฐ์— NLP ๊ณต๋ถ€๋ฅผ ํ•˜๋ฉด์„œ ๋ชจ๋ธ์„ ๋Œ๋ ค๋ณด๊ธฐ์— ์ข‹์€ ๋ฐ์ดํ„ฐ ๊ฐ™์•˜๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋‹ค๋ฅธ ํ• ์ผ๋“ค๋„ ์žˆ๊ณ  ์ด ๋Œ€ํšŒ์— ๋งŽ์€ ์‹œ๊ฐ„์„ ํˆฌ์žํ•  ์ˆ˜ ์—†๋Š” ์ƒํ™ฉ์ด๋ฉฐ, ๊ทธ๋ ‡๊ธฐ์— ์งง์€ ์‹œ๊ฐ„์— ํ’€๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•ด ์ฃผ์—ˆ๋‹ค.

์ผ๋‹จ, ๊ฒฐ์ธก์น˜๋ฅผ ์ „๋ถ€ ์ œ๊ฑฐํ•˜์˜€๋‹ค.

๊ฒฐ์ธก์น˜๋ฅผ ๋Œ€์ฒดํ•˜๊ณ  ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ธฐ์—๋Š” ์‹œ๊ฐ„์ด ์—†์—ˆ๋”ฐ..

๊ทธ๋žฌ๋”๋‹ˆ 7,613 -> 5,080์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค„์—ˆ๋‹ค. 5000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ์ž˜ ๋ ๋ ค๋‚˜ ๊ฑฑ์ •ํ–ˆ์—ˆ๋‹ค.

class load_dataset(Dataset) :
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, item):
        text = self.df.iloc[item, 3]
        label = self.df.iloc[item,4]
        return text, label

 ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•˜๊ฒŒ classํ•˜๋‚˜๋ฅผ ์ •์˜ํ•˜์—ฌ text์™€ label์ด return ๋˜๋„๋ก ๋งŒ๋“ค์—ˆ๋‹ค.

 

train_dataset = load_dataset(train_df)
train_loader = DataLoader(train_dataset, shuffle=True)

๊ทธ๋Ÿฌ๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด์„œ keywords, location, Id Column ๊ฐ’๋“ค์€ ์ „๋ถ€ ์ œ๊ฑฐํ•˜๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ Text, Label ๊ฐ’๋งŒ ๋ฝ‘์•˜๋‹ค.

 

device = torch.device("cuda")
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')
model.to(device)

optimizer = Adam(model.parameters(), lr=1e-6)

""" ๋ชจ๋ธ ํ•™์Šต """
model.train()

๊ทธ๋Ÿฌ๊ณ  ๋ชจ๋ธ์„ ๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด pretrained ๋œ Bert์˜ Classification ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€์„œ ์ด๋ฅผ CUDA์— ์˜ฌ๋ฆฌ๊ณ  ์ค€๋น„๋ฅผ ํ•˜์˜€๋‹ค.

total_corrct = 0
total_len = 0
total_loss = 0
count = 0

for epoch in range(1) :
    for text, label in train_loader:
        optimizer.zero_grad()

        #ํ•œ ๋ฌธ์žฅ์—์„œ ๋‹จ์–ด์”ฉ
		#<CLS>, <SEP> ๋“ฑ์˜ special token์„ ์ถ”๊ฐ€
        encoding_list = [tokenizer.encode(x, add_special_tokens=True) for x in text] 
		
        padding_list = [x + [0]*(256-len(x)) for x in encoding_list]

        sample = torch.tensor(padding_list)
        sample = sample.to(device)
        #label = torch.tensor(label)
        label = label.to(device)

        outputs = model(sample, labels=label)
        loss, logits = outputs
        print(logits)
        predict = torch.argmax(F.softmax(logits), dim=1)
        print(predict)
        corrct = predict.eq(label)

        total_corrct += corrct.sum()
        total_len += len(label)
        total_loss += loss

        loss.backward()
        optimizer.step()
        break
        if count % 1000 ==0 :
            print(f'Epoch : {epoch+1}, Iteration : {count}')
            print(f'Train Loss : {total_loss/1000}')
            print(f'Accuracy : {total_corrct/total_len}\n')

        count +=1

model.eval()

๊ทธ๋Ÿฌ๊ณ  ์ด๋ ‡๊ฒŒ ๋Œ๋ ค์„œ ๋Œ€ํšŒ๋ฅผ ์œ„ํ•œ BERT ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ , ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ณ  ์ด ๋ชจ๋ธ์„ ์ €์žฅํ•˜์—ฌ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

์—ํฌํฌ๋ฅผ ์—ฌ๋Ÿฌ๋ฒˆ ๋Œ๋ฆด๊ฑธ ๊ทธ๋žฌ๋‚˜ ์ผ๋‹จ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋„ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ–ˆ์œผ๋‹ˆ ์—ํฌํฌ๋„ ํ•œ๋ฒˆ๋งŒ ๋Œ๋ ธ๋‹ค.

์•„๋งˆ Accuracy๊ฐ€ 80% ์ด์ƒ์œผ๋กœ ๋‚˜์™”๋˜๊ฑฐ ๊ฐ™๋‹ค.

๊ทธ๋ž˜์„œ ๋งŒ์กฑํ•ด์„œ ๋ฐ”๋กœ ๋๋‚ด๊ณ  Predict๋กœ ๋„˜์–ด๊ฐ”๋‹ค.

(์‹œ๊ฐ„์„ ๋งŽ์ด ํˆฌ์žํ•  ์—ฌ์œ ๊ฐ€ ์—†์—ˆ๊ธฐ์—..)

 

""" Predict.py """
import pandas as pd
import torch
from torch.utils.data import DataLoader
from pytorch_transformers import BertTokenizer
import csv
import re

test_data = pd.read_csv('test.csv')
test_df = test_data.fillna("null")
model = torch.load('predict_of_tweet_model.pth')

๋˜‘๊ฐ™์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋กœ๋“œํ•˜๊ณ , test ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๊ณ , ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ฒฐ์ธก์น˜๋ฅผ "null"๋กœ ๋Œ€์ฒดํ•ด์คฌ๋‹ค.

 

model.to(torch.device('cpu'))
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

test_dataset = test_df["text"] #2,158 Rows
processing_test_dataset = test_dataset.reset_index()["text"]
processing_test_dataset = DataLoader(processing_test_dataset)

test ๋ฐ์ดํ„ฐ์…‹์—๋Š” 2,158๊ฐœ์˜ Rows๊ฐ€ ์žˆ์—ˆ๋‹ค.

๊ทผ๋ฐ ์ด์ƒํ•˜๊ฒŒ ์—ฌ๊ธฐ์„œ๋Š” CUDA๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ณผ๊ฐ€ ๊ณ„์†๋– ์„œ ๋ฐฉ๋ฒ•์ฐพ๋‹ค๊ฐ€ ๊ทธ๋ƒฅ CPU๋กœ ๋Œ๋ ค์คฌ๋‹ค.

id๋„ 0๋ถ€ํ„ฐ ์žˆ๋Š”๊ฒŒ ์•„๋‹Œ ๋œจ๋ฌธ๋œจ๋ฌธ ์žˆ์–ด์„œ ์ธ๋ฑ์Šค ์ดˆ๊ธฐํ™”ํ•ด์ฃผ๊ณ  ๋ถˆ๋Ÿฌ์™”๋‹ค.

 

final = []
for text in processing_test_dataset:
    encoding_list = [tokenizer.encode(x, add_special_tokens=True) for x in text]
    padding_list = [x + [0]*(256-len(x)) for x in encoding_list]

    sample = torch.tensor(padding_list)
    sample = sample.to(torch.device('cpu'))
    outputs = model(sample)
    predict = torch.argmax(outputs[0],dim=1)
    print(f'predict -> {predict}')
    final.append(predict.int())

์ด๋Ÿฌ๊ณ  ์ด์ „ ๋ชจ๋ธ ํ•™์Šตํ• ๋•Œ์™€ ๊ฐ™์ด ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํ›„์— ์ด๋ฅผ ๋ชจ๋ธ์— ๋„ฃ๊ณ  Predict๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

 

์ด๋ ‡๊ฒŒ ๋‚˜์˜จ predict ๊ฐ’๋“ค์€ list๋กœ appendํ•ด์„œ ์ €์žฅํ•ด ์ฃผ์—ˆ๋Š”๋ฐ,

0๊ณผ 1๋กœ ๋‚˜์˜ค๋Š”๊ฒŒ์•„๋‹Œ

"Tensor[1], dtype=uint32" ๋ญ ์ด๋Ÿฐ์‹์œผ๋กœ ๋˜์–ด ๋‚˜์˜ค๊ธฐ์— ๋˜ํ•œ๋ฒˆ ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ฃผ์—ˆ๋‹ค.

 

re_pattern = re.compile(r'\d')
temp = final
target_list = list()

for x in temp :
    temp = re.findall(re_pattern, str(x))
    target_list.append(temp[0])

์–˜๋„ค๋“ค์„ ์ฒ˜๋ฆฌํ•˜๋ ค๊ณ  ์ •๊ทœ์‹์„ ๋„ฃ์–ด์„œ ์ˆซ์ž๋งŒ ๋ฝ‘์•„์ฃผ๋ฉด [์˜ˆ์ธกํƒ€์ผ“๊ฐ’(0 or 1), 3, 2] ์ด๋Ÿฐ์‹์œผ๋กœ ๋‚˜์˜จ๋‹ค. ๋’ค์— ๋ฐ์ดํ„ฐํƒ€์ž…์— ๋ถ™๋Š” ์ˆซ์ž๊นŒ์ง€ ๋”ฐ๋ผ ๋‚˜์˜ค๊ธฐ์— ๊ทธ์ค‘ ์ฒซ๋ฒˆ์งธ ์ˆซ์ž๋งŒ ๋ฝ‘์•„์„œ ์ƒˆ๋กœ์šด ๋ฆฌ์ŠคํŠธ์— ์ €์žฅํ•ด์ฃผ์—ˆ๋‹ค.

์ด๋ ‡๊ฒŒ๋˜๋ฉด list์—๋Š” [1,1,0,0,0,0,0,1,0,1,0,1,...] ์ด๋Ÿฐ์‹์œผ๋กœ ์˜ˆ์ธก๊ฐ’๋“ค์ด ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค.

 

with open('predict.csv', 'w', newline='') as f :
    writer = csv.writer(f)
    writer.writerow(target_list)

data_id = pd.DataFrame(test_data['id'])
data_target = pd.DataFrame(target_list)
result = pd.concat([data_id, data_target], axis=1)
result.to_csv("result.csv")

๊ทธ๋Ÿฌ๋ฉด ์ด ๋‚˜์˜จ ๊ฐ’๊ณผ ๊ธฐ์กด์˜ test ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” id์™€ ํ•ฉ์ณ์ฃผ๋ฉด ์ตœ์ข… ์ œ์ถœ ํŒŒ์ผ์ด ๋งŒ๋“ค์–ด์ง„๋‹ค.

์ œ์ถœ ํŒŒ์ผ ํ˜•์‹

์ด์ œ ์ด ํŒŒ์ผ์„ Kaggle์— ๋ณด๋‚ด์„œ ์ ์ˆ˜๋ฅผ ํ™•์ธํ•ด ๋ณด์•˜๋‹ค.

81์ ์ด ๋‚˜์™”๋‹ค.

์ด ์ ์ˆ˜๋Š” 81.428% ๋งž์ถ”์—ˆ๋‹ค๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์—„์ฒญ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ–ˆ๋Š”๋ฐ๋„ ์ƒ๊ฐ๋ณด๋‹ค ๋†’์€ ์ ์ˆ˜๊ฐ€ ๋‚˜์™”๋‹ค.

์—ฌ๊ธฐ์„œ BERT ๋ชจ๋ธ์˜ ์–ด๋งˆ๋ฌด์‹œํ•œ ์„ฑ๋Šฅ์„ ๋А๋‚„์ˆ˜๊ฐ€ ์žˆ์—ˆ๋‹ค..

 

์ด๋ ‡๊ฒŒ ํ•ด์„œ NLP ๋Œ€ํšŒ๋ฅผ ํ•œ๋ฒˆ ์ฐธ๊ฐ€ํ•ด ๋ณด์•˜๋Š”๋ฐ, ๊ทธ์ „๊นŒ์ง€๋งŒํ•ด๋„ NLP ์–ด๋ ต๊ณ  ์žฌ๋ฏธ์—†๊ฒŒ ๋А๊ปด์กŒ๋Š”๋ฐ ๋ง‰์ƒ ์ง์ ‘ ํ’€์–ด๋ณด๋‹ˆ ๋น„์ „๊ณผ ํฌ๊ฒŒ ๋‹ค๋ฅผ๊ฒŒ ์—†๋‹ค๊ณ  ๋А๊ปด์กŒ๋‹ค. ๋” ๊ณต๋ถ€ํ•˜๊ณ  ๋” ํŒŒ๋ด์•ผ๊ฒ ๋‹ค. 

728x90
๋ฐ˜์‘ํ˜•
Liky