ํ…์ŠคํŠธ ์œ ์‚ฌ๋„

์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ (Cosine Similarity) -> ๋‘ ๊ฐœ์˜ ๋ฒกํ„ฐ ๊ฐ’์˜ Cos ๊ฐ๋„
์œ ํด๋ฆฌ๋””์–ธ ์œ ์‚ฌ๋„ (Euclidean  Similarity) -> ๋‘ ๊ฐœ์˜ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ = L2 ๊ฑฐ๋ฆฌ
๋งจํ•˜ํƒ„ ์œ ์‚ฌ๋„ (Menhattan Similarity) -> ์‚ฌ๊ฐ ๊ฒฉ์ž ์ตœ๋‹จ ๊ฑฐ๋ฆฌ = L1 ๊ฑฐ๋ฆฌ
์ž์นด๋“œ ์œ ์‚ฌ๋„ (Jaccard Similarity) -> ๊ต์ง‘ํ•ฉ๊ณผ ํ•ฉ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๋กœ ๊ณ„์‚ฐ

๋‘ ๋ฌธ์žฅ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‘ ๋ฌธ์žฅ์ด ์„œ๋กœ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€ ๋‚˜ํƒ€๋‚ด์ฃผ๋Š” ๊ธฐ๋ฒ•

 

์•„๋ž˜์—์„œ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ๋ฐ›๋Š” Sentences๋Š” ["Hello World", "Hello Word"] ํ˜•์‹์ด๋‹ค.

### ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ###
def cos_performance(sentences) :

    tfidf_vectorizer = TfidfVectorizer()
     # ๋ฌธ์žฅ ๋ฒกํ„ฐํ™”(์‚ฌ์ „ ๋งŒ๋“ค๊ธฐ)

    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)

    cos_similar = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return cos_similar[0][0]
### ์œ ํด๋ฆฌ๋””์–ธ ์œ ์‚ฌ๋„ (๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ ๊ตฌํ•˜๊ธฐ) ###
def euclidean_performance(sentences) :
    tfidf_vectorizer = TfidfVectorizer()

    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)

    ## ์ •๊ทœํ™” ##
    tfidf_normalized = tfidf_matrix/np.sum(tfidf_matrix)

    euc_d_norm = euclidean_distances(tfidf_normalized[0:1],tfidf_normalized[1:2])

    return euc_d_norm[0][0]
### ๋งจํ•˜ํƒ„ ์œ ์‚ฌ๋„(๊ฒฉ์ž๋กœ ๋œ ๊ฑฐ๋ฆฌ์—์„œ์˜ ์ตœ๋‹จ๊ฑฐ๋ฆฌ) ###
def manhattan_performance(sentences) :
    tfidf_vectorizer = TfidfVectorizer()

    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)

    ## ์ •๊ทœํ™” ##
    tfidf_normalized = tfidf_matrix/np.sum(tfidf_matrix)

    manhattan_d = manhattan_distances(tfidf_normalized[0:1],tfidf_normalized[1:2])

    return manhattan_d[0][0]

 

Sentence Transformer ์‚ฌ์šฉ

Sentence Transformer sentence-transformers/all-MiniLM-L6-v2
Sentence Transformer sentence-transformers/bert-base-nli-mean-tokens

HuggingFace์— ์˜ฌ๋ผ์™€์žˆ๋Š” ๋ฌธ์žฅ ์œ ์‚ฌ๋„ ์ธก์ •์„ ์œ„ํ•œ Sentence Transformer ์ด๋‹ค.

 

def sentence_transformer(sentences) :
    seed.set_seed(42)
    model_name = "'sentence-transformers/bert-base-nli-mean-tokens" #or sentence-transformers/all-MiniLM-L6-v2 
    device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]  
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    
    tokenizer = AutoTokenizer.from_pretrained(model_name, local_files_only=True)
    model = AutoModel.from_pretrained(model_name, local_files_only=True)
    model.to(device)
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    encoded_input.to(device)
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

    return sentence_embeddings

์ด๋ ‡๊ฒŒ ๋ฌธ์žฅ๋“ค์˜ ๋ฌธ์žฅ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

NLI ๋ฐ์ดํ„ฐ์…‹์„ ๋„ฃ์–ด์„œ ์ ์ˆ˜๋ฅผ ๊ตฌํ•˜๊ณ , ์ด๋ฅผ ์‹œ๊ฐํ™” ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ๋‹ค.

(์—ฐ๊ด€, ๋ชจํ˜ธ, ๋ชจ์ˆœ ๋ฐ์ดํ„ฐ์…‹)

 

 

WordNet ์˜๋ฏธ ์œ ์‚ฌ๋„ ์ธก์ •

 

๊ฒฝ๋กœ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„
  • ๋‹จ์–ด ๊ฐ„ ์ƒ/ํ•˜ ์œ„๊ณ„๊ตฌ์กฐ์—์„œ์˜ ์ตœ๋‹จ ๊ฒฝ๋กœ์˜ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜
  • ๊ฒฝ๋กœ ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€๊นŒ์›€ โ†’ ์œ ์‚ฌํ•จ
  • ๊ฒฝ๋กœ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉˆ โ†’ ์œ ์‚ฌํ•˜์ง€์•Š์Œ
  • 0~1 ์‚ฌ์ด์˜ ์‹ค์ˆ˜๊ฐ’์œผ๋กœ ํ‘œ์ค€ํ™” ๋Œ
Leacock Chordorow ์œ ์‚ฌ๋„
  • ๋‹จ์–ด ๊ฐ„ ์œ„๊ณ„๊ตฌ์กฐ์—์„œ์˜ ์ตœ๋‹จ ๊ฑฐ๋ฆฌ์™€ ๋‹จ์–ด ์˜๋ฏธ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ตœ๋Œ€ ๊นŠ์ด ๊ธฐ๋ฐ˜
  • ํ‘œ์ค€ํ™” ๋˜์–ด์žˆ์ง€ ์•Š์Œ.
Wu-Palmer ์œ ์‚ฌ๋„
  • ๋‹จ์–ด ์œ„๊ณ„๊ตฌ์กฐ์—์„œ ๋‘ ๋‹จ์–ด์˜ ๊นŠ์ด์™€ ๋‹จ์–ด ๊ฐ„์˜ ์ตœ์†Œ ๊ณตํ†ต ํฌํ•จ ๊ธฐ๋ฐ˜
import wordnet

#๊ฒฝ๋กœ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„
right_whale = wordnet.synset('right_whale.n.01')
orca = wordnet.synset('orca.n.01')
right_whale.path_similarity(orca)

#Leacock Chordorow ์œ ์‚ฌ๋„
right_whale = wordnet.synset('right_whale.n.01')
orca = wordnet.synset('orca.n.01')
right_whale.lch_similarity(orca)

#Wu-Palmer ์œ ์‚ฌ๋„
right_whale = wordnet.synset('right_whale.n.01')
orca = wordnet.synset('orca.n.01')
right_whale.wup_similarity(orca)

 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('train.csv')

# TF-IDF ๋ฒกํ„ฐํ™”
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['sentence'])

# ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
cosine_similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_similarities)

# ํžˆํŠธ๋งต์œผ๋กœ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์‹œ๊ฐํ™”
plt.imshow(cosine_similarities, interpolation='nearest')
plt.colorbar()
plt.show()
728x90
๋ฐ˜์‘ํ˜•
Liky