Finding and Sorting Similar Sentences in a Corpus of Documents Using Natural Language Processing Techniques

Introduction

In this article, we will explore how to find and sort similar sentences to a given list of words in a corpus of documents. This problem involves natural language processing (NLP) techniques, specifically text feature extraction and similarity measurement.

We’ll use the popular scikit-learn library for Python, which provides efficient implementations of various algorithms used in machine learning and NLP tasks.

Preparing the Data

To start solving this problem, we need to prepare our data. We have a list of colors (colors), things (things), and a corpus of documents (corpus). Each color and thing is associated with keywords that can be used for feature extraction.

We’ll create two arrays: one for each topic (color and thing). The array for each topic will contain the keyword occurrences for all unique words in that topic.

# colors
         0    1    2    3    4    5    6
red      1    0    0    0    1    0    0
apple    1    0    0    0    0    0    0
rainbow  0    1    0    0    0    0    0
purple   0    0    1    0    0    0    0

# things
          0    1    2    3    4    5    6
apple     1    0    0    0    1    0    0
pickle    0    0    0    1    0    0    0
tomato    0    0    0    0    0    0    0
rainbow   0    0    1    0    0    0    0
book      0    0    0    0    1    1    0
yellow   0    0    0    0    0    1    0

Count Vectorizer

Next, we’ll use a count vectorizer to get word occurrences for all unique words in our corpus. This will create a matrix where each row corresponds to a document and each column corresponds to a unique word.

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
out = cv.fit_transform(corpus).toarray() #apply countvectorizer

Filtering by Topic

To filter the vectorized data for a specific topic, we’ll use the vocabulary of that topic. We’ll get the indexes of all items from all topics and then filter the vectorized data.

combined = colors+things  #combine all your topics

c = [(k,v) for k,v in cv.vocabulary_.items() if k in combined] #get indexes for all the items from all topics

cdf = pd.DataFrame(out[:,[i[1] for i in c]], columns=[i[0] for i in c]).T  #Filter cv dataframe for all items

Cosine Similarity

To measure the similarity between two documents, we’ll use the cosine similarity. This involves taking the dot product of the vectorized data and then dividing by the magnitude of each vector.

def get_similary_table(topic):
    df = cdf.loc[cdf.index.isin(topic)]  #filter by topic
    cnd = df.values
    similarity = cnd.T@cnd #Take dot product to get similarty matrix
    dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
    return dd

get_similary_table(things)

Finding the Most Similar Document

To find the most similar document to a given document, we’ll filter the similarity table by sentence and get the column with the maximum value.

def get_similar_review(s, topic):
    df = cdf.loc[cdf.index.isin(topic)] #filter by topic
    cnd = df.values
    similarity = cnd.T@cnd #Take dot product to get similarty matrix
    np.fill_diagonal(similarity,0) #set diagonal elements to 0, to avoid same sentence being returned as output
    dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
    return dd.loc[s].idxmax(axis=0) #filter by sentence and get column name with max value

s = 'i ate a red apple.'
get_similar(s, colors)

#'the kid read the book the little red riding hood'

Conclusion

In this article, we explored how to find and sort similar sentences to a given list of words in a corpus of documents. We used the scikit-learn library for Python, which provided efficient implementations of various algorithms used in machine learning and NLP tasks.

We created two arrays: one for each topic (color and thing), containing keyword occurrences for all unique words in that topic. Then we used a count vectorizer to get word occurrences for all unique words in our corpus. We filtered the vectorized data by topic and measured the similarity between documents using cosine similarity.

Finally, we found the most similar document to a given document by filtering the similarity table by sentence and getting the column with the maximum value.

This approach can be used in various applications such as text classification, clustering, and retrieval.

Last modified on 2025-04-02