Introduction
In this article, we will explore how to find and sort similar sentences to a given list of words in a corpus of documents. This problem involves natural language processing (NLP) techniques, specifically text feature extraction and similarity measurement.
We’ll use the popular scikit-learn library for Python, which provides efficient implementations of various algorithms used in machine learning and NLP tasks.
Preparing the Data
To start solving this problem, we need to prepare our data. We have a list of colors (colors), things (things), and a corpus of documents (corpus). Each color and thing is associated with keywords that can be used for feature extraction.
We’ll create two arrays: one for each topic (color and thing). The array for each topic will contain the keyword occurrences for all unique words in that topic.
# colors
0 1 2 3 4 5 6
red 1 0 0 0 1 0 0
apple 1 0 0 0 0 0 0
rainbow 0 1 0 0 0 0 0
purple 0 0 1 0 0 0 0
# things
0 1 2 3 4 5 6
apple 1 0 0 0 1 0 0
pickle 0 0 0 1 0 0 0
tomato 0 0 0 0 0 0 0
rainbow 0 0 1 0 0 0 0
book 0 0 0 0 1 1 0
yellow 0 0 0 0 0 1 0
Count Vectorizer
Next, we’ll use a count vectorizer to get word occurrences for all unique words in our corpus. This will create a matrix where each row corresponds to a document and each column corresponds to a unique word.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
out = cv.fit_transform(corpus).toarray() #apply countvectorizer
Filtering by Topic
To filter the vectorized data for a specific topic, we’ll use the vocabulary of that topic. We’ll get the indexes of all items from all topics and then filter the vectorized data.
combined = colors+things #combine all your topics
c = [(k,v) for k,v in cv.vocabulary_.items() if k in combined] #get indexes for all the items from all topics
cdf = pd.DataFrame(out[:,[i[1] for i in c]], columns=[i[0] for i in c]).T #Filter cv dataframe for all items
Cosine Similarity
To measure the similarity between two documents, we’ll use the cosine similarity. This involves taking the dot product of the vectorized data and then dividing by the magnitude of each vector.
def get_similary_table(topic):
df = cdf.loc[cdf.index.isin(topic)] #filter by topic
cnd = df.values
similarity = cnd.T@cnd #Take dot product to get similarty matrix
dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
return dd
get_similary_table(things)
Finding the Most Similar Document
To find the most similar document to a given document, we’ll filter the similarity table by sentence and get the column with the maximum value.
def get_similar_review(s, topic):
df = cdf.loc[cdf.index.isin(topic)] #filter by topic
cnd = df.values
similarity = cnd.T@cnd #Take dot product to get similarty matrix
np.fill_diagonal(similarity,0) #set diagonal elements to 0, to avoid same sentence being returned as output
dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
return dd.loc[s].idxmax(axis=0) #filter by sentence and get column name with max value
s = 'i ate a red apple.'
get_similar(s, colors)
#'the kid read the book the little red riding hood'
Conclusion
In this article, we explored how to find and sort similar sentences to a given list of words in a corpus of documents. We used the scikit-learn library for Python, which provided efficient implementations of various algorithms used in machine learning and NLP tasks.
We created two arrays: one for each topic (color and thing), containing keyword occurrences for all unique words in that topic. Then we used a count vectorizer to get word occurrences for all unique words in our corpus. We filtered the vectorized data by topic and measured the similarity between documents using cosine similarity.
Finally, we found the most similar document to a given document by filtering the similarity table by sentence and getting the column with the maximum value.
This approach can be used in various applications such as text classification, clustering, and retrieval.
Last modified on 2025-04-02