Create Key Phrases Using Python Without NLP Libraries

The simplest way to get key phrases out of your text using Python

11 min readOct 1, 2020

There are several NLP libraries to work with, for example, Natural Language Toolkit (NLTK), TextBlob, CoreNLP, Gensim, and spaCy. You can use these incredible libraries for processing your text to get your best keywords or key phrases.

There are a lot of methods for parsing texts. In this article, I’m going to show you the easiest way to parse your texts and get 10 best key phrases of your text using any of the NLP libraries.

However, we will need some libraries for pre-processing and sorting the data.

Libraries Required

import re
import heapq

Suppose we have the following text block, which has 600 words:

“ Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore’s law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks. Many of the notable early successes occurred in the field of machine translation, due especially working at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or use a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical. In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).”

Let’s load the text in a string.

text = “Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore’s law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).”

Now, split the sentences of the text string.

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

For preprocessing, we are going to lower the text and split the text into words (word_tokenize).

clean_text = text.lower()
word_tokenize = clean_text.split()

We also need to exclude the stopwords of the language we need to summarize. You can get the stopwords of your desired language from the Countwordsfree website: https://countwordsfree.com/stopwords

stop_words = [“i”, “me”, “my”, “myself”, “we”, “our”, “ours”, “ourselves”, “you”, “your”, “yours”, “yourself”, “yourselves”, “he”, “him”, “his”, “himself”, “she”, “her”, “hers”, “herself”, “it”, “its”, “itself”, “they”, “them”, “their”, “theirs”, “themselves”, “what”, “which”, “who”, “whom”, “this”, “that”, “these”, “those”, “am”, “is”, “are”, “was”, “were”, “be”, “been”, “being”, “have”, “has”, “had”, “having”, “do”, “does”, “did”, “doing”, “a”, “an”, “the”, “and”, “but”, “if”, “or”, “because”, “as”, “until”, “while”, “of”, “at”, “by”, “for”, “with”, “about”, “against”, “between”, “into”, “through”, “during”, “before”, “after”, “above”, “below”, “to”, “from”, “up”, “down”, “in”, “out”, “on”, “off”, “over”, “under”, “again”, “further”, “then”, “once”, “here”, “there”, “when”, “where”, “why”, “how”, “all”, “any”, “both”, “each”, “few”, “more”, “most”, “other”, “some”, “such”, “no”, “nor”, “not”, “only”, “own”, “same”, “so”, “than”, “too”, “very”, “s”, “t”, “can”, “will”, “just”, “don”, “should”, “now”]

We have put all the stop words of the English language in a list. You can add other languages’ stopwords and append them to this list.

Next, we are going to tokenize all the words into a dictionary and give them values.

word2count = {}
#for word in word_tokenize:
for word in word_tokenize:
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

After that, we are going to create a weighted histogram.

# weighted histogram
for key in word2count.keys():
    word2count[key] = word2count[key] / max(word2count.values())

now, we are going to extract key phrases. First set the parameter.

min_keywords = 2
max_keywords = 3

here we are limiting the key phrases to have a minimum of 2 words to a maximum of 3 words.

Now, here comes the main part. Extracting the key phrases.

# Initializes the candidate list to empty
candidates = []
# Splits the sentence to get a list of lowercase words
sl = text.lower().split()
for num_keywords in range(min_keywords, max_keywords + 1):
    # Until the third-last word
    for i in range(0, len(sl) - num_keywords):
        # Position i marks the first word of the candidate. Proceeds only if it's not a stopword
        if sl[i] not in stop_words:
            candidate = sl[i]
            # Initializes j (the pointer to the next word) to 1
            j = 1
            # Initializes the word counter. This counts the non-stopwords words in the candidate
            keyword_counter = 1
            contains_stopword = False
            # Until the word count reaches the maximum number of keywords or the end is reached
            while keyword_counter < num_keywords and i + j < len(sl):
                # Adds the next word to the candidate
                candidate = candidate + ' ' + sl[i + j]
                # If it's not a stopword, increase the word counter. If it is, turn on the flag
                if sl[i + j] not in stop_words:
                    keyword_counter += 1
                else:
                    contains_stopword = True
                # Next position
                j += 1
            # Adds the candidate to the list only if:
            # 1) it contains at least a stopword (if it doesn't it's already been considered)
            # AND
            # 2) the last word is not a stopword
            # AND
            # 3) the adjoined candidate keyphrase contains exactly the correct number of keywords (to avoid doubles)
            if contains_stopword and candidate.split()[-1] not in stop_words and keyword_counter == num_keywords:
                candidates.append(candidate)

After that, we have to score the key phrase by their weights that we calculated earlier

#scoring best keyphrases
key2score = {}
for key_phrase in candidates:
    for keyword in  key_phrase.split():
        if keyword in word2count.keys():
            if key_phrase not in key2score.keys():
                key2score[key_phrase] = word2count[keyword]
            else:
                key2score[key_phrase] += word2count[keyword]

Now we just have to sort the best 10 key phrases and see the results.

#best keyphrases
best_ten_keyphrases = heapq.nlargest(10, key2score, key=key2score.get)
print('------------- BEST 10 KeyPhrases --------------')
print(best_ten_keyphrases)

Output

------------- BEST 10 KeyPhrases --------------
['learning and deep neural', 'approaches may be viewed', 'may be viewed as a new', 'viewed as a new paradigm', 'paradigm distinct from statistical', 'instance, the term neural', '(nmt) emphasizes the fact', 'emphasizes the fact that deep', 'fact that deep learning-based', 'learning-based approaches to machine']

We have got the 10 best key phrases from a 600 words text with a very simple bit of coding.

Here is the full code:

import re
import heapq
# get text
text = "Your text to get key phrases"
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
clean_text = text.lower()
word_tokenize = clean_text.split()
#stop_words = nltk.corpus.stopwords.words('english')
stop_words = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

#stop_words += additional_stopwords.split()
# histogram
word2count = {}
#for word in word_tokenize:
for word in word_tokenize:
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1
# weighted histogram
for key in word2count.keys():
    word2count[key] = word2count[key] / max(word2count.values())           
#keyphrase extraction
# For each possible length of the adjoined candidate
min_keywords = 2
max_keywords = 3
# Initializes the candidate list to empty
candidates = []
# Splits the sentence to get a list of lowercase words
sl = text.lower().split()#the extraction for 
for num_keywords in range(min_keywords, max_keywords + 1):
    # Until the third-last word
    for i in range(0, len(sl) - num_keywords):
        # Position i marks the first word of the candidate. Proceeds only if it's not a stopword
        if sl[i] not in stop_words:
            candidate = sl[i]
            # Initializes j (the pointer to the next word) to 1
            j = 1
            # Initializes the word counter. This counts the non-stopwords words in the candidate
            keyword_counter = 1
            contains_stopword = False
            # Until the word count reaches the maximum number of keywords or the end is reached
            while keyword_counter < num_keywords and i + j < len(sl):
                # Adds the next word to the candidate
                candidate = candidate + ' ' + sl[i + j]
                # If it's not a stopword, increase the word counter. If it is, turn on the flag
                if sl[i + j] not in stop_words:
                    keyword_counter += 1
                else:
                    contains_stopword = True
                # Next position
                j += 1
            # Adds the candidate to the list only if:
            # 1) it contains at least a stopword (if it doesn't it's already been considered)
            # AND
            # 2) the last word is not a stopword
            # AND
            # 3) the adjoined candidate keyphrase contains exactly the correct number of keywords (to avoid doubles)
            if contains_stopword and candidate.split()[-1] not in stop_words and keyword_counter == num_keywords:
                candidates.append(candidate)
#scoring best keyphrases
key2score = {}
for key_phrase in candidates:
    for keyword in  key_phrase.split():
        if keyword in word2count.keys():
            if key_phrase not in key2score.keys():
                key2score[key_phrase] = word2count[keyword]
            else:
                key2score[key_phrase] += word2count[keyword]
                
                
#best keyphrases
best_ten_keyphrases = heapq.nlargest(10, key2score, key=key2score.get)
print('------------- BEST 10 KeyPhrases --------------')
print(best_ten_keyphrases)

Conclusion

There are a lot more methods of getting key phrases. But most of them I found only give single keywords or full sentences. I have demonstrated a really simple way of finding key phrases or keywords. I hope this will help you and encourage you to learn more about natural language processing and more interesting projects.