Getty Images

Tip

How does the bag-of-words model work in NLP?

It's no easy task to teach a machine learning system the complexity of human language. Bag-of-words models can help by turning text into numerical representations.

The bag-of-words model is a popular text modeling technique used in natural language processing. It's an effective way to extract patterns in text, which can often be challenging and compute-intensive.

The bag-of-words (BoW) model uses various methods -- including tokenization and vectorization -- to process text. The natural language processing (NLP) technique has many use cases, such as identifying sentiment, classifying text and detecting spam. However, there are certain limitations and alternative approaches that NLP engineers should consider before using the BoW model.

How does the bag-of-words model work?

The BoW model processes text by converting the words within a text sequence into a number representing their frequency of occurrence. Each word count is then correlated against a dictionary, an established list of words the model can detect. The model can then focus on specific words of interest while ignoring other words and language considerations, such as grammar.

Converting natural language into a numerical representation helps the NLP system understand the importance of different text sequences, such as sentences.

BoW models typically employ five general processes:

1. Establish a dictionary

A dictionary defines the words the model looks for and acts upon. For example, a BoW model built for sentiment analysis or customer satisfaction might establish a dictionary that includes words such as excellent, wonderful, disappointing and slow. NLP engineers and business leaders typically create dictionaries during model development.

2. Tokenize the text

Tokenization is the act of dividing text into elements called tokens. Tokens can include individual words, punctuation marks and meaningful parts of words. For example, if the text contains a sentence such as, "The car drives fast on the road," tokenization would create a set of individual elements such as the, car, drives, fast, on, the and road.

3. Create a vocabulary

The vocabulary process evaluates tokens and identifies unique words. For example, if the text contains the two sentences, "The car drives fast on the road," and, "The bumpy road will break the car," the vocabulary would include the, car, drives, fast, on, road, bumpy, will and break. Repeated words are only counted once in the vocabulary.

4. Count word occurrence

When a word receives a vocabulary entry, the occurrence of that word is also counted. For the two previous sentences, the number of occurrences would be the (4), car (2), drives (1), fast (1), on (1), road (2), bumpy (1), will (1) and break (1).

5. Generate vector representation

Once word occurrences are counted, each is converted into a numerical vector: a mathematical object that includes both magnitude and direction. Vectorization enables the model to evaluate the relative importance of words and establish semantic relationships between them. For example, even though the word the appears four times in the two example sentences, its importance to the overall text might be less than words such as car or break.

Bag-of-words model use cases

The BoW model helps machine learning systems understand words and their context. This makes it ideally suited for various uses:

  • Search and text classification. Based on vocabulary and context, BoW models can categorize documents into specific topics or coverage areas, such as news, business, weather and sports. Classification is useful for search and content aggregation platforms.
  • Language determination. BoW models can determine a text's language by identifying vocabulary words, which can help with automated translation and geolocation.
  • Sentiment analysis. BoW models can evaluate the positive and negative words within a vocabulary to gauge sentiment about a topic. This is useful in automated surveys and other user feedback tools.
  • Spam detection. By analyzing particular words often contained in spam, BoW models can identify the presence of unwanted or malicious content.
  • Topic discovery. BoW models can analyze text from different documents to identify common themes or topics that might not be obvious or intuitive to casual readers.

Strengths and limitations of the bag-of-words model

The BoW model is a reliable and well-accepted means of processing text data. Common strengths include the following:

  • Simplicity. BoW models require relatively little compute power and are typically simple to design and deploy. Dictionary maintenance can require recurring attention, but effort is minimal unless the model's purpose radically changes.
  • Flexibility. BoW models are adaptable to various NLP uses such as search, text classification and sentiment analysis.
  • Explainability. Many machine learning models have opaque processing and decision-making. In contrast, BoW models are straightforward in their text representation, making output easy to understand and explain. This can make BoW models appealing to organizations concerned with compliance and governance.
  • Scalability. BoW models typically handle data efficiently, and their simplicity makes it easy to scale for compute and storage when handling large documents or data sets.

However, the technique also has limitations that technology and business leaders should understand:

  • Insensitivity. BoW models are insensitive to word order and ignore punctuation and grammar rules. This poses problems when establishing word relationships; for example, different sentences can have the same meaning, like in the classic "Let's eat, Grandma" versus "Let's eat Grandma" grammar problem.
  • Limited semantics. BoW models gauge the presence and frequency of words but cannot capture their meanings within phrases. For instance, BoW models struggle with homographs -- words with identical spellings but different meanings. To a BoW model, the word bat is the same word regardless of whether it means a flying mammal or baseball equipment.
  • High dimensionality. Text with many unique words can result in data sets with large numbers of features or variables, which affects compute demands and performance. High dimensionality can lead to overfitting, where there is high accuracy on training data but low accuracy on new data.
  • Hard-to-add words. Adding new words to established BoW models can be difficult because new words are often ignored or redirected to an "unknown" token. To add new words, NLP engineers must reprocess all text from scratch.

Alternative approaches to the BoW model can mitigate some of these challenges. One alternative is the bag-of-n-grams, which uses two- and three-word phrases to capture and analyze relationships between words that often appear together.

Another alternative is term frequency-inverse document frequency (TF-IDF). TF-IDF builds a normalized word count where each count is divided by the number of documents in which that word appears. With mathematical processing, TF-IDF makes rare or unusual words more important and effectively ignores common or stop words. This can help overcome some of the insensitivity found in BoW models.

Stephen J. Bigelow, senior technology editor at Informa TechTarget, has more than 30 years of technical writing experience in the PC and technology industry.

Dig Deeper on AI technologies