Filtering Words from a Status Column in Pandas DataFrame with Regex

Filtering Words into a New Column with Pandas

In this article, we’ll explore how to filter certain words from a status column in a pandas DataFrame and create a new column based on the filtered values.

Problem Statement

Suppose you have a pandas DataFrame with a Status column that contains strings describing an athlete’s condition for a game. You want to create a new column called Game_Status that filters through the Status column, identifying whether the athlete is likely to play or not. The filtering words/phrases can be defined in a list.

Solution

To implement this solution, we’ll use the pandas str.extract method along with regular expressions (regex). We’ll define a list of filtering words/phrases and then apply them to the Status column using regex patterns.

Installing Required Libraries

Before we begin, make sure you have the necessary libraries installed. You can install them via pip:

pip install pandas numpy

Defining Filtering Words/Phrases

First, let’s define a list of filtering words/phrases. We’ll use this list to create our regex pattern.

import numpy as np

# Define filtering words/phrases
gamestatuswords = ['out', 'questionable', 'doubtful','locker room','won\'t return','won\'t play','fractured','sprained','hyperextended','bruised',
                    'probable to return', 'uncertain', 'game-time decision','miss','weeks', 'GTD','suspended','suspension','day-to-day',
                    'game time decision', 'broken', 'torn', 'separated', 'ACL','unlikely to play','will not play','without timetable','retire',
                    'ejected','ejection','probable','hopeful','will play', 'available to play','will start','plans to play','good to go','cleared']

Creating a New Column with Filtering Words/Phrases

Now that we have our filtering words/phrases, let’s create the regex pattern and apply it to the Status column using pandas’ str.extract method.

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'Status': [
        'probable for Thursday',
        'plans to play on Tuesday',
        'questionable Tuesday',
        'won\'t play on Sunday',
        'will start on Saturday',
        'is questionable',
        'will not play Thursday',
        'questionable Monday',
        'good to go vs. CLE',
        'probable vs. Cavaliers',
        'questionable Monday',
        'will not play Saturday',
        'drops 35/16/7 on Mavs',
        'will play vs. DAL',
        'probable vs. Mavericks',
        'will play vs. Knicks',
        "'hopeful' to play Tues",
        'will play on Sunday',
        'will play on Saturday',
        'headed toward a GTD?'
    ]
})

# Create the Game_Status column
df['Game-status'] = (df.Status
                   .str.extract(fr'\b({"|".join(gamestatuswords)})\b')[0]
                   .fillna('')
               )

Output

The resulting DataFrame will have a new Game_status column that filters through the Status column, identifying whether the athlete is likely to play or not.

                      Status    Game-status
4      probable for Thursday       probable
7   plans to play on Tuesday  plans to play
8       questionable Tuesday   questionable
13      won't play on Sunday     won't play
15    will start on Saturday     will start
16           is questionable   questionable
17    will not play Thursday  will not play
32       questionable Monday   questionable
35        good to go vs. CLE     good to go
36    probable vs. Cavaliers       probable
37       questionable Monday   questionable
40    will not play Saturday  will not play
41     drops 35/16/7 on Mavs               
42         will play vs. DAL      will play
43    probable vs. Mavericks       probable
45      will play vs. Knicks      will play
47    'hopeful' to play Tues        hopeful
52       will play on Sunday      will play
55     will play on Saturday      will play
56      headed toward a GTD?            GTD

Conclusion

In this article, we explored how to filter certain words from a status column in a pandas DataFrame and create a new column based on the filtered values. We used the str.extract method along with regular expressions (regex) to implement this solution. By defining a list of filtering words/phrases and creating a regex pattern, we were able to identify whether an athlete is likely to play or not.

Next Steps

This solution can be applied to various real-world scenarios where data filtering is required. For example, in sports analytics, you might want to filter athletes based on their injury status before calculating their expected performance.

Additionally, this article touched upon the basics of regular expressions and how they can be used with pandas for data manipulation. If you’re interested in learning more about regex patterns or other pandas features, feel free to explore the pandas documentation or check out our upcoming articles on these topics.


Last modified on 2025-01-29