Identifying All Entries Within 7 Days of Some Dates Using pandas: A Comparative Analysis of Two Efficient Solutions

Identifying All Entries Within 7 Days of Some Dates Using pandas

=====================================================

In this article, we’ll explore a problem where you have two dataframes, one for trade records and another for browsing history. You want to identify all the entries in the viewing history that occurred within 7 days prior to each trade date.

Introduction

To solve this problem, we can use pandas, a powerful library in Python for data manipulation and analysis. We’ll break down the solution into steps and explore various approaches.

Problem Statement

We have two dataframes:

trade: records trade dates, person IDs, code numbers, and values.
view: records viewing history dates, person IDs, code numbers, and values.

Our goal is to generate a list of related browsing history (occurred within 7 days prior to the trade) for each trade.

Initial Approach

We’ll start by merging the two dataframes using pandas’ merge_asof function. However, we soon realize that this approach has limitations, as it only uses each view entry once and does not assign views to trades as expected.

Alternative Solution Using janitor

The problem can be solved efficiently using the janitor library’s conditional_join function. We’ll create a helper column in the trade dataframe to represent the start date of the 7-day window, then join this with the view dataframe based on various conditions.

import pandas as pd
import janitor

trade['date'] = pd.to_datetime(trade['date'])
view['date'] = pd.to_datetime(view['date'])

out = (trade
      .assign(start_date=lambda d: d['date'].sub(pd.DateOffset(days=7)))
      .conditional_join(view.rename(columns={'date': 'view_dates',
                                             'value': 'view_values'}),
                        ('start_date', 'view_dates', '<='),
                        ('date', 'view_dates', '>='),
                        ('person', 'person', '=='),
                        ('code', 'code', '=='),
                        right_columns=['view_dates', 'view_values']
                       )
      .drop(columns='start_date')
      .assign(view_dates=lambda d: d['view_dates'].dt.strftime('%Y-%m-%d'))
      .groupby(list(trade), as_index=False).agg(list)  
)

Alternative Solution Using pandas

Another approach is to use a pure pandas solution, although it’s less efficient than the janitor method. We’ll merge all combinations of trades and views based on person IDs and code numbers.

out = (trade
      .merge(view.rename(columns={'date': 'view_dates', 'value': 'view_values'}),
             on=['person', 'code'])
      .loc[lambda d: d['date'].gt(d['view_dates']) & 
            d['date'].sub(pd.DateOffset(days=7)).le(d['view_dates'])
           ]
      .assign(view_dates=lambda d: d['view_dates'].dt.strftime('%Y-%m-%d'))
      .groupby(list(trade), as_index=False).agg(list)  
)

Output

Both solutions produce the same output:

date person code value1 view_dates view_values
2019-08-31	1	123	1 [2019-08-29, 2019-08-30] [1, 3]
2019-09-01	1	123	2 [2019-08-29, 2019-08-30, 2019-09-01] [1, 3, 5]
2019-09-04	2	456	3 [2019-08-31, 2019-09-01, 2019-09-03] [4, 7, 9]

Conclusion

Identifying all entries within a time range in the viewing history for each trade is an interesting problem that can be solved efficiently using pandas and the janitor library. The alternative solution using janitor is more efficient than the pure pandas approach.

We hope this article has provided you with a clear understanding of how to solve this problem using pandas and the janitor library.

Last modified on 2025-04-16