Drop Duplicates based on Nearest Datetime condition

Question

import pandas as pd

def nearest(items, pivot):
    return min(items, key=lambda x: abs(x - pivot))

df = pd.read_csv("C:/Files/input.txt", dtype=str)
duplicatesDf = df[df.duplicated(subset=['CLASS_ID', 'START_TIME', 'TEACHER_ID'], keep=False)]
duplicatesDf['START_TIME'] =  pd.to_datetime(duplicatesDf['START_TIME'], format='%Y/%m/%d %H:%M:%S.%f')
print duplicatesDf
print df['START_TIME'].dt.date

df:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
2,123,2020/06/01 20:47:26.000,o1,2020/06/04 20:47:26.000
3,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:47:26.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000

So, I've got a dataframe like mentioned above. As you can see, I have multiple records with the same CLASS_ID,START_DATE and TEACHER_ID. Whenever, multiple records like these are present, I would like to retain only 1 record based on the condition that, the retained record should have its END_DATE nearest to its START_DATE(by minute level precision).

In this case, for CLASS_ID 123, the record with ID 1 will be retained, as its END_DATE 2020/06/02 00:00:00.000 is nearest to its START_DATE 2020/06/01 20:47:26.000 as compared to record with ID 2 whose END_DATE is 2020/06/04 20:47:26.000. Similarly for CLASS_ID 789, record with ID 4 will be retained.

Hence the expected output will be:

ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000

I've been going through the following links, https://stackoverflow.com/a/32237949, https://stackoverflow.com/a/33043374 to find a solution but have unfortunately reached an impasse.

Hence, would some kind soul mind helping me out a bit. Many thanks.

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

IIUC, we can use .loc and idxmin() after creating a condtional column to measure the elapsed time between the start and the end, we will apply idxmin() as a groupby operation on your CLASS_ID column.

df.loc[
    df.assign(mins=(df["END_TIME"] - df["START_TIME"]))
    .groupby("CLASS_ID")["mins"]
    .idxmin()
]


   ID  CLASS_ID          START_TIME TEACHER_ID            END_TIME
0   1       123 2020-06-01 20:47:26         o1 2020-06-02 00:00:00
4   5       456 2020-06-01 20:47:26         o5 2020-06-08 20:00:26
3   4       789 2020-06-01 20:47:26         o3 2020-06-03 14:40:00

in steps.

Time Delta.

print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]))[['CLASS_ID','mins']])

   CLASS_ID            mins
0       123 0 days 03:12:34
1       123 3 days 00:00:00
2       789 1 days 18:00:00
3       789 1 days 17:52:34
4       456 6 days 23:13:00

minimum index from time delta column whilst grouping with CLASS_ID

print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]) )
    .groupby("CLASS_ID")["mins"]
    .idxmin())

CLASS_ID
123    0
456    4
789    3
Name: mins, dtype: int64

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 16 '20 at 19:06

Umar.H

22,559
7
39
74

Would you mind helping me in casting the timestamp columns in my dataframe from string to datetime? I need to read the whole file as a string as I'm doing further processings in the middle. – AlanC Jun 16 '20 at 19:17
@Aron do you mean `df['START_DATE'] = pd.to_datetime(df['START_DATE'])`? – Umar.H Jun 16 '20 at 19:18
yes. unfortunately its throwing a warning. Something like `A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead` – AlanC Jun 16 '20 at 19:19
did you copy your dataframe from another dataframe ? use `new_df = df.copy()` when copying to a new variable – Umar.H Jun 16 '20 at 19:20
No. I'm directly reading to the dataframe from the file. Then when I execute the `to_datetime` command, I get this warning – AlanC Jun 16 '20 at 19:22
strange, instead in your read method, pass in `parse_dates` i.e `pd.read_csv(...,parse_dates=['START_TIME','END_TIME'])` @Aron – Umar.H Jun 16 '20 at 19:25
@Aron you are very welcome my friend, good luck in your analysis – Umar.H Jun 16 '20 at 19:30

Drop Duplicates based on Nearest Datetime condition

1 Answers1

Time Delta.