1

I have a dataset with many project's monthly expendituries (cost curve), like this one:

Project Date Expenditure(USD)
Project A 12-2020 500
Project A 01-2021 1257
Project A 02-2021 125889
Project A 03-2021 102447
Project A 04-2021 1248
Project A 05-2021 1222
Project A 06-2021 856
Project B 01-2021 5589
Project B 02-2021 52874
Project B 03-2021 5698745
Project B 04-2021 2031487
Project B 05-2021 2359874
Project B 06-2021 25413
Project B 07-2021 2014
Project B 08-2021 2569

Using python, I want to create a "Period" column that replace the month value for a integer that represents the count of months of the project, like this: Where the line is the first month of the Project A (12-2020) the code should put 1 in the "Period" column, the second month (01-2021) is 2, the third (02-2021) is 3, etc. because I need to focus on the number of months that the projects of my dataframe had an expediture (month 1, month 2, month 3...)

Project Date Period Expenditure(USD)
Project A 12-2020 1 500
Project A 01-2021 2 1257
Project A 02-2021 3 125889
Project A 03-2021 4 102447
Project A 04-2021 5 1248
Project A 05-2021 6 1222
Project A 06-2021 7 856
Project B 01-2021 1 5589
Project B 02-2021 2 52874
Project B 03-2021 3 5698745
Project B 04-2021 4 2031487
Project B 05-2021 5 2359874
Project B 06-2021 6 25413
Project B 07-2021 7 2014
Project B 08-2021 8 2569
conradoov
  • 13
  • 3

1 Answers1

0

The easiest thing is for you to calculate for each row:

  1. The start date of the corresponding project.
  2. The months since current date and start date of the project.

Below is a sample code that does that for you:

import pandas as pd
import numpy as np

df = pd.DataFrame( [ ["Project A", "12-2020", 500], ["Project A", "01-2021", 1257], ["Project A", "02-2021", 125889], ["Project A", "03-2021", 102447], ["Project A", "04-2021", 1248], ["Project A", "05-2021", 1222], ["Project A", "06-2021", 856], ["Project B", "01-2021", 5589], ["Project B", "02-2021", 52874], ["Project B", "03-2021", 5698745], ["Project B", "04-2021", 2031487], ["Project B", "05-2021", 2359874], ["Project B", "06-2021", 25413], ["Project B", "07-2021", 2014], ["Project B", "08-2021", 2569], ], columns=["Project", "Date", "Expenditure(USD)"], )

df["Date"] = pd.to_datetime(df["Date"], format="%m-%Y") # Convert date column type

get the start date of the project

i.e find the lowest date of rows that have the same project as the current row

df["Project Start Date"] = df.apply(lambda row: min(df[df["Project"] == row["Project"]]["Date"]), axis=1)

calculate the period

i.e. the number of months of the current date since the start of the project + 1

df["Period"] = ((df["Date"] - df["Project Start Date"]) / np.timedelta64(1, "M") + 1).round().astype(int)

print(df)

It gives you the following:

      Project        Date  Expenditure(USD) Project Start Date  Period
0   Project A  2020-12-01               500         2020-12-01       1
1   Project A  2021-01-01              1257         2020-12-01       2
2   Project A  2021-02-01            125889         2020-12-01       3
3   Project A  2021-03-01            102447         2020-12-01       4
4   Project A  2021-04-01              1248         2020-12-01       5
5   Project A  2021-05-01              1222         2020-12-01       6
6   Project A  2021-06-01               856         2020-12-01       7
7   Project B  2021-01-01              5589         2021-01-01       1
8   Project B  2021-02-01             52874         2021-01-01       2
9   Project B  2021-03-01           5698745         2021-01-01       3
10  Project B  2021-04-01           2031487         2021-01-01       4
11  Project B  2021-05-01           2359874         2021-01-01       5
12  Project B  2021-06-01             25413         2021-01-01       6
13  Project B  2021-07-01              2014         2021-01-01       7
14  Project B  2021-08-01              2569         2021-01-01       8
Bruno Lubascher
  • 3,618
  • 1
  • 14
  • 36