0

I have a pandas dataframe which looks like this after the following code:

df['row_l0'] = df.groupby('Category',sort=False).ngroup()+1
df['row_l1'] = df.groupby(['Category','Process'],sort=False).ngroup()+1
df['row_l2'] = df.groupby(['Process','Parent'],sort=False).ngroup()+1

    Category    Process Parent      row_l0  row_l1  row_l2
127         A       a.5     a.5.4       1       5       24
128         A       a.6     a.6.1       1       6       25
129         A       a.6     a.6.2       1       6       26
130         A       a.6     a.6.3       1       6       26
131         A       a.6     a.6.4       1       6       27
132         A       a.6     a.6.5       1       6       27
133         B       b.1     b.1.1       2       7       28
134         B       b.2     b.2.1       2       8       29
135         B       b.2     b.2.2       2       8       29

For clarity, row_l0 relates to Category, row_l1 relates to Process and row_l2 to Parent.

The row_l0 is correct, but I can't seem to be able to reset the count/grouping for the subsequent groups (row_l1 and row_l2) when I get to category B (and beyond).

E.g. at index 133, row_l0 is correctly assigned 2, but for row_l1 instead of continuing from 6 to 7, it should reset due to being a new Category and assigned 1

Ideally, I'd end up with something like:

    Category    Process Parent      row_l0  row_l1  row_l2
127         A       a.5     a.5.4       1       5       24
128         A       a.6     a.6.1       1       6       25
129         A       a.6     a.6.2       1       6       26
130         A       a.6     a.6.3       1       6       26
131         A       a.6     a.6.4       1       6       27
132         A       a.6     a.6.5       1       6       27
133         B       b.1     b.1.1       2       1       1
134         B       b.2     b.2.1       2       2       2
135         B       b.2     b.2.2       2       2       3

I've been trying a mixture of combinations using groupby, rank and ngroup, and adapted various other questions (or this one) but can't seem to get it working.

Laurent
  • 12,287
  • 7
  • 21
  • 37
Maitiu
  • 118
  • 2
  • 10

2 Answers2

1

With the following dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "Category": ["A", "A", "A", "A", "A", "A", "B", "B", "B"],
        "Process": ["a.5", "a.6", "a.6", "a.6", "a.6", "a.6", "b.1", "b.2", "b.2"],
        "Parent": [
            "a.5.4",
            "a.6.1",
            "a.6.2",
            "a.6.3",
            "a.6.4",
            "a.6.5",
            "b.1.1",
            "b.2.1",
            "b.2.2",
        ],
    },
)

Here is one way to do it:

df["row_l0"] = df["Category"].apply(
    lambda x: {col: i + 1 for i, col in enumerate(df["Category"].unique())}[x]
)

df["row_l1"] = df["Process"].apply(lambda x: x[-1])

df["row_l2"] = [
    j + 1
    for count in df["Parent"].str[0].value_counts().to_dict().values()
    for j in range(count)
]
print(df)
# Output
  Category Process Parent  row_l0 row_l1  row_l2
0        A     a.5  a.5.4       1      5       1
1        A     a.6  a.6.1       1      6       2
2        A     a.6  a.6.2       1      6       3
3        A     a.6  a.6.3       1      6       4
4        A     a.6  a.6.4       1      6       5
5        A     a.6  a.6.5       1      6       6
6        B     b.1  b.1.1       2      1       1
7        B     b.2  b.2.1       2      2       2
8        B     b.2  b.2.2       2      2       3
Laurent
  • 12,287
  • 7
  • 21
  • 37
  • The df['row_l1'] calculation doesn't work as a general solution, because in some cases there won't always be a number on the last char, or it could be a number not wanted (e.g. it might be d.0 instead of d.1). But I think I can adapt the logic for `row_l2` for a more general `row_l1` solution – Maitiu Jul 10 '22 at 10:52
  • Right, this is the way to go, indeed. Let me know, otherwise. Cheers. – Laurent Jul 11 '22 at 18:05
0

To achieve the counting restarts you need to nest the groupbys.

One way to do that would be:

df["row_l0"] = df.groupby("Category").ngroup() + 1
for col, i in ("Process", 1), ("Parent", 2):
    df[f"row_l{i}"] = (
        df.groupby("Category")[col]
          .transform(lambda ser: ser.groupby(ser).ngroup() + 1)
    )

Or if you only want to group by Category once:

df["row_l0"] = df["row_l1"] = df["row_l2"] = 0
for i, (_, sdf) in enumerate(df.groupby("Category"), start=1):
    idx = sdf.index
    df.loc[idx, "row_l0"] = i
    df.loc[idx, "row_l1"] = sdf.groupby("Process").ngroup() + 1
    df.loc[idx, "row_l2"] = sdf.groupby("Parent").ngroup() + 1
Timus
  • 10,974
  • 5
  • 14
  • 28