Counting the number of rows that fulfill a condition and assigning to each row a number from 1 to nRows pandas

Question

So again, i have another question related to this: I'm processing a DataFrame, which looks like the following:

the thing is that now I want to add an additional column, called 'position', in which, according to the contributor_id, and the number of edits, the number of the corresponding row appears. The thing is that now, I don't want the count of rows to reestart until the value in nEdits is greater than 0, and This number must be reinitiated to 1 when the contributor_id changes:

    contributor_id  timestamp   nEdits   Position
0           8        2018-01-01   1          1
1           8        2018-02-01   1          2
2           8        2018-03-01   1          3
3           8        2018-04-01   1          4
4           8        2018-05-01   1          5
5           8        2018-06-01   1          6
6           8        2018-07-01   1          7
7           8        2018-08-01   1          8
8       26424341     2018-01-01   0          0
9       26424341     2018-02-01   0          0
10      26424341     2018-03-01   11         1
11      26424341     2018-04-01   34         2
12      26424341     2018-05-01   42         3
13      26424341     2018-06-01   46         4
14      26424341     2018-07-01   50         5
15      26424341     2018-08-01   54         6
16      26870381     2018-01-01   465        1
17      26870381     2018-02-01   566        2
18      26870381     2018-03-01   601        3

The idea I got from some answers to compute the position column is to do: df.groupby("contributor_id").position.cumsum() But I don't know how to include the condition that nEdits must be greater than 0 in order to reestart the count.

Better is `df["Position"] = df.groupby("contributor_id").cumcount() + 1` — jezrael, Dec 07 '18 at 11:19
I made some changes in the question.. any ideas on how could I implement them? — HRDSL, Dec 07 '18 at 11:42

jezrael · Accepted Answer · 2018-12-07T12:17:59.987

Use GroupBy.cumcount by column contributor_id and helper Series for distingoush multiple 0 in same group:

m = df['nEdits'] == 0
df['Position1'] = np.where(m, 0, df.groupby([m.ne(m.shift()).cumsum(), 
                                             'contributor_id']).cumcount() + 1)
print (df)
    contributor_id   timestamp  nEdits  Position  Position1
0                8  2018-01-01       1         1          1
1                8  2018-02-01       1         2          2
2                8  2018-03-01       1         3          3
3                8  2018-04-01       1         4          4
4                8  2018-05-01       1         5          5
5                8  2018-06-01       1         6          6
6                8  2018-07-01       1         7          7
7                8  2018-08-01       1         8          8
8         26424341  2018-01-01       0         0          0
9         26424341  2018-02-01       0         0          0
10        26424341  2018-03-01      11         1          1
11        26424341  2018-04-01      34         2          2
12        26424341  2018-05-01       0         3          0 <- added 0 for more general data
13        26424341  2018-06-01      46         4          1
14        26424341  2018-07-01      50         5          2
15        26424341  2018-08-01      54         6          3
16        26870381  2018-01-01     465         1          1
17        26870381  2018-02-01     566         2          2
18        26870381  2018-03-01     601         3          3

Detail:

print (m.ne(m.shift()).cumsum())
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     2
9     2
10    3
11    3
12    4
13    5
14    5
15    5
16    5
17    5
18    5
Name: nEdits, dtype: int32

Check difference:

m = df['nEdits'] == 0
df['Position1'] = np.where(m, 0, df.groupby([m, 'contributor_id']).cumcount() + 1)
print (df)
    contributor_id   timestamp  nEdits  Position  Position1
0                8  2018-01-01       1         1          1
1                8  2018-02-01       1         2          2
2                8  2018-03-01       1         3          3
3                8  2018-04-01       1         4          4
4                8  2018-05-01       1         5          5
5                8  2018-06-01       1         6          6
6                8  2018-07-01       1         7          7
7                8  2018-08-01       1         8          8
8         26424341  2018-01-01       0         0          0
9         26424341  2018-02-01       0         0          0
10        26424341  2018-03-01      11         1          1
11        26424341  2018-04-01      34         2          2
12        26424341  2018-05-01       0         3          0
13        26424341  2018-06-01      46         4          3 <-not new group
14        26424341  2018-07-01      50         5          4 <-not new group
15        26424341  2018-08-01      54         6          5 <-not new group
16        26870381  2018-01-01     465         1          1
17        26870381  2018-02-01     566         2          2
18        26870381  2018-03-01     601         3          3

The approach worked perfectly. Thank you very much for your help! — HRDSL, Dec 08 '18 at 10:44

Counting the number of rows that fulfill a condition and assigning to each row a number from 1 to nRows pandas

1 Answers1