Count unique register in numpy array

Question

I have a numpy array with letters "a", "b" or "c",

import numpy as np

my_array = np.array(["a", "a", "c", "c", "a"]) # In this example "b" is not present

I want to fuild a function f that counts the unique records of each letter present in the array, for my example f should respond [3, 0, 2] meaning that "a" has appeared 3 times, "b" 0 times and "c" 2 times.

I'm looking for solution (if it possible) that use numpy functions and not explicit for loops over the array. Maybe a kind of group by

https://stackoverflow.com/questions/28663856/how-to-count-the-occurrence-of-certain-item-in-an-ndarray — MegaIng, Feb 09 '22 at 20:36

Lucas Roberts · Accepted Answer · 2022-02-09T20:22:40.943

Counter from the collections builtin will do that for you.

import numpy as np 
my_array = np.array(["a", "a", "c", "c", "a"])
from collections import Counter
cnt = Counter(my_array)
cnt 
#  Counter({'a': 3, 'c': 2})

Note that it does not provide counts for items which did not appear until you ask for them. At that point the counter will return 0.

>>> cnt['b']
0

If you want to wrap that in a function where you already have a list of keys (not all of which may be present in your array data), that will not populate the 0 counts with keys for you. If you want the 0s and the keys to be populated, something like this:

import numpy as np
from collections import Counter
from typing import Dict, Any


def counter_function(data, keys) -> Dict[Any, int]:
    cnt = Counter(data)
    for key in keys:
        cnt[key] = cnt[key]
    return cnt

my_array = np.array(["a", "a", "c", "c", "a"])
so_counter = counter_function(my_array, ["a", "b", "c"])
so_counter
# Counter({'a': 3, 'c': 2, 'b': 0})

will do it for you.

score 2 · Answer 2 · answered Feb 09 '22 at 20:36

2

You can also use np.unique with return_counts=True, and just convert it to a dict with dict + zip:

dct = dict(zip(*np.unique(my_array, return_counts=True)))

Output:

>>> dct
{'a': 3, 'c': 2}

For smaller arrays, Lucas's answer is faster, but for large arrays, numpy is much more efficient.

answered Feb 09 '22 at 20:36

This method does not handle the `0` key counts by default and the `collections.Counter` approach does even if it does not store the `0`s in the structure. The `dict` will raise a `KeyError` for `0` key counts. – Lucas Roberts Feb 10 '22 at 15:46
Ah, I didn't realize that the OP wanted that. Good thinking. – Feb 10 '22 at 15:52

Warren Weckesser · Answer 3 · 2022-02-09T20:55:34.187

If my_array has a typical length of about 10 or more, it can be worthwhile to convert your array to the integers [0, 1, 2] and then apply bincount().

Here's an example with your my_array:

In [31]: my_array = np.array(["a", "a", "c", "c", "a"])

In [32]: b = my_array.view(np.int32) - ord('a')

In [33]: b
Out[33]: array([0, 0, 2, 2, 0], dtype=int32)

In [34]: np.bincount(b, minlength=3)
Out[34]: array([3, 0, 2])

Here's a timing comparison of that method and collections.Counter using an input with length 100:

In [34]: rng = np.random.default_rng()

In [35]: a = rng.choice(['a', 'a', 'b', 'c'], size=100)

In [36]: %timeit Counter(a)
32.1 µs ± 723 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [37]: %timeit b = a.view(np.int32) - ord('a'); np.bincount(b, minlength=3)
3.86 µs ± 50.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The approach with bincount() is much faster.

It is also faster than using np.unique() with the parameter return_counts=True:

In [41]: %timeit values, counts = np.unique(a, return_counts=True)
19.7 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Thanks Warren this is a nice solution-and fast! The question asked for handling of `0`s for specific keys and this ***does not*** handle that case, unless I'm mistaken? The `collections.Counter` approach will but it does not store the `0` values. If storing the `0`s is required then I think you need to pass keys and store similar to how I've done. — Lucas Roberts, Feb 11 '22 at 02:27
The question says *'I have a numpy array with letters "a", "b" or "c"...'*, but then qualifies that by saying that a letter (e.g. `"b"`) might be missing. The code in my answer handles that correctly, and you can see that for the given example, `my_array`, it returns exactly what the OP asked for, `[3, 0, 2]`. It is certainly possible that @Andrex has a more general problem in mind, and the description in the question doesn't reflect the full generality. If that is the case, @Andrex should probably update the question. — Warren Weckesser, Feb 11 '22 at 03:03

Count unique register in numpy array

3 Answers3