For simplicity, I'll assume that interpolation isn't needed, and
that it suffices to find the individual nearest to the
$q^\text{th}$ quantile point, where $0 \leqslant q \leqslant 1.$
Suppose that the population consists of $N$ individuals, sorted in
ascending order of the values of some attribute. Suppose that there
are $r$ different attribute values, and that $m_i$ individuals have the
$i^\text{th}$ value of the attribute, for $i = 1, 2, \ldots, r.$
Then $m_1 + m_2 + \cdots + m_r = N.$
Represent the $k^\text{th}$ individual as the centre of a notional
continuous interval $[k - \tfrac12, k + \tfrac12),$ for
$k = 1, 2, \ldots, N.$ Then the entire population occupies the
interval $[\tfrac12, N + \tfrac12),$ and the $q^\text{th}$ quantile
point is at $Nq + \tfrac12.$ We simplistically replace this with
the nearest integer, rounding down in the ambiguous case when $Nq$
is an integer. Thus we take the $q^\text{th}$ quantile to be
individual number $\left\lfloor{Nq}\right\rfloor + 1,$ for
$q \in [0, 1),$ or number $N,$ in the special case $q = 1.$
Define the partial sums $M_i = m_1 + m_2 + \cdots + m_i,$ for
$i = 0, 1, \ldots, r.$ These form a strictly increasing sequence
$M = (M_0, M_1, \ldots, M_r),$ where $M_0 = 0$ and $M_r = N.$ For
$k = 1, 2, \ldots, N,$ therefore, there exists a unique positive
integer $i = f(k, M) \leqslant r$ such that
$M_{i-1} < k \leqslant M_i.$ That means that the $k^\text{th}$
individual in the population has the $i^\text{th}$ attribute value.
In terms of this function $f,$ if $s$ is the list of attribute
values sorted into ascending order, then the $q^\text{th}$ quantile
value of the attribute is (ignoring the special case $q = 1$):
$$
s[f(\left\lfloor{Nq}\right\rfloor + 1, M)].
$$
Here's a toy Python 3 module that does the job. I haven't tried it on any
large arrays. For all I know, the way I've coded it may use tons of resources. (You'll surely need to recode it anyway, for instance to use interpolation.)
"""Compute quantiles: see https://math.stackexchange.com/q/3721765."""
all = ['weighted']
import math, operator, itertools
class weighted(object):
"""
Structure of repeated attribute values in ascending order.
"""
def __init__(self, x, w):
"""
Create sorted data from unsorted attribute values and their "weights".
"""
self.xs, self.ws = zip(*sorted(zip(x, w), key=operator.itemgetter(0)))
self.subtotals = list(itertools.accumulate(self.ws))
self.N = self.subtotals[-1]
def individual(self, q):
"""
Identify individual member of population nearest to the q'th quantile.
"""
return math.floor(q * self.N) + 1 if q < 1 else self.N
def attribute(self, k):
"""
Compute attribute index of k'th individual member of the population.
"""
for i, M in enumerate(self.subtotals):
if M >= k:
return i
def quantile(self, q):
"""
Compute q'th quantile value of the attribute.
"""
return self.xs[self.attribute(self.individual(q))]
def main():
print('median = {}'.format(weighted([6, 4, 2],[1, 3, 5]).quantile(.5)))
if name == 'main':
main()
Version 0.2
This is still a toy implementation. In particular, it still might be hugely
inefficient (I haven't given any thought to that question), and it
still hasn't been tested on any large datasets. What is nice about
it is that the new class multilist is obviously capable of
being considerably elaborated. (No doubt I'll tinker with it a lot,
but there isn't likely to be any good reason to post my tinkerings here.)
I'm not sure how to post code in Maths.SE, so the indentation of the
code isn't quite consistent.
"""Lists of items with multiplicity, analogous to multisets."""
all = ['individual', 'multilist', 'quantile']
import math, itertools
def individual(q, N):
"""
Number (1 to N) of individual near q'th quantile of population of size N.
"""
return math.floor(q*N) + 1 if q < 1 else N
def quantile(x, q):
"""
Compute the q'th quantile value of the given sorted (N.B.!) multilist x.
"""
return x[individual(q, len(x))]
class multilist(object):
"""
List of elements with multiplicity: similar to a multiset, whence the name.
The multiplicity of each element is a positive integer. The purpose of the
multilist is to behave like a list in which each element occurs many times,
without actually having to store all of those occurrences.
"""
def init(self, x, w):
"""
Create multilist from list of values and list of their multiplicities.
"""
self.items = x
self.times = w
self.subtotals = list(itertools.accumulate(self.times))
def len(self):
"""
Get the number of items in a list with multiplicities.
The syntax needed to call this function is "len(x)", where x is the
name of the multilist.
"""
return self.subtotals[-1]
def getitem(self, k):
"""
Find the k'th item in a list with multiplicities.
If the multiplicities are m_1, m_2, ..., m_r (note that Python indices
are 1 less, running from 0 to r - 1), and subtotals M_0, M_1, ..., M_r,
where M_i = m_1 + m_2 + ... + m_i (i = 0, 1, ..., r), then we want the
unique i (but the Python code uses i - 1) such that M_{i-1} < k <= M_i.
The syntax needed to call this function is "x[k]", where x is the name
of the multilist, and 1 <= k <= len(x).
"""
for i, M in enumerate(self.subtotals):
if M >= k:
return self.items[i]
def sorted(self):
"""
Return a sorted copy of the given multilist.
Note on the implementation: by default, 2-tuples in Python are compared
lexicographically, i.e. by the first element, or the second in the case
of a tie; so there is no need for parameter key=operator.itemgetter(0).
"""
return multilist(*zip(*sorted(zip(self.items, self.times))))
def main():
data = multilist([6, 4, 2], [1, 3, 5]).sorted()
print('median = {}'.format(quantile(data, .5)))
if name == 'main':
main()
xws = sorted(zip(x, w), key=operator.itemgetter(0)), and instead of using numpy, write your own Python code to calculate the quantiles of a sorted list of (number, weight) tuples likexws? This approach might be slow, but at least I don't think it risks running out of memory. (Sadly I'm not likely to have time today to try to write some code myself, but it's an interesting question.) – Calum Gilhooley Jun 19 '20 at 15:17