3

First the proviso I'm only an aspiring mathematician. Secondly forgive how I've articulated this question in Python, but it is essentially a computational problem I'm looking to solve.

I wonder if there's an efficient way to calculate quantile breaks for weighted array (i.e. Numpy array, so possibly mathematical vector?). By weighted array, consider array x = [x₁, x₂, .., xn] which has a corresponding array of weights w = [w₁, w₂, .., wn]. In my current workflow I unpack x into new array xw in which each element xⁱ is repeated wⁱ times, and I then calculate its distribution statistics (e.g quartiles). But the unpacking is very computationally expensive so I'm exploring possibility of an alternative.

To describe the problem in Python (also see here):

import numpy as np
import random

random log distribution

x = np.random.lognormal(mean = 7, sigma = 0.7, size = 10000) x = np.int64(np.ceil(x))

View histogram:

import matplotlib
import matplotlib.pyplot as plt
tmp = plt.hist(x, bins=1000, alpha=0.6, color='c', ec='c', log=True)

enter image description here

Apply weights w to array x:

def weighted_array(arr, weights):
    repeated_array = list(map(lambda f, i: [f] * int(i), arr, weights))
    return np.array([item for sublist in repeated_array for item in sublist])

weighted_array([6,4,2], [1,3,5]) # for example #> array([6, 4, 4, 4, 2, 2, 2, 2, 2])

For simplicity let's weight x array by itself (i.e. w = x)

xw = weighted_array(x, x) len(xw) #> 14092084

stats = np.quantile(xw, [.05, .25, .5, .75, .95]) print(stats) #> [ 563. 1101. 1771. 2854. 5523.]

The process of generating xw is very expensive for large arrays of large numbers and easily exhaust system memory. So I wonder if there is a mathematical way to calculate stats from the original x and w arrays without having to apply the weights to generate xw?

Thanks in advance!

geotheory
  • 135
  • Perhaps you could start with something like (in Python 3) xws = sorted(zip(x, w), key=operator.itemgetter(0)), and instead of using numpy, write your own Python code to calculate the quantiles of a sorted list of (number, weight) tuples like xws? This approach might be slow, but at least I don't think it risks running out of memory. (Sadly I'm not likely to have time today to try to write some code myself, but it's an interesting question.) – Calum Gilhooley Jun 19 '20 at 15:17
  • Not a bad idea. I'll see what I can come up with. Thanks Calum. – geotheory Jun 19 '20 at 17:17
  • 2
    There are many web results for "find the median from a histogram". Most of them can easily be modified to find quartiles, or any other percentile. None require that your weights be integers. – Ethan Bolker Jun 23 '20 at 00:33
  • @EthanBolker That comment could be elaborated into a short but good answer. – Calum Gilhooley Jun 23 '20 at 00:48
  • For what it's worth this is what I came up with in the end https://math.stackexchange.com/a/3731398/77811 – geotheory Jun 23 '20 at 12:49

3 Answers3

2
import numpy as np
your_data    = [ 1.7 , 2.2 , 3.9 ]
your_weights = [ 2 , 1 , 5 ]
xw = np.repeat( your_data , your_weights )

You should obtain that your xw is

[ 1.7 , 1.7 , 2.2 , 3.9 , 3.9 , 3.9 , 3.9 , 3.9 ]

Unfortunately numpy doesn't have built in weighted functions for everything, but you can put things together in this way.

Quillo
  • 2,260
1

For simplicity, I'll assume that interpolation isn't needed, and that it suffices to find the individual nearest to the $q^\text{th}$ quantile point, where $0 \leqslant q \leqslant 1.$

Suppose that the population consists of $N$ individuals, sorted in ascending order of the values of some attribute. Suppose that there are $r$ different attribute values, and that $m_i$ individuals have the $i^\text{th}$ value of the attribute, for $i = 1, 2, \ldots, r.$ Then $m_1 + m_2 + \cdots + m_r = N.$

Represent the $k^\text{th}$ individual as the centre of a notional continuous interval $[k - \tfrac12, k + \tfrac12),$ for $k = 1, 2, \ldots, N.$ Then the entire population occupies the interval $[\tfrac12, N + \tfrac12),$ and the $q^\text{th}$ quantile point is at $Nq + \tfrac12.$ We simplistically replace this with the nearest integer, rounding down in the ambiguous case when $Nq$ is an integer. Thus we take the $q^\text{th}$ quantile to be individual number $\left\lfloor{Nq}\right\rfloor + 1,$ for $q \in [0, 1),$ or number $N,$ in the special case $q = 1.$

Define the partial sums $M_i = m_1 + m_2 + \cdots + m_i,$ for $i = 0, 1, \ldots, r.$ These form a strictly increasing sequence $M = (M_0, M_1, \ldots, M_r),$ where $M_0 = 0$ and $M_r = N.$ For $k = 1, 2, \ldots, N,$ therefore, there exists a unique positive integer $i = f(k, M) \leqslant r$ such that $M_{i-1} < k \leqslant M_i.$ That means that the $k^\text{th}$ individual in the population has the $i^\text{th}$ attribute value.

In terms of this function $f,$ if $s$ is the list of attribute values sorted into ascending order, then the $q^\text{th}$ quantile value of the attribute is (ignoring the special case $q = 1$): $$ s[f(\left\lfloor{Nq}\right\rfloor + 1, M)]. $$

Here's a toy Python 3 module that does the job. I haven't tried it on any large arrays. For all I know, the way I've coded it may use tons of resources. (You'll surely need to recode it anyway, for instance to use interpolation.)

"""Compute quantiles: see https://math.stackexchange.com/q/3721765."""

all = ['weighted']

import math, operator, itertools

class weighted(object): """ Structure of repeated attribute values in ascending order. """

def __init__(self, x, w):
    &quot;&quot;&quot;
    Create sorted data from unsorted attribute values and their &quot;weights&quot;.
    &quot;&quot;&quot;
    self.xs, self.ws = zip(*sorted(zip(x, w), key=operator.itemgetter(0)))
    self.subtotals = list(itertools.accumulate(self.ws))
    self.N = self.subtotals[-1]

def individual(self, q):
    &quot;&quot;&quot;
    Identify individual member of population nearest to the q'th quantile.
    &quot;&quot;&quot;
    return math.floor(q * self.N) + 1 if q &lt; 1 else self.N

def attribute(self, k):
    &quot;&quot;&quot;
    Compute attribute index of k'th individual member of the population.
    &quot;&quot;&quot;
    for i, M in enumerate(self.subtotals):
        if M &gt;= k:
            return i

def quantile(self, q):
    &quot;&quot;&quot;
    Compute q'th quantile value of the attribute.
    &quot;&quot;&quot;
    return self.xs[self.attribute(self.individual(q))]

def main(): print('median = {}'.format(weighted([6, 4, 2],[1, 3, 5]).quantile(.5)))

if name == 'main': main()

Version 0.2

This is still a toy implementation. In particular, it still might be hugely inefficient (I haven't given any thought to that question), and it still hasn't been tested on any large datasets. What is nice about it is that the new class multilist is obviously capable of being considerably elaborated. (No doubt I'll tinker with it a lot, but there isn't likely to be any good reason to post my tinkerings here.)

I'm not sure how to post code in Maths.SE, so the indentation of the code isn't quite consistent.

"""Lists of items with multiplicity, analogous to multisets."""

all = ['individual', 'multilist', 'quantile']

import math, itertools

def individual(q, N): """ Number (1 to N) of individual near q'th quantile of population of size N. """ return math.floor(q*N) + 1 if q < 1 else N

def quantile(x, q): """ Compute the q'th quantile value of the given sorted (N.B.!) multilist x. """ return x[individual(q, len(x))]

class multilist(object): """ List of elements with multiplicity: similar to a multiset, whence the name.

The multiplicity of each element is a positive integer. The purpose of the
multilist is to behave like a list in which each element occurs many times,
without actually having to store all of those occurrences.
&quot;&quot;&quot;

def init(self, x, w): """ Create multilist from list of values and list of their multiplicities. """ self.items = x self.times = w self.subtotals = list(itertools.accumulate(self.times))

def len(self): """ Get the number of items in a list with multiplicities.

The syntax needed to call this function is &quot;len(x)&quot;, where x is the
name of the multilist.
&quot;&quot;&quot;
return self.subtotals[-1]

def getitem(self, k): """ Find the k'th item in a list with multiplicities.

If the multiplicities are m_1, m_2, ..., m_r (note that Python indices
are 1 less, running from 0 to r - 1), and subtotals M_0, M_1, ..., M_r,
where M_i = m_1 + m_2 + ... + m_i (i = 0, 1, ..., r), then we want the
unique i (but the Python code uses i - 1) such that M_{i-1} &lt; k &lt;= M_i.

The syntax needed to call this function is &quot;x[k]&quot;, where x is the name
of the multilist, and 1 &lt;= k &lt;= len(x).
&quot;&quot;&quot;
for i, M in enumerate(self.subtotals):
    if M &gt;= k:
        return self.items[i]

def sorted(self): """ Return a sorted copy of the given multilist.

Note on the implementation: by default, 2-tuples in Python are compared
lexicographically, i.e. by the first element, or the second in the case
of a tie; so there is no need for parameter key=operator.itemgetter(0).
&quot;&quot;&quot;
return multilist(*zip(*sorted(zip(self.items, self.times))))

def main(): data = multilist([6, 4, 2], [1, 3, 5]).sorted() print('median = {}'.format(quantile(data, .5)))

if name == 'main': main()

  • This is excellent work. I think it gets me most of the way, so many thanks. – geotheory Jun 21 '20 at 01:18
  • I hope @EthanBolker can be persuaded to write an answer (based on his comment) that can be accepted in place of this one. I was intrigued by the idea of a sequence in which items could be virtually repeated, while only being stored once. Perhaps indeed that idea does have some applications, but, as an answer to this question, it showed tunnel vision. It is both simpler and more general to simply sort the lists of attributes and their weights (perhaps in the way shown here, unless there's a better way to do it in Python, as I hoped there would be), and apply formulae for positive real weights. – Calum Gilhooley Jun 23 '20 at 11:23
0

I'm not sure if best approach, but this is what I came up with.

import numpy as np
import random
import pandas as pd

d = pd.DataFrame(data = { 'x': np.random.lognormal(mean = 7, sigma = 0.7, size = 1000), 'wgts': np.int64(np.random.sample(1000) * 1000) })

Sorted-Cumulative-Weight-Breaks™ method

quantile = sum(d['wgts']) / 4 d = d.sort_values('x') # sort d['cum_wgts'] = np.cumsum(d.wgts) # cumulative weights d['qtl'] = np.int64(d.cum_wgts / quantile) # quantile binning d['new_qtl'] = np.invert(d.qtl.eq(d.qtl.shift())) # to filter at breaks quartiles2 = d[d.new_qtl].x

My original method:

def weighted_array(arr, weights):
    repeated_array = list(map(lambda f, i: [f] * int(i), arr, weights))
    return np.array([item for sublist in repeated_array for item in sublist])

xw = weighted_array(d.x, d.wgts) quartiles1 = np.quantile(xw, [0, .25, .5, .75, 1])

Results comparison:

print(np.int64(quartiles1))
print(np.int64(quartiles2))
#> [  170   679  1161  1860 12613]
#> [  170   679  1161  1860 12613]

Views much appreciated.

geotheory
  • 135
  • This method also handles floating point weights, but I wonder if it will fail with some edge cases (e.g. individual lines reflecting multiple quantiles). – geotheory Jun 23 '20 at 12:58