I would shuffle the array as follows:
Code
def weighted_shuffle(array)
arr = array.sort_by { |h| -h[:weight] }
tot_wt = arr.reduce(0) { |t,h| t += h[:weight] }
ndx_left = arr.each_index.to_a
arr.size.times.with_object([]) do |_,a|
cum = 0
rn = (tot_wt>0) ? rand(tot_wt) : 0
ndx = ndx_left.find { |i| rn <= (cum += arr[i][:weight]) }
a << arr[ndx]
tot_wt -= arr[ndx_left.delete(ndx)][:weight]
end
end
Examples
elements = [
{ :id => "ID_1", :weight => 100 },
{ :id => "ID_2", :weight => 200 },
{ :id => "ID_3", :weight => 600 }
]
def display(arr,n)
n.times.with_object([]) { |_,a|
p weighted_shuffle(arr).map { |h| h[:id] } }
end
display(elements,10)
["ID_3", "ID_2", "ID_1"]
["ID_1", "ID_3", "ID_2"]
["ID_1", "ID_3", "ID_2"]
["ID_3", "ID_2", "ID_1"]
["ID_3", "ID_2", "ID_1"]
["ID_2", "ID_3", "ID_1"]
["ID_2", "ID_3", "ID_1"]
["ID_3", "ID_1", "ID_2"]
["ID_3", "ID_1", "ID_2"]
["ID_3", "ID_2", "ID_1"]
n = 10_000
pos = elements.each_index.with_object({}) { |i,pos| pos[i] = Hash.new(0) }
n.times { weighted_shuffle(elements).each_with_index { |h,i|
pos[i][h[:id]] += 1 } }
pos.each { |_,h| h.each_key { |k| h[k] = (h[k]/n.to_f).round(3) } }
#=> {0=>{"ID_3"=>0.661, "ID_2"=>0.224, "ID_1"=>0.115},
# 1=>{"ID_2"=>0.472, "ID_3"=>0.278, "ID_1"=>0.251},
# 2=>{"ID_1"=>0.635, "ID_2"=>0.304, "ID_3"=>0.061}}
This says that, of the 10,000 times weighted_shuffle was called, the first element selected was `"ID_3" 66.1% of the time, "ID_2" 22.4% percent of the time and "ID_1" the remaining 11.5% of the time. "ID_2" was selected second 47.2% of the times, and so on.
Explanation
arr is the array of hashes to be shuffled. The shuffle is performed in arr.size steps. At each step I randomly draw an element of arr, without replacement, using the weights provided. If h[:weight] sums to tot for all elements h of arr that have not been previously selected, the probability of any one of those hashes h being selected is h[:weight]/tot. The selection at each step is done by finding the first cumulative probability p for which rand(tot) <= p. This last step is made more efficient by pre-sorting element's elements by declining weight, which is done in the first step of the method:
elements.sort_by { |h| -h[:weight] }
#=> [{ :id => "ID_3", :weight => 600 },
# { :id => "ID_2", :weight => 200 },
# { :id => "ID_1", :weight => 100 }]
This is implemented using an array of indices of arr, called ndx_left, over which the iteration is performed. After a hash h at index i is selected, tot is updated by subtracting h[:weight] and i is deleted from ndx_left.
Variant
The following is a variant of the method above:
def weighted_shuffle_variant(array)
arr = array.sort_by { |h| -h[:weight] }
tot_wt = arr.reduce(0) { |t,h| t += h[:weight] }
n = arr.size
n.times.with_object([]) do |_,a|
cum = 0
rn = (tot_wt>0) ? rand(tot_wt) : 0
h, ndx = arr.each_with_index.find { |h,_| rn <= (cum += h[:weight]) }
a << h
tot_wt -= h[:weight]
arr[ndx] = arr.pop
end
end
Rather than maintaining an array of indices of elements in arr which have not yet been selected, arr is modified in place and reduced in size by one when each element is selected. If the element arr[i] is selected, the last element is copied to offset i and the last element of arr is removed:
arr[i] = arr.pop
Benchmark
The approach of replicating each element h of elements h[:weight] times, shuffling then uniqifying the result is excruciatingly inefficient. If that's not obvious, here's a benchmark. I've compared my weighted_shuffle with @Mori's solution which is representative of the "replicate, shuffle, delete" approach:
def mori_shuffle(array)
array.flat_map { |h| [h[:id]] * h[:weight] }.shuffle.uniq
end
require 'benchmark'
def test_em(nelements, ndigits)
puts "\nelements.size=>#{nelements}, weights have #{ndigits} digits\n\n"
mx = 10**ndigits
elements = nelements.times.map { |i| { id: i, weight: rand(mx) } }
Benchmark.bm(15 "mori_shuffle", "weighted_shuffle") do |x|
x.report { mori_shuffle(elements) }
x.report { weighted_shuffle(elements) }
end
end
elements.size=>3, weights have 1 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000068)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000051)
elements.size=>3, weights have 2 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000035)
weighted_shuffle 0.010000 0.000000 0.010000 ( 0.000026)
elements.size=>3, weights have 3 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000161)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000027)
elements.size=>3, weights have 4 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000854)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000026)
elements.size=>20, weights have 2 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000089)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000090)
elements.size=>20, weights have 3 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000771)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000071)
elements.size=>20, weights have 4 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.005895)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000073)
elements.size=>100, weights have 2 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000446)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000683)
elements.size=>100, weights have 3 digits
user system total real
mori_shuffle 0.010000 0.000000 0.010000 ( 0.003765)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000659)
elements.size=>100, weights have 4 digits
user system total real
mori_shuffle 0.030000 0.010000 0.040000 ( 0.034982)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000638)
elements.size=>100, weights have 5 digits
user system total real
mori_shuffle 0.550000 0.040000 0.590000 ( 0.593190)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000623)
elements.size=>100, weights have 6 digits
user system total real
mori_shuffle 5.560000 0.380000 5.940000 ( 5.944749)
weighted_shuffle 0.010000 0.000000 0.010000 ( 0.000636)
Comparison of weighted_shuffle and weighted_shuffle_variant
Considering that the benchmark engine is all warmed up, I may as well compare the two methods I suggested. The results are similar, with weighted_shuffle having a consistent edge. Here are some typical results:
elements.size=>20, weights have 3 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000062)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000108)
elements.size=>20, weights have 4 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000060)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000089)
elements.size=>100, weights have 2 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000666)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000871)
elements.size=>100, weights have 4 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000625)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000803)
elements.size=>100, weights have 6 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000664)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000773)
As compared to weighted_shuffle, weighted_shuffle_variant does not maintain an array of indices of elements of (a copy of) elements that have not yet been selected (a time-saver). Instead, it replaces the selected element in the array with the last element of the array and then pops the last element, causing the size of the array to decrease by one at each step. Unfortunately, that destroys the ordering of elements by decreasing weight. By contrast, weighted_shuffle maintains the optimization of considering elements by decreasing order of weight. On balance, the latter tradeoff appears to be more important than the former.