Given percentages table which has a list of categories and their desired percentages, and raw data table (with item_id & category fields), how can I create a random sample of N items (e.g. N = 10) from the data table while ensuring the percentage of items for each category in the final sample is distributed according to percentages table. I.e. random sample with particular categories distribution.
Here we are basically trying to reproduce category distribution from a particular data table in a smaller sample.
with percentages as (
select 'A' as category, 0.3 as category_percentage UNION ALL
'B' , 0.5 UNION ALL
'C' , 0.2
),
data as (
Select 1 as item_id, 'A' as category UNION ALL
Select 2, 'B' UNION ALL
Select 3, 'B' UNION ALL
Select 4, 'C' UNION ALL
Select 5, 'A' UNION ALL
Select 6, 'C' UNION ALL
Select 7, 'B' UNION ALL
Select 8, 'B' UNION ALL
Select 9, 'B' UNION ALL
Select 10, 'A' UNION ALL
Select 11, 'A' UNION ALL
Select 12, 'C' UNION ALL
Select 13, 'C' UNION ALL
Select 14, 'B' UNION ALL
Select 15, 'A' UNION ALL
Select 16, 'C' UNION ALL
Select 17, 'A' UNION ALL
Select 18, 'B'
)
So in example above for N =10 it would create a sample with ten items where 30% (3 items) would belong to category A, 50% (5 items) to category B and 20% (2 items) to category C.
example output (random sample of 10 items with categories distributed based on percentages table):
item_id, category
5 'A'
10 'A'
15 'A'
3 'B'
8 'B'
14 'B'
10 'B'
18 'B'
6 'C'
13 'C'
And related question, if each item in data also has a score field, how can you sample top scored items from data table (i.e. pick top ranked instead of random items) while keeping the category distribution from percentages table
This question is related but in our case the percentages for each category are coming from a separate table Stratified random sampling with BigQuery?