2

I have always been interested in learning how I can make a custom Probability Distribution that corresponds to some particular situation (e.g. constraints).

For example - suppose I have a single dice that has 100 sides and the following conditions:

  • Condition 1: This dice has the highest probability of landing on 50
  • Condition 2: Sides closer to 50 have higher probabilities compared to sides farther away from 50 (i.e. monotonically strictly decreasing, e.g. Probability of landing on side 50 > 49 > 48 ... AND Probability of landing on side 50> 51 > 52 >...)
  • Condition 3: Sides at equal distance to 50 have the same probabilities (e.g. Probability of landing on side 49 = Probability of landing on side 51, Probability of landing on side 48 = Probability of landing on side 52, etc.)

My Question: I want to make a separate dice that corresponds to each one of these situations below:

  • Situation 1: I want to create a dice that satisfies Condition 1, Condition 2 and Condition 3. How can I define a Probability Distribution Function for this situation?
  • Situation 2: I want to create a dice that satisfies Condition 1, Condition 2 and Condition 3 AND the probability of landing on side 50 is given by $p_{50} = 0.5$. How can I define a Probability Distribution Function for this situation?
  • Situation 3: I want to create a dice that satisfies Condition 1, Condition 2 and Condition 3 AND the probability of landing on side 50 is given by $p_{50} = 0.5$ AND the probability of landing on side 49 = side 51 = 0.3. How can I define a Probability Distribution Function for this situation?

I am not sure how to solve these kinds of questions analytically. Ideally, I would be interested in defining an exact theoretical probability distribution corresponding to each situation (e.g. a multinomial distribution with certain properties).

What I tried so far: For the time being, I tried to solve this question by simulation (e.g. Situation 1). Using the R programming language, I simulated numbers from a Normal Distribution (centered around 50), truncated the results (i.e. only allowed numbers between 0 and 100), and calculated the probabilities of landing between any given ranges:

# define mean, standard deviation of a normal distribution with a large number of simulations
    mean <- 50
    sd <- 15
    n <- 100000

simulate from this normal distribution

set.seed(123)  
numbers &lt;- rnorm(n, mean, sd)

truncate the distribution (i.e. only keep numbers between 0 and 100)

numbers &lt;- ifelse(numbers &lt; 0, 0, ifelse(numbers &gt; 100, 100, numbers))

Define the intervals

min_interval <- seq(0, 99, by = 1) max_interval <- seq(1, 100, by = 1)

count <- vector("numeric", length(min_interval)) percentage <- vector("numeric", length(min_interval))

Calculate the count and percentage of numbers in each interval

for (i in seq_along(min_interval)) { count[i] <- sum(numbers >= min_interval[i] & numbers < max_interval[i]) percentage[i] <- count[i] / length(numbers) * 100 }

store results

df <- data.frame(min_interval = min_interval, max_interval = max_interval, count = count, percentage = percentage)

#sort results df <- df[order(-df$percentage), ]

As we can see, the results of this simulation approximately correspond to Situation 1 (Condition 2 and Condition 3 are not fully met):

# plot results
plot(density(numbers))

enter image description here

We can see that numbers around 50 have higher probabilities (i.e. percentage/100) compared to numbers further away from 50 (even though Condition 2 and Condition 3 are not fully met):

# view results

head(df) min_interval max_interval count percentage 51 50 51 2714 2.714 49 48 49 2632 2.632 50 49 50 2628 2.628 53 52 53 2626 2.626 48 47 48 2615 2.615 54 53 54 2611 2.611

tail(df) min_interval max_interval count percentage 95 94 95 22 0.022 3 2 3 19 0.019 100 99 100 16 0.016 4 3 4 14 0.014 98 97 98 12 0.012 99 98 99 10 0.010

I also included an optional visualization:

library(ggplot2)

ggplot(df, aes(x=min_interval, y=percentage/100)) + geom_bar(stat="identity", fill="steelblue") + labs(x="Min Interval", y="Probability", title="Approximation of Discrete Probability Distribution for Situation 1") + theme_minimal()

ggplot(df, aes(x=min_interval, y=percentage)) + geom_bar(stat="identity", fill="steelblue", color="steelblue", width=1) + labs(x="Min Interval", y="Percentage", title="Approximation of Discrete Probability Distribution for Situation 1") + theme_minimal()

enter image description here

But is there a way to mathematically (i.e. analytically) calculate these probabilities for Situation 1, Situation 2 and Situation 3? Can a system of equations be created corresponding to Situation 1, Situation 2 and Situation 3 alongside a set of constraints - such that these probabilities can be calculated analytically? Perhaps this can be done with a Multinomial Distribution? Maybe an Exponential Decay Function that can be used such that it passes through all points?

Thanks!

  • Notes:
  • My own previous attempt at (incorrectly) approaching a similar question How to Define a Bell Curve
  • Is this current question even possible? Will it require complex non-linear optimization algorithms?
stats_noob
  • 4,107
  • 5
    You talk about "distance" and "sides" as if the die is a geometric object, but in your analysis, you simply assume that these properties reduce down to the absolute value of the difference of the face value from $50$, rather than specifying the geometry and labeling of an actual die. Then you impose a criterion that makes no sense: condition 3 would imply that the sum of the probabilities on three faces exceeds $1$. The probability of obtaining 49, 50, or 51 would be $0.5 + 0.3 + 0.3 > 1.$ – heropup Nov 11 '23 at 17:31
  • @ heropup: thank you for your reply! My analysis (i.e. the R computer code) was a very crude way of solving this problem that is inexact. I was just looking for something to get started with. – stats_noob Nov 11 '23 at 17:53
  • Just to clarify, I meant that: There are 3 general conditions (Condition 1, Condition 2, Condition 3). Then, there are 3 separate situations (Situation 1, Situation 2, Situation 3). I want to create a new dice for each for these 3 situations .... but each dice obeys the 3 general conditions. – stats_noob Nov 11 '23 at 17:57
  • 1
    Is this an attempt to rewrite https://math.stackexchange.com/questions/4804681/how-to-define-a-bell-curve ? or https://math.stackexchange.com/questions/4805161/is-there-a-discrete-version-of-the-normal-distribution ? – Gerry Myerson Nov 12 '23 at 02:53
  • @Gerry Myerson: thank you for your reply! The question you linked was my initial attempt at solving this question. This new question is a later attempt after doing more work. I am actually planning on deleting the older question you linked. – stats_noob Nov 12 '23 at 02:55
  • It might be better to keep all the related questions up and each linked to the others, so people could see where you're coming from and what has already been accepted or rejected before they reinvent the wheel. – Gerry Myerson Nov 12 '23 at 03:00
  • @ Gerry Myerson: Great suggestion! I will work on this – stats_noob Nov 12 '23 at 03:22
  • It seems to me that your "truncation" of the distribution gathered all outcomes less than $0$ into the "$0$" bucket while discarding all outcomes greater than $100$. So if there was anything to truncate, this "truncation" will unbalance the distribution slightly. – David K Nov 12 '23 at 23:57
  • It also seems to me that you have a bucket for results $49\leq x<50$ and a bucket for results $50\leq x<51$ which have equal probability according to a normal distribution with mean $50$, and these buckets each have greater probability than any other bucket. So you have a symmetric discrete distribution, but it isn't centered at the discrete value $50$, it's centered between the two equally likely outcomes $49$ and $50$ with all other probabilities symmetric around that center ($p_{48}=p_{51}$, for example). This is not easy to see from the simulation because the simulation is, ahem, random. – David K Nov 13 '23 at 00:01
  • @ David K: thank you so much for your comments- much appreciated. – stats_noob Nov 13 '23 at 04:21

2 Answers2

3

The probabilities have to add up to $1$. Subtract any fixed probabilities e.g. $p_{50}=0.5$

The put $p_{1}=k, p_{2}=2k$ etc...

Then $2(k+2k+3k+....49k)=0.5$

$k=\frac{1}{4900}$ and make the 51, 52 etc... probabilities symmetrical with the 49,48 etc...

(the sum of numbers 1 to $n$ is $\frac{1}{2}n(n+1)$)

You can adjust this method depending on what central values are being fixed.

John Hunter
  • 691
  • 4
  • 13
  • @ John Hunter: thank you so much for your answer! Can your answer be used to create a multinomial distribution? Thanks! – stats_noob Nov 22 '23 at 02:35
  • @stats noob I don't think you can have p(50) = 0.5 and still have a binomial shaped distribution, but you could use the Binomial B(99,0.5) or B(100,0.5) probabilities, they give the right shape and have 100 different probs, that peak at 50. It's the same as the probabilities of getting k heads when flipping a coin 100 times, 50 heads is most likely, just that the peak probability won't be as high as 0.5, it's just too high – John Hunter Nov 23 '23 at 08:44
3

Situation 3 does not obey the laws of probability since $$p(50)+p(49)+p(51)=0.5+0.3+0.3=1.1$$

When creating a probability distribution, ensure normalization: the sum of probabilities in the sample space must be 1.

Situation 2 is asymmetrical and modelling analytically becomes more difficult: $$1<2<...<49: 49 \text{Outcomes}$$ $$5)<52<...<100: 50 \text{Outcomes}$$

With 101 outcomes, the modelling is easier:

Binomial Distribution with number of trials $=n=101$, and probability of success$=p=\frac{1}{2}$ gives a symmetric distribution with: $$p(1)<p(2)<...<p(50)<p(51), p(51)>p(52)>...>p(101)$$

Geometric Parameter Generalizing $p(51)=p, p(49)=p(52)=kp, p(48)=p(53)=k^2p$ gives: $$p(51-n)=p(51+n)=k^np$$

Summing all the probabilitites to 1 gives: $$p+k(2p)+k^2(2p)+...k^{50}(2p)=1$$

In your case $p(50)=p=0.5\Rightarrow2p=1$ $$0.5+k+k^2+...k^{50}=1$$ $$k+k^2+...k^{50}=0.5$$

This is now a geometric series with first term $k$, common ratio $k$, and 50 terms. Use the formula for the sum of a geometric series to find the value of $k$ and you are done.

Starlight
  • 2,684
  • I think I wrote Situation 3 incorrectly. What I meant was: P(49) = P(50) = 0.3 ... I meant that P(49) + P(50) = 0.3 , i.e. each equals 0.15 . Sorry for this. Given these clarifications, is Situation 3 possible?
  • – stats_noob Nov 22 '23 at 02:34
  • For situation 2, you are modelling it as a Binomial Distribution. But I thought it should be a Multinomial Distribution because there are multiple outcomes?
  • – stats_noob Nov 22 '23 at 02:34