4

I am trying to do sampling with replacement in Scala/Spark, defining the probabilities for each class.

This is how I would do it in R.

# Vector to sample from
x <- c("User1","User2","User3","User4","User5")

# Occurenciens from which to obtain sampling probabilities
y <- c(2,4,4,3,2)

# Calculate sampling probabilities
p <- y / sum(y)

# Draw sample with replacement of size 10
s <- sample(x, 10, replace = TRUE, prom = p)

# Which yields (for example):
[1] "User5" "User1" "User1" "User5" "User2" "User4" "User4" "User2" "User1" "User3"

How can I do the same in Scala / Spark?

Stefano
  • 41
  • 2

1 Answers1

1

Scala:

def weightedSampleWithReplacement[T](data: Array[T], 
                                     weights: Array[Double], 
                                     n: Int, 
                                     random: Random): Array[T] = {
  val cumWeights = weights.scanLeft(0.0)(_ + _)
  val cumProbs = cumWeights.map(_ / cumWeights.last)
  Array.fill(n) {
    val r = random.nextDouble()
    data(cumProbs.indexWhere(r < _) - 1)
  }
} 

Spark has an RDD.sample() method that can sample without replacement, though not with weights. You could probably adapt that method along the lines above to do this however.

Sean Owen
  • 6,664
  • 6
  • 33
  • 44