3

Suppose that I am trying to build a random forest by subsampling the data and choosing a single feature per tree randomly. For example, suppose there is some dataset,

$D = \{(x_{1},y_{1}), ......(x_{N},y_{N})\}$ where we have that $x_{i} \in \mathbb{R}^{D}$ and $y_{i} \in \mathbb{R}$ for $ i= 1,....n.$. We are trying to construct the tree as follows:

  1. First we randomly sample one feature index $j \in \{1,....D\}$
  2. Then we draw some sample of the data $\tilde D_{k}$ of size $M \le N$ with replacement. These datapoints will then have indices $k = k_{1},....,k_{M}$
  3. Keep only the $j^{th}$ feature of the M samples: $\tilde D^{j}_{k} = {(\tilde x^{(j)}_{(k_{1})},y_{(k_{1})}),......(\tilde x^{(j)}_{(k_{M})},y_{(k_{M})})}$
  4. Then we build a decision tree on $\tilde D_{k}^{(j)}$.
  5. Then average R of these random trees to create a random forest

We were asked for which class of conditional distributions $Y|X=x$ are very random forests unbiased? I am wondering what is meant by the "class" of conditional distribution? Could someone shed some light on this please?

Also, how does the bias and variance of this RF vary with a traditional RF? I assume that I will need to look at the generalization bounds? I am not sure. Could someone please shed some light on this?

user1234
  • 131
  • 1

0 Answers0