3

I'm trying to understand how the shap algorithm calculates in polynomial time an estimation to the feature attribution function that satisfies the shapely value attributes (specifically for tree based models!).

A simplified version of the feature attribution function for the feature $i$ of the model $f$ prediction for the input instance $x$ is: $\phi_i(f, x)=\sum_{S \in N /\{i\}}[f_x(S\bigcup{i})-f_x(S)]$ - i.e. sum the change in the model prediction for every set of features $S$ that does not include the feature $i$ from the same set when $i$ is introduced.

The naive exponential estimation for a tree based model (Algorithm 1 in the paper), estimates $f_x(S)$ by recursively following the decision path for $x$ if the split feature is in $S$, and taking the weighted average of both branches if the split feature is not in $S$ (weights are the coverage of train samples in that branch). We do that for every set $S$ in the powerset of features.

So far so good. But now the paper describes Algorithm 2 - a polynomial method for computing the same values, basically by "running simultaneously for all $2^N$ subsets"! I can't understand that, no matter how much I try (see my notebook).

Can someone enlighten me and explain how is that working? how are we tracking the number of sets and the coverage in every path?

Note that the real algorithm is much more complex than what I'm presenting here, as the real feature attribution function weights each set $S$ proportionally to the number of features in it $|S|$ - but I'm not interested in that complexity, only in the basic simplified setup I've presented here.

Thanks for your efforts, this is something that's eluding me for years

ihadanny
  • 379
  • 2
  • 13

3 Answers3

3

First, some notation.

A decision tree is a binary tree in which each internal node is labelled by one of $x_1,\ldots,x_n$, and has two outgoing edges labelled $0$ and $1$. Leaves are labelled with real numbers. To evaluate the decision tree on an input $(x_1,\ldots,x_n) \in \{0,1\}^n$, we just follow the tree from the root to a leaf, and output the value on the leaf. From now on, we fix a decision tree $T$, as well as a reference output $y$.

A decision tree is irredundant if no root-to-leaf path contains the same variable twice. We can easily make a tree irredundant in linear time, and so we assume that $T$ is irredundant. (For some reason, the paper doesn't make this assumption.)

Given a subset $S \subseteq [n]$, let $f(S)$ denote the expect output of $T$ when run on a random input $x$ subject to $x_S = y_S$ (i.e., $x_i = y_i$ for all $i \in S$). Alternatively, it is the expected value of the following process. Start at the root and proceed to a leaf. At any given node, if the node queries a variable $i \in S$, go to the branch labelled $y_i$. Otherwise, take a random branch. Thus $f([n])$ is the value of $T$ on $y$, and $f(\emptyset)$ is the value of $T$ on a random input.

For $i \in [n]$, we define the Shapley value $\phi(i)$ to be $$ \phi(i) = \sum_{i \notin S} \frac{|S|!(n-|S|-1)!}{n!} (f(S \cup \{i\}) - f(S)). $$ For the actual algorithm, the coefficients won't matter much. Indeed, we will calculate directly $$ \alpha_+(i,\ell) = \sum_{\substack{i \in S \\ |S|=\ell}} f(S), \\ \alpha_-(i,\ell) = \sum_{\substack{i \notin S \\ |S|=\ell}} f(S), $$ from which we can determine $\phi(i)$ using the formula. In the same way, we can compute other power indices, such as the Banzhaf index.

Let $v$ be any node in $T$. Define $\alpha_{\pm}(i,\ell,v,s_0,s_1)$ as the same value as above, when run on the decision tree rooted at $v$, given that we have already put $s_1$ nodes inside $S$, and $s_0$ nodes outside $S$. Thus we are interested in $\alpha_-(i,\ell,r,1,0)$ and $\alpha_+(i,\ell,r,0,1)$, where $r$ is the root of $T$.

If $v$ is a leaf then $\alpha_{\pm}(i,\ell,v,s_0,s_1)$ is just the label of $v$ multiplied by the number of possible sets $S$, which is $\binom{n-s_0-s_1}{\ell-s_1}$. Now suppose that $v$ is an internal node labelled with $x_j$ and having children $v_0,v_1$. If $j = i$ then $$ \alpha_+(i,\ell,v,s_0,s_1) = \alpha_+(i,\ell,v_{y_i},s_0,s_1), \\ \alpha_-(i,\ell,v,s_0,s_1) = \frac{\alpha_-(i,\ell,v_0,s_0,s_1) + \alpha_-(i,\ell,v_1,s_0,s_1)}{2}. $$ (In fact, it doesn't matter whether we choose $\alpha_+$ or $\alpha_-$ on the right-hand side.) If $j \neq i$ then $$ \alpha_{\pm}(i,\ell,v,s_0,s_1) = \alpha_{\pm}(i,\ell,v_{y_j},s_0,s_1+1) + \frac{\alpha_{\pm}(i,\ell,v_0,s_0+1,s_1) + \alpha_{\pm}(i,\ell,v_1,s_0+1,s_1)}{2}. $$

The actual algorithm in the paper is some sort of mild optimization of this idea.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514
1

slight optimization over @yuval-filmus excellent answer.

In his comment here: https://github.com/slundberg/shap/issues/24 @sh1ng suggests an improvement. Rephrasing a bit, he's saying that we can use the path depth $d$ instead of $n$, and we can also save ourselves the iteration over sets with $|S|=l$ and thus improve the time complexity.

Suppose we have a leaf node at depth $d$. The dynamic programming recursion ensures that we will handle every set $S$ with size $|S|=s_1 \leq d$ by collecting $s_1$ features along the path into our set and $s_0$ features out of our set s.t $s_0+s_1=d$.

Since we covered any possible set of features along the path, the iteration over $l$ is used only to take into account features that are not along the path, and count sets with sizes $|S|=l \geq s_1$, which are ${n-d} \choose {l-s_1}$. Note that since our tree is irreducible (each feature appears at most once in any path), we know that $d \leq n$ so the expression makes sense and the factorial is never negative.

We will now show by induction over the number of features $n$ that we can:

  1. Use $d$ instead of $n$ in the shapely weight
  2. Don't need to iterate over $l$ and take into account features not along the path

If $n=d$ the claim is trivial. We will use the shapely weight $sw = \frac{s_1!(n-s_1-1)!}{n!} = \frac{s_1!(d-s_1-1)!}{d!}$. The count would also be trivial ${{n-d} \choose {l-s_1}} =1$

Now suppose $n=d+1$. There would be

  • ${{n-d} \choose {l-s_1}}=1$ set with the same size $l=s_1$, and it's shapely weight would be $\frac{s_1!(n-s_1-1)!}{n!} = \frac{s_1!(d-s_1)!}{(d+1)!}=sw\frac{d-s_1}{d+1}$.
  • There would also be ${{n-d} \choose {l-s_1}}=1$ set with size $l=s_1+1$, and it's shapely weight would be $\frac{(s_1+1)!(n-s_1+1-1)!}{n!} = \frac{(s_1+1)!(d-s_1-1)!}{(d+1)!}=sw\frac{s_1+1}{d+1}$
  • when we'll sum those 2 sets in the $l$ iteration, we'll get exactly the same contribution as $sw\frac{d-s_1}{d+1}+sw\frac{s_1+1}{d+1} = sw$
  • these are the only 2 valid values of $l$ for our path

Important to notice that these 2 sets would have exactly the same contribution $f(S)$ as in the case where $n=d$, as they land in exactly the same node of the tree.

Also important to notice that the same claim holds true if $i$ is in our set ($\alpha_+$ code branch) or outside ($\alpha_-$ code branch).

Now if $n=d+2$, exactly the same argument would hold for these 2 sets, and we'll get 4 sets. 2 sets with weights summing to $sw\frac{d-s_1}{d+1}$ and 2 sets with weights summing to $sw\frac{s_1+1}{d+1}$, and the sum is again $sw$.

code is available in this notebook. credit goes to yuval and sh1ng

ihadanny
  • 379
  • 2
  • 13
1

In this paper: https://arxiv.org/pdf/2007.14045.pdf, a polynomial-time algorithm is given to compute the SHAP score for deterministic and decomposable Boolean circuits. This algorithm can also be used for decision trees, since there is a simple polynomial-time algorithm for translating such trees into deterministic and decomposable Boolean circuits. For the same reason, the algorithm can also be used for ordered binary decision diagrams (OBDDs) and free binary decision diagrams (FBDDs).

Alan
  • 11
  • 1