Difference between indicator column and categorical identity column in tensorflow

Question

I am learning Tensorflow and came across different feature columns used in Tensorflow . Out of these types, two are categorical_identity_column and indicator_column. Both have been defined in the same way. As far as I understand, both convert categorical column to one-hot encoded column.

So my question is what is the difference between the two? When to use one and when to use the other?

score 10 · Answer 1 · edited Aug 02 '20 at 12:23

indicator_column encodes the input to a multi-hot representation, not one-hot encoding.

The example clarifies more:

name = indicator_column(categorical_column_with_vocabulary_list(
    'name', ['bob', 'george', 'wanda'])
columns = [name, ...]
features = tf.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)
dense_tensor == [[1, 0, 0]]  # If "name" bytes_list is ["bob"]
dense_tensor == [[1, 0, 1]]  # If "name" bytes_list is ["bob", "wanda"]
dense_tensor == [[2, 0, 0]]  # If "name" bytes_list is ["bob", "bob"]

The last two examples describe what is meant by multi-hot encoding. for example if the input be ["bob", "wanda"] the encoding will be [[1, 0, 1]].

score 2 · Answer 2 · edited Oct 26 '18 at 18:52

Regarding the question in the comments above (by Ankit Seth), the docs here say the following about deep models (as opposed to "wide", i.e. linear):

tf.estimator.DNNClassifier and tf.estimator.DNNRegressor: Only accept dense columns. Other column types must be wrapped in either an indicator_column or embedding_column.

And if you try to pass a categorical column directly to a deep model, TF will throw the following error:

ValueError: Items of feature_columns must be a _DenseColumn. You can wrap a categorical column with an embedding_column or indicator_column.

score 1 · Answer 3 · answered Apr 03 '18 at 22:20

You would use categorical_column_with_* to get a _CategoricalColumn to feed into a linear model; this column returns identity values, often using a vocabulary.

On the other hand, indicator_column is a multi-hot representation of a given categorical column and would be used if you want to feed the feature into a DNN, for example; it produces an _IndicatorColumn. embedding_column is analogous but you'd use it if your input is sparse.

Difference between indicator column and categorical identity column in tensorflow

3 Answers3

Linked