9

I am learning Tensorflow and came across different feature columns used in Tensorflow . Out of these types, two are categorical_identity_column and indicator_column. Both have been defined in the same way. As far as I understand, both convert categorical column to one-hot encoded column.

So my question is what is the difference between the two? When to use one and when to use the other?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
Ankit Seth
  • 1,821
  • 15
  • 27

3 Answers3

10

indicator_column encodes the input to a multi-hot representation, not one-hot encoding.

The example clarifies more:

name = indicator_column(categorical_column_with_vocabulary_list(
    'name', ['bob', 'george', 'wanda'])
columns = [name, ...]
features = tf.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)

dense_tensor == [[1, 0, 0]] # If "name" bytes_list is ["bob"] dense_tensor == [[1, 0, 1]] # If "name" bytes_list is ["bob", "wanda"] dense_tensor == [[2, 0, 0]] # If "name" bytes_list is ["bob", "bob"]

The last two examples describe what is meant by multi-hot encoding. for example if the input be ["bob", "wanda"] the encoding will be [[1, 0, 1]].

Zephyr
  • 997
  • 4
  • 11
  • 20
m.elahi
  • 201
  • 1
  • 3
2

Regarding the question in the comments above (by Ankit Seth), the docs here say the following about deep models (as opposed to "wide", i.e. linear):

tf.estimator.DNNClassifier and tf.estimator.DNNRegressor: Only accept dense columns. Other column types must be wrapped in either an indicator_column or embedding_column.

And if you try to pass a categorical column directly to a deep model, TF will throw the following error:

ValueError: Items of feature_columns must be a _DenseColumn. You can wrap a categorical column with an embedding_column or indicator_column.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
Milad Shahidi
  • 413
  • 4
  • 9
1

You would use categorical_column_with_* to get a _CategoricalColumn to feed into a linear model; this column returns identity values, often using a vocabulary.

On the other hand, indicator_column is a multi-hot representation of a given categorical column and would be used if you want to feed the feature into a DNN, for example; it produces an _IndicatorColumn. embedding_column is analogous but you'd use it if your input is sparse.

Ethereal
  • 111
  • 2