TensorFlow | one hot encoding of categorical features in TensorFlow

In TensorFlow Categorical values can be transformed to one-hot-encoded vectors by using tf.feature_column.categorical_column_with_vocabulary_list function with tf.feature_column.indicator_column function.

Import libraries and load the sample dataset into dataframe


import tensorflow as tf
import pandas as pd

df = pd.read_csv('https://storage.googleapis.com/gcptutorials.com/dataset/sample.csv')


Visualize dataframe records


print(df.head())

======Output======
survived	sex	age	n_siblings_spouses	parch	fare	class	deck	embark_town	alone
0	0	male	22.0	1	0	7.2500	Third	unknown	Southampton	n
1	1	female	38.0	1	0	71.2833	First	C	Cherbourg	n
2	1	female	26.0	0	0	7.9250	Third	unknown	Southampton	y
3	1	female	35.0	1	0	53.1000	First	C	Southampton	n
4	0	male	28.0	0	0	8.4583	Third	unknown	Queenstown	y

Review and categorize features into lists


CAT_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
               'embark_town', 'alone']
NUM_COLUMNS = ['age', 'fare']

Create feature columns for estimators
  • For categorical features, provide vocab and feature name to the function tf.feature_column.categorical_column_with_vocabulary_list. Here vocab is list of unique values for that particular feature
  • Use tf.feature_column.indicator_column to convert VocabularyListCategoricalColumn to one-hot-encode vector
  • For numerical features, provide feature name and data type to function tf.feature_column.numeric_column
  • Append all features to the list, refer below code for all of the steps.
  • 
    
    feature_cols = []
    
    # Create IndicatorColumn for categorical features
    for feature in CAT_COLUMNS:
      vocab = df[feature].unique()
      feature_cols.append(tf.feature_column.indicator_column(
          tf.feature_column.categorical_column_with_vocabulary_list(feature, vocab)))
    
    # Create NumericColumn for numerical features
    for feature in NUM_COLUMNS:
      feature_cols.append(tf.feature_column.numeric_column(feature, dtype=tf.float32))
    
    print(feature_cols)
    
    =======Output======
    [IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='n_siblings_spouses', vocabulary_list=(1, 0, 3, 4, 2, 5, 8), dtype=tf.int64, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='parch', vocabulary_list=(0, 1, 2, 5, 3, 4), dtype=tf.int64, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('Third', 'First', 'Second'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('unknown', 'C', 'G', 'A', 'B', 'D', 'F', 'E'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Southampton', 'Cherbourg', 'Queenstown', 'unknown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('n', 'y'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]
    [IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='n_siblings_spouses', vocabulary_list=(1, 0, 3, 4, 2, 5, 8), dtype=tf.int64, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='parch', vocabulary_list=(0, 1, 2, 5, 3, 4), dtype=tf.int64, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('Third', 'First', 'Second'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('unknown', 'C', 'G', 'A', 'B', 'D', 'F', 'E'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Southampton', 'Cherbourg', 'Queenstown', 'unknown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('n', 'y'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]
    
    

    View output after applying feature column transformations

  • Take input from the dataframe
  • Provide input to tf.keras.layers.DenseFeatures, below are the output snapshots for features transformation
  • one-hot-encoding for the first feature column
    
    row = dict(df.head(1))
    tf.keras.layers.DenseFeatures(feature_cols[0])(row).numpy()
    
    ======Output======
    array([[1., 0.]], dtype=float32)
    
    

    one-hot-encoding for feature column "n_siblings_spouses"
    
    row = dict(df.head(1))
    tf.keras.layers.DenseFeatures(feature_cols[1])(row).numpy()
    
    ======Output======
    array([[1., 0., 0., 0., 0., 0., 0.]], dtype=float32)
    
    

    one-hot-encoding for feature column "parch"
    
    row = dict(df.head(1))
    tf.keras.layers.DenseFeatures(feature_cols[2])(row).numpy()
    
    ======Output======
    array([[1., 0., 0., 0., 0., 0.]], dtype=float32)
    
    

    one-hot-encoding for feature column "class"
    
    row = dict(df.head(1))
    tf.keras.layers.DenseFeatures(feature_cols[3])(row).numpy()
    
    ======Output======
    array([[1., 0., 0.]], dtype=float32)
    
    

    one-hot-encoding for feature column "deck"
    
    row = dict(df.head(1))
    tf.keras.layers.DenseFeatures(feature_cols[4])(row).numpy()
    
    ======Output======
    array([[1., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
    
    

    one-hot-encoding for feature column "embark_town"
    
    row = dict(df.head(1))
    tf.keras.layers.DenseFeatures(feature_cols[5])(row).numpy()
    
    ======Output======
    array([[1., 0., 0., 0.]], dtype=float32)
    
    

    one-hot-encoding for feature column "alone"
    
    row = dict(df.head(1))
    tf.keras.layers.DenseFeatures(feature_cols[6])(row).numpy()
    
    ======Output======
    array([[1., 0.]], dtype=float32)
    
    

    All of the feature column transformations

    
    row = dict(df.head(1))
    tf.keras.layers.DenseFeatures(feature_cols)(row).numpy()
    
    ======Output======
    array([[22.  ,  1.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ,  0.  ,  0.  ,
             0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  1.  ,  0.  ,  0.  ,  0.  ,
             7.25,  1.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  1.  ,
             0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  1.  ,  0.  ]], dtype=float32)
    
    

    Categories: TensorFlow