Skip to content Skip to sidebar Skip to footer

Labelencoder That Keeps Missing Values As 'nan'

I am rying to use the label encoder in orrder to convert categorical data into numeric values. I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer after

Solution 1:

The first question is: do you wish to encode each column separately or encode them all with one encoding?

The expression df = df.astype(str).apply(LabelEncoder().fit_transform) implies that you encode all the columns separately.

That case you can do the following:
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
))
print(df)
Out:
     A  B    C
00.001.01  NaN  10.021.02  NaN

the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.

Drawbacks First, there are a mixed types of columns: if a column contains a NaN value, then column has a type float, because nan's are floats in python.

df.dtypes
A    float64
B      int64
C    float64
dtype: object

It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.

The second point is: probably you need to memorize a LabelEncoder - because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.

A simple, explicit solution is:

encoders = dict()

for col_name in df.columns:
    series = df[col_name]
    label_encoder = LabelEncoder()
    df[col_name] = pd.Series(
        label_encoder.fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    )
    encoders[col_name] = label_encoder

print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN

- more code, but result is the same

print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}

- also, encoders are available. Inverse transform (should drop nan's before!) too:

encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])

Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.

How it works

The df.apply(lambda series: ...) applies a function which returns pd.Series to each column; so, it returns a dataframe with a new values.

Expression step by step:

pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
)

- series[series.notnull()] drop NaN values, then feeds the rest to the fit_transform.

- as the label encoder returns a numpy.array and throws out an index, index=series[series.notnull()].index restores it to concatenate it correctly. If don't do indexing:

print(df)
Out:
     A  B    C
0    x  12.01  NaN  61.02    z  9  NaN
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
     A  B    C
00.001.011.010.02  NaN  2  NaN

- values shift from correct positions - and even an IndexError may occur.

Single encoder for all columns

That case, stack dataframe, fit encodet, then unstack it

series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
    label_encoder.fit_transform(series_stack),
    index=series_stack.index
).unstack()
print(df)
Out:
     A    B    C
0  5.0  0.0  2.0
1  NaN  3.0  1.0
2  6.0  4.0  NaN

- as the series_stack is pd.Series containing NaN's, all values from the DataFrame is floats, so you may prefer to convert it.

Hope it helps.

Post a Comment for "Labelencoder That Keeps Missing Values As 'nan'"