Labelencoder That Keeps Missing Values As 'nan'

May 30, 2024 Post a Comment

I am rying to use the label encoder in orrder to convert categorical data into numeric values. I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer after

Solution 1:

The first question is: do you wish to encode each column separately or encode them all with one encoding?

The expression df = df.astype(str).apply(LabelEncoder().fit_transform) implies that you encode all the columns separately.

That case you can do the following:
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
))
print(df)
Out:
     A  B    C
00.001.01  NaN  10.021.02  NaN

the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.

Drawbacks First, there are a mixed types of columns: if a column contains a NaN value, then column has a type float, because nan's are floats in python.

df.dtypes
A    float64
B      int64
C    float64
dtype: object

It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.

The second point is: probably you need to memorize a LabelEncoder - because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.

A simple, explicit solution is:

encoders = dict()

for col_name in df.columns:
    series = df[col_name]
    label_encoder = LabelEncoder()
    df[col_name] = pd.Series(
        label_encoder.fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    )
    encoders[col_name] = label_encoder

print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN

- more code, but result is the same

print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}

- also, encoders are available. Inverse transform (should drop nan's before!) too:

encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])

Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.

How it works

The df.apply(lambda series: ...) applies a function which returns pd.Series to each column; so, it returns a dataframe with a new values.

Expression step by step:

pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
)

- series[series.notnull()] drop NaN values, then feeds the rest to the fit_transform.

- as the label encoder returns a numpy.array and throws out an index, index=series[series.notnull()].index restores it to concatenate it correctly. If don't do indexing:

print(df)
Out:
     A  B    C
0    x  12.01  NaN  61.02    z  9  NaN
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
     A  B    C
00.001.011.010.02  NaN  2  NaN

- values shift from correct positions - and even an IndexError may occur.

Single encoder for all columns

That case, stack dataframe, fit encodet, then unstack it

series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
    label_encoder.fit_transform(series_stack),
    index=series_stack.index
).unstack()
print(df)
Out:
     A    B    C
0  5.0  0.0  2.0
1  NaN  3.0  1.0
2  6.0  4.0  NaN

- as the series_stack is pd.Series containing NaN's, all values from the DataFrame is floats, so you may prefer to convert it.

Hope it helps.

Python stackoverflow Examples

Labelencoder That Keeps Missing Values As 'nan'

Solution 1:

Post a Comment for "Labelencoder That Keeps Missing Values As 'nan'"