Labelencoder That Keeps Missing Values As 'nan'
Solution 1:
The first question is: do you wish to encode each column separately or encode them all with one encoding?
The expression df = df.astype(str).apply(LabelEncoder().fit_transform) implies that you encode all the columns separately.
That case you can do the following:
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
))
print(df)
Out:
     A  B    C
00.001.01  NaN  10.021.02  NaN
the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.
Drawbacks
First, there are a mixed types of columns: if a column contains a NaN value, then column has a type float, because nan's are floats in python.
df.dtypes
A    float64
B      int64
C    float64
dtype: object
It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.
The second point is: probably you need to memorize a LabelEncoder - because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.
A simple, explicit solution is:
encoders = dict()
for col_name in df.columns:
    series = df[col_name]
    label_encoder = LabelEncoder()
    df[col_name] = pd.Series(
        label_encoder.fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    )
    encoders[col_name] = label_encoder
print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN
- more code, but result is the same
print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}
- also, encoders are available. Inverse transform (should drop nan's before!) too:
encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])
Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.
How it works
The df.apply(lambda series: ...) applies a function which returns pd.Series to each column; so, it returns a dataframe with a new values.
Expression step by step:
pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
)
- series[series.notnull()] drop NaN values, then feeds the rest to the fit_transform.
- as the label encoder returns a numpy.array and throws out an index, index=series[series.notnull()].index restores it to concatenate it correctly. If don't do indexing:
print(df)
Out:
     A  B    C
0    x  12.01  NaN  61.02    z  9  NaN
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
     A  B    C
00.001.011.010.02  NaN  2  NaN
- values shift from correct positions - and even an IndexError may occur.
Single encoder for all columns
That case, stack dataframe, fit encodet, then unstack it
series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
    label_encoder.fit_transform(series_stack),
    index=series_stack.index
).unstack()
print(df)
Out:
     A    B    C
0  5.0  0.0  2.0
1  NaN  3.0  1.0
2  6.0  4.0  NaN
- as the series_stack is pd.Series containing NaN's, all values from the DataFrame is floats, so you may prefer to convert it. 
Hope it helps.
Post a Comment for "Labelencoder That Keeps Missing Values As 'nan'"