Labelencoder That Keeps Missing Values As 'nan'
Solution 1:
The first question is: do you wish to encode each column separately or encode them all with one encoding?
The expression df = df.astype(str).apply(LabelEncoder().fit_transform)
implies that you encode all the columns separately.
That case you can do the following:
df = df.apply(lambda series: pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
))
print(df)
Out:
A B C
00.001.01 NaN 10.021.02 NaN
the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.
Drawbacks
First, there are a mixed types of columns: if a column contains a NaN
value, then column has a type float
, because nan's are floats in python.
df.dtypes
A float64
B int64
C float64
dtype: object
It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.
The second point is: probably you need to memorize a LabelEncoder
- because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.
A simple, explicit solution is:
encoders = dict()
for col_name in df.columns:
series = df[col_name]
label_encoder = LabelEncoder()
df[col_name] = pd.Series(
label_encoder.fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
)
encoders[col_name] = label_encoder
print(df)
Out:
A B C
0 0.0 0 1.0
1 NaN 1 0.0
2 1.0 2 NaN
- more code, but result is the same
print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}
- also, encoders are available. Inverse transform (should drop nan's before!) too:
encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])
Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.
How it works
The df.apply(lambda series: ...)
applies a function which returns pd.Series
to each column; so, it returns a dataframe with a new values.
Expression step by step:
pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
)
- series[series.notnull()]
drop NaN
values, then feeds the rest to the fit_transform
.
- as the label encoder returns a numpy.array
and throws out an index, index=series[series.notnull()].index
restores it to concatenate it correctly. If don't do indexing:
print(df)
Out:
A B C
0 x 12.01 NaN 61.02 z 9 NaN
df = df.apply(lambda series: pd.Series(
LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
A B C
00.001.011.010.02 NaN 2 NaN
- values shift from correct positions - and even an IndexError
may occur.
Single encoder for all columns
That case, stack dataframe, fit encodet, then unstack it
series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
label_encoder.fit_transform(series_stack),
index=series_stack.index
).unstack()
print(df)
Out:
A B C
0 5.0 0.0 2.0
1 NaN 3.0 1.0
2 6.0 4.0 NaN
- as the series_stack
is pd.Series
containing NaN
's, all values from the DataFrame is floats, so you may prefer to convert it.
Hope it helps.
Post a Comment for "Labelencoder That Keeps Missing Values As 'nan'"