How To Drop Columns Based On Multiple Filters In A Dataframe Using Pyspark?
I have a list of valid values that a cell can have. If one cell in a column is invalid, I need to drop the whole column. I understand there are answers of dropping rows in a partic
Solution 1:
I am not only looking at the code solution but more on the off-the-shelf code provided from PySpark.
Unfortunately, Spark is designed to operate in parallel on a row-by-row basis. Filtering out columns is not something for which there will be an "off-the-shelf code" solution.
Nevertheless, here is one approach you can take:
First collect the counts of the invalid elements in each column.
from pyspark.sql.functions import col, lit, sum as _sum, when
valid = ['Messi', 'Ronaldo', 'Virgil']
invalid_counts = df.select(
*[_sum(when(col(c).isin(valid), lit(0)).otherwise(lit(1))).alias(c) for c in df.columns]
).collect()
print(invalid_counts)
#[Row(Column1=0, Column2=1, Column3=0, Column4=1, Column5=3)]
This output will be a list with only one element. You can iterate over the items in this element to find the columns to keep.
valid_columns = [k for k,v in invalid_counts[0].asDict().items() if v == 0]
print(valid_columns)
#['Column 3', 'Column 1']
Now just select these columns from your original DataFrame. You can first sort valid_columns
using list.index
if you want to maintain the original column order.
valid_columns = sorted(valid_columns, key=df.columns.index)
df.select(valid_columns).show()
#+--------+--------+#|Column 1|Column 3|#+--------+--------+#| Ronaldo| Messi|#| Ronaldo| Virgil|#| Ronaldo| Messi|#+--------+--------+
Post a Comment for "How To Drop Columns Based On Multiple Filters In A Dataframe Using Pyspark?"