Pyspark Boolean Pivot
I have some data mimicking the following structure: rdd = sc.parallelize( [ (0,1), (0,5), (0,3), (1,2), (1,3), (2,6) ]
Solution 1:
@Psidom's answer will only work on Spark version 2.3 and higher as the pyspark.sql.DataFrameNaFunctions
did not support bool
in prior versions.
This is what I get when I run that code in Spark 2.1:
import pyspark.sql.functions as F
(df_data.withColumn('value', F.concat(F.lit('value_'), df_data.value))
.groupBy('group').pivot('value').agg(F.count('*').isNotNull())
.na.fill(False).show())
#+-----+-------+-------+-------+-------+-------+#|group|value_1|value_2|value_3|value_5|value_6|#+-----+-------+-------+-------+-------+-------+#| 0| true| null| true| true| null|#| 1| null| true| true| null| null|#| 2| null| null| null| null| true|#+-----+-------+-------+-------+-------+-------+
Here is an alternative solution that should work for Spark 2.2 and lower:
# first pivot and fill nulls with 0
df = df_data.groupBy('group').pivot('value').count().na.fill(0)
df.show()
#+-----+---+---+---+---+---+#|group| 1| 2| 3| 5| 6|#+-----+---+---+---+---+---+#| 0| 1| 0| 1| 1| 0|#| 1| 0| 1| 1| 0| 0|#| 2| 0| 0| 0| 0| 1|#+-----+---+---+---+---+---+
Now use select
to rename the columns and cast the values from int
to bool
:
df.select(
*[F.col(c) if c == 'group'else F.col(c).cast('boolean').alias('value_'+c)
for c in df.columns]
).show()
+-----+-------+-------+-------+-------+-------+
|group|value_1|value_2|value_3|value_5|value_6|
+-----+-------+-------+-------+-------+-------+
| 0| true| false| true| true| false|
| 1| false| true| true| false| false|
| 2| false| false| false| false| true|
+-----+-------+-------+-------+-------+-------+
Solution 2:
Here is one way:
import pyspark.sql.functions as F
(df_data.withColumn('value', F.concat(F.lit('value_'), df_data.value))
.groupBy('group').pivot('value').agg(F.count('*').isNotNull())
.na.fill(False).show())
+-----+-------+-------+-------+-------+-------+|group|value_1|value_2|value_3|value_5|value_6|+-----+-------+-------+-------+-------+-------+|0|true|false|true|true|false||1|false|true|true|false|false||2|false|false|false|false|true|+-----+-------+-------+-------+-------+-------+
Post a Comment for "Pyspark Boolean Pivot"