Skip to content Skip to sidebar Skip to footer

Pyspark Boolean Pivot

I have some data mimicking the following structure: rdd = sc.parallelize( [ (0,1), (0,5), (0,3), (1,2), (1,3), (2,6) ]

Solution 1:

@Psidom's answer will only work on Spark version 2.3 and higher as the pyspark.sql.DataFrameNaFunctions did not support bool in prior versions.

This is what I get when I run that code in Spark 2.1:

import pyspark.sql.functions as F

(df_data.withColumn('value', F.concat(F.lit('value_'), df_data.value))
        .groupBy('group').pivot('value').agg(F.count('*').isNotNull())
        .na.fill(False).show())
#+-----+-------+-------+-------+-------+-------+#|group|value_1|value_2|value_3|value_5|value_6|#+-----+-------+-------+-------+-------+-------+#|    0|   true|   null|   true|   true|   null|#|    1|   null|   true|   true|   null|   null|#|    2|   null|   null|   null|   null|   true|#+-----+-------+-------+-------+-------+-------+

Here is an alternative solution that should work for Spark 2.2 and lower:

# first pivot and fill nulls with 0
df = df_data.groupBy('group').pivot('value').count().na.fill(0)
df.show()
#+-----+---+---+---+---+---+#|group|  1|  2|  3|  5|  6|#+-----+---+---+---+---+---+#|    0|  1|  0|  1|  1|  0|#|    1|  0|  1|  1|  0|  0|#|    2|  0|  0|  0|  0|  1|#+-----+---+---+---+---+---+

Now use select to rename the columns and cast the values from int to bool:

df.select(
    *[F.col(c) if c == 'group'else F.col(c).cast('boolean').alias('value_'+c) 
      for c in df.columns]
).show()
+-----+-------+-------+-------+-------+-------+
|group|value_1|value_2|value_3|value_5|value_6|
+-----+-------+-------+-------+-------+-------+
|    0|   true|  false|   true|   true|  false|
|    1|  false|   true|   true|  false|  false|
|    2|  false|  false|  false|  false|   true|
+-----+-------+-------+-------+-------+-------+

Solution 2:

Here is one way:

import pyspark.sql.functions as F

(df_data.withColumn('value', F.concat(F.lit('value_'), df_data.value))
        .groupBy('group').pivot('value').agg(F.count('*').isNotNull())
        .na.fill(False).show())

+-----+-------+-------+-------+-------+-------+|group|value_1|value_2|value_3|value_5|value_6|+-----+-------+-------+-------+-------+-------+|0|true|false|true|true|false||1|false|true|true|false|false||2|false|false|false|false|true|+-----+-------+-------+-------+-------+-------+

Post a Comment for "Pyspark Boolean Pivot"