Skip to content Skip to sidebar Skip to footer

Read A File In Pyspark With Custom Column And Record Delmiter

Is there any way to use custom record delimiters while reading a csv file in pyspark. In my file records are separated by ** instead of newline. Is there any way of using this cust

Solution 1:

I would read it as a pure text file into a rdd and then split on the character that is your line break. Afterwards convert it to a dataframe Like this

rdd1= (sc
       .textFile("/jupyter/nfs/test.txt")
       .flatMap(lambda line: line.split("**"))
       .map(lambda x: x.split(";"))
      )
df1=rdd1.toDF(["a","b","c"])
df1.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
| a1| b1| c1|
| a2| b2| c2|
| a3| b2| c3|
+---+---+---+

or if like this


rdd2= (sc
       .textFile("/jupyter/nfs/test.txt")
       .flatMap(lambda line: line.split("**"))
       .map(lambda x: [x])
      )
df2=(rdd2
     .toDF(["abc"])
     .withColumn("a",f.split(f.col("abc"),";")[0])
     .withColumn("b",f.split(f.col("abc"),";")[1])
     .withColumn("c",f.split(f.col("abc"),";")[2])
     .drop("abc")
    )
df2.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
| a1| b1| c1|
| a2| b2| c2|
| a3| b2| c3|
+---+---+---+

where the test.txt looks like

a1;b1;c1**a2;b2;c2**a3;b2;c3

Post a Comment for "Read A File In Pyspark With Custom Column And Record Delmiter"