Read A File In Pyspark With Custom Column And Record Delmiter
Is there any way to use custom record delimiters while reading a csv file in pyspark. In my file records are separated by ** instead of newline. Is there any way of using this cust
Solution 1:
I would read it as a pure text file into a rdd and then split on the character that is your line break. Afterwards convert it to a dataframe Like this
rdd1= (sc
.textFile("/jupyter/nfs/test.txt")
.flatMap(lambda line: line.split("**"))
.map(lambda x: x.split(";"))
)
df1=rdd1.toDF(["a","b","c"])
df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| a1| b1| c1|
| a2| b2| c2|
| a3| b2| c3|
+---+---+---+
or if like this
rdd2= (sc
.textFile("/jupyter/nfs/test.txt")
.flatMap(lambda line: line.split("**"))
.map(lambda x: [x])
)
df2=(rdd2
.toDF(["abc"])
.withColumn("a",f.split(f.col("abc"),";")[0])
.withColumn("b",f.split(f.col("abc"),";")[1])
.withColumn("c",f.split(f.col("abc"),";")[2])
.drop("abc")
)
df2.show()
+---+---+---+
| a| b| c|
+---+---+---+
| a1| b1| c1|
| a2| b2| c2|
| a3| b2| c3|
+---+---+---+
where the test.txt looks like
a1;b1;c1**a2;b2;c2**a3;b2;c3
Post a Comment for "Read A File In Pyspark With Custom Column And Record Delmiter"