Wednesday, April 2, 2025

Spark - Explode

 The explode function in PySpark is used to transform an array or map column into multiple rows. It is commonly used when dealing with nested data structures like JSON or arrays within a DataFrame.

1.) explode is used to convert array or map elements into separate rows.

2.) If the array is empty, that row is removed.

3.) When applied to a map, explode creates key-value pairs as rows.

4.) If you want to keep empty rows, use posexplode, which retains the row index.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col,explode,lit

spark=SparkSession.builder.appName("sangam_test_explode").getOrCreate()

data = [
    (1, ["apple", "banana", "cherry"]),
    (2, ["grape", "orange"]),
    (3, [])
]

df=spark.createDataFrame(data,["id","fruits"])
df.show(truncate=False)

df_explode=df.withColumn("fruits",explode(col("fruits")))
df_explode.show(truncate=False)






No comments:

Post a Comment

Delta Lake - Time Travel

  Time Travel allows you to query, restore, or compare data from a previous version of a Delta table. Delta Lake automatically keeps tra...