Wednesday, April 2, 2025

Spark - Explode

 The explode function in PySpark is used to transform an array or map column into multiple rows. It is commonly used when dealing with nested data structures like JSON or arrays within a DataFrame.

1.) explode is used to convert array or map elements into separate rows.

2.) If the array is empty, that row is removed.

3.) When applied to a map, explode creates key-value pairs as rows.

4.) If you want to keep empty rows, use posexplode, which retains the row index.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col,explode,lit

spark=SparkSession.builder.appName("sangam_test_explode").getOrCreate()

data = [
    (1, ["apple", "banana", "cherry"]),
    (2, ["grape", "orange"]),
    (3, [])
]

df=spark.createDataFrame(data,["id","fruits"])
df.show(truncate=False)

df_explode=df.withColumn("fruits",explode(col("fruits")))
df_explode.show(truncate=False)






Saturday, March 29, 2025

Understanding AST (Abstract Syntax Tree) in Spark

Understanding AST (Abstract Syntax Tree) in Spark

An Abstract Syntax Tree (AST) is a way to break down a query into a tree-like structure so that the system can process it more efficiently. This happens in Apache Spark, SQL databases (like Snowflake, PostgreSQL, MySQL), and other data processing engines.

1. AST in Apache Spark
When you write a PySpark DataFrame operation or an SQL query, Spark follows these steps:
Step 1: Parsing (Building the AST)
Spark reads the query and converts it into an AST, which represents its structure.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AST_Example").getOrCreate()
df = spark.read.csv("data.csv", header=True)
df_filtered = df.select("name", "age").filter(df["age"] > 30)
df_filtered.explain(mode="extended") # Show execution plan

Step 2: Logical Plan (Understanding the Query)
Spark converts the AST into a Logical Plan, which describes what needs to be done, without deciding how to do it.
Project [name, age]
โ”œโ”€โ”€ Filter (age > 30)
โ”œโ”€โ”€ Read data.csv

Step 3: Optimization (Making it Faster)
Spark improves the Logical Plan to make the query run faster and cheaper.
Some common optimizations:
Predicate Pushdown โ†’ Moves filters closer to data to reduce scanning.
Projection Pruning โ†’ Removes unnecessary columns.
Constant Folding โ†’ Simplifies calculations before execution (e.g., 5 + 10 is replaced with 15).

Step 4: Physical Plan (Deciding How to Execute)
Finally, Spark chooses the best way to run the query and creates a Physical Plan.
Example Physical Plan:
*(1) Project [name, age]
*(2) Filter (age#1 > 30)
*(3) Read CSV file


Why is AST Important?

Helps Spark and databases understand queries.

Optimizes performance before running the query.

Makes sure the query runs efficiently on large datasets.


Saturday, March 22, 2025

Azure - databricks - storing vs registering a DF in Delta Lake

 




When to Use What?

โœ… Use "Storing DataFrame as Delta Table" if:

  • You donโ€™t need SQL access and will only use Spark for reading/writing.

  • The table is temporary or an intermediate dataset in a pipeline.

  • You want to manually control schema enforcement and versioning.

โœ… Use "Registering DataFrame in Delta Lake" if:

  • You want the table available for SQL queries in Databricks or external BI tools.

  • You need governance, access control, and metadata tracking.

  • You plan to use Delta Sharing, Unity Catalog, or ADF for integration.











Monday, March 17, 2025

Azure : Azure Data Lake vs. Azure Blob Storage

 Azure Data Lake vs. Azure Blob Storage: Which One Should You Use?

Both Azure Data Lake Storage (ADLS) and Azure Blob Storage are powerful cloud storage solutions, but they serve different purposes. Letโ€™s break it down:


 Purpose & Use Cases

ADLS: Built for big data & analytics, perfect for handling massive datasets and integrating with tools like Databricks, Synapse, and Spark.

Blob Storage: A general-purpose object store, great for storing images, videos, backups, and web content.


Storage Structure & Performance

ADLS: Uses a hierarchical namespace, meaning files and folders act like a traditional file systemโ€”this makes operations like renaming and deleting much faster!

Blob Storage: Uses a flat structure, treating everything as an object, which is simpler but less efficient for managing large-scale structured data.


Access & Security

ADLS: Offers POSIX-like ACLs (fine-grained permissions) alongside Azure RBAC (Role-Based Access Control).

Blob Storage: More flexible, allowing access via SAS tokens, RBAC, and anonymous public access if needed.


Cost Considerations

ADLS: Slightly more expensive because of its advanced capabilities.

Blob Storage: Cheaper and more cost-effective for basic storage needs.


Which One Should You Choose?

 Pick ADLS if you're dealing with big data, analytics, or machine learning workloads.

 Pick Blob Storage if you need scalable storage for backups, media files, and general object storage.

Sunday, March 16, 2025

Spark - Constant Folding

 Constant Folding: Sparkโ€™s Hidden Efficiency Booster ๐Ÿš€

Imagine this: you write a query in Spark that includes something like (2 + 3) * column_name. Now, wouldnโ€™t it be smarter to compute (2 + 3) just once, rather than doing the math every single time the query is run? Thatโ€™s exactly what constant folding does for you!

Sparkโ€™s Catalyst Optimizer recognizes constant expressions like (2 + 3), evaluates them during the query optimization phase, and replaces them with their computed valueโ€”in this case, 5. So, the query is transformed into SELECT 5 * column_name FROM table_name. Simple, efficient, and ready to blaze through execution! ๐Ÿ”ฅ

This clever optimization reduces the computation Spark needs to perform when processing data, ensuring faster and more efficient query execution. Itโ€™s like giving your queries a little brainpower boost before they hit the big leagues.

Friday, March 14, 2025

Join Strategies in Apache Spark


๐Ÿ”น Broadcast Hash Join

This join is ideal when one dataset is significantly smaller than the other. The smaller dataset is sent to all executor nodes, allowing each node to join its partitions of the larger dataset with the broadcasted data locally. Since no shuffle is required, this method is highly efficient for joins where one table is small enough to fit in memory.


๐Ÿ”น Sort Merge Join

The default strategy when dealing with large datasets that cannot fit into memory. Both datasets are first sorted based on the join key, followed by a merge operation to match records efficiently. This approach requires shuffling for sorting, which can be computationally expensive. It is best used when neither table is small enough for broadcasting.


๐Ÿ”น Shuffle Hash Join

Used when one dataset is smaller than the other, but not small enough for a broadcast join. The data is hashed and repartitioned based on the join keys, ensuring that matching records are processed within the same partition. This method is more efficient than Sort Merge Join but requires partitioning overhead. It is useful when working with medium-sized datasets.


๐Ÿ”น Broadcast Nested Loop Join

This method is chosen when no other join strategy applies, often as a last resort. It broadcasts the smaller dataset and iterates through it for each row in the larger dataset, applying filtering conditions to retain only matching records. Since this approach does not rely on keys, it is commonly used in cross joins or joins with non-equi conditions where standard join algorithms are inefficient.


๐Ÿ”น Cartesian Product Join

A cross join that generates all possible combinations between two datasets. Each row from the first dataset is paired with every row in the second dataset. This join strategy is explicitly used when a cross join is needed or when no join condition is provided. Since it produces an exponential number of rows, it is typically avoided for large datasets due to high computational cost.


๐Ÿ”น Skew Join

Designed to handle data skew, where certain keys have a disproportionately large number of records. Spark detects skewed keys and splits their partitions into smaller chunks, distributing the workload evenly across the cluster. This ensures that the join operation runs efficiently without bottlenecking specific partitions. Skew join optimization is particularly useful when a few keys dominate the dataset, causing slow performance in standard joins.



โœ” Broadcast joins are fastest when one table is small.

โœ” Sort Merge Join is default for large, unsorted datasets.

โœ” Shuffle Hash Join is effective for medium-sized tables.

โœ” Broadcast Nested Loop Join is a fallback when other joins arenโ€™t feasible.

โœ” Cartesian Join should be used cautiously due to high computation.

โœ” Skew Join prevents bottlenecks caused by uneven data distribution.


Example:

๐Ÿ‘‰ Broadcast Hash Join โ€“ ๐Ÿš€ When one table is small, broadcast it! Eliminates shuffle, boosts speed!

๐Ÿ‘‰ Sort Merge Join โ€“ ๐Ÿ“Š For large datasets. Sort both sides before merging. Expensive shuffle, but reliable.

๐Ÿ‘‰ Shuffle Hash Join โ€“ ๐Ÿ”„ Medium-sized tables? Hash & partition them efficiently. Faster than Sort Merge.

๐Ÿ‘‰ Broadcast Nested Loop Join โ€“ โš ๏ธ Last resort! When no keys exist, broadcast small table & loop through rows. Costly!

๐Ÿ‘‰ Skew Join โ€“ ๐ŸŽฏ Fix data skew by splitting large keys into smaller partitions. No more stragglers!

 

Saturday, June 6, 2020

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as well. Here, the job essentially consists of a set of program or programs that needs to manipulate a certain piece of data.So, a job is usually associated with certain parameters like one important thing of a job is that it is going to be a set of programs. Here in the Hadoop ecosystem, itโ€™s going to be a Map Reduce program typically written in Java. However, there are also non-Java based jobs which can be submitted into the Hadoop cluster which we shall discuss in detail later.




The important parameters which actually identify or make up a job is going to be the programs and the piece of data on which the program has to work on , which is going to be the input file path and the output file or the output directory to which the output of the programโ€™s execution needs to be stored. So, a job in a Hadoop ecosystem is going to be a combination of all the three.

When we specifically speak about a job in terms of a Java MapReduce job, the programs are going to be submitted into the cluster in the form of a jar file which is going to be a packaging of all the classes which constitute the program which needs to execute on a particular set of data.

The two major features what Hadoop actually provides to the users.

The first important feature is a distributed file system which is called as the Hadoop Distributed File System or the HDFS. Basically, once you submitted into the Hadoop cluster itโ€™s going to be residing on multiple data nodes where the original file would be divided into smaller pieces. So, this service actually is called as the Hadoop Distributed File System.

Second most important feature of Hadoop is that it provides a robust parallel processing framework which is called as a MapReduce Framework. So, once you submit a job into the Hadoop cluster, the program which is supposed to execute on a piece of data is now going to be running on multi machines wherever the data corresponding to the input file on which the Hadoop program is supposed to manipulate. So, the two core services or the two core features what Hadoop actually comes very handy is that the Hadoop Distributed File System and then the Parallel Processing Framework called as a MapReduce. p { margin-bottom: 0.25cm; line-height: 115% }

Spark - Explode

  The explode function in PySpark is used to transform an array or map column into multiple rows. It is commonly used when dealing with nest...