Data is Future: March 2025

Saturday, March 29, 2025

Understanding AST (Abstract Syntax Tree) in Spark

Understanding AST (Abstract Syntax Tree) in Spark

An Abstract Syntax Tree (AST) is a way to break down a query into a tree-like structure so that the system can process it more efficiently. This happens in Apache Spark, SQL databases (like Snowflake, PostgreSQL, MySQL), and other data processing engines.

1. AST in Apache Spark
When you write a PySpark DataFrame operation or an SQL query, Spark follows these steps:
Step 1: Parsing (Building the AST)
Spark reads the query and converts it into an AST, which represents its structure.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AST_Example").getOrCreate()
df = spark.read.csv("data.csv", header=True)
df_filtered = df.select("name", "age").filter(df["age"] > 30)
df_filtered.explain(mode="extended") # Show execution plan

Step 2: Logical Plan (Understanding the Query)
Spark converts the AST into a Logical Plan, which describes what needs to be done, without deciding how to do it.
Project [name, age]
├── Filter (age > 30)
├── Read data.csv

Step 3: Optimization (Making it Faster)
Spark improves the Logical Plan to make the query run faster and cheaper.
Some common optimizations:
Predicate Pushdown → Moves filters closer to data to reduce scanning.
Projection Pruning → Removes unnecessary columns.
Constant Folding → Simplifies calculations before execution (e.g., 5 + 10 is replaced with 15).

Step 4: Physical Plan (Deciding How to Execute)
Finally, Spark chooses the best way to run the query and creates a Physical Plan.
Example Physical Plan:
*(1) Project [name, age]
*(2) Filter (age#1 > 30)
*(3) Read CSV file

Why is AST Important?

Helps Spark and databases understand queries.

Optimizes performance before running the query.

Makes sure the query runs efficiently on large datasets.

Saturday, March 22, 2025

Azure - databricks - storing vs registering a DF in Delta Lake



When to Use What? ✅ Use "Storing DataFrame as Delta Table" if: You don’t need SQL access and will only use Spark for reading/writing. The table is temporary or an intermediate dataset in a pipeline. You want to manually control schema enforcement and versioning. ✅ Use "Registering DataFrame in Delta Lake" if: You want the table available for SQL queries in Databricks or external BI tools. You need governance, access control, and metadata tracking. You plan to use Delta Sharing, Unity Catalog, or ADF for integration.

Monday, March 17, 2025

Azure : Azure Data Lake vs. Azure Blob Storage

Azure Data Lake vs. Azure Blob Storage: Which One Should You Use?

Both Azure Data Lake Storage (ADLS) and Azure Blob Storage are powerful cloud storage solutions, but they serve different purposes. Let’s break it down:

Purpose & Use Cases

ADLS: Built for big data & analytics, perfect for handling massive datasets and integrating with tools like Databricks, Synapse, and Spark.

Blob Storage: A general-purpose object store, great for storing images, videos, backups, and web content.

Storage Structure & Performance

ADLS: Uses a hierarchical namespace, meaning files and folders act like a traditional file system—this makes operations like renaming and deleting much faster!

Blob Storage: Uses a flat structure, treating everything as an object, which is simpler but less efficient for managing large-scale structured data.

Access & Security

ADLS: Offers POSIX-like ACLs (fine-grained permissions) alongside Azure RBAC (Role-Based Access Control).

Blob Storage: More flexible, allowing access via SAS tokens, RBAC, and anonymous public access if needed.

Cost Considerations

ADLS: Slightly more expensive because of its advanced capabilities.

Blob Storage: Cheaper and more cost-effective for basic storage needs.

Which One Should You Choose?

Pick ADLS if you're dealing with big data, analytics, or machine learning workloads.

Pick Blob Storage if you need scalable storage for backups, media files, and general object storage.

Sunday, March 16, 2025

Spark - Constant Folding

Constant Folding: Spark’s Hidden Efficiency Booster 🚀

Imagine this: you write a query in Spark that includes something like (2 + 3) * column_name. Now, wouldn’t it be smarter to compute (2 + 3) just once, rather than doing the math every single time the query is run? That’s exactly what constant folding does for you!

Spark’s Catalyst Optimizer recognizes constant expressions like (2 + 3), evaluates them during the query optimization phase, and replaces them with their computed value—in this case, 5. So, the query is transformed into SELECT 5 * column_name FROM table_name. Simple, efficient, and ready to blaze through execution! 🔥

This clever optimization reduces the computation Spark needs to perform when processing data, ensuring faster and more efficient query execution. It’s like giving your queries a little brainpower boost before they hit the big leagues.

Friday, March 14, 2025

Join Strategies in Apache Spark

🔹 Broadcast Hash Join

This join is ideal when one dataset is significantly smaller than the other. The smaller dataset is sent to all executor nodes, allowing each node to join its partitions of the larger dataset with the broadcasted data locally. Since no shuffle is required, this method is highly efficient for joins where one table is small enough to fit in memory.

🔹 Sort Merge Join

The default strategy when dealing with large datasets that cannot fit into memory. Both datasets are first sorted based on the join key, followed by a merge operation to match records efficiently. This approach requires shuffling for sorting, which can be computationally expensive. It is best used when neither table is small enough for broadcasting.

🔹 Shuffle Hash Join

Used when one dataset is smaller than the other, but not small enough for a broadcast join. The data is hashed and repartitioned based on the join keys, ensuring that matching records are processed within the same partition. This method is more efficient than Sort Merge Join but requires partitioning overhead. It is useful when working with medium-sized datasets.

🔹 Broadcast Nested Loop Join

This method is chosen when no other join strategy applies, often as a last resort. It broadcasts the smaller dataset and iterates through it for each row in the larger dataset, applying filtering conditions to retain only matching records. Since this approach does not rely on keys, it is commonly used in cross joins or joins with non-equi conditions where standard join algorithms are inefficient.

🔹 Cartesian Product Join

A cross join that generates all possible combinations between two datasets. Each row from the first dataset is paired with every row in the second dataset. This join strategy is explicitly used when a cross join is needed or when no join condition is provided. Since it produces an exponential number of rows, it is typically avoided for large datasets due to high computational cost.

🔹 Skew Join

Designed to handle data skew, where certain keys have a disproportionately large number of records. Spark detects skewed keys and splits their partitions into smaller chunks, distributing the workload evenly across the cluster. This ensures that the join operation runs efficiently without bottlenecking specific partitions. Skew join optimization is particularly useful when a few keys dominate the dataset, causing slow performance in standard joins.

✔ Broadcast joins are fastest when one table is small.

✔ Sort Merge Join is default for large, unsorted datasets.

✔ Shuffle Hash Join is effective for medium-sized tables.

✔ Broadcast Nested Loop Join is a fallback when other joins aren’t feasible.

✔ Cartesian Join should be used cautiously due to high computation.

✔ Skew Join prevents bottlenecks caused by uneven data distribution.

Example:

👉 Broadcast Hash Join – 🚀 When one table is small, broadcast it! Eliminates shuffle, boosts speed!

👉 Sort Merge Join – 📊 For large datasets. Sort both sides before merging. Expensive shuffle, but reliable.

👉 Shuffle Hash Join – 🔄 Medium-sized tables? Hash & partition them efficiently. Faster than Sort Merge.

👉 Broadcast Nested Loop Join – ⚠️ Last resort! When no keys exist, broadcast small table & loop through rows. Costly!

👉 Skew Join – 🎯 Fix data skew by splitting large keys into smaller partitions. No more stragglers!

Saturday, March 29, 2025

Saturday, March 22, 2025

When to Use What?

Monday, March 17, 2025

Sunday, March 16, 2025

Friday, March 14, 2025