DEV Community

Cover image for Day 12: UDF vs Pandas UDF
Sandeep
Sandeep

Posted on

Day 12: UDF vs Pandas UDF

Welcome to Day 12 of the Spark Mastery Series!
Today we dissect a topic that has ruined the performance of countless ETL pipelines:

UDFs (User Defined Functions)

A UDF seems innocent - but adding one UDF can slow your entire job by 10x.

Let’s understand why and how to avoid that with better alternatives.

🌟 1. What is a UDF?

A UDF (User Defined Function) is a Python function applied on Spark DataFrame.

Example:

from pyspark.sql.functions import udf

@udf("string")
def reverse_name(name):
    return name[::-1]
Enter fullscreen mode Exit fullscreen mode

This works…
But it's slow, because Spark must:

  • Ship each record to Python
  • Execute Python code
  • Convert result back to JVM
  • Merge with DataFrame

Every record goes through Python β†’ JVM boundary β†’ slow.

🌟 2. Built-in Functions β€” ALWAYS Preferred

These are the functions Spark provides internally:

df.withColumn("upper_name", upper(col("name")))
Enter fullscreen mode Exit fullscreen mode

Why they are fastest:

  • Implemented in Scala (native)
  • Vectorized
  • Optimized by Catalyst
  • Support predicate pushdown
  • Support column pruning

Rule:

If Spark has a built-in function β†’ NEVER write a UDF.

🌟 3. Pandas UDF β€” The Best Alternative to Normal UDFs

Regular UDF = row-by-row in Python.

Pandas UDF = uses Apache Arrow for vectorized operations β†’ much faster.

Example:

from pyspark.sql.functions import pandas_udf

@pandas_udf("double")
def multiply_by_two(col):
    return col * 2
Enter fullscreen mode Exit fullscreen mode

Spark sends data in batches, not row-by-row β†’ huge speed improvement.

🌟 4. Types of Pandas UDFs

🟒 Scalar Pandas UDF
Operates like built-in function.

@pandas_udf("double")
def add_one(col):
    return col + 1
Enter fullscreen mode Exit fullscreen mode

πŸ”΅ Grouped Map UDF
Operates on a full pandas DataFrame for each group.

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
Enter fullscreen mode Exit fullscreen mode

Example use cases:

  • Time-series transformation
  • Per-user model training
  • Per-group cleaning

πŸ”΄ Grouped Aggregate UDF

@pandas_udf("double", PandasUDFType.GROUPED_AGG)
Enter fullscreen mode Exit fullscreen mode

Good for:

  • statistical aggregation
  • ML metrics

🌟 5. When Should You Use a Normal UDF?

Only when:

  • No built-in function
  • Not vectorizable
  • Lots of custom Python logic

Very rare in ETL pipelines.

🌟 6. Real Example: Performance Difference

Using UDF:

Time: 50 seconds
Enter fullscreen mode Exit fullscreen mode

Using Pandas UDF:

Time: 8 seconds
Enter fullscreen mode Exit fullscreen mode

Using built-in function:

Time: 1 second
Enter fullscreen mode Exit fullscreen mode

This is the reason senior engineers avoid UDFs completely unless needed.

🌟 7. Summary Guidelines

βœ” Use built-in functions whenever possible
βœ” Use Pandas UDF when logic is vectorizable
βœ” Use normal UDF rarely
βœ” Avoid UDFs on large data
βœ” Avoid using UDF inside joins or filters
βœ” Evaluate execution plan using .explain()

πŸš€ Summary

We learned:

  • Difference between UDF and Pandas UDF
  • Why Python UDF is slow
  • When to avoid UDFs
  • When Pandas UDF is best
  • Best practices for performance

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)