Welcome to Day 12 of the Spark Mastery Series!
Today we dissect a topic that has ruined the performance of countless ETL pipelines:
UDFs (User Defined Functions)
A UDF seems innocent - but adding one UDF can slow your entire job by 10x.
Letβs understand why and how to avoid that with better alternatives.
π 1. What is a UDF?
A UDF (User Defined Function) is a Python function applied on Spark DataFrame.
Example:
from pyspark.sql.functions import udf
@udf("string")
def reverse_name(name):
return name[::-1]
This worksβ¦
But it's slow, because Spark must:
- Ship each record to Python
- Execute Python code
- Convert result back to JVM
- Merge with DataFrame
Every record goes through Python β JVM boundary β slow.
π 2. Built-in Functions β ALWAYS Preferred
These are the functions Spark provides internally:
df.withColumn("upper_name", upper(col("name")))
Why they are fastest:
- Implemented in Scala (native)
- Vectorized
- Optimized by Catalyst
- Support predicate pushdown
- Support column pruning
Rule:
If Spark has a built-in function β NEVER write a UDF.
π 3. Pandas UDF β The Best Alternative to Normal UDFs
Regular UDF = row-by-row in Python.
Pandas UDF = uses Apache Arrow for vectorized operations β much faster.
Example:
from pyspark.sql.functions import pandas_udf
@pandas_udf("double")
def multiply_by_two(col):
return col * 2
Spark sends data in batches, not row-by-row β huge speed improvement.
π 4. Types of Pandas UDFs
π’ Scalar Pandas UDF
Operates like built-in function.
@pandas_udf("double")
def add_one(col):
return col + 1
π΅ Grouped Map UDF
Operates on a full pandas DataFrame for each group.
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
Example use cases:
- Time-series transformation
- Per-user model training
- Per-group cleaning
π΄ Grouped Aggregate UDF
@pandas_udf("double", PandasUDFType.GROUPED_AGG)
Good for:
- statistical aggregation
- ML metrics
π 5. When Should You Use a Normal UDF?
Only when:
- No built-in function
- Not vectorizable
- Lots of custom Python logic
Very rare in ETL pipelines.
π 6. Real Example: Performance Difference
Using UDF:
Time: 50 seconds
Using Pandas UDF:
Time: 8 seconds
Using built-in function:
Time: 1 second
This is the reason senior engineers avoid UDFs completely unless needed.
π 7. Summary Guidelines
β Use built-in functions whenever possible
β Use Pandas UDF when logic is vectorizable
β Use normal UDF rarely
β Avoid UDFs on large data
β Avoid using UDF inside joins or filters
β Evaluate execution plan using .explain()
π Summary
We learned:
- Difference between UDF and Pandas UDF
- Why Python UDF is slow
- When to avoid UDFs
- When Pandas UDF is best
- Best practices for performance
Follow for more such content. Let me know if I missed anything in comments. Thank you!!
Top comments (0)