Analyze the UDF function used in pySpark given below.import pandas as pd from pyspark.sql.functions import pandas_udf from pyspark.sql import Window df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], (“id”, “v”)) @pandas_udf(“double”) def mean_udf(v: pd.Series) -> float: return v.mean() df.select(mean_udf(df[‘v’])).show() df.groupby(“id”).agg(mean_udf(df[‘v’])).show() w = Window \ .partitionBy(‘id’) \ .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) df.withColumn(‘mean_v’, mean_udf(df[‘v’]).over(w)).show()
Which of the following statements regarding this code are valid?
Options
a. This type of UDF does not support partial aggregation
b. all data for a group or window will be loaded into memory by this code.
c. Only unbounded window is supported by this code
d. Both 1,2
e. All of these
Skills Covered
- IT-Programming Languages/Frameworks
Assessing
- Fundamentals
Question Type
- MCQ