Eager and Lazy Execution

Eager execution means instant executions as soon as the thread assigned to statement gets its turn and Lazy execution means, actual execution is delayed until all logical phases of the query have been analyzed or if there was any scope of applying optimizations, those also have been applied.

Apache Spark, by design says – Transformations are lazy and Actions are Eager

Transformations are operations that operate on a dataframe and results into another dataframe. These could be further subdivided into wide transformations or narrow transformations. Wide transformations are those which needs data from all the worker threads to deliver the final output. If you are coming from SSIS background, you could relate it to like Blocked Transformations e.g. distinct(), groupBy(), sum() etc. On another hand, narrow transformations are those which could operate within the same worker and their own output could be simply plugged into final output without any dependency on the output of another worker thread e.g. filter(), coalesce() etc.

Most of the transformations corresponding to SQL built-in functions are found in sql.functions module of spark.

Actions are the operations which operate on data but doesn’t produce another dataframe. They generally write output of the action to some storage using dataFrameWriter() methods e.g. count(), first(), head() etc.

Leave a Reply