You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by brccosta <br...@gmail.com> on 2016/09/13 12:27:38 UTC

Spark SQL - Actions and Transformations

Dear all,

We're performing some tests with cache and persist in datasets. In RDD, we
know that the transformations are lazy, being executed only when an action
occurs. So, for example, we put a .cache() in a RDD after an action, which
in turn is executed as the last operations of a sequence of transformations.

However, what are the lazy operations in Datasets and Dataframes? For
example, the following code (fragment):

(df_train, df_test) = df.randomSplit([0.8, 0.2])

r_tokenizer = RegexTokenizer(inputCol="review", outputCol="words_all",
gaps=False, pattern="\\p{L}+")
df_words_all = r_tokenizer.transform(df_train)

remover = StopWordsRemover(inputCol="words_all", outputCol="words_filtered")
df_filtered = remover.transform(df_words_all)
df_filtered = df_filtered.drop('words_all')

hashingTF = HashingTF(inputCol="words_filtered", outputCol="features")
df_features = hashingTF.transform(df_filtered)
df_features = df_features.drop('words_filtered')

lr = LogisticRegression(maxIter=iteractions, regParam=0.01)
model1 = lr.fit(df_features)

evaluator = BinaryClassificationEvaluator()
pipelineModel_features  = PipelineModel (stages=[r_tokenizer, remover,
hashingTF])
df_test_features = pipelineModel_features.transform(df_test)
predictions = model1.transform(df_test_features)
eval_test = evaluator.evaluate(predictions)

All transformations of df_train and df_test will only occur when the
operations fit() and evaluate() are executed?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Actions-and-Transformations-tp27698.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org