You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Saif Addin <sa...@gmail.com> on 2019/03/16 20:49:02 UTC

Spark ML on Python has short memory?

Hi,

We're working on Spark NLP by including multiple ML Estimators and
Transformers.

Getting a negative performance hit on Python side, because of the columns
being recalculated recursively (and more than recursively) on each
stage.transform() call.

I am not being able to trace the root of the problem, since serialization
seems to be happening on jvm side, using the _jvm wrappers in pyspark ML.

Printing a log each time a stage is actually executed, and loading the same
*PipelineModel* in both Scala and Python, I get the following log in Scala:

-----------------
scala> val result = pipeline.transform(data).cache()

annotating REGEX_TOKENIZER_b39e97328de5
annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5
annotating NerDLModel_7b95d7750b3b
annotating NER_CONVERTER_f12a17e51b45
annotating DEEP SENTENCE DETECTOR_4a08f41f1d47
annotating LEMMATIZER_eff31d5f9d97
annotating STEMMER_552360206a2d
annotating POS_2b9b0142f847
annotating SPELL_7c55d8e48423

result: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text:
string, document:
array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>,sentence_embeddings:array<float>>>
... 9 more fields]
-----------------

while on python, cache() does not only execute cache() operation, but also,
retraces to multiple calls of the columns, as if it had shorter memory:

-------------------
result_df.show()

[Stage 37:===================>                                      (1 + 2)
/ 3]Really annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5
annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5
annotating NerDLModel_7b95d7750b3b
[Stage 37:======================================>                   (2 + 1)
/ 3]Really annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5
annotating NerDLModel_7b95d7750b3b
annotating NER_CONVERTER_f12a17e51b45
annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5
annotating NerDLModel_7b95d7750b3b
annotating NER_CONVERTER_f12a17e51b45
annotating DEEP SENTENCE DETECTOR_4a08f41f1d47
annotating REGEX_TOKENIZER_b39e97328de5
annotating LEMMATIZER_eff31d5f9d97
annotating REGEX_TOKENIZER_b39e97328de5
annotating STEMMER_552360206a2d
annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating REGEX_TOKENIZER_b39e97328de5
annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5
annotating NerDLModel_7b95d7750b3b
annotating NER_CONVERTER_f12a17e51b45
annotating DEEP SENTENCE DETECTOR_4a08f41f1d47
annotating REGEX_TOKENIZER_b39e97328de5
annotating POS_2b9b0142f847
annotating REGEX_TOKENIZER_b39e97328de5
annotating SPELL_7c55d8e48423
-------------------

If you have any insights into how can we help trace the problem, we'll be
gladly appreciating!
I have tried various approaches, such as transforming step by step, or
caching() the input, but none of them seem to have impact.

Best,
Saif