You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ma...@wunderlich.com on 2022/02/18 06:42:10 UTC

Encoders.STRING() causing performance problems in Java application


Hello,

I am working on optimising the performance of a Java ML/NLP application 
based on Spark / SparkNLP. For prediction, I am applying a trained model 
on a Spark dataset which consists of one column with only one row. The 
dataset is created like this:

     List<String> textList = Collections.singletonList(text);
     Dataset<Row> data = sparkSession
         .createDataset(textList, Encoders.STRING())
         .withColumnRenamed(COL_VALUE, COL_TEXT);

The predictions are created like this:

     PipelineModel fittedPipeline = pipeline.fit(dataset);

     Dataset<Row> prediction = fittedPipeline.transform(dataset);

We noticed that the performance isn't quite as good as expected. After 
profiling the application with VisualVM, I noticed that the problem is 
with org.apache.spark.sql.Encoders.STRING() in the creation of the 
dataset, which by itself takes up about 75% of the time for the whole 
prediction method call.

So, is there a simpler and more efficient way of creating the required 
dataset, consisting of one column and one String row?

Thanks a lot.

Cheers,

Martin

Re: Encoders.STRING() causing performance problems in Java application

Posted by ma...@wunderlich.com.


Addendum: I have tried to replace localIterator with a forEach() call on 
the dataset directly, but this hasn't improved the performance.

If the forEach call is the issue, there probably isn't much that can be 
done to further improve things, other than perhaps trying to batch the 
prediction calls instead of running them line by line on the input file.

Cheers,

Martin

Am 2022-02-18 09:41, schrieb martin@wunderlich.com:

> I have been able to partially fix this issue by creating a static final 
> field (i.e. a constant) for Encoders.STRING(). This removes the 
> bottleneck associated with instantiating this Encoder. However, this 
> moved the performance issue only to these two methods:
> 
> org.apache.spark.sql.SparkSession.createDataset (in the code below)
> 
> org.apache.spark.sql.Dataset.toLocalIterator ()
> 
> (ca. 40% each of execution time)
> 
> The second one is called when extracting the prediction results from 
> the dataset:
> 
> Dataset<Row> datasetWithPredictions = predictor.predict(text);
> 
> Dataset<Row> tokensWithPredictions = 
> datasetWithPredictions.select(TOKEN_RESULT, TOKEN_BEGIN, TOKEN_END, 
> PREDICTION_RESULT);
> 
> Iterator<Row> rowIt = tokensWithPredictions.toLocalIterator();
> 
> while(rowIt.hasNext()) {
> Row row = rowIt.next();
> [...] // do stuff here to convert the row
> 
> Any ideas of how I might be able to further optimize this?
> 
> Cheers,
> 
> Martin
> 
> Am 2022-02-18 07:42, schrieb martin@wunderlich.com:
> 
>> Hello,
>> 
>> I am working on optimising the performance of a Java ML/NLP 
>> application based on Spark / SparkNLP. For prediction, I am applying a 
>> trained model on a Spark dataset which consists of one column with 
>> only one row. The dataset is created like this:
>> 
>> List<String> textList = Collections.singletonList(text);
>> Dataset<Row> data = sparkSession
>> .createDataset(textList, Encoders.STRING())
>> .withColumnRenamed(COL_VALUE, COL_TEXT);
>> 
>> The predictions are created like this:
>> 
>> PipelineModel fittedPipeline = pipeline.fit(dataset);
>> 
>> Dataset<Row> prediction = fittedPipeline.transform(dataset);
>> 
>> We noticed that the performance isn't quite as good as expected. After 
>> profiling the application with VisualVM, I noticed that the problem is 
>> with org.apache.spark.sql.Encoders.STRING() in the creation of the 
>> dataset, which by itself takes up about 75% of the time for the whole 
>> prediction method call.
>> 
>> So, is there a simpler and more efficient way of creating the required 
>> dataset, consisting of one column and one String row?
>> 
>> Thanks a lot.
>> 
>> Cheers,
>> 
>> Martin

Re: Encoders.STRING() causing performance problems in Java application

Posted by ma...@wunderlich.com.


I have been able to partially fix this issue by creating a static final 
field (i.e. a constant) for Encoders.STRING(). This removes the 
bottleneck associated with instantiating this Encoder. However, this 
moved the performance issue only to these two methods:

org.apache.spark.sql.SparkSession.createDataset (in the code below)

org.apache.spark.sql.Dataset.toLocalIterator ()

(ca. 40% each of execution time)

The second one is called when extracting the prediction results from the 
dataset:

Dataset<Row> datasetWithPredictions = predictor.predict(text);

Dataset<Row> tokensWithPredictions = 
datasetWithPredictions.select(TOKEN_RESULT, TOKEN_BEGIN, TOKEN_END, 
PREDICTION_RESULT);

Iterator<Row> rowIt = tokensWithPredictions.toLocalIterator();

while(rowIt.hasNext()) {
     Row row = rowIt.next();
     [...] // do stuff here to convert the row

Any ideas of how I might be able to further optimize this?

Cheers,

Martin

Am 2022-02-18 07:42, schrieb martin@wunderlich.com:

> Hello,
> 
> I am working on optimising the performance of a Java ML/NLP application 
> based on Spark / SparkNLP. For prediction, I am applying a trained 
> model on a Spark dataset which consists of one column with only one 
> row. The dataset is created like this:
> 
> List<String> textList = Collections.singletonList(text);
> Dataset<Row> data = sparkSession
> .createDataset(textList, Encoders.STRING())
> .withColumnRenamed(COL_VALUE, COL_TEXT);
> 
> The predictions are created like this:
> 
> PipelineModel fittedPipeline = pipeline.fit(dataset);
> 
> Dataset<Row> prediction = fittedPipeline.transform(dataset);
> 
> We noticed that the performance isn't quite as good as expected. After 
> profiling the application with VisualVM, I noticed that the problem is 
> with org.apache.spark.sql.Encoders.STRING() in the creation of the 
> dataset, which by itself takes up about 75% of the time for the whole 
> prediction method call.
> 
> So, is there a simpler and more efficient way of creating the required 
> dataset, consisting of one column and one String row?
> 
> Thanks a lot.
> 
> Cheers,
> 
> Martin

Re: Encoders.STRING() causing performance problems in Java application

Posted by Sean Owen <sr...@gmail.com>.

Oh, yes of course. If you run an entire distributed Spark job for one row,
over and over, that's much slower. It would make much more sense to run the
whole data set at once - the point is parallelism here.

On Mon, Feb 21, 2022 at 2:36 AM <ma...@wunderlich.com> wrote:

> Thanks a lot, Sean, for the comments. I realize I didn't provide enough
> background information to properly diagnose this issue.
>
> In the meantime, I have created some test cases for isolating the problem
> and running some specific performance tests. The numbers are quite
> revealing: Running our Spark model individually on Strings takes about 8
> Sec for the test data, whereas is take 88 ms when run on the entire data in
> a single Dataset. This is a factor of 100x. This gets even worse for larger
> datasets.
>
> So, the root cause here is the way the Spark model is being called for one
> string at a time by the self-built prediction pipeline (which is also using
> other ML techniques apart from Spark). Needs some re-factoring...
>
> Thanks again for the help.
>
> Cheers,
>
> Martin
>
>
> Am 2022-02-18 13:41, schrieb Sean Owen:
>
> That doesn't make a lot of sense. Are you profiling the driver, rather
> than executors where the work occurs?
> Is your data set quite small such that small overheads look big?
> Do you even need Spark if your data is not distributed - coming from the
> driver anyway?
>
> The fact that a static final field did anything suggests something is
> amiss with your driver program. Are you perhaps inadvertently serializing
> your containing class with a bunch of other data by using its methods in a
> closure?
> If your data is small it's not surprising that the overhead could be in
> just copying the data around, the two methods you cite, rather than the
> compute.
> Too many things here to really say what's going on.
>
>
> On Fri, Feb 18, 2022 at 12:42 AM <ma...@wunderlich.com> wrote:
>
> Hello,
>
> I am working on optimising the performance of a Java ML/NLP application
> based on Spark / SparkNLP. For prediction, I am applying a trained model on
> a Spark dataset which consists of one column with only one row. The dataset
> is created like this:
>
>     List<String> textList = Collections.singletonList(text);
>     Dataset<Row> data = sparkSession
>         .createDataset(textList, Encoders.STRING())
>         .withColumnRenamed(COL_VALUE, COL_TEXT);
>
>
> The predictions are created like this:
>
>     PipelineModel fittedPipeline = pipeline.fit(dataset);
>
>     Dataset<Row> prediction = fittedPipeline.transform(dataset);
>
>
> We noticed that the performance isn't quite as good as expected. After
> profiling the application with VisualVM, I noticed that the problem is with
> org.apache.spark.sql.Encoders.STRING() in the creation of the dataset,
> which by itself takes up about 75% of the time for the whole prediction
> method call.
>
> So, is there a simpler and more efficient way of creating the required
> dataset, consisting of one column and one String row?
>
> Thanks a lot.
>
> Cheers,
>
> Martin
>
>

Re: Encoders.STRING() causing performance problems in Java application

Posted by ma...@wunderlich.com.


Thanks a lot, Sean, for the comments. I realize I didn't provide enough 
background information to properly diagnose this issue.

In the meantime, I have created some test cases for isolating the 
problem and running some specific performance tests. The numbers are 
quite revealing: Running our Spark model individually on Strings takes 
about 8 Sec for the test data, whereas is take 88 ms when run on the 
entire data in a single Dataset. This is a factor of 100x. This gets 
even worse for larger datasets.

So, the root cause here is the way the Spark model is being called for 
one string at a time by the self-built prediction pipeline (which is 
also using other ML techniques apart from Spark). Needs some 
re-factoring...

Thanks again for the help.

Cheers,

Martin

Am 2022-02-18 13:41, schrieb Sean Owen:

> That doesn't make a lot of sense. Are you profiling the driver, rather 
> than executors where the work occurs?
> Is your data set quite small such that small overheads look big?
> Do you even need Spark if your data is not distributed - coming from 
> the driver anyway?
> 
> The fact that a static final field did anything suggests something is 
> amiss with your driver program. Are you perhaps inadvertently 
> serializing your containing class with a bunch of other data by using 
> its methods in a closure?
> If your data is small it's not surprising that the overhead could be in 
> just copying the data around, the two methods you cite, rather than the 
> compute.
> Too many things here to really say what's going on.
> 
> On Fri, Feb 18, 2022 at 12:42 AM <ma...@wunderlich.com> wrote:
> 
>> Hello,
>> 
>> I am working on optimising the performance of a Java ML/NLP 
>> application based on Spark / SparkNLP. For prediction, I am applying a 
>> trained model on a Spark dataset which consists of one column with 
>> only one row. The dataset is created like this:
>> 
>> List<String> textList = Collections.singletonList(text);
>> Dataset<Row> data = sparkSession
>> .createDataset(textList, Encoders.STRING())
>> .withColumnRenamed(COL_VALUE, COL_TEXT);
>> 
>> The predictions are created like this:
>> 
>> PipelineModel fittedPipeline = pipeline.fit(dataset);
>> 
>> Dataset<Row> prediction = fittedPipeline.transform(dataset);
>> 
>> We noticed that the performance isn't quite as good as expected. After 
>> profiling the application with VisualVM, I noticed that the problem is 
>> with org.apache.spark.sql.Encoders.STRING() in the creation of the 
>> dataset, which by itself takes up about 75% of the time for the whole 
>> prediction method call.
>> 
>> So, is there a simpler and more efficient way of creating the required 
>> dataset, consisting of one column and one String row?
>> 
>> Thanks a lot.
>> 
>> Cheers,
>> 
>> Martin

Re: Encoders.STRING() causing performance problems in Java application

Posted by Sean Owen <sr...@gmail.com>.

That doesn't make a lot of sense. Are you profiling the driver, rather than
executors where the work occurs?
Is your data set quite small such that small overheads look big?
Do you even need Spark if your data is not distributed - coming from the
driver anyway?

The fact that a static final field did anything suggests something is amiss
with your driver program. Are you perhaps inadvertently serializing your
containing class with a bunch of other data by using its methods in a
closure?
If your data is small it's not surprising that the overhead could be in
just copying the data around, the two methods you cite, rather than the
compute.
Too many things here to really say what's going on.

On Fri, Feb 18, 2022 at 12:42 AM <ma...@wunderlich.com> wrote:

> Hello,
>
> I am working on optimising the performance of a Java ML/NLP application
> based on Spark / SparkNLP. For prediction, I am applying a trained model on
> a Spark dataset which consists of one column with only one row. The dataset
> is created like this:
>
>     List<String> textList = Collections.singletonList(text);
>     Dataset<Row> data = sparkSession
>         .createDataset(textList, Encoders.STRING())
>         .withColumnRenamed(COL_VALUE, COL_TEXT);
>
>
> The predictions are created like this:
>
>     PipelineModel fittedPipeline = pipeline.fit(dataset);
>
>     Dataset<Row> prediction = fittedPipeline.transform(dataset);
>
>
> We noticed that the performance isn't quite as good as expected. After
> profiling the application with VisualVM, I noticed that the problem is with
> org.apache.spark.sql.Encoders.STRING() in the creation of the dataset,
> which by itself takes up about 75% of the time for the whole prediction
> method call.
>
> So, is there a simpler and more efficient way of creating the required
> dataset, consisting of one column and one String row?
>
> Thanks a lot.
>
> Cheers,
>
> Martin
>