You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by gen tang <ge...@gmail.com> on 2015/08/18 16:39:44 UTC

Spark works with the data in another cluster(Elasticsearch)

Hi,

Currently, I have my data in the cluster of Elasticsearch and I try to use
spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark are two different
clusters. And I use hadoop input format(es-hadoop) to read data in ES.

I am wondering how this environment affect the speed of analysis.
If I understand well, spark will read data from ES cluster and do calculate
on its own cluster(include writing shuffle result on its own machine), Is
this right? If this is correct, I think that the performance will just a
little bit slower than the data stored on the same cluster.

I will be appreciated if someone can share his/her experience about using
spark with elasticsearch.

Thanks a lot in advance for your help.

Cheers
Gen

Re: Spark works with the data in another cluster(Elasticsearch)

Posted by gen tang <ge...@gmail.com>.

Great advice.
Thanks a lot Nick.

In fact, if we use rdd.persist(DISK) command at the beginning of the
program to avoid hitting the network again and again. The speed is not
influenced a lot. In my case, it is just 1 min more compared to the
situation that we put the data in local HDFS.

Cheers
Gen

On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> While it's true locality might speed things up, I'd say it's a very bad
> idea to mix your Spark and ES clusters - if your ES cluster is serving
> production queries (and in particular using aggregations), you'll run into
> performance issues on your production ES cluster.
>
> ES-hadoop uses ES scan & scroll to pull data pretty efficiently, so
> pulling it across the network is not too bad. If you do need to avoid that,
> pull the data and write what you need to HDFS as say parquet files (eg pull
> data daily and write it, then you have all data available on your Spark
> cluster).
>
> And of course ensure thatbwhen you do pull data from ES to Spark, you
> cache it to avoid hitting the network again
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> If the data is local to the machine then obviously it will be faster
>> compared to pulling it through the network and storing it locally (either
>> memory or disk etc). Have a look at the data locality
>> <http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html>
>> .
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Aug 18, 2015 at 8:09 PM, gen tang <ge...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Currently, I have my data in the cluster of Elasticsearch and I try to
>>> use spark to analyse those data.
>>> The cluster of Elasticsearch and the cluster of spark are two different
>>> clusters. And I use hadoop input format(es-hadoop) to read data in ES.
>>>
>>> I am wondering how this environment affect the speed of analysis.
>>> If I understand well, spark will read data from ES cluster and do
>>> calculate on its own cluster(include writing shuffle result on its own
>>> machine), Is this right? If this is correct, I think that the performance
>>> will just a little bit slower than the data stored on the same cluster.
>>>
>>> I will be appreciated if someone can share his/her experience about
>>> using spark with elasticsearch.
>>>
>>> Thanks a lot in advance for your help.
>>>
>>> Cheers
>>> Gen
>>>
>>
>>
>

Re: Spark works with the data in another cluster(Elasticsearch)

Posted by Nick Pentreath <ni...@gmail.com>.

While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run into performance issues on your production ES cluster.

ES-hadoop uses ES scan & scroll to pull data pretty efficiently, so pulling it across the network is not too bad. If you do need to avoid that, pull the data and write what you need to HDFS as say parquet files (eg pull data daily and write it, then you have all data available on your Spark cluster).

And of course ensure thatbwhen you do pull data from ES to Spark, you cache it to avoid hitting the network again

—
Sent from Mailbox

On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> If the data is local to the machine then obviously it will be faster
> compared to pulling it through the network and storing it locally (either
> memory or disk etc). Have a look at the data locality
> <http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html>
> .
> Thanks
> Best Regards
> On Tue, Aug 18, 2015 at 8:09 PM, gen tang <ge...@gmail.com> wrote:
>> Hi,
>>
>> Currently, I have my data in the cluster of Elasticsearch and I try to use
>> spark to analyse those data.
>> The cluster of Elasticsearch and the cluster of spark are two different
>> clusters. And I use hadoop input format(es-hadoop) to read data in ES.
>>
>> I am wondering how this environment affect the speed of analysis.
>> If I understand well, spark will read data from ES cluster and do
>> calculate on its own cluster(include writing shuffle result on its own
>> machine), Is this right? If this is correct, I think that the performance
>> will just a little bit slower than the data stored on the same cluster.
>>
>> I will be appreciated if someone can share his/her experience about using
>> spark with elasticsearch.
>>
>> Thanks a lot in advance for your help.
>>
>> Cheers
>> Gen
>>

Re: Spark works with the data in another cluster(Elasticsearch)

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

If the data is local to the machine then obviously it will be faster
compared to pulling it through the network and storing it locally (either
memory or disk etc). Have a look at the data locality
<http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html>
.

Thanks
Best Regards

On Tue, Aug 18, 2015 at 8:09 PM, gen tang <ge...@gmail.com> wrote:

> Hi,
>
> Currently, I have my data in the cluster of Elasticsearch and I try to use
> spark to analyse those data.
> The cluster of Elasticsearch and the cluster of spark are two different
> clusters. And I use hadoop input format(es-hadoop) to read data in ES.
>
> I am wondering how this environment affect the speed of analysis.
> If I understand well, spark will read data from ES cluster and do
> calculate on its own cluster(include writing shuffle result on its own
> machine), Is this right? If this is correct, I think that the performance
> will just a little bit slower than the data stored on the same cluster.
>
> I will be appreciated if someone can share his/her experience about using
> spark with elasticsearch.
>
> Thanks a lot in advance for your help.
>
> Cheers
> Gen
>