You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geode.apache.org by aashish choudhary <aa...@gmail.com> on 2019/06/24 17:03:46 UTC

Spark geode best practices

Hi,

We have been experiencing issues while connect to geode using putAll API
with spark. Issue is specific to one particular spark job which tries to
load data to a replicated region. Exception we see in the server side is
that default limit of 800 gets maxed out and on client side we see retry
attempt to each server but gets failed even though when we re ran the same
job it gets completed without any issue.

In the code problem I could see is that we are connecting to geode using
client cache in forEachPartition which I think could be the issue. So for
each partition we are making a connection to geode. In stats file we could
see that connections getting timeout and there is thread burst also
sometimes >4000.

What is the recommended way to connect to geode using spark?

But this one specific job which gets failed most of the times and is a
replicated region. Also when we change the type of region to partitioned
then job gets completed. We have enabled disk persistence for both type of
regions.

Thoughts?



With best regards,
Ashish

Re: Spark geode best practices

Posted by Jason Huynh <jh...@pivotal.io>.

Hi Ashish,

Do you have custom code that connects spark to geode?  I know there was a
geode-spark connector at one point and that it was forked:
https://github.com/Pivotal-Field-Engineering/geode-spark-connector   (but
it looks like it hasn't been updated in awhile).  Just curious if there was
some code we could look at.

On Mon, Jun 24, 2019 at 11:53 AM Anilkumar Gingade <ag...@pivotal.io>
wrote:

> Hi Ashish,
>
> How many threads at a time executing putAll jobs in a single client (spark
> job?)...
> Do you see read timeout exception in client logs...If so, can you try
> increasing the read timeout value. Or reducing the putAll size.
>
> In case of PutAll for partitioned region; the putAll (entries) size is
> broken down and sent to respective servers based on its data affinity; the
> reason its working with partitioned region.
>
> You can find more detail on how client-server connection works at:
>
> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>
> -Anil.
>
>
>
>
>
>
>
> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
> aashish.choudhary1@gmail.com> wrote:
>
>> Hi,
>>
>> We have been experiencing issues while connect to geode using putAll API
>> with spark. Issue is specific to one particular spark job which tries to
>> load data to a replicated region. Exception we see in the server side is
>> that default limit of 800 gets maxed out and on client side we see retry
>> attempt to each server but gets failed even though when we re ran the same
>> job it gets completed without any issue.
>>
>> In the code problem I could see is that we are connecting to geode using
>> client cache in forEachPartition which I think could be the issue. So for
>> each partition we are making a connection to geode. In stats file we could
>> see that connections getting timeout and there is thread burst also
>> sometimes >4000.
>>
>> What is the recommended way to connect to geode using spark?
>>
>> But this one specific job which gets failed most of the times and is a
>> replicated region. Also when we change the type of region to partitioned
>> then job gets completed. We have enabled disk persistence for both type of
>> regions.
>>
>> Thoughts?
>>
>>
>>
>> With best regards,
>> Ashish
>>
>

Re: Spark geode best practices

Posted by Xiaojian Zhou <gz...@pivotal.io>.

Then try to do following 2 instead:
1) change your replicated region to partitioned region. (It will
automatically split your putAll into 113 small putAlls).
2) change your max connection to 5000.


On Wed, Jun 26, 2019 at 12:15 PM aashish choudhary <
aashish.choudhary1@gmail.com> wrote:

> So are you saying that we should put in batches of 1~10k. But that I tried
> already atleast for 10k and it was failing with default readtimeout.
> Additionally it takes forever to put all 600k records into that region in
> batchmode.
>
> With best regards,
> Ashish
>
> On Wed, Jun 26, 2019, 11:38 PM Xiaojian Zhou <gz...@pivotal.io> wrote:
>
>> You can increase max connection size from default 800 to 5000. We did
>> that long time ago for customer.
>>
>> I noticed that your servers are using "replicated" region. In that case,
>> then the singlehop will not take effect. That's fine.
>>
>> If putAll map is too big, then it will hit read timeout issue, because it
>> will take longer time to process bigger map.
>> 600k in one map is too big. According to my test, 1k to 10k is the
>> comfortable size. Since increasing read timeout workaround your issue. I
>> feel size too big is probably the real root cause.
>>
>> So my suggestions:
>> 1) try to reduce your putAll map to 1k ~ 10K
>> 2) If still not working, increase max connection size from 800 to 5000.
>>
>> Regards
>> Gester Zhou
>>
>>
>>
>> On Wed, Jun 26, 2019 at 10:46 AM Charlie Black <cb...@pivotal.io> wrote:
>>
>>> Try batches that are small as a starting point - say 100.
>>>
>>> On Wed, Jun 26, 2019 at 10:33 AM aashish choudhary <
>>> aashish.choudhary1@gmail.com> wrote:
>>>
>>>> Yes we see exceeded max-connections error on server side.
>>>>
>>>> So I was trying to see how the putAll API works in general and from a
>>>> standard java client I was trying to simulate the behaviour that we see on
>>>> our server.
>>>> I tried to put 600k records using putAll on my local machine with 1
>>>> locator and 2 servers. Region type is replicate persistent and I could see
>>>> that local clientCache API getting crashed with some "pool unexpected"
>>>> error. We do see this error on our spark code as well. It then do a retry
>>>> and gets failed. However surprisingly data gets inserted in the region even
>>>> though clientCache java API was crashed.
>>>>
>>>> I tried to run it through in some batches but those also got failed and
>>>> it's too slow.
>>>>
>>>> Only way I was able to make it work by is increasing readtimeout to 60
>>>> seconds.
>>>>
>>>> Can someone share some tips on putAll API?
>>>> How to use it effectively?
>>>>
>>>>
>>>> With best regards,
>>>> Ashish
>>>>
>>>> On Wed, Jun 26, 2019, 6:20 AM Anilkumar Gingade <ag...@pivotal.io>
>>>> wrote:
>>>>
>>>>> Ashish,
>>>>>
>>>>> Do you see "exceeded max-connections" error...
>>>>>
>>>>> Operation/Job getting completed second time indicates, the server
>>>>> where the operation is executed first time may have issues, you may want to
>>>>> see the load on that server and if there are any memory issues.
>>>>>
>>>>> >>What is the recommended way to connect to geode using spark?
>>>>> Its more of how the geode is used in this context; is the spark
>>>>> processors are acting as geode's client or peer node. If its geode client,
>>>>> then its more about tuning client connections based on how/what operations
>>>>> are performed.
>>>>>
>>>>>  Anil
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 25, 2019 at 10:54 AM aashish choudhary <
>>>>> aashish.choudhary1@gmail.com> wrote:
>>>>>
>>>>>> We could also see below on server side logs as well.
>>>>>>
>>>>>> Rejected connection from Server connection from
>>>>>> >> [client host address=x.yx.x.x; client port=abc] because incoming
>>>>>> >> request was rejected by pool possibly due to thread exhaustion
>>>>>> >>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 25, 2019, 7:27 AM aashish choudhary <
>>>>>> aashish.choudhary1@gmail.com> wrote:
>>>>>>
>>>>>>> As I mentioned earlier threads count could go to 4000 and we have
>>>>>>> seen readtimeout crossing default 10 seconds. We tried to increase read
>>>>>>> timeout to 30 seconds but that didn't work either. Record count is not more
>>>>>>> than 600k.
>>>>>>>
>>>>>>> Job gets successful in second attempt without changing anything
>>>>>>> which is bit weird.
>>>>>>>
>>>>>>> With best regards,
>>>>>>> Ashish
>>>>>>>
>>>>>>> On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <
>>>>>>> agingade@pivotal.io> wrote:
>>>>>>>
>>>>>>>> Hi Ashish,
>>>>>>>>
>>>>>>>> How many threads at a time executing putAll jobs in a single client
>>>>>>>> (spark job?)...
>>>>>>>> Do you see read timeout exception in client logs...If so, can you
>>>>>>>> try increasing the read timeout value. Or reducing the putAll size.
>>>>>>>>
>>>>>>>> In case of PutAll for partitioned region; the putAll (entries) size
>>>>>>>> is broken down and sent to respective servers based on its data affinity;
>>>>>>>> the reason its working with partitioned region.
>>>>>>>>
>>>>>>>> You can find more detail on how client-server connection works at:
>>>>>>>>
>>>>>>>> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__geode.apache.org_docs_guide_14_topologies-5Fand-5Fcomm_topology-5Fconcepts_how-5Fthe-5Fpool-5Fmanages-5Fconnections.html&d=DwMFaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=ytCZi8EyFn1QZDXaow7uOQ&m=cbnDKYT-WgN1H8tBKVimepu_Xn39dIjrxPUI5pUqafk&s=uDG5oxsM1CRDlWx2F-_QTqM8dJjygXiejHPkDxa7EFc&e=>
>>>>>>>>
>>>>>>>> -Anil.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
>>>>>>>> aashish.choudhary1@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We have been experiencing issues while connect to geode using
>>>>>>>>> putAll API with spark. Issue is specific to one particular spark job which
>>>>>>>>> tries to load data to a replicated region. Exception we see in the server
>>>>>>>>> side is that default limit of 800 gets maxed out and on client side we see
>>>>>>>>> retry attempt to each server but gets failed even though when we re ran the
>>>>>>>>> same job it gets completed without any issue.
>>>>>>>>>
>>>>>>>>> In the code problem I could see is that we are connecting to geode
>>>>>>>>> using client cache in forEachPartition which I think could be the issue. So
>>>>>>>>> for each partition we are making a connection to geode. In stats file we
>>>>>>>>> could see that connections getting timeout and there is thread burst also
>>>>>>>>> sometimes >4000.
>>>>>>>>>
>>>>>>>>> What is the recommended way to connect to geode using spark?
>>>>>>>>>
>>>>>>>>> But this one specific job which gets failed most of the times and
>>>>>>>>> is a replicated region. Also when we change the type of region to
>>>>>>>>> partitioned then job gets completed. We have enabled disk persistence for
>>>>>>>>> both type of regions.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With best regards,
>>>>>>>>> Ashish
>>>>>>>>>
>>>>>>>>
>>>
>>> --
>>> Charlie Black | cblack@pivotal.io
>>>
>>

Re: Spark geode best practices

Posted by Anthony Baker <ab...@pivotal.io>.

I typically recommend small batch sizes (~1000 keys) with multiple threads (10-100 depending on resources).

Have you calculated if you’ve saturated your network bandwidth?  If so, none of these ideas will help.


Anthony


> On Jun 26, 2019, at 12:15 PM, aashish choudhary <aa...@gmail.com> wrote:
> 
> So are you saying that we should put in batches of 1~10k. But that I tried already atleast for 10k and it was failing with default readtimeout. Additionally it takes forever to put all 600k records into that region in batchmode.
> 
> With best regards,
> Ashish
> 
> On Wed, Jun 26, 2019, 11:38 PM Xiaojian Zhou <gzhou@pivotal.io <ma...@pivotal.io>> wrote:
> You can increase max connection size from default 800 to 5000. We did that long time ago for customer. 
> 
> I noticed that your servers are using "replicated" region. In that case, then the singlehop will not take effect. That's fine. 
> 
> If putAll map is too big, then it will hit read timeout issue, because it will take longer time to process bigger map. 
> 600k in one map is too big. According to my test, 1k to 10k is the comfortable size. Since increasing read timeout workaround your issue. I feel size too big is probably the real root cause. 
> 
> So my suggestions:
> 1) try to reduce your putAll map to 1k ~ 10K
> 2) If still not working, increase max connection size from 800 to 5000. 
> 
> Regards
> Gester Zhou
> 
> 
> 
> On Wed, Jun 26, 2019 at 10:46 AM Charlie Black <cblack@pivotal.io <ma...@pivotal.io>> wrote:
> Try batches that are small as a starting point - say 100.   
> 
> On Wed, Jun 26, 2019 at 10:33 AM aashish choudhary <aashish.choudhary1@gmail.com <ma...@gmail.com>> wrote:
> Yes we see exceeded max-connections error on server side.
> 
> So I was trying to see how the putAll API works in general and from a standard java client I was trying to simulate the behaviour that we see on our server.
> I tried to put 600k records using putAll on my local machine with 1 locator and 2 servers. Region type is replicate persistent and I could see that local clientCache API getting crashed with some "pool unexpected" error. We do see this error on our spark code as well. It then do a retry and gets failed. However surprisingly data gets inserted in the region even though clientCache java API was crashed. 
> 
> I tried to run it through in some batches but those also got failed and it's too slow.
> 
> Only way I was able to make it work by is increasing readtimeout to 60 seconds.
> 
> Can someone share some tips on putAll API?
> How to use it effectively?
> 
> 
> With best regards,
> Ashish
> 
> On Wed, Jun 26, 2019, 6:20 AM Anilkumar Gingade <agingade@pivotal.io <ma...@pivotal.io>> wrote:
> Ashish,
> 
> Do you see "exceeded max-connections" error...
> 
> Operation/Job getting completed second time indicates, the server where the operation is executed first time may have issues, you may want to see the load on that server and if there are any memory issues.
> 
> >>What is the recommended way to connect to geode using spark?
> Its more of how the geode is used in this context; is the spark processors are acting as geode's client or peer node. If its geode client, then its more about tuning client connections based on how/what operations are performed.
> 
>  Anil
> 
> 
> 
> 
> On Tue, Jun 25, 2019 at 10:54 AM aashish choudhary <aashish.choudhary1@gmail.com <ma...@gmail.com>> wrote:
> We could also see below on server side logs as well. 
> Rejected connection from Server connection from
> >> [client host address=x.yx.x.x; client port=abc] because incoming
> >> request was rejected by pool possibly due to thread exhaustion
> >>
> 
> On Tue, Jun 25, 2019, 7:27 AM aashish choudhary <aashish.choudhary1@gmail.com <ma...@gmail.com>> wrote:
> As I mentioned earlier threads count could go to 4000 and we have seen readtimeout crossing default 10 seconds. We tried to increase read timeout to 30 seconds but that didn't work either. Record count is not more than 600k.
> 
> Job gets successful in second attempt without changing anything which is bit weird.
> 
> With best regards,
> Ashish
> 
> On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <agingade@pivotal.io <ma...@pivotal.io>> wrote:
> Hi Ashish,
> 
> How many threads at a time executing putAll jobs in a single client (spark job?)...
> Do you see read timeout exception in client logs...If so, can you try increasing the read timeout value. Or reducing the putAll size.
> 
> In case of PutAll for partitioned region; the putAll (entries) size is broken down and sent to respective servers based on its data affinity; the reason its working with partitioned region.
> 
> You can find more detail on how client-server connection works at:
> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html <https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html>
> 
> -Anil.
> 
> 
> 
> 
> 
> 
> 
> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <aashish.choudhary1@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> We have been experiencing issues while connect to geode using putAll API with spark. Issue is specific to one particular spark job which tries to load data to a replicated region. Exception we see in the server side is that default limit of 800 gets maxed out and on client side we see retry attempt to each server but gets failed even though when we re ran the same job it gets completed without any issue.
> 
> In the code problem I could see is that we are connecting to geode using client cache in forEachPartition which I think could be the issue. So for each partition we are making a connection to geode. In stats file we could see that connections getting timeout and there is thread burst also sometimes >4000.
> 
> What is the recommended way to connect to geode using spark?
> 
> But this one specific job which gets failed most of the times and is a replicated region. Also when we change the type of region to partitioned then job gets completed. We have enabled disk persistence for both type of regions.
> 
> Thoughts?
> 
> 
> 
> With best regards,
> Ashish
> 
> 
> -- 
> Charlie Black | cblack@pivotal.io <ma...@pivotal.io>

Re: Spark geode best practices

Posted by aashish choudhary <aa...@gmail.com>.

So are you saying that we should put in batches of 1~10k. But that I tried
already atleast for 10k and it was failing with default readtimeout.
Additionally it takes forever to put all 600k records into that region in
batchmode.

With best regards,
Ashish

On Wed, Jun 26, 2019, 11:38 PM Xiaojian Zhou <gz...@pivotal.io> wrote:

> You can increase max connection size from default 800 to 5000. We did that
> long time ago for customer.
>
> I noticed that your servers are using "replicated" region. In that case,
> then the singlehop will not take effect. That's fine.
>
> If putAll map is too big, then it will hit read timeout issue, because it
> will take longer time to process bigger map.
> 600k in one map is too big. According to my test, 1k to 10k is the
> comfortable size. Since increasing read timeout workaround your issue. I
> feel size too big is probably the real root cause.
>
> So my suggestions:
> 1) try to reduce your putAll map to 1k ~ 10K
> 2) If still not working, increase max connection size from 800 to 5000.
>
> Regards
> Gester Zhou
>
>
>
> On Wed, Jun 26, 2019 at 10:46 AM Charlie Black <cb...@pivotal.io> wrote:
>
>> Try batches that are small as a starting point - say 100.
>>
>> On Wed, Jun 26, 2019 at 10:33 AM aashish choudhary <
>> aashish.choudhary1@gmail.com> wrote:
>>
>>> Yes we see exceeded max-connections error on server side.
>>>
>>> So I was trying to see how the putAll API works in general and from a
>>> standard java client I was trying to simulate the behaviour that we see on
>>> our server.
>>> I tried to put 600k records using putAll on my local machine with 1
>>> locator and 2 servers. Region type is replicate persistent and I could see
>>> that local clientCache API getting crashed with some "pool unexpected"
>>> error. We do see this error on our spark code as well. It then do a retry
>>> and gets failed. However surprisingly data gets inserted in the region even
>>> though clientCache java API was crashed.
>>>
>>> I tried to run it through in some batches but those also got failed and
>>> it's too slow.
>>>
>>> Only way I was able to make it work by is increasing readtimeout to 60
>>> seconds.
>>>
>>> Can someone share some tips on putAll API?
>>> How to use it effectively?
>>>
>>>
>>> With best regards,
>>> Ashish
>>>
>>> On Wed, Jun 26, 2019, 6:20 AM Anilkumar Gingade <ag...@pivotal.io>
>>> wrote:
>>>
>>>> Ashish,
>>>>
>>>> Do you see "exceeded max-connections" error...
>>>>
>>>> Operation/Job getting completed second time indicates, the server where
>>>> the operation is executed first time may have issues, you may want to see
>>>> the load on that server and if there are any memory issues.
>>>>
>>>> >>What is the recommended way to connect to geode using spark?
>>>> Its more of how the geode is used in this context; is the spark
>>>> processors are acting as geode's client or peer node. If its geode client,
>>>> then its more about tuning client connections based on how/what operations
>>>> are performed.
>>>>
>>>>  Anil
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 25, 2019 at 10:54 AM aashish choudhary <
>>>> aashish.choudhary1@gmail.com> wrote:
>>>>
>>>>> We could also see below on server side logs as well.
>>>>>
>>>>> Rejected connection from Server connection from
>>>>> >> [client host address=x.yx.x.x; client port=abc] because incoming
>>>>> >> request was rejected by pool possibly due to thread exhaustion
>>>>> >>
>>>>>
>>>>>
>>>>> On Tue, Jun 25, 2019, 7:27 AM aashish choudhary <
>>>>> aashish.choudhary1@gmail.com> wrote:
>>>>>
>>>>>> As I mentioned earlier threads count could go to 4000 and we have
>>>>>> seen readtimeout crossing default 10 seconds. We tried to increase read
>>>>>> timeout to 30 seconds but that didn't work either. Record count is not more
>>>>>> than 600k.
>>>>>>
>>>>>> Job gets successful in second attempt without changing anything which
>>>>>> is bit weird.
>>>>>>
>>>>>> With best regards,
>>>>>> Ashish
>>>>>>
>>>>>> On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <ag...@pivotal.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ashish,
>>>>>>>
>>>>>>> How many threads at a time executing putAll jobs in a single client
>>>>>>> (spark job?)...
>>>>>>> Do you see read timeout exception in client logs...If so, can you
>>>>>>> try increasing the read timeout value. Or reducing the putAll size.
>>>>>>>
>>>>>>> In case of PutAll for partitioned region; the putAll (entries) size
>>>>>>> is broken down and sent to respective servers based on its data affinity;
>>>>>>> the reason its working with partitioned region.
>>>>>>>
>>>>>>> You can find more detail on how client-server connection works at:
>>>>>>>
>>>>>>> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>>>>>>>
>>>>>>> -Anil.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
>>>>>>> aashish.choudhary1@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We have been experiencing issues while connect to geode using
>>>>>>>> putAll API with spark. Issue is specific to one particular spark job which
>>>>>>>> tries to load data to a replicated region. Exception we see in the server
>>>>>>>> side is that default limit of 800 gets maxed out and on client side we see
>>>>>>>> retry attempt to each server but gets failed even though when we re ran the
>>>>>>>> same job it gets completed without any issue.
>>>>>>>>
>>>>>>>> In the code problem I could see is that we are connecting to geode
>>>>>>>> using client cache in forEachPartition which I think could be the issue. So
>>>>>>>> for each partition we are making a connection to geode. In stats file we
>>>>>>>> could see that connections getting timeout and there is thread burst also
>>>>>>>> sometimes >4000.
>>>>>>>>
>>>>>>>> What is the recommended way to connect to geode using spark?
>>>>>>>>
>>>>>>>> But this one specific job which gets failed most of the times and
>>>>>>>> is a replicated region. Also when we change the type of region to
>>>>>>>> partitioned then job gets completed. We have enabled disk persistence for
>>>>>>>> both type of regions.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With best regards,
>>>>>>>> Ashish
>>>>>>>>
>>>>>>>
>>
>> --
>> Charlie Black | cblack@pivotal.io
>>
>

Re: Spark geode best practices

Posted by Xiaojian Zhou <gz...@pivotal.io>.

You can increase max connection size from default 800 to 5000. We did that
long time ago for customer.

I noticed that your servers are using "replicated" region. In that case,
then the singlehop will not take effect. That's fine.

If putAll map is too big, then it will hit read timeout issue, because it
will take longer time to process bigger map.
600k in one map is too big. According to my test, 1k to 10k is the
comfortable size. Since increasing read timeout workaround your issue. I
feel size too big is probably the real root cause.

So my suggestions:
1) try to reduce your putAll map to 1k ~ 10K
2) If still not working, increase max connection size from 800 to 5000.

Regards
Gester Zhou



On Wed, Jun 26, 2019 at 10:46 AM Charlie Black <cb...@pivotal.io> wrote:

> Try batches that are small as a starting point - say 100.
>
> On Wed, Jun 26, 2019 at 10:33 AM aashish choudhary <
> aashish.choudhary1@gmail.com> wrote:
>
>> Yes we see exceeded max-connections error on server side.
>>
>> So I was trying to see how the putAll API works in general and from a
>> standard java client I was trying to simulate the behaviour that we see on
>> our server.
>> I tried to put 600k records using putAll on my local machine with 1
>> locator and 2 servers. Region type is replicate persistent and I could see
>> that local clientCache API getting crashed with some "pool unexpected"
>> error. We do see this error on our spark code as well. It then do a retry
>> and gets failed. However surprisingly data gets inserted in the region even
>> though clientCache java API was crashed.
>>
>> I tried to run it through in some batches but those also got failed and
>> it's too slow.
>>
>> Only way I was able to make it work by is increasing readtimeout to 60
>> seconds.
>>
>> Can someone share some tips on putAll API?
>> How to use it effectively?
>>
>>
>> With best regards,
>> Ashish
>>
>> On Wed, Jun 26, 2019, 6:20 AM Anilkumar Gingade <ag...@pivotal.io>
>> wrote:
>>
>>> Ashish,
>>>
>>> Do you see "exceeded max-connections" error...
>>>
>>> Operation/Job getting completed second time indicates, the server where
>>> the operation is executed first time may have issues, you may want to see
>>> the load on that server and if there are any memory issues.
>>>
>>> >>What is the recommended way to connect to geode using spark?
>>> Its more of how the geode is used in this context; is the spark
>>> processors are acting as geode's client or peer node. If its geode client,
>>> then its more about tuning client connections based on how/what operations
>>> are performed.
>>>
>>>  Anil
>>>
>>>
>>>
>>>
>>> On Tue, Jun 25, 2019 at 10:54 AM aashish choudhary <
>>> aashish.choudhary1@gmail.com> wrote:
>>>
>>>> We could also see below on server side logs as well.
>>>>
>>>> Rejected connection from Server connection from
>>>> >> [client host address=x.yx.x.x; client port=abc] because incoming
>>>> >> request was rejected by pool possibly due to thread exhaustion
>>>> >>
>>>>
>>>>
>>>> On Tue, Jun 25, 2019, 7:27 AM aashish choudhary <
>>>> aashish.choudhary1@gmail.com> wrote:
>>>>
>>>>> As I mentioned earlier threads count could go to 4000 and we have seen
>>>>> readtimeout crossing default 10 seconds. We tried to increase read timeout
>>>>> to 30 seconds but that didn't work either. Record count is not more than
>>>>> 600k.
>>>>>
>>>>> Job gets successful in second attempt without changing anything which
>>>>> is bit weird.
>>>>>
>>>>> With best regards,
>>>>> Ashish
>>>>>
>>>>> On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <ag...@pivotal.io>
>>>>> wrote:
>>>>>
>>>>>> Hi Ashish,
>>>>>>
>>>>>> How many threads at a time executing putAll jobs in a single client
>>>>>> (spark job?)...
>>>>>> Do you see read timeout exception in client logs...If so, can you try
>>>>>> increasing the read timeout value. Or reducing the putAll size.
>>>>>>
>>>>>> In case of PutAll for partitioned region; the putAll (entries) size
>>>>>> is broken down and sent to respective servers based on its data affinity;
>>>>>> the reason its working with partitioned region.
>>>>>>
>>>>>> You can find more detail on how client-server connection works at:
>>>>>>
>>>>>> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>>>>>>
>>>>>> -Anil.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
>>>>>> aashish.choudhary1@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have been experiencing issues while connect to geode using putAll
>>>>>>> API with spark. Issue is specific to one particular spark job which tries
>>>>>>> to load data to a replicated region. Exception we see in the server side is
>>>>>>> that default limit of 800 gets maxed out and on client side we see retry
>>>>>>> attempt to each server but gets failed even though when we re ran the same
>>>>>>> job it gets completed without any issue.
>>>>>>>
>>>>>>> In the code problem I could see is that we are connecting to geode
>>>>>>> using client cache in forEachPartition which I think could be the issue. So
>>>>>>> for each partition we are making a connection to geode. In stats file we
>>>>>>> could see that connections getting timeout and there is thread burst also
>>>>>>> sometimes >4000.
>>>>>>>
>>>>>>> What is the recommended way to connect to geode using spark?
>>>>>>>
>>>>>>> But this one specific job which gets failed most of the times and is
>>>>>>> a replicated region. Also when we change the type of region to partitioned
>>>>>>> then job gets completed. We have enabled disk persistence for both type of
>>>>>>> regions.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With best regards,
>>>>>>> Ashish
>>>>>>>
>>>>>>
>
> --
> Charlie Black | cblack@pivotal.io
>

Re: Spark geode best practices

Posted by Charlie Black <cb...@pivotal.io>.

Try batches that are small as a starting point - say 100.

On Wed, Jun 26, 2019 at 10:33 AM aashish choudhary <
aashish.choudhary1@gmail.com> wrote:

> Yes we see exceeded max-connections error on server side.
>
> So I was trying to see how the putAll API works in general and from a
> standard java client I was trying to simulate the behaviour that we see on
> our server.
> I tried to put 600k records using putAll on my local machine with 1
> locator and 2 servers. Region type is replicate persistent and I could see
> that local clientCache API getting crashed with some "pool unexpected"
> error. We do see this error on our spark code as well. It then do a retry
> and gets failed. However surprisingly data gets inserted in the region even
> though clientCache java API was crashed.
>
> I tried to run it through in some batches but those also got failed and
> it's too slow.
>
> Only way I was able to make it work by is increasing readtimeout to 60
> seconds.
>
> Can someone share some tips on putAll API?
> How to use it effectively?
>
>
> With best regards,
> Ashish
>
> On Wed, Jun 26, 2019, 6:20 AM Anilkumar Gingade <ag...@pivotal.io>
> wrote:
>
>> Ashish,
>>
>> Do you see "exceeded max-connections" error...
>>
>> Operation/Job getting completed second time indicates, the server where
>> the operation is executed first time may have issues, you may want to see
>> the load on that server and if there are any memory issues.
>>
>> >>What is the recommended way to connect to geode using spark?
>> Its more of how the geode is used in this context; is the spark
>> processors are acting as geode's client or peer node. If its geode client,
>> then its more about tuning client connections based on how/what operations
>> are performed.
>>
>>  Anil
>>
>>
>>
>>
>> On Tue, Jun 25, 2019 at 10:54 AM aashish choudhary <
>> aashish.choudhary1@gmail.com> wrote:
>>
>>> We could also see below on server side logs as well.
>>>
>>> Rejected connection from Server connection from
>>> >> [client host address=x.yx.x.x; client port=abc] because incoming
>>> >> request was rejected by pool possibly due to thread exhaustion
>>> >>
>>>
>>>
>>> On Tue, Jun 25, 2019, 7:27 AM aashish choudhary <
>>> aashish.choudhary1@gmail.com> wrote:
>>>
>>>> As I mentioned earlier threads count could go to 4000 and we have seen
>>>> readtimeout crossing default 10 seconds. We tried to increase read timeout
>>>> to 30 seconds but that didn't work either. Record count is not more than
>>>> 600k.
>>>>
>>>> Job gets successful in second attempt without changing anything which
>>>> is bit weird.
>>>>
>>>> With best regards,
>>>> Ashish
>>>>
>>>> On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <ag...@pivotal.io>
>>>> wrote:
>>>>
>>>>> Hi Ashish,
>>>>>
>>>>> How many threads at a time executing putAll jobs in a single client
>>>>> (spark job?)...
>>>>> Do you see read timeout exception in client logs...If so, can you try
>>>>> increasing the read timeout value. Or reducing the putAll size.
>>>>>
>>>>> In case of PutAll for partitioned region; the putAll (entries) size is
>>>>> broken down and sent to respective servers based on its data affinity; the
>>>>> reason its working with partitioned region.
>>>>>
>>>>> You can find more detail on how client-server connection works at:
>>>>>
>>>>> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>>>>>
>>>>> -Anil.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
>>>>> aashish.choudhary1@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We have been experiencing issues while connect to geode using putAll
>>>>>> API with spark. Issue is specific to one particular spark job which tries
>>>>>> to load data to a replicated region. Exception we see in the server side is
>>>>>> that default limit of 800 gets maxed out and on client side we see retry
>>>>>> attempt to each server but gets failed even though when we re ran the same
>>>>>> job it gets completed without any issue.
>>>>>>
>>>>>> In the code problem I could see is that we are connecting to geode
>>>>>> using client cache in forEachPartition which I think could be the issue. So
>>>>>> for each partition we are making a connection to geode. In stats file we
>>>>>> could see that connections getting timeout and there is thread burst also
>>>>>> sometimes >4000.
>>>>>>
>>>>>> What is the recommended way to connect to geode using spark?
>>>>>>
>>>>>> But this one specific job which gets failed most of the times and is
>>>>>> a replicated region. Also when we change the type of region to partitioned
>>>>>> then job gets completed. We have enabled disk persistence for both type of
>>>>>> regions.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>>
>>>>>>
>>>>>> With best regards,
>>>>>> Ashish
>>>>>>
>>>>>

-- 
Charlie Black | cblack@pivotal.io

Re: Spark geode best practices

Posted by aashish choudhary <aa...@gmail.com>.

Yes we see exceeded max-connections error on server side.

So I was trying to see how the putAll API works in general and from a
standard java client I was trying to simulate the behaviour that we see on
our server.
I tried to put 600k records using putAll on my local machine with 1 locator
and 2 servers. Region type is replicate persistent and I could see that
local clientCache API getting crashed with some "pool unexpected" error. We
do see this error on our spark code as well. It then do a retry and gets
failed. However surprisingly data gets inserted in the region even though
clientCache java API was crashed.

I tried to run it through in some batches but those also got failed and
it's too slow.

Only way I was able to make it work by is increasing readtimeout to 60
seconds.

Can someone share some tips on putAll API?
How to use it effectively?


With best regards,
Ashish

On Wed, Jun 26, 2019, 6:20 AM Anilkumar Gingade <ag...@pivotal.io> wrote:

> Ashish,
>
> Do you see "exceeded max-connections" error...
>
> Operation/Job getting completed second time indicates, the server where
> the operation is executed first time may have issues, you may want to see
> the load on that server and if there are any memory issues.
>
> >>What is the recommended way to connect to geode using spark?
> Its more of how the geode is used in this context; is the spark processors
> are acting as geode's client or peer node. If its geode client, then its
> more about tuning client connections based on how/what operations are
> performed.
>
>  Anil
>
>
>
>
> On Tue, Jun 25, 2019 at 10:54 AM aashish choudhary <
> aashish.choudhary1@gmail.com> wrote:
>
>> We could also see below on server side logs as well.
>>
>> Rejected connection from Server connection from
>> >> [client host address=x.yx.x.x; client port=abc] because incoming
>> >> request was rejected by pool possibly due to thread exhaustion
>> >>
>>
>>
>> On Tue, Jun 25, 2019, 7:27 AM aashish choudhary <
>> aashish.choudhary1@gmail.com> wrote:
>>
>>> As I mentioned earlier threads count could go to 4000 and we have seen
>>> readtimeout crossing default 10 seconds. We tried to increase read timeout
>>> to 30 seconds but that didn't work either. Record count is not more than
>>> 600k.
>>>
>>> Job gets successful in second attempt without changing anything which is
>>> bit weird.
>>>
>>> With best regards,
>>> Ashish
>>>
>>> On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <ag...@pivotal.io>
>>> wrote:
>>>
>>>> Hi Ashish,
>>>>
>>>> How many threads at a time executing putAll jobs in a single client
>>>> (spark job?)...
>>>> Do you see read timeout exception in client logs...If so, can you try
>>>> increasing the read timeout value. Or reducing the putAll size.
>>>>
>>>> In case of PutAll for partitioned region; the putAll (entries) size is
>>>> broken down and sent to respective servers based on its data affinity; the
>>>> reason its working with partitioned region.
>>>>
>>>> You can find more detail on how client-server connection works at:
>>>>
>>>> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>>>>
>>>> -Anil.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
>>>> aashish.choudhary1@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We have been experiencing issues while connect to geode using putAll
>>>>> API with spark. Issue is specific to one particular spark job which tries
>>>>> to load data to a replicated region. Exception we see in the server side is
>>>>> that default limit of 800 gets maxed out and on client side we see retry
>>>>> attempt to each server but gets failed even though when we re ran the same
>>>>> job it gets completed without any issue.
>>>>>
>>>>> In the code problem I could see is that we are connecting to geode
>>>>> using client cache in forEachPartition which I think could be the issue. So
>>>>> for each partition we are making a connection to geode. In stats file we
>>>>> could see that connections getting timeout and there is thread burst also
>>>>> sometimes >4000.
>>>>>
>>>>> What is the recommended way to connect to geode using spark?
>>>>>
>>>>> But this one specific job which gets failed most of the times and is a
>>>>> replicated region. Also when we change the type of region to partitioned
>>>>> then job gets completed. We have enabled disk persistence for both type of
>>>>> regions.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>>
>>>>> With best regards,
>>>>> Ashish
>>>>>
>>>>

Re: Spark geode best practices

Posted by Anilkumar Gingade <ag...@pivotal.io>.

Ashish,

Do you see "exceeded max-connections" error...

Operation/Job getting completed second time indicates, the server where the
operation is executed first time may have issues, you may want to see the
load on that server and if there are any memory issues.

>>What is the recommended way to connect to geode using spark?
Its more of how the geode is used in this context; is the spark processors
are acting as geode's client or peer node. If its geode client, then its
more about tuning client connections based on how/what operations are
performed.

 Anil




On Tue, Jun 25, 2019 at 10:54 AM aashish choudhary <
aashish.choudhary1@gmail.com> wrote:

> We could also see below on server side logs as well.
>
> Rejected connection from Server connection from
> >> [client host address=x.yx.x.x; client port=abc] because incoming
> >> request was rejected by pool possibly due to thread exhaustion
> >>
>
>
> On Tue, Jun 25, 2019, 7:27 AM aashish choudhary <
> aashish.choudhary1@gmail.com> wrote:
>
>> As I mentioned earlier threads count could go to 4000 and we have seen
>> readtimeout crossing default 10 seconds. We tried to increase read timeout
>> to 30 seconds but that didn't work either. Record count is not more than
>> 600k.
>>
>> Job gets successful in second attempt without changing anything which is
>> bit weird.
>>
>> With best regards,
>> Ashish
>>
>> On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <ag...@pivotal.io>
>> wrote:
>>
>>> Hi Ashish,
>>>
>>> How many threads at a time executing putAll jobs in a single client
>>> (spark job?)...
>>> Do you see read timeout exception in client logs...If so, can you try
>>> increasing the read timeout value. Or reducing the putAll size.
>>>
>>> In case of PutAll for partitioned region; the putAll (entries) size is
>>> broken down and sent to respective servers based on its data affinity; the
>>> reason its working with partitioned region.
>>>
>>> You can find more detail on how client-server connection works at:
>>>
>>> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>>>
>>> -Anil.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
>>> aashish.choudhary1@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have been experiencing issues while connect to geode using putAll
>>>> API with spark. Issue is specific to one particular spark job which tries
>>>> to load data to a replicated region. Exception we see in the server side is
>>>> that default limit of 800 gets maxed out and on client side we see retry
>>>> attempt to each server but gets failed even though when we re ran the same
>>>> job it gets completed without any issue.
>>>>
>>>> In the code problem I could see is that we are connecting to geode
>>>> using client cache in forEachPartition which I think could be the issue. So
>>>> for each partition we are making a connection to geode. In stats file we
>>>> could see that connections getting timeout and there is thread burst also
>>>> sometimes >4000.
>>>>
>>>> What is the recommended way to connect to geode using spark?
>>>>
>>>> But this one specific job which gets failed most of the times and is a
>>>> replicated region. Also when we change the type of region to partitioned
>>>> then job gets completed. We have enabled disk persistence for both type of
>>>> regions.
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>>
>>>> With best regards,
>>>> Ashish
>>>>
>>>

Re: Spark geode best practices

Posted by aashish choudhary <aa...@gmail.com>.

We could also see below on server side logs as well.

Rejected connection from Server connection from
>> [client host address=x.yx.x.x; client port=abc] because incoming
>> request was rejected by pool possibly due to thread exhaustion
>>


On Tue, Jun 25, 2019, 7:27 AM aashish choudhary <
aashish.choudhary1@gmail.com> wrote:

> As I mentioned earlier threads count could go to 4000 and we have seen
> readtimeout crossing default 10 seconds. We tried to increase read timeout
> to 30 seconds but that didn't work either. Record count is not more than
> 600k.
>
> Job gets successful in second attempt without changing anything which is
> bit weird.
>
> With best regards,
> Ashish
>
> On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <ag...@pivotal.io>
> wrote:
>
>> Hi Ashish,
>>
>> How many threads at a time executing putAll jobs in a single client
>> (spark job?)...
>> Do you see read timeout exception in client logs...If so, can you try
>> increasing the read timeout value. Or reducing the putAll size.
>>
>> In case of PutAll for partitioned region; the putAll (entries) size is
>> broken down and sent to respective servers based on its data affinity; the
>> reason its working with partitioned region.
>>
>> You can find more detail on how client-server connection works at:
>>
>> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>>
>> -Anil.
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
>> aashish.choudhary1@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We have been experiencing issues while connect to geode using putAll API
>>> with spark. Issue is specific to one particular spark job which tries to
>>> load data to a replicated region. Exception we see in the server side is
>>> that default limit of 800 gets maxed out and on client side we see retry
>>> attempt to each server but gets failed even though when we re ran the same
>>> job it gets completed without any issue.
>>>
>>> In the code problem I could see is that we are connecting to geode using
>>> client cache in forEachPartition which I think could be the issue. So for
>>> each partition we are making a connection to geode. In stats file we could
>>> see that connections getting timeout and there is thread burst also
>>> sometimes >4000.
>>>
>>> What is the recommended way to connect to geode using spark?
>>>
>>> But this one specific job which gets failed most of the times and is a
>>> replicated region. Also when we change the type of region to partitioned
>>> then job gets completed. We have enabled disk persistence for both type of
>>> regions.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>> With best regards,
>>> Ashish
>>>
>>

Re: Spark geode best practices

Posted by aashish choudhary <aa...@gmail.com>.

As I mentioned earlier threads count could go to 4000 and we have seen
readtimeout crossing default 10 seconds. We tried to increase read timeout
to 30 seconds but that didn't work either. Record count is not more than
600k.

Job gets successful in second attempt without changing anything which is
bit weird.

With best regards,
Ashish

On Tue, Jun 25, 2019, 12:23 AM Anilkumar Gingade <ag...@pivotal.io>
wrote:

> Hi Ashish,
>
> How many threads at a time executing putAll jobs in a single client (spark
> job?)...
> Do you see read timeout exception in client logs...If so, can you try
> increasing the read timeout value. Or reducing the putAll size.
>
> In case of PutAll for partitioned region; the putAll (entries) size is
> broken down and sent to respective servers based on its data affinity; the
> reason its working with partitioned region.
>
> You can find more detail on how client-server connection works at:
>
> https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html
>
> -Anil.
>
>
>
>
>
>
>
> On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
> aashish.choudhary1@gmail.com> wrote:
>
>> Hi,
>>
>> We have been experiencing issues while connect to geode using putAll API
>> with spark. Issue is specific to one particular spark job which tries to
>> load data to a replicated region. Exception we see in the server side is
>> that default limit of 800 gets maxed out and on client side we see retry
>> attempt to each server but gets failed even though when we re ran the same
>> job it gets completed without any issue.
>>
>> In the code problem I could see is that we are connecting to geode using
>> client cache in forEachPartition which I think could be the issue. So for
>> each partition we are making a connection to geode. In stats file we could
>> see that connections getting timeout and there is thread burst also
>> sometimes >4000.
>>
>> What is the recommended way to connect to geode using spark?
>>
>> But this one specific job which gets failed most of the times and is a
>> replicated region. Also when we change the type of region to partitioned
>> then job gets completed. We have enabled disk persistence for both type of
>> regions.
>>
>> Thoughts?
>>
>>
>>
>> With best regards,
>> Ashish
>>
>

Re: Spark geode best practices

Posted by Anilkumar Gingade <ag...@pivotal.io>.

Hi Ashish,

How many threads at a time executing putAll jobs in a single client (spark
job?)...
Do you see read timeout exception in client logs...If so, can you try
increasing the read timeout value. Or reducing the putAll size.

In case of PutAll for partitioned region; the putAll (entries) size is
broken down and sent to respective servers based on its data affinity; the
reason its working with partitioned region.

You can find more detail on how client-server connection works at:
https://geode.apache.org/docs/guide/14/topologies_and_comm/topology_concepts/how_the_pool_manages_connections.html

-Anil.







On Mon, Jun 24, 2019 at 10:04 AM aashish choudhary <
aashish.choudhary1@gmail.com> wrote:

> Hi,
>
> We have been experiencing issues while connect to geode using putAll API
> with spark. Issue is specific to one particular spark job which tries to
> load data to a replicated region. Exception we see in the server side is
> that default limit of 800 gets maxed out and on client side we see retry
> attempt to each server but gets failed even though when we re ran the same
> job it gets completed without any issue.
>
> In the code problem I could see is that we are connecting to geode using
> client cache in forEachPartition which I think could be the issue. So for
> each partition we are making a connection to geode. In stats file we could
> see that connections getting timeout and there is thread burst also
> sometimes >4000.
>
> What is the recommended way to connect to geode using spark?
>
> But this one specific job which gets failed most of the times and is a
> replicated region. Also when we change the type of region to partitioned
> then job gets completed. We have enabled disk persistence for both type of
> regions.
>
> Thoughts?
>
>
>
> With best regards,
> Ashish
>