You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Abhishek Mishra <so...@gmail.com> on 2020/12/09 06:24:09 UTC

solrcloud with EKS kubernetes

Hello guys,
We are kind of facing some of the issues(Like timeout etc.) which are very
inconsistent. By any chance can it be related to EKS? We are using solr 7.7
and zookeeper 3.4.13. Should we move to ECS?

Regards,
Abhishek

Re: solrcloud with EKS kubernetes

Posted by Abhishek Mishra <so...@gmail.com>.
Hi Jonathan,
it was really helpful. Some of the metrics were crossing threshold like
network bandwidth etc.

Regards,
Abhishek

On Sat, Dec 26, 2020 at 7:54 PM Jonathan Tan <jt...@gmail.com> wrote:

> Hi Abhishek,
>
> Merry Christmas to you too!
> I think it's really a question regarding your indexing speed NFRs.
>
> Have you had a chance to take a look at your IOPS & write bytes/second
> graphs for that host & PVC?
>
> I'd suggest that's the first thing to go look at, so that you can find out
> whether you're actually IOPS bound or not.
> If you are, then it becomes a question of *how* you're indexing, and
> whether that can be "slowed down" or not.
>
>
>
> On Thu, Dec 24, 2020 at 5:55 PM Abhishek Mishra <so...@gmail.com>
> wrote:
>
> > Hi Jonathan,
> > Merry Christmas.
> > Thanks for the suggestion. To manage IOPS can we do something on
> > rate-limiting behalf?
> >
> > Regards,
> > Abhishek
> >
> >
> > On Thu, Dec 17, 2020 at 5:07 AM Jonathan Tan <jt...@gmail.com> wrote:
> >
> > > Hi Abhishek,
> > >
> > > We're running Solr Cloud 8.6 on GKE.
> > > 3 node cluster, running 4 cpus (configured) and 8gb of min & max JVM
> > > configured, all with anti-affinity so they never exist on the same
> node.
> > > It's got 2 collections of ~13documents each, 6 shards, 3 replicas each,
> > > disk usage on each node is ~54gb (we've got all the shards replicated
> to
> > > all nodes)
> > >
> > > We're also using a 200gb zonal SSD, which *has* been necessary just so
> > that
> > > we've got the right IOPS & bandwidth. (That's approximately 6000 IOPS
> for
> > > read & write each, and 96MB/s for read & write each)
> > >
> > > Various lessons learnt...
> > > You definitely don't want them ever on the same kubernetes node. From a
> > > resilience perspective, yes, but also when one SOLR node gets busy,
> they
> > > tend to all get busy, so now you'll have resource contention. Recovery
> > can
> > > also get very busy and resource intensive, and again, sitting on the
> same
> > > node is problematic. We also saw the need to move to SSDs because of
> how
> > > IOPS bound we were.
> > >
> > > Did I mention use SSDs? ;)
> > >
> > > Good luck!
> > >
> > > On Mon, Dec 14, 2020 at 5:34 PM Abhishek Mishra <so...@gmail.com>
> > > wrote:
> > >
> > > > Hi Houston,
> > > > Sorry for the late reply. Each shard has a 9GB size around.
> > > > Yeah, we are providing enough resources to pods. We are currently
> > > > using c5.4xlarge.
> > > > XMS and XMX is 16GB. The machine is having 32 GB and 16 core.
> > > > No, I haven't run it outside Kubernetes. But I do have colleagues who
> > did
> > > > the same on 7.2 and didn't face any issue regarding it.
> > > > Storage volume is gp2 50GB.
> > > > It's not the search query where we are facing inconsistencies or
> > > timeouts.
> > > > Seems some internal admin APIs sometimes have issues. So while adding
> > new
> > > > replica in clusters sometimes result in inconsistencies. Like
> recovery
> > > > takes some time more than one hour.
> > > >
> > > > Regards,
> > > > Abhishek
> > > >
> > > > On Thu, Dec 10, 2020 at 10:23 AM Houston Putman <
> > houstonputman@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hello Abhishek,
> > > > >
> > > > > It's really hard to provide any advice without knowing any
> > information
> > > > > about your setup/usage.
> > > > >
> > > > > Are you giving your Solr pods enough resources on EKS?
> > > > > Have you run Solr in the same configuration outside of kubernetes
> in
> > > the
> > > > > past without timeouts?
> > > > > What type of storage volumes are you using to store your data?
> > > > > Are you using headless services to connect your Solr Nodes, or
> > > ingresses?
> > > > >
> > > > > If this is the first time that you are using this data + Solr
> > > > > configuration, maybe it's just that your data within Solr isn't
> > > optimized
> > > > > for the type of queries that you are doing.
> > > > > If you have run it successfully in the past outside of Kubernetes,
> > > then I
> > > > > would look at the resources that you are giving your pods and the
> > > storage
> > > > > volumes that you are using.
> > > > > If you are using Ingresses, that might be causing slow connections
> > > > between
> > > > > nodes, or between your client and Solr.
> > > > >
> > > > > - Houston
> > > > >
> > > > > On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra <
> solrmishra@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hello guys,
> > > > > > We are kind of facing some of the issues(Like timeout etc.) which
> > are
> > > > > very
> > > > > > inconsistent. By any chance can it be related to EKS? We are
> using
> > > solr
> > > > > 7.7
> > > > > > and zookeeper 3.4.13. Should we move to ECS?
> > > > > >
> > > > > > Regards,
> > > > > > Abhishek
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: solrcloud with EKS kubernetes

Posted by Jonathan Tan <jt...@gmail.com>.
Hi Abhishek,

Merry Christmas to you too!
I think it's really a question regarding your indexing speed NFRs.

Have you had a chance to take a look at your IOPS & write bytes/second
graphs for that host & PVC?

I'd suggest that's the first thing to go look at, so that you can find out
whether you're actually IOPS bound or not.
If you are, then it becomes a question of *how* you're indexing, and
whether that can be "slowed down" or not.



On Thu, Dec 24, 2020 at 5:55 PM Abhishek Mishra <so...@gmail.com>
wrote:

> Hi Jonathan,
> Merry Christmas.
> Thanks for the suggestion. To manage IOPS can we do something on
> rate-limiting behalf?
>
> Regards,
> Abhishek
>
>
> On Thu, Dec 17, 2020 at 5:07 AM Jonathan Tan <jt...@gmail.com> wrote:
>
> > Hi Abhishek,
> >
> > We're running Solr Cloud 8.6 on GKE.
> > 3 node cluster, running 4 cpus (configured) and 8gb of min & max JVM
> > configured, all with anti-affinity so they never exist on the same node.
> > It's got 2 collections of ~13documents each, 6 shards, 3 replicas each,
> > disk usage on each node is ~54gb (we've got all the shards replicated to
> > all nodes)
> >
> > We're also using a 200gb zonal SSD, which *has* been necessary just so
> that
> > we've got the right IOPS & bandwidth. (That's approximately 6000 IOPS for
> > read & write each, and 96MB/s for read & write each)
> >
> > Various lessons learnt...
> > You definitely don't want them ever on the same kubernetes node. From a
> > resilience perspective, yes, but also when one SOLR node gets busy, they
> > tend to all get busy, so now you'll have resource contention. Recovery
> can
> > also get very busy and resource intensive, and again, sitting on the same
> > node is problematic. We also saw the need to move to SSDs because of how
> > IOPS bound we were.
> >
> > Did I mention use SSDs? ;)
> >
> > Good luck!
> >
> > On Mon, Dec 14, 2020 at 5:34 PM Abhishek Mishra <so...@gmail.com>
> > wrote:
> >
> > > Hi Houston,
> > > Sorry for the late reply. Each shard has a 9GB size around.
> > > Yeah, we are providing enough resources to pods. We are currently
> > > using c5.4xlarge.
> > > XMS and XMX is 16GB. The machine is having 32 GB and 16 core.
> > > No, I haven't run it outside Kubernetes. But I do have colleagues who
> did
> > > the same on 7.2 and didn't face any issue regarding it.
> > > Storage volume is gp2 50GB.
> > > It's not the search query where we are facing inconsistencies or
> > timeouts.
> > > Seems some internal admin APIs sometimes have issues. So while adding
> new
> > > replica in clusters sometimes result in inconsistencies. Like recovery
> > > takes some time more than one hour.
> > >
> > > Regards,
> > > Abhishek
> > >
> > > On Thu, Dec 10, 2020 at 10:23 AM Houston Putman <
> houstonputman@gmail.com
> > >
> > > wrote:
> > >
> > > > Hello Abhishek,
> > > >
> > > > It's really hard to provide any advice without knowing any
> information
> > > > about your setup/usage.
> > > >
> > > > Are you giving your Solr pods enough resources on EKS?
> > > > Have you run Solr in the same configuration outside of kubernetes in
> > the
> > > > past without timeouts?
> > > > What type of storage volumes are you using to store your data?
> > > > Are you using headless services to connect your Solr Nodes, or
> > ingresses?
> > > >
> > > > If this is the first time that you are using this data + Solr
> > > > configuration, maybe it's just that your data within Solr isn't
> > optimized
> > > > for the type of queries that you are doing.
> > > > If you have run it successfully in the past outside of Kubernetes,
> > then I
> > > > would look at the resources that you are giving your pods and the
> > storage
> > > > volumes that you are using.
> > > > If you are using Ingresses, that might be causing slow connections
> > > between
> > > > nodes, or between your client and Solr.
> > > >
> > > > - Houston
> > > >
> > > > On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra <solrmishra@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hello guys,
> > > > > We are kind of facing some of the issues(Like timeout etc.) which
> are
> > > > very
> > > > > inconsistent. By any chance can it be related to EKS? We are using
> > solr
> > > > 7.7
> > > > > and zookeeper 3.4.13. Should we move to ECS?
> > > > >
> > > > > Regards,
> > > > > Abhishek
> > > > >
> > > >
> > >
> >
>

Re: solrcloud with EKS kubernetes

Posted by Abhishek Mishra <so...@gmail.com>.
Hi Jonathan,
Merry Christmas.
Thanks for the suggestion. To manage IOPS can we do something on
rate-limiting behalf?

Regards,
Abhishek


On Thu, Dec 17, 2020 at 5:07 AM Jonathan Tan <jt...@gmail.com> wrote:

> Hi Abhishek,
>
> We're running Solr Cloud 8.6 on GKE.
> 3 node cluster, running 4 cpus (configured) and 8gb of min & max JVM
> configured, all with anti-affinity so they never exist on the same node.
> It's got 2 collections of ~13documents each, 6 shards, 3 replicas each,
> disk usage on each node is ~54gb (we've got all the shards replicated to
> all nodes)
>
> We're also using a 200gb zonal SSD, which *has* been necessary just so that
> we've got the right IOPS & bandwidth. (That's approximately 6000 IOPS for
> read & write each, and 96MB/s for read & write each)
>
> Various lessons learnt...
> You definitely don't want them ever on the same kubernetes node. From a
> resilience perspective, yes, but also when one SOLR node gets busy, they
> tend to all get busy, so now you'll have resource contention. Recovery can
> also get very busy and resource intensive, and again, sitting on the same
> node is problematic. We also saw the need to move to SSDs because of how
> IOPS bound we were.
>
> Did I mention use SSDs? ;)
>
> Good luck!
>
> On Mon, Dec 14, 2020 at 5:34 PM Abhishek Mishra <so...@gmail.com>
> wrote:
>
> > Hi Houston,
> > Sorry for the late reply. Each shard has a 9GB size around.
> > Yeah, we are providing enough resources to pods. We are currently
> > using c5.4xlarge.
> > XMS and XMX is 16GB. The machine is having 32 GB and 16 core.
> > No, I haven't run it outside Kubernetes. But I do have colleagues who did
> > the same on 7.2 and didn't face any issue regarding it.
> > Storage volume is gp2 50GB.
> > It's not the search query where we are facing inconsistencies or
> timeouts.
> > Seems some internal admin APIs sometimes have issues. So while adding new
> > replica in clusters sometimes result in inconsistencies. Like recovery
> > takes some time more than one hour.
> >
> > Regards,
> > Abhishek
> >
> > On Thu, Dec 10, 2020 at 10:23 AM Houston Putman <houstonputman@gmail.com
> >
> > wrote:
> >
> > > Hello Abhishek,
> > >
> > > It's really hard to provide any advice without knowing any information
> > > about your setup/usage.
> > >
> > > Are you giving your Solr pods enough resources on EKS?
> > > Have you run Solr in the same configuration outside of kubernetes in
> the
> > > past without timeouts?
> > > What type of storage volumes are you using to store your data?
> > > Are you using headless services to connect your Solr Nodes, or
> ingresses?
> > >
> > > If this is the first time that you are using this data + Solr
> > > configuration, maybe it's just that your data within Solr isn't
> optimized
> > > for the type of queries that you are doing.
> > > If you have run it successfully in the past outside of Kubernetes,
> then I
> > > would look at the resources that you are giving your pods and the
> storage
> > > volumes that you are using.
> > > If you are using Ingresses, that might be causing slow connections
> > between
> > > nodes, or between your client and Solr.
> > >
> > > - Houston
> > >
> > > On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra <so...@gmail.com>
> > > wrote:
> > >
> > > > Hello guys,
> > > > We are kind of facing some of the issues(Like timeout etc.) which are
> > > very
> > > > inconsistent. By any chance can it be related to EKS? We are using
> solr
> > > 7.7
> > > > and zookeeper 3.4.13. Should we move to ECS?
> > > >
> > > > Regards,
> > > > Abhishek
> > > >
> > >
> >
>

Re: solrcloud with EKS kubernetes

Posted by Jonathan Tan <jt...@gmail.com>.
Hi Abhishek,

We're running Solr Cloud 8.6 on GKE.
3 node cluster, running 4 cpus (configured) and 8gb of min & max JVM
configured, all with anti-affinity so they never exist on the same node.
It's got 2 collections of ~13documents each, 6 shards, 3 replicas each,
disk usage on each node is ~54gb (we've got all the shards replicated to
all nodes)

We're also using a 200gb zonal SSD, which *has* been necessary just so that
we've got the right IOPS & bandwidth. (That's approximately 6000 IOPS for
read & write each, and 96MB/s for read & write each)

Various lessons learnt...
You definitely don't want them ever on the same kubernetes node. From a
resilience perspective, yes, but also when one SOLR node gets busy, they
tend to all get busy, so now you'll have resource contention. Recovery can
also get very busy and resource intensive, and again, sitting on the same
node is problematic. We also saw the need to move to SSDs because of how
IOPS bound we were.

Did I mention use SSDs? ;)

Good luck!

On Mon, Dec 14, 2020 at 5:34 PM Abhishek Mishra <so...@gmail.com>
wrote:

> Hi Houston,
> Sorry for the late reply. Each shard has a 9GB size around.
> Yeah, we are providing enough resources to pods. We are currently
> using c5.4xlarge.
> XMS and XMX is 16GB. The machine is having 32 GB and 16 core.
> No, I haven't run it outside Kubernetes. But I do have colleagues who did
> the same on 7.2 and didn't face any issue regarding it.
> Storage volume is gp2 50GB.
> It's not the search query where we are facing inconsistencies or timeouts.
> Seems some internal admin APIs sometimes have issues. So while adding new
> replica in clusters sometimes result in inconsistencies. Like recovery
> takes some time more than one hour.
>
> Regards,
> Abhishek
>
> On Thu, Dec 10, 2020 at 10:23 AM Houston Putman <ho...@gmail.com>
> wrote:
>
> > Hello Abhishek,
> >
> > It's really hard to provide any advice without knowing any information
> > about your setup/usage.
> >
> > Are you giving your Solr pods enough resources on EKS?
> > Have you run Solr in the same configuration outside of kubernetes in the
> > past without timeouts?
> > What type of storage volumes are you using to store your data?
> > Are you using headless services to connect your Solr Nodes, or ingresses?
> >
> > If this is the first time that you are using this data + Solr
> > configuration, maybe it's just that your data within Solr isn't optimized
> > for the type of queries that you are doing.
> > If you have run it successfully in the past outside of Kubernetes, then I
> > would look at the resources that you are giving your pods and the storage
> > volumes that you are using.
> > If you are using Ingresses, that might be causing slow connections
> between
> > nodes, or between your client and Solr.
> >
> > - Houston
> >
> > On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra <so...@gmail.com>
> > wrote:
> >
> > > Hello guys,
> > > We are kind of facing some of the issues(Like timeout etc.) which are
> > very
> > > inconsistent. By any chance can it be related to EKS? We are using solr
> > 7.7
> > > and zookeeper 3.4.13. Should we move to ECS?
> > >
> > > Regards,
> > > Abhishek
> > >
> >
>

Re: solrcloud with EKS kubernetes

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
FWIW, I have seen Solr exhaust the IOPS burst quota on AWS causing
slow replication and high latency for search and indexing operations.
You may want to dig into cloud watch metrics and see if you are
running into a similar issue. The default IOPS quota on gp2 is very
low (100?).

Another thing to check is whether you have DNS TTLs for both positive
and negative lookups configured. When nodes go down and come back up
in Kubernetes the address of the pod remains the same but the IP can
change and the JVM caches DNS lookups. This can cause timeouts.

On 12/14/20, Abhishek Mishra <so...@gmail.com> wrote:
> Hi Houston,
> Sorry for the late reply. Each shard has a 9GB size around.
> Yeah, we are providing enough resources to pods. We are currently
> using c5.4xlarge.
> XMS and XMX is 16GB. The machine is having 32 GB and 16 core.
> No, I haven't run it outside Kubernetes. But I do have colleagues who did
> the same on 7.2 and didn't face any issue regarding it.
> Storage volume is gp2 50GB.
> It's not the search query where we are facing inconsistencies or timeouts.
> Seems some internal admin APIs sometimes have issues. So while adding new
> replica in clusters sometimes result in inconsistencies. Like recovery
> takes some time more than one hour.
>
> Regards,
> Abhishek
>
> On Thu, Dec 10, 2020 at 10:23 AM Houston Putman <ho...@gmail.com>
> wrote:
>
>> Hello Abhishek,
>>
>> It's really hard to provide any advice without knowing any information
>> about your setup/usage.
>>
>> Are you giving your Solr pods enough resources on EKS?
>> Have you run Solr in the same configuration outside of kubernetes in the
>> past without timeouts?
>> What type of storage volumes are you using to store your data?
>> Are you using headless services to connect your Solr Nodes, or ingresses?
>>
>> If this is the first time that you are using this data + Solr
>> configuration, maybe it's just that your data within Solr isn't optimized
>> for the type of queries that you are doing.
>> If you have run it successfully in the past outside of Kubernetes, then I
>> would look at the resources that you are giving your pods and the storage
>> volumes that you are using.
>> If you are using Ingresses, that might be causing slow connections
>> between
>> nodes, or between your client and Solr.
>>
>> - Houston
>>
>> On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra <so...@gmail.com>
>> wrote:
>>
>> > Hello guys,
>> > We are kind of facing some of the issues(Like timeout etc.) which are
>> very
>> > inconsistent. By any chance can it be related to EKS? We are using solr
>> 7.7
>> > and zookeeper 3.4.13. Should we move to ECS?
>> >
>> > Regards,
>> > Abhishek
>> >
>>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: solrcloud with EKS kubernetes

Posted by Abhishek Mishra <so...@gmail.com>.
Hi Houston,
Sorry for the late reply. Each shard has a 9GB size around.
Yeah, we are providing enough resources to pods. We are currently
using c5.4xlarge.
XMS and XMX is 16GB. The machine is having 32 GB and 16 core.
No, I haven't run it outside Kubernetes. But I do have colleagues who did
the same on 7.2 and didn't face any issue regarding it.
Storage volume is gp2 50GB.
It's not the search query where we are facing inconsistencies or timeouts.
Seems some internal admin APIs sometimes have issues. So while adding new
replica in clusters sometimes result in inconsistencies. Like recovery
takes some time more than one hour.

Regards,
Abhishek

On Thu, Dec 10, 2020 at 10:23 AM Houston Putman <ho...@gmail.com>
wrote:

> Hello Abhishek,
>
> It's really hard to provide any advice without knowing any information
> about your setup/usage.
>
> Are you giving your Solr pods enough resources on EKS?
> Have you run Solr in the same configuration outside of kubernetes in the
> past without timeouts?
> What type of storage volumes are you using to store your data?
> Are you using headless services to connect your Solr Nodes, or ingresses?
>
> If this is the first time that you are using this data + Solr
> configuration, maybe it's just that your data within Solr isn't optimized
> for the type of queries that you are doing.
> If you have run it successfully in the past outside of Kubernetes, then I
> would look at the resources that you are giving your pods and the storage
> volumes that you are using.
> If you are using Ingresses, that might be causing slow connections between
> nodes, or between your client and Solr.
>
> - Houston
>
> On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra <so...@gmail.com>
> wrote:
>
> > Hello guys,
> > We are kind of facing some of the issues(Like timeout etc.) which are
> very
> > inconsistent. By any chance can it be related to EKS? We are using solr
> 7.7
> > and zookeeper 3.4.13. Should we move to ECS?
> >
> > Regards,
> > Abhishek
> >
>

Re: solrcloud with EKS kubernetes

Posted by Houston Putman <ho...@gmail.com>.
Hello Abhishek,

It's really hard to provide any advice without knowing any information
about your setup/usage.

Are you giving your Solr pods enough resources on EKS?
Have you run Solr in the same configuration outside of kubernetes in the
past without timeouts?
What type of storage volumes are you using to store your data?
Are you using headless services to connect your Solr Nodes, or ingresses?

If this is the first time that you are using this data + Solr
configuration, maybe it's just that your data within Solr isn't optimized
for the type of queries that you are doing.
If you have run it successfully in the past outside of Kubernetes, then I
would look at the resources that you are giving your pods and the storage
volumes that you are using.
If you are using Ingresses, that might be causing slow connections between
nodes, or between your client and Solr.

- Houston

On Wed, Dec 9, 2020 at 3:24 PM Abhishek Mishra <so...@gmail.com> wrote:

> Hello guys,
> We are kind of facing some of the issues(Like timeout etc.) which are very
> inconsistent. By any chance can it be related to EKS? We are using solr 7.7
> and zookeeper 3.4.13. Should we move to ECS?
>
> Regards,
> Abhishek
>