You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Vincent Russell <vi...@gmail.com> on 2023/03/15 21:52:18 UTC

slow ingest speeds

Hello,

I am using accumulo 2.0.1 with hadoop 3.3.1.

I have two identical clusters with 28 tservers.

I have writers on both clusters which are set with 10 batch writers with a
max memory of 50m.

However, one server is ingesting 10x faster than the other.

Is there anything I should check for?

I don't see any errors, but one thing that I noticed is that the slow site
has a lot of "Slow sync cost" info log messages from the tservers.

I see these messages on the fast cluster as well, but they are far less.
It also appears that on the slow cluster these messages are occurring on
only two of the nodes in the cluster, where these messages appear to be
more spread out on the fast cluster.

Thank you in advance for your help,
Vincent

Re: slow ingest speeds

Posted by Vincent Russell <vi...@gmail.com>.
I just noticed that swappiness is set to 60 at the slow site and 1 at the
other site.   I am going to work with the system administrators to change
this as soon as possible.

Thanks,

On Thu, Mar 16, 2023 at 9:34 AM Vincent Russell <vi...@gmail.com>
wrote:

> Thank you Dave.   I didn't take  look at the slow sync cost message when I
> shut those nodes down.  I just monitied the ingest speed.  I can try that
> again.
>
> I also shutdown the tserver on one of those slow sync cost nodes and ingst
> stopped for about 30 seconds and then continued at the same slow speed.
>
> Also according to the accumulo monitor the tablets are pretty
> evenly-distributed.
>
> I am going to try to move the node that's doing the ingesting to another
> host and see what happens.
>
> Thanks,
>
> On Wed, Mar 15, 2023 at 7:26 PM Dave Marion <dm...@gmail.com> wrote:
>
>> When you shut down the two datanodes, did you have the same "slow sync
>> cost" messages concentrated on two nodes? If so, is it possible that a
>> majority of the writes are going to a small set of tablet servers? You
>> might be able to see this on the Monitor. Is it possible that tablets you
>> are ingesting are collocated instead of spread out?
>>
>> On Wed, Mar 15, 2023 at 7:01 PM Vincent Russell <
>> vincent.russell@gmail.com>
>> wrote:
>>
>> > I stopped the two data nodes and it had no effect.
>> >
>> > Thanks,
>> >
>> > On Wed, Mar 15, 2023 at 6:53 PM Vincent Russell <
>> vincent.russell@gmail.com
>> > >
>> > wrote:
>> >
>> > > Yes.  We have the hdfs rack-aware set up to divide the blocks equally.
>> > > And according to the name node http page it doesn't look like those
>> nodes
>> > > have a much higher number of blocks that nother nodes.
>> > >
>> > > I can try temporarily shutting down one of the data nodes to see what
>> > that
>> > > does.
>> > >
>> > > We did already lose a node on the cluster a few days ago.  I'm
>> currently
>> > > waiting for the system administrators to replace a disk.
>> > >
>> > > Thanks,
>> > >
>> > > On Wed, Mar 15, 2023 at 5:59 PM Dave Marion <dm...@gmail.com>
>> wrote:
>> > >
>> > >> sounds like you have a hot-spot on those two datanode hosts. Either
>> > >> because
>> > >> the blocks that it's writing to are all (or a majority) located
>> there,
>> > or
>> > >> there is some type of issue with the host. Stopping the DN processes
>> on
>> > >> those two hosts should confirm this, unless the hot spot moves. Do
>> you
>> > >> have
>> > >> the HDFS rack script set up appropriately to distribute the blocks
>> for
>> > >> files across the hosts?
>> > >>
>> > >> On Wed, Mar 15, 2023 at 5:52 PM Vincent Russell <
>> > >> vincent.russell@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > Hello,
>> > >> >
>> > >> > I am using accumulo 2.0.1 with hadoop 3.3.1.
>> > >> >
>> > >> > I have two identical clusters with 28 tservers.
>> > >> >
>> > >> > I have writers on both clusters which are set with 10 batch writers
>> > >> with a
>> > >> > max memory of 50m.
>> > >> >
>> > >> > However, one server is ingesting 10x faster than the other.
>> > >> >
>> > >> > Is there anything I should check for?
>> > >> >
>> > >> > I don't see any errors, but one thing that I noticed is that the
>> slow
>> > >> site
>> > >> > has a lot of "Slow sync cost" info log messages from the tservers.
>> > >> >
>> > >> > I see these messages on the fast cluster as well, but they are far
>> > less.
>> > >> > It also appears that on the slow cluster these messages are
>> occurring
>> > on
>> > >> > only two of the nodes in the cluster, where these messages appear
>> to
>> > be
>> > >> > more spread out on the fast cluster.
>> > >> >
>> > >> > Thank you in advance for your help,
>> > >> > Vincent
>> > >> >
>> > >>
>> > >
>> >
>>
>

Re: slow ingest speeds

Posted by Vincent Russell <vi...@gmail.com>.
Thank you Dave.   I didn't take  look at the slow sync cost message when I
shut those nodes down.  I just monitied the ingest speed.  I can try that
again.

I also shutdown the tserver on one of those slow sync cost nodes and ingst
stopped for about 30 seconds and then continued at the same slow speed.

Also according to the accumulo monitor the tablets are pretty
evenly-distributed.

I am going to try to move the node that's doing the ingesting to another
host and see what happens.

Thanks,

On Wed, Mar 15, 2023 at 7:26 PM Dave Marion <dm...@gmail.com> wrote:

> When you shut down the two datanodes, did you have the same "slow sync
> cost" messages concentrated on two nodes? If so, is it possible that a
> majority of the writes are going to a small set of tablet servers? You
> might be able to see this on the Monitor. Is it possible that tablets you
> are ingesting are collocated instead of spread out?
>
> On Wed, Mar 15, 2023 at 7:01 PM Vincent Russell <vincent.russell@gmail.com
> >
> wrote:
>
> > I stopped the two data nodes and it had no effect.
> >
> > Thanks,
> >
> > On Wed, Mar 15, 2023 at 6:53 PM Vincent Russell <
> vincent.russell@gmail.com
> > >
> > wrote:
> >
> > > Yes.  We have the hdfs rack-aware set up to divide the blocks equally.
> > > And according to the name node http page it doesn't look like those
> nodes
> > > have a much higher number of blocks that nother nodes.
> > >
> > > I can try temporarily shutting down one of the data nodes to see what
> > that
> > > does.
> > >
> > > We did already lose a node on the cluster a few days ago.  I'm
> currently
> > > waiting for the system administrators to replace a disk.
> > >
> > > Thanks,
> > >
> > > On Wed, Mar 15, 2023 at 5:59 PM Dave Marion <dm...@gmail.com>
> wrote:
> > >
> > >> sounds like you have a hot-spot on those two datanode hosts. Either
> > >> because
> > >> the blocks that it's writing to are all (or a majority) located there,
> > or
> > >> there is some type of issue with the host. Stopping the DN processes
> on
> > >> those two hosts should confirm this, unless the hot spot moves. Do you
> > >> have
> > >> the HDFS rack script set up appropriately to distribute the blocks for
> > >> files across the hosts?
> > >>
> > >> On Wed, Mar 15, 2023 at 5:52 PM Vincent Russell <
> > >> vincent.russell@gmail.com>
> > >> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I am using accumulo 2.0.1 with hadoop 3.3.1.
> > >> >
> > >> > I have two identical clusters with 28 tservers.
> > >> >
> > >> > I have writers on both clusters which are set with 10 batch writers
> > >> with a
> > >> > max memory of 50m.
> > >> >
> > >> > However, one server is ingesting 10x faster than the other.
> > >> >
> > >> > Is there anything I should check for?
> > >> >
> > >> > I don't see any errors, but one thing that I noticed is that the
> slow
> > >> site
> > >> > has a lot of "Slow sync cost" info log messages from the tservers.
> > >> >
> > >> > I see these messages on the fast cluster as well, but they are far
> > less.
> > >> > It also appears that on the slow cluster these messages are
> occurring
> > on
> > >> > only two of the nodes in the cluster, where these messages appear to
> > be
> > >> > more spread out on the fast cluster.
> > >> >
> > >> > Thank you in advance for your help,
> > >> > Vincent
> > >> >
> > >>
> > >
> >
>

Re: slow ingest speeds

Posted by Dave Marion <dm...@gmail.com>.
When you shut down the two datanodes, did you have the same "slow sync
cost" messages concentrated on two nodes? If so, is it possible that a
majority of the writes are going to a small set of tablet servers? You
might be able to see this on the Monitor. Is it possible that tablets you
are ingesting are collocated instead of spread out?

On Wed, Mar 15, 2023 at 7:01 PM Vincent Russell <vi...@gmail.com>
wrote:

> I stopped the two data nodes and it had no effect.
>
> Thanks,
>
> On Wed, Mar 15, 2023 at 6:53 PM Vincent Russell <vincent.russell@gmail.com
> >
> wrote:
>
> > Yes.  We have the hdfs rack-aware set up to divide the blocks equally.
> > And according to the name node http page it doesn't look like those nodes
> > have a much higher number of blocks that nother nodes.
> >
> > I can try temporarily shutting down one of the data nodes to see what
> that
> > does.
> >
> > We did already lose a node on the cluster a few days ago.  I'm currently
> > waiting for the system administrators to replace a disk.
> >
> > Thanks,
> >
> > On Wed, Mar 15, 2023 at 5:59 PM Dave Marion <dm...@gmail.com> wrote:
> >
> >> sounds like you have a hot-spot on those two datanode hosts. Either
> >> because
> >> the blocks that it's writing to are all (or a majority) located there,
> or
> >> there is some type of issue with the host. Stopping the DN processes on
> >> those two hosts should confirm this, unless the hot spot moves. Do you
> >> have
> >> the HDFS rack script set up appropriately to distribute the blocks for
> >> files across the hosts?
> >>
> >> On Wed, Mar 15, 2023 at 5:52 PM Vincent Russell <
> >> vincent.russell@gmail.com>
> >> wrote:
> >>
> >> > Hello,
> >> >
> >> > I am using accumulo 2.0.1 with hadoop 3.3.1.
> >> >
> >> > I have two identical clusters with 28 tservers.
> >> >
> >> > I have writers on both clusters which are set with 10 batch writers
> >> with a
> >> > max memory of 50m.
> >> >
> >> > However, one server is ingesting 10x faster than the other.
> >> >
> >> > Is there anything I should check for?
> >> >
> >> > I don't see any errors, but one thing that I noticed is that the slow
> >> site
> >> > has a lot of "Slow sync cost" info log messages from the tservers.
> >> >
> >> > I see these messages on the fast cluster as well, but they are far
> less.
> >> > It also appears that on the slow cluster these messages are occurring
> on
> >> > only two of the nodes in the cluster, where these messages appear to
> be
> >> > more spread out on the fast cluster.
> >> >
> >> > Thank you in advance for your help,
> >> > Vincent
> >> >
> >>
> >
>

Re: slow ingest speeds

Posted by Vincent Russell <vi...@gmail.com>.
I stopped the two data nodes and it had no effect.

Thanks,

On Wed, Mar 15, 2023 at 6:53 PM Vincent Russell <vi...@gmail.com>
wrote:

> Yes.  We have the hdfs rack-aware set up to divide the blocks equally.
> And according to the name node http page it doesn't look like those nodes
> have a much higher number of blocks that nother nodes.
>
> I can try temporarily shutting down one of the data nodes to see what that
> does.
>
> We did already lose a node on the cluster a few days ago.  I'm currently
> waiting for the system administrators to replace a disk.
>
> Thanks,
>
> On Wed, Mar 15, 2023 at 5:59 PM Dave Marion <dm...@gmail.com> wrote:
>
>> sounds like you have a hot-spot on those two datanode hosts. Either
>> because
>> the blocks that it's writing to are all (or a majority) located there, or
>> there is some type of issue with the host. Stopping the DN processes on
>> those two hosts should confirm this, unless the hot spot moves. Do you
>> have
>> the HDFS rack script set up appropriately to distribute the blocks for
>> files across the hosts?
>>
>> On Wed, Mar 15, 2023 at 5:52 PM Vincent Russell <
>> vincent.russell@gmail.com>
>> wrote:
>>
>> > Hello,
>> >
>> > I am using accumulo 2.0.1 with hadoop 3.3.1.
>> >
>> > I have two identical clusters with 28 tservers.
>> >
>> > I have writers on both clusters which are set with 10 batch writers
>> with a
>> > max memory of 50m.
>> >
>> > However, one server is ingesting 10x faster than the other.
>> >
>> > Is there anything I should check for?
>> >
>> > I don't see any errors, but one thing that I noticed is that the slow
>> site
>> > has a lot of "Slow sync cost" info log messages from the tservers.
>> >
>> > I see these messages on the fast cluster as well, but they are far less.
>> > It also appears that on the slow cluster these messages are occurring on
>> > only two of the nodes in the cluster, where these messages appear to be
>> > more spread out on the fast cluster.
>> >
>> > Thank you in advance for your help,
>> > Vincent
>> >
>>
>

Re: slow ingest speeds

Posted by Vincent Russell <vi...@gmail.com>.
Yes.  We have the hdfs rack-aware set up to divide the blocks equally.  And
according to the name node http page it doesn't look like those nodes have
a much higher number of blocks that nother nodes.

I can try temporarily shutting down one of the data nodes to see what that
does.

We did already lose a node on the cluster a few days ago.  I'm currently
waiting for the system administrators to replace a disk.

Thanks,

On Wed, Mar 15, 2023 at 5:59 PM Dave Marion <dm...@gmail.com> wrote:

> sounds like you have a hot-spot on those two datanode hosts. Either because
> the blocks that it's writing to are all (or a majority) located there, or
> there is some type of issue with the host. Stopping the DN processes on
> those two hosts should confirm this, unless the hot spot moves. Do you have
> the HDFS rack script set up appropriately to distribute the blocks for
> files across the hosts?
>
> On Wed, Mar 15, 2023 at 5:52 PM Vincent Russell <vincent.russell@gmail.com
> >
> wrote:
>
> > Hello,
> >
> > I am using accumulo 2.0.1 with hadoop 3.3.1.
> >
> > I have two identical clusters with 28 tservers.
> >
> > I have writers on both clusters which are set with 10 batch writers with
> a
> > max memory of 50m.
> >
> > However, one server is ingesting 10x faster than the other.
> >
> > Is there anything I should check for?
> >
> > I don't see any errors, but one thing that I noticed is that the slow
> site
> > has a lot of "Slow sync cost" info log messages from the tservers.
> >
> > I see these messages on the fast cluster as well, but they are far less.
> > It also appears that on the slow cluster these messages are occurring on
> > only two of the nodes in the cluster, where these messages appear to be
> > more spread out on the fast cluster.
> >
> > Thank you in advance for your help,
> > Vincent
> >
>

Re: slow ingest speeds

Posted by Dave Marion <dm...@gmail.com>.
sounds like you have a hot-spot on those two datanode hosts. Either because
the blocks that it's writing to are all (or a majority) located there, or
there is some type of issue with the host. Stopping the DN processes on
those two hosts should confirm this, unless the hot spot moves. Do you have
the HDFS rack script set up appropriately to distribute the blocks for
files across the hosts?

On Wed, Mar 15, 2023 at 5:52 PM Vincent Russell <vi...@gmail.com>
wrote:

> Hello,
>
> I am using accumulo 2.0.1 with hadoop 3.3.1.
>
> I have two identical clusters with 28 tservers.
>
> I have writers on both clusters which are set with 10 batch writers with a
> max memory of 50m.
>
> However, one server is ingesting 10x faster than the other.
>
> Is there anything I should check for?
>
> I don't see any errors, but one thing that I noticed is that the slow site
> has a lot of "Slow sync cost" info log messages from the tservers.
>
> I see these messages on the fast cluster as well, but they are far less.
> It also appears that on the slow cluster these messages are occurring on
> only two of the nodes in the cluster, where these messages appear to be
> more spread out on the fast cluster.
>
> Thank you in advance for your help,
> Vincent
>