You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Arshak Navruzyan <ar...@gmail.com> on 2014/01/12 18:28:21 UTC

ISAM file location vs. read performance

One aspect of Accumulo architecture is still unclear to me.  Would you
achieve better scan performance if you could guarantee that the tablet and
its ISAM file lived on the same node?  Guessing ISAM files are not
splittable so they pretty much stay on one HDFS data node (plus the replica
copy). Or is the theory that SATA and a 10GBps network provide more or less
the same throughput?

I generally understand that as the table grows and Accumulo creates more
splits (tablets) you get better distribution over the cluster but seems
like data location would still be important.   HBase folks seem to think
that you can approx. double your throughput if let the region server
directly read the file (dfs.client.read.shortcircuit=true) as opposed to
going through the data node. (
http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf).  Perhaps this is
due more to HDFS overhead?

I do get that one really nice thing about Accumulo's architecture is that
it costs almost nothing to reassign tablet to a different tserver and this
is a huge problem for other systems.

Re: ISAM file location vs. read performance

Posted by Eric Newton <er...@gmail.com>.

You may find org.apache.accumulo.server.util.LocalityCheck useful.

-Eric



On Thu, Jan 16, 2014 at 2:12 PM, Arshak Navruzyan <ar...@gmail.com> wrote:

> I did some manual testing on this to see where HDFS is placing blocks in
> relation to the location of the tablets.  I used the following command to
> determine where HDFS is replicating the various blocks of the Rfiles.
>
> hadoop fsck /accumulo/tables/a -locations -blocks -files
>
> From my limited testing, it appears that John's observation that "tserver
> with ultimately end up major compacting it's files, ensuring locality" is
> indeed true.  In all cases, the node that was responsible for the tablet,
> held a copy of all the blocks of that Rfile.
>
> More extensive testing in bigger environments would probably still be
> helpful before we write this into the documentation.  Also not sure what
> happen during tserver failures/reassignments.
>
> One thing that would make testing much easier is if "getsplits -v"
> reported the HDFS location of the tablet.  Right now you have to troll
> through !METADATA to figure it out.
>
>
> On Mon, Jan 13, 2014 at 10:25 AM, Arshak Navruzyan <ar...@gmail.com>wrote:
>
>> Thanks for all the explanations.  Perhaps this is something we should
>> clearly spell out in the documentation once all the facts are in.  I'll
>> keep a task open for now. (
>> https://issues.apache.org/jira/browse/ACCUMULO-2185)
>>
>>
>> On Sun, Jan 12, 2014 at 4:26 PM, Donald Miner <dm...@clearedgeit.com>wrote:
>>
>>> HDFS-385 (
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/HDFS-385 )
>>> is for custom pluggable block placement policies and there has been some
>>> talk (i think) about improving mean time to recovering and data locality in
>>> hbase.
>>>
>>> Basically this would allow accumulo to have a policy for its blocks and
>>> control its own destiny... Instead of things like the rebalancer screwing
>>> things up.
>>>
>>> I honestly don't know much else about this. Just thought it might be
>>> relevant to the conversation.
>>>
>>> > On Jan 12, 2014, at 6:42 PM, Josh Elser <jo...@gmail.com> wrote:
>>> >
>>> >
>>> >
>>> >> On 1/12/14, 6:17 PM, Sean Busbey wrote:
>>> >> On Sun, Jan 12, 2014 at 4:42 PM, William Slacum
>>> >> <wilhelm.von.cloud@accumulo.net <mailto:
>>> wilhelm.von.cloud@accumulo.net>>
>>> >> wrote:
>>> >>
>>> >>    Some data on short circuit reads would be great to have.
>>> >>
>>> >>
>>> >> What kind of data are you looking for? Just HDFS read rates? or
>>> >> specifically Accumulo when set up to make use of it?
>>> >
>>> > I believe what Bill means, and what I'm also curious about, is
>>> specifically the impact on performance for Accumulo's workload: a merged
>>> read over multiple files. An easy test might be to create multiple RFiles
>>> (1 to 10 files?) which contain interspersed data. Test some sort of
>>> random-read and random-seek+sequential-read workloads, from 1 to 10 RFiles,
>>> and with shortcircuit reads on an off.
>>> >
>>> > Perhaps a slightly more accurate test would be to up the compaction
>>> ratio on a table, and then bulk import them to a single table, and then
>>> just use the regular client API.
>>> >
>>> >>    I'm unsure of how correct the "compaction leading to eventual
>>> >>    locality" postulation is. It seems, to me at least, that in the
>>> case
>>> >>    of a multi-block file, the file system would eventually try to
>>> >>    distribute those blocks rather than leave them all on a single
>>> host.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> I know in HBase set ups, it's common to either disable the HDFS
>>> Balancer
>>> >> or just disable for a namespace containing the part of the filesystem
>>> >> that handles HBase. Otherwise, when the blocks are moved off to other
>>> >> hosts you get performance degradation until compaction can happen
>>> again.
>>> >> I would expect the same thing ought to be done for Accumulo.
>>> >
>>> > AFAIK, HBase also does a lot more in regards to assigning Tablets in
>>> regards to the blocks that serve them, no? To my knowledge, Accumulo
>>> doesn't do anything like this. I don't want users to think that disabling
>>> the HDFS balancer is a good idea for Accumulo unless we have actual
>>> evidence.
>>>
>>
>>
>

Re: ISAM file location vs. read performance

Posted by Arshak Navruzyan <ar...@gmail.com>.

I did some manual testing on this to see where HDFS is placing blocks in
relation to the location of the tablets.  I used the following command to
determine where HDFS is replicating the various blocks of the Rfiles.

hadoop fsck /accumulo/tables/a -locations -blocks -files

>From my limited testing, it appears that John's observation that "tserver
with ultimately end up major compacting it's files, ensuring locality" is
indeed true.  In all cases, the node that was responsible for the tablet,
held a copy of all the blocks of that Rfile.

More extensive testing in bigger environments would probably still be
helpful before we write this into the documentation.  Also not sure what
happen during tserver failures/reassignments.

One thing that would make testing much easier is if "getsplits -v" reported
the HDFS location of the tablet.  Right now you have to troll through
!METADATA to figure it out.


On Mon, Jan 13, 2014 at 10:25 AM, Arshak Navruzyan <ar...@gmail.com>wrote:

> Thanks for all the explanations.  Perhaps this is something we should
> clearly spell out in the documentation once all the facts are in.  I'll
> keep a task open for now. (
> https://issues.apache.org/jira/browse/ACCUMULO-2185)
>
>
> On Sun, Jan 12, 2014 at 4:26 PM, Donald Miner <dm...@clearedgeit.com>wrote:
>
>> HDFS-385 (
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/HDFS-385 )
>> is for custom pluggable block placement policies and there has been some
>> talk (i think) about improving mean time to recovering and data locality in
>> hbase.
>>
>> Basically this would allow accumulo to have a policy for its blocks and
>> control its own destiny... Instead of things like the rebalancer screwing
>> things up.
>>
>> I honestly don't know much else about this. Just thought it might be
>> relevant to the conversation.
>>
>> > On Jan 12, 2014, at 6:42 PM, Josh Elser <jo...@gmail.com> wrote:
>> >
>> >
>> >
>> >> On 1/12/14, 6:17 PM, Sean Busbey wrote:
>> >> On Sun, Jan 12, 2014 at 4:42 PM, William Slacum
>> >> <wilhelm.von.cloud@accumulo.net <mailto:wilhelm.von.cloud@accumulo.net
>> >>
>> >> wrote:
>> >>
>> >>    Some data on short circuit reads would be great to have.
>> >>
>> >>
>> >> What kind of data are you looking for? Just HDFS read rates? or
>> >> specifically Accumulo when set up to make use of it?
>> >
>> > I believe what Bill means, and what I'm also curious about, is
>> specifically the impact on performance for Accumulo's workload: a merged
>> read over multiple files. An easy test might be to create multiple RFiles
>> (1 to 10 files?) which contain interspersed data. Test some sort of
>> random-read and random-seek+sequential-read workloads, from 1 to 10 RFiles,
>> and with shortcircuit reads on an off.
>> >
>> > Perhaps a slightly more accurate test would be to up the compaction
>> ratio on a table, and then bulk import them to a single table, and then
>> just use the regular client API.
>> >
>> >>    I'm unsure of how correct the "compaction leading to eventual
>> >>    locality" postulation is. It seems, to me at least, that in the case
>> >>    of a multi-block file, the file system would eventually try to
>> >>    distribute those blocks rather than leave them all on a single host.
>> >>
>> >>
>> >>
>> >>
>> >> I know in HBase set ups, it's common to either disable the HDFS
>> Balancer
>> >> or just disable for a namespace containing the part of the filesystem
>> >> that handles HBase. Otherwise, when the blocks are moved off to other
>> >> hosts you get performance degradation until compaction can happen
>> again.
>> >> I would expect the same thing ought to be done for Accumulo.
>> >
>> > AFAIK, HBase also does a lot more in regards to assigning Tablets in
>> regards to the blocks that serve them, no? To my knowledge, Accumulo
>> doesn't do anything like this. I don't want users to think that disabling
>> the HDFS balancer is a good idea for Accumulo unless we have actual
>> evidence.
>>
>
>

Re: ISAM file location vs. read performance

Posted by Sean Busbey <bu...@cloudera.com>.

On Mon, Jan 13, 2014 at 12:25 PM, Arshak Navruzyan <ar...@gmail.com>wrote:

> Thanks for all the explanations.  Perhaps this is something we should
> clearly spell out in the documentation once all the facts are in.  I'll
> keep a task open for now. (
> https://issues.apache.org/jira/browse/ACCUMULO-2185)
>
>
>
Perfect. Thanks Arshak!

If you're willing to take on some profiling to get answers for our
documentation, I'd be happy to help you along the way. Just send me a ping
off-list.

-Sean

Re: ISAM file location vs. read performance

Posted by Arshak Navruzyan <ar...@gmail.com>.

Thanks for all the explanations.  Perhaps this is something we should
clearly spell out in the documentation once all the facts are in.  I'll
keep a task open for now. (
https://issues.apache.org/jira/browse/ACCUMULO-2185)


On Sun, Jan 12, 2014 at 4:26 PM, Donald Miner <dm...@clearedgeit.com>wrote:

> HDFS-385 (
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/HDFS-385 ) is
> for custom pluggable block placement policies and there has been some talk
> (i think) about improving mean time to recovering and data locality in
> hbase.
>
> Basically this would allow accumulo to have a policy for its blocks and
> control its own destiny... Instead of things like the rebalancer screwing
> things up.
>
> I honestly don't know much else about this. Just thought it might be
> relevant to the conversation.
>
> > On Jan 12, 2014, at 6:42 PM, Josh Elser <jo...@gmail.com> wrote:
> >
> >
> >
> >> On 1/12/14, 6:17 PM, Sean Busbey wrote:
> >> On Sun, Jan 12, 2014 at 4:42 PM, William Slacum
> >> <wilhelm.von.cloud@accumulo.net <mailto:wilhelm.von.cloud@accumulo.net
> >>
> >> wrote:
> >>
> >>    Some data on short circuit reads would be great to have.
> >>
> >>
> >> What kind of data are you looking for? Just HDFS read rates? or
> >> specifically Accumulo when set up to make use of it?
> >
> > I believe what Bill means, and what I'm also curious about, is
> specifically the impact on performance for Accumulo's workload: a merged
> read over multiple files. An easy test might be to create multiple RFiles
> (1 to 10 files?) which contain interspersed data. Test some sort of
> random-read and random-seek+sequential-read workloads, from 1 to 10 RFiles,
> and with shortcircuit reads on an off.
> >
> > Perhaps a slightly more accurate test would be to up the compaction
> ratio on a table, and then bulk import them to a single table, and then
> just use the regular client API.
> >
> >>    I'm unsure of how correct the "compaction leading to eventual
> >>    locality" postulation is. It seems, to me at least, that in the case
> >>    of a multi-block file, the file system would eventually try to
> >>    distribute those blocks rather than leave them all on a single host.
> >>
> >>
> >>
> >>
> >> I know in HBase set ups, it's common to either disable the HDFS Balancer
> >> or just disable for a namespace containing the part of the filesystem
> >> that handles HBase. Otherwise, when the blocks are moved off to other
> >> hosts you get performance degradation until compaction can happen again.
> >> I would expect the same thing ought to be done for Accumulo.
> >
> > AFAIK, HBase also does a lot more in regards to assigning Tablets in
> regards to the blocks that serve them, no? To my knowledge, Accumulo
> doesn't do anything like this. I don't want users to think that disabling
> the HDFS balancer is a good idea for Accumulo unless we have actual
> evidence.
>

Re: ISAM file location vs. read performance

Posted by Donald Miner <dm...@clearedgeit.com>.

HDFS-385 ( https://issues.apache.org/jira/plugins/servlet/mobile#issue/HDFS-385 ) is for custom pluggable block placement policies and there has been some talk (i think) about improving mean time to recovering and data locality in hbase.

Basically this would allow accumulo to have a policy for its blocks and control its own destiny... Instead of things like the rebalancer screwing things up.

I honestly don't know much else about this. Just thought it might be relevant to the conversation. 

> On Jan 12, 2014, at 6:42 PM, Josh Elser <jo...@gmail.com> wrote:
> 
> 
> 
>> On 1/12/14, 6:17 PM, Sean Busbey wrote:
>> On Sun, Jan 12, 2014 at 4:42 PM, William Slacum
>> <wilhelm.von.cloud@accumulo.net <ma...@accumulo.net>>
>> wrote:
>> 
>>    Some data on short circuit reads would be great to have.
>> 
>> 
>> What kind of data are you looking for? Just HDFS read rates? or
>> specifically Accumulo when set up to make use of it?
> 
> I believe what Bill means, and what I'm also curious about, is specifically the impact on performance for Accumulo's workload: a merged read over multiple files. An easy test might be to create multiple RFiles (1 to 10 files?) which contain interspersed data. Test some sort of random-read and random-seek+sequential-read workloads, from 1 to 10 RFiles, and with shortcircuit reads on an off.
> 
> Perhaps a slightly more accurate test would be to up the compaction ratio on a table, and then bulk import them to a single table, and then just use the regular client API.
> 
>>    I'm unsure of how correct the "compaction leading to eventual
>>    locality" postulation is. It seems, to me at least, that in the case
>>    of a multi-block file, the file system would eventually try to
>>    distribute those blocks rather than leave them all on a single host.
>> 
>> 
>> 
>> 
>> I know in HBase set ups, it's common to either disable the HDFS Balancer
>> or just disable for a namespace containing the part of the filesystem
>> that handles HBase. Otherwise, when the blocks are moved off to other
>> hosts you get performance degradation until compaction can happen again.
>> I would expect the same thing ought to be done for Accumulo.
> 
> AFAIK, HBase also does a lot more in regards to assigning Tablets in regards to the blocks that serve them, no? To my knowledge, Accumulo doesn't do anything like this. I don't want users to think that disabling the HDFS balancer is a good idea for Accumulo unless we have actual evidence.

Re: ISAM file location vs. read performance

Posted by Josh Elser <jo...@gmail.com>.

On 1/12/14, 6:17 PM, Sean Busbey wrote:
> On Sun, Jan 12, 2014 at 4:42 PM, William Slacum
> <wilhelm.von.cloud@accumulo.net <ma...@accumulo.net>>
> wrote:
>
>     Some data on short circuit reads would be great to have.
>
>
> What kind of data are you looking for? Just HDFS read rates? or
> specifically Accumulo when set up to make use of it?

I believe what Bill means, and what I'm also curious about, is 
specifically the impact on performance for Accumulo's workload: a merged 
read over multiple files. An easy test might be to create multiple 
RFiles (1 to 10 files?) which contain interspersed data. Test some sort 
of random-read and random-seek+sequential-read workloads, from 1 to 10 
RFiles, and with shortcircuit reads on an off.

Perhaps a slightly more accurate test would be to up the compaction 
ratio on a table, and then bulk import them to a single table, and then 
just use the regular client API.

>     I'm unsure of how correct the "compaction leading to eventual
>     locality" postulation is. It seems, to me at least, that in the case
>     of a multi-block file, the file system would eventually try to
>     distribute those blocks rather than leave them all on a single host.
>
>
>
>
> I know in HBase set ups, it's common to either disable the HDFS Balancer
> or just disable for a namespace containing the part of the filesystem
> that handles HBase. Otherwise, when the blocks are moved off to other
> hosts you get performance degradation until compaction can happen again.
> I would expect the same thing ought to be done for Accumulo.

AFAIK, HBase also does a lot more in regards to assigning Tablets in 
regards to the blocks that serve them, no? To my knowledge, Accumulo 
doesn't do anything like this. I don't want users to think that 
disabling the HDFS balancer is a good idea for Accumulo unless we have 
actual evidence.

Re: ISAM file location vs. read performance

Posted by Sean Busbey <bu...@cloudera.com>.

On Sun, Jan 12, 2014 at 4:42 PM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> Some data on short circuit reads would be great to have.
>
>
What kind of data are you looking for? Just HDFS read rates? or
specifically Accumulo when set up to make use of it?

> I'm unsure of how correct the "compaction leading to eventual locality"
> postulation is. It seems, to me at least, that in the case of a multi-block
> file, the file system would eventually try to distribute those blocks
> rather than leave them all on a single host.
>
>
>

I know in HBase set ups, it's common to either disable the HDFS Balancer or
just disable for a namespace containing the part of the filesystem that
handles HBase. Otherwise, when the blocks are moved off to other hosts you
get performance degradation until compaction can happen again. I would
expect the same thing ought to be done for Accumulo.

Re: ISAM file location vs. read performance

Posted by William Slacum <wi...@accumulo.net>.

Some data on short circuit reads would be great to have.

I'm unsure of how correct the "compaction leading to eventual locality"
postulation is. It seems, to me at least, that in the case of a multi-block
file, the file system would eventually try to distribute those blocks
rather than leave them all on a single host.

One quick correction: "not splittable" means that the file can't be
processed (ie, MapReduce'd over) in chunks, not that the file won't be
split into blocks.



On Sun, Jan 12, 2014 at 1:58 PM, Arshak Navruzyan <ar...@gmail.com> wrote:

> John,
>
> Thanks for the explanation.  I had to look up the HDFS block distribution
> documentation and it now makes complete sense.
>
> "the 1st replica is placed on the local machine"
>
> So since the compacted RFile is not splittable by HDFS, this ensures that
> the whole thing will be available where the Accumulo tablet is running.
>
> Maybe I can test out the shortcircuit reads and report back.
>
> Thanks,
>
> Arshak
>
>
> On Sun, Jan 12, 2014 at 9:36 AM, John Vines <vi...@apache.org> wrote:
>
>> So I'm not certain on our performance with short circuit reads, aside
>> from them being better.
>>
>> But because of the way hdfs writes get distributed, a tablet server has a
>> strong probability of being a local read, so that is there. This is because
>> a tserver with ultimately end up major compacting it's files, ensuring
>> locality. So simply constantly ingesting will lead to eventual locality if
>> it wasn't there before. It just so happens those reads go through a
>> datanode, but not necessarily through the network.
>>
>> Sent from my phone, please pardon the typos and brevity.
>> On Jan 12, 2014 12:29 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
>>
>>> One aspect of Accumulo architecture is still unclear to me.  Would you
>>> achieve better scan performance if you could guarantee that the tablet and
>>> its ISAM file lived on the same node?  Guessing ISAM files are not
>>> splittable so they pretty much stay on one HDFS data node (plus the replica
>>> copy). Or is the theory that SATA and a 10GBps network provide more or less
>>> the same throughput?
>>>
>>> I generally understand that as the table grows and Accumulo creates more
>>> splits (tablets) you get better distribution over the cluster but seems
>>> like data location would still be important.   HBase folks seem to think
>>> that you can approx. double your throughput if let the region server
>>> directly read the file (dfs.client.read.shortcircuit=true) as opposed to
>>> going through the data node. (
>>> http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf).  Perhaps this
>>> is due more to HDFS overhead?
>>>
>>> I do get that one really nice thing about Accumulo's architecture is
>>> that it costs almost nothing to reassign tablet to a different tserver and
>>> this is a huge problem for other systems.
>>>
>>>
>>>
>

Re: ISAM file location vs. read performance

Posted by Arshak Navruzyan <ar...@gmail.com>.

John,

Thanks for the explanation.  I had to look up the HDFS block distribution
documentation and it now makes complete sense.

"the 1st replica is placed on the local machine"

So since the compacted RFile is not splittable by HDFS, this ensures that
the whole thing will be available where the Accumulo tablet is running.

Maybe I can test out the shortcircuit reads and report back.

Thanks,

Arshak


On Sun, Jan 12, 2014 at 9:36 AM, John Vines <vi...@apache.org> wrote:

> So I'm not certain on our performance with short circuit reads, aside from
> them being better.
>
> But because of the way hdfs writes get distributed, a tablet server has a
> strong probability of being a local read, so that is there. This is because
> a tserver with ultimately end up major compacting it's files, ensuring
> locality. So simply constantly ingesting will lead to eventual locality if
> it wasn't there before. It just so happens those reads go through a
> datanode, but not necessarily through the network.
>
> Sent from my phone, please pardon the typos and brevity.
> On Jan 12, 2014 12:29 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
>
>> One aspect of Accumulo architecture is still unclear to me.  Would you
>> achieve better scan performance if you could guarantee that the tablet and
>> its ISAM file lived on the same node?  Guessing ISAM files are not
>> splittable so they pretty much stay on one HDFS data node (plus the replica
>> copy). Or is the theory that SATA and a 10GBps network provide more or less
>> the same throughput?
>>
>> I generally understand that as the table grows and Accumulo creates more
>> splits (tablets) you get better distribution over the cluster but seems
>> like data location would still be important.   HBase folks seem to think
>> that you can approx. double your throughput if let the region server
>> directly read the file (dfs.client.read.shortcircuit=true) as opposed to
>> going through the data node. (
>> http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf).  Perhaps this is
>> due more to HDFS overhead?
>>
>> I do get that one really nice thing about Accumulo's architecture is that
>> it costs almost nothing to reassign tablet to a different tserver and this
>> is a huge problem for other systems.
>>
>>
>>

Re: ISAM file location vs. read performance

Posted by John Vines <vi...@apache.org>.

So I'm not certain on our performance with short circuit reads, aside from
them being better.

But because of the way hdfs writes get distributed, a tablet server has a
strong probability of being a local read, so that is there. This is because
a tserver with ultimately end up major compacting it's files, ensuring
locality. So simply constantly ingesting will lead to eventual locality if
it wasn't there before. It just so happens those reads go through a
datanode, but not necessarily through the network.

Sent from my phone, please pardon the typos and brevity.
On Jan 12, 2014 12:29 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:

> One aspect of Accumulo architecture is still unclear to me.  Would you
> achieve better scan performance if you could guarantee that the tablet and
> its ISAM file lived on the same node?  Guessing ISAM files are not
> splittable so they pretty much stay on one HDFS data node (plus the replica
> copy). Or is the theory that SATA and a 10GBps network provide more or less
> the same throughput?
>
> I generally understand that as the table grows and Accumulo creates more
> splits (tablets) you get better distribution over the cluster but seems
> like data location would still be important.   HBase folks seem to think
> that you can approx. double your throughput if let the region server
> directly read the file (dfs.client.read.shortcircuit=true) as opposed to
> going through the data node. (
> http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf).  Perhaps this is
> due more to HDFS overhead?
>
> I do get that one really nice thing about Accumulo's architecture is that
> it costs almost nothing to reassign tablet to a different tserver and this
> is a huge problem for other systems.
>
>
>