You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Mohammad Kargar <mk...@phemi.com> on 2017/10/19 21:50:50 UTC

Accumulo as a Column Storage

AFAIK in Accumulo we can use "locality groups" to group sets of columns
together on disk which would make it more like  a column-oriented database.
Considering that "locality groups" are per column family, I was wondering
what if we treat column families like column qualifiers (creating one
column family per each qualifier) and assigning each to a different
locality group. This way all the data in a given column will be next to
each other on disk which makes it easier for analytical applications to
query the data.

Any thoughts?

Thanks,
Mohammad

Re: Accumulo as a Column Storage

Posted by Keith Turner <ke...@deenlo.com>.
On Thu, Oct 19, 2017 at 9:05 PM, Christopher <ct...@apache.org> wrote:
> There's no expected scaling issue with having each column qualifier in its
> own unique column family, regardless of how large the number of these
> becomes. I've ingested random data like this before for testing, and it
> works fine.
>
> However, there may be an issue trying to create a very large number of
> locality groups. Locality groups are named, and you must explicitly
> configure them to store particular column families. That configuration is
> typically stored in ZooKeeper, and the configuration storage (in ZooKeeper,
> and/or in your conf/accumulo-site.xml file) does not scale as well as the
> data storage (HDFS) does. Where, and how, it will break, is probably
> system-dependent and not directly known (at least, not known by me). I would
> expect dozens, and possibly hundreds, of locality groups to work okay, but
> thousands seems like it's too many (but I haven't tried).

Seeking to a random location is O(F*L), where F is the number of files
and L is the number of locality groups used.  So if a tablet had 10
files and 10 locality groups were being used, then a seek on the
tablet would result in 100 seeks at the lowest levels.

After the initial seek, scanning over locality groups uses a heap of
heaps.  A heap that select the min key from all files.  Within each
file there is a heap that selects the min key from each loc group.
So scanning is O(log2(F) * log2(L)) or O(log2(F) + log2(L)) not sure.

Scanning over lots locality groups is probably pretty efficient, but
doing lots of random seeks over lots of loc groups may not be.

>
>
> On Thu, Oct 19, 2017 at 6:47 PM Mohammad Kargar <mk...@phemi.com> wrote:
>>
>> That makes sense. So this means that there's no limit or concerns on
>> having, potentially,  large number of column families (holing only one
>> column qualifier), right?
>>
>> On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser <el...@apache.org> wrote:
>>>
>>> Yup, that's the intended use case. You have the flexibility to determine
>>> what column families make sense to group together. Your only "cost" in
>>> changing your mind is the speed at which you can re-compact your data.
>>>
>>> There is one concern which comes to mind. Though making many locality
>>> groups does increase the speed at which you can read from specific columns,
>>> it decreases the speed at which you can read from _all_ columns. So, you can
>>> do this trick to make Accumulo act more like a columnar database, but beware
>>> that you're going to have an impact if you still have a use-case where you
>>> read more than just one or two columns at a time.
>>>
>>> Does that make sense?
>>>
>>>
>>> On 10/19/17 5:50 PM, Mohammad Kargar wrote:
>>>>
>>>> AFAIK in Accumulo we can use "locality groups" to group sets of columns
>>>> together on disk which would make it more like  a column-oriented database.
>>>> Considering that "locality groups" are per column family, I was wondering
>>>> what if we treat column families like column qualifiers (creating one column
>>>> family per each qualifier) and assigning each to a different locality group.
>>>> This way all the data in a given column will be next to each other on disk
>>>> which makes it easier for analytical applications to query the data.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks,
>>>> Mohammad
>>>>
>>
>

Re: Accumulo as a Column Storage

Posted by Mohammad Kargar <mk...@phemi.com>.
"dozens, and possibly hundreds of locality groups" per table or per
Accumulo instance?

On Thu, Oct 19, 2017 at 6:05 PM, Christopher <ct...@apache.org> wrote:

> There's no expected scaling issue with having each column qualifier in its
> own unique column family, regardless of how large the number of these
> becomes. I've ingested random data like this before for testing, and it
> works fine.
>
> However, there may be an issue trying to create a very large number of
> locality groups. Locality groups are named, and you must explicitly
> configure them to store particular column families. That configuration is
> typically stored in ZooKeeper, and the configuration storage (in ZooKeeper,
> and/or in your conf/accumulo-site.xml file) does not scale as well as the
> data storage (HDFS) does. Where, and how, it will break, is probably
> system-dependent and not directly known (at least, not known by me). I
> would expect dozens, and possibly hundreds, of locality groups to work
> okay, but thousands seems like it's too many (but I haven't tried).
>
>
> On Thu, Oct 19, 2017 at 6:47 PM Mohammad Kargar <mk...@phemi.com> wrote:
>
>> That makes sense. So this means that there's no limit or concerns on
>> having, potentially,  large number of column families (holing only one
>> column qualifier), right?
>>
>> On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser <el...@apache.org> wrote:
>>
>>> Yup, that's the intended use case. You have the flexibility to determine
>>> what column families make sense to group together. Your only "cost" in
>>> changing your mind is the speed at which you can re-compact your data.
>>>
>>> There is one concern which comes to mind. Though making many locality
>>> groups does increase the speed at which you can read from specific columns,
>>> it decreases the speed at which you can read from _all_ columns. So, you
>>> can do this trick to make Accumulo act more like a columnar database, but
>>> beware that you're going to have an impact if you still have a use-case
>>> where you read more than just one or two columns at a time.
>>>
>>> Does that make sense?
>>>
>>>
>>> On 10/19/17 5:50 PM, Mohammad Kargar wrote:
>>>
>>>> AFAIK in Accumulo we can use "locality groups" to group sets of columns
>>>> together on disk which would make it more like  a column-oriented database.
>>>> Considering that "locality groups" are per column family, I was wondering
>>>> what if we treat column families like column qualifiers (creating one
>>>> column family per each qualifier) and assigning each to a different
>>>> locality group. This way all the data in a given column will be next to
>>>> each other on disk which makes it easier for analytical applications to
>>>> query the data.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks,
>>>> Mohammad
>>>>
>>>>
>>

Re: Accumulo as a Column Storage

Posted by Christopher <ct...@apache.org>.
There's no expected scaling issue with having each column qualifier in its
own unique column family, regardless of how large the number of these
becomes. I've ingested random data like this before for testing, and it
works fine.

However, there may be an issue trying to create a very large number of
locality groups. Locality groups are named, and you must explicitly
configure them to store particular column families. That configuration is
typically stored in ZooKeeper, and the configuration storage (in ZooKeeper,
and/or in your conf/accumulo-site.xml file) does not scale as well as the
data storage (HDFS) does. Where, and how, it will break, is probably
system-dependent and not directly known (at least, not known by me). I
would expect dozens, and possibly hundreds, of locality groups to work
okay, but thousands seems like it's too many (but I haven't tried).

On Thu, Oct 19, 2017 at 6:47 PM Mohammad Kargar <mk...@phemi.com> wrote:

> That makes sense. So this means that there's no limit or concerns on
> having, potentially,  large number of column families (holing only one
> column qualifier), right?
>
> On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser <el...@apache.org> wrote:
>
>> Yup, that's the intended use case. You have the flexibility to determine
>> what column families make sense to group together. Your only "cost" in
>> changing your mind is the speed at which you can re-compact your data.
>>
>> There is one concern which comes to mind. Though making many locality
>> groups does increase the speed at which you can read from specific columns,
>> it decreases the speed at which you can read from _all_ columns. So, you
>> can do this trick to make Accumulo act more like a columnar database, but
>> beware that you're going to have an impact if you still have a use-case
>> where you read more than just one or two columns at a time.
>>
>> Does that make sense?
>>
>>
>> On 10/19/17 5:50 PM, Mohammad Kargar wrote:
>>
>>> AFAIK in Accumulo we can use "locality groups" to group sets of columns
>>> together on disk which would make it more like  a column-oriented database.
>>> Considering that "locality groups" are per column family, I was wondering
>>> what if we treat column families like column qualifiers (creating one
>>> column family per each qualifier) and assigning each to a different
>>> locality group. This way all the data in a given column will be next to
>>> each other on disk which makes it easier for analytical applications to
>>> query the data.
>>>
>>> Any thoughts?
>>>
>>> Thanks,
>>> Mohammad
>>>
>>>
>

Re: Accumulo as a Column Storage

Posted by Mohammad Kargar <mk...@phemi.com>.
That makes sense. So this means that there's no limit or concerns on
having, potentially,  large number of column families (holing only one
column qualifier), right?

On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser <el...@apache.org> wrote:

> Yup, that's the intended use case. You have the flexibility to determine
> what column families make sense to group together. Your only "cost" in
> changing your mind is the speed at which you can re-compact your data.
>
> There is one concern which comes to mind. Though making many locality
> groups does increase the speed at which you can read from specific columns,
> it decreases the speed at which you can read from _all_ columns. So, you
> can do this trick to make Accumulo act more like a columnar database, but
> beware that you're going to have an impact if you still have a use-case
> where you read more than just one or two columns at a time.
>
> Does that make sense?
>
>
> On 10/19/17 5:50 PM, Mohammad Kargar wrote:
>
>> AFAIK in Accumulo we can use "locality groups" to group sets of columns
>> together on disk which would make it more like  a column-oriented database.
>> Considering that "locality groups" are per column family, I was wondering
>> what if we treat column families like column qualifiers (creating one
>> column family per each qualifier) and assigning each to a different
>> locality group. This way all the data in a given column will be next to
>> each other on disk which makes it easier for analytical applications to
>> query the data.
>>
>> Any thoughts?
>>
>> Thanks,
>> Mohammad
>>
>>

Re: Accumulo as a Column Storage

Posted by Josh Elser <el...@apache.org>.
Yup, that's the intended use case. You have the flexibility to determine 
what column families make sense to group together. Your only "cost" in 
changing your mind is the speed at which you can re-compact your data.

There is one concern which comes to mind. Though making many locality 
groups does increase the speed at which you can read from specific 
columns, it decreases the speed at which you can read from _all_ 
columns. So, you can do this trick to make Accumulo act more like a 
columnar database, but beware that you're going to have an impact if you 
still have a use-case where you read more than just one or two columns 
at a time.

Does that make sense?

On 10/19/17 5:50 PM, Mohammad Kargar wrote:
> AFAIK in Accumulo we can use "locality groups" to group sets of columns 
> together on disk which would make it more like  a column-oriented 
> database. Considering that "locality groups" are per column family, I 
> was wondering what if we treat column families like column qualifiers 
> (creating one column family per each qualifier) and assigning each to a 
> different locality group. This way all the data in a given column will 
> be next to each other on disk which makes it easier for analytical 
> applications to query the data.
> 
> Any thoughts?
> 
> Thanks,
> Mohammad
>