You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Yamini Joshi <ya...@gmail.com> on 2016/10/20 17:22:35 UTC

Net ColumnFamily Count

Hello all

I am trying to find the number of times a set of column families appear in
a set of records (irrespective of the rowIds). Is it possible to do this on
the server side? My concern is that if the set of column families is huge,
it might face memory constraints on the server side. Also, we might need to
generate new keys with columnfamily name as the key and count as the value.

Best regards,
Yamini Joshi

Re: Net ColumnFamily Count

Posted by Yamini Joshi <ya...@gmail.com>.

I will take a look at it. Thanks Josh :)

Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 5:30 PM, Josh Elser <jo...@gmail.com> wrote:

> You can do a partial summation in an Iterator, but managing memory
> pressure (like you originally pointed out) would require some tricks.
>
> In general, Iterators work well with performing partial computations and
> letting the client perform a final computation over the batches.
>
> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo
> might help
>
> Yamini Joshi wrote:
>
>> I want to push all the computation to the server. I am using a test DB
>> but the DB is huge in the actual dev environment. I am also not sure if
>> writing to a new table is a good option either. It is not a one time
>> operation, it needs to be computed for every query that a user fires
>> with set of parameters.
>>
>> I am back to square one. But I guess if there is no other option, I will
>> try to benchmark and keep you guys in the loop :)
>>
>>
>>
>> Best regards,
>> Yamini Joshi
>>
>> On Thu, Oct 20, 2016 at 4:22 PM, Josh Elser <josh.elser@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     I would like to inject some hesitation here. This is getting into
>>     what I'd call "advance Accumulo development".
>>
>>     I'd encourage you to benchmark the simple implementation (bring back
>>     the columns you want to count to the client, and perform the
>>     summation there) and see if that runs in an acceptable amount of time.
>>
>>     Creating a "pivot" table (where you move the column family from your
>>     source table to the row of a new table) is fairly straightforward to
>>     do, but you will run into problems in keeping both tables in sync
>>     with each other. :)
>>
>>     ivan bella wrote:
>>
>>         I do not have any reference code for you. However basically you
>>         want to
>>         write a program that scans from one table, creates new
>>         transformed Key
>>         which you write as Mutations to another table. The transfomed Key
>>         object's row would be the column family of the key you pulled
>>         from the
>>         scan, and the value would be a 1 encoded using one of the
>>         encoders in
>>         the LongCombiner class. You would create the new table you are
>>         going to
>>         write to manually in the accumulo shell and set a
>>         SummingCombiner on the
>>         majc, minc, and scan with the same encoder you used. Run your
>>         program,
>>         compact the new table, and then scan it.
>>
>>
>>             On October 20, 2016 at 4:07 PM Yamini Joshi
>>             <yamini.1691@gmail.com <ma...@gmail.com>> wrote:
>>
>>             Alright! Do you happen to have some reference code that I
>>             can refer
>>             to? I am a newbie and I am not sure if by caching,
>>             aggregating and
>>             merge sort you mean to use some Accumulo wrapper or write a
>>             simple
>>             java code.
>>
>>             Best regards,
>>             Yamini Joshi
>>
>>             On Thu, Oct 20, 2016 at 2:49 PM, ivan bella
>>             <ivan@ivan.bella.name <ma...@ivan.bella.name>
>>             <mailto:ivan@ivan.bella.name <ma...@ivan.bella.name>>>
>>             wrote:
>>
>>                  __
>>
>>                  That is essentially the same thing, but instead of
>>             doing it within
>>                  an iterator, you are letting accumulo do the work!
>> Perfect.
>>
>>                      On October 20, 2016 at 3:38 PM
>>                 yamini.1691@gmail.com <ma...@gmail.com>
>>                 <mailto:yamini.1691@gmail.com
>>                 <ma...@gmail.com>> wrote:
>>
>>                      I am wondering what the complexity would be for
>>                 this and also how
>>                      does it compare to creating a new table with the
>>                 required revered
>>                      data and calculating the sum using an iterator.
>>
>>                      Sent from my iPhone
>>
>>                      On Oct 20, 2016, at 2:07 PM, ivan bella
>>                 <ivan@ivan.bella.name <ma...@ivan.bella.name>
>>                 <mailto:ivan@ivan.bella.name
>>
>>                 <ma...@ivan.bella.name>>> wrote:
>>
>>                          You could cache results in an internal map.
>>                     Once the number of
>>                          entries in your map gets to a certain point,
>>                     you could dump them
>>                          to a separate file in hdfs and then start
>>                     building a new map.
>>                          Once you have completed the underlying scan, do
>>                     a merge sort and
>>                          aggregation of the written files to start
>>                     returning the keys. I
>>                          did something similar to this and it seems to
>>                     work well. You
>>                          might want to use RFiles as the underlying
>>                     format which would
>>                          enable reuse of some accumulo code when doing
>>                     the merge sort.
>>                          Also it would allow more efficient reseeking
>>                     into the rfiles if
>>                          your iterator gets torn down and reconstructed
>>                     provided you
>>                          detect this and at least avoid redoing the
>>                     entire scan.
>>
>>                              On October 20, 2016 at 1:22 PM Yamini Joshi
>>                         <yamini.1691@gmail.com
>>                         <ma...@gmail.com>
>>                         <mailto:yamini.1691@gmail.com
>>                         <ma...@gmail.com>>> wrote:
>>
>>                              Hello all
>>
>>                              I am trying to find the number of times a
>>                         set of column
>>                              families appear in a set of records
>>                         (irrespective of the
>>                              rowIds). Is it possible to do this on the
>>                         server side? My
>>                              concern is that if the set of column
>>                         families is huge, it might
>>                              face memory constraints on the server side.
>>                         Also, we might need
>>                              to generate new keys with columnfamily name
>>                         as the key and
>>                              count as the value.
>>
>>                              Best regards,
>>                              Yamini Joshi
>>
>>
>>
>>
>>

Re: Net ColumnFamily Count

Posted by Josh Elser <jo...@gmail.com>.

You can do a partial summation in an Iterator, but managing memory 
pressure (like you originally pointed out) would require some tricks.

In general, Iterators work well with performing partial computations and 
letting the client perform a final computation over the batches.

https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo 
might help

Yamini Joshi wrote:
> I want to push all the computation to the server. I am using a test DB
> but the DB is huge in the actual dev environment. I am also not sure if
> writing to a new table is a good option either. It is not a one time
> operation, it needs to be computed for every query that a user fires
> with set of parameters.
>
> I am back to square one. But I guess if there is no other option, I will
> try to benchmark and keep you guys in the loop :)
>
>
>
> Best regards,
> Yamini Joshi
>
> On Thu, Oct 20, 2016 at 4:22 PM, Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     I would like to inject some hesitation here. This is getting into
>     what I'd call "advance Accumulo development".
>
>     I'd encourage you to benchmark the simple implementation (bring back
>     the columns you want to count to the client, and perform the
>     summation there) and see if that runs in an acceptable amount of time.
>
>     Creating a "pivot" table (where you move the column family from your
>     source table to the row of a new table) is fairly straightforward to
>     do, but you will run into problems in keeping both tables in sync
>     with each other. :)
>
>     ivan bella wrote:
>
>         I do not have any reference code for you. However basically you
>         want to
>         write a program that scans from one table, creates new
>         transformed Key
>         which you write as Mutations to another table. The transfomed Key
>         object's row would be the column family of the key you pulled
>         from the
>         scan, and the value would be a 1 encoded using one of the
>         encoders in
>         the LongCombiner class. You would create the new table you are
>         going to
>         write to manually in the accumulo shell and set a
>         SummingCombiner on the
>         majc, minc, and scan with the same encoder you used. Run your
>         program,
>         compact the new table, and then scan it.
>
>
>             On October 20, 2016 at 4:07 PM Yamini Joshi
>             <yamini.1691@gmail.com <ma...@gmail.com>> wrote:
>
>             Alright! Do you happen to have some reference code that I
>             can refer
>             to? I am a newbie and I am not sure if by caching,
>             aggregating and
>             merge sort you mean to use some Accumulo wrapper or write a
>             simple
>             java code.
>
>             Best regards,
>             Yamini Joshi
>
>             On Thu, Oct 20, 2016 at 2:49 PM, ivan bella
>             <ivan@ivan.bella.name <ma...@ivan.bella.name>
>             <mailto:ivan@ivan.bella.name <ma...@ivan.bella.name>>>
>             wrote:
>
>                  __
>
>                  That is essentially the same thing, but instead of
>             doing it within
>                  an iterator, you are letting accumulo do the work! Perfect.
>
>                      On October 20, 2016 at 3:38 PM
>                 yamini.1691@gmail.com <ma...@gmail.com>
>                 <mailto:yamini.1691@gmail.com
>                 <ma...@gmail.com>> wrote:
>
>                      I am wondering what the complexity would be for
>                 this and also how
>                      does it compare to creating a new table with the
>                 required revered
>                      data and calculating the sum using an iterator.
>
>                      Sent from my iPhone
>
>                      On Oct 20, 2016, at 2:07 PM, ivan bella
>                 <ivan@ivan.bella.name <ma...@ivan.bella.name>
>                 <mailto:ivan@ivan.bella.name
>                 <ma...@ivan.bella.name>>> wrote:
>
>                          You could cache results in an internal map.
>                     Once the number of
>                          entries in your map gets to a certain point,
>                     you could dump them
>                          to a separate file in hdfs and then start
>                     building a new map.
>                          Once you have completed the underlying scan, do
>                     a merge sort and
>                          aggregation of the written files to start
>                     returning the keys. I
>                          did something similar to this and it seems to
>                     work well. You
>                          might want to use RFiles as the underlying
>                     format which would
>                          enable reuse of some accumulo code when doing
>                     the merge sort.
>                          Also it would allow more efficient reseeking
>                     into the rfiles if
>                          your iterator gets torn down and reconstructed
>                     provided you
>                          detect this and at least avoid redoing the
>                     entire scan.
>
>                              On October 20, 2016 at 1:22 PM Yamini Joshi
>                         <yamini.1691@gmail.com
>                         <ma...@gmail.com>
>                         <mailto:yamini.1691@gmail.com
>                         <ma...@gmail.com>>> wrote:
>
>                              Hello all
>
>                              I am trying to find the number of times a
>                         set of column
>                              families appear in a set of records
>                         (irrespective of the
>                              rowIds). Is it possible to do this on the
>                         server side? My
>                              concern is that if the set of column
>                         families is huge, it might
>                              face memory constraints on the server side.
>                         Also, we might need
>                              to generate new keys with columnfamily name
>                         as the key and
>                              count as the value.
>
>                              Best regards,
>                              Yamini Joshi
>
>
>
>

Re: Net ColumnFamily Count

Posted by Yamini Joshi <ya...@gmail.com>.

I want to push all the computation to the server. I am using a test DB but
the DB is huge in the actual dev environment. I am also not sure if writing
to a new table is a good option either. It is not a one time operation, it
needs to be computed for every query that a user fires with set of
parameters.

I am back to square one. But I guess if there is no other option, I will
try to benchmark and keep you guys in the loop :)



Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 4:22 PM, Josh Elser <jo...@gmail.com> wrote:

> I would like to inject some hesitation here. This is getting into what I'd
> call "advance Accumulo development".
>
> I'd encourage you to benchmark the simple implementation (bring back the
> columns you want to count to the client, and perform the summation there)
> and see if that runs in an acceptable amount of time.
>
> Creating a "pivot" table (where you move the column family from your
> source table to the row of a new table) is fairly straightforward to do,
> but you will run into problems in keeping both tables in sync with each
> other. :)
>
> ivan bella wrote:
>
>> I do not have any reference code for you. However basically you want to
>> write a program that scans from one table, creates new transformed Key
>> which you write as Mutations to another table. The transfomed Key
>> object's row would be the column family of the key you pulled from the
>> scan, and the value would be a 1 encoded using one of the encoders in
>> the LongCombiner class. You would create the new table you are going to
>> write to manually in the accumulo shell and set a SummingCombiner on the
>> majc, minc, and scan with the same encoder you used. Run your program,
>> compact the new table, and then scan it.
>>
>>
>> On October 20, 2016 at 4:07 PM Yamini Joshi <ya...@gmail.com>
>>> wrote:
>>>
>>> Alright! Do you happen to have some reference code that I can refer
>>> to? I am a newbie and I am not sure if by caching, aggregating and
>>> merge sort you mean to use some Accumulo wrapper or write a simple
>>> java code.
>>>
>>> Best regards,
>>> Yamini Joshi
>>>
>>> On Thu, Oct 20, 2016 at 2:49 PM, ivan bella <ivan@ivan.bella.name
>>> <ma...@ivan.bella.name>> wrote:
>>>
>>>     __
>>>
>>>     That is essentially the same thing, but instead of doing it within
>>>     an iterator, you are letting accumulo do the work! Perfect.
>>>
>>>     On October 20, 2016 at 3:38 PM yamini.1691@gmail.com
>>>>     <ma...@gmail.com> wrote:
>>>>
>>>>     I am wondering what the complexity would be for this and also how
>>>>     does it compare to creating a new table with the required revered
>>>>     data and calculating the sum using an iterator.
>>>>
>>>>     Sent from my iPhone
>>>>
>>>>     On Oct 20, 2016, at 2:07 PM, ivan bella <ivan@ivan.bella.name
>>>>     <ma...@ivan.bella.name>> wrote:
>>>>
>>>>     You could cache results in an internal map. Once the number of
>>>>>     entries in your map gets to a certain point, you could dump them
>>>>>     to a separate file in hdfs and then start building a new map.
>>>>>     Once you have completed the underlying scan, do a merge sort and
>>>>>     aggregation of the written files to start returning the keys. I
>>>>>     did something similar to this and it seems to work well. You
>>>>>     might want to use RFiles as the underlying format which would
>>>>>     enable reuse of some accumulo code when doing the merge sort.
>>>>>     Also it would allow more efficient reseeking into the rfiles if
>>>>>     your iterator gets torn down and reconstructed provided you
>>>>>     detect this and at least avoid redoing the entire scan.
>>>>>
>>>>>     On October 20, 2016 at 1:22 PM Yamini Joshi
>>>>>>     <yamini.1691@gmail.com <ma...@gmail.com>> wrote:
>>>>>>
>>>>>>     Hello all
>>>>>>
>>>>>>     I am trying to find the number of times a set of column
>>>>>>     families appear in a set of records (irrespective of the
>>>>>>     rowIds). Is it possible to do this on the server side? My
>>>>>>     concern is that if the set of column families is huge, it might
>>>>>>     face memory constraints on the server side. Also, we might need
>>>>>>     to generate new keys with columnfamily name as the key and
>>>>>>     count as the value.
>>>>>>
>>>>>>     Best regards,
>>>>>>     Yamini Joshi
>>>>>>
>>>>>
>>>
>>>

Re: Net ColumnFamily Count

Posted by Josh Elser <jo...@gmail.com>.

I would like to inject some hesitation here. This is getting into what 
I'd call "advance Accumulo development".

I'd encourage you to benchmark the simple implementation (bring back the 
columns you want to count to the client, and perform the summation 
there) and see if that runs in an acceptable amount of time.

Creating a "pivot" table (where you move the column family from your 
source table to the row of a new table) is fairly straightforward to do, 
but you will run into problems in keeping both tables in sync with each 
other. :)

ivan bella wrote:
> I do not have any reference code for you. However basically you want to
> write a program that scans from one table, creates new transformed Key
> which you write as Mutations to another table. The transfomed Key
> object's row would be the column family of the key you pulled from the
> scan, and the value would be a 1 encoded using one of the encoders in
> the LongCombiner class. You would create the new table you are going to
> write to manually in the accumulo shell and set a SummingCombiner on the
> majc, minc, and scan with the same encoder you used. Run your program,
> compact the new table, and then scan it.
>
>
>> On October 20, 2016 at 4:07 PM Yamini Joshi <ya...@gmail.com> wrote:
>>
>> Alright! Do you happen to have some reference code that I can refer
>> to? I am a newbie and I am not sure if by caching, aggregating and
>> merge sort you mean to use some Accumulo wrapper or write a simple
>> java code.
>>
>> Best regards,
>> Yamini Joshi
>>
>> On Thu, Oct 20, 2016 at 2:49 PM, ivan bella <ivan@ivan.bella.name
>> <ma...@ivan.bella.name>> wrote:
>>
>>     __
>>
>>     That is essentially the same thing, but instead of doing it within
>>     an iterator, you are letting accumulo do the work! Perfect.
>>
>>>     On October 20, 2016 at 3:38 PM yamini.1691@gmail.com
>>>     <ma...@gmail.com> wrote:
>>>
>>>     I am wondering what the complexity would be for this and also how
>>>     does it compare to creating a new table with the required revered
>>>     data and calculating the sum using an iterator.
>>>
>>>     Sent from my iPhone
>>>
>>>     On Oct 20, 2016, at 2:07 PM, ivan bella <ivan@ivan.bella.name
>>>     <ma...@ivan.bella.name>> wrote:
>>>
>>>>     You could cache results in an internal map. Once the number of
>>>>     entries in your map gets to a certain point, you could dump them
>>>>     to a separate file in hdfs and then start building a new map.
>>>>     Once you have completed the underlying scan, do a merge sort and
>>>>     aggregation of the written files to start returning the keys. I
>>>>     did something similar to this and it seems to work well. You
>>>>     might want to use RFiles as the underlying format which would
>>>>     enable reuse of some accumulo code when doing the merge sort.
>>>>     Also it would allow more efficient reseeking into the rfiles if
>>>>     your iterator gets torn down and reconstructed provided you
>>>>     detect this and at least avoid redoing the entire scan.
>>>>
>>>>>     On October 20, 2016 at 1:22 PM Yamini Joshi
>>>>>     <yamini.1691@gmail.com <ma...@gmail.com>> wrote:
>>>>>
>>>>>     Hello all
>>>>>
>>>>>     I am trying to find the number of times a set of column
>>>>>     families appear in a set of records (irrespective of the
>>>>>     rowIds). Is it possible to do this on the server side? My
>>>>>     concern is that if the set of column families is huge, it might
>>>>>     face memory constraints on the server side. Also, we might need
>>>>>     to generate new keys with columnfamily name as the key and
>>>>>     count as the value.
>>>>>
>>>>>     Best regards,
>>>>>     Yamini Joshi
>>
>>

Re: Net ColumnFamily Count

Posted by Yamini Joshi <ya...@gmail.com>.

Alright! Do you happen to have some reference code that I can refer to? I
am a newbie and I am not sure if by caching, aggregating and merge sort you
mean to use some Accumulo wrapper or write a simple java code.

Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 2:49 PM, ivan bella <iv...@ivan.bella.name> wrote:

> That is essentially the same thing, but instead of doing it within an
> iterator, you are letting accumulo do the work!  Perfect.
>
> On October 20, 2016 at 3:38 PM yamini.1691@gmail.com wrote:
>
> I am wondering what the complexity would be for this and also how does it
> compare to creating a new table with the required revered data and
> calculating the sum using an iterator.
>
> Sent from my iPhone
>
> On Oct 20, 2016, at 2:07 PM, ivan bella <iv...@ivan.bella.name> wrote:
>
> You could cache results in an internal map.  Once the number of entries in
> your map gets to a certain point, you could dump them to a separate file in
> hdfs and then start building a new map.  Once you have completed the
> underlying scan, do a merge sort and aggregation of the written files to
> start returning the keys.  I did something similar to this and it seems to
> work well.  You might want to use RFiles as the underlying format which
> would enable reuse of some accumulo code when doing the merge sort.  Also
> it would allow more efficient reseeking into the rfiles if your iterator
> gets torn down and reconstructed provided you detect this and at least
> avoid redoing the entire scan.
>
> On October 20, 2016 at 1:22 PM Yamini Joshi <ya...@gmail.com> wrote:
>
> Hello all
>
> I am trying to find the number of times a set of column families appear in
> a set of records (irrespective of the rowIds). Is it possible to do this on
> the server side? My concern is that if the set of column families is huge,
> it might face memory constraints on the server side. Also, we might need to
> generate new keys with columnfamily name as the key and count as the value.
>
> Best regards,
> Yamini Joshi
>
>

Re: Net ColumnFamily Count

Posted by ya...@gmail.com.

I am wondering what the complexity would be for this and also how does it compare to creating a new table with the required revered data and calculating the sum using an iterator.

Sent from my iPhone

> On Oct 20, 2016, at 2:07 PM, ivan bella <iv...@ivan.bella.name> wrote:
> 
> You could cache results in an internal map.  Once the number of entries in your map gets to a certain point, you could dump them to a separate file in hdfs and then start building a new map.  Once you have completed the underlying scan, do a merge sort and aggregation of the written files to start returning the keys.  I did something similar to this and it seems to work well.  You might want to use RFiles as the underlying format which would enable reuse of some accumulo code when doing the merge sort.  Also it would allow more efficient reseeking into the rfiles if your iterator gets torn down and reconstructed provided you detect this and at least avoid redoing the entire scan.
> 
>> On October 20, 2016 at 1:22 PM Yamini Joshi <ya...@gmail.com> wrote:
>> 
>> Hello all
>> 
>> I am trying to find the number of times a set of column families appear in a set of records (irrespective of the rowIds). Is it possible to do this on the server side? My concern is that if the set of column families is huge, it might face memory constraints on the server side. Also, we might need to generate new keys with columnfamily name as the key and count as the value.
>> 
>> Best regards,
>> Yamini Joshi