You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "mohit.kaushik" <mo...@orkash.com> on 2015/07/01 09:16:36 UTC

Accumulo indexing social media data

Hi,

We have an social media application currently using MongoDB to serve 
documents . We decided to shift it to Accumulo. I am designing the 
schema and indexing approach but having some difficulties in managing 
indexes and a few concerns with generating UUID in Accumulo.

UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates 
a 12 byte UUID sorted on current time and good for multi-user 
multi-process environment (<time>     <Mac add>    <process id>  <client 
counter> ) which is perfect. but if I concatenate  the time,mac add, 
process-id, client counter. These are around 28 to 30 characters which 
means around 60 bytes. And If I store it in reverse order so that the 
latest document shows on top, the size would be doubled( more than 120 
bytes) as described by David Medinets. Is there any way to store this 
UUID in lesser size or any other efficient way to generate UUID reverse 
sorted on current time.

Indexing : I need to retrieve documents from index based on some query 
on fields. I found two approaches to index documents in Accumulo.
(1) Term based reverse indexing and
(2) Document partitioning indexing

As Adam described in this video 
https://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Document 
partitioning indexing.

Row                    <partition id>
                                /            \
CF                 <doc>            <index>
                            |                       |
CQ                <UUID>          <Term>
                            |                       |
                       <field>           <UUID>
                            |                        |
                            |                  <Field>
Value            <value>

If I just want to serve documents based on single term query. Would it 
be better to store <term> in column family so that I can limit on single 
term in CF. It will  reduce the data by a good factor. what can be other 
pros and cons of this approach?
And how should i decide the on partition_Id. If i storing tweets on 3 
node cluster?

Regards
Mohit Kaushik

Re: Accumulo indexing social media data

Posted by Christopher <ct...@apache.org>.

Sorry, I'm not that familiar with the D4M schema.

Regarding partitioning, I agree with Josh's response.


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Thu, Jul 2, 2015 at 8:34 AM, mohit.kaushik <mo...@orkash.com>
wrote:

>  Christopher,
>
> What I understood from the Medinets explanation of reverse sorting is
> first he subtracts every character from 255 to make it reverse index. and
> also append the original UUID to that string. When I checked the D4M
> schema, it prints ??????? in the front of UUID which I suppose the
> characters subtracted from 255, if I am not misunderstood.
>
> And the problem with 60 bytes or 120 bytes is nothing. I just don't want
> to waste space for no benefits at all.  when It can be done in 12 or 13
> bytes. And Thanks I looked at the MongoDriver code. I supposed that the
> encoding may not fit to lexicographical sorting.
>
> Can you please provide some inputs on deciding the Partition Id?
>
> -Mohit Kaushik
>
>
> On 07/01/2015 09:55 PM, Christopher wrote:
>
> I'm not sure I understand why the size would be doubled.... if you
> store it in reverse order, it's not going to take up more bytes. Are
> you storing it forward *and* reverse? If so, why?
>
> Also, forgive me for asking, but 60 bytes doesn't seem problematic to
> me... that's going to be compressed on disk anyway. Why is 60 bytes
> too large for your use case?
>
> Also, if the MongoDB 12 byte UUID was sufficient, why aren't you using
> a UUID that is in that same format?
>
> Regarding serving documents based on a single term query... it seems
> to me that if that is your only requirement, then a row which looks
> like "<term> <UUID>" would be more appropriate, since the best way to
> support single-term query is to index on that term (UUID added only to
> enable rows to split).
>
> --
> Christopher L Tubbs IIhttp://gravatar.com/ctubbsii
>
>
> On Wed, Jul 1, 2015 at 3:16 AM, mohit.kaushik <mo...@orkash.com> <mo...@orkash.com> wrote:
>
>  Hi,
>
> We have an social media application currently using MongoDB to serve
> documents . We decided to shift it to Accumulo. I am designing the schema
> and indexing approach but having some difficulties in managing indexes and a
> few concerns with generating UUID in Accumulo.
>
> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates a 12
> byte UUID sorted on current time and good for multi-user multi-process
> environment (<time>     <Mac add>    <process id>  <client counter> ) which
> is perfect. but if I concatenate  the time,mac add, process-id, client
> counter. These are around 28 to 30 characters which means around 60 bytes.
> And If I store it in reverse order so that the latest document shows on top,
> the size would be doubled( more than 120 bytes) as described by David
> Medinets. Is there any way to store this UUID in lesser size or any other
> efficient way to generate UUID reverse sorted on current time.
>
> Indexing : I need to retrieve documents from index based on some query on
> fields. I found two approaches to index documents in Accumulo.
> (1) Term based reverse indexing and
> (2) Document partitioning indexing
>
> As Adam described in this video https://www.youtube.com/watch?v=Ck70G6OuGT4.
> If I use Document partitioning indexing.
>
> Row                    <partition id>
>                                /            \
> CF                 <doc>            <index>
>                            |                       |
> CQ                <UUID>          <Term>
>                            |                       |
>                       <field>           <UUID>
>                            |                        |
>                            |                  <Field>
> Value            <value>
>
> If I just want to serve documents based on single term query. Would it be
> better to store <term> in column family so that I can limit on single term
> in CF. It will  reduce the data by a good factor. what can be other pros and
> cons of this approach?
> And how should i decide the on partition_Id. If i storing tweets on 3 node
> cluster?
>
> Regards
> Mohit Kaushik
>
>
>
>
> --
>
> * Mohit Kaushik*
> Software Engineer
> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>
>  <http://politicomapper.orkash.com>interactive social intelligence at
> work...
>
>  <https://www.facebook.com/Orkash2012>
> <http://www.linkedin.com/company/orkash-services-private-limited>
> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
> <http://www.orkash.com>
>  <http://www.orkash.com> ... ensuring Assurance in complexity and
> uncertainty
>
> *This message including the attachments, if any, is a confidential
> business communication. If you are not the intended recipient it may be
> unlawful for you to read, copy, distribute, disclose or otherwise use the
> information in this e-mail. If you have received it in error or are not the
> intended recipient, please destroy it and notify the sender immediately.
> Thank you *
>

Re: Accumulo indexing social media data

Posted by "mohit.kaushik" <mo...@orkash.com>.

Christopher,

What I understood from the Medinets explanation of reverse sorting is 
first he subtracts every character from 255 to make it reverse index. 
and also append the original UUID to that string. When I checked the D4M 
schema, it prints ??????? in the front of UUID which I suppose the 
characters subtracted from 255, if I am not misunderstood.

And the problem with 60 bytes or 120 bytes is nothing. I just don't want 
to waste space for no benefits at all.  when It can be done in 12 or 13 
bytes. And Thanks I looked at the MongoDriver code. I supposed that the 
encoding may not fit to lexicographical sorting.

Can you please provide some inputs on deciding the Partition Id?

-Mohit Kaushik

On 07/01/2015 09:55 PM, Christopher wrote:
> I'm not sure I understand why the size would be doubled.... if you
> store it in reverse order, it's not going to take up more bytes. Are
> you storing it forward *and* reverse? If so, why?
>
> Also, forgive me for asking, but 60 bytes doesn't seem problematic to
> me... that's going to be compressed on disk anyway. Why is 60 bytes
> too large for your use case?
>
> Also, if the MongoDB 12 byte UUID was sufficient, why aren't you using
> a UUID that is in that same format?
>
> Regarding serving documents based on a single term query... it seems
> to me that if that is your only requirement, then a row which looks
> like "<term> <UUID>" would be more appropriate, since the best way to
> support single-term query is to index on that term (UUID added only to
> enable rows to split).
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Wed, Jul 1, 2015 at 3:16 AM, mohit.kaushik <mo...@orkash.com> wrote:
>> Hi,
>>
>> We have an social media application currently using MongoDB to serve
>> documents . We decided to shift it to Accumulo. I am designing the schema
>> and indexing approach but having some difficulties in managing indexes and a
>> few concerns with generating UUID in Accumulo.
>>
>> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates a 12
>> byte UUID sorted on current time and good for multi-user multi-process
>> environment (<time>     <Mac add>    <process id>  <client counter> ) which
>> is perfect. but if I concatenate  the time,mac add, process-id, client
>> counter. These are around 28 to 30 characters which means around 60 bytes.
>> And If I store it in reverse order so that the latest document shows on top,
>> the size would be doubled( more than 120 bytes) as described by David
>> Medinets. Is there any way to store this UUID in lesser size or any other
>> efficient way to generate UUID reverse sorted on current time.
>>
>> Indexing : I need to retrieve documents from index based on some query on
>> fields. I found two approaches to index documents in Accumulo.
>> (1) Term based reverse indexing and
>> (2) Document partitioning indexing
>>
>> As Adam described in this video https://www.youtube.com/watch?v=Ck70G6OuGT4.
>> If I use Document partitioning indexing.
>>
>> Row                    <partition id>
>>                                 /            \
>> CF                 <doc>            <index>
>>                             |                       |
>> CQ                <UUID>          <Term>
>>                             |                       |
>>                        <field>           <UUID>
>>                             |                        |
>>                             |                  <Field>
>> Value            <value>
>>
>> If I just want to serve documents based on single term query. Would it be
>> better to store <term> in column family so that I can limit on single term
>> in CF. It will  reduce the data by a good factor. what can be other pros and
>> cons of this approach?
>> And how should i decide the on partition_Id. If i storing tweets on 3 node
>> cluster?
>>
>> Regards
>> Mohit Kaushik
>>
>


-- 
Signature

*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<http://politicomapper.orkash.com>interactive social intelligence at work...

<https://www.facebook.com/Orkash2012> 
<http://www.linkedin.com/company/orkash-services-private-limited> 
<https://twitter.com/Orkash> <http://www.orkash.com/blog/> 
<http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidential 
business communication. If you are not the intended recipient it may be 
unlawful for you to read, copy, distribute, disclose or otherwise use 
the information in this e-mail. If you have received it in error or are 
not the intended recipient, please destroy it and notify the sender 
immediately. Thank you /

Re: Accumulo indexing social media data

Posted by Christopher <ct...@apache.org>.

I'm not sure I understand why the size would be doubled.... if you
store it in reverse order, it's not going to take up more bytes. Are
you storing it forward *and* reverse? If so, why?

Also, forgive me for asking, but 60 bytes doesn't seem problematic to
me... that's going to be compressed on disk anyway. Why is 60 bytes
too large for your use case?

Also, if the MongoDB 12 byte UUID was sufficient, why aren't you using
a UUID that is in that same format?

Regarding serving documents based on a single term query... it seems
to me that if that is your only requirement, then a row which looks
like "<term> <UUID>" would be more appropriate, since the best way to
support single-term query is to index on that term (UUID added only to
enable rows to split).

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Wed, Jul 1, 2015 at 3:16 AM, mohit.kaushik <mo...@orkash.com> wrote:
> Hi,
>
> We have an social media application currently using MongoDB to serve
> documents . We decided to shift it to Accumulo. I am designing the schema
> and indexing approach but having some difficulties in managing indexes and a
> few concerns with generating UUID in Accumulo.
>
> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates a 12
> byte UUID sorted on current time and good for multi-user multi-process
> environment (<time>     <Mac add>    <process id>  <client counter> ) which
> is perfect. but if I concatenate  the time,mac add, process-id, client
> counter. These are around 28 to 30 characters which means around 60 bytes.
> And If I store it in reverse order so that the latest document shows on top,
> the size would be doubled( more than 120 bytes) as described by David
> Medinets. Is there any way to store this UUID in lesser size or any other
> efficient way to generate UUID reverse sorted on current time.
>
> Indexing : I need to retrieve documents from index based on some query on
> fields. I found two approaches to index documents in Accumulo.
> (1) Term based reverse indexing and
> (2) Document partitioning indexing
>
> As Adam described in this video https://www.youtube.com/watch?v=Ck70G6OuGT4.
> If I use Document partitioning indexing.
>
> Row                    <partition id>
>                                /            \
> CF                 <doc>            <index>
>                            |                       |
> CQ                <UUID>          <Term>
>                            |                       |
>                       <field>           <UUID>
>                            |                        |
>                            |                  <Field>
> Value            <value>
>
> If I just want to serve documents based on single term query. Would it be
> better to store <term> in column family so that I can limit on single term
> in CF. It will  reduce the data by a good factor. what can be other pros and
> cons of this approach?
> And how should i decide the on partition_Id. If i storing tweets on 3 node
> cluster?
>
> Regards
> Mohit Kaushik
>

Re: Accumulo indexing social media data

Posted by "mohit.kaushik" <mo...@orkash.com>.

Thanks Josh, I am testing the approach. I have one more consideration 
which is "CONDITIONAL MUTATIONS". I have stored the fields in CQ 
according to the following schema.

Row     <partition id>
/ \
CF <doc> <index>
| |
CQ <UUID> <Term>
| |
<field> <UUID>
| |
| <Field>
Value <value>


Documents have a fields url. if the url exist. I want mutations not to 
be added(skiped). But as I don't know the partitionID. How can I apply 
conditional mutations here to check the existence of url.

-Mohit kaushik

On 07/06/2015 02:04 AM, Josh Elser wrote:
> If your primary search criteria is on a single-term, a term-based 
> reverse index is going to serve you much better than a 
> document-partitioned index.
>
> Document partitioned indexes can better support concurrency since you 
> have some amount of hash-partitioning involved in the partition ID 
> (sometimes you can include other data in the partition ID to further 
> restrict the "search space"). However, you always need to query each 
> partition to get an answer for a single term. You'll have much higher 
> latency using this approach than a term-partitioned index.
>
> To answer your question about choosing a partition ID, it typically 
> revolves around the number of TabletServers you want a single query to 
> parallelize on. For example, if you can assume to have ~10 queries 
> running at one time, you don't want each query to communicate with 90% 
> of your TabletServers. If you only run one or two queries at a time, 
> you would want to talk to as many TabletServers as you can.
>
> To further complicate things, you can also try to apply a partition ID 
> as a suffix on term-based indexes to work around queries such as "the" 
> or "and" which are prone to be extremely common terms. With a simple 
> term-based index, all records for this term would be contained in a 
> single Tablet on a single TabletServer. This ultimately comes down to 
> the amount and distribution of data you're storing.
>
> Come back with more information, and we can give some more 
> recommendations. Honestly, you probably won't get this right the first 
> time, but this is expected :). What you can do is..
>
> * Set some expectations on performance
> * Do some simple math on actual data (estimate parallelism, latency, etc)
> * Prototype and test it
>
> mohit.kaushik wrote:
>> Hi,
>>
>> We have an social media application currently using MongoDB to serve
>> documents . We decided to shift it to Accumulo. I am designing the
>> schema and indexing approach but having some difficulties in managing
>> indexes and a few concerns with generating UUID in Accumulo.
>>
>> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates
>> a 12 byte UUID sorted on current time and good for multi-user
>> multi-process environment (<time> <Mac add> <process id> <client
>> counter> ) which is perfect. but if I concatenate the time,mac add,
>> process-id, client counter. These are around 28 to 30 characters which
>> means around 60 bytes. And If I store it in reverse order so that the
>> latest document shows on top, the size would be doubled( more than 120
>> bytes) as described by David Medinets. Is there any way to store this
>> UUID in lesser size or any other efficient way to generate UUID reverse
>> sorted on current time.
>>
>> Indexing : I need to retrieve documents from index based on some query
>> on fields. I found two approaches to index documents in Accumulo.
>> (1) Term based reverse indexing and
>> (2) Document partitioning indexing
>>
>> As Adam described in this video
>> https://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Document
>> partitioning indexing.
>>
>> Row <partition id>
>> / \
>> CF <doc> <index>
>> | |
>> CQ <UUID> <Term>
>> | |
>> <field> <UUID>
>> | |
>> | <Field>
>> Value <value>
>>
>> If I just want to serve documents based on single term query. Would it
>> be better to store <term> in column family so that I can limit on single
>> term in CF. It will reduce the data by a good factor. what can be other
>> pros and cons of this approach?
>> And how should i decide the on partition_Id. If i storing tweets on 3
>> node cluster?
>>
>> Regards
>> Mohit Kaushik
>>
>


-- 
Signature

*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<http://politicomapper.orkash.com>interactive social intelligence at work...

<https://www.facebook.com/Orkash2012> 
<http://www.linkedin.com/company/orkash-services-private-limited> 
<https://twitter.com/Orkash> <http://www.orkash.com/blog/> 
<http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidential 
business communication. If you are not the intended recipient it may be 
unlawful for you to read, copy, distribute, disclose or otherwise use 
the information in this e-mail. If you have received it in error or are 
not the intended recipient, please destroy it and notify the sender 
immediately. Thank you /

Re: Accumulo indexing social media data

Posted by Josh Elser <jo...@gmail.com>.

If your primary search criteria is on a single-term, a term-based 
reverse index is going to serve you much better than a 
document-partitioned index.

Document partitioned indexes can better support concurrency since you 
have some amount of hash-partitioning involved in the partition ID 
(sometimes you can include other data in the partition ID to further 
restrict the "search space"). However, you always need to query each 
partition to get an answer for a single term. You'll have much higher 
latency using this approach than a term-partitioned index.

To answer your question about choosing a partition ID, it typically 
revolves around the number of TabletServers you want a single query to 
parallelize on. For example, if you can assume to have ~10 queries 
running at one time, you don't want each query to communicate with 90% 
of your TabletServers. If you only run one or two queries at a time, you 
would want to talk to as many TabletServers as you can.

To further complicate things, you can also try to apply a partition ID 
as a suffix on term-based indexes to work around queries such as "the" 
or "and" which are prone to be extremely common terms. With a simple 
term-based index, all records for this term would be contained in a 
single Tablet on a single TabletServer. This ultimately comes down to 
the amount and distribution of data you're storing.

Come back with more information, and we can give some more 
recommendations. Honestly, you probably won't get this right the first 
time, but this is expected :). What you can do is..

* Set some expectations on performance
* Do some simple math on actual data (estimate parallelism, latency, etc)
* Prototype and test it

mohit.kaushik wrote:
> Hi,
>
> We have an social media application currently using MongoDB to serve
> documents . We decided to shift it to Accumulo. I am designing the
> schema and indexing approach but having some difficulties in managing
> indexes and a few concerns with generating UUID in Accumulo.
>
> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates
> a 12 byte UUID sorted on current time and good for multi-user
> multi-process environment (<time> <Mac add> <process id> <client
> counter> ) which is perfect. but if I concatenate the time,mac add,
> process-id, client counter. These are around 28 to 30 characters which
> means around 60 bytes. And If I store it in reverse order so that the
> latest document shows on top, the size would be doubled( more than 120
> bytes) as described by David Medinets. Is there any way to store this
> UUID in lesser size or any other efficient way to generate UUID reverse
> sorted on current time.
>
> Indexing : I need to retrieve documents from index based on some query
> on fields. I found two approaches to index documents in Accumulo.
> (1) Term based reverse indexing and
> (2) Document partitioning indexing
>
> As Adam described in this video
> https://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Document
> partitioning indexing.
>
> Row <partition id>
> / \
> CF <doc> <index>
> | |
> CQ <UUID> <Term>
> | |
> <field> <UUID>
> | |
> | <Field>
> Value <value>
>
> If I just want to serve documents based on single term query. Would it
> be better to store <term> in column family so that I can limit on single
> term in CF. It will reduce the data by a good factor. what can be other
> pros and cons of this approach?
> And how should i decide the on partition_Id. If i storing tweets on 3
> node cluster?
>
> Regards
> Mohit Kaushik
>