You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Aji Janis <aj...@gmail.com> on 2012/12/04 23:21:01 UTC

Map Reduce on accumulo

NOTE: I am fairly sure this hasn't been asked on here yet - my apologies if
it was already asked in which case please forward me a link to the
answers.Thank you.

If my environment set up is as follows:
-64MB HDFS block
-5 tablet servers
-10 tablets of size 1GB each per tablet server

If I have a table like below:
rowA | f1 | q1 | v1
rowA | f1 | q2 | v2

rowB | f1 | q1 | v3

rowC | f1 | q1 | v4
rowC | f2 | q1 | v5
rowC | f3 | q3 | v6

>From the little documentation, I know all data about rowA will go one
tablet which may or may not contain data about other rows ie its all or
none. So my questions are:

How are the tablets mapped to a Datanode or HDFS block? Obviously, One
tablet is split into multiple HDFS blocks (8 in this case) so would they be
stored on the same or different datanode(s) or does it not matter?

In the example above, would all data about RowC (or A or B) go onto the
same HDFS block or different HDFS blocks?

When executing a map reduce job how many mappers would I get? (one per hdfs
block? or per tablet? or per server?)

Thank you in advance for any and all suggestions.

Re: Map Reduce on accumulo

Posted by Aji Janis <aj...@gmail.com>.

Thank you!

On Fri, Dec 7, 2012 at 9:51 AM, Billie Rinaldi <bi...@apache.org> wrote:

> On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <aj...@gmail.com> wrote:
>
>> Thank you for the clarification. You mentioned that "Using the input
>> format, unless you override the autosplitting in it, you will get 1 mapper
>> per tablet." in your initial response. Again, pardon me for the newbie
>> question, but how do I find out if autosplitting is overriden or not?
>>
>
> You override autosplitting with the command
> AccumuloInputFormat.disableAutoAdjustRanges(Configuration).  So if you
> haven't done that, it will fit mappers to tablets.
>
> Billie
>
>
>
>>
>> Aji
>>
>>
>>
>> On Tue, Dec 4, 2012 at 8:36 PM, John Vines <vi...@apache.org> wrote:
>>
>>> Your first two presumptions are correct. You will get 3 mappers and each
>>> mapper will have data for only one tablet.
>>>
>>> Each mapper will function exactly as a scanner for the range of the
>>> tablet, so you will get things in lexicographical order. So the mapper for
>>> tablet A will get all items for rowA in order before getting items for rowB.
>>>
>>> John
>>>
>>>
>>>
>>> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <aj...@gmail.com> wrote:
>>>
>>>> Thank you John for your response. I do have a few followup questions.
>>>> Let me use a better example. Lets say my table and tabletserver
>>>> distributions are as follows:
>>>>
>>>> ---------------------------------------------
>>>> MyTable:
>>>>
>>>> rowA | f1 | q1 | v1
>>>> rowA | f2 | q2 | v2
>>>> rowA | f3 | q3 | v3
>>>>
>>>> rowB | f1 | q1 | v1
>>>> rowB | f1 | q2 | v2
>>>>
>>>> rowC | f1 | q1 | v1
>>>>
>>>> rowD | f1 | q1 | v1
>>>> rowD | f1 | q2 | v2
>>>>
>>>> rowE | f1 | q1 | v1
>>>>
>>>> ---------------------------------------------
>>>>
>>>> TabletServer1: Tablet A: rowA, rowC
>>>> TabletServer2: Tablet B: rowB
>>>> TabletServer2: Tablet C: rowD
>>>>
>>>> --------------------------------------------
>>>>
>>>> In this example, if I have a map reduce job that reads from the table
>>>> above and writes to MyTable2 table using
>>>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
>>>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>>>>
>>>> Lets not focus on what the map reduce job itself is. From
>>>> your explanation below sounds like if autosplitting is not overriden then
>>>> we get *three mappers* total. Is that right?
>>>>
>>>> Further, I will be right in assuming that a mapper will NOT get data
>>>> from multiple tablets. Correct?
>>>>
>>>> I am also very confused on what the *order of input to the mapper*will be. Would mapper_at_tabletA get
>>>> -all data from rowA before it gets all data from rowC or
>>>> -all data from rowC before it gets all data from rowA or
>>>> -something like:
>>>>    rowA | f1 | q1 | v1
>>>>    rowA | f2 | q2 | v2
>>>>    rowC | f1 | q1 | v1
>>>>    rowA | f3 | q3 | v3
>>>>
>>>> I know these are a lot of question but I really like to get a good
>>>> understanding of the architecture. Thank you!
>>>> Aji
>>>>
>>>>
>>>>
>>>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <vi...@apache.org> wrote:
>>>>
>>>>> A tablet consists of both an in memory portion and 0 to many files in
>>>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>>>>> boost to the natural locality you get when you write data to HDFS, but if a
>>>>> tablet migrates that locality could be lost until data is compacted
>>>>> (rewritten). Locality could be retained due to data replication, but
>>>>> Accumulo does not make extraordinary effort to attempt to get a little bit
>>>>> of locality, as data will eventually be rewritten and locality restored.
>>>>>
>>>>> As for your example, if all data for a given row is inserted at the
>>>>> same time, then it is guaranteed to be in the same file. There is no
>>>>> atomicity guarantee regarding HDFS blocks though, so depending on the block
>>>>> size and the amount of data in the file (and it's distribution), it is
>>>>> possible for a few entries to span files even though they are adjacent.
>>>>>
>>>>> Using the input format, unless you override the autosplitting in it,
>>>>> you will get 1 mapper per tablet. If you disable auto-splitting, then you
>>>>> get one mapper per range you specify.
>>>>>
>>>>> Hope this helps, let me know if you have other questions or need
>>>>> clarification.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <aj...@gmail.com> wrote:
>>>>>
>>>>>> NOTE: I am fairly sure this hasn't been asked on here yet - my
>>>>>> apologies if it was already asked in which case please forward me a link to
>>>>>> the answers.Thank you.
>>>>>>
>>>>>> If my environment set up is as follows:
>>>>>> -64MB HDFS block
>>>>>> -5 tablet servers
>>>>>> -10 tablets of size 1GB each per tablet server
>>>>>>
>>>>>> If I have a table like below:
>>>>>> rowA | f1 | q1 | v1
>>>>>> rowA | f1 | q2 | v2
>>>>>>
>>>>>> rowB | f1 | q1 | v3
>>>>>>
>>>>>> rowC | f1 | q1 | v4
>>>>>> rowC | f2 | q1 | v5
>>>>>> rowC | f3 | q3 | v6
>>>>>>
>>>>>> From the little documentation, I know all data about rowA will go one
>>>>>> tablet which may or may not contain data about other rows ie its all or
>>>>>> none. So my questions are:
>>>>>>
>>>>>> How are the tablets mapped to a Datanode or HDFS block? Obviously,
>>>>>> One tablet is split into multiple HDFS blocks (8 in this case) so would
>>>>>> they be stored on the same or different datanode(s) or does it not matter?
>>>>>>
>>>>>> In the example above, would all data about RowC (or A or B) go onto
>>>>>> the same HDFS block or different HDFS blocks?
>>>>>>
>>>>>> When executing a map reduce job how many mappers would I get? (one
>>>>>> per hdfs block? or per tablet? or per server?)
>>>>>>
>>>>>> Thank you in advance for any and all suggestions.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Map Reduce on accumulo

Posted by Billie Rinaldi <bi...@apache.org>.

On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <aj...@gmail.com> wrote:

> Thank you for the clarification. You mentioned that "Using the input
> format, unless you override the autosplitting in it, you will get 1 mapper
> per tablet." in your initial response. Again, pardon me for the newbie
> question, but how do I find out if autosplitting is overriden or not?
>

You override autosplitting with the command
AccumuloInputFormat.disableAutoAdjustRanges(Configuration).  So if you
haven't done that, it will fit mappers to tablets.

Billie



>
> Aji
>
>
>
> On Tue, Dec 4, 2012 at 8:36 PM, John Vines <vi...@apache.org> wrote:
>
>> Your first two presumptions are correct. You will get 3 mappers and each
>> mapper will have data for only one tablet.
>>
>> Each mapper will function exactly as a scanner for the range of the
>> tablet, so you will get things in lexicographical order. So the mapper for
>> tablet A will get all items for rowA in order before getting items for rowB.
>>
>> John
>>
>>
>>
>> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <aj...@gmail.com> wrote:
>>
>>> Thank you John for your response. I do have a few followup questions.
>>> Let me use a better example. Lets say my table and tabletserver
>>> distributions are as follows:
>>>
>>> ---------------------------------------------
>>> MyTable:
>>>
>>> rowA | f1 | q1 | v1
>>> rowA | f2 | q2 | v2
>>> rowA | f3 | q3 | v3
>>>
>>> rowB | f1 | q1 | v1
>>> rowB | f1 | q2 | v2
>>>
>>> rowC | f1 | q1 | v1
>>>
>>> rowD | f1 | q1 | v1
>>> rowD | f1 | q2 | v2
>>>
>>> rowE | f1 | q1 | v1
>>>
>>> ---------------------------------------------
>>>
>>> TabletServer1: Tablet A: rowA, rowC
>>> TabletServer2: Tablet B: rowB
>>> TabletServer2: Tablet C: rowD
>>>
>>> --------------------------------------------
>>>
>>> In this example, if I have a map reduce job that reads from the table
>>> above and writes to MyTable2 table using
>>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
>>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>>>
>>> Lets not focus on what the map reduce job itself is. From
>>> your explanation below sounds like if autosplitting is not overriden then
>>> we get *three mappers* total. Is that right?
>>>
>>> Further, I will be right in assuming that a mapper will NOT get data
>>> from multiple tablets. Correct?
>>>
>>> I am also very confused on what the *order of input to the mapper* will
>>> be. Would mapper_at_tabletA get
>>> -all data from rowA before it gets all data from rowC or
>>> -all data from rowC before it gets all data from rowA or
>>> -something like:
>>>    rowA | f1 | q1 | v1
>>>    rowA | f2 | q2 | v2
>>>    rowC | f1 | q1 | v1
>>>    rowA | f3 | q3 | v3
>>>
>>> I know these are a lot of question but I really like to get a good
>>> understanding of the architecture. Thank you!
>>> Aji
>>>
>>>
>>>
>>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <vi...@apache.org> wrote:
>>>
>>>> A tablet consists of both an in memory portion and 0 to many files in
>>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>>>> boost to the natural locality you get when you write data to HDFS, but if a
>>>> tablet migrates that locality could be lost until data is compacted
>>>> (rewritten). Locality could be retained due to data replication, but
>>>> Accumulo does not make extraordinary effort to attempt to get a little bit
>>>> of locality, as data will eventually be rewritten and locality restored.
>>>>
>>>> As for your example, if all data for a given row is inserted at the
>>>> same time, then it is guaranteed to be in the same file. There is no
>>>> atomicity guarantee regarding HDFS blocks though, so depending on the block
>>>> size and the amount of data in the file (and it's distribution), it is
>>>> possible for a few entries to span files even though they are adjacent.
>>>>
>>>> Using the input format, unless you override the autosplitting in it,
>>>> you will get 1 mapper per tablet. If you disable auto-splitting, then you
>>>> get one mapper per range you specify.
>>>>
>>>> Hope this helps, let me know if you have other questions or need
>>>> clarification.
>>>>
>>>> John
>>>>
>>>>
>>>>
>>>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <aj...@gmail.com> wrote:
>>>>
>>>>> NOTE: I am fairly sure this hasn't been asked on here yet - my
>>>>> apologies if it was already asked in which case please forward me a link to
>>>>> the answers.Thank you.
>>>>>
>>>>> If my environment set up is as follows:
>>>>> -64MB HDFS block
>>>>> -5 tablet servers
>>>>> -10 tablets of size 1GB each per tablet server
>>>>>
>>>>> If I have a table like below:
>>>>> rowA | f1 | q1 | v1
>>>>> rowA | f1 | q2 | v2
>>>>>
>>>>> rowB | f1 | q1 | v3
>>>>>
>>>>> rowC | f1 | q1 | v4
>>>>> rowC | f2 | q1 | v5
>>>>> rowC | f3 | q3 | v6
>>>>>
>>>>> From the little documentation, I know all data about rowA will go one
>>>>> tablet which may or may not contain data about other rows ie its all or
>>>>> none. So my questions are:
>>>>>
>>>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>>>>> tablet is split into multiple HDFS blocks (8 in this case) so would they be
>>>>> stored on the same or different datanode(s) or does it not matter?
>>>>>
>>>>> In the example above, would all data about RowC (or A or B) go onto
>>>>> the same HDFS block or different HDFS blocks?
>>>>>
>>>>> When executing a map reduce job how many mappers would I get? (one per
>>>>> hdfs block? or per tablet? or per server?)
>>>>>
>>>>> Thank you in advance for any and all suggestions.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Map Reduce on accumulo

Posted by Aji Janis <aj...@gmail.com>.

Thank you for the clarification. You mentioned that "Using the input
format, unless you override the autosplitting in it, you will get 1 mapper
per tablet." in your initial response. Again, pardon me for the newbie
question, but how do I find out if autosplitting is overriden or not?

Aji



On Tue, Dec 4, 2012 at 8:36 PM, John Vines <vi...@apache.org> wrote:

> Your first two presumptions are correct. You will get 3 mappers and each
> mapper will have data for only one tablet.
>
> Each mapper will function exactly as a scanner for the range of the
> tablet, so you will get things in lexicographical order. So the mapper for
> tablet A will get all items for rowA in order before getting items for rowB.
>
> John
>
>
>
> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <aj...@gmail.com> wrote:
>
>> Thank you John for your response. I do have a few followup questions.
>> Let me use a better example. Lets say my table and tabletserver
>> distributions are as follows:
>>
>> ---------------------------------------------
>> MyTable:
>>
>> rowA | f1 | q1 | v1
>> rowA | f2 | q2 | v2
>> rowA | f3 | q3 | v3
>>
>> rowB | f1 | q1 | v1
>> rowB | f1 | q2 | v2
>>
>> rowC | f1 | q1 | v1
>>
>> rowD | f1 | q1 | v1
>> rowD | f1 | q2 | v2
>>
>> rowE | f1 | q1 | v1
>>
>> ---------------------------------------------
>>
>> TabletServer1: Tablet A: rowA, rowC
>> TabletServer2: Tablet B: rowB
>> TabletServer2: Tablet C: rowD
>>
>> --------------------------------------------
>>
>> In this example, if I have a map reduce job that reads from the table
>> above and writes to MyTable2 table using
>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>>
>> Lets not focus on what the map reduce job itself is. From
>> your explanation below sounds like if autosplitting is not overriden then
>> we get *three mappers* total. Is that right?
>>
>> Further, I will be right in assuming that a mapper will NOT get data from
>> multiple tablets. Correct?
>>
>> I am also very confused on what the *order of input to the mapper* will
>> be. Would mapper_at_tabletA get
>> -all data from rowA before it gets all data from rowC or
>> -all data from rowC before it gets all data from rowA or
>> -something like:
>>    rowA | f1 | q1 | v1
>>    rowA | f2 | q2 | v2
>>    rowC | f1 | q1 | v1
>>    rowA | f3 | q3 | v3
>>
>> I know these are a lot of question but I really like to get a good
>> understanding of the architecture. Thank you!
>> Aji
>>
>>
>>
>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <vi...@apache.org> wrote:
>>
>>> A tablet consists of both an in memory portion and 0 to many files in
>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>>> boost to the natural locality you get when you write data to HDFS, but if a
>>> tablet migrates that locality could be lost until data is compacted
>>> (rewritten). Locality could be retained due to data replication, but
>>> Accumulo does not make extraordinary effort to attempt to get a little bit
>>> of locality, as data will eventually be rewritten and locality restored.
>>>
>>> As for your example, if all data for a given row is inserted at the same
>>> time, then it is guaranteed to be in the same file. There is no atomicity
>>> guarantee regarding HDFS blocks though, so depending on the block size and
>>> the amount of data in the file (and it's distribution), it is possible for
>>> a few entries to span files even though they are adjacent.
>>>
>>> Using the input format, unless you override the autosplitting in it, you
>>> will get 1 mapper per tablet. If you disable auto-splitting, then you get
>>> one mapper per range you specify.
>>>
>>> Hope this helps, let me know if you have other questions or need
>>> clarification.
>>>
>>> John
>>>
>>>
>>>
>>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <aj...@gmail.com> wrote:
>>>
>>>> NOTE: I am fairly sure this hasn't been asked on here yet - my
>>>> apologies if it was already asked in which case please forward me a link to
>>>> the answers.Thank you.
>>>>
>>>> If my environment set up is as follows:
>>>> -64MB HDFS block
>>>> -5 tablet servers
>>>> -10 tablets of size 1GB each per tablet server
>>>>
>>>> If I have a table like below:
>>>> rowA | f1 | q1 | v1
>>>> rowA | f1 | q2 | v2
>>>>
>>>> rowB | f1 | q1 | v3
>>>>
>>>> rowC | f1 | q1 | v4
>>>> rowC | f2 | q1 | v5
>>>> rowC | f3 | q3 | v6
>>>>
>>>> From the little documentation, I know all data about rowA will go one
>>>> tablet which may or may not contain data about other rows ie its all or
>>>> none. So my questions are:
>>>>
>>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>>>> tablet is split into multiple HDFS blocks (8 in this case) so would they be
>>>> stored on the same or different datanode(s) or does it not matter?
>>>>
>>>> In the example above, would all data about RowC (or A or B) go onto the
>>>> same HDFS block or different HDFS blocks?
>>>>
>>>> When executing a map reduce job how many mappers would I get? (one per
>>>> hdfs block? or per tablet? or per server?)
>>>>
>>>> Thank you in advance for any and all suggestions.
>>>>
>>>
>>>
>>
>

Re: Map Reduce on accumulo

Posted by John Vines <vi...@apache.org>.

Your first two presumptions are correct. You will get 3 mappers and each
mapper will have data for only one tablet.

Each mapper will function exactly as a scanner for the range of the tablet,
so you will get things in lexicographical order. So the mapper for tablet A
will get all items for rowA in order before getting items for rowB.

John


On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <aj...@gmail.com> wrote:

> Thank you John for your response. I do have a few followup questions.
> Let me use a better example. Lets say my table and tabletserver
> distributions are as follows:
>
> ---------------------------------------------
> MyTable:
>
> rowA | f1 | q1 | v1
> rowA | f2 | q2 | v2
> rowA | f3 | q3 | v3
>
> rowB | f1 | q1 | v1
> rowB | f1 | q2 | v2
>
> rowC | f1 | q1 | v1
>
> rowD | f1 | q1 | v1
> rowD | f1 | q2 | v2
>
> rowE | f1 | q1 | v1
>
> ---------------------------------------------
>
> TabletServer1: Tablet A: rowA, rowC
> TabletServer2: Tablet B: rowB
> TabletServer2: Tablet C: rowD
>
> --------------------------------------------
>
> In this example, if I have a map reduce job that reads from the table
> above and writes to MyTable2 table using
> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>
> Lets not focus on what the map reduce job itself is. From
> your explanation below sounds like if autosplitting is not overriden then
> we get *three mappers* total. Is that right?
>
> Further, I will be right in assuming that a mapper will NOT get data from
> multiple tablets. Correct?
>
> I am also very confused on what the *order of input to the mapper* will
> be. Would mapper_at_tabletA get
> -all data from rowA before it gets all data from rowC or
> -all data from rowC before it gets all data from rowA or
> -something like:
>    rowA | f1 | q1 | v1
>    rowA | f2 | q2 | v2
>    rowC | f1 | q1 | v1
>    rowA | f3 | q3 | v3
>
> I know these are a lot of question but I really like to get a good
> understanding of the architecture. Thank you!
> Aji
>
>
>
> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <vi...@apache.org> wrote:
>
>> A tablet consists of both an in memory portion and 0 to many files in
>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>> boost to the natural locality you get when you write data to HDFS, but if a
>> tablet migrates that locality could be lost until data is compacted
>> (rewritten). Locality could be retained due to data replication, but
>> Accumulo does not make extraordinary effort to attempt to get a little bit
>> of locality, as data will eventually be rewritten and locality restored.
>>
>> As for your example, if all data for a given row is inserted at the same
>> time, then it is guaranteed to be in the same file. There is no atomicity
>> guarantee regarding HDFS blocks though, so depending on the block size and
>> the amount of data in the file (and it's distribution), it is possible for
>> a few entries to span files even though they are adjacent.
>>
>> Using the input format, unless you override the autosplitting in it, you
>> will get 1 mapper per tablet. If you disable auto-splitting, then you get
>> one mapper per range you specify.
>>
>> Hope this helps, let me know if you have other questions or need
>> clarification.
>>
>> John
>>
>>
>>
>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <aj...@gmail.com> wrote:
>>
>>> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies
>>> if it was already asked in which case please forward me a link to the
>>> answers.Thank you.
>>>
>>> If my environment set up is as follows:
>>> -64MB HDFS block
>>> -5 tablet servers
>>> -10 tablets of size 1GB each per tablet server
>>>
>>> If I have a table like below:
>>> rowA | f1 | q1 | v1
>>> rowA | f1 | q2 | v2
>>>
>>> rowB | f1 | q1 | v3
>>>
>>> rowC | f1 | q1 | v4
>>> rowC | f2 | q1 | v5
>>> rowC | f3 | q3 | v6
>>>
>>> From the little documentation, I know all data about rowA will go one
>>> tablet which may or may not contain data about other rows ie its all or
>>> none. So my questions are:
>>>
>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>>> tablet is split into multiple HDFS blocks (8 in this case) so would they be
>>> stored on the same or different datanode(s) or does it not matter?
>>>
>>> In the example above, would all data about RowC (or A or B) go onto the
>>> same HDFS block or different HDFS blocks?
>>>
>>> When executing a map reduce job how many mappers would I get? (one per
>>> hdfs block? or per tablet? or per server?)
>>>
>>> Thank you in advance for any and all suggestions.
>>>
>>
>>
>

Re: Map Reduce on accumulo

Posted by Aji Janis <aj...@gmail.com>.

Thank you John for your response. I do have a few followup questions.
Let me use a better example. Lets say my table and tabletserver
distributions are as follows:

---------------------------------------------
MyTable:

rowA | f1 | q1 | v1
rowA | f2 | q2 | v2
rowA | f3 | q3 | v3

rowB | f1 | q1 | v1
rowB | f1 | q2 | v2

rowC | f1 | q1 | v1

rowD | f1 | q1 | v1
rowD | f1 | q2 | v2

rowE | f1 | q1 | v1

---------------------------------------------

TabletServer1: Tablet A: rowA, rowC
TabletServer2: Tablet B: rowB
TabletServer2: Tablet C: rowD

--------------------------------------------

In this example, if I have a map reduce job that reads from the table above
and writes to MyTable2 table using
org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.

Lets not focus on what the map reduce job itself is. From
your explanation below sounds like if autosplitting is not overriden then
we get *three mappers* total. Is that right?

Further, I will be right in assuming that a mapper will NOT get data from
multiple tablets. Correct?

I am also very confused on what the *order of input to the mapper* will be.
Would mapper_at_tabletA get
-all data from rowA before it gets all data from rowC or
-all data from rowC before it gets all data from rowA or
-something like:
   rowA | f1 | q1 | v1
   rowA | f2 | q2 | v2
   rowC | f1 | q1 | v1
   rowA | f3 | q3 | v3

I know these are a lot of question but I really like to get a good
understanding of the architecture. Thank you!
Aji



On Tue, Dec 4, 2012 at 5:45 PM, John Vines <vi...@apache.org> wrote:

> A tablet consists of both an in memory portion and 0 to many files in
> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
> boost to the natural locality you get when you write data to HDFS, but if a
> tablet migrates that locality could be lost until data is compacted
> (rewritten). Locality could be retained due to data replication, but
> Accumulo does not make extraordinary effort to attempt to get a little bit
> of locality, as data will eventually be rewritten and locality restored.
>
> As for your example, if all data for a given row is inserted at the same
> time, then it is guaranteed to be in the same file. There is no atomicity
> guarantee regarding HDFS blocks though, so depending on the block size and
> the amount of data in the file (and it's distribution), it is possible for
> a few entries to span files even though they are adjacent.
>
> Using the input format, unless you override the autosplitting in it, you
> will get 1 mapper per tablet. If you disable auto-splitting, then you get
> one mapper per range you specify.
>
> Hope this helps, let me know if you have other questions or need
> clarification.
>
> John
>
>
>
> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <aj...@gmail.com> wrote:
>
>> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies
>> if it was already asked in which case please forward me a link to the
>> answers.Thank you.
>>
>> If my environment set up is as follows:
>> -64MB HDFS block
>> -5 tablet servers
>> -10 tablets of size 1GB each per tablet server
>>
>> If I have a table like below:
>> rowA | f1 | q1 | v1
>> rowA | f1 | q2 | v2
>>
>> rowB | f1 | q1 | v3
>>
>> rowC | f1 | q1 | v4
>> rowC | f2 | q1 | v5
>> rowC | f3 | q3 | v6
>>
>> From the little documentation, I know all data about rowA will go one
>> tablet which may or may not contain data about other rows ie its all or
>> none. So my questions are:
>>
>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>> tablet is split into multiple HDFS blocks (8 in this case) so would they be
>> stored on the same or different datanode(s) or does it not matter?
>>
>> In the example above, would all data about RowC (or A or B) go onto the
>> same HDFS block or different HDFS blocks?
>>
>> When executing a map reduce job how many mappers would I get? (one per
>> hdfs block? or per tablet? or per server?)
>>
>> Thank you in advance for any and all suggestions.
>>
>
>

Re: Map Reduce on accumulo

Posted by John Vines <vi...@apache.org>.

A tablet consists of both an in memory portion and 0 to many files in HDFS.
Each file may be one or many HDFS blocks. Accumulo gets a performance boost
to the natural locality you get when you write data to HDFS, but if a
tablet migrates that locality could be lost until data is compacted
(rewritten). Locality could be retained due to data replication, but
Accumulo does not make extraordinary effort to attempt to get a little bit
of locality, as data will eventually be rewritten and locality restored.

As for your example, if all data for a given row is inserted at the same
time, then it is guaranteed to be in the same file. There is no atomicity
guarantee regarding HDFS blocks though, so depending on the block size and
the amount of data in the file (and it's distribution), it is possible for
a few entries to span files even though they are adjacent.

Using the input format, unless you override the autosplitting in it, you
will get 1 mapper per tablet. If you disable auto-splitting, then you get
one mapper per range you specify.

Hope this helps, let me know if you have other questions or need
clarification.

John

On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <aj...@gmail.com> wrote:

> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies
> if it was already asked in which case please forward me a link to the
> answers.Thank you.
>
> If my environment set up is as follows:
> -64MB HDFS block
> -5 tablet servers
> -10 tablets of size 1GB each per tablet server
>
> If I have a table like below:
> rowA | f1 | q1 | v1
> rowA | f1 | q2 | v2
>
> rowB | f1 | q1 | v3
>
> rowC | f1 | q1 | v4
> rowC | f2 | q1 | v5
> rowC | f3 | q3 | v6
>
> From the little documentation, I know all data about rowA will go one
> tablet which may or may not contain data about other rows ie its all or
> none. So my questions are:
>
> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
> tablet is split into multiple HDFS blocks (8 in this case) so would they be
> stored on the same or different datanode(s) or does it not matter?
>
> In the example above, would all data about RowC (or A or B) go onto the
> same HDFS block or different HDFS blocks?
>
> When executing a map reduce job how many mappers would I get? (one per
> hdfs block? or per tablet? or per server?)
>
> Thank you in advance for any and all suggestions.
>