You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by David Medinets <da...@gmail.com> on 2012/04/16 20:38:50 UTC

Using AccumuloOutputFormat, All Records Stored In One Tablet (Node)

Hopefully I am doing something wrong that can be easily rectified. I
have an hadoop job that is sending well over 200M entries into
accumulo. But every entry is being sent to a single node. The table
was created by the hadoop job.

How can I get the entries to be spread over several nodes?

Re: Using AccumuloOutputFormat, All Records Stored In One Tablet (Node)

Posted by Billie J Rinaldi <bi...@ugov.gov>.

On Monday, April 16, 2012 3:01:03 PM, "David Medinets" <da...@gmail.com> wrote:
> I'll ask another basic question. The row id values are stored as
> strings. So "1" and "1111" are sorted together. Let's say that I have
> five nodes. Would I run this?
> 
> addsplits 2 4 6 8 -t table

That syntax looks correct.  Those particular split points might or might not be what you want depending on the distribution of your data.

Billie


> On Mon, Apr 16, 2012 at 2:55 PM, David Medinets
> <da...@gmail.com> wrote:
> > argh ... Just to be clear. The splits are essentially partitions of
> > the row id?
> >
> > Can I add splits after the data is ingested? If so, how can I
> > redistribute?
> >
> > On Mon, Apr 16, 2012 at 2:45 PM, Eric Newton <er...@gmail.com>
> > wrote:
> >> Create the table with splits, but this requires you to know
> >> something about
> >> the distribution of your data.
> >>
> >> -Eric
> >>
> >>
> >> On Mon, Apr 16, 2012 at 2:38 PM, David Medinets
> >> <da...@gmail.com>
> >> wrote:
> >>>
> >>> Hopefully I am doing something wrong that can be easily rectified.
> >>> I
> >>> have an hadoop job that is sending well over 200M entries into
> >>> accumulo. But every entry is being sent to a single node. The
> >>> table
> >>> was created by the hadoop job.
> >>>
> >>> How can I get the entries to be spread over several nodes?
> >>
> >>

Re: Using AccumuloOutputFormat, All Records Stored In One Tablet (Node)

Posted by David Medinets <da...@gmail.com>.

I'll ask another basic question. The row id values are stored as
strings. So "1" and "1111" are sorted together. Let's say that I have
five nodes. Would I run this?

addsplits 2 4 6 8 -t table

On Mon, Apr 16, 2012 at 2:55 PM, David Medinets
<da...@gmail.com> wrote:
> argh ... Just to be clear. The splits are essentially partitions of the row id?
>
> Can I add splits after the data is ingested? If so, how can I redistribute?
>
> On Mon, Apr 16, 2012 at 2:45 PM, Eric Newton <er...@gmail.com> wrote:
>> Create the table with splits, but this requires you to know something about
>> the distribution of your data.
>>
>> -Eric
>>
>>
>> On Mon, Apr 16, 2012 at 2:38 PM, David Medinets <da...@gmail.com>
>> wrote:
>>>
>>> Hopefully I am doing something wrong that can be easily rectified. I
>>> have an hadoop job that is sending well over 200M entries into
>>> accumulo. But every entry is being sent to a single node. The table
>>> was created by the hadoop job.
>>>
>>> How can I get the entries to be spread over several nodes?
>>
>>

Re: Using AccumuloOutputFormat, All Records Stored In One Tablet (Node)

Posted by Billie J Rinaldi <bi...@ugov.gov>.

On Monday, April 16, 2012 2:55:48 PM, "David Medinets" <da...@gmail.com> wrote:
> argh ... Just to be clear. The splits are essentially partitions of
> the row id?

Yes, specified by the end of the range.

> Can I add splits after the data is ingested? If so, how can I
> redistribute?

Yes.  You can either add specific split points, or you can lower the split threshold based on the size of the table.  For example, if the table size is S bytes, and you ideally want to have T tablets, then set the table's split threshold to S/T.  These calculations are rarely exact, so I would start high on the split threshold, let it split out, see if the number of tablets is ok, then lower again if necessary.

Billie

> On Mon, Apr 16, 2012 at 2:45 PM, Eric Newton <er...@gmail.com>
> wrote:
> > Create the table with splits, but this requires you to know
> > something about
> > the distribution of your data.
> >
> > -Eric
> >
> >
> > On Mon, Apr 16, 2012 at 2:38 PM, David Medinets
> > <da...@gmail.com>
> > wrote:
> >>
> >> Hopefully I am doing something wrong that can be easily rectified.
> >> I
> >> have an hadoop job that is sending well over 200M entries into
> >> accumulo. But every entry is being sent to a single node. The table
> >> was created by the hadoop job.
> >>
> >> How can I get the entries to be spread over several nodes?
> >
> >

Re: Using AccumuloOutputFormat, All Records Stored In One Tablet (Node)

Posted by David Medinets <da...@gmail.com>.

argh ... Just to be clear. The splits are essentially partitions of the row id?

Can I add splits after the data is ingested? If so, how can I redistribute?

On Mon, Apr 16, 2012 at 2:45 PM, Eric Newton <er...@gmail.com> wrote:
> Create the table with splits, but this requires you to know something about
> the distribution of your data.
>
> -Eric
>
>
> On Mon, Apr 16, 2012 at 2:38 PM, David Medinets <da...@gmail.com>
> wrote:
>>
>> Hopefully I am doing something wrong that can be easily rectified. I
>> have an hadoop job that is sending well over 200M entries into
>> accumulo. But every entry is being sent to a single node. The table
>> was created by the hadoop job.
>>
>> How can I get the entries to be spread over several nodes?
>
>

Re: Using AccumuloOutputFormat, All Records Stored In One Tablet (Node)

Posted by Eric Newton <er...@gmail.com>.

Create the table with splits, but this requires you to know something about
the distribution of your data.

-Eric

On Mon, Apr 16, 2012 at 2:38 PM, David Medinets <da...@gmail.com>wrote:

> Hopefully I am doing something wrong that can be easily rectified. I
> have an hadoop job that is sending well over 200M entries into
> accumulo. But every entry is being sent to a single node. The table
> was created by the hadoop job.
>
> How can I get the entries to be spread over several nodes?
>