You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Fabian Alenius <fa...@gmail.com> on 2012/08/11 19:16:22 UTC

Partitioning strings for bucketed table

Hi,

I'm trying create an external bucketed table but I'm having trouble
recreating the behavior of the hive partitioner used to create
internal bucketed tables.

My bucket key is a String s. Currently in my partitioner I'm using the
follow code which is based on my findings in the Hive codebase:

  (s.hashCode() & Integer.MAX_VALUE) % numPartitions;

Unfortunately, when I do a select count(*) with TABLESAMPLE about 1%
of the rows are missing from those coming into the mapper.

I suspect that I might need wrap my String in a Writable before
calling hashCode(). Does anyone know exactly how to partition the data
so that it becomes compatible with hive bucketing?


Regards,

Fabian

Re: Partitioning strings for bucketed table

Posted by Fabian Alenius <fa...@gmail.com>.
Just noticed that the missing rows are account for under the counter:

org.apache.hadoop.hive.ql.exec.FilterOperator$Counter - FILTERED

Is there any way to print these rows or get more information about why
they are being filtered?

Fabian

On Sat, Aug 11, 2012 at 7:16 PM, Fabian Alenius
<fa...@gmail.com> wrote:
> Hi,
>
> I'm trying create an external bucketed table but I'm having trouble
> recreating the behavior of the hive partitioner used to create
> internal bucketed tables.
>
> My bucket key is a String s. Currently in my partitioner I'm using the
> follow code which is based on my findings in the Hive codebase:
>
>   (s.hashCode() & Integer.MAX_VALUE) % numPartitions;
>
> Unfortunately, when I do a select count(*) with TABLESAMPLE about 1%
> of the rows are missing from those coming into the mapper.
>
> I suspect that I might need wrap my String in a Writable before
> calling hashCode(). Does anyone know exactly how to partition the data
> so that it becomes compatible with hive bucketing?
>
>
> Regards,
>
> Fabian