You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Lars Francke <la...@gmail.com> on 2010/08/10 00:41:52 UTC

Simulating an auto-incrementing column

Hi,

I have a problem and I hope someone has an idea on how to solve it.

My dataset consists of just very simple key-value pairs of strings
coming from PostgreSQL using Sqoop.

1) I need to count how often a key occurs -> Easy
2) I need to count how often a key-value pair occurs -> Easy

I need to output this data to PostgreSQL again, into two tables:

a) "keys" with the columns: id, key_name, count
b) "values" with the columns: id, key_id, value_name, count

Now the ids I'm referring to don't exist yet and I'm looking into
solutions to generate them. They have to be integers/longs but they
don't have to be in any order/pattern. I'm not concerned about
performance either as this query will be run monthly at most.

Do you have any idea how I could introduce this new column into the
output of query 1)? I could easily introduce it into 2) with a join
then. I thought about using a custom reducer script but apart from the
fact that I've never done it so far it would require that there is
only one reducer so that I can simulate an auto-incrementer. My
current best idea is to write a regular MR job that processes the Hive
output but I'd love to do everything in Hive if possible.

I might very well approach this problem completely wrong so don't
hesitate to propose a better solution or bash me for my poor
understanding of Hive :)

Thanks for any input and help.

Cheers,
Lars

Re: Simulating an auto-incrementing column

Posted by Lars Francke <la...@gmail.com>.

Hi Tim,

> I had a similar need and came across
> https://issues.apache.org/jira/browse/HIVE-1304 but haven't got round
> to trying it yet.

well that looks exactly like what I'm looking for. The missing link
for me was this line:

set mapred.reduce.tasks=1;

I've used it before but I don't know why I didn't think of it now.
I've even mentioned it in my initial mail :(
Thanks! I'll go ahead and try it right now.

Cheers,
Lars

Re: Simulating an auto-incrementing column

Posted by Tim Robertson <ti...@gmail.com>.

Hi Lars,

I had a similar need and came across
https://issues.apache.org/jira/browse/HIVE-1304 but haven't got round
to trying it yet.

Cheers,
Tim


On Tue, Aug 10, 2010 at 12:41 AM, Lars Francke <la...@gmail.com> wrote:
> Hi,
>
> I have a problem and I hope someone has an idea on how to solve it.
>
> My dataset consists of just very simple key-value pairs of strings
> coming from PostgreSQL using Sqoop.
>
> 1) I need to count how often a key occurs -> Easy
> 2) I need to count how often a key-value pair occurs -> Easy
>
> I need to output this data to PostgreSQL again, into two tables:
>
> a) "keys" with the columns: id, key_name, count
> b) "values" with the columns: id, key_id, value_name, count
>
> Now the ids I'm referring to don't exist yet and I'm looking into
> solutions to generate them. They have to be integers/longs but they
> don't have to be in any order/pattern. I'm not concerned about
> performance either as this query will be run monthly at most.
>
> Do you have any idea how I could introduce this new column into the
> output of query 1)? I could easily introduce it into 2) with a join
> then. I thought about using a custom reducer script but apart from the
> fact that I've never done it so far it would require that there is
> only one reducer so that I can simulate an auto-incrementer. My
> current best idea is to write a regular MR job that processes the Hive
> output but I'd love to do everything in Hive if possible.
>
> I might very well approach this problem completely wrong so don't
> hesitate to propose a better solution or bash me for my poor
> understanding of Hive :)
>
> Thanks for any input and help.
>
> Cheers,
> Lars
>