You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Vadim Zaliva <kr...@gmail.com> on 2009/01/27 02:40:37 UTC

DBOutputFormat and auto-generated keys

Is it possible to obtain auto-generated IDs when writing data using
DBOutputFormat?

For example, is it possible to write Mapper which stores records in DB
and returns auto-generated
IDs of these records?

Let me explain what I am trying to achieve:

I have data like this
<key, (value,value,value)>

which I would like to store in normalized for in two tables. First
table will store
keys (string). Each key will have unique int id auto-generated by mysql.

Second table will have (key_id,value) pairs, key_id being foreign key,
pointing to first table.

Sincerely,
Vadim

Re: DBOutputFormat and auto-generated keys

Posted by Kevin Peterson <kp...@biz360.com>.
On Mon, Jan 26, 2009 at 5:40 PM, Vadim Zaliva <kr...@gmail.com> wrote:

> Is it possible to obtain auto-generated IDs when writing data using
> DBOutputFormat?
>
> For example, is it possible to write Mapper which stores records in DB
> and returns auto-generated
> IDs of these records?

...

> which I would like to store in normalized for in two tables. First
> table will store
> keys (string). Each key will have unique int id auto-generated by mysql.
>
> Second table will have (key_id,value) pairs, key_id being foreign key,
> pointing to first table.
>

A mapper has to have one output format, and that output format can't pass
any data into the map, so that approach won't work. DBOutputFormat doesn't
provide any way to do it either.

If you wanted to add this kind of functionality, you would need to write
your own output format, which probably wouldn't look much like
DBOutputFormat, which would be aware of your foreign keys. It would quickly
get very complicated.

One possibility that comes to mind is writing a "HibernateOutputFormat" or
similar, which would give you a way to express the relationships between
tables, leaving your only task to hook up your persistence logic to a hadoop
output format.

I had a similar problem with writing out reports to be used by a Rails app,
and solved it by restructuring things so that I don't need to write to two
tables from the same map task.