You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Christopher Dorner <ch...@gmail.com> on 2011/10/01 13:05:38 UTC

Best way to write to multiple tables in one map-only job

Hallo,

i am building a RDF Store using HBase and experimenting with different 
index tables and Schema Designs.

For the input, i have a File where each line is a RDF triple in N3 Format.

I need to write to multiple Tables since i need to build several index 
tables. For the sake of reducing IO and not reading the file a few times 
i want to do that in one Map-Only Job. Later the file will contain a few 
million triples.

I am experimenting in Pseudo-Distributed-Mode so far but will be able to 
run it on our cluster soon.
Storing the data in the Tables does not need to be speed-optimized at 
any cost, but i just want to do it as simple and fast as possible.


What is the best way to write to more than 1 table in one Map-Task?

a)
I can either use "MultiTableOutputFormat.class" and write in map() using:
Put put = new Put(key);
put.add(kv);
context.write(tableName, put);

Can i write to e.g. 6 Tables in this way by creating a new Put for each 
table?

But how can i turn off autoFlush and set writeBufferSize in this case? 
Because i think autoflush is not that good in this case of putting lots 
of values.


b)
I can use an instance of HTable in the Mapper class. Then i can set 
autoFlush and writeBufferSize and write to the table using:
HTable table = new HTable(config, tableName);
table.put(put);

But it is recommended to use only one instance of HTable, so i would 
need to do
table = new Table();
for each table i want to write to. Is that still fine with 6 tables?
I stumbled upon HTablePool. Is this for these scenarios?


Thank You and Regards,
Christopher

RE: Best way to write to multiple tables in one map-only job

Posted by Michael Segel <mi...@hotmail.com>.

One other option...
Your map() method has a null writeable and you handle the put() to the table(s) yourself within the map() method.
You can also set the autoflush within your job too.


> Date: Tue, 4 Oct 2011 16:20:25 +0200
> From: christopher.dorner@gmail.com
> To: user@hbase.apache.org
> Subject: Re: Best way to write to multiple tables in one map-only job
> 
> Thank you for the hint.
> 
> What about autoflush then? Is that also something i can set using the 
> config on job setup? Or does it onyl work with an HTable instance? 
> Somehow i can't really find the right information :)
> 
> Regards,
> Christopher
> 
> Am 03.10.2011 19:20, schrieb Jean-Daniel Cryans:
> > Option a) and b) are the same since MultiTableOutputFormat internally
> > uses multiple HTables. See for yourself:
> >
> > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.java
> >
> > Also you can set the write buffer but setting
> > hbase.client.write.buffer on the configuration that your pass in the
> > job setup.
> >
> > Using HTablePool in a single threaded application doesn't offer more
> > than just storage for your HTables.
> >
> > Hope that helps,
> >
> > J-D
> >
> > On Sat, Oct 1, 2011 at 4:05 AM, Christopher Dorner
> > <ch...@gmail.com>  wrote:
> >> Hallo,
> >>
> >> i am building a RDF Store using HBase and experimenting with different index
> >> tables and Schema Designs.
> >>
> >> For the input, i have a File where each line is a RDF triple in N3 Format.
> >>
> >> I need to write to multiple Tables since i need to build several index
> >> tables. For the sake of reducing IO and not reading the file a few times i
> >> want to do that in one Map-Only Job. Later the file will contain a few
> >> million triples.
> >>
> >> I am experimenting in Pseudo-Distributed-Mode so far but will be able to run
> >> it on our cluster soon.
> >> Storing the data in the Tables does not need to be speed-optimized at any
> >> cost, but i just want to do it as simple and fast as possible.
> >>
> >>
> >> What is the best way to write to more than 1 table in one Map-Task?
> >>
> >> a)
> >> I can either use "MultiTableOutputFormat.class" and write in map() using:
> >> Put put = new Put(key);
> >> put.add(kv);
> >> context.write(tableName, put);
> >>
> >> Can i write to e.g. 6 Tables in this way by creating a new Put for each
> >> table?
> >>
> >> But how can i turn off autoFlush and set writeBufferSize in this case?
> >> Because i think autoflush is not that good in this case of putting lots of
> >> values.
> >>
> >>
> >> b)
> >> I can use an instance of HTable in the Mapper class. Then i can set
> >> autoFlush and writeBufferSize and write to the table using:
> >> HTable table = new HTable(config, tableName);
> >> table.put(put);
> >>
> >> But it is recommended to use only one instance of HTable, so i would need to
> >> do
> >> table = new Table();
> >> for each table i want to write to. Is that still fine with 6 tables?
> >> I stumbled upon HTablePool. Is this for these scenarios?
> >>
> >>
> >> Thank You and Regards,
> >> Christopher
> >>
>

Re: Best way to write to multiple tables in one map-only job

Posted by Jean-Daniel Cryans <jd...@apache.org>.

>From the code gave the link to:

https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.java#L102

Hope this helps,

J-D

On Tue, Oct 4, 2011 at 7:20 AM, Christopher Dorner
<ch...@gmail.com> wrote:
> Thank you for the hint.
>
> What about autoflush then? Is that also something i can set using the config
> on job setup? Or does it onyl work with an HTable instance? Somehow i can't
> really find the right information :)
>
> Regards,
> Christopher
>
> Am 03.10.2011 19:20, schrieb Jean-Daniel Cryans:
>>
>> Option a) and b) are the same since MultiTableOutputFormat internally
>> uses multiple HTables. See for yourself:
>>
>>
>> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.java
>>
>> Also you can set the write buffer but setting
>> hbase.client.write.buffer on the configuration that your pass in the
>> job setup.
>>
>> Using HTablePool in a single threaded application doesn't offer more
>> than just storage for your HTables.
>>
>> Hope that helps,
>>
>> J-D
>>
>> On Sat, Oct 1, 2011 at 4:05 AM, Christopher Dorner
>> <ch...@gmail.com>  wrote:
>>>
>>> Hallo,
>>>
>>> i am building a RDF Store using HBase and experimenting with different
>>> index
>>> tables and Schema Designs.
>>>
>>> For the input, i have a File where each line is a RDF triple in N3
>>> Format.
>>>
>>> I need to write to multiple Tables since i need to build several index
>>> tables. For the sake of reducing IO and not reading the file a few times
>>> i
>>> want to do that in one Map-Only Job. Later the file will contain a few
>>> million triples.
>>>
>>> I am experimenting in Pseudo-Distributed-Mode so far but will be able to
>>> run
>>> it on our cluster soon.
>>> Storing the data in the Tables does not need to be speed-optimized at any
>>> cost, but i just want to do it as simple and fast as possible.
>>>
>>>
>>> What is the best way to write to more than 1 table in one Map-Task?
>>>
>>> a)
>>> I can either use "MultiTableOutputFormat.class" and write in map() using:
>>> Put put = new Put(key);
>>> put.add(kv);
>>> context.write(tableName, put);
>>>
>>> Can i write to e.g. 6 Tables in this way by creating a new Put for each
>>> table?
>>>
>>> But how can i turn off autoFlush and set writeBufferSize in this case?
>>> Because i think autoflush is not that good in this case of putting lots
>>> of
>>> values.
>>>
>>>
>>> b)
>>> I can use an instance of HTable in the Mapper class. Then i can set
>>> autoFlush and writeBufferSize and write to the table using:
>>> HTable table = new HTable(config, tableName);
>>> table.put(put);
>>>
>>> But it is recommended to use only one instance of HTable, so i would need
>>> to
>>> do
>>> table = new Table();
>>> for each table i want to write to. Is that still fine with 6 tables?
>>> I stumbled upon HTablePool. Is this for these scenarios?
>>>
>>>
>>> Thank You and Regards,
>>> Christopher
>>>
>
>

Re: Best way to write to multiple tables in one map-only job

Posted by Christopher Dorner <ch...@gmail.com>.

Thank you for the hint.

What about autoflush then? Is that also something i can set using the 
config on job setup? Or does it onyl work with an HTable instance? 
Somehow i can't really find the right information :)

Regards,
Christopher

Am 03.10.2011 19:20, schrieb Jean-Daniel Cryans:
> Option a) and b) are the same since MultiTableOutputFormat internally
> uses multiple HTables. See for yourself:
>
> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.java
>
> Also you can set the write buffer but setting
> hbase.client.write.buffer on the configuration that your pass in the
> job setup.
>
> Using HTablePool in a single threaded application doesn't offer more
> than just storage for your HTables.
>
> Hope that helps,
>
> J-D
>
> On Sat, Oct 1, 2011 at 4:05 AM, Christopher Dorner
> <ch...@gmail.com>  wrote:
>> Hallo,
>>
>> i am building a RDF Store using HBase and experimenting with different index
>> tables and Schema Designs.
>>
>> For the input, i have a File where each line is a RDF triple in N3 Format.
>>
>> I need to write to multiple Tables since i need to build several index
>> tables. For the sake of reducing IO and not reading the file a few times i
>> want to do that in one Map-Only Job. Later the file will contain a few
>> million triples.
>>
>> I am experimenting in Pseudo-Distributed-Mode so far but will be able to run
>> it on our cluster soon.
>> Storing the data in the Tables does not need to be speed-optimized at any
>> cost, but i just want to do it as simple and fast as possible.
>>
>>
>> What is the best way to write to more than 1 table in one Map-Task?
>>
>> a)
>> I can either use "MultiTableOutputFormat.class" and write in map() using:
>> Put put = new Put(key);
>> put.add(kv);
>> context.write(tableName, put);
>>
>> Can i write to e.g. 6 Tables in this way by creating a new Put for each
>> table?
>>
>> But how can i turn off autoFlush and set writeBufferSize in this case?
>> Because i think autoflush is not that good in this case of putting lots of
>> values.
>>
>>
>> b)
>> I can use an instance of HTable in the Mapper class. Then i can set
>> autoFlush and writeBufferSize and write to the table using:
>> HTable table = new HTable(config, tableName);
>> table.put(put);
>>
>> But it is recommended to use only one instance of HTable, so i would need to
>> do
>> table = new Table();
>> for each table i want to write to. Is that still fine with 6 tables?
>> I stumbled upon HTablePool. Is this for these scenarios?
>>
>>
>> Thank You and Regards,
>> Christopher
>>

Re: Best way to write to multiple tables in one map-only job

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Option a) and b) are the same since MultiTableOutputFormat internally
uses multiple HTables. See for yourself:

https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.java

Also you can set the write buffer but setting
hbase.client.write.buffer on the configuration that your pass in the
job setup.

Using HTablePool in a single threaded application doesn't offer more
than just storage for your HTables.

Hope that helps,

J-D

On Sat, Oct 1, 2011 at 4:05 AM, Christopher Dorner
<ch...@gmail.com> wrote:
> Hallo,
>
> i am building a RDF Store using HBase and experimenting with different index
> tables and Schema Designs.
>
> For the input, i have a File where each line is a RDF triple in N3 Format.
>
> I need to write to multiple Tables since i need to build several index
> tables. For the sake of reducing IO and not reading the file a few times i
> want to do that in one Map-Only Job. Later the file will contain a few
> million triples.
>
> I am experimenting in Pseudo-Distributed-Mode so far but will be able to run
> it on our cluster soon.
> Storing the data in the Tables does not need to be speed-optimized at any
> cost, but i just want to do it as simple and fast as possible.
>
>
> What is the best way to write to more than 1 table in one Map-Task?
>
> a)
> I can either use "MultiTableOutputFormat.class" and write in map() using:
> Put put = new Put(key);
> put.add(kv);
> context.write(tableName, put);
>
> Can i write to e.g. 6 Tables in this way by creating a new Put for each
> table?
>
> But how can i turn off autoFlush and set writeBufferSize in this case?
> Because i think autoflush is not that good in this case of putting lots of
> values.
>
>
> b)
> I can use an instance of HTable in the Mapper class. Then i can set
> autoFlush and writeBufferSize and write to the table using:
> HTable table = new HTable(config, tableName);
> table.put(put);
>
> But it is recommended to use only one instance of HTable, so i would need to
> do
> table = new Table();
> for each table i want to write to. Is that still fine with 6 tables?
> I stumbled upon HTablePool. Is this for these scenarios?
>
>
> Thank You and Regards,
> Christopher
>