You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Norbert Burger <no...@gmail.com> on 2012/08/17 21:17:17 UTC

issues copying data from one table to another

Hi folks -- we're running CDH3u3 (0.90.4).  I'm trying export data
from an existing table that has far too many regions (2600+ for only 8
regionservers) into one with a more reasonable region count for this
cluster (256).  Overall data volume is approx. 3 TB.

I thought initially that I'd use the bulkload/importtsv approach, but
it turns out this table's schema has column qualifiers made from
timestamps, so it's impossible for me to specify a list of target
columns for importtsv.  From what I can tell, the TSV interchange
format requires your data to have the same colquals throughout.

I took a look at CopyTable and Export/Import, which both appear to
wrap the Hbase client API (emitting Puts from a mapper).  But I'm
seeing significant performance problems with this approach, to the
point that I'm not sure it's feasible.  Export appears to work OK, but
when I try importing the data back from HDFS, the rest of our cluster
drags to halt -- client writes (even those not associated with the
Import) start timing out.  Fwiw, import already disables autoFlush
(via TableOutputFormat).

>From [1], one option I could try would to disable the WAL.  Are there
are other techniques I should try?  Has anyone implemented a
bulkloader which doesn't use the TSV format?

Norbert

[1] http://hbase.apache.org/book/perf.writing.html

Re: issues copying data from one table to another

Posted by Norbert Burger <no...@gmail.com>.

On Fri, Aug 17, 2012 at 4:09 PM, anil gupta <an...@gmail.com> wrote:
> If you want to customize the bulkloader then you can write your own mapper
> to define the business logic for loading. You need to specify the mapper at
> the time of running importsv by using:

Thanks, Anil.  I had that seen that section of the HBase book, but
glossed over the mapper class property until you pointed it out.

Norbert

Re: issues copying data from one table to another

Posted by anil gupta <an...@gmail.com>.

Hi Norbert,

If you want to customize the bulkloader then you can write your own mapper
to define the business logic for loading. You need to specify the mapper at
the time of running importsv by using:


 "-Dimporttsv.mapper.class=my.Mapper" property.

Refer to this link: http://hbase.apache.org/book.html#importtsv

HTH,
Anil


On Fri, Aug 17, 2012 at 12:17 PM, Norbert Burger
<no...@gmail.com>wrote:

> Hi folks -- we're running CDH3u3 (0.90.4).  I'm trying export data
> from an existing table that has far too many regions (2600+ for only 8
> regionservers) into one with a more reasonable region count for this
> cluster (256).  Overall data volume is approx. 3 TB.
>
> I thought initially that I'd use the bulkload/importtsv approach, but
> it turns out this table's schema has column qualifiers made from
> timestamps, so it's impossible for me to specify a list of target
> columns for importtsv.  From what I can tell, the TSV interchange
> format requires your data to have the same colquals throughout.
>
> I took a look at CopyTable and Export/Import, which both appear to
> wrap the Hbase client API (emitting Puts from a mapper).  But I'm
> seeing significant performance problems with this approach, to the
> point that I'm not sure it's feasible.  Export appears to work OK, but
> when I try importing the data back from HDFS, the rest of our cluster
> drags to halt -- client writes (even those not associated with the
> Import) start timing out.  Fwiw, import already disables autoFlush
> (via TableOutputFormat).
>
> From [1], one option I could try would to disable the WAL.  Are there
> are other techniques I should try?  Has anyone implemented a
> bulkloader which doesn't use the TSV format?
>
> Norbert
>
> [1] http://hbase.apache.org/book/perf.writing.html
>



-- 
Thanks & Regards,
Anil Gupta

Re: issues copying data from one table to another

Posted by Norbert Burger <no...@gmail.com>.

On Sat, Aug 18, 2012 at 7:14 AM, Michael Segel
<mi...@hotmail.com> wrote:

Thanks.

> Just out of curiosity, what would happen if you could disable the table, alter the table's max file size and then attempted to merge regions?  Note: I've never tried this, don't know if its possible, just thinking outside of the box...

Good idea.  In this case, I'm free to disable the old, region-full
table.  Unfortunately, I've already started writing data into the
newer, lower-region-count table, so at some point I'll need to export
the data anyway.

Does it make sense that these perf issues are caused by using the
HBase client API (vs. bulk export)?  My next thought was to write a
custom mapper for importtsv, as Anil suggested

Norbert

Re: issues copying data from one table to another

Posted by Michael Segel <mi...@hotmail.com>.

Can you disable the table? 
How much free disk space do you have? 

Is  this a production cluster?
Can you upgrade to CDH3u5?

Are you running a capacity scheduler or fair scheduler?

Just out of curiosity, what would happen if you could disable the table, alter the table's max file size and then attempted to merge regions?  Note: I've never tried this, don't know if its possible, just thinking outside of the box...

Outside of that... the safest way to do this would be to export the table. You'll get 2800 mappers so if you are using a scheduler, you just put this in to a queue that limits the number of concurrent mappers. 

When you import the data, in to your new table, you can run on an even more restrictive queue so that you have less of an impact on your system.  The downside is that its going to take a bit longer to run. Again, its probably the safest way to do this....

HTH, 

-Mike

On Aug 17, 2012, at 2:17 PM, Norbert Burger <no...@gmail.com> wrote:

> Hi folks -- we're running CDH3u3 (0.90.4).  I'm trying export data
> from an existing table that has far too many regions (2600+ for only 8
> regionservers) into one with a more reasonable region count for this
> cluster (256).  Overall data volume is approx. 3 TB.
> 
> I thought initially that I'd use the bulkload/importtsv approach, but
> it turns out this table's schema has column qualifiers made from
> timestamps, so it's impossible for me to specify a list of target
> columns for importtsv.  From what I can tell, the TSV interchange
> format requires your data to have the same colquals throughout.
> 
> I took a look at CopyTable and Export/Import, which both appear to
> wrap the Hbase client API (emitting Puts from a mapper).  But I'm
> seeing significant performance problems with this approach, to the
> point that I'm not sure it's feasible.  Export appears to work OK, but
> when I try importing the data back from HDFS, the rest of our cluster
> drags to halt -- client writes (even those not associated with the
> Import) start timing out.  Fwiw, import already disables autoFlush
> (via TableOutputFormat).
> 
> From [1], one option I could try would to disable the WAL.  Are there
> are other techniques I should try?  Has anyone implemented a
> bulkloader which doesn't use the TSV format?
> 
> Norbert
> 
> [1] http://hbase.apache.org/book/perf.writing.html
>