You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jinyuan Zhou <zh...@gmail.com> on 2013/05/17 16:52:23 UTC

bulk load skipping tsv files

Hi,
I wonder if there are tool similar
to org.apache.hadoop.hbase.mapreduce.ImportTsv.  IimportTsv read from tsv
file and create HFiles which are ready to be loaded into the corresponding
region by another
tool org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. What I want
is to read from some hbase table and create hfiles directly  I think I I
know how to write up such class by following steps in ImportTsv class but I
wonder if some one already did this.
Thanks,
Jack

-- 
-- Jinyuan (Jack) Zhou

Re: bulk load skipping tsv files

Posted by Jinyuan Zhou <zh...@gmail.com>.

I had thought about coprocessor. But I had an impression that coprocessor
is last option one shoud try becuase it is so invasive to the jvm running
hbase. Not sure about current status though. However, what the croprocessor
can give me in this case is less network load. My problem is the hbase's
housekeeping work load as a result of increased data valume. For this  part
, coprocessor may not help.
Thanks,



On Fri, May 17, 2013 at 10:05 AM, Ted Yu <yu...@gmail.com> wrote:

> Jinyuan:
>
> bq. no new data needed, only some value will be changed by recalculation.
>
> Have you considered using coprocessor to fullfil the above task ?
>
> Cheers
>
> On Fri, May 17, 2013 at 8:57 AM, Shahab Yunus <shahab.yunus@gmail.com
> >wrote:
>
> > If I understood your usecase correctly, then if you don't need to
> maintain
> > older versions of data then why don't you set the 'max version' parameter
> > for your table to 1? I believe that the increase in data even in case of
> > updates is due to that (?) Have you tried that?
> >
> > Regards,
> > Shahab
> >
> >
> > On Fri, May 17, 2013 at 11:49 AM, Jinyuan Zhou <zhou.jinyuan@gmail.com
> > >wrote:
> >
> > > Actually,  I wanted to update each row of a table each day. no new data
> > > needed, only some value will be changed by recalculation.  It looks
> like
> > > every time I do, the data is doubled in table. even though it is
> update.
> > I
> > > believe even an update will result in new hfiles and the cluster is
> then
> > > very busy on splitting region and related stuff. It need to about an
> hour
> > > undate only about 250 milliron rows. I only need one version. so, I
> think
> > > it might be faster, I just  store the calculated resesult in HFile and
> > then
> > > trunk the original table, then  bulk load to the Hfiles to the  empty
> > > table.
> > > Thanks,
> > >
> > >
> > >
> > > On Fri, May 17, 2013 at 7:55 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > bq. What I want is to read from some hbase table and create hfiles
> > > directly
> > > >
> > > > Can you describe your use case in more detail ?
> > > >
> > > > Thanks
> > > >
> > > > On Fri, May 17, 2013 at 7:52 AM, Jinyuan Zhou <
> zhou.jinyuan@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > > I wonder if there are tool similar
> > > > > to org.apache.hadoop.hbase.mapreduce.ImportTsv.  IimportTsv read
> from
> > > tsv
> > > > > file and create HFiles which are ready to be loaded into the
> > > > corresponding
> > > > > region by another
> > > > > tool org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. What
> I
> > > want
> > > > > is to read from some hbase table and create hfiles directly  I
> think
> > I
> > > I
> > > > > know how to write up such class by following steps in ImportTsv
> class
> > > > but I
> > > > > wonder if some one already did this.
> > > > > Thanks,
> > > > > Jack
> > > > >
> > > > > --
> > > > > -- Jinyuan (Jack) Zhou
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Jinyuan (Jack) Zhou
> > >
> >
>



-- 
-- Jinyuan (Jack) Zhou

Re: bulk load skipping tsv files

Posted by Ted Yu <yu...@gmail.com>.

Jinyuan:

bq. no new data needed, only some value will be changed by recalculation.

Have you considered using coprocessor to fullfil the above task ?

Cheers

On Fri, May 17, 2013 at 8:57 AM, Shahab Yunus <sh...@gmail.com>wrote:

> If I understood your usecase correctly, then if you don't need to maintain
> older versions of data then why don't you set the 'max version' parameter
> for your table to 1? I believe that the increase in data even in case of
> updates is due to that (?) Have you tried that?
>
> Regards,
> Shahab
>
>
> On Fri, May 17, 2013 at 11:49 AM, Jinyuan Zhou <zhou.jinyuan@gmail.com
> >wrote:
>
> > Actually,  I wanted to update each row of a table each day. no new data
> > needed, only some value will be changed by recalculation.  It looks like
> > every time I do, the data is doubled in table. even though it is update.
> I
> > believe even an update will result in new hfiles and the cluster is then
> > very busy on splitting region and related stuff. It need to about an hour
> > undate only about 250 milliron rows. I only need one version. so, I think
> > it might be faster, I just  store the calculated resesult in HFile and
> then
> > trunk the original table, then  bulk load to the Hfiles to the  empty
> > table.
> > Thanks,
> >
> >
> >
> > On Fri, May 17, 2013 at 7:55 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > bq. What I want is to read from some hbase table and create hfiles
> > directly
> > >
> > > Can you describe your use case in more detail ?
> > >
> > > Thanks
> > >
> > > On Fri, May 17, 2013 at 7:52 AM, Jinyuan Zhou <zhou.jinyuan@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > > I wonder if there are tool similar
> > > > to org.apache.hadoop.hbase.mapreduce.ImportTsv.  IimportTsv read from
> > tsv
> > > > file and create HFiles which are ready to be loaded into the
> > > corresponding
> > > > region by another
> > > > tool org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. What I
> > want
> > > > is to read from some hbase table and create hfiles directly  I think
> I
> > I
> > > > know how to write up such class by following steps in ImportTsv class
> > > but I
> > > > wonder if some one already did this.
> > > > Thanks,
> > > > Jack
> > > >
> > > > --
> > > > -- Jinyuan (Jack) Zhou
> > > >
> > >
> >
> >
> >
> > --
> > -- Jinyuan (Jack) Zhou
> >
>

Re: bulk load skipping tsv files

Posted by Jinyuan Zhou <zh...@gmail.com>.

will try that. Thanks,


On Fri, May 17, 2013 at 8:57 AM, Shahab Yunus <sh...@gmail.com>wrote:

> If I understood your usecase correctly, then if you don't need to maintain
> older versions of data then why don't you set the 'max version' parameter
> for your table to 1? I believe that the increase in data even in case of
> updates is due to that (?) Have you tried that?
>
> Regards,
> Shahab
>
>
> On Fri, May 17, 2013 at 11:49 AM, Jinyuan Zhou <zhou.jinyuan@gmail.com
> >wrote:
>
> > Actually,  I wanted to update each row of a table each day. no new data
> > needed, only some value will be changed by recalculation.  It looks like
> > every time I do, the data is doubled in table. even though it is update.
> I
> > believe even an update will result in new hfiles and the cluster is then
> > very busy on splitting region and related stuff. It need to about an hour
> > undate only about 250 milliron rows. I only need one version. so, I think
> > it might be faster, I just  store the calculated resesult in HFile and
> then
> > trunk the original table, then  bulk load to the Hfiles to the  empty
> > table.
> > Thanks,
> >
> >
> >
> > On Fri, May 17, 2013 at 7:55 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > bq. What I want is to read from some hbase table and create hfiles
> > directly
> > >
> > > Can you describe your use case in more detail ?
> > >
> > > Thanks
> > >
> > > On Fri, May 17, 2013 at 7:52 AM, Jinyuan Zhou <zhou.jinyuan@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > > I wonder if there are tool similar
> > > > to org.apache.hadoop.hbase.mapreduce.ImportTsv.  IimportTsv read from
> > tsv
> > > > file and create HFiles which are ready to be loaded into the
> > > corresponding
> > > > region by another
> > > > tool org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. What I
> > want
> > > > is to read from some hbase table and create hfiles directly  I think
> I
> > I
> > > > know how to write up such class by following steps in ImportTsv class
> > > but I
> > > > wonder if some one already did this.
> > > > Thanks,
> > > > Jack
> > > >
> > > > --
> > > > -- Jinyuan (Jack) Zhou
> > > >
> > >
> >
> >
> >
> > --
> > -- Jinyuan (Jack) Zhou
> >
>



-- 
-- Jinyuan (Jack) Zhou

Re: bulk load skipping tsv files

Posted by Shahab Yunus <sh...@gmail.com>.

If I understood your usecase correctly, then if you don't need to maintain
older versions of data then why don't you set the 'max version' parameter
for your table to 1? I believe that the increase in data even in case of
updates is due to that (?) Have you tried that?

Regards,
Shahab


On Fri, May 17, 2013 at 11:49 AM, Jinyuan Zhou <zh...@gmail.com>wrote:

> Actually,  I wanted to update each row of a table each day. no new data
> needed, only some value will be changed by recalculation.  It looks like
> every time I do, the data is doubled in table. even though it is update. I
> believe even an update will result in new hfiles and the cluster is then
> very busy on splitting region and related stuff. It need to about an hour
> undate only about 250 milliron rows. I only need one version. so, I think
> it might be faster, I just  store the calculated resesult in HFile and then
> trunk the original table, then  bulk load to the Hfiles to the  empty
> table.
> Thanks,
>
>
>
> On Fri, May 17, 2013 at 7:55 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > bq. What I want is to read from some hbase table and create hfiles
> directly
> >
> > Can you describe your use case in more detail ?
> >
> > Thanks
> >
> > On Fri, May 17, 2013 at 7:52 AM, Jinyuan Zhou <zhou.jinyuan@gmail.com
> > >wrote:
> >
> > > Hi,
> > > I wonder if there are tool similar
> > > to org.apache.hadoop.hbase.mapreduce.ImportTsv.  IimportTsv read from
> tsv
> > > file and create HFiles which are ready to be loaded into the
> > corresponding
> > > region by another
> > > tool org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. What I
> want
> > > is to read from some hbase table and create hfiles directly  I think I
> I
> > > know how to write up such class by following steps in ImportTsv class
> > but I
> > > wonder if some one already did this.
> > > Thanks,
> > > Jack
> > >
> > > --
> > > -- Jinyuan (Jack) Zhou
> > >
> >
>
>
>
> --
> -- Jinyuan (Jack) Zhou
>

Re: bulk load skipping tsv files

Posted by Jinyuan Zhou <zh...@gmail.com>.

Actually,  I wanted to update each row of a table each day. no new data
needed, only some value will be changed by recalculation.  It looks like
every time I do, the data is doubled in table. even though it is update. I
believe even an update will result in new hfiles and the cluster is then
very busy on splitting region and related stuff. It need to about an hour
undate only about 250 milliron rows. I only need one version. so, I think
it might be faster, I just  store the calculated resesult in HFile and then
trunk the original table, then  bulk load to the Hfiles to the  empty
table.
Thanks,

On Fri, May 17, 2013 at 7:55 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. What I want is to read from some hbase table and create hfiles directly
>
> Can you describe your use case in more detail ?
>
> Thanks
>
> On Fri, May 17, 2013 at 7:52 AM, Jinyuan Zhou <zhou.jinyuan@gmail.com
> >wrote:
>
> > Hi,
> > I wonder if there are tool similar
> > to org.apache.hadoop.hbase.mapreduce.ImportTsv.  IimportTsv read from tsv
> > file and create HFiles which are ready to be loaded into the
> corresponding
> > region by another
> > tool org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. What I want
> > is to read from some hbase table and create hfiles directly  I think I I
> > know how to write up such class by following steps in ImportTsv class
> but I
> > wonder if some one already did this.
> > Thanks,
> > Jack
> >
> > --
> > -- Jinyuan (Jack) Zhou
> >
>

-- 
-- Jinyuan (Jack) Zhou

Re: bulk load skipping tsv files

Posted by Ted Yu <yu...@gmail.com>.

bq. What I want is to read from some hbase table and create hfiles directly

Can you describe your use case in more detail ?

Thanks

On Fri, May 17, 2013 at 7:52 AM, Jinyuan Zhou <zh...@gmail.com>wrote:

> Hi,
> I wonder if there are tool similar
> to org.apache.hadoop.hbase.mapreduce.ImportTsv.  IimportTsv read from tsv
> file and create HFiles which are ready to be loaded into the corresponding
> region by another
> tool org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. What I want
> is to read from some hbase table and create hfiles directly  I think I I
> know how to write up such class by following steps in ImportTsv class but I
> wonder if some one already did this.
> Thanks,
> Jack
>
> --
> -- Jinyuan (Jack) Zhou
>