You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Sever Fundatureanu <fu...@gmail.com> on 2012/07/26 18:39:09 UTC

Bulk loading disadvantages

Hello,

For the bulkloading process, the HBase documentation mentions that in
a 2nd stage "the appropriate Region Server adopts the HFile, moving it
into its storage directory and making the data available to clients."
But from my experience the files also remain in the original location
from where they are "adopted". So I guess the data is actually copied
into the HBase directory right? This means that, compared to the
online importing, when bulk loading you essentially need twice the
disk space on HDFS, right?
Another problem is with data locality immediately after bulk loading
through MR. I understand that the locality is obtained in time through
compactions and splits. However you don't get this problem while
importing online, right?

Thanks in advance,
Sever

Re: Bulk loading disadvantages

Posted by Alex Baranau <al...@gmail.com>.

> Another problem is with data locality immediately after bulk loading
> through MR.

You might find this recent discussion about that useful: [1]

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

[1] The start is here:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201207.mbox/%3CCAA7+SiBcu_yB45=WEARKcpdW1hgnKsuV4ceVxhjF8k5yRwV1KA@mail.gmail.com%3Ebut
then the thread gets broken due to "FWD"/"RES" adding into subj. Also
you can find it here:
http://search-hadoop.com/?q=bulk+import+and+data+locality

On Fri, Jul 27, 2012 at 9:46 AM, Sever Fundatureanu <
fundatureanu.sever@gmail.com> wrote:

> After digging a bit I've found my problem comes from the following
> lines in the Store class:
>
> void bulkLoadHFile(String srcPathStr) throws IOException {
>     Path srcPath = new Path(srcPathStr);
>
>     // Move the file if it's on another filesystem
>     FileSystem srcFs = srcPath.getFileSystem(conf);
>     if (!srcFs.equals(fs)) {
>       LOG.info("File " + srcPath + " on different filesystem than " +
>           "destination store - moving to this filesystem.");
>       Path tmpPath = getTmpPath();
>       FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf);
>       LOG.info("Copied to temporary path on dst filesystem: " + tmpPath);
>       srcPath = tmpPath;
>     }
>
> The equality for the 2 filesystems fails in my case and I get the
> following log:
>
> 2012-07-27 14:47:25,321 INFO
> org.apache.hadoop.hbase.regionserver.Store: File
>
> hdfs://fs0.cm.cluster:8020/user/sfu200/outputBsbm/string2Id/F/e6cf2d1b69354e268b79597bf3855357
> on different filesystem than destination store - moving to this
> filesystem.
> 2012-07-27 14:47:27,286 INFO
> org.apache.hadoop.hbase.regionserver.Store: Copied to temporary path
> on dst filesystem:
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
> 2012-07-27 14:47:27,286 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
> to
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712
> 2012-07-27 14:47:27,297 INFO
> org.apache.hadoop.hbase.regionserver.StoreFile: HFile Bloom filter
> type for c4bbf70a6654422db81884f15f34c712: NONE, but ROW specified in
> column family configuration
> 2012-07-27 14:47:27,297 INFO
> org.apache.hadoop.hbase.regionserver.Store: Moved hfile
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
> into store directory
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F
> - updating store file list.
> 2012-07-27 14:47:27,297 INFO
> org.apache.hadoop.hbase.regionserver.Store: Successfully loaded store
> file
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
> into store F (new location:
>
> hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712)
>
> In my hbase-site.xml I have:
> <property>
>       <name>hbase.rootdir</name>
>       <value>hdfs://fs0.cm.cluster:8020/hbase</value>
>       <description>The directory shared by RegionServers.
>       </description>
>     </property>
>
> and in my hdfs-site.xml I have:
> <property>
>   <name>fs.default.name</name>
>   <value>hdfs://fs0.cm.cluster:8020</value>
> </property>
>
> As you can see they point to the same namenode. So I really don't
> understand why the above check fails..
>
> Regards,
> Sever
>
> On Fri, Jul 27, 2012 at 1:17 PM, Sever Fundatureanu
> <fu...@gmail.com> wrote:
> > Hi Anil,
> >
> > I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed
> > the ones mentioned my Bijeet. I can also add that I am doing the 2nd
> > stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path
> > sourceDir, HTable table) on a LoadIncrementalHFiles object.
> >
> > Best,
> > Sever
> >
> >
> > On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <an...@gmail.com>
> wrote:
> >> Hi Sever,
> >>
> >> That's a very interesting thing. Which Hadoop and hbase version you are
> using? I am going to run bulk loads tomorrow. If you can tell me which
> directories in hdfs you compared with /hbase/$table then I will try to
> check the same.
> >>
> >> Best Regards,
> >> Anil
> >>
> >> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <
> fundatureanu.sever@gmail.com> wrote:
> >>
> >>> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <la...@gmail.com>
> wrote:
> >>>>>
> >>>>>
> >>>>> For the bulkloading process, the HBase documentation mentions that in
> >>>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving
> it
> >>>>> into its storage directory and making the data available to clients."
> >>>>> But from my experience the files also remain in the original location
> >>>>> from where they are "adopted". So I guess the data is actually copied
> >>>>> into the HBase directory right? This means that, compared to the
> >>>>> online importing, when bulk loading you essentially need twice the
> >>>>> disk space on HDFS, right?
> >>>>>
> >>>>
> >>>> Yes, if you are generating HFiles on one cluster and loading into a
> >>>> separate hbase cluster. If they are co-located, its just a hdfs mv.
> >>>
> >>> Hmm, both the HFile generation and the HBase cluster runs on top of
> >>> the same HDFS cluster. I did a "du" on both the source HDFS directory
> >>> and the destination "/hbase" directory and I got the same sizes (+-
> >>> few bytes). I deleted the source directory from HDFS and then scanned
> >>> the table without any problems. Maybe there is a config parameter I'm
> >>> missing?
> >>>
> >>> Sever
> >
> >
> >
> > --
> > Sever Fundatureanu
> >
> > Vrije Universiteit Amsterdam
> > E-mail: fundatureanu.sever@gmail.com
>
>
>
> --
> Sever Fundatureanu
>
> Vrije Universiteit Amsterdam
> E-mail: fundatureanu.sever@gmail.com
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: Bulk loading disadvantages

Posted by Sever Fundatureanu <fu...@gmail.com>.

After digging a bit I've found my problem comes from the following
lines in the Store class:

void bulkLoadHFile(String srcPathStr) throws IOException {
    Path srcPath = new Path(srcPathStr);

    // Move the file if it's on another filesystem
    FileSystem srcFs = srcPath.getFileSystem(conf);
    if (!srcFs.equals(fs)) {
      LOG.info("File " + srcPath + " on different filesystem than " +
          "destination store - moving to this filesystem.");
      Path tmpPath = getTmpPath();
      FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf);
      LOG.info("Copied to temporary path on dst filesystem: " + tmpPath);
      srcPath = tmpPath;
    }

The equality for the 2 filesystems fails in my case and I get the following log:

2012-07-27 14:47:25,321 INFO
org.apache.hadoop.hbase.regionserver.Store: File
hdfs://fs0.cm.cluster:8020/user/sfu200/outputBsbm/string2Id/F/e6cf2d1b69354e268b79597bf3855357
on different filesystem than destination store - moving to this
filesystem.
2012-07-27 14:47:27,286 INFO
org.apache.hadoop.hbase.regionserver.Store: Copied to temporary path
on dst filesystem:
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
2012-07-27 14:47:27,286 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
to hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712
2012-07-27 14:47:27,297 INFO
org.apache.hadoop.hbase.regionserver.StoreFile: HFile Bloom filter
type for c4bbf70a6654422db81884f15f34c712: NONE, but ROW specified in
column family configuration
2012-07-27 14:47:27,297 INFO
org.apache.hadoop.hbase.regionserver.Store: Moved hfile
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
into store directory
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F
- updating store file list.
2012-07-27 14:47:27,297 INFO
org.apache.hadoop.hbase.regionserver.Store: Successfully loaded store
file hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
into store F (new location:
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712)

In my hbase-site.xml I have:
<property>
      <name>hbase.rootdir</name>
      <value>hdfs://fs0.cm.cluster:8020/hbase</value>
      <description>The directory shared by RegionServers.
      </description>
    </property>

and in my hdfs-site.xml I have:
<property>
  <name>fs.default.name</name>
  <value>hdfs://fs0.cm.cluster:8020</value>
</property>

As you can see they point to the same namenode. So I really don't
understand why the above check fails..

Regards,
Sever

On Fri, Jul 27, 2012 at 1:17 PM, Sever Fundatureanu
<fu...@gmail.com> wrote:
> Hi Anil,
>
> I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed
> the ones mentioned my Bijeet. I can also add that I am doing the 2nd
> stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path
> sourceDir, HTable table) on a LoadIncrementalHFiles object.
>
> Best,
> Sever
>
>
> On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <an...@gmail.com> wrote:
>> Hi Sever,
>>
>> That's a very interesting thing. Which Hadoop and hbase version you are using? I am going to run bulk loads tomorrow. If you can tell me which directories in hdfs you compared with /hbase/$table then I will try to check the same.
>>
>> Best Regards,
>> Anil
>>
>> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <fu...@gmail.com> wrote:
>>
>>> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <la...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> For the bulkloading process, the HBase documentation mentions that in
>>>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it
>>>>> into its storage directory and making the data available to clients."
>>>>> But from my experience the files also remain in the original location
>>>>> from where they are "adopted". So I guess the data is actually copied
>>>>> into the HBase directory right? This means that, compared to the
>>>>> online importing, when bulk loading you essentially need twice the
>>>>> disk space on HDFS, right?
>>>>>
>>>>
>>>> Yes, if you are generating HFiles on one cluster and loading into a
>>>> separate hbase cluster. If they are co-located, its just a hdfs mv.
>>>
>>> Hmm, both the HFile generation and the HBase cluster runs on top of
>>> the same HDFS cluster. I did a "du" on both the source HDFS directory
>>> and the destination "/hbase" directory and I got the same sizes (+-
>>> few bytes). I deleted the source directory from HDFS and then scanned
>>> the table without any problems. Maybe there is a config parameter I'm
>>> missing?
>>>
>>> Sever
>
>
>
> --
> Sever Fundatureanu
>
> Vrije Universiteit Amsterdam
> E-mail: fundatureanu.sever@gmail.com



-- 
Sever Fundatureanu

Vrije Universiteit Amsterdam
E-mail: fundatureanu.sever@gmail.com

Re: Bulk loading disadvantages

Posted by Sever Fundatureanu <fu...@gmail.com>.

Hi Anil,

I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed
the ones mentioned my Bijeet. I can also add that I am doing the 2nd
stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path
sourceDir, HTable table) on a LoadIncrementalHFiles object.

Best,
Sever


On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <an...@gmail.com> wrote:
> Hi Sever,
>
> That's a very interesting thing. Which Hadoop and hbase version you are using? I am going to run bulk loads tomorrow. If you can tell me which directories in hdfs you compared with /hbase/$table then I will try to check the same.
>
> Best Regards,
> Anil
>
> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <fu...@gmail.com> wrote:
>
>> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <la...@gmail.com> wrote:
>>>>
>>>>
>>>> For the bulkloading process, the HBase documentation mentions that in
>>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it
>>>> into its storage directory and making the data available to clients."
>>>> But from my experience the files also remain in the original location
>>>> from where they are "adopted". So I guess the data is actually copied
>>>> into the HBase directory right? This means that, compared to the
>>>> online importing, when bulk loading you essentially need twice the
>>>> disk space on HDFS, right?
>>>>
>>>
>>> Yes, if you are generating HFiles on one cluster and loading into a
>>> separate hbase cluster. If they are co-located, its just a hdfs mv.
>>
>> Hmm, both the HFile generation and the HBase cluster runs on top of
>> the same HDFS cluster. I did a "du" on both the source HDFS directory
>> and the destination "/hbase" directory and I got the same sizes (+-
>> few bytes). I deleted the source directory from HDFS and then scanned
>> the table without any problems. Maybe there is a config parameter I'm
>> missing?
>>
>> Sever



-- 
Sever Fundatureanu

Vrije Universiteit Amsterdam
E-mail: fundatureanu.sever@gmail.com

Re: Bulk loading disadvantages

Posted by Bijeet Singh <bi...@gmail.com>.

Anil,

The two directories in question here are -

  1.  the HDFS location where the MapReduce job creates the HFiles
  2.  the directory pointed to by hbase.rootdir in your HBase configuration
- the default value is /hbase. Inside the
       HBase root directory, there are per-table subdirectories.

So for the kind of comparison that you mentioned, you need to look in the
directory <hbase.rootdir>/<table-name> and the
directory where you are creating the HFiles.

BIjeet



On Fri, Jul 27, 2012 at 9:10 AM, Anil Gupta <an...@gmail.com> wrote:

> Hi Sever,
>
> That's a very interesting thing. Which Hadoop and hbase version you are
> using? I am going to run bulk loads tomorrow. If you can tell me which
> directories in hdfs you compared with /hbase/$table then I will try to
> check the same.
>
> Best Regards,
> Anil
>
> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <
> fundatureanu.sever@gmail.com> wrote:
>
> > On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <la...@gmail.com>
> wrote:
> >>>
> >>>
> >>> For the bulkloading process, the HBase documentation mentions that in
> >>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it
> >>> into its storage directory and making the data available to clients."
> >>> But from my experience the files also remain in the original location
> >>> from where they are "adopted". So I guess the data is actually copied
> >>> into the HBase directory right? This means that, compared to the
> >>> online importing, when bulk loading you essentially need twice the
> >>> disk space on HDFS, right?
> >>>
> >>
> >> Yes, if you are generating HFiles on one cluster and loading into a
> >> separate hbase cluster. If they are co-located, its just a hdfs mv.
> >
> > Hmm, both the HFile generation and the HBase cluster runs on top of
> > the same HDFS cluster. I did a "du" on both the source HDFS directory
> > and the destination "/hbase" directory and I got the same sizes (+-
> > few bytes). I deleted the source directory from HDFS and then scanned
> > the table without any problems. Maybe there is a config parameter I'm
> > missing?
> >
> > Sever
>

Re: Bulk loading disadvantages

Posted by Anil Gupta <an...@gmail.com>.

Hi Sever,

That's a very interesting thing. Which Hadoop and hbase version you are using? I am going to run bulk loads tomorrow. If you can tell me which directories in hdfs you compared with /hbase/$table then I will try to check the same.

Best Regards,
Anil

On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu <fu...@gmail.com> wrote:

> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <la...@gmail.com> wrote:
>>> 
>>> 
>>> For the bulkloading process, the HBase documentation mentions that in
>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it
>>> into its storage directory and making the data available to clients."
>>> But from my experience the files also remain in the original location
>>> from where they are "adopted". So I guess the data is actually copied
>>> into the HBase directory right? This means that, compared to the
>>> online importing, when bulk loading you essentially need twice the
>>> disk space on HDFS, right?
>>> 
>> 
>> Yes, if you are generating HFiles on one cluster and loading into a
>> separate hbase cluster. If they are co-located, its just a hdfs mv.
> 
> Hmm, both the HFile generation and the HBase cluster runs on top of
> the same HDFS cluster. I did a "du" on both the source HDFS directory
> and the destination "/hbase" directory and I got the same sizes (+-
> few bytes). I deleted the source directory from HDFS and then scanned
> the table without any problems. Maybe there is a config parameter I'm
> missing?
> 
> Sever

Re: Bulk loading disadvantages

Posted by Sever Fundatureanu <fu...@gmail.com>.

On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <la...@gmail.com> wrote:
>>
>>
>> For the bulkloading process, the HBase documentation mentions that in
>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it
>> into its storage directory and making the data available to clients."
>> But from my experience the files also remain in the original location
>> from where they are "adopted". So I guess the data is actually copied
>> into the HBase directory right? This means that, compared to the
>> online importing, when bulk loading you essentially need twice the
>> disk space on HDFS, right?
>>
>
> Yes, if you are generating HFiles on one cluster and loading into a
> separate hbase cluster. If they are co-located, its just a hdfs mv.

Hmm, both the HFile generation and the HBase cluster runs on top of
the same HDFS cluster. I did a "du" on both the source HDFS directory
and the destination "/hbase" directory and I got the same sizes (+-
few bytes). I deleted the source directory from HDFS and then scanned
the table without any problems. Maybe there is a config parameter I'm
missing?

Sever

Re: Bulk loading disadvantages

Posted by Sateesh Lakkarsu <la...@gmail.com>.

>
>
> For the bulkloading process, the HBase documentation mentions that in
> a 2nd stage "the appropriate Region Server adopts the HFile, moving it
> into its storage directory and making the data available to clients."
> But from my experience the files also remain in the original location
> from where they are "adopted". So I guess the data is actually copied
> into the HBase directory right? This means that, compared to the
> online importing, when bulk loading you essentially need twice the
> disk space on HDFS, right?
>

Yes, if you are generating HFiles on one cluster and loading into a
separate hbase cluster. If they are co-located, its just a hdfs mv.

Another problem is with data locality immediately after bulk loading
> through MR. I understand that the locality is obtained in time through
> compactions and splits. However you don't get this problem while
> importing online, right?
>
> Yes