You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Ophir Cohen <op...@gmail.com> on 2011/08/11 10:08:47 UTC

Bulk upload

Hi,
I started to use bulk upload and encounter a strange problem.
I'm using Cloudera cdh3-u1.

I'm using  HFileOutputFormat.configureIncrementalLoad() to configure my job.
This method create partition file for the TotalOrderPartitioner and save it
to HDFS.

When the TotalOrderPartitioner initiated it tries to find the path for the
file in the configuration:
public static String getPartitionFile(Configuration conf) {
    return conf.get(PARTITIONER_PATH, DEFAULT_PATH);
  }

The strange thing is that this parameter never assigned!
It looks to me that it should have configured
in  HFileOutputFormat.configureIncrementalLoad() but it does not!

Then it takes the default ("_part") or something similar and (of course)
does not find it...

BTW
When I manually add this parameter it works great.

Is that a bug or do I miss something?
Thanks,
Ophir

Re: Bulk upload

Posted by Ophir Cohen <op...@gmail.com>.

Thanks for the answer - it exactly what I encountered...
It looks that it still exists in cdh3-u1...
Ophir

On Tue, Aug 16, 2011 at 02:07, Jean-Daniel Cryans <jd...@apache.org>wrote:

> From this jira it was fixed in 0.21.0:
> https://issues.apache.org/jira/browse/MAPREDUCE-476
>
> I know CDH has it patched in, not sure about the others.
>
> J-D
>
> On Thu, Aug 11, 2011 at 1:28 AM, Ophir Cohen <op...@gmail.com> wrote:
> > I did some more tests and found the problem: on local run the distribtued
> > cache does not work.
> >
> > On full cluster it works.
> > Sorry for your time...
> > Ophir
> >
> > PS
> > Is there any way to use distributed cache localy as well (i.e. when I'm
> > running MR from intellijIdea )?
> >
> > On Thu, Aug 11, 2011 at 11:20, Ophir Cohen <op...@gmail.com> wrote:
> >
> >> Now I see that it uses the distributed cache - but for some reason
> >> the TotalOrderPartitioner does not grab it.
> >> Ophir
> >>
> >>
> >> On Thu, Aug 11, 2011 at 11:08, Ophir Cohen <op...@gmail.com> wrote:
> >>
> >>> Hi,
> >>> I started to use bulk upload and encounter a strange problem.
> >>> I'm using Cloudera cdh3-u1.
> >>>
> >>> I'm using  HFileOutputFormat.configureIncrementalLoad() to configure my
> >>> job.
> >>> This method create partition file for the TotalOrderPartitioner and
> save
> >>> it to HDFS.
> >>>
> >>> When the TotalOrderPartitioner initiated it tries to find the path for
> the
> >>> file in the configuration:
> >>> public static String getPartitionFile(Configuration conf) {
> >>>     return conf.get(PARTITIONER_PATH, DEFAULT_PATH);
> >>>   }
> >>>
> >>> The strange thing is that this parameter never assigned!
> >>> It looks to me that it should have configured
> >>> in  HFileOutputFormat.configureIncrementalLoad() but it does not!
> >>>
> >>> Then it takes the default ("_part") or something similar and (of
> course)
> >>> does not find it...
> >>>
> >>> BTW
> >>> When I manually add this parameter it works great.
> >>>
> >>> Is that a bug or do I miss something?
> >>> Thanks,
> >>> Ophir
> >>>
> >>>
> >>
> >
>

Re: Bulk upload

Posted by Jean-Daniel Cryans <jd...@apache.org>.

>From this jira it was fixed in 0.21.0:
https://issues.apache.org/jira/browse/MAPREDUCE-476

I know CDH has it patched in, not sure about the others.

J-D

On Thu, Aug 11, 2011 at 1:28 AM, Ophir Cohen <op...@gmail.com> wrote:
> I did some more tests and found the problem: on local run the distribtued
> cache does not work.
>
> On full cluster it works.
> Sorry for your time...
> Ophir
>
> PS
> Is there any way to use distributed cache localy as well (i.e. when I'm
> running MR from intellijIdea )?
>
> On Thu, Aug 11, 2011 at 11:20, Ophir Cohen <op...@gmail.com> wrote:
>
>> Now I see that it uses the distributed cache - but for some reason
>> the TotalOrderPartitioner does not grab it.
>> Ophir
>>
>>
>> On Thu, Aug 11, 2011 at 11:08, Ophir Cohen <op...@gmail.com> wrote:
>>
>>> Hi,
>>> I started to use bulk upload and encounter a strange problem.
>>> I'm using Cloudera cdh3-u1.
>>>
>>> I'm using  HFileOutputFormat.configureIncrementalLoad() to configure my
>>> job.
>>> This method create partition file for the TotalOrderPartitioner and save
>>> it to HDFS.
>>>
>>> When the TotalOrderPartitioner initiated it tries to find the path for the
>>> file in the configuration:
>>> public static String getPartitionFile(Configuration conf) {
>>>     return conf.get(PARTITIONER_PATH, DEFAULT_PATH);
>>>   }
>>>
>>> The strange thing is that this parameter never assigned!
>>> It looks to me that it should have configured
>>> in  HFileOutputFormat.configureIncrementalLoad() but it does not!
>>>
>>> Then it takes the default ("_part") or something similar and (of course)
>>> does not find it...
>>>
>>> BTW
>>> When I manually add this parameter it works great.
>>>
>>> Is that a bug or do I miss something?
>>> Thanks,
>>> Ophir
>>>
>>>
>>
>

Re: Bulk upload

Posted by Ophir Cohen <op...@gmail.com>.

I did some more tests and found the problem: on local run the distribtued
cache does not work.

On full cluster it works.
Sorry for your time...
Ophir

PS
Is there any way to use distributed cache localy as well (i.e. when I'm
running MR from intellijIdea )?

On Thu, Aug 11, 2011 at 11:20, Ophir Cohen <op...@gmail.com> wrote:

> Now I see that it uses the distributed cache - but for some reason
> the TotalOrderPartitioner does not grab it.
> Ophir
>
>
> On Thu, Aug 11, 2011 at 11:08, Ophir Cohen <op...@gmail.com> wrote:
>
>> Hi,
>> I started to use bulk upload and encounter a strange problem.
>> I'm using Cloudera cdh3-u1.
>>
>> I'm using  HFileOutputFormat.configureIncrementalLoad() to configure my
>> job.
>> This method create partition file for the TotalOrderPartitioner and save
>> it to HDFS.
>>
>> When the TotalOrderPartitioner initiated it tries to find the path for the
>> file in the configuration:
>> public static String getPartitionFile(Configuration conf) {
>>     return conf.get(PARTITIONER_PATH, DEFAULT_PATH);
>>   }
>>
>> The strange thing is that this parameter never assigned!
>> It looks to me that it should have configured
>> in  HFileOutputFormat.configureIncrementalLoad() but it does not!
>>
>> Then it takes the default ("_part") or something similar and (of course)
>> does not find it...
>>
>> BTW
>> When I manually add this parameter it works great.
>>
>> Is that a bug or do I miss something?
>> Thanks,
>> Ophir
>>
>>
>

Re: Bulk upload

Posted by Ophir Cohen <op...@gmail.com>.

Now I see that it uses the distributed cache - but for some reason
the TotalOrderPartitioner does not grab it.
Ophir

On Thu, Aug 11, 2011 at 11:08, Ophir Cohen <op...@gmail.com> wrote:

> Hi,
> I started to use bulk upload and encounter a strange problem.
> I'm using Cloudera cdh3-u1.
>
> I'm using  HFileOutputFormat.configureIncrementalLoad() to configure my
> job.
> This method create partition file for the TotalOrderPartitioner and save it
> to HDFS.
>
> When the TotalOrderPartitioner initiated it tries to find the path for the
> file in the configuration:
> public static String getPartitionFile(Configuration conf) {
>     return conf.get(PARTITIONER_PATH, DEFAULT_PATH);
>   }
>
> The strange thing is that this parameter never assigned!
> It looks to me that it should have configured
> in  HFileOutputFormat.configureIncrementalLoad() but it does not!
>
> Then it takes the default ("_part") or something similar and (of course)
> does not find it...
>
> BTW
> When I manually add this parameter it works great.
>
> Is that a bug or do I miss something?
> Thanks,
> Ophir
>
>