You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Stuti Awasthi <st...@hcl.com> on 2011/11/10 08:31:50 UTC

MR - Input from Hbase output to HDFS

Hi
Currently I am understading Hbase MapReduce support. I followed http://hbase.apache.org/book/mapreduce.example.html and executed it successfully.
But I am not sure what changes to be done to  MR which takes input from Hbase table and put output to HDFS.

How to set output dir . I tried to set with JobConf but it gives me error that output directory is not set.
Please Suggest.

Regards,
Stuti Awasthi
HCL Comnet Systems and Services Ltd
F-8/9 Basement, Sec-3,Noida.


________________________________
::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
received this email in error please delete it and notify the sender immediately. Before opening any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------

Re: MR - Input from Hbase output to HDFS

Posted by Harsh J <ha...@cloudera.com>.

When using HBase, consider using the new API primarily.

The mapred.* package upstream in Hadoop is not deprecated anymore, however.

On 22-Nov-2011, at 1:21 AM, Denis Kreis wrote:

> Hi
> 
> Is org.apache.hadoop.mapred.FileInputFormat to be considered
> as obsolete/deprecated?
> 
> Thanks!
> 
> 2011/11/15 Stuti Awasthi <st...@hcl.com>
> 
>> Sure Doug,
>> Thanks
>> 
>> -----Original Message-----
>> From: Doug Meil [mailto:doug.meil@explorysmedical.com]
>> Sent: Monday, November 14, 2011 9:08 PM
>> To: user@hbase.apache.org
>> Subject: Re: MR - Input from Hbase output to HDFS
>> 
>> 
>> Glad to worked through that and everything is working.  I will add an
>> example of MR to Hbase-to-HDFS in the book.
>> 
>> 
>> 
>> 
>> 
>> On 11/14/11 1:24 AM, "Stuti Awasthi" <st...@hcl.com> wrote:
>> 
>>> Hi,
>>> I think that issue is with Filesystem Configuration, as in config, it
>>> is picking HbaseConfiguration. When I modified my output directory path
>>> to absolute path of HDFS :
>>> FileOutputFormat.setOutputPath(job, new
>>> Path("hdfs://master:54310/MR/stuti3"));
>>> 
>>> The MR jobs runs successfully and I am able to see stuti3 directory
>>> inside HDFS at desired path.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Stuti Awasthi
>>> Sent: Monday, November 14, 2011 11:40 AM
>>> To: user@hbase.apache.org
>>> Subject: RE: MR - Input from Hbase output to HDFS
>>> 
>>> Hi Joey,
>>> Thanks for pointing this. After importing "FileOutputFormat" as you
>>> suggested, I am able to run MR job from eclipse (Windows) the only
>>> problem is I am not able to see the output directory this code is
>>> creating. HDFS and HBase are on Linux machine.
>>> 
>>> Code :
>>>              Configuration config = HBaseConfiguration.create();
>>>              config.set("hbase.zookeeper.quorum", "master");
>>>              config.set("hbase.zookeeper.property.clientPort", "2181");
>>> 
>>>              Job job = new Job(config, "Hbase_Read_Write");
>>>              job.setJarByClass(ReadWriteDriver.class);
>>>              Scan scan = new Scan();
>>>              scan.setCaching(500);
>>>              scan.setCacheBlocks(false);
>>>              TableMapReduceUtil.initTableMapperJob("users",
>>> scan,ReadWriteMapper.class, Text.class, IntWritable.class, job);
>>>              job.setOutputFormatClass(TextOutputFormat.class);
>>>              FileOutputFormat.setOutputPath(job, new Path("/stuti2"));
>>> 
>>> After executing this code, the MR jobs runs successfully but when I
>>> look hdfs no directory is created "/stuti2". I also looked directory in
>>> local filesystem of Linux machine as well as windows machine, but not
>>> able to find the output folder anywhere.
>>> 
>>> Eclipse console Output :
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.version=1.6.0_27
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.vendor=Sun Microsystems Inc.
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.home=C:\Program Files\Java\jdk1.6.0_27\jre
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.class.path=D:\workspace\Hbase\MRHbaseReadWrite\bin;D:\
>>> wor
>>> kspace\Hbase\MRHbaseReadWrite\lib\commons-cli-1.2.jar;D:\workspace\Hbas
>>> e\M
>>> RHbaseReadWrite\lib\commons-httpclient-3.0.1.jar;D:\workspace\Hbase\MRH
>>> bas
>>> eReadWrite\lib\commons-logging-1.0.4.jar;D:\workspace\Hbase\MRHbaseRead
>>> Wri
>>> te\lib\hadoop-0.20.2-core.jar;D:\workspace\Hbase\MRHbaseReadWrite\lib\h
>>> bas
>>> e-0.90.3.jar;D:\workspace\Hbase\MRHbaseReadWrite\lib\log4j-1.2.15.jar;D
>>> :\w orkspace\Hbase\MRHbaseReadWrite\lib\zookeeper-3.3.2.jar
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.library.path=C:\Program
>>> Files\Java\jdk1.6.0_27\jre\bin;C:\Windows\Sun\Java\bin;C:\Windows\syste
>>> m32 ;C:\Windows;C:/Program Files/Java/jre6/bin/client;C:/Program
>>> Files/Java/jre6/bin;C:/Program
>>> Files/Java/jre6/lib/i386;C:\Windows\system32;C:\Windows;C:\Windows\Syst
>>> em3 2\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
>>> Files\Java\jdk1.6.0_27;C:\Program
>>> Files\TortoiseSVN\bin;C:\cygwin\bin;D:\apache-maven-3.0.3\bin;D:\eclips
>>> e;;
>>> .
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.io.tmpdir=C:\Users\STUTIA~1\AppData\Local\Temp\
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.compiler=<NA>
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:os.name=Windows 7
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:os.arch=x86
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:os.version=6.1
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:user.name=stutiawasthi
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:user.home=C:\Users\stutiawasthi
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:user.dir=D:\workspace\Hbase\MRHbaseReadWrite
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Initiating client
>>> connection,
>>> connectString=master:2181 sessionTimeout=180000 watcher=hconnection
>>> 11/11/14 11:21:45 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server master/10.33.64.235:2181
>>> 11/11/14 11:21:45 INFO zookeeper.ClientCnxn: Socket connection
>>> established to master/10.33.64.235:2181, initiating session
>>> 11/11/14 11:21:45 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server master/10.33.64.235:2181, sessionid =
>>> 0x33879243de00ec, negotiated timeout = 180000
>>> 11/11/14 11:21:46 INFO mapred.JobClient: Running job: job_local_0001
>>> 11/11/14 11:21:46 INFO zookeeper.ZooKeeper: Initiating client
>>> connection,
>>> connectString=master:2181 sessionTimeout=180000 watcher=hconnection
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server master/10.33.64.235:2181
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Socket connection
>>> established to master/10.33.64.235:2181, initiating session
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server master/10.33.64.235:2181, sessionid =
>>> 0x33879243de00ed, negotiated timeout = 180000
>>> 11/11/14 11:21:46 INFO zookeeper.ZooKeeper: Initiating client
>>> connection,
>>> connectString=master:2181 sessionTimeout=180000 watcher=hconnection
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server master/10.33.64.235:2181
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Socket connection
>>> established to master/10.33.64.235:2181, initiating session
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server master/10.33.64.235:2181, sessionid =
>>> 0x33879243de00ee, negotiated timeout = 180000
>>> 11/11/14 11:21:46 INFO mapred.MapTask: io.sort.mb = 100
>>> 11/11/14 11:21:46 INFO mapred.MapTask: data buffer = 79691776/99614720
>>> 11/11/14 11:21:46 INFO mapred.MapTask: record buffer = 262144/327680
>>> ...............................................
>>> 11/11/14 11:21:46 INFO mapred.MapTask: Finished spill 0
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner:
>>> Task:attempt_local_0001_m_000000_0 is done. And is in the process of
>>> commiting
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner:
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner: Task
>>> 'attempt_local_0001_m_000000_0' done.
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner:
>>> 11/11/14 11:21:46 INFO mapred.Merger: Merging 1 sorted segments
>>> 11/11/14 11:21:46 INFO mapred.Merger: Down to the last merge-pass, with
>>> 1 segments left of total size: 103 bytes
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner:
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner:
>>> Task:attempt_local_0001_r_000000_0 is done. And is in the process of
>>> commiting
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner:
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner: Task
>>> attempt_local_0001_r_000000_0 is allowed to commit now
>>> 11/11/14 11:21:46 INFO output.FileOutputCommitter: Saved output of task
>>> 'attempt_local_0001_r_000000_0' to /stuti2
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner: reduce > reduce
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner: Task
>>> 'attempt_local_0001_r_000000_0' done.
>>> 11/11/14 11:21:47 INFO mapred.JobClient:  map 100% reduce 100%
>>> 11/11/14 11:21:47 INFO mapred.JobClient: Job complete: job_local_0001
>>> 11/11/14 11:21:47 INFO mapred.JobClient: Counters: 12
>>> 11/11/14 11:21:47 INFO mapred.JobClient:   FileSystemCounters
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     FILE_BYTES_READ=40923
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=82343
>>> 11/11/14 11:21:47 INFO mapred.JobClient:   Map-Reduce Framework
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Reduce input groups=5
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Combine output records=0
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Map input records=5
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Reduce shuffle bytes=0
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Reduce output records=5
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Spilled Records=10
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Map output bytes=91
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Combine input records=0
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Map output records=5
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Reduce input records=5
>>> 
>>> 
>>> Please Suggest
>>> 
>>> -----Original Message-----
>>> From: Joey Echeverria [mailto:joey@cloudera.com]
>>> Sent: Friday, November 11, 2011 10:38 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: MR - Input from Hbase output to HDFS
>>> 
>>> There are two APIs (old and new), and you appear to be mixing them.
>>> TableMapReduceUtil only works with the new API. The solution is to
>>> import the new version of FileOutputFormat which takes a Job:
>>> 
>>> 
>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
>>> 
>>> -Joey
>>> 
>>> On Fri, Nov 11, 2011 at 12:55 AM, Stuti Awasthi <st...@hcl.com>
>>> wrote:
>>>> The method " setOutputPath (JobConf,Path)" take JobConf as a
>>>> parameter not the Job object.
>>>> At least this is the error Im getting while compiling with Hadoop
>>>> 0.20.2 jar with eclipse.
>>>> 
>>>> FileOutputFormat.setOutputPath(conf, new Path("/output"));
>>>> 
>>>> -----Original Message-----
>>>> From: Prashant Sharma [mailto:prashant.iiith@gmail.com]
>>>> Sent: Friday, November 11, 2011 11:20 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: MR - Input from Hbase output to HDFS
>>>> 
>>>> Hi stuti,
>>>> I was wondering why  you are not using job object to set output path
>>>> like this.
>>>> 
>>>> FileOutputFormat.setOutputPath(job, new Path("outputReadWrite") );
>>>> 
>>>> 
>>>> thanks
>>>> 
>>>> On Fri, Nov 11, 2011 at 10:43 AM, Stuti Awasthi
>>>> <st...@hcl.com>wrote:
>>>> 
>>>>> Hi Andrie,
>>>>> Well I am bit confused. When I use Jobconf , and associate with
>>>>> JobClient to run the job then I get the error that "Input directory
>>>>> is not set".
>>>>> Since I want my input to be taken by Hbase table which I already
>>>>> configured with "TableMapReduceUtil.initTableMapperJob". I don't want
>>>>> to set input directory via jobconf.
>>>>> How to mix these 2 so that I can get input from Hbase and write
>>>>> ouput  to HDFS.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Andrei Cojocaru [mailto:majormax@gmail.com]
>>>>> Sent: Thursday, November 10, 2011 7:09 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: MR - Input from Hbase output to HDFS
>>>>> 
>>>>> Stuti,
>>>>> 
>>>>> I don't see you associating JobConf with Job anywhere.
>>>>> -Andrei
>>>>> 
>>>>> ::DISCLAIMER::
>>>>> 
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>> -
>>>>> -------------------------------------------------
>>>>> 
>>>>> The contents of this e-mail and any attachment(s) are confidential
>>>>> and intended for the named recipient(s) only.
>>>>> It shall not attach any liability on the originator or HCL or its
>>>>> affiliates. Any views or opinions presented in this email are solely
>>>>> those of the author and may not necessarily reflect the opinions of
>>>>> HCL or its affiliates.
>>>>> Any form of reproduction, dissemination, copying, disclosure,
>>>>> modification, distribution and / or publication of this message
>>>>> without the prior written consent of the author of this e-mail is
>>>>> strictly prohibited. If you have received this email in error please
>>>>> delete it and notify the sender immediately. Before opening any mail
>>>>> and attachments please check them for viruses and defect.
>>>>> 
>>>>> 
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>> -
>>>>> -------------------------------------------------
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>> 
>> 
>> 
>>