You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Anze <an...@volja.net> on 2010/10/25 16:01:18 UTC

loading from HBase - Pig 0.7

Hi all!

I am struggling to find a working solution to load data from HBase directly. I 
am using Cloudera CDH3b3 which comes with Pig 0.7. What would be the easiest 
way to load data from HBase? 
If it matters: we need the rows to be included, too.

I have checked ElephantBird, but it seems to require Pig 0.6. I could 
downgrade, but it seems... well... :)

On the other hand, loading from HBase with rows is only added in Pig 0.8:
https://issues.apache.org/jira/browse/PIG-915
https://issues.apache.org/jira/browse/PIG-1205
But judging from the last issue Pig 0.8 requires HBase 0.20.6? 

I can install latest Pig from source if needed, but I'd rather leave Hadoop 
and HBase at their versions (0.20.2 and 0.89.20100924 respectively).

Should I write my own UDF? I'd appreciate some pointers.

Thanks,

Anze

Re: loading from HBase - Pig 0.7

Posted by Anze Skerlavaj <an...@spletoholik.si>.
Dmitriy, thanks for answering! I will try it and post here how it goes... 
Right now I'm in a middle of Pig 0.7 session (I gave up and exported data from 
HBase to HDFS). Next week... :)

Anze


On Thursday 28 October 2010, Dmitriy Ryaboy wrote:
> It works with 20.2, and the error trace you pasted appears to be
> completely independent of HBaseStorage..
> 
> I see that you are using the snapshot jar -- try putting your hadoop
> jars and various dependencies on your classpath, and only using the
> -nohadoop jar that pig also builds.
> 
> -D
> 
> On Thu, Oct 28, 2010 at 1:42 AM, Anze <an...@volja.net> wrote:
> > Does anyone know, should Pig (0.8 - svn trunk) work with Hadoop 0.20.2?
> > 
> > I still can't start the Pig...
> > 
> > Thanks,
> > 
> > Anze
> > 
> > On Wednesday 27 October 2010, Anze wrote:
> >> Thanks, I guess I would trip over that later on - but for this immediate
> >> problem it doesn't help (of course, because Pig fails at the start, when
> >> I'm not working with HBase yet).
> >> 
> >> I have tracked the error message to HBaseStorage.init() and added some
> >> debugging info:
> >> -----
> >>     public void init() {
> >>         // check if name node is set, if not we set local as fail back
> >>         String nameNode =
> >> this.properties.getProperty(FILE_SYSTEM_LOCATION);
> >> System.out.println("NAMENODE: " + nameNode); // debug
> >>         if (nameNode == null || nameNode.length() == 0) {
> >>             nameNode = "local";
> >>         }
> >>         this.configuration =
> >> ConfigurationUtil.toConfiguration(this.properties);
> >>         try {
> >>             if (this.uri != null) {
> >>                 this.fs = FileSystem.get(this.uri, this.configuration);
> >>             } else {
> >>                 this.fs = FileSystem.get(this.configuration);
> >>             }
> >>         } catch (IOException e) {
> >>             e.printStackTrace(); // debug
> >>             throw new RuntimeException("Failed to create DataStorage",
> >> e); }
> >>         short defaultReplication = fs.getDefaultReplication();
> >>         properties.setProperty(DEFAULT_REPLICATION_FACTOR_KEY,
> >> 
> >> Short.valueOf(defaultReplication).toString()); }
> >> -----
> >> 
> >> The run now looks like this:
> >> -----
> >> root:/opt/pig# bin/pig
> >> PIG_HOME: /opt/pig/bin/..
> >> PIG_CONF_DIR: /opt/pig/bin/../conf
> >> 2010-10-27 10:18:18,728 [main] INFO  org.apache.pig.Main - Logging error
> >> messages to: /opt/pig/pig_1288167498720.log
> >> 2010-10-27 10:18:18,940 [main] INFO
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> Connecting to hadoop file system at: hdfs://<MY NAMENODE>:8020/
> >> NAMENODE: hdfs://<MY NAMENODE>:8020/
> >> java.io.IOException: Call to <MY NAMENODE>/10.0.0.3:8020 failed on local
> >> exception: java.io.EOFException
> >>         at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
> >>         at org.apache.hadoop.ipc.Client.call(Client.java:743)
> >>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> >>         at $Proxy0.getProtocolVersion(Unknown Source)
> >>         at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
> >>         at
> >> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
> >>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
> >>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
> >>         at
> >> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileS
> >> yst em.java:82) at
> >> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
> >>         at
> >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at
> >> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at
> >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at
> >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at
> >> org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage
> >> .ja va:73) at
> >> org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStora
> >> ge. java:58) at
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExe
> >> cut ionEngine.java:212) at
> >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExe
> >> cut ionEngine.java:132) at
> >> org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
> >> org.apache.pig.PigServer.<init>(PigServer.java:225)
> >>         at org.apache.pig.PigServer.<init>(PigServer.java:214)
> >>         at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
> >>         at org.apache.pig.Main.run(Main.java:450)
> >>         at org.apache.pig.Main.main(Main.java:107)
> >> Caused by: java.io.EOFException
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >>         at
> >> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
> >>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> >> 2010-10-27 10:18:19,124 [main] ERROR org.apache.pig.Main - ERROR 2999:
> >> Unexpected internal error. Failed to create DataStorage
> >> Details at logfile: /opt/pig/pig_1288167498720.log
> >> -----
> >> 
> >> I have replaced the name of my server with <MY NAMENODE> in the above
> >> listing. BTW, this works as it should:
> >> # hadoop fs -ls hdfs://<MY NAMENODE>:8020/
> >> 
> >> I would appreciate some pointers, I have no idea what is causing this...
> >> 
> >> Anze
> >> 
> >> On Wednesday 27 October 2010, Dmitriy Ryaboy wrote:
> >> > The same way you have /etc/hadoop/conf on the claspath, you want to
> >> > put the hbase conf directory on the classpath.
> >> > 
> >> > -D
> >> > 
> >> > On Tue, Oct 26, 2010 at 11:50 PM, Anze <an...@volja.net> wrote:
> >> > >> ... You have all the conf files in PIG_CLASSPATH right?
> >> > > 
> >> > > I think I do:
> >> > > ***
> >> > > PIG_HOME: /opt/pig/bin/..
> >> > > PIG_CONF_DIR: /opt/pig/bin/../conf
> >> > > dry run:
> >> > > /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
> >> > > -Dpig.log.dir=/opt/pig/bin/../logs -Dpig.log.file=pig.log
> >> > > -Dpig.home.dir=/opt/pig/bin/.. -
> >> > > Dpig.root.logger=INFO,console,DRFA -classpath
> >> > > /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
> >> > > sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/op
> >> > > t/p ig /bin/../build/test/classes:/opt/pig/bin/../pig-
> >> > > *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
> >> > > SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/h
> >> > > bas e- 0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
> >> > > test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar
> >> > > org.apache.pig.Main ***
> >> > > 
> >> > > Generated log file contains:
> >> > > ***
> >> > > Error before Pig is launched
> >> > > ----------------------------
> >> > > ERROR 2999: Unexpected internal error. Failed to create DataStorage
> >> > > 
> >> > > java.lang.RuntimeException: Failed to create DataStorage
> >> > > 
> >> > >        at
> >> > > 
> >> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataSto
> >> > > rag e. java:75) at
> >> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataS
> >> > > tor ag e.java:58) at
> >> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(
> >> > > HEx ec utionEngine.java:212) at
> >> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(
> >> > > HEx ec utionEngine.java:132) at
> >> > > org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
> >> > > org.apache.pig.PigServer.<init>(PigServer.java:225)
> >> > > 
> >> > >        at org.apache.pig.PigServer.<init>(PigServer.java:214)
> >> > >        at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
> >> > >        at org.apache.pig.Main.run(Main.java:450)
> >> > >        at org.apache.pig.Main.main(Main.java:107)
> >> > > 
> >> > > Caused by: java.io.IOException: Call to
> >> > > namenode.admundus.com/10.0.0.3:8020 failed on local exception:
> >> > > java.io.EOFException
> >> > > 
> >> > >        at
> >> > > org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at
> >> > > org.apache.hadoop.ipc.Client.call(Client.java:743) at
> >> > > org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at
> >> > > $Proxy0.getProtocolVersion(Unknown Source)
> >> > >        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
> >> > >        at
> >> > > 
> >> > > org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:10
> >> > > 6)
> >> > > 
> >> > >        at
> >> > > org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207) at
> >> > > org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170) at
> >> > > 
> >> > > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedF
> >> > > ile Sy stem.java:82) at
> >> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:137
> >> > > 8)
> >> > > 
> >> > >        at
> >> > >      
> >> > >  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at
> >> > >      
> >> > >  org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at
> >> > > org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at
> >> > > org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at
> >> > > 
> >> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataSto
> >> > > rag e. java:72) ... 9 more
> >> > > Caused by: java.io.EOFException
> >> > > 
> >> > >        at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >> > >        at
> >> > > 
> >> > > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:
> >> > > 501 )
> >> > > 
> >> > >        at
> >> > > org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> >> > > 
> >> > > ====================================================================
> >> > > === == =======
> >> > > 
> >> > > And the Pig complains:
> >> > > ***
> >> > > log4j:WARN No appenders could be found for logger
> >> > > (org.apache.hadoop.conf.Configuration).
> >> > > log4j:WARN Please initialize the log4j system properly.
> >> > > 2010-10-27 08:46:44,762 [main] INFO  org.apache.pig.Main - Logging
> >> > > error messages to: /opt/pig/bin/pig_1288162004754.log
> >> > > 2010-10-27 08:46:44,970 [main] INFO
> >> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> > > Connecting to hadoop file system at: hdfs://...:8020/
> >> > > 2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR
> >> > > 2999: Unexpected internal error. Failed to create DataStorage
> >> > > Details at logfile: /opt/pig/bin/pig_1288162004754.log
> >> > > ***
> >> > > 
> >> > > Any idea what is wrong? I have searched the net and most answers
> >> > > talk about incompatible versions of Hadoop and Pig (but the posts
> >> > > are old).
> >> > > 
> >> > > Thanks,
> >> > > 
> >> > > Anze
> >> > > 
> >> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> > >> Yeah pig 8 is not officially released yet, it will be cut at the
> >> > >> end of the month or beginning of next month.
> >> > >> 
> >> > >> Failed to create DataStorage sounds vaguely familiar.. can you send
> >> > >> the full pig session and the full error? I think it's not
> >> > >> connecting to hbase on the client-side, or something along those
> >> > >> lines. You have all the conf files in PIG_CLASSPATH right?
> >> > >> 
> >> > >> -D
> >> > >> 
> >> > >> On Tue, Oct 26, 2010 at 6:32 AM, Anze <an...@volja.net> wrote:
> >> > >> > Hmmm, not quite there yet. :-/
> >> > >> > 
> >> > >> > I installed:
> >> > >> > - HBase 0.20.6
> >> > >> > - Cloudera CDH3b3 Hadoop (0.20.2)
> >> > >> > - Pig 0.8 (since official download is empty (?) I fetched the Pig
> >> > >> > trunk from SVN and built it)
> >> > >> > 
> >> > >> > Now it complains about "Failed to create DataStorage". Any ideas?
> >> > >> > Should I upgrade Haddop too?
> >> > >> > 
> >> > >> > This is getting a bit complicated to install. :)
> >> > >> > 
> >> > >> > I would appreciate some pointers - google revealed nothing
> >> > >> > useful.
> >> > >> > 
> >> > >> > Thanks,
> >> > >> > 
> >> > >> > Anze
> >> > >> > 
> >> > >> > On Tuesday 26 October 2010, Anze wrote:
> >> > >> >> Great! :)
> >> > >> >> 
> >> > >> >> Thanks for helping me out.
> >> > >> >> 
> >> > >> >> All the best,
> >> > >> >> 
> >> > >> >> Anze
> >> > >> >> 
> >> > >> >> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> > >> >> > I think that you might be able to get away with 20.2 if you
> >> > >> >> > don't use the filtering options.
> >> > >> >> > 
> >> > >> >> > On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> 
wrote:
> >> > >> >> > > Dmitriy, thanks for the answer!
> >> > >> >> > > 
> >> > >> >> > > The problem with upgrading to HBase 0.20.6 is that cloudera
> >> > >> >> > > doesn't ship it yet and we would like to keep our install at
> >> > >> >> > > "official" versions, even if beta. Of course, since this is
> >> > >> >> > > a development / testing cluster, we could bend the rules if
> >> > >> >> > > really necessary...
> >> > >> >> > > 
> >> > >> >> > > I have written a small MR job (actually, just "M" job :)
> >> > >> >> > > that exports the tables to files (allowing me to use Pig
> >> > >> >> > > 0.7), but that is a bit cumbersome and slow.
> >> > >> >> > > 
> >> > >> >> > > If I install the latest Pig (0.8), will it work at all with
> >> > >> >> > > HBase 0.20.2? In other words, are scan filters (which were
> >> > >> >> > > fixed in 0.20.6) needed as part of user-defined parameters
> >> > >> >> > > or as part of Pig optimizations in reading from HBase? Hope
> >> > >> >> > > my question makes sense...
> >> > >> >> > > 
> >> > >> >> > > :)
> >> > >> >> > > 
> >> > >> >> > > Thanks again,
> >> > >> >> > > 
> >> > >> >> > > Anze
> >> > >> >> > > 
> >> > >> >> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> > >> >> > >> Anze, the reason we bumped up to 20.6 in the ticket was
> >> > >> >> > >> because HBase 20.2 had a bug in it. Ask the HBase folks,
> >> > >> >> > >> but I'd say you should upgrade.
> >> > >> >> > >> FWIW we upgraded to 20.6 from 20.2 a few months back and
> >> > >> >> > >> it's been working smoothly.
> >> > >> >> > >> 
> >> > >> >> > >> The Elephant-Bird hbase loader for pig 0.6 does add row
> >> > >> >> > >> keys and most of the other features we added to the
> >> > >> >> > >> built-in loader for pig 0.8 (notably, it does not do
> >> > >> >> > >> storage). But I don't recommend downgrading to pig 0.6, as
> >> > >> >> > >> 7 and especially 8 are great improvements to the software.
> >> > >> >> > >> 
> >> > >> >> > >> -D
> >> > >> >> > >> 
> >> > >> >> > >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net>
> > 
> > wrote:
> >> > >> >> > >> > Hi all!
> >> > >> >> > >> > 
> >> > >> >> > >> > I am struggling to find a working solution to load data
> >> > >> >> > >> > from HBase directly. I am using Cloudera CDH3b3 which
> >> > >> >> > >> > comes with Pig 0.7. What would be the easiest way to
> >> > >> >> > >> > load data from HBase? If it matters: we need the rows to
> >> > >> >> > >> > be included, too.
> >> > >> >> > >> > 
> >> > >> >> > >> > I have checked ElephantBird, but it seems to require Pig
> >> > >> >> > >> > 0.6. I could downgrade, but it seems... well... :)
> >> > >> >> > >> > 
> >> > >> >> > >> > On the other hand, loading from HBase with rows is only
> >> > >> >> > >> > added in Pig 0.8:
> >> > >> >> > >> > https://issues.apache.org/jira/browse/PIG-915
> >> > >> >> > >> > https://issues.apache.org/jira/browse/PIG-1205
> >> > >> >> > >> > But judging from the last issue Pig 0.8 requires HBase
> >> > >> >> > >> > 0.20.6?
> >> > >> >> > >> > 
> >> > >> >> > >> > I can install latest Pig from source if needed, but I'd
> >> > >> >> > >> > rather leave Hadoop and HBase at their versions (0.20.2
> >> > >> >> > >> > and 0.89.20100924 respectively).
> >> > >> >> > >> > 
> >> > >> >> > >> > Should I write my own UDF? I'd appreciate some pointers.
> >> > >> >> > >> > 
> >> > >> >> > >> > Thanks,
> >> > >> >> > >> > 
> >> > >> >> > >> > Anze


Re: loading from HBase - Pig 0.7

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
It works with 20.2, and the error trace you pasted appears to be
completely independent of HBaseStorage..

I see that you are using the snapshot jar -- try putting your hadoop
jars and various dependencies on your classpath, and only using the
-nohadoop jar that pig also builds.

-D

On Thu, Oct 28, 2010 at 1:42 AM, Anze <an...@volja.net> wrote:
>
> Does anyone know, should Pig (0.8 - svn trunk) work with Hadoop 0.20.2?
>
> I still can't start the Pig...
>
> Thanks,
>
> Anze
>
>
> On Wednesday 27 October 2010, Anze wrote:
>> Thanks, I guess I would trip over that later on - but for this immediate
>> problem it doesn't help (of course, because Pig fails at the start, when
>> I'm not working with HBase yet).
>>
>> I have tracked the error message to HBaseStorage.init() and added some
>> debugging info:
>> -----
>>     public void init() {
>>         // check if name node is set, if not we set local as fail back
>>         String nameNode =
>> this.properties.getProperty(FILE_SYSTEM_LOCATION);
>> System.out.println("NAMENODE: " + nameNode); // debug
>>         if (nameNode == null || nameNode.length() == 0) {
>>             nameNode = "local";
>>         }
>>         this.configuration =
>> ConfigurationUtil.toConfiguration(this.properties);
>>         try {
>>             if (this.uri != null) {
>>                 this.fs = FileSystem.get(this.uri, this.configuration);
>>             } else {
>>                 this.fs = FileSystem.get(this.configuration);
>>             }
>>         } catch (IOException e) {
>>             e.printStackTrace(); // debug
>>             throw new RuntimeException("Failed to create DataStorage", e);
>>         }
>>         short defaultReplication = fs.getDefaultReplication();
>>         properties.setProperty(DEFAULT_REPLICATION_FACTOR_KEY,
>>
>> Short.valueOf(defaultReplication).toString()); }
>> -----
>>
>> The run now looks like this:
>> -----
>> root:/opt/pig# bin/pig
>> PIG_HOME: /opt/pig/bin/..
>> PIG_CONF_DIR: /opt/pig/bin/../conf
>> 2010-10-27 10:18:18,728 [main] INFO  org.apache.pig.Main - Logging error
>> messages to: /opt/pig/pig_1288167498720.log
>> 2010-10-27 10:18:18,940 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>> to hadoop file system at: hdfs://<MY NAMENODE>:8020/
>> NAMENODE: hdfs://<MY NAMENODE>:8020/
>> java.io.IOException: Call to <MY NAMENODE>/10.0.0.3:8020 failed on local
>> exception: java.io.EOFException
>>         at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:743)
>>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>>         at $Proxy0.getProtocolVersion(Unknown Source)
>>         at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
>>         at
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
>>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
>>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
>>         at
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSyst
>> em.java:82) at
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
>>         at
>> org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.ja
>> va:73) at
>> org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.
>> java:58) at
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecut
>> ionEngine.java:212) at
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecut
>> ionEngine.java:132) at
>> org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
>> org.apache.pig.PigServer.<init>(PigServer.java:225)
>>         at org.apache.pig.PigServer.<init>(PigServer.java:214)
>>         at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
>>         at org.apache.pig.Main.run(Main.java:450)
>>         at org.apache.pig.Main.main(Main.java:107)
>> Caused by: java.io.EOFException
>>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
>>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
>> 2010-10-27 10:18:19,124 [main] ERROR org.apache.pig.Main - ERROR 2999:
>> Unexpected internal error. Failed to create DataStorage
>> Details at logfile: /opt/pig/pig_1288167498720.log
>> -----
>>
>> I have replaced the name of my server with <MY NAMENODE> in the above
>> listing. BTW, this works as it should:
>> # hadoop fs -ls hdfs://<MY NAMENODE>:8020/
>>
>> I would appreciate some pointers, I have no idea what is causing this...
>>
>> Anze
>>
>> On Wednesday 27 October 2010, Dmitriy Ryaboy wrote:
>> > The same way you have /etc/hadoop/conf on the claspath, you want to
>> > put the hbase conf directory on the classpath.
>> >
>> > -D
>> >
>> > On Tue, Oct 26, 2010 at 11:50 PM, Anze <an...@volja.net> wrote:
>> > >> ... You have all the conf files in PIG_CLASSPATH right?
>> > >
>> > > I think I do:
>> > > ***
>> > > PIG_HOME: /opt/pig/bin/..
>> > > PIG_CONF_DIR: /opt/pig/bin/../conf
>> > > dry run:
>> > > /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
>> > > -Dpig.log.dir=/opt/pig/bin/../logs -Dpig.log.file=pig.log
>> > > -Dpig.home.dir=/opt/pig/bin/.. -
>> > > Dpig.root.logger=INFO,console,DRFA -classpath
>> > > /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
>> > > sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/p
>> > > ig /bin/../build/test/classes:/opt/pig/bin/../pig-
>> > > *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
>> > > SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbas
>> > > e- 0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
>> > > test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar
>> > > org.apache.pig.Main ***
>> > >
>> > > Generated log file contains:
>> > > ***
>> > > Error before Pig is launched
>> > > ----------------------------
>> > > ERROR 2999: Unexpected internal error. Failed to create DataStorage
>> > >
>> > > java.lang.RuntimeException: Failed to create DataStorage
>> > >
>> > >        at
>> > >
>> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorag
>> > > e. java:75) at
>> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStor
>> > > ag e.java:58) at
>> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HEx
>> > > ec utionEngine.java:212) at
>> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HEx
>> > > ec utionEngine.java:132) at
>> > > org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
>> > > org.apache.pig.PigServer.<init>(PigServer.java:225)
>> > >
>> > >        at org.apache.pig.PigServer.<init>(PigServer.java:214)
>> > >        at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
>> > >        at org.apache.pig.Main.run(Main.java:450)
>> > >        at org.apache.pig.Main.main(Main.java:107)
>> > >
>> > > Caused by: java.io.IOException: Call to
>> > > namenode.admundus.com/10.0.0.3:8020 failed on local exception:
>> > > java.io.EOFException
>> > >
>> > >        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
>> > >        at org.apache.hadoop.ipc.Client.call(Client.java:743)
>> > >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>> > >        at $Proxy0.getProtocolVersion(Unknown Source)
>> > >        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
>> > >        at
>> > >
>> > > org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
>> > >
>> > >        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
>> > >        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
>> > >        at
>> > >
>> > > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFile
>> > > Sy stem.java:82) at
>> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
>> > >
>> > >        at
>> > >        org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>> > >        at
>> > >        org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
>> > >        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at
>> > >        org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at
>> > >
>> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorag
>> > > e. java:72) ... 9 more
>> > > Caused by: java.io.EOFException
>> > >
>> > >        at java.io.DataInputStream.readInt(DataInputStream.java:375)
>> > >        at
>> > >
>> > > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501
>> > > )
>> > >
>> > >        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
>> > >
>> > > =======================================================================
>> > > == =======
>> > >
>> > > And the Pig complains:
>> > > ***
>> > > log4j:WARN No appenders could be found for logger
>> > > (org.apache.hadoop.conf.Configuration).
>> > > log4j:WARN Please initialize the log4j system properly.
>> > > 2010-10-27 08:46:44,762 [main] INFO  org.apache.pig.Main - Logging
>> > > error messages to: /opt/pig/bin/pig_1288162004754.log
>> > > 2010-10-27 08:46:44,970 [main] INFO
>> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> > > Connecting to hadoop file system at: hdfs://...:8020/
>> > > 2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
>> > > Unexpected internal error. Failed to create DataStorage
>> > > Details at logfile: /opt/pig/bin/pig_1288162004754.log
>> > > ***
>> > >
>> > > Any idea what is wrong? I have searched the net and most answers talk
>> > > about incompatible versions of Hadoop and Pig (but the posts are old).
>> > >
>> > > Thanks,
>> > >
>> > > Anze
>> > >
>> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> > >> Yeah pig 8 is not officially released yet, it will be cut at the end
>> > >> of the month or beginning of next month.
>> > >>
>> > >> Failed to create DataStorage sounds vaguely familiar.. can you send
>> > >> the full pig session and the full error? I think it's not connecting
>> > >> to hbase on the client-side, or something along those lines. You have
>> > >> all the conf files in PIG_CLASSPATH right?
>> > >>
>> > >> -D
>> > >>
>> > >> On Tue, Oct 26, 2010 at 6:32 AM, Anze <an...@volja.net> wrote:
>> > >> > Hmmm, not quite there yet. :-/
>> > >> >
>> > >> > I installed:
>> > >> > - HBase 0.20.6
>> > >> > - Cloudera CDH3b3 Hadoop (0.20.2)
>> > >> > - Pig 0.8 (since official download is empty (?) I fetched the Pig
>> > >> > trunk from SVN and built it)
>> > >> >
>> > >> > Now it complains about "Failed to create DataStorage". Any ideas?
>> > >> > Should I upgrade Haddop too?
>> > >> >
>> > >> > This is getting a bit complicated to install. :)
>> > >> >
>> > >> > I would appreciate some pointers - google revealed nothing useful.
>> > >> >
>> > >> > Thanks,
>> > >> >
>> > >> > Anze
>> > >> >
>> > >> > On Tuesday 26 October 2010, Anze wrote:
>> > >> >> Great! :)
>> > >> >>
>> > >> >> Thanks for helping me out.
>> > >> >>
>> > >> >> All the best,
>> > >> >>
>> > >> >> Anze
>> > >> >>
>> > >> >> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> > >> >> > I think that you might be able to get away with 20.2 if you don't
>> > >> >> > use the filtering options.
>> > >> >> >
>> > >> >> > On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
>> > >> >> > > Dmitriy, thanks for the answer!
>> > >> >> > >
>> > >> >> > > The problem with upgrading to HBase 0.20.6 is that cloudera
>> > >> >> > > doesn't ship it yet and we would like to keep our install at
>> > >> >> > > "official" versions, even if beta. Of course, since this is a
>> > >> >> > > development / testing cluster, we could bend the rules if
>> > >> >> > > really necessary...
>> > >> >> > >
>> > >> >> > > I have written a small MR job (actually, just "M" job :) that
>> > >> >> > > exports the tables to files (allowing me to use Pig 0.7), but
>> > >> >> > > that is a bit cumbersome and slow.
>> > >> >> > >
>> > >> >> > > If I install the latest Pig (0.8), will it work at all with
>> > >> >> > > HBase 0.20.2? In other words, are scan filters (which were
>> > >> >> > > fixed in 0.20.6) needed as part of user-defined parameters or
>> > >> >> > > as part of Pig optimizations in reading from HBase? Hope my
>> > >> >> > > question makes sense...
>> > >> >> > >
>> > >> >> > > :)
>> > >> >> > >
>> > >> >> > > Thanks again,
>> > >> >> > >
>> > >> >> > > Anze
>> > >> >> > >
>> > >> >> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> > >> >> > >> Anze, the reason we bumped up to 20.6 in the ticket was
>> > >> >> > >> because HBase 20.2 had a bug in it. Ask the HBase folks, but
>> > >> >> > >> I'd say you should upgrade.
>> > >> >> > >> FWIW we upgraded to 20.6 from 20.2 a few months back and it's
>> > >> >> > >> been working smoothly.
>> > >> >> > >>
>> > >> >> > >> The Elephant-Bird hbase loader for pig 0.6 does add row keys
>> > >> >> > >> and most of the other features we added to the built-in
>> > >> >> > >> loader for pig 0.8 (notably, it does not do storage). But I
>> > >> >> > >> don't recommend downgrading to pig 0.6, as 7 and especially 8
>> > >> >> > >> are great improvements to the software.
>> > >> >> > >>
>> > >> >> > >> -D
>> > >> >> > >>
>> > >> >> > >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net>
> wrote:
>> > >> >> > >> > Hi all!
>> > >> >> > >> >
>> > >> >> > >> > I am struggling to find a working solution to load data from
>> > >> >> > >> > HBase directly. I am using Cloudera CDH3b3 which comes with
>> > >> >> > >> > Pig 0.7. What would be the easiest way to load data from
>> > >> >> > >> > HBase? If it matters: we need the rows to be included, too.
>> > >> >> > >> >
>> > >> >> > >> > I have checked ElephantBird, but it seems to require Pig
>> > >> >> > >> > 0.6. I could downgrade, but it seems... well... :)
>> > >> >> > >> >
>> > >> >> > >> > On the other hand, loading from HBase with rows is only
>> > >> >> > >> > added in Pig 0.8:
>> > >> >> > >> > https://issues.apache.org/jira/browse/PIG-915
>> > >> >> > >> > https://issues.apache.org/jira/browse/PIG-1205
>> > >> >> > >> > But judging from the last issue Pig 0.8 requires HBase
>> > >> >> > >> > 0.20.6?
>> > >> >> > >> >
>> > >> >> > >> > I can install latest Pig from source if needed, but I'd
>> > >> >> > >> > rather leave Hadoop and HBase at their versions (0.20.2 and
>> > >> >> > >> > 0.89.20100924 respectively).
>> > >> >> > >> >
>> > >> >> > >> > Should I write my own UDF? I'd appreciate some pointers.
>> > >> >> > >> >
>> > >> >> > >> > Thanks,
>> > >> >> > >> >
>> > >> >> > >> > Anze
>
>

Re: loading from HBase - Pig 0.7

Posted by Anze <an...@volja.net>.
Does anyone know, should Pig (0.8 - svn trunk) work with Hadoop 0.20.2?

I still can't start the Pig...

Thanks,

Anze


On Wednesday 27 October 2010, Anze wrote:
> Thanks, I guess I would trip over that later on - but for this immediate
> problem it doesn't help (of course, because Pig fails at the start, when
> I'm not working with HBase yet).
> 
> I have tracked the error message to HBaseStorage.init() and added some
> debugging info:
> -----
>     public void init() {
>         // check if name node is set, if not we set local as fail back
>         String nameNode =
> this.properties.getProperty(FILE_SYSTEM_LOCATION);
> System.out.println("NAMENODE: " + nameNode); // debug
>         if (nameNode == null || nameNode.length() == 0) {
>             nameNode = "local";
>         }
>         this.configuration =
> ConfigurationUtil.toConfiguration(this.properties);
>         try {
>             if (this.uri != null) {
>                 this.fs = FileSystem.get(this.uri, this.configuration);
>             } else {
>                 this.fs = FileSystem.get(this.configuration);
>             }
>         } catch (IOException e) {
>             e.printStackTrace(); // debug
>             throw new RuntimeException("Failed to create DataStorage", e);
>         }
>         short defaultReplication = fs.getDefaultReplication();
>         properties.setProperty(DEFAULT_REPLICATION_FACTOR_KEY,
>                               
> Short.valueOf(defaultReplication).toString()); }
> -----
> 
> The run now looks like this:
> -----
> root:/opt/pig# bin/pig
> PIG_HOME: /opt/pig/bin/..
> PIG_CONF_DIR: /opt/pig/bin/../conf
> 2010-10-27 10:18:18,728 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /opt/pig/pig_1288167498720.log
> 2010-10-27 10:18:18,940 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
> to hadoop file system at: hdfs://<MY NAMENODE>:8020/
> NAMENODE: hdfs://<MY NAMENODE>:8020/
> java.io.IOException: Call to <MY NAMENODE>/10.0.0.3:8020 failed on local
> exception: java.io.EOFException
>         at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
>         at org.apache.hadoop.ipc.Client.call(Client.java:743)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>         at $Proxy0.getProtocolVersion(Unknown Source)
>         at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
>         at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSyst
> em.java:82) at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
>         at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.ja
> va:73) at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.
> java:58) at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecut
> ionEngine.java:212) at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecut
> ionEngine.java:132) at
> org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
> org.apache.pig.PigServer.<init>(PigServer.java:225)
>         at org.apache.pig.PigServer.<init>(PigServer.java:214)
>         at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
>         at org.apache.pig.Main.run(Main.java:450)
>         at org.apache.pig.Main.main(Main.java:107)
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> 2010-10-27 10:18:19,124 [main] ERROR org.apache.pig.Main - ERROR 2999:
> Unexpected internal error. Failed to create DataStorage
> Details at logfile: /opt/pig/pig_1288167498720.log
> -----
> 
> I have replaced the name of my server with <MY NAMENODE> in the above
> listing. BTW, this works as it should:
> # hadoop fs -ls hdfs://<MY NAMENODE>:8020/
> 
> I would appreciate some pointers, I have no idea what is causing this...
> 
> Anze
> 
> On Wednesday 27 October 2010, Dmitriy Ryaboy wrote:
> > The same way you have /etc/hadoop/conf on the claspath, you want to
> > put the hbase conf directory on the classpath.
> > 
> > -D
> > 
> > On Tue, Oct 26, 2010 at 11:50 PM, Anze <an...@volja.net> wrote:
> > >> ... You have all the conf files in PIG_CLASSPATH right?
> > > 
> > > I think I do:
> > > ***
> > > PIG_HOME: /opt/pig/bin/..
> > > PIG_CONF_DIR: /opt/pig/bin/../conf
> > > dry run:
> > > /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
> > > -Dpig.log.dir=/opt/pig/bin/../logs -Dpig.log.file=pig.log
> > > -Dpig.home.dir=/opt/pig/bin/.. -
> > > Dpig.root.logger=INFO,console,DRFA -classpath
> > > /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
> > > sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/p
> > > ig /bin/../build/test/classes:/opt/pig/bin/../pig-
> > > *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
> > > SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbas
> > > e- 0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
> > > test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar
> > > org.apache.pig.Main ***
> > > 
> > > Generated log file contains:
> > > ***
> > > Error before Pig is launched
> > > ----------------------------
> > > ERROR 2999: Unexpected internal error. Failed to create DataStorage
> > > 
> > > java.lang.RuntimeException: Failed to create DataStorage
> > > 
> > >        at
> > > 
> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorag
> > > e. java:75) at
> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStor
> > > ag e.java:58) at
> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HEx
> > > ec utionEngine.java:212) at
> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HEx
> > > ec utionEngine.java:132) at
> > > org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
> > > org.apache.pig.PigServer.<init>(PigServer.java:225)
> > > 
> > >        at org.apache.pig.PigServer.<init>(PigServer.java:214)
> > >        at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
> > >        at org.apache.pig.Main.run(Main.java:450)
> > >        at org.apache.pig.Main.main(Main.java:107)
> > > 
> > > Caused by: java.io.IOException: Call to
> > > namenode.admundus.com/10.0.0.3:8020 failed on local exception:
> > > java.io.EOFException
> > > 
> > >        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
> > >        at org.apache.hadoop.ipc.Client.call(Client.java:743)
> > >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> > >        at $Proxy0.getProtocolVersion(Unknown Source)
> > >        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
> > >        at
> > > 
> > > org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
> > > 
> > >        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
> > >        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
> > >        at
> > > 
> > > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFile
> > > Sy stem.java:82) at
> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
> > > 
> > >        at
> > >        org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> > >        at
> > >        org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
> > >        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at
> > >        org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at
> > > 
> > > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorag
> > > e. java:72) ... 9 more
> > > Caused by: java.io.EOFException
> > > 
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:375)
> > >        at
> > > 
> > > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501
> > > )
> > > 
> > >        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> > > 
> > > =======================================================================
> > > == =======
> > > 
> > > And the Pig complains:
> > > ***
> > > log4j:WARN No appenders could be found for logger
> > > (org.apache.hadoop.conf.Configuration).
> > > log4j:WARN Please initialize the log4j system properly.
> > > 2010-10-27 08:46:44,762 [main] INFO  org.apache.pig.Main - Logging
> > > error messages to: /opt/pig/bin/pig_1288162004754.log
> > > 2010-10-27 08:46:44,970 [main] INFO
> > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> > > Connecting to hadoop file system at: hdfs://...:8020/
> > > 2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
> > > Unexpected internal error. Failed to create DataStorage
> > > Details at logfile: /opt/pig/bin/pig_1288162004754.log
> > > ***
> > > 
> > > Any idea what is wrong? I have searched the net and most answers talk
> > > about incompatible versions of Hadoop and Pig (but the posts are old).
> > > 
> > > Thanks,
> > > 
> > > Anze
> > > 
> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> > >> Yeah pig 8 is not officially released yet, it will be cut at the end
> > >> of the month or beginning of next month.
> > >> 
> > >> Failed to create DataStorage sounds vaguely familiar.. can you send
> > >> the full pig session and the full error? I think it's not connecting
> > >> to hbase on the client-side, or something along those lines. You have
> > >> all the conf files in PIG_CLASSPATH right?
> > >> 
> > >> -D
> > >> 
> > >> On Tue, Oct 26, 2010 at 6:32 AM, Anze <an...@volja.net> wrote:
> > >> > Hmmm, not quite there yet. :-/
> > >> > 
> > >> > I installed:
> > >> > - HBase 0.20.6
> > >> > - Cloudera CDH3b3 Hadoop (0.20.2)
> > >> > - Pig 0.8 (since official download is empty (?) I fetched the Pig
> > >> > trunk from SVN and built it)
> > >> > 
> > >> > Now it complains about "Failed to create DataStorage". Any ideas?
> > >> > Should I upgrade Haddop too?
> > >> > 
> > >> > This is getting a bit complicated to install. :)
> > >> > 
> > >> > I would appreciate some pointers - google revealed nothing useful.
> > >> > 
> > >> > Thanks,
> > >> > 
> > >> > Anze
> > >> > 
> > >> > On Tuesday 26 October 2010, Anze wrote:
> > >> >> Great! :)
> > >> >> 
> > >> >> Thanks for helping me out.
> > >> >> 
> > >> >> All the best,
> > >> >> 
> > >> >> Anze
> > >> >> 
> > >> >> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> > >> >> > I think that you might be able to get away with 20.2 if you don't
> > >> >> > use the filtering options.
> > >> >> > 
> > >> >> > On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
> > >> >> > > Dmitriy, thanks for the answer!
> > >> >> > > 
> > >> >> > > The problem with upgrading to HBase 0.20.6 is that cloudera
> > >> >> > > doesn't ship it yet and we would like to keep our install at
> > >> >> > > "official" versions, even if beta. Of course, since this is a
> > >> >> > > development / testing cluster, we could bend the rules if
> > >> >> > > really necessary...
> > >> >> > > 
> > >> >> > > I have written a small MR job (actually, just "M" job :) that
> > >> >> > > exports the tables to files (allowing me to use Pig 0.7), but
> > >> >> > > that is a bit cumbersome and slow.
> > >> >> > > 
> > >> >> > > If I install the latest Pig (0.8), will it work at all with
> > >> >> > > HBase 0.20.2? In other words, are scan filters (which were
> > >> >> > > fixed in 0.20.6) needed as part of user-defined parameters or
> > >> >> > > as part of Pig optimizations in reading from HBase? Hope my
> > >> >> > > question makes sense...
> > >> >> > > 
> > >> >> > > :)
> > >> >> > > 
> > >> >> > > Thanks again,
> > >> >> > > 
> > >> >> > > Anze
> > >> >> > > 
> > >> >> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> > >> >> > >> Anze, the reason we bumped up to 20.6 in the ticket was
> > >> >> > >> because HBase 20.2 had a bug in it. Ask the HBase folks, but
> > >> >> > >> I'd say you should upgrade.
> > >> >> > >> FWIW we upgraded to 20.6 from 20.2 a few months back and it's
> > >> >> > >> been working smoothly.
> > >> >> > >> 
> > >> >> > >> The Elephant-Bird hbase loader for pig 0.6 does add row keys
> > >> >> > >> and most of the other features we added to the built-in
> > >> >> > >> loader for pig 0.8 (notably, it does not do storage). But I
> > >> >> > >> don't recommend downgrading to pig 0.6, as 7 and especially 8
> > >> >> > >> are great improvements to the software.
> > >> >> > >> 
> > >> >> > >> -D
> > >> >> > >> 
> > >> >> > >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> 
wrote:
> > >> >> > >> > Hi all!
> > >> >> > >> > 
> > >> >> > >> > I am struggling to find a working solution to load data from
> > >> >> > >> > HBase directly. I am using Cloudera CDH3b3 which comes with
> > >> >> > >> > Pig 0.7. What would be the easiest way to load data from
> > >> >> > >> > HBase? If it matters: we need the rows to be included, too.
> > >> >> > >> > 
> > >> >> > >> > I have checked ElephantBird, but it seems to require Pig
> > >> >> > >> > 0.6. I could downgrade, but it seems... well... :)
> > >> >> > >> > 
> > >> >> > >> > On the other hand, loading from HBase with rows is only
> > >> >> > >> > added in Pig 0.8:
> > >> >> > >> > https://issues.apache.org/jira/browse/PIG-915
> > >> >> > >> > https://issues.apache.org/jira/browse/PIG-1205
> > >> >> > >> > But judging from the last issue Pig 0.8 requires HBase
> > >> >> > >> > 0.20.6?
> > >> >> > >> > 
> > >> >> > >> > I can install latest Pig from source if needed, but I'd
> > >> >> > >> > rather leave Hadoop and HBase at their versions (0.20.2 and
> > >> >> > >> > 0.89.20100924 respectively).
> > >> >> > >> > 
> > >> >> > >> > Should I write my own UDF? I'd appreciate some pointers.
> > >> >> > >> > 
> > >> >> > >> > Thanks,
> > >> >> > >> > 
> > >> >> > >> > Anze


Re: loading from HBase - Pig 0.7

Posted by Anze <an...@volja.net>.
Thanks, I guess I would trip over that later on - but for this immediate 
problem it doesn't help (of course, because Pig fails at the start, when I'm 
not working with HBase yet).

I have tracked the error message to HBaseStorage.init() and added some 
debugging info:
-----
    public void init() {
        // check if name node is set, if not we set local as fail back
        String nameNode = this.properties.getProperty(FILE_SYSTEM_LOCATION);
        System.out.println("NAMENODE: " + nameNode); // debug
        if (nameNode == null || nameNode.length() == 0) {
            nameNode = "local";
        }
        this.configuration = 
ConfigurationUtil.toConfiguration(this.properties);
        try {
            if (this.uri != null) {
                this.fs = FileSystem.get(this.uri, this.configuration);
            } else {
                this.fs = FileSystem.get(this.configuration);
            }
        } catch (IOException e) {
            e.printStackTrace(); // debug
            throw new RuntimeException("Failed to create DataStorage", e);
        }
        short defaultReplication = fs.getDefaultReplication();
        properties.setProperty(DEFAULT_REPLICATION_FACTOR_KEY,
                               Short.valueOf(defaultReplication).toString());
    }
-----

The run now looks like this:
-----
root:/opt/pig# bin/pig
PIG_HOME: /opt/pig/bin/..
PIG_CONF_DIR: /opt/pig/bin/../conf
2010-10-27 10:18:18,728 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /opt/pig/pig_1288167498720.log
2010-10-27 10:18:18,940 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://<MY NAMENODE>:8020/
NAMENODE: hdfs://<MY NAMENODE>:8020/
java.io.IOException: Call to <MY NAMENODE>/10.0.0.3:8020 failed on local 
exception: java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
        at org.apache.hadoop.ipc.Client.call(Client.java:743)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
        at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
        at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:73)
        at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:212)
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:132)
        at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
        at org.apache.pig.PigServer.<init>(PigServer.java:225)
        at org.apache.pig.PigServer.<init>(PigServer.java:214)
        at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
        at org.apache.pig.Main.run(Main.java:450)
        at org.apache.pig.Main.main(Main.java:107)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
2010-10-27 10:18:19,124 [main] ERROR org.apache.pig.Main - ERROR 2999: 
Unexpected internal error. Failed to create DataStorage
Details at logfile: /opt/pig/pig_1288167498720.log
-----

I have replaced the name of my server with <MY NAMENODE> in the above listing. 
BTW, this works as it should:
# hadoop fs -ls hdfs://<MY NAMENODE>:8020/

I would appreciate some pointers, I have no idea what is causing this...

Anze


On Wednesday 27 October 2010, Dmitriy Ryaboy wrote:
> The same way you have /etc/hadoop/conf on the claspath, you want to
> put the hbase conf directory on the classpath.
> 
> -D
> 
> On Tue, Oct 26, 2010 at 11:50 PM, Anze <an...@volja.net> wrote:
> >> ... You have all the conf files in PIG_CLASSPATH right?
> > 
> > I think I do:
> > ***
> > PIG_HOME: /opt/pig/bin/..
> > PIG_CONF_DIR: /opt/pig/bin/../conf
> > dry run:
> > /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
> > -Dpig.log.dir=/opt/pig/bin/../logs -Dpig.log.file=pig.log
> > -Dpig.home.dir=/opt/pig/bin/.. -
> > Dpig.root.logger=INFO,console,DRFA -classpath
> > /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
> > sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/pig
> > /bin/../build/test/classes:/opt/pig/bin/../pig-
> > *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
> > SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbase-
> > 0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
> > test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar
> > org.apache.pig.Main ***
> > 
> > Generated log file contains:
> > ***
> > Error before Pig is launched
> > ----------------------------
> > ERROR 2999: Unexpected internal error. Failed to create DataStorage
> > 
> > java.lang.RuntimeException: Failed to create DataStorage
> >        at
> > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.
> > java:75) at
> > org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorag
> > e.java:58) at
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExec
> > utionEngine.java:212) at
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExec
> > utionEngine.java:132) at
> > org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
> > org.apache.pig.PigServer.<init>(PigServer.java:225)
> >        at org.apache.pig.PigServer.<init>(PigServer.java:214)
> >        at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
> >        at org.apache.pig.Main.run(Main.java:450)
> >        at org.apache.pig.Main.main(Main.java:107)
> > Caused by: java.io.IOException: Call to
> > namenode.admundus.com/10.0.0.3:8020 failed on local exception:
> > java.io.EOFException
> >        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
> >        at org.apache.hadoop.ipc.Client.call(Client.java:743)
> >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> >        at $Proxy0.getProtocolVersion(Unknown Source)
> >        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
> >        at
> > org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
> >        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
> >        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
> >        at
> > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSy
> > stem.java:82) at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
> >        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> >        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
> >        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
> >        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
> >        at
> > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.
> > java:72) ... 9 more
> > Caused by: java.io.EOFException
> >        at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >        at
> > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
> >        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> > =========================================================================
> > =======
> > 
> > And the Pig complains:
> > ***
> > log4j:WARN No appenders could be found for logger
> > (org.apache.hadoop.conf.Configuration).
> > log4j:WARN Please initialize the log4j system properly.
> > 2010-10-27 08:46:44,762 [main] INFO  org.apache.pig.Main - Logging error
> > messages to: /opt/pig/bin/pig_1288162004754.log
> > 2010-10-27 08:46:44,970 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> > Connecting to hadoop file system at: hdfs://...:8020/
> > 2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
> > Unexpected internal error. Failed to create DataStorage
> > Details at logfile: /opt/pig/bin/pig_1288162004754.log
> > ***
> > 
> > Any idea what is wrong? I have searched the net and most answers talk
> > about incompatible versions of Hadoop and Pig (but the posts are old).
> > 
> > Thanks,
> > 
> > Anze
> > 
> > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> Yeah pig 8 is not officially released yet, it will be cut at the end
> >> of the month or beginning of next month.
> >> 
> >> Failed to create DataStorage sounds vaguely familiar.. can you send
> >> the full pig session and the full error? I think it's not connecting
> >> to hbase on the client-side, or something along those lines. You have
> >> all the conf files in PIG_CLASSPATH right?
> >> 
> >> -D
> >> 
> >> On Tue, Oct 26, 2010 at 6:32 AM, Anze <an...@volja.net> wrote:
> >> > Hmmm, not quite there yet. :-/
> >> > 
> >> > I installed:
> >> > - HBase 0.20.6
> >> > - Cloudera CDH3b3 Hadoop (0.20.2)
> >> > - Pig 0.8 (since official download is empty (?) I fetched the Pig
> >> > trunk from SVN and built it)
> >> > 
> >> > Now it complains about "Failed to create DataStorage". Any ideas?
> >> > Should I upgrade Haddop too?
> >> > 
> >> > This is getting a bit complicated to install. :)
> >> > 
> >> > I would appreciate some pointers - google revealed nothing useful.
> >> > 
> >> > Thanks,
> >> > 
> >> > Anze
> >> > 
> >> > On Tuesday 26 October 2010, Anze wrote:
> >> >> Great! :)
> >> >> 
> >> >> Thanks for helping me out.
> >> >> 
> >> >> All the best,
> >> >> 
> >> >> Anze
> >> >> 
> >> >> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> >> > I think that you might be able to get away with 20.2 if you don't
> >> >> > use the filtering options.
> >> >> > 
> >> >> > On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
> >> >> > > Dmitriy, thanks for the answer!
> >> >> > > 
> >> >> > > The problem with upgrading to HBase 0.20.6 is that cloudera
> >> >> > > doesn't ship it yet and we would like to keep our install at
> >> >> > > "official" versions, even if beta. Of course, since this is a
> >> >> > > development / testing cluster, we could bend the rules if really
> >> >> > > necessary...
> >> >> > > 
> >> >> > > I have written a small MR job (actually, just "M" job :) that
> >> >> > > exports the tables to files (allowing me to use Pig 0.7), but
> >> >> > > that is a bit cumbersome and slow.
> >> >> > > 
> >> >> > > If I install the latest Pig (0.8), will it work at all with HBase
> >> >> > > 0.20.2? In other words, are scan filters (which were fixed in
> >> >> > > 0.20.6) needed as part of user-defined parameters or as part of
> >> >> > > Pig optimizations in reading from HBase? Hope my question makes
> >> >> > > sense...
> >> >> > > 
> >> >> > > :)
> >> >> > > 
> >> >> > > Thanks again,
> >> >> > > 
> >> >> > > Anze
> >> >> > > 
> >> >> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> >> > >> Anze, the reason we bumped up to 20.6 in the ticket was because
> >> >> > >> HBase 20.2 had a bug in it. Ask the HBase folks, but I'd say you
> >> >> > >> should upgrade.
> >> >> > >> FWIW we upgraded to 20.6 from 20.2 a few months back and it's
> >> >> > >> been working smoothly.
> >> >> > >> 
> >> >> > >> The Elephant-Bird hbase loader for pig 0.6 does add row keys and
> >> >> > >> most of the other features we added to the built-in loader for
> >> >> > >> pig 0.8 (notably, it does not do storage). But I don't
> >> >> > >> recommend downgrading to pig 0.6, as 7 and especially 8 are
> >> >> > >> great improvements to the software.
> >> >> > >> 
> >> >> > >> -D
> >> >> > >> 
> >> >> > >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
> >> >> > >> > Hi all!
> >> >> > >> > 
> >> >> > >> > I am struggling to find a working solution to load data from
> >> >> > >> > HBase directly. I am using Cloudera CDH3b3 which comes with
> >> >> > >> > Pig 0.7. What would be the easiest way to load data from
> >> >> > >> > HBase? If it matters: we need the rows to be included, too.
> >> >> > >> > 
> >> >> > >> > I have checked ElephantBird, but it seems to require Pig 0.6.
> >> >> > >> > I could downgrade, but it seems... well... :)
> >> >> > >> > 
> >> >> > >> > On the other hand, loading from HBase with rows is only added
> >> >> > >> > in Pig 0.8: https://issues.apache.org/jira/browse/PIG-915
> >> >> > >> > https://issues.apache.org/jira/browse/PIG-1205
> >> >> > >> > But judging from the last issue Pig 0.8 requires HBase 0.20.6?
> >> >> > >> > 
> >> >> > >> > I can install latest Pig from source if needed, but I'd rather
> >> >> > >> > leave Hadoop and HBase at their versions (0.20.2 and
> >> >> > >> > 0.89.20100924 respectively).
> >> >> > >> > 
> >> >> > >> > Should I write my own UDF? I'd appreciate some pointers.
> >> >> > >> > 
> >> >> > >> > Thanks,
> >> >> > >> > 
> >> >> > >> > Anze


Re: loading from HBase - Pig 0.7

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
The same way you have /etc/hadoop/conf on the claspath, you want to
put the hbase conf directory on the classpath.

-D

On Tue, Oct 26, 2010 at 11:50 PM, Anze <an...@volja.net> wrote:
>
>> ... You have all the conf files in PIG_CLASSPATH right?
>
> I think I do:
> ***
> PIG_HOME: /opt/pig/bin/..
> PIG_CONF_DIR: /opt/pig/bin/../conf
> dry run:
> /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m -Dpig.log.dir=/opt/pig/bin/../logs
> -Dpig.log.file=pig.log -Dpig.home.dir=/opt/pig/bin/.. -
> Dpig.root.logger=INFO,console,DRFA -classpath
> /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
> sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/pig/bin/../build/test/classes:/opt/pig/bin/../pig-
> *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
> SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbase-0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
> test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar org.apache.pig.Main
> ***
>
> Generated log file contains:
> ***
> Error before Pig is launched
> ----------------------------
> ERROR 2999: Unexpected internal error. Failed to create DataStorage
>
> java.lang.RuntimeException: Failed to create DataStorage
>        at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
>        at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
>        at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:212)
>        at
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:132)
>        at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
>        at org.apache.pig.PigServer.<init>(PigServer.java:225)
>        at org.apache.pig.PigServer.<init>(PigServer.java:214)
>        at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
>        at org.apache.pig.Main.run(Main.java:450)
>        at org.apache.pig.Main.main(Main.java:107)
> Caused by: java.io.IOException: Call to namenode.admundus.com/10.0.0.3:8020
> failed on local exception: java.io.EOFException
>        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
>        at org.apache.hadoop.ipc.Client.call(Client.java:743)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>        at $Proxy0.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
>        at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
>        at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
>        at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
>        ... 9 more
> Caused by: java.io.EOFException
>        at java.io.DataInputStream.readInt(DataInputStream.java:375)
>        at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
>        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
> ================================================================================
>
> And the Pig complains:
> ***
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.conf.Configuration).
> log4j:WARN Please initialize the log4j system properly.
> 2010-10-27 08:46:44,762 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /opt/pig/bin/pig_1288162004754.log
> 2010-10-27 08:46:44,970 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
> hadoop file system at: hdfs://...:8020/
> 2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
> Unexpected internal error. Failed to create DataStorage
> Details at logfile: /opt/pig/bin/pig_1288162004754.log
> ***
>
> Any idea what is wrong? I have searched the net and most answers talk about
> incompatible versions of Hadoop and Pig (but the posts are old).
>
> Thanks,
>
> Anze
>
>
> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> Yeah pig 8 is not officially released yet, it will be cut at the end
>> of the month or beginning of next month.
>>
>> Failed to create DataStorage sounds vaguely familiar.. can you send
>> the full pig session and the full error? I think it's not connecting
>> to hbase on the client-side, or something along those lines. You have
>> all the conf files in PIG_CLASSPATH right?
>>
>> -D
>>
>> On Tue, Oct 26, 2010 at 6:32 AM, Anze <an...@volja.net> wrote:
>> > Hmmm, not quite there yet. :-/
>> >
>> > I installed:
>> > - HBase 0.20.6
>> > - Cloudera CDH3b3 Hadoop (0.20.2)
>> > - Pig 0.8 (since official download is empty (?) I fetched the Pig trunk
>> > from SVN and built it)
>> >
>> > Now it complains about "Failed to create DataStorage". Any ideas? Should
>> > I upgrade Haddop too?
>> >
>> > This is getting a bit complicated to install. :)
>> >
>> > I would appreciate some pointers - google revealed nothing useful.
>> >
>> > Thanks,
>> >
>> > Anze
>> >
>> > On Tuesday 26 October 2010, Anze wrote:
>> >> Great! :)
>> >>
>> >> Thanks for helping me out.
>> >>
>> >> All the best,
>> >>
>> >> Anze
>> >>
>> >> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> >> > I think that you might be able to get away with 20.2 if you don't use
>> >> > the filtering options.
>> >> >
>> >> > On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
>> >> > > Dmitriy, thanks for the answer!
>> >> > >
>> >> > > The problem with upgrading to HBase 0.20.6 is that cloudera doesn't
>> >> > > ship it yet and we would like to keep our install at "official"
>> >> > > versions, even if beta. Of course, since this is a development /
>> >> > > testing cluster, we could bend the rules if really necessary...
>> >> > >
>> >> > > I have written a small MR job (actually, just "M" job :) that
>> >> > > exports the tables to files (allowing me to use Pig 0.7), but that
>> >> > > is a bit cumbersome and slow.
>> >> > >
>> >> > > If I install the latest Pig (0.8), will it work at all with HBase
>> >> > > 0.20.2? In other words, are scan filters (which were fixed in
>> >> > > 0.20.6) needed as part of user-defined parameters or as part of Pig
>> >> > > optimizations in reading from HBase? Hope my question makes
>> >> > > sense...
>> >> > >
>> >> > > :)
>> >> > >
>> >> > > Thanks again,
>> >> > >
>> >> > > Anze
>> >> > >
>> >> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> >> > >> Anze, the reason we bumped up to 20.6 in the ticket was because
>> >> > >> HBase 20.2 had a bug in it. Ask the HBase folks, but I'd say you
>> >> > >> should upgrade.
>> >> > >> FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
>> >> > >> working smoothly.
>> >> > >>
>> >> > >> The Elephant-Bird hbase loader for pig 0.6 does add row keys and
>> >> > >> most of the other features we added to the built-in loader for pig
>> >> > >> 0.8 (notably, it does not do storage). But I don't recommend
>> >> > >> downgrading to pig 0.6, as 7 and especially 8 are great
>> >> > >> improvements to the software.
>> >> > >>
>> >> > >> -D
>> >> > >>
>> >> > >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
>> >> > >> > Hi all!
>> >> > >> >
>> >> > >> > I am struggling to find a working solution to load data from
>> >> > >> > HBase directly. I am using Cloudera CDH3b3 which comes with Pig
>> >> > >> > 0.7. What would be the easiest way to load data from HBase?
>> >> > >> > If it matters: we need the rows to be included, too.
>> >> > >> >
>> >> > >> > I have checked ElephantBird, but it seems to require Pig 0.6. I
>> >> > >> > could downgrade, but it seems... well... :)
>> >> > >> >
>> >> > >> > On the other hand, loading from HBase with rows is only added in
>> >> > >> > Pig 0.8: https://issues.apache.org/jira/browse/PIG-915
>> >> > >> > https://issues.apache.org/jira/browse/PIG-1205
>> >> > >> > But judging from the last issue Pig 0.8 requires HBase 0.20.6?
>> >> > >> >
>> >> > >> > I can install latest Pig from source if needed, but I'd rather
>> >> > >> > leave Hadoop and HBase at their versions (0.20.2 and
>> >> > >> > 0.89.20100924 respectively).
>> >> > >> >
>> >> > >> > Should I write my own UDF? I'd appreciate some pointers.
>> >> > >> >
>> >> > >> > Thanks,
>> >> > >> >
>> >> > >> > Anze
>
>

Re: loading from HBase - Pig 0.7

Posted by Anze <an...@volja.net>.
> ... You have all the conf files in PIG_CLASSPATH right?

I think I do:
***
PIG_HOME: /opt/pig/bin/..
PIG_CONF_DIR: /opt/pig/bin/../conf
dry run:
/usr/lib/jvm/java-6-sun/bin/java -Xmx1000m -Dpig.log.dir=/opt/pig/bin/../logs 
-Dpig.log.file=pig.log -Dpig.home.dir=/opt/pig/bin/.. -
Dpig.root.logger=INFO,console,DRFA -classpath 
/opt/pig/bin/../conf:/usr/lib/jvm/java-6-
sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/pig/bin/../build/test/classes:/opt/pig/bin/../pig-
*-core.jar:/opt/pig/bin/../build/pig-0.8.0-
SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbase-0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar org.apache.pig.Main
***

Generated log file contains:
***
Error before Pig is launched
----------------------------
ERROR 2999: Unexpected internal error. Failed to create DataStorage

java.lang.RuntimeException: Failed to create DataStorage
        at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
        at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:212)
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:132)
        at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
        at org.apache.pig.PigServer.<init>(PigServer.java:225)
        at org.apache.pig.PigServer.<init>(PigServer.java:214)
        at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
        at org.apache.pig.Main.run(Main.java:450)
        at org.apache.pig.Main.main(Main.java:107)
Caused by: java.io.IOException: Call to namenode.admundus.com/10.0.0.3:8020 
failed on local exception: java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
        at org.apache.hadoop.ipc.Client.call(Client.java:743)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
        at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
        at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
        ... 9 more
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
================================================================================

And the Pig complains:
***
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
2010-10-27 08:46:44,762 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /opt/pig/bin/pig_1288162004754.log
2010-10-27 08:46:44,970 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://...:8020/
2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999: 
Unexpected internal error. Failed to create DataStorage
Details at logfile: /opt/pig/bin/pig_1288162004754.log
***

Any idea what is wrong? I have searched the net and most answers talk about 
incompatible versions of Hadoop and Pig (but the posts are old).

Thanks,

Anze


On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> Yeah pig 8 is not officially released yet, it will be cut at the end
> of the month or beginning of next month.
> 
> Failed to create DataStorage sounds vaguely familiar.. can you send
> the full pig session and the full error? I think it's not connecting
> to hbase on the client-side, or something along those lines. You have
> all the conf files in PIG_CLASSPATH right?
> 
> -D
> 
> On Tue, Oct 26, 2010 at 6:32 AM, Anze <an...@volja.net> wrote:
> > Hmmm, not quite there yet. :-/
> > 
> > I installed:
> > - HBase 0.20.6
> > - Cloudera CDH3b3 Hadoop (0.20.2)
> > - Pig 0.8 (since official download is empty (?) I fetched the Pig trunk
> > from SVN and built it)
> > 
> > Now it complains about "Failed to create DataStorage". Any ideas? Should
> > I upgrade Haddop too?
> > 
> > This is getting a bit complicated to install. :)
> > 
> > I would appreciate some pointers - google revealed nothing useful.
> > 
> > Thanks,
> > 
> > Anze
> > 
> > On Tuesday 26 October 2010, Anze wrote:
> >> Great! :)
> >> 
> >> Thanks for helping me out.
> >> 
> >> All the best,
> >> 
> >> Anze
> >> 
> >> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> > I think that you might be able to get away with 20.2 if you don't use
> >> > the filtering options.
> >> > 
> >> > On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
> >> > > Dmitriy, thanks for the answer!
> >> > > 
> >> > > The problem with upgrading to HBase 0.20.6 is that cloudera doesn't
> >> > > ship it yet and we would like to keep our install at "official"
> >> > > versions, even if beta. Of course, since this is a development /
> >> > > testing cluster, we could bend the rules if really necessary...
> >> > > 
> >> > > I have written a small MR job (actually, just "M" job :) that
> >> > > exports the tables to files (allowing me to use Pig 0.7), but that
> >> > > is a bit cumbersome and slow.
> >> > > 
> >> > > If I install the latest Pig (0.8), will it work at all with HBase
> >> > > 0.20.2? In other words, are scan filters (which were fixed in
> >> > > 0.20.6) needed as part of user-defined parameters or as part of Pig
> >> > > optimizations in reading from HBase? Hope my question makes
> >> > > sense...
> >> > > 
> >> > > :)
> >> > > 
> >> > > Thanks again,
> >> > > 
> >> > > Anze
> >> > > 
> >> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> > >> Anze, the reason we bumped up to 20.6 in the ticket was because
> >> > >> HBase 20.2 had a bug in it. Ask the HBase folks, but I'd say you
> >> > >> should upgrade.
> >> > >> FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
> >> > >> working smoothly.
> >> > >> 
> >> > >> The Elephant-Bird hbase loader for pig 0.6 does add row keys and
> >> > >> most of the other features we added to the built-in loader for pig
> >> > >> 0.8 (notably, it does not do storage). But I don't recommend
> >> > >> downgrading to pig 0.6, as 7 and especially 8 are great
> >> > >> improvements to the software.
> >> > >> 
> >> > >> -D
> >> > >> 
> >> > >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
> >> > >> > Hi all!
> >> > >> > 
> >> > >> > I am struggling to find a working solution to load data from
> >> > >> > HBase directly. I am using Cloudera CDH3b3 which comes with Pig
> >> > >> > 0.7. What would be the easiest way to load data from HBase?
> >> > >> > If it matters: we need the rows to be included, too.
> >> > >> > 
> >> > >> > I have checked ElephantBird, but it seems to require Pig 0.6. I
> >> > >> > could downgrade, but it seems... well... :)
> >> > >> > 
> >> > >> > On the other hand, loading from HBase with rows is only added in
> >> > >> > Pig 0.8: https://issues.apache.org/jira/browse/PIG-915
> >> > >> > https://issues.apache.org/jira/browse/PIG-1205
> >> > >> > But judging from the last issue Pig 0.8 requires HBase 0.20.6?
> >> > >> > 
> >> > >> > I can install latest Pig from source if needed, but I'd rather
> >> > >> > leave Hadoop and HBase at their versions (0.20.2 and
> >> > >> > 0.89.20100924 respectively).
> >> > >> > 
> >> > >> > Should I write my own UDF? I'd appreciate some pointers.
> >> > >> > 
> >> > >> > Thanks,
> >> > >> > 
> >> > >> > Anze


Re: loading from HBase - Pig 0.7

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Yeah pig 8 is not officially released yet, it will be cut at the end
of the month or beginning of next month.

Failed to create DataStorage sounds vaguely familiar.. can you send
the full pig session and the full error? I think it's not connecting
to hbase on the client-side, or something along those lines. You have
all the conf files in PIG_CLASSPATH right?

-D

On Tue, Oct 26, 2010 at 6:32 AM, Anze <an...@volja.net> wrote:
>
> Hmmm, not quite there yet. :-/
>
> I installed:
> - HBase 0.20.6
> - Cloudera CDH3b3 Hadoop (0.20.2)
> - Pig 0.8 (since official download is empty (?) I fetched the Pig trunk from
> SVN and built it)
>
> Now it complains about "Failed to create DataStorage". Any ideas? Should I
> upgrade Haddop too?
>
> This is getting a bit complicated to install. :)
>
> I would appreciate some pointers - google revealed nothing useful.
>
> Thanks,
>
> Anze
>
>
> On Tuesday 26 October 2010, Anze wrote:
>> Great! :)
>>
>> Thanks for helping me out.
>>
>> All the best,
>>
>> Anze
>>
>> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> > I think that you might be able to get away with 20.2 if you don't use
>> > the filtering options.
>> >
>> > On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
>> > > Dmitriy, thanks for the answer!
>> > >
>> > > The problem with upgrading to HBase 0.20.6 is that cloudera doesn't
>> > > ship it yet and we would like to keep our install at "official"
>> > > versions, even if beta. Of course, since this is a development /
>> > > testing cluster, we could bend the rules if really necessary...
>> > >
>> > > I have written a small MR job (actually, just "M" job :) that exports
>> > > the tables to files (allowing me to use Pig 0.7), but that is a bit
>> > > cumbersome and slow.
>> > >
>> > > If I install the latest Pig (0.8), will it work at all with HBase
>> > > 0.20.2? In other words, are scan filters (which were fixed in 0.20.6)
>> > > needed as part of user-defined parameters or as part of Pig
>> > > optimizations in reading from HBase? Hope my question makes sense...
>> > > :)
>> > >
>> > > Thanks again,
>> > >
>> > > Anze
>> > >
>> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> > >> Anze, the reason we bumped up to 20.6 in the ticket was because HBase
>> > >> 20.2 had a bug in it. Ask the HBase folks, but I'd say you should
>> > >> upgrade.
>> > >> FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
>> > >> working smoothly.
>> > >>
>> > >> The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
>> > >> of the other features we added to the built-in loader for pig 0.8
>> > >> (notably, it does not do storage). But I don't recommend downgrading
>> > >> to pig 0.6, as 7 and especially 8 are great improvements to the
>> > >> software.
>> > >>
>> > >> -D
>> > >>
>> > >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
>> > >> > Hi all!
>> > >> >
>> > >> > I am struggling to find a working solution to load data from HBase
>> > >> > directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
>> > >> > would be the easiest way to load data from HBase?
>> > >> > If it matters: we need the rows to be included, too.
>> > >> >
>> > >> > I have checked ElephantBird, but it seems to require Pig 0.6. I
>> > >> > could downgrade, but it seems... well... :)
>> > >> >
>> > >> > On the other hand, loading from HBase with rows is only added in Pig
>> > >> > 0.8: https://issues.apache.org/jira/browse/PIG-915
>> > >> > https://issues.apache.org/jira/browse/PIG-1205
>> > >> > But judging from the last issue Pig 0.8 requires HBase 0.20.6?
>> > >> >
>> > >> > I can install latest Pig from source if needed, but I'd rather leave
>> > >> > Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
>> > >> > respectively).
>> > >> >
>> > >> > Should I write my own UDF? I'd appreciate some pointers.
>> > >> >
>> > >> > Thanks,
>> > >> >
>> > >> > Anze
>
>

Re: loading from HBase - Pig 0.7

Posted by Anze <an...@volja.net>.
Hmmm, not quite there yet. :-/

I installed:
- HBase 0.20.6
- Cloudera CDH3b3 Hadoop (0.20.2) 
- Pig 0.8 (since official download is empty (?) I fetched the Pig trunk from 
SVN and built it)

Now it complains about "Failed to create DataStorage". Any ideas? Should I 
upgrade Haddop too? 

This is getting a bit complicated to install. :)

I would appreciate some pointers - google revealed nothing useful.

Thanks,

Anze


On Tuesday 26 October 2010, Anze wrote:
> Great! :)
> 
> Thanks for helping me out.
> 
> All the best,
> 
> Anze
> 
> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> > I think that you might be able to get away with 20.2 if you don't use
> > the filtering options.
> > 
> > On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
> > > Dmitriy, thanks for the answer!
> > > 
> > > The problem with upgrading to HBase 0.20.6 is that cloudera doesn't
> > > ship it yet and we would like to keep our install at "official"
> > > versions, even if beta. Of course, since this is a development /
> > > testing cluster, we could bend the rules if really necessary...
> > > 
> > > I have written a small MR job (actually, just "M" job :) that exports
> > > the tables to files (allowing me to use Pig 0.7), but that is a bit
> > > cumbersome and slow.
> > > 
> > > If I install the latest Pig (0.8), will it work at all with HBase
> > > 0.20.2? In other words, are scan filters (which were fixed in 0.20.6)
> > > needed as part of user-defined parameters or as part of Pig
> > > optimizations in reading from HBase? Hope my question makes sense...
> > > :)
> > > 
> > > Thanks again,
> > > 
> > > Anze
> > > 
> > > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> > >> Anze, the reason we bumped up to 20.6 in the ticket was because HBase
> > >> 20.2 had a bug in it. Ask the HBase folks, but I'd say you should
> > >> upgrade.
> > >> FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
> > >> working smoothly.
> > >> 
> > >> The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
> > >> of the other features we added to the built-in loader for pig 0.8
> > >> (notably, it does not do storage). But I don't recommend downgrading
> > >> to pig 0.6, as 7 and especially 8 are great improvements to the
> > >> software.
> > >> 
> > >> -D
> > >> 
> > >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
> > >> > Hi all!
> > >> > 
> > >> > I am struggling to find a working solution to load data from HBase
> > >> > directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
> > >> > would be the easiest way to load data from HBase?
> > >> > If it matters: we need the rows to be included, too.
> > >> > 
> > >> > I have checked ElephantBird, but it seems to require Pig 0.6. I
> > >> > could downgrade, but it seems... well... :)
> > >> > 
> > >> > On the other hand, loading from HBase with rows is only added in Pig
> > >> > 0.8: https://issues.apache.org/jira/browse/PIG-915
> > >> > https://issues.apache.org/jira/browse/PIG-1205
> > >> > But judging from the last issue Pig 0.8 requires HBase 0.20.6?
> > >> > 
> > >> > I can install latest Pig from source if needed, but I'd rather leave
> > >> > Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
> > >> > respectively).
> > >> > 
> > >> > Should I write my own UDF? I'd appreciate some pointers.
> > >> > 
> > >> > Thanks,
> > >> > 
> > >> > Anze


Re: loading from HBase - Pig 0.7

Posted by Anze <an...@volja.net>.
Great! :)

Thanks for helping me out.

All the best,

Anze

On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> I think that you might be able to get away with 20.2 if you don't use
> the filtering options.
> 
> On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
> > Dmitriy, thanks for the answer!
> > 
> > The problem with upgrading to HBase 0.20.6 is that cloudera doesn't ship
> > it yet and we would like to keep our install at "official" versions,
> > even if beta. Of course, since this is a development / testing cluster,
> > we could bend the rules if really necessary...
> > 
> > I have written a small MR job (actually, just "M" job :) that exports the
> > tables to files (allowing me to use Pig 0.7), but that is a bit
> > cumbersome and slow.
> > 
> > If I install the latest Pig (0.8), will it work at all with HBase 0.20.2?
> > In other words, are scan filters (which were fixed in 0.20.6) needed as
> > part of user-defined parameters or as part of Pig optimizations in
> > reading from HBase? Hope my question makes sense... :)
> > 
> > Thanks again,
> > 
> > Anze
> > 
> > On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> >> Anze, the reason we bumped up to 20.6 in the ticket was because HBase
> >> 20.2 had a bug in it. Ask the HBase folks, but I'd say you should
> >> upgrade.
> >> FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
> >> working smoothly.
> >> 
> >> The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
> >> of the other features we added to the built-in loader for pig 0.8
> >> (notably, it does not do storage). But I don't recommend downgrading
> >> to pig 0.6, as 7 and especially 8 are great improvements to the
> >> software.
> >> 
> >> -D
> >> 
> >> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
> >> > Hi all!
> >> > 
> >> > I am struggling to find a working solution to load data from HBase
> >> > directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
> >> > would be the easiest way to load data from HBase?
> >> > If it matters: we need the rows to be included, too.
> >> > 
> >> > I have checked ElephantBird, but it seems to require Pig 0.6. I could
> >> > downgrade, but it seems... well... :)
> >> > 
> >> > On the other hand, loading from HBase with rows is only added in Pig
> >> > 0.8: https://issues.apache.org/jira/browse/PIG-915
> >> > https://issues.apache.org/jira/browse/PIG-1205
> >> > But judging from the last issue Pig 0.8 requires HBase 0.20.6?
> >> > 
> >> > I can install latest Pig from source if needed, but I'd rather leave
> >> > Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
> >> > respectively).
> >> > 
> >> > Should I write my own UDF? I'd appreciate some pointers.
> >> > 
> >> > Thanks,
> >> > 
> >> > Anze


Re: loading from HBase - Pig 0.7

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I think that you might be able to get away with 20.2 if you don't use
the filtering options.


On Mon, Oct 25, 2010 at 3:39 PM, Anze <an...@volja.net> wrote:
>
> Dmitriy, thanks for the answer!
>
> The problem with upgrading to HBase 0.20.6 is that cloudera doesn't ship it
> yet and we would like to keep our install at "official" versions, even if
> beta. Of course, since this is a development / testing cluster, we could bend
> the rules if really necessary...
>
> I have written a small MR job (actually, just "M" job :) that exports the
> tables to files (allowing me to use Pig 0.7), but that is a bit cumbersome and
> slow.
>
> If I install the latest Pig (0.8), will it work at all with HBase 0.20.2?
> In other words, are scan filters (which were fixed in 0.20.6) needed as part
> of user-defined parameters or as part of Pig optimizations in reading from
> HBase? Hope my question makes sense... :)
>
> Thanks again,
>
> Anze
>
>
> On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
>> Anze, the reason we bumped up to 20.6 in the ticket was because HBase
>> 20.2 had a bug in it. Ask the HBase folks, but I'd say you should
>> upgrade.
>> FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
>> working smoothly.
>>
>> The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
>> of the other features we added to the built-in loader for pig 0.8
>> (notably, it does not do storage). But I don't recommend downgrading
>> to pig 0.6, as 7 and especially 8 are great improvements to the
>> software.
>>
>> -D
>>
>> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
>> > Hi all!
>> >
>> > I am struggling to find a working solution to load data from HBase
>> > directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
>> > would be the easiest way to load data from HBase?
>> > If it matters: we need the rows to be included, too.
>> >
>> > I have checked ElephantBird, but it seems to require Pig 0.6. I could
>> > downgrade, but it seems... well... :)
>> >
>> > On the other hand, loading from HBase with rows is only added in Pig 0.8:
>> > https://issues.apache.org/jira/browse/PIG-915
>> > https://issues.apache.org/jira/browse/PIG-1205
>> > But judging from the last issue Pig 0.8 requires HBase 0.20.6?
>> >
>> > I can install latest Pig from source if needed, but I'd rather leave
>> > Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
>> > respectively).
>> >
>> > Should I write my own UDF? I'd appreciate some pointers.
>> >
>> > Thanks,
>> >
>> > Anze
>
>

Re: loading from HBase - Pig 0.7

Posted by Anze <an...@volja.net>.
Dmitriy, thanks for the answer!

The problem with upgrading to HBase 0.20.6 is that cloudera doesn't ship it 
yet and we would like to keep our install at "official" versions, even if 
beta. Of course, since this is a development / testing cluster, we could bend 
the rules if really necessary... 

I have written a small MR job (actually, just "M" job :) that exports the 
tables to files (allowing me to use Pig 0.7), but that is a bit cumbersome and 
slow.

If I install the latest Pig (0.8), will it work at all with HBase 0.20.2? 
In other words, are scan filters (which were fixed in 0.20.6) needed as part 
of user-defined parameters or as part of Pig optimizations in reading from 
HBase? Hope my question makes sense... :)

Thanks again,

Anze


On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
> Anze, the reason we bumped up to 20.6 in the ticket was because HBase
> 20.2 had a bug in it. Ask the HBase folks, but I'd say you should
> upgrade.
> FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
> working smoothly.
> 
> The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
> of the other features we added to the built-in loader for pig 0.8
> (notably, it does not do storage). But I don't recommend downgrading
> to pig 0.6, as 7 and especially 8 are great improvements to the
> software.
> 
> -D
> 
> On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
> > Hi all!
> > 
> > I am struggling to find a working solution to load data from HBase
> > directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
> > would be the easiest way to load data from HBase?
> > If it matters: we need the rows to be included, too.
> > 
> > I have checked ElephantBird, but it seems to require Pig 0.6. I could
> > downgrade, but it seems... well... :)
> > 
> > On the other hand, loading from HBase with rows is only added in Pig 0.8:
> > https://issues.apache.org/jira/browse/PIG-915
> > https://issues.apache.org/jira/browse/PIG-1205
> > But judging from the last issue Pig 0.8 requires HBase 0.20.6?
> > 
> > I can install latest Pig from source if needed, but I'd rather leave
> > Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
> > respectively).
> > 
> > Should I write my own UDF? I'd appreciate some pointers.
> > 
> > Thanks,
> > 
> > Anze


Re: loading from HBase - Pig 0.7

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Anze, the reason we bumped up to 20.6 in the ticket was because HBase
20.2 had a bug in it. Ask the HBase folks, but I'd say you should
upgrade.
FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
working smoothly.

The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
of the other features we added to the built-in loader for pig 0.8
(notably, it does not do storage). But I don't recommend downgrading
to pig 0.6, as 7 and especially 8 are great improvements to the
software.

-D


On Mon, Oct 25, 2010 at 7:01 AM, Anze <an...@volja.net> wrote:
> Hi all!
>
> I am struggling to find a working solution to load data from HBase directly. I
> am using Cloudera CDH3b3 which comes with Pig 0.7. What would be the easiest
> way to load data from HBase?
> If it matters: we need the rows to be included, too.
>
> I have checked ElephantBird, but it seems to require Pig 0.6. I could
> downgrade, but it seems... well... :)
>
> On the other hand, loading from HBase with rows is only added in Pig 0.8:
> https://issues.apache.org/jira/browse/PIG-915
> https://issues.apache.org/jira/browse/PIG-1205
> But judging from the last issue Pig 0.8 requires HBase 0.20.6?
>
> I can install latest Pig from source if needed, but I'd rather leave Hadoop
> and HBase at their versions (0.20.2 and 0.89.20100924 respectively).
>
> Should I write my own UDF? I'd appreciate some pointers.
>
> Thanks,
>
> Anze
>