You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Rob Stewart <ro...@googlemail.com> on 2010/01/14 15:49:23 UTC

Pig DataGenerator as a MR Job

Hi there.

I am well underway with comparing Pig, Hive, JAQL etc...

The DataGenerator is proving a valuable tool for me. Thanks for that.

I have one query. I am able to use it in local mode, no problem, and some
experiments are complete.

However, I cannot seem to use it in MapReduce mode on the cluster. This is
my file "generateData" contents:
------------------
export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
export datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
/usr/lib/hadoop/bin/hadoop jar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -m 1 -rows
10000000 -f words.dat s:8:50:z:0
------------------

The error I receive when trying to run it with "-m 1" option (in cluster
mode):
Caused by: java.lang.ClassNotFoundException: sdsu.algorithms.data.Zipf

So in local mode, it successfully picks up the jar file sdsuLibJKD14.jar ,
but when running it in cluster mode, this classpath is not found?


thanks.

Rob Stewart

Re: Pig DataGenerator as a MR Job

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Thanks for  persevering Rob! :)

-D

On Thu, Jan 14, 2010 at 4:16 PM, Rob Stewart
<ro...@googlemail.com> wrote:
> Cheers Alan,
>
> Done.
>
> Rob.
>
>
> 2010/1/14 Alan Gates <ga...@yahoo-inc.com>
>
>> Rob,
>>
>> Feel free to update the wiki with your findings.  You don't have to be a
>> committer to change the wiki.
>>
>> Alan.
>>
>>
>> On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote:
>>
>>  Hello Dmitry!
>>>
>>> I have it solved, it was just a bit of trial and error based on the Hive
>>> bug
>>> report/fix I found.
>>>
>>> The report is indeed correct, the following works:
>>>
>>>> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
>>>>
>>> -libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f
>>> /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>
>>> This puts the Pig wiki out of date for Hadoop 0.20, but is still relevant
>>> for Hadoop 0.18 and less.
>>>
>>> May I propose that you update the wiki as such:
>>> ------------------------
>>> DataGenerator Usage:
>>> For 0.18.0
>>>
>>>> hadoop jar -libjars $zipfjar $datagenjar
>>>>
>>> org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator> -conf
>>> $conf_file [options] colspec...
>>>
>>> For 0.20.0
>>>
>>>> hadoop jar $datagenjar
>>>> org.apache.pig.test.utils.datagen.DataGenerator</pig/DataGenerator>
>>>>  -libjars
>>>>
>>> $zipfjar -conf $conf_file [options] colspec...
>>> --------------
>>>
>>> Sound OK ?
>>>
>>>
>>> Rob Stewart
>>>
>>>
>>> 2010/1/14 Rob Stewart <ro...@googlemail.com>
>>>
>>>  Yeah, unfortunately your suggestion does not work, and neither does the
>>>> order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars
>>>> usage:
>>>>
>>>> hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
>>>> mylib.jar input output
>>>>
>>>> So I tried this:
>>>> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
>>>> -conf $conf_file -rows 10000000 -f
>>>> /scratch/tmpHDFS_files/wordsx1_skewed.dat
>>>> -libjars $zipfjar s:8:50:z:0
>>>>
>>>> However, the DataGenerator does not like it as one of its' options:
>>>> ---------
>>>> Couldn't parse the command line arguments, Found unknown option
>>>> (-libjars)
>>>> at position 5
>>>> ---------
>>>>
>>>> I'd be happy/surprised to hear from anyone who can use the format given
>>>> on
>>>> the Pig wiki for the DataGenerator, in cluster mode (using -m parameter).
>>>>
>>>> Any more suggestions Dmitry, and thanks for your help, it's mucho
>>>> appreciated!
>>>>
>>>> Rob
>>>>
>>>>
>>>>
>>>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>>>
>>>>  Sorry if I am not reading carefully enough -- but the bug report you
>>>>> cite seems to indicate you want
>>>>>
>>>>> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
>>>>> $zipfjar $datagenjar -conf $conf_file -rows
>>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>>
>>>>> (possibly separating zipfjar and datagenjar with commas if that patch
>>>>> was applied to your version of 20)
>>>>>
>>>>> which I don't see in the list of things you tried?
>>>>>
>>>>> -D
>>>>>
>>>>> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
>>>>> <ro...@googlemail.com> wrote:
>>>>>
>>>>>> Hi Dmitriy,
>>>>>>
>>>>>> No, I do think that there was a change in 0.20.0
>>>>>>
>>>>>> See the error I get:
>>>>>> Exception in thread "main" java.io.IOException: Error opening job jar:
>>>>>> -libjars
>>>>>>
>>>>>> This is what I am trying to run:
>>>>>> hadoop jar -libjars $zipfjar $datagenjar
>>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>>>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>>>
>>>>>> The $zipfjar has only one jar file in this classpath. It seems that
>>>>>>
>>>>> there
>>>>>
>>>>>> was a change to hadoop 0.20.0, not allowing for the option -libjars
>>>>>> immediately after "hadoop jar".
>>>>>>
>>>>>> This is the extract from the Hive bug report I was talking about:
>>>>>> -------------
>>>>>>
>>>>>>
>>>>>> In hadoop-20 - the -libjars has to come after the jar file/class
>>>>>>
>>>>>> Please try applying this patch to bin/ext/cli.sh
>>>>>>
>>>>>> --- cli.sh  (revision 789726)
>>>>>> +++ cli.sh  (working copy)
>>>>>> @@ -10,7 +10,7 @@
>>>>>>   exit 3;
>>>>>>  fi
>>>>>>
>>>>>> -  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
>>>>>> $HIVE_OPTS "$@"
>>>>>> +  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
>>>>>> $HIVE_OPTS "$@"
>>>>>> }
>>>>>>
>>>>>> ----------------
>>>>>>
>>>>>> I have also tried:
>>>>>> hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
>>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>>>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>>>
>>>>>> This gives the same error.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>>>>>
>>>>>>  I think the link you sent got malformatted, but try separating the
>>>>>>> jars with a comma
>>>>>>> http://issues.apache.org/jira/browse/HADOOP-4864
>>>>>>>
>>>>>>> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
>>>>>>> <ro...@googlemail.com> wrote:
>>>>>>>
>>>>>>>> Hi Dmitriy,
>>>>>>>>
>>>>>>>> OK, well it seems that since 0.20.0 the order as specified on the Pig
>>>>>>>>
>>>>>>> wiki
>>>>>>>
>>>>>>>> is no longer relevant:
>>>>>>>> doop jar -libjars $zipfjar $datagenjar
>>>>>>>>
>>>>>>> org.apache.pig.test.utils.datagen.
>>>>>
>>>>>> DataGenerator </pig/DataGenerator> -conf $conf_file [options]
>>>>>>>>
>>>>>>> colspec...
>>>>>
>>>>>>
>>>>>>>> See this patch over at Hive for 0.20.0:
>>>>>>>>
>>>>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
>>>>>
>>>>>> DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com>
>>>>>>>>
>>>>>>>> I have tried a few combinations, but I can't seem to fit in the
>>>>>>>>
>>>>>>> "-libjars
>>>>>
>>>>>> $zipfjar" in anywhere now.
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> Thanks for your help.
>>>>>>>>
>>>>>>>> Rob
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>>>>>>>
>>>>>>>>  Rob,
>>>>>>>>> You need to tell Hadoop which jars you need it to ship to the worker
>>>>>>>>> nodes. You include datagen.jar, etc, on the classpath, which makes
>>>>>>>>> them discoverable locally, but you aren't telling Hadoop to ship
>>>>>>>>>
>>>>>>>> them.
>>>>>
>>>>>>  You want to list them, comma-separated, in the -libjars parameter.
>>>>>>>>>
>>>>>>>>> -D
>>>>>>>>>
>>>>>>>>> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
>>>>>>>>> <ro...@googlemail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi there.
>>>>>>>>>>
>>>>>>>>>> I am well underway with comparing Pig, Hive, JAQL etc...
>>>>>>>>>>
>>>>>>>>>> The DataGenerator is proving a valuable tool for me. Thanks for
>>>>>>>>>>
>>>>>>>>> that.
>>>>>
>>>>>>
>>>>>>>>>> I have one query. I am able to use it in local mode, no problem,
>>>>>>>>>>
>>>>>>>>> and
>>>>>
>>>>>> some
>>>>>>>
>>>>>>>> experiments are complete.
>>>>>>>>>>
>>>>>>>>>> However, I cannot seem to use it in MapReduce mode on the cluster.
>>>>>>>>>>
>>>>>>>>> This
>>>>>>>
>>>>>>>> is
>>>>>>>>>
>>>>>>>>>> my file "generateData" contents:
>>>>>>>>>> ------------------
>>>>>>>>>> export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
>>>>>>>>>> export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
>>>>>>>>>> export
>>>>>>>>>>
>>>>>>>>> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
>>>>>
>>>>>>  export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
>>>>>>>>>> export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
>>>>>>>>>> /usr/lib/hadoop/bin/hadoop jar $datagenjar
>>>>>>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file
>>>>>>>>>>
>>>>>>>>> -m 1
>>>>>
>>>>>>  -rows
>>>>>>>>>
>>>>>>>>>> 10000000 -f words.dat s:8:50:z:0
>>>>>>>>>> ------------------
>>>>>>>>>>
>>>>>>>>>> The error I receive when trying to run it with "-m 1" option (in
>>>>>>>>>>
>>>>>>>>> cluster
>>>>>>>
>>>>>>>> mode):
>>>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>>>>>
>>>>>>>>> sdsu.algorithms.data.Zipf
>>>>>
>>>>>>
>>>>>>>>>> So in local mode, it successfully picks up the jar file
>>>>>>>>>>
>>>>>>>>> sdsuLibJKD14.jar
>>>>>>>
>>>>>>>> ,
>>>>>>>>>
>>>>>>>>>> but when running it in cluster mode, this classpath is not found?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> thanks.
>>>>>>>>>>
>>>>>>>>>> Rob Stewart
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>

Re: Pig DataGenerator as a MR Job

Posted by Rob Stewart <ro...@googlemail.com>.

Cheers Alan,

Done.

Rob.


2010/1/14 Alan Gates <ga...@yahoo-inc.com>

> Rob,
>
> Feel free to update the wiki with your findings.  You don't have to be a
> committer to change the wiki.
>
> Alan.
>
>
> On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote:
>
>  Hello Dmitry!
>>
>> I have it solved, it was just a bit of trial and error based on the Hive
>> bug
>> report/fix I found.
>>
>> The report is indeed correct, the following works:
>>
>>> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
>>>
>> -libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f
>> /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>
>> This puts the Pig wiki out of date for Hadoop 0.20, but is still relevant
>> for Hadoop 0.18 and less.
>>
>> May I propose that you update the wiki as such:
>> ------------------------
>> DataGenerator Usage:
>> For 0.18.0
>>
>>> hadoop jar -libjars $zipfjar $datagenjar
>>>
>> org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator> -conf
>> $conf_file [options] colspec...
>>
>> For 0.20.0
>>
>>> hadoop jar $datagenjar
>>> org.apache.pig.test.utils.datagen.DataGenerator</pig/DataGenerator>
>>>  -libjars
>>>
>> $zipfjar -conf $conf_file [options] colspec...
>> --------------
>>
>> Sound OK ?
>>
>>
>> Rob Stewart
>>
>>
>> 2010/1/14 Rob Stewart <ro...@googlemail.com>
>>
>>  Yeah, unfortunately your suggestion does not work, and neither does the
>>> order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars
>>> usage:
>>>
>>> hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
>>> mylib.jar input output
>>>
>>> So I tried this:
>>> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
>>> -conf $conf_file -rows 10000000 -f
>>> /scratch/tmpHDFS_files/wordsx1_skewed.dat
>>> -libjars $zipfjar s:8:50:z:0
>>>
>>> However, the DataGenerator does not like it as one of its' options:
>>> ---------
>>> Couldn't parse the command line arguments, Found unknown option
>>> (-libjars)
>>> at position 5
>>> ---------
>>>
>>> I'd be happy/surprised to hear from anyone who can use the format given
>>> on
>>> the Pig wiki for the DataGenerator, in cluster mode (using -m parameter).
>>>
>>> Any more suggestions Dmitry, and thanks for your help, it's mucho
>>> appreciated!
>>>
>>> Rob
>>>
>>>
>>>
>>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>>
>>>  Sorry if I am not reading carefully enough -- but the bug report you
>>>> cite seems to indicate you want
>>>>
>>>> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
>>>> $zipfjar $datagenjar -conf $conf_file -rows
>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>
>>>> (possibly separating zipfjar and datagenjar with commas if that patch
>>>> was applied to your version of 20)
>>>>
>>>> which I don't see in the list of things you tried?
>>>>
>>>> -D
>>>>
>>>> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
>>>> <ro...@googlemail.com> wrote:
>>>>
>>>>> Hi Dmitriy,
>>>>>
>>>>> No, I do think that there was a change in 0.20.0
>>>>>
>>>>> See the error I get:
>>>>> Exception in thread "main" java.io.IOException: Error opening job jar:
>>>>> -libjars
>>>>>
>>>>> This is what I am trying to run:
>>>>> hadoop jar -libjars $zipfjar $datagenjar
>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>>
>>>>> The $zipfjar has only one jar file in this classpath. It seems that
>>>>>
>>>> there
>>>>
>>>>> was a change to hadoop 0.20.0, not allowing for the option -libjars
>>>>> immediately after "hadoop jar".
>>>>>
>>>>> This is the extract from the Hive bug report I was talking about:
>>>>> -------------
>>>>>
>>>>>
>>>>> In hadoop-20 - the -libjars has to come after the jar file/class
>>>>>
>>>>> Please try applying this patch to bin/ext/cli.sh
>>>>>
>>>>> --- cli.sh  (revision 789726)
>>>>> +++ cli.sh  (working copy)
>>>>> @@ -10,7 +10,7 @@
>>>>>   exit 3;
>>>>>  fi
>>>>>
>>>>> -  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
>>>>> $HIVE_OPTS "$@"
>>>>> +  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
>>>>> $HIVE_OPTS "$@"
>>>>> }
>>>>>
>>>>> ----------------
>>>>>
>>>>> I have also tried:
>>>>> hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>>
>>>>> This gives the same error.
>>>>>
>>>>>
>>>>>
>>>>> Rob
>>>>>
>>>>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>>>>
>>>>>  I think the link you sent got malformatted, but try separating the
>>>>>> jars with a comma
>>>>>> http://issues.apache.org/jira/browse/HADOOP-4864
>>>>>>
>>>>>> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
>>>>>> <ro...@googlemail.com> wrote:
>>>>>>
>>>>>>> Hi Dmitriy,
>>>>>>>
>>>>>>> OK, well it seems that since 0.20.0 the order as specified on the Pig
>>>>>>>
>>>>>> wiki
>>>>>>
>>>>>>> is no longer relevant:
>>>>>>> doop jar -libjars $zipfjar $datagenjar
>>>>>>>
>>>>>> org.apache.pig.test.utils.datagen.
>>>>
>>>>> DataGenerator </pig/DataGenerator> -conf $conf_file [options]
>>>>>>>
>>>>>> colspec...
>>>>
>>>>>
>>>>>>> See this patch over at Hive for 0.20.0:
>>>>>>>
>>>>>>>
>>>> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
>>>>
>>>>> DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com>
>>>>>>>
>>>>>>> I have tried a few combinations, but I can't seem to fit in the
>>>>>>>
>>>>>> "-libjars
>>>>
>>>>> $zipfjar" in anywhere now.
>>>>>>>
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> Thanks for your help.
>>>>>>>
>>>>>>> Rob
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>>>>>>
>>>>>>>  Rob,
>>>>>>>> You need to tell Hadoop which jars you need it to ship to the worker
>>>>>>>> nodes. You include datagen.jar, etc, on the classpath, which makes
>>>>>>>> them discoverable locally, but you aren't telling Hadoop to ship
>>>>>>>>
>>>>>>> them.
>>>>
>>>>>  You want to list them, comma-separated, in the -libjars parameter.
>>>>>>>>
>>>>>>>> -D
>>>>>>>>
>>>>>>>> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
>>>>>>>> <ro...@googlemail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi there.
>>>>>>>>>
>>>>>>>>> I am well underway with comparing Pig, Hive, JAQL etc...
>>>>>>>>>
>>>>>>>>> The DataGenerator is proving a valuable tool for me. Thanks for
>>>>>>>>>
>>>>>>>> that.
>>>>
>>>>>
>>>>>>>>> I have one query. I am able to use it in local mode, no problem,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> some
>>>>>>
>>>>>>> experiments are complete.
>>>>>>>>>
>>>>>>>>> However, I cannot seem to use it in MapReduce mode on the cluster.
>>>>>>>>>
>>>>>>>> This
>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>>> my file "generateData" contents:
>>>>>>>>> ------------------
>>>>>>>>> export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
>>>>>>>>> export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
>>>>>>>>> export
>>>>>>>>>
>>>>>>>> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
>>>>
>>>>>  export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
>>>>>>>>> export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
>>>>>>>>> /usr/lib/hadoop/bin/hadoop jar $datagenjar
>>>>>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file
>>>>>>>>>
>>>>>>>> -m 1
>>>>
>>>>>  -rows
>>>>>>>>
>>>>>>>>> 10000000 -f words.dat s:8:50:z:0
>>>>>>>>> ------------------
>>>>>>>>>
>>>>>>>>> The error I receive when trying to run it with "-m 1" option (in
>>>>>>>>>
>>>>>>>> cluster
>>>>>>
>>>>>>> mode):
>>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>>>>
>>>>>>>> sdsu.algorithms.data.Zipf
>>>>
>>>>>
>>>>>>>>> So in local mode, it successfully picks up the jar file
>>>>>>>>>
>>>>>>>> sdsuLibJKD14.jar
>>>>>>
>>>>>>> ,
>>>>>>>>
>>>>>>>>> but when running it in cluster mode, this classpath is not found?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks.
>>>>>>>>>
>>>>>>>>> Rob Stewart
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>

Re: Pig DataGenerator as a MR Job

Posted by Alan Gates <ga...@yahoo-inc.com>.

Rob,

Feel free to update the wiki with your findings.  You don't have to be  
a committer to change the wiki.

Alan.

On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote:

> Hello Dmitry!
>
> I have it solved, it was just a bit of trial and error based on the  
> Hive bug
> report/fix I found.
>
> The report is indeed correct, the following works:
>> hadoop jar $datagenjar  
>> org.apache.pig.test.utils.datagen.DataGenerator
> -libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f
> /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>
> This puts the Pig wiki out of date for Hadoop 0.20, but is still  
> relevant
> for Hadoop 0.18 and less.
>
> May I propose that you update the wiki as such:
> ------------------------
> DataGenerator Usage:
> For 0.18.0
>> hadoop jar -libjars $zipfjar $datagenjar
> org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator>  
> -conf
> $conf_file [options] colspec...
>
> For 0.20.0
>> hadoop jar $datagenjar  
>> org.apache.pig.test.utils.datagen.DataGenerator</pig/ 
>> DataGenerator>  -libjars
> $zipfjar -conf $conf_file [options] colspec...
> --------------
>
> Sound OK ?
>
>
> Rob Stewart
>
>
> 2010/1/14 Rob Stewart <ro...@googlemail.com>
>
>> Yeah, unfortunately your suggestion does not work, and neither does  
>> the
>> order given on the Pig wiki. Instead, see the Hadoop wiki for - 
>> libjars
>> usage:
>>
>> hadoop jar hadoop-examples.jar wordcount -files cachefile.txt - 
>> libjars
>> mylib.jar input output
>>
>> So I tried this:
>> hadoop jar $datagenjar  
>> org.apache.pig.test.utils.datagen.DataGenerator
>> -conf $conf_file -rows 10000000 -f /scratch/tmpHDFS_files/ 
>> wordsx1_skewed.dat
>> -libjars $zipfjar s:8:50:z:0
>>
>> However, the DataGenerator does not like it as one of its' options:
>> ---------
>> Couldn't parse the command line arguments, Found unknown option (- 
>> libjars)
>> at position 5
>> ---------
>>
>> I'd be happy/surprised to hear from anyone who can use the format  
>> given on
>> the Pig wiki for the DataGenerator, in cluster mode (using -m  
>> parameter).
>>
>> Any more suggestions Dmitry, and thanks for your help, it's mucho
>> appreciated!
>>
>> Rob
>>
>>
>>
>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>
>>> Sorry if I am not reading carefully enough -- but the bug report you
>>> cite seems to indicate you want
>>>
>>> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
>>> $zipfjar $datagenjar -conf $conf_file -rows
>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>
>>> (possibly separating zipfjar and datagenjar with commas if that  
>>> patch
>>> was applied to your version of 20)
>>>
>>> which I don't see in the list of things you tried?
>>>
>>> -D
>>>
>>> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
>>> <ro...@googlemail.com> wrote:
>>>> Hi Dmitriy,
>>>>
>>>> No, I do think that there was a change in 0.20.0
>>>>
>>>> See the error I get:
>>>> Exception in thread "main" java.io.IOException: Error opening job  
>>>> jar:
>>>> -libjars
>>>>
>>>> This is what I am trying to run:
>>>> hadoop jar -libjars $zipfjar $datagenjar
>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file - 
>>>> rows
>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>
>>>> The $zipfjar has only one jar file in this classpath. It seems that
>>> there
>>>> was a change to hadoop 0.20.0, not allowing for the option -libjars
>>>> immediately after "hadoop jar".
>>>>
>>>> This is the extract from the Hive bug report I was talking about:
>>>> -------------
>>>>
>>>>
>>>> In hadoop-20 - the -libjars has to come after the jar file/class
>>>>
>>>> Please try applying this patch to bin/ext/cli.sh
>>>>
>>>> --- cli.sh  (revision 789726)
>>>> +++ cli.sh  (working copy)
>>>> @@ -10,7 +10,7 @@
>>>>    exit 3;
>>>>  fi
>>>>
>>>> -  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar  
>>>> $CLASS
>>>> $HIVE_OPTS "$@"
>>>> +  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS  
>>>> $AUX_JARS_CMD_LINE
>>>> $HIVE_OPTS "$@"
>>>> }
>>>>
>>>> ----------------
>>>>
>>>> I have also tried:
>>>> hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file - 
>>>> rows
>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>
>>>> This gives the same error.
>>>>
>>>>
>>>>
>>>> Rob
>>>>
>>>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>>>
>>>>> I think the link you sent got malformatted, but try separating the
>>>>> jars with a comma
>>>>> http://issues.apache.org/jira/browse/HADOOP-4864
>>>>>
>>>>> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
>>>>> <ro...@googlemail.com> wrote:
>>>>>> Hi Dmitriy,
>>>>>>
>>>>>> OK, well it seems that since 0.20.0 the order as specified on  
>>>>>> the Pig
>>>>> wiki
>>>>>> is no longer relevant:
>>>>>> doop jar -libjars $zipfjar $datagenjar
>>> org.apache.pig.test.utils.datagen.
>>>>>> DataGenerator </pig/DataGenerator> -conf $conf_file [options]
>>> colspec...
>>>>>>
>>>>>> See this patch over at Hive for 0.20.0:
>>>>>>
>>> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/ 
>>> <
>>>>>> DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com 
>>>>>> >
>>>>>>
>>>>>> I have tried a few combinations, but I can't seem to fit in the
>>> "-libjars
>>>>>> $zipfjar" in anywhere now.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Thanks for your help.
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>>>>>>
>>>>>>> Rob,
>>>>>>> You need to tell Hadoop which jars you need it to ship to the  
>>>>>>> worker
>>>>>>> nodes. You include datagen.jar, etc, on the classpath, which  
>>>>>>> makes
>>>>>>> them discoverable locally, but you aren't telling Hadoop to ship
>>> them.
>>>>>>> You want to list them, comma-separated, in the -libjars  
>>>>>>> parameter.
>>>>>>>
>>>>>>> -D
>>>>>>>
>>>>>>> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
>>>>>>> <ro...@googlemail.com> wrote:
>>>>>>>> Hi there.
>>>>>>>>
>>>>>>>> I am well underway with comparing Pig, Hive, JAQL etc...
>>>>>>>>
>>>>>>>> The DataGenerator is proving a valuable tool for me. Thanks for
>>> that.
>>>>>>>>
>>>>>>>> I have one query. I am able to use it in local mode, no  
>>>>>>>> problem,
>>> and
>>>>> some
>>>>>>>> experiments are complete.
>>>>>>>>
>>>>>>>> However, I cannot seem to use it in MapReduce mode on the  
>>>>>>>> cluster.
>>>>> This
>>>>>>> is
>>>>>>>> my file "generateData" contents:
>>>>>>>> ------------------
>>>>>>>> export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0- 
>>>>>>>> core.jar
>>>>>>>> export zipfjar=$HOME/installation/pig/pig-0.5.0/ 
>>>>>>>> sdsuLibJKD14.jar
>>>>>>>> export
>>> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
>>>>>>>> export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
>>>>>>>> export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
>>>>>>>> /usr/lib/hadoop/bin/hadoop jar $datagenjar
>>>>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf  
>>>>>>>> $conf_file
>>> -m 1
>>>>>>> -rows
>>>>>>>> 10000000 -f words.dat s:8:50:z:0
>>>>>>>> ------------------
>>>>>>>>
>>>>>>>> The error I receive when trying to run it with "-m 1" option  
>>>>>>>> (in
>>>>> cluster
>>>>>>>> mode):
>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>> sdsu.algorithms.data.Zipf
>>>>>>>>
>>>>>>>> So in local mode, it successfully picks up the jar file
>>>>> sdsuLibJKD14.jar
>>>>>>> ,
>>>>>>>> but when running it in cluster mode, this classpath is not  
>>>>>>>> found?
>>>>>>>>
>>>>>>>>
>>>>>>>> thanks.
>>>>>>>>
>>>>>>>> Rob Stewart
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>

Re: Pig DataGenerator as a MR Job

Posted by Rob Stewart <ro...@googlemail.com>.

Hello Dmitry!

I have it solved, it was just a bit of trial and error based on the Hive bug
report/fix I found.

The report is indeed correct, the following works:
> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
-libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f
/scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

This puts the Pig wiki out of date for Hadoop 0.20, but is still relevant
for Hadoop 0.18 and less.

May I propose that you update the wiki as such:
------------------------
DataGenerator Usage:
For 0.18.0
> hadoop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator> -conf
$conf_file [options] colspec...

For 0.20.0
> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator</pig/DataGenerator>  -libjars
$zipfjar -conf $conf_file [options] colspec...
--------------

Sound OK ?


Rob Stewart


2010/1/14 Rob Stewart <ro...@googlemail.com>

> Yeah, unfortunately your suggestion does not work, and neither does the
> order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars
> usage:
>
> hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
> mylib.jar input output
>
> So I tried this:
> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
> -conf $conf_file -rows 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat
> -libjars $zipfjar s:8:50:z:0
>
> However, the DataGenerator does not like it as one of its' options:
> ---------
> Couldn't parse the command line arguments, Found unknown option (-libjars)
> at position 5
> ---------
>
> I'd be happy/surprised to hear from anyone who can use the format given on
> the Pig wiki for the DataGenerator, in cluster mode (using -m parameter).
>
> Any more suggestions Dmitry, and thanks for your help, it's mucho
> appreciated!
>
> Rob
>
>
>
> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>
>> Sorry if I am not reading carefully enough -- but the bug report you
>> cite seems to indicate you want
>>
>> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
>> $zipfjar $datagenjar -conf $conf_file -rows
>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>
>> (possibly separating zipfjar and datagenjar with commas if that patch
>> was applied to your version of 20)
>>
>> which I don't see in the list of things you tried?
>>
>> -D
>>
>> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
>> <ro...@googlemail.com> wrote:
>> > Hi Dmitriy,
>> >
>> > No, I do think that there was a change in 0.20.0
>> >
>> > See the error I get:
>> > Exception in thread "main" java.io.IOException: Error opening job jar:
>> > -libjars
>> >
>> > This is what I am trying to run:
>> > hadoop jar -libjars $zipfjar $datagenjar
>> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>> > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>> >
>> > The $zipfjar has only one jar file in this classpath. It seems that
>> there
>> > was a change to hadoop 0.20.0, not allowing for the option -libjars
>> > immediately after "hadoop jar".
>> >
>> > This is the extract from the Hive bug report I was talking about:
>> > -------------
>> >
>> >
>> > In hadoop-20 - the -libjars has to come after the jar file/class
>> >
>> > Please try applying this patch to bin/ext/cli.sh
>> >
>> > --- cli.sh  (revision 789726)
>> > +++ cli.sh  (working copy)
>> > @@ -10,7 +10,7 @@
>> >     exit 3;
>> >   fi
>> >
>> > -  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
>> > $HIVE_OPTS "$@"
>> > +  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
>> > $HIVE_OPTS "$@"
>> >  }
>> >
>> > ----------------
>> >
>> > I have also tried:
>> > hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
>> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>> > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>> >
>> > This gives the same error.
>> >
>> >
>> >
>> > Rob
>> >
>> > 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>> >
>> >> I think the link you sent got malformatted, but try separating the
>> >> jars with a comma
>> >> http://issues.apache.org/jira/browse/HADOOP-4864
>> >>
>> >> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
>> >> <ro...@googlemail.com> wrote:
>> >> > Hi Dmitriy,
>> >> >
>> >> > OK, well it seems that since 0.20.0 the order as specified on the Pig
>> >> wiki
>> >> > is no longer relevant:
>> >> > doop jar -libjars $zipfjar $datagenjar
>> org.apache.pig.test.utils.datagen.
>> >> > DataGenerator </pig/DataGenerator> -conf $conf_file [options]
>> colspec...
>> >> >
>> >> > See this patch over at Hive for 0.20.0:
>> >> >
>> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
>> >> > DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com>
>> >> >
>> >> > I have tried a few combinations, but I can't seem to fit in the
>> "-libjars
>> >> > $zipfjar" in anywhere now.
>> >> >
>> >> > Any ideas?
>> >> >
>> >> > Thanks for your help.
>> >> >
>> >> > Rob
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>> >> >
>> >> >> Rob,
>> >> >> You need to tell Hadoop which jars you need it to ship to the worker
>> >> >> nodes. You include datagen.jar, etc, on the classpath, which makes
>> >> >> them discoverable locally, but you aren't telling Hadoop to ship
>> them.
>> >> >> You want to list them, comma-separated, in the -libjars parameter.
>> >> >>
>> >> >> -D
>> >> >>
>> >> >> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
>> >> >> <ro...@googlemail.com> wrote:
>> >> >> > Hi there.
>> >> >> >
>> >> >> > I am well underway with comparing Pig, Hive, JAQL etc...
>> >> >> >
>> >> >> > The DataGenerator is proving a valuable tool for me. Thanks for
>> that.
>> >> >> >
>> >> >> > I have one query. I am able to use it in local mode, no problem,
>> and
>> >> some
>> >> >> > experiments are complete.
>> >> >> >
>> >> >> > However, I cannot seem to use it in MapReduce mode on the cluster.
>> >> This
>> >> >> is
>> >> >> > my file "generateData" contents:
>> >> >> > ------------------
>> >> >> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
>> >> >> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
>> >> >> > export
>> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
>> >> >> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
>> >> >> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
>> >> >> > /usr/lib/hadoop/bin/hadoop jar $datagenjar
>> >> >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file
>> -m 1
>> >> >> -rows
>> >> >> > 10000000 -f words.dat s:8:50:z:0
>> >> >> > ------------------
>> >> >> >
>> >> >> > The error I receive when trying to run it with "-m 1" option (in
>> >> cluster
>> >> >> > mode):
>> >> >> > Caused by: java.lang.ClassNotFoundException:
>> sdsu.algorithms.data.Zipf
>> >> >> >
>> >> >> > So in local mode, it successfully picks up the jar file
>> >> sdsuLibJKD14.jar
>> >> >> ,
>> >> >> > but when running it in cluster mode, this classpath is not found?
>> >> >> >
>> >> >> >
>> >> >> > thanks.
>> >> >> >
>> >> >> > Rob Stewart
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>
>

Re: Pig DataGenerator as a MR Job

Posted by Rob Stewart <ro...@googlemail.com>.

Yeah, unfortunately your suggestion does not work, and neither does the
order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars
usage:

hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
mylib.jar input output

So I tried this:
hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -conf
$conf_file -rows 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat
-libjars $zipfjar s:8:50:z:0

However, the DataGenerator does not like it as one of its' options:
---------
Couldn't parse the command line arguments, Found unknown option (-libjars)
at position 5
---------

I'd be happy/surprised to hear from anyone who can use the format given on
the Pig wiki for the DataGenerator, in cluster mode (using -m parameter).

Any more suggestions Dmitry, and thanks for your help, it's mucho
appreciated!

Rob



2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>

> Sorry if I am not reading carefully enough -- but the bug report you
> cite seems to indicate you want
>
> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
> $zipfjar $datagenjar -conf $conf_file -rows
> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>
> (possibly separating zipfjar and datagenjar with commas if that patch
> was applied to your version of 20)
>
> which I don't see in the list of things you tried?
>
> -D
>
> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
> <ro...@googlemail.com> wrote:
> > Hi Dmitriy,
> >
> > No, I do think that there was a change in 0.20.0
> >
> > See the error I get:
> > Exception in thread "main" java.io.IOException: Error opening job jar:
> > -libjars
> >
> > This is what I am trying to run:
> > hadoop jar -libjars $zipfjar $datagenjar
> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
> > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
> >
> > The $zipfjar has only one jar file in this classpath. It seems that there
> > was a change to hadoop 0.20.0, not allowing for the option -libjars
> > immediately after "hadoop jar".
> >
> > This is the extract from the Hive bug report I was talking about:
> > -------------
> >
> >
> > In hadoop-20 - the -libjars has to come after the jar file/class
> >
> > Please try applying this patch to bin/ext/cli.sh
> >
> > --- cli.sh  (revision 789726)
> > +++ cli.sh  (working copy)
> > @@ -10,7 +10,7 @@
> >     exit 3;
> >   fi
> >
> > -  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
> > $HIVE_OPTS "$@"
> > +  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
> > $HIVE_OPTS "$@"
> >  }
> >
> > ----------------
> >
> > I have also tried:
> > hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
> > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
> >
> > This gives the same error.
> >
> >
> >
> > Rob
> >
> > 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
> >
> >> I think the link you sent got malformatted, but try separating the
> >> jars with a comma
> >> http://issues.apache.org/jira/browse/HADOOP-4864
> >>
> >> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
> >> <ro...@googlemail.com> wrote:
> >> > Hi Dmitriy,
> >> >
> >> > OK, well it seems that since 0.20.0 the order as specified on the Pig
> >> wiki
> >> > is no longer relevant:
> >> > doop jar -libjars $zipfjar $datagenjar
> org.apache.pig.test.utils.datagen.
> >> > DataGenerator </pig/DataGenerator> -conf $conf_file [options]
> colspec...
> >> >
> >> > See this patch over at Hive for 0.20.0:
> >> >
> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
> >> > DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com>
> >> >
> >> > I have tried a few combinations, but I can't seem to fit in the
> "-libjars
> >> > $zipfjar" in anywhere now.
> >> >
> >> > Any ideas?
> >> >
> >> > Thanks for your help.
> >> >
> >> > Rob
> >> >
> >> >
> >> >
> >> >
> >> > 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
> >> >
> >> >> Rob,
> >> >> You need to tell Hadoop which jars you need it to ship to the worker
> >> >> nodes. You include datagen.jar, etc, on the classpath, which makes
> >> >> them discoverable locally, but you aren't telling Hadoop to ship
> them.
> >> >> You want to list them, comma-separated, in the -libjars parameter.
> >> >>
> >> >> -D
> >> >>
> >> >> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
> >> >> <ro...@googlemail.com> wrote:
> >> >> > Hi there.
> >> >> >
> >> >> > I am well underway with comparing Pig, Hive, JAQL etc...
> >> >> >
> >> >> > The DataGenerator is proving a valuable tool for me. Thanks for
> that.
> >> >> >
> >> >> > I have one query. I am able to use it in local mode, no problem,
> and
> >> some
> >> >> > experiments are complete.
> >> >> >
> >> >> > However, I cannot seem to use it in MapReduce mode on the cluster.
> >> This
> >> >> is
> >> >> > my file "generateData" contents:
> >> >> > ------------------
> >> >> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
> >> >> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
> >> >> > export
> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
> >> >> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
> >> >> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
> >> >> > /usr/lib/hadoop/bin/hadoop jar $datagenjar
> >> >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -m
> 1
> >> >> -rows
> >> >> > 10000000 -f words.dat s:8:50:z:0
> >> >> > ------------------
> >> >> >
> >> >> > The error I receive when trying to run it with "-m 1" option (in
> >> cluster
> >> >> > mode):
> >> >> > Caused by: java.lang.ClassNotFoundException:
> sdsu.algorithms.data.Zipf
> >> >> >
> >> >> > So in local mode, it successfully picks up the jar file
> >> sdsuLibJKD14.jar
> >> >> ,
> >> >> > but when running it in cluster mode, this classpath is not found?
> >> >> >
> >> >> >
> >> >> > thanks.
> >> >> >
> >> >> > Rob Stewart
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: Pig DataGenerator as a MR Job

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Sorry if I am not reading carefully enough -- but the bug report you
cite seems to indicate you want

hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
$zipfjar $datagenjar -conf $conf_file -rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

(possibly separating zipfjar and datagenjar with commas if that patch
was applied to your version of 20)

which I don't see in the list of things you tried?

-D

On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi Dmitriy,
>
> No, I do think that there was a change in 0.20.0
>
> See the error I get:
> Exception in thread "main" java.io.IOException: Error opening job jar:
> -libjars
>
> This is what I am trying to run:
> hadoop jar -libjars $zipfjar $datagenjar
> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>
> The $zipfjar has only one jar file in this classpath. It seems that there
> was a change to hadoop 0.20.0, not allowing for the option -libjars
> immediately after "hadoop jar".
>
> This is the extract from the Hive bug report I was talking about:
> -------------
>
>
> In hadoop-20 - the -libjars has to come after the jar file/class
>
> Please try applying this patch to bin/ext/cli.sh
>
> --- cli.sh  (revision 789726)
> +++ cli.sh  (working copy)
> @@ -10,7 +10,7 @@
>     exit 3;
>   fi
>
> -  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
> $HIVE_OPTS "$@"
> +  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
> $HIVE_OPTS "$@"
>  }
>
> ----------------
>
> I have also tried:
> hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>
> This gives the same error.
>
>
>
> Rob
>
> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>
>> I think the link you sent got malformatted, but try separating the
>> jars with a comma
>> http://issues.apache.org/jira/browse/HADOOP-4864
>>
>> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
>> <ro...@googlemail.com> wrote:
>> > Hi Dmitriy,
>> >
>> > OK, well it seems that since 0.20.0 the order as specified on the Pig
>> wiki
>> > is no longer relevant:
>> > doop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.
>> > DataGenerator </pig/DataGenerator> -conf $conf_file [options] colspec...
>> >
>> > See this patch over at Hive for 0.20.0:
>> > http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
>> > DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com>
>> >
>> > I have tried a few combinations, but I can't seem to fit in the "-libjars
>> > $zipfjar" in anywhere now.
>> >
>> > Any ideas?
>> >
>> > Thanks for your help.
>> >
>> > Rob
>> >
>> >
>> >
>> >
>> > 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>> >
>> >> Rob,
>> >> You need to tell Hadoop which jars you need it to ship to the worker
>> >> nodes. You include datagen.jar, etc, on the classpath, which makes
>> >> them discoverable locally, but you aren't telling Hadoop to ship them.
>> >> You want to list them, comma-separated, in the -libjars parameter.
>> >>
>> >> -D
>> >>
>> >> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
>> >> <ro...@googlemail.com> wrote:
>> >> > Hi there.
>> >> >
>> >> > I am well underway with comparing Pig, Hive, JAQL etc...
>> >> >
>> >> > The DataGenerator is proving a valuable tool for me. Thanks for that.
>> >> >
>> >> > I have one query. I am able to use it in local mode, no problem, and
>> some
>> >> > experiments are complete.
>> >> >
>> >> > However, I cannot seem to use it in MapReduce mode on the cluster.
>> This
>> >> is
>> >> > my file "generateData" contents:
>> >> > ------------------
>> >> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
>> >> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
>> >> > export datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
>> >> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
>> >> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
>> >> > /usr/lib/hadoop/bin/hadoop jar $datagenjar
>> >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -m 1
>> >> -rows
>> >> > 10000000 -f words.dat s:8:50:z:0
>> >> > ------------------
>> >> >
>> >> > The error I receive when trying to run it with "-m 1" option (in
>> cluster
>> >> > mode):
>> >> > Caused by: java.lang.ClassNotFoundException: sdsu.algorithms.data.Zipf
>> >> >
>> >> > So in local mode, it successfully picks up the jar file
>> sdsuLibJKD14.jar
>> >> ,
>> >> > but when running it in cluster mode, this classpath is not found?
>> >> >
>> >> >
>> >> > thanks.
>> >> >
>> >> > Rob Stewart
>> >> >
>> >>
>> >
>>
>

Re: Pig DataGenerator as a MR Job

Posted by Rob Stewart <ro...@googlemail.com>.

Hi Dmitriy,

No, I do think that there was a change in 0.20.0

See the error I get:
Exception in thread "main" java.io.IOException: Error opening job jar:
-libjars

This is what I am trying to run:
hadoop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

The $zipfjar has only one jar file in this classpath. It seems that there
was a change to hadoop 0.20.0, not allowing for the option -libjars
immediately after "hadoop jar".

This is the extract from the Hive bug report I was talking about:
-------------


In hadoop-20 - the -libjars has to come after the jar file/class

Please try applying this patch to bin/ext/cli.sh

--- cli.sh  (revision 789726)
+++ cli.sh  (working copy)
@@ -10,7 +10,7 @@
     exit 3;
   fi

-  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
$HIVE_OPTS "$@"
+  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
$HIVE_OPTS "$@"
 }

----------------

I have also tried:
hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

This gives the same error.



Rob

2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>

> I think the link you sent got malformatted, but try separating the
> jars with a comma
> http://issues.apache.org/jira/browse/HADOOP-4864
>
> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
> <ro...@googlemail.com> wrote:
> > Hi Dmitriy,
> >
> > OK, well it seems that since 0.20.0 the order as specified on the Pig
> wiki
> > is no longer relevant:
> > doop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.
> > DataGenerator </pig/DataGenerator> -conf $conf_file [options] colspec...
> >
> > See this patch over at Hive for 0.20.0:
> > http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
> > DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com>
> >
> > I have tried a few combinations, but I can't seem to fit in the "-libjars
> > $zipfjar" in anywhere now.
> >
> > Any ideas?
> >
> > Thanks for your help.
> >
> > Rob
> >
> >
> >
> >
> > 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
> >
> >> Rob,
> >> You need to tell Hadoop which jars you need it to ship to the worker
> >> nodes. You include datagen.jar, etc, on the classpath, which makes
> >> them discoverable locally, but you aren't telling Hadoop to ship them.
> >> You want to list them, comma-separated, in the -libjars parameter.
> >>
> >> -D
> >>
> >> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
> >> <ro...@googlemail.com> wrote:
> >> > Hi there.
> >> >
> >> > I am well underway with comparing Pig, Hive, JAQL etc...
> >> >
> >> > The DataGenerator is proving a valuable tool for me. Thanks for that.
> >> >
> >> > I have one query. I am able to use it in local mode, no problem, and
> some
> >> > experiments are complete.
> >> >
> >> > However, I cannot seem to use it in MapReduce mode on the cluster.
> This
> >> is
> >> > my file "generateData" contents:
> >> > ------------------
> >> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
> >> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
> >> > export datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
> >> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
> >> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
> >> > /usr/lib/hadoop/bin/hadoop jar $datagenjar
> >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -m 1
> >> -rows
> >> > 10000000 -f words.dat s:8:50:z:0
> >> > ------------------
> >> >
> >> > The error I receive when trying to run it with "-m 1" option (in
> cluster
> >> > mode):
> >> > Caused by: java.lang.ClassNotFoundException: sdsu.algorithms.data.Zipf
> >> >
> >> > So in local mode, it successfully picks up the jar file
> sdsuLibJKD14.jar
> >> ,
> >> > but when running it in cluster mode, this classpath is not found?
> >> >
> >> >
> >> > thanks.
> >> >
> >> > Rob Stewart
> >> >
> >>
> >
>

Re: Pig DataGenerator as a MR Job

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think the link you sent got malformatted, but try separating the
jars with a comma
http://issues.apache.org/jira/browse/HADOOP-4864

On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi Dmitriy,
>
> OK, well it seems that since 0.20.0 the order as specified on the Pig wiki
> is no longer relevant:
> doop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.
> DataGenerator </pig/DataGenerator> -conf $conf_file [options] colspec...
>
> See this patch over at Hive for 0.20.0:
> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
> DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com>
>
> I have tried a few combinations, but I can't seem to fit in the "-libjars
> $zipfjar" in anywhere now.
>
> Any ideas?
>
> Thanks for your help.
>
> Rob
>
>
>
>
> 2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>
>
>> Rob,
>> You need to tell Hadoop which jars you need it to ship to the worker
>> nodes. You include datagen.jar, etc, on the classpath, which makes
>> them discoverable locally, but you aren't telling Hadoop to ship them.
>> You want to list them, comma-separated, in the -libjars parameter.
>>
>> -D
>>
>> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
>> <ro...@googlemail.com> wrote:
>> > Hi there.
>> >
>> > I am well underway with comparing Pig, Hive, JAQL etc...
>> >
>> > The DataGenerator is proving a valuable tool for me. Thanks for that.
>> >
>> > I have one query. I am able to use it in local mode, no problem, and some
>> > experiments are complete.
>> >
>> > However, I cannot seem to use it in MapReduce mode on the cluster. This
>> is
>> > my file "generateData" contents:
>> > ------------------
>> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
>> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
>> > export datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
>> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
>> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
>> > /usr/lib/hadoop/bin/hadoop jar $datagenjar
>> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -m 1
>> -rows
>> > 10000000 -f words.dat s:8:50:z:0
>> > ------------------
>> >
>> > The error I receive when trying to run it with "-m 1" option (in cluster
>> > mode):
>> > Caused by: java.lang.ClassNotFoundException: sdsu.algorithms.data.Zipf
>> >
>> > So in local mode, it successfully picks up the jar file sdsuLibJKD14.jar
>> ,
>> > but when running it in cluster mode, this classpath is not found?
>> >
>> >
>> > thanks.
>> >
>> > Rob Stewart
>> >
>>
>

Re: Pig DataGenerator as a MR Job

Posted by Rob Stewart <ro...@googlemail.com>.

Hi Dmitriy,

OK, well it seems that since 0.20.0 the order as specified on the Pig wiki
is no longer relevant:
doop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.
DataGenerator </pig/DataGenerator> -conf $conf_file [options] colspec...

See this patch over at Hive for 0.20.0:
http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
DFD95197F3AE8C45B0A96C2F4BA3A2556C8358C30B@SC-MBXC1.TheFacebook.com>

I have tried a few combinations, but I can't seem to fit in the "-libjars
$zipfjar" in anywhere now.

Any ideas?

Thanks for your help.

Rob




2010/1/14 Dmitriy Ryaboy <dv...@gmail.com>

> Rob,
> You need to tell Hadoop which jars you need it to ship to the worker
> nodes. You include datagen.jar, etc, on the classpath, which makes
> them discoverable locally, but you aren't telling Hadoop to ship them.
> You want to list them, comma-separated, in the -libjars parameter.
>
> -D
>
> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
> <ro...@googlemail.com> wrote:
> > Hi there.
> >
> > I am well underway with comparing Pig, Hive, JAQL etc...
> >
> > The DataGenerator is proving a valuable tool for me. Thanks for that.
> >
> > I have one query. I am able to use it in local mode, no problem, and some
> > experiments are complete.
> >
> > However, I cannot seem to use it in MapReduce mode on the cluster. This
> is
> > my file "generateData" contents:
> > ------------------
> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
> > export datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
> > /usr/lib/hadoop/bin/hadoop jar $datagenjar
> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -m 1
> -rows
> > 10000000 -f words.dat s:8:50:z:0
> > ------------------
> >
> > The error I receive when trying to run it with "-m 1" option (in cluster
> > mode):
> > Caused by: java.lang.ClassNotFoundException: sdsu.algorithms.data.Zipf
> >
> > So in local mode, it successfully picks up the jar file sdsuLibJKD14.jar
> ,
> > but when running it in cluster mode, this classpath is not found?
> >
> >
> > thanks.
> >
> > Rob Stewart
> >
>

Re: Pig DataGenerator as a MR Job

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Rob,
You need to tell Hadoop which jars you need it to ship to the worker
nodes. You include datagen.jar, etc, on the classpath, which makes
them discoverable locally, but you aren't telling Hadoop to ship them.
You want to list them, comma-separated, in the -libjars parameter.

-D

On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi there.
>
> I am well underway with comparing Pig, Hive, JAQL etc...
>
> The DataGenerator is proving a valuable tool for me. Thanks for that.
>
> I have one query. I am able to use it in local mode, no problem, and some
> experiments are complete.
>
> However, I cannot seem to use it in MapReduce mode on the cluster. This is
> my file "generateData" contents:
> ------------------
> export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
> export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
> export datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
> export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
> export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
> /usr/lib/hadoop/bin/hadoop jar $datagenjar
> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -m 1 -rows
> 10000000 -f words.dat s:8:50:z:0
> ------------------
>
> The error I receive when trying to run it with "-m 1" option (in cluster
> mode):
> Caused by: java.lang.ClassNotFoundException: sdsu.algorithms.data.Zipf
>
> So in local mode, it successfully picks up the jar file sdsuLibJKD14.jar ,
> but when running it in cluster mode, this classpath is not found?
>
>
> thanks.
>
> Rob Stewart
>