You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by felix gao <gr...@gmail.com> on 2010/10/07 03:09:09 UTC

Pig Streaming with Python Scripts

I have a python script defined as
import sys

for line in sys.stdin:
    if not line:
        break
    sys.stdout.write(line)

my data test looks like
({(19199vzFj6+uRbJf,7388,9074,50|22598,1267739954,0.0020,365,1,1)},1L)


my pig script is

temp = STREAM test THROUGH GroupStreamer as
(test_bag:chararray,·num_entries: long );

when I ran that my job will fail with
===== Task Information Header =====
Command: TestStream.py
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)
Start time: Wed Oct 06 17:57:52 PDT 2010
=====          * * *          =====
/Users/felixgao/Documents/data/logs/TestStream.py: line 1: import: command
not found
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: syntax error near
unexpected token `if'
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: `    if not
line:'
2010-10-06 17:57:52,690 [Thread-21] ERROR
org.apache.pig.impl.streaming.ExecutableManager - 'TestStream.py ' failed
with exit status: 2
2010-10-06 17:57:52,697 [Thread-14] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 2090: Received
Error while processing the reduce plan: 'TestStream.py ' failed with exit
status: 2
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:465)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:250)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)

What did I do wrong here?




Another question is if I specify by alias as
temp = STREAM Test THROUGH GroupStreamer
as (test_grp_cnt:bag {test_none_zero: tuple(f1:chararray, f2:int, f3:int,
f4:chararray, f5:int, f6:double, f7:int, f8:int, f9:int)}, ·num_entries:
long  );
I will get
2010-10-06 17:38:57,092 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered " ";" "; "" at line 76, column
179.
Was expecting one of:
    ")" ...
    "," ...

What is the correct way of specifying a bag of tuples based on my data
sample?

Thanks,

Felix

Re: Pig Streaming with Python Scripts

Posted by Alan Gates <ga...@yahoo-inc.com>.

I don't think Pig understands that this is a Python script.  What  
happens if you put #!/bin/python (or whatever is appropriate in your  
system) at the beginning of your GroupStreamer?  Alternatively you  
could explicitly call python on this file in your command by saying

STREAM test THROUGH `/bin/python GroupStreamer`

Alan.

On Oct 6, 2010, at 6:09 PM, felix gao wrote:

> I have a python script defined as
> import sys
>
> for line in sys.stdin:
>    if not line:
>        break
>    sys.stdout.write(line)
>
> my data test looks like
> ({(19199vzFj6+uRbJf,7388,9074,50|22598,1267739954,0.0020,365,1,1)},1L)
>
>
> my pig script is
>
> temp = STREAM test THROUGH GroupStreamer as
> (test_bag:chararray,·num_entries: long );
>
> when I ran that my job will fail with
> ===== Task Information Header =====
> Command: TestStream.py
> (stdin-org.apache.pig.builtin.PigStreaming/stdout- 
> org.apache.pig.builtin.PigStreaming)
> Start time: Wed Oct 06 17:57:52 PDT 2010
> =====          * * *          =====
> /Users/felixgao/Documents/data/logs/TestStream.py: line 1: import:  
> command
> not found
> /Users/felixgao/Documents/data/logs/TestStream.py: line 9: syntax  
> error near
> unexpected token `if'
> /Users/felixgao/Documents/data/logs/TestStream.py: line 9: `    if not
> line:'
> 2010-10-06 17:57:52,690 [Thread-21] ERROR
> org.apache.pig.impl.streaming.ExecutableManager - 'TestStream.py '  
> failed
> with exit status: 2
> 2010-10-06 17:57:52,697 [Thread-14] WARN
> org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> org.apache.pig.backend.executionengine.ExecException: ERROR 2090:  
> Received
> Error while processing the reduce plan: 'TestStream.py ' failed with  
> exit
> status: 2
>    at
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce 
> $Reduce.runPipeline(PigMapReduce.java:465)
>    at
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce 
> $Reduce.processOnePackageOutput(PigMapReduce.java:401)
>    at
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce 
> $Reduce.reduce(PigMapReduce.java:381)
>    at
> org 
> .apache 
> .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce 
> $Reduce.reduce(PigMapReduce.java:250)
>    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>    at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java: 
> 216)
>
> What did I do wrong here?
>
>
>
>
> Another question is if I specify by alias as
> temp = STREAM Test THROUGH GroupStreamer
> as (test_grp_cnt:bag {test_none_zero: tuple(f1:chararray, f2:int,  
> f3:int,
> f4:chararray, f5:int, f6:double, f7:int, f8:int, f9:int)},  
> ·num_entries:
> long  );
> I will get
> 2010-10-06 17:38:57,092 [main] ERROR  
> org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Encountered " ";" "; "" at line  
> 76, column
> 179.
> Was expecting one of:
>    ")" ...
>    "," ...
>
> What is the correct way of specifying a bag of tuples based on my data
> sample?
>
> Thanks,
>
> Felix