You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by CJ Niemira <cn...@gmail.com> on 2011/07/15 21:42:10 UTC

PigStorageWithInputPath

I'm attempting to get the PigStorageWithInputPath example (
http://wiki.apache.org/pig/PigStorageWithInputPath) working, but I must be
missing something. It works fine if I specify a single file, but if I use a
glob in my load command all of the records end up with the first filename
only.


My loadfunc class:

package com.test;


import java.io.IOException;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.log4j.Logger;

import
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;

import org.apache.pig.builtin.PigStorage;

import org.apache.pig.data.Tuple;


public class PigStorageWithInputPath extends PigStorage {

    static Logger logger = Logger.getLogger(PigStorageWithInputPath.class);

    Path path = null;


    @Override

    public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
reader, PigSplit split) {

        super.prepareToRead(reader, split);

        path = ((FileSplit)split.getWrappedSplit()).getPath();

        logger.info("path @prepareToRead="+path);

    }


    @Override

    public Tuple getNext() throws IOException {

        Tuple myTuple = super.getNext();

        if (myTuple != null)

            myTuple.append(path.toString());

        return myTuple;

    }

}



My test data is three files. Each file has three lines in the format "file
n\tline n".

-bash-3.2$ hadoop fs -ls /test

Found 3 items

-rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file1

-rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file2

-rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file3



My Pig script:

register /tmp/udf.jar;

a = load '/test/*' using com.test.PigStorageWithInputPath();

dump a;



The output:

(file 1,line 1,hdfs://localhost/test/file1)

(file 1,line 2,hdfs://localhost/test/file1)

(file 1,line 3,hdfs://localhost/test/file1)

(file 2,line 1,hdfs://localhost/test/file1)

(file 2,line 2,hdfs://localhost/test/file1)

(file 2,line 3,hdfs://localhost/test/file1)

(file 3,line 1,hdfs://localhost/test/file1)

(file 3,line 2,hdfs://localhost/test/file1)

(file 3,line 3,hdfs://localhost/test/file1)



The last item in each line is "file1," even though the contents clearly came
from other files. What I'm expecting to see is this:

(file 1,line 1,hdfs://localhost/test/file1)

(file 1,line 2,hdfs://localhost/test/file1)

(file 1,line 3,hdfs://localhost/test/file1)

(file 2,line 1,hdfs://localhost/test/file2)

(file 2,line 2,hdfs://localhost/test/file2)

(file 2,line 3,hdfs://localhost/test/file2)

(file 3,line 1,hdfs://localhost/test/file3)

(file 3,line 2,hdfs://localhost/test/file3)

(file 3,line 3,hdfs://localhost/test/file3)



And the task attempt log:

2011-07-15 15:00:58,933 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded
the native-hadoop library

2011-07-15 15:00:59,329 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=

2011-07-15 15:00:59,720 INFO com.test.PigStorageWithInputPath: path
@prepareToRead=hdfs://localhost/test/file1

2011-07-15 15:00:59,769 INFO com.test.PigStorageWithInputPath: path
@prepareToRead=hdfs://localhost/test/file1

2011-07-15 15:00:59,771 INFO com.test.PigStorageWithInputPath: path
@prepareToRead=hdfs://localhost/test/file1

2011-07-15 15:00:59,830 INFO org.apache.hadoop.mapred.Task:
Task:attempt_201107111628_0027_m_000000_0 is done. And is in the process of
commiting

2011-07-15 15:01:00,851 INFO org.apache.hadoop.mapred.Task: Task
attempt_201107111628_0027_m_000000_0 is allowed to commit now

2011-07-15 15:01:00,861 INFO
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of
task 'attempt_201107111628_0027_m_000000_0' to
hdfs://localhost/tmp/temp-2143171191/tmp-286804436

2011-07-15 15:01:00,875 INFO org.apache.hadoop.mapred.Task: Task
'attempt_201107111628_0027_m_000000_0' done.

2011-07-15 15:01:00,897 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-



For the record, I'm doing this test against a CDH3U0 installation (pig-0.8 +
hadoop-0.20.2) in pseudo-distributed mode.


Am I doing something wrong?

Re: PigStorageWithInputPath

Posted by CJ Niemira <cn...@gmail.com>.
Thanks, that's got it.

I though't I'd disabled this, but I was setting "splitCombination" instead
of "pig.spiltCombination."

Cheers!


On Fri, Jul 15, 2011 at 4:53 PM, Xiaomeng Wan <sh...@gmail.com> wrote:

> try "set pig.splitCombination false" in your pig script
>
> Shawn
>
> On Fri, Jul 15, 2011 at 1:42 PM, CJ Niemira <cn...@gmail.com> wrote:
> > I'm attempting to get the PigStorageWithInputPath example (
> > http://wiki.apache.org/pig/PigStorageWithInputPath) working, but I must
> be
> > missing something. It works fine if I specify a single file, but if I use
> a
> > glob in my load command all of the records end up with the first filename
> > only.
> >
> >
> > My loadfunc class:
> >
> > package com.test;
> >
> >
> > import java.io.IOException;
> >
> > import org.apache.hadoop.fs.Path;
> >
> > import org.apache.hadoop.mapreduce.RecordReader;
> >
> > import org.apache.hadoop.mapreduce.lib.input.FileSplit;
> >
> > import org.apache.log4j.Logger;
> >
> > import
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
> >
> > import org.apache.pig.builtin.PigStorage;
> >
> > import org.apache.pig.data.Tuple;
> >
> >
> > public class PigStorageWithInputPath extends PigStorage {
> >
> >    static Logger logger =
> Logger.getLogger(PigStorageWithInputPath.class);
> >
> >    Path path = null;
> >
> >
> >    @Override
> >
> >    public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
> > reader, PigSplit split) {
> >
> >        super.prepareToRead(reader, split);
> >
> >        path = ((FileSplit)split.getWrappedSplit()).getPath();
> >
> >        logger.info("path @prepareToRead="+path);
> >
> >    }
> >
> >
> >    @Override
> >
> >    public Tuple getNext() throws IOException {
> >
> >        Tuple myTuple = super.getNext();
> >
> >        if (myTuple != null)
> >
> >            myTuple.append(path.toString());
> >
> >        return myTuple;
> >
> >    }
> >
> > }
> >
> >
> >
> > My test data is three files. Each file has three lines in the format
> "file
> > n\tline n".
> >
> > -bash-3.2$ hadoop fs -ls /test
> >
> > Found 3 items
> >
> > -rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file1
> >
> > -rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file2
> >
> > -rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file3
> >
> >
> >
> > My Pig script:
> >
> > register /tmp/udf.jar;
> >
> > a = load '/test/*' using com.test.PigStorageWithInputPath();
> >
> > dump a;
> >
> >
> >
> > The output:
> >
> > (file 1,line 1,hdfs://localhost/test/file1)
> >
> > (file 1,line 2,hdfs://localhost/test/file1)
> >
> > (file 1,line 3,hdfs://localhost/test/file1)
> >
> > (file 2,line 1,hdfs://localhost/test/file1)
> >
> > (file 2,line 2,hdfs://localhost/test/file1)
> >
> > (file 2,line 3,hdfs://localhost/test/file1)
> >
> > (file 3,line 1,hdfs://localhost/test/file1)
> >
> > (file 3,line 2,hdfs://localhost/test/file1)
> >
> > (file 3,line 3,hdfs://localhost/test/file1)
> >
> >
> >
> > The last item in each line is "file1," even though the contents clearly
> came
> > from other files. What I'm expecting to see is this:
> >
> > (file 1,line 1,hdfs://localhost/test/file1)
> >
> > (file 1,line 2,hdfs://localhost/test/file1)
> >
> > (file 1,line 3,hdfs://localhost/test/file1)
> >
> > (file 2,line 1,hdfs://localhost/test/file2)
> >
> > (file 2,line 2,hdfs://localhost/test/file2)
> >
> > (file 2,line 3,hdfs://localhost/test/file2)
> >
> > (file 3,line 1,hdfs://localhost/test/file3)
> >
> > (file 3,line 2,hdfs://localhost/test/file3)
> >
> > (file 3,line 3,hdfs://localhost/test/file3)
> >
> >
> >
> > And the task attempt log:
> >
> > 2011-07-15 15:00:58,933 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded
> > the native-hadoop library
> >
> > 2011-07-15 15:00:59,329 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > Initializing JVM Metrics with processName=MAP, sessionId=
> >
> > 2011-07-15 15:00:59,720 INFO com.test.PigStorageWithInputPath: path
> > @prepareToRead=hdfs://localhost/test/file1
> >
> > 2011-07-15 15:00:59,769 INFO com.test.PigStorageWithInputPath: path
> > @prepareToRead=hdfs://localhost/test/file1
> >
> > 2011-07-15 15:00:59,771 INFO com.test.PigStorageWithInputPath: path
> > @prepareToRead=hdfs://localhost/test/file1
> >
> > 2011-07-15 15:00:59,830 INFO org.apache.hadoop.mapred.Task:
> > Task:attempt_201107111628_0027_m_000000_0 is done. And is in the process
> of
> > commiting
> >
> > 2011-07-15 15:01:00,851 INFO org.apache.hadoop.mapred.Task: Task
> > attempt_201107111628_0027_m_000000_0 is allowed to commit now
> >
> > 2011-07-15 15:01:00,861 INFO
> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output
> of
> > task 'attempt_201107111628_0027_m_000000_0' to
> > hdfs://localhost/tmp/temp-2143171191/tmp-286804436
> >
> > 2011-07-15 15:01:00,875 INFO org.apache.hadoop.mapred.Task: Task
> > 'attempt_201107111628_0027_m_000000_0' done.
> >
> > 2011-07-15 15:01:00,897 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
> > Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-
> >
> >
> >
> > For the record, I'm doing this test against a CDH3U0 installation
> (pig-0.8 +
> > hadoop-0.20.2) in pseudo-distributed mode.
> >
> >
> > Am I doing something wrong?
> >
>

Re: PigStorageWithInputPath

Posted by Xiaomeng Wan <sh...@gmail.com>.
try "set pig.splitCombination false" in your pig script

Shawn

On Fri, Jul 15, 2011 at 1:42 PM, CJ Niemira <cn...@gmail.com> wrote:
> I'm attempting to get the PigStorageWithInputPath example (
> http://wiki.apache.org/pig/PigStorageWithInputPath) working, but I must be
> missing something. It works fine if I specify a single file, but if I use a
> glob in my load command all of the records end up with the first filename
> only.
>
>
> My loadfunc class:
>
> package com.test;
>
>
> import java.io.IOException;
>
> import org.apache.hadoop.fs.Path;
>
> import org.apache.hadoop.mapreduce.RecordReader;
>
> import org.apache.hadoop.mapreduce.lib.input.FileSplit;
>
> import org.apache.log4j.Logger;
>
> import
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
>
> import org.apache.pig.builtin.PigStorage;
>
> import org.apache.pig.data.Tuple;
>
>
> public class PigStorageWithInputPath extends PigStorage {
>
>    static Logger logger = Logger.getLogger(PigStorageWithInputPath.class);
>
>    Path path = null;
>
>
>    @Override
>
>    public void prepareToRead(@SuppressWarnings("rawtypes") RecordReader
> reader, PigSplit split) {
>
>        super.prepareToRead(reader, split);
>
>        path = ((FileSplit)split.getWrappedSplit()).getPath();
>
>        logger.info("path @prepareToRead="+path);
>
>    }
>
>
>    @Override
>
>    public Tuple getNext() throws IOException {
>
>        Tuple myTuple = super.getNext();
>
>        if (myTuple != null)
>
>            myTuple.append(path.toString());
>
>        return myTuple;
>
>    }
>
> }
>
>
>
> My test data is three files. Each file has three lines in the format "file
> n\tline n".
>
> -bash-3.2$ hadoop fs -ls /test
>
> Found 3 items
>
> -rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file1
>
> -rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file2
>
> -rw-r--r--   1 test supergroup         42 2011-07-15 15:00 /test/file3
>
>
>
> My Pig script:
>
> register /tmp/udf.jar;
>
> a = load '/test/*' using com.test.PigStorageWithInputPath();
>
> dump a;
>
>
>
> The output:
>
> (file 1,line 1,hdfs://localhost/test/file1)
>
> (file 1,line 2,hdfs://localhost/test/file1)
>
> (file 1,line 3,hdfs://localhost/test/file1)
>
> (file 2,line 1,hdfs://localhost/test/file1)
>
> (file 2,line 2,hdfs://localhost/test/file1)
>
> (file 2,line 3,hdfs://localhost/test/file1)
>
> (file 3,line 1,hdfs://localhost/test/file1)
>
> (file 3,line 2,hdfs://localhost/test/file1)
>
> (file 3,line 3,hdfs://localhost/test/file1)
>
>
>
> The last item in each line is "file1," even though the contents clearly came
> from other files. What I'm expecting to see is this:
>
> (file 1,line 1,hdfs://localhost/test/file1)
>
> (file 1,line 2,hdfs://localhost/test/file1)
>
> (file 1,line 3,hdfs://localhost/test/file1)
>
> (file 2,line 1,hdfs://localhost/test/file2)
>
> (file 2,line 2,hdfs://localhost/test/file2)
>
> (file 2,line 3,hdfs://localhost/test/file2)
>
> (file 3,line 1,hdfs://localhost/test/file3)
>
> (file 3,line 2,hdfs://localhost/test/file3)
>
> (file 3,line 3,hdfs://localhost/test/file3)
>
>
>
> And the task attempt log:
>
> 2011-07-15 15:00:58,933 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded
> the native-hadoop library
>
> 2011-07-15 15:00:59,329 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
>
> 2011-07-15 15:00:59,720 INFO com.test.PigStorageWithInputPath: path
> @prepareToRead=hdfs://localhost/test/file1
>
> 2011-07-15 15:00:59,769 INFO com.test.PigStorageWithInputPath: path
> @prepareToRead=hdfs://localhost/test/file1
>
> 2011-07-15 15:00:59,771 INFO com.test.PigStorageWithInputPath: path
> @prepareToRead=hdfs://localhost/test/file1
>
> 2011-07-15 15:00:59,830 INFO org.apache.hadoop.mapred.Task:
> Task:attempt_201107111628_0027_m_000000_0 is done. And is in the process of
> commiting
>
> 2011-07-15 15:01:00,851 INFO org.apache.hadoop.mapred.Task: Task
> attempt_201107111628_0027_m_000000_0 is allowed to commit now
>
> 2011-07-15 15:01:00,861 INFO
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of
> task 'attempt_201107111628_0027_m_000000_0' to
> hdfs://localhost/tmp/temp-2143171191/tmp-286804436
>
> 2011-07-15 15:01:00,875 INFO org.apache.hadoop.mapred.Task: Task
> 'attempt_201107111628_0027_m_000000_0' done.
>
> 2011-07-15 15:01:00,897 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-
>
>
>
> For the record, I'm doing this test against a CDH3U0 installation (pig-0.8 +
> hadoop-0.20.2) in pseudo-distributed mode.
>
>
> Am I doing something wrong?
>