You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jameson Li <ho...@gmail.com> on 2011/06/13 13:07:09 UTC

How to get/operate the InputFileName in pig 0.8.1

Hi,

I hava some files in the hdfs://path/load/ like this:
file_29_00001
file_47_00001
file_16_00001
...
These files are generate by other M/R jobs. The files are only contains one
column, and the number in the file name between 'file_' and '_00001' is a
id.
I want to add the id into its input format like this(I think I should to
write a LoadFunc to get the id):
a = load '/path/load/' as com.company.pig.GetIDFromFileName();
dump a;
//here the parameter 'a' will have two columns:one is the origin column and
the other is the id.

And my question are these:
1, Does there have the existing func that I can get the id from the file
name?
2, I think the method in pig 0.6.0 can help me:
*bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
org.apache.pig.impl.io.BufferedPositionedInputStream, long,
long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
 fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>
in,
long offset, long end)
          Specifies a portion of an InputStream to read tuples.
but I can't find the same method in pig 0.8.1.
Which method can I use to operate the input file in the pig 0.8.1 API?

Thanks very much.

Re: How to get/operate the InputFileName in pig 0.8.1

Posted by Jameson Li <ho...@gmail.com>.
Another question:

The class *org.apache.pig.piggybank.storage.MultiStorage *can help me to store
the Pig output into
different directories.
But the I want to let the file not contain the 'splitFieldIndex'.
For example:
Input file:
id name
--------
1 jack
1 tom
1 lily
2 cat
2 pig
2 bird

After using MultiStorage('/my/home/output','0', 'bz2', '\\t') , normally, I
will get the below files and their contents:
1/1-0
------
1 jack
1 tom
1 lily

2/2-0
------
2 cat
2 pig
2 bird

I want to get the files and their contents:
1/1-0
------
jack
tom
lily

2/2-0
------
cat
pig
bird

Is there a switch that I can use to generate the store file that do or do
not contains the  'splitFieldIndex'?

I have seen the code it seems that the answer is No.
Maybe I have to write another class like
MultiStorageSwithWriteKey to extends the class MultiStorageSwithKey.
Am I right?

Thanks very much.


2011/6/17 Jameson Li <ho...@gmail.com>

> I am sorry that I have a fault.
> My newest jar file is in the dir /home/user/project/lib/myUDF.jar, but
> there has an old jar file in the pig lib dir $PIG-HOME/lib(/opt/pig/lib ).
> Unfortunately after registering the jar
> file--/home/user/project/lib/myUDF.jar, when the pig code execuded, it will
> first scan the UDF classes in the pig lib jar files.
>
> 2011/6/17 Daniel Dai <ji...@yahoo-inc.com>
>
>> Should not be. Pig does not cache myUDF.jar. Every run will pick myUDF.jar
>> again from /home/user/project/lib.
>>
>
>

Re: How to get/operate the InputFileName in pig 0.8.1

Posted by Jameson Li <ho...@gmail.com>.
I am sorry that I have a fault.
My newest jar file is in the dir /home/user/project/lib/myUDF.jar, but there
has an old jar file in the pig lib dir $PIG-HOME/lib(/opt/pig/lib ).
Unfortunately after registering the jar
file--/home/user/project/lib/myUDF.jar, when the pig code execuded, it will
first scan the UDF classes in the pig lib jar files.

2011/6/17 Daniel Dai <ji...@yahoo-inc.com>

> Should not be. Pig does not cache myUDF.jar. Every run will pick myUDF.jar
> again from /home/user/project/lib.
>

Re: How to get/operate the InputFileName in pig 0.8.1

Posted by Daniel Dai <ji...@yahoo-inc.com>.
Should not be. Pig does not cache myUDF.jar. Every run will pick 
myUDF.jar again from /home/user/project/lib.

Daniel

On 06/16/2011 06:09 AM, Jameson Li wrote:
> Great. Depend onthe 
> wiki:http://wiki.apache.org/pig/PigStorageWithInputPath and the 
> setting:-Dpig.noSplitCombination=true, I can get the filename in the pig.
>
> But I have another problem.
> I modify the UDF code and ant it and generate the newest jar file(I am 
> sure the jar file has updated)
> pig -x local
> register /home/user/project/lib/myUDF.jar
> a = load 'aaa';
> b = foreach a generate com.company.pig.myUDF();
> dump b;
>
> I found that the result has been using the old jar file and UDF class, 
> and I think UDF classes has been caced somewhere.
>
> Am I right?
> And how to using the really newest jar file after re-compile?
>
> Thanks very much.
>
> 2011/6/15 Daniel Dai <jianyong@yahoo-inc.com 
> <ma...@yahoo-inc.com>>
>
>     Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you
>     will need to disable split combination: -Dpig.noSplitCombination=true
>
>     Daniel
>
>
>     On 06/13/2011 04:07 AM, Jameson Li wrote:
>>     Hi, I hava some files in the hdfs://path/load/ like this:
>>     file_29_00001 file_47_00001 file_16_00001 ... These files are
>>     generate by other M/R jobs. The files are only contains one
>>     column, and the number in the file name between 'file_' and
>>     '_00001' is a id. I want to add the id into its input format like
>>     this(I think I should to write a LoadFunc to get the id): a =
>>     load '/path/load/' as com.company.pig.
>>     GetIDFromFileName();
>>     dump a;
>>     //here the parameter 'a' will have two columns:one is the origin column and
>>     the other is the id.
>>
>>     And my question are these:
>>     1, Does there have the existing func that I can get the id from the file
>>     name?
>>     2, I think the method in pig 0.6.0 can help me:
>>     *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
>>     org.apache.pig.impl.io.BufferedPositionedInputStream, long,
>>     long)>  <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo%28java.lang.String,org.apache.pig.impl.io.BufferedPositionedInputStream,long,long%29>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
>>       fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>
>>     in, long offset, long end) Specifies a portion of an InputStream
>>     to read tuples. but I can't find the same method in pig 0.8.1.
>>     Which method can I use to operate the input file in the pig 0.8.1
>>     API? Thanks very much.
>
>


Re: How to get/operate the InputFileName in pig 0.8.1

Posted by Jameson Li <ho...@gmail.com>.
Great. Depend on the
wiki:http://wiki.apache.org/pig/PigStorageWithInputPath and
the setting:-Dpig.noSplitCombination=true, I can get the filename in the
pig.

But I have another problem.
I modify the UDF code and ant it and generate the newest jar file(I am sure
the jar file has updated)
pig -x local
register /home/user/project/lib/myUDF.jar
a = load 'aaa';
b = foreach a generate com.company.pig.myUDF();
dump b;

I found that the result has been using the old jar file and UDF class, and I
think UDF classes has been caced somewhere.

Am I right?
And how to using the really newest jar file after re-compile?

Thanks very much.

2011/6/15 Daniel Dai <ji...@yahoo-inc.com>

>  Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you will
> need to disable split combination: -Dpig.noSplitCombination=true
>
> Daniel
>
>
> On 06/13/2011 04:07 AM, Jameson Li wrote:
>
> Hi,
>
> I hava some files in the hdfs://path/load/ like this:
> file_29_00001
> file_47_00001
> file_16_00001
> ...
> These files are generate by other M/R jobs. The files are only contains one
> column, and the number in the file name between 'file_' and '_00001' is a
> id.
> I want to add the id into its input format like this(I think I should to
> write a LoadFunc to get the id):
> a = load '/path/load/' as com.company.pig.
> GetIDFromFileName();
> dump a;
> //here the parameter 'a' will have two columns:one is the origin column and
> the other is the id.
>
> And my question are these:
> 1, Does there have the existing func that I can get the id from the file
> name?
> 2, I think the method in pig 0.6.0 can help me:
> *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
> org.apache.pig.impl.io.BufferedPositionedInputStream, long,
> long)> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,org.apache.pig.impl.io.BufferedPositionedInputStream,long,long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true> <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
>  fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html> <http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>
>
>
> in,
> long offset, long end)
>           Specifies a portion of an InputStream to read tuples.
> but I can't find the same method in pig 0.8.1.
> Which method can I use to operate the input file in the pig 0.8.1 API?
>
> Thanks very much.
>
>
>

Re: How to get/operate the InputFileName in pig 0.8.1

Posted by Daniel Dai <ji...@yahoo-inc.com>.
Check http://wiki.apache.org/pig/PigStorageWithInputPath, also you will 
need to disable split combination: -Dpig.noSplitCombination=true

Daniel

On 06/13/2011 04:07 AM, Jameson Li wrote:
> Hi,
>
> I hava some files in the hdfs://path/load/ like this:
> file_29_00001
> file_47_00001
> file_16_00001
> ...
> These files are generate by other M/R jobs. The files are only contains one
> column, and the number in the file name between 'file_' and '_00001' is a
> id.
> I want to add the id into its input format like this(I think I should to
> write a LoadFunc to get the id):
> a = load '/path/load/' as com.company.pig.GetIDFromFileName();
> dump a;
> //here the parameter 'a' will have two columns:one is the origin column and
> the other is the id.
>
> And my question are these:
> 1, Does there have the existing func that I can get the id from the file
> name?
> 2, I think the method in pig 0.6.0 can help me:
> *bindTo<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/builtin/PigStorage.html#bindTo(java.lang.String,
> org.apache.pig.impl.io.BufferedPositionedInputStream, long,
> long)>*(String<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html?is-external=true>
>   fileName, BufferedPositionedInputStream<http://pig.apache.org/docs/r0.6.0/api/org/apache/pig/impl/io/BufferedPositionedInputStream.html>
> in,
> long offset, long end)
>            Specifies a portion of an InputStream to read tuples.
> but I can't find the same method in pig 0.8.1.
> Which method can I use to operate the input file in the pig 0.8.1 API?
>
> Thanks very much.