You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jumping <qu...@gmail.com> on 2010/03/01 12:09:35 UTC

Could identify file name?

Hi,
Could pig recognize files name are importing ? If could, how to do ? I want
to combine them according filename.

Exp:
google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....

Sort and combine by name, then output two files:  google_all.csv,
baidu_all.csv  in a pig script.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)

Re: Could identify file name?

Posted by Romain Rigaux <ro...@gmail.com>.
Or you can just call the script twice with:

$INPUT= 'input/path/*baidu*'
$OUTPUT='output/path/baidu_all'

then

$INPUT= 'input/path/*google*'
$OUTPUT='output/path/google_all'

Thanks,

Romain

On Wed, Mar 3, 2010 at 5:58 PM, Zaki Rahaman <za...@gmail.com> wrote:

> Even if you're using amazon elastic mapreduce you can specify additional
> named parameters when running scripts. You can specify variable placeholders
> in your script and then pass them on the console. Or specify defaults. Or
> you can always run your scripts in interactive mode so you have complete
> control over execution. And you can always hardcode when all else fails
>
> Sent from my iPhone
>
>
> On Mar 3, 2010, at 8:45 PM, Jumping <qu...@gmail.com> wrote:
>
>  I am using MapReduce on Amazon,  there is another problem, like as how to
>> use two "$INPUT" parameters in a pig script.
>>
>> Best Regards,
>> Jumping Qu
>>
>> ------
>> Don't tell me how many enemies we have, but where they are!
>> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
>> budget.)
>>
>>
>> On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman <za...@gmail.com>
>> wrote:
>>
>>  Just curious,
>>>
>>> What solution did you use?
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Mar 3, 2010, at 8:06 PM, Jumping <qu...@gmail.com> wrote:
>>>
>>> Thanks all of you guys.
>>>
>>>>
>>>>
>>>> Best Regards,
>>>> Jumping Qu
>>>>
>>>> ------
>>>> Don't tell me how many enemies we have, but where they are!
>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
>>>> budget.)
>>>>
>>>>
>>>> On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <za...@gmail.com>
>>>> wrote:
>>>>
>>>> In this case, why wouldn't you simply use globbing in your load
>>>>
>>>>> statements?
>>>>> Somethign like
>>>>>
>>>>> baidu = LOAD 'input/path/*baidu*' AS (schema);
>>>>> google = LOAD 'input/path/*google*' AS (schema);
>>>>>
>>>>> Store baidu INTO 'output/path/baidu_all';
>>>>> Store google INTO 'output/path/google_all';
>>>>>
>>>>> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <romain.rigaux@gmail.com
>>>>>
>>>>>  wrote:
>>>>>>
>>>>>>
>>>>> Actually I was using another loader and I just tried with PigStorage
>>>>>
>>>>>> (Pig
>>>>>> 0.6) and it seems to work too.
>>>>>>
>>>>>> If your input file has two columns this will have the expected schema
>>>>>> and
>>>>>> data:
>>>>>>
>>>>>> A = load 'file' USING MyLoader() AS (f1:chararray,
>>>>>> f2:chararray, fileName:chararray);
>>>>>>
>>>>>> A: {f1: chararray,f2: chararray,filename: chararray}
>>>>>>
>>>>>> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third
>>>>>> column
>>>>>> will be null.
>>>>>>
>>>>>> So in practice the loader loads the data "independently" and then
>>>>>> "casts"
>>>>>> it
>>>>>> to the schema you provided. After yes, I don't say that it is a very
>>>>>>
>>>>>>  clean
>>>>>
>>>>>  solution.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Romain
>>>>>>
>>>>>> 2010/3/2 Mridul Muralidharan <mr...@yahoo-inc.com>
>>>>>>
>>>>>>
>>>>>>  I am not sure if this will work as you expect.
>>>>>>> Depending on which implementation of PigStorage you end up using, it
>>>>>>> might exhibit different behavior.
>>>>>>>
>>>>>>> If I am not wrong, currently, for example, if you specify something
>>>>>>>
>>>>>>>  like
>>>>>>
>>>>>
>>>>>  :
>>>>>>
>>>>>>
>>>>>>> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
>>>>>>> fileName:chararray);
>>>>>>>
>>>>>>>
>>>>>>> your code will end up generating a tuple of 4 fields - the fileName
>>>>>>> always being 'null' and the actual filename you inserted through
>>>>>>> MyLoader ending up being the 4th field (and so not 'seen' by pig -
>>>>>>> not
>>>>>>> sure what happens if you do a join, etc with this tuple though !
>>>>>>> Essentially runtime is not consistent with script schema).
>>>>>>>
>>>>>>>
>>>>>>> Note - this is an implementation specific behavior, which could
>>>>>>>
>>>>>>>  probably
>>>>>>
>>>>>
>>>>>  have been fixed by implementation specific hack
>>>>>>
>>>>>>> "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
>>>>>>> the last field expected].
>>>>>>>
>>>>>>> As expected, it is brittle code.
>>>>>>>
>>>>>>>
>>>>>>> From a while back, I remember facing issues with pig's implicit
>>>>>>> conversion to/from bytearray, its implicit project which was
>>>>>>>
>>>>>>>  introduced,
>>>>>>
>>>>>
>>>>>  insertion of null's to extend to schema specified (the above
>>>>>> behavior),
>>>>>>
>>>>>>> etc.
>>>>>>> So you would become dependent on the impl changes.
>>>>>>>
>>>>>>>
>>>>>>> I dont think BinStorage and PigStorage have been written with
>>>>>>> inheritance in mind ...
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
>>>>>>>
>>>>>>>  Hi,
>>>>>>>>
>>>>>>>> In Pig 0.6 you can extend the PigStorage and grab the name of the
>>>>>>>>
>>>>>>>>  file
>>>>>>>
>>>>>>
>>>>>  with
>>>>>>
>>>>>>>
>>>>>>>  something like this:
>>>>>>>>
>>>>>>>> @Override
>>>>>>>> public void bindTo(String fileName, BufferedPositionedInputStream
>>>>>>>>
>>>>>>>>  is,
>>>>>>>
>>>>>>
>>>>>>  long
>>>>>>>
>>>>>>>  offset, long end)
>>>>>>>>    throws IOException {
>>>>>>>>  super.bindTo(fileName, is, offset, end);
>>>>>>>>
>>>>>>>>  this.fileName = fileName; // In your case match with a regexp
>>>>>>>>
>>>>>>>>  and
>>>>>>>
>>>>>>
>>>>>  get
>>>>>>
>>>>>>>
>>>>>>>  the group with the name only (e.g. google, baidu)
>>>>>>>> }
>>>>>>>>
>>>>>>>> @Override
>>>>>>>> public Tuple getNext() throws IOException {
>>>>>>>>  Tuple next = super.getNext();
>>>>>>>>
>>>>>>>>  if (next != null) {
>>>>>>>>    next.append(fileName);
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  return next;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Then you can group on the name and split on it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Romain
>>>>>>>>
>>>>>>>> On Mon, Mar 1, 2010 at 3:09 AM, Jumping<qu...@gmail.com>
>>>>>>>>
>>>>>>>>  wrote:
>>>>>>>
>>>>>>
>>>>>
>>>>>>  Hi,
>>>>>>>>
>>>>>>>>> Could pig recognize files name are importing ? If could, how to do
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>>  I
>>>>>>>>
>>>>>>>
>>>>>  want
>>>>>>
>>>>>>>
>>>>>>>  to combine them according filename.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Exp:
>>>>>>>>> google_2009_12_21.csv, google_2010_01_21.csv,
>>>>>>>>> google_2010_02_21.csv,
>>>>>>>>> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
>>>>>>>>>
>>>>>>>>>  ....
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>  Sort and combine by name, then output two files:  google_all.csv,
>>>>>>>>> baidu_all.csv  in a pig script.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Jumping Qu
>>>>>>>>>
>>>>>>>>> ------
>>>>>>>>> Don't tell me how many enemies we have, but where they are!
>>>>>>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and
>>>>>>>>>
>>>>>>>>>  under
>>>>>>>>
>>>>>>>
>>>>>>  budget.)
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Zaki Rahaman
>>>>>
>>>>>
>>>>>

Re: Could identify file name?

Posted by Zaki Rahaman <za...@gmail.com>.
Even if you're using amazon elastic mapreduce you can specify  
additional named parameters when running scripts. You can specify  
variable placeholders in your script and then pass them on the  
console. Or specify defaults. Or you can always run your scripts in  
interactive mode so you have complete control over execution. And you  
can always hardcode when all else fails

Sent from my iPhone

On Mar 3, 2010, at 8:45 PM, Jumping <qu...@gmail.com> wrote:

> I am using MapReduce on Amazon,  there is another problem, like as  
> how to
> use two "$INPUT" parameters in a pig script.
>
> Best Regards,
> Jumping Qu
>
> ------
> Don't tell me how many enemies we have, but where they are!
> (ADV:Perl -- It's like Java, only it lets you deliver on time and  
> under
> budget.)
>
>
> On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman  
> <za...@gmail.com> wrote:
>
>> Just curious,
>>
>> What solution did you use?
>>
>> Sent from my iPhone
>>
>>
>> On Mar 3, 2010, at 8:06 PM, Jumping <qu...@gmail.com> wrote:
>>
>> Thanks all of you guys.
>>>
>>>
>>> Best Regards,
>>> Jumping Qu
>>>
>>> ------
>>> Don't tell me how many enemies we have, but where they are!
>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and  
>>> under
>>> budget.)
>>>
>>>
>>> On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman  
>>> <za...@gmail.com>
>>> wrote:
>>>
>>> In this case, why wouldn't you simply use globbing in your load
>>>> statements?
>>>> Somethign like
>>>>
>>>> baidu = LOAD 'input/path/*baidu*' AS (schema);
>>>> google = LOAD 'input/path/*google*' AS (schema);
>>>>
>>>> Store baidu INTO 'output/path/baidu_all';
>>>> Store google INTO 'output/path/google_all';
>>>>
>>>> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <romain.rigaux@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>> Actually I was using another loader and I just tried with  
>>>> PigStorage
>>>>> (Pig
>>>>> 0.6) and it seems to work too.
>>>>>
>>>>> If your input file has two columns this will have the expected  
>>>>> schema
>>>>> and
>>>>> data:
>>>>>
>>>>> A = load 'file' USING MyLoader() AS (f1:chararray,
>>>>> f2:chararray, fileName:chararray);
>>>>>
>>>>> A: {f1: chararray,f2: chararray,filename: chararray}
>>>>>
>>>>> If you do "tuple.set(tuple.getLength() - 1, fileName)" your  
>>>>> third column
>>>>> will be null.
>>>>>
>>>>> So in practice the loader loads the data "independently" and then
>>>>> "casts"
>>>>> it
>>>>> to the schema you provided. After yes, I don't say that it is a  
>>>>> very
>>>>>
>>>> clean
>>>>
>>>>> solution.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Romain
>>>>>
>>>>> 2010/3/2 Mridul Muralidharan <mr...@yahoo-inc.com>
>>>>>
>>>>>
>>>>>> I am not sure if this will work as you expect.
>>>>>> Depending on which implementation of PigStorage you end up  
>>>>>> using, it
>>>>>> might exhibit different behavior.
>>>>>>
>>>>>> If I am not wrong, currently, for example, if you specify  
>>>>>> something
>>>>>>
>>>>> like
>>>>
>>>>> :
>>>>>
>>>>>>
>>>>>> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
>>>>>> fileName:chararray);
>>>>>>
>>>>>>
>>>>>> your code will end up generating a tuple of 4 fields - the  
>>>>>> fileName
>>>>>> always being 'null' and the actual filename you inserted through
>>>>>> MyLoader ending up being the 4th field (and so not 'seen' by  
>>>>>> pig - not
>>>>>> sure what happens if you do a join, etc with this tuple though !
>>>>>> Essentially runtime is not consistent with script schema).
>>>>>>
>>>>>>
>>>>>> Note - this is an implementation specific behavior, which could
>>>>>>
>>>>> probably
>>>>
>>>>> have been fixed by implementation specific hack
>>>>>> "tuple.set(tuple.getLength() - 1, fileName)" [if you know  
>>>>>> fileName is
>>>>>> the last field expected].
>>>>>>
>>>>>> As expected, it is brittle code.
>>>>>>
>>>>>>
>>>>>> From a while back, I remember facing issues with pig's implicit
>>>>>> conversion to/from bytearray, its implicit project which was
>>>>>>
>>>>> introduced,
>>>>
>>>>> insertion of null's to extend to schema specified (the above  
>>>>> behavior),
>>>>>> etc.
>>>>>> So you would become dependent on the impl changes.
>>>>>>
>>>>>>
>>>>>> I dont think BinStorage and PigStorage have been written with
>>>>>> inheritance in mind ...
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> In Pig 0.6 you can extend the PigStorage and grab the name of  
>>>>>>> the
>>>>>>>
>>>>>> file
>>>>
>>>>> with
>>>>>>
>>>>>>> something like this:
>>>>>>>
>>>>>>> @Override
>>>>>>> public void bindTo(String fileName,  
>>>>>>> BufferedPositionedInputStream
>>>>>>>
>>>>>> is,
>>>>>
>>>>>> long
>>>>>>
>>>>>>> offset, long end)
>>>>>>>     throws IOException {
>>>>>>>   super.bindTo(fileName, is, offset, end);
>>>>>>>
>>>>>>>   this.fileName = fileName; // In your case match with a regexp
>>>>>>>
>>>>>> and
>>>>
>>>>> get
>>>>>>
>>>>>>> the group with the name only (e.g. google, baidu)
>>>>>>> }
>>>>>>>
>>>>>>> @Override
>>>>>>> public Tuple getNext() throws IOException {
>>>>>>>   Tuple next = super.getNext();
>>>>>>>
>>>>>>>   if (next != null) {
>>>>>>>     next.append(fileName);
>>>>>>>   }
>>>>>>>
>>>>>>>   return next;
>>>>>>> }
>>>>>>>
>>>>>>> Then you can group on the name and split on it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Romain
>>>>>>>
>>>>>>> On Mon, Mar 1, 2010 at 3:09 AM, Jumping<qu...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>
>>>>>
>>>>>>> Hi,
>>>>>>>> Could pig recognize files name are importing ? If could, how  
>>>>>>>> to do ?
>>>>>>>>
>>>>>>> I
>>>>
>>>>> want
>>>>>>
>>>>>>> to combine them according filename.
>>>>>>>>
>>>>>>>> Exp:
>>>>>>>> google_2009_12_21.csv, google_2010_01_21.csv,  
>>>>>>>> google_2010_02_21.csv,
>>>>>>>> baidu_2009_11_22.csv, baidu_2010_01_01.csv,  
>>>>>>>> baidu_2010_02_03.csv,
>>>>>>>>
>>>>>>> ....
>>>>
>>>>>
>>>>>>>> Sort and combine by name, then output two files:   
>>>>>>>> google_all.csv,
>>>>>>>> baidu_all.csv  in a pig script.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Jumping Qu
>>>>>>>>
>>>>>>>> ------
>>>>>>>> Don't tell me how many enemies we have, but where they are!
>>>>>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time  
>>>>>>>> and
>>>>>>>>
>>>>>>> under
>>>>>
>>>>>> budget.)
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Zaki Rahaman
>>>>
>>>>

Re: Could identify file name?

Posted by Jumping <qu...@gmail.com>.
I am using MapReduce on Amazon,  there is another problem, like as how to
use two "$INPUT" parameters in a pig script.

Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)


On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman <za...@gmail.com> wrote:

> Just curious,
>
> What solution did you use?
>
> Sent from my iPhone
>
>
> On Mar 3, 2010, at 8:06 PM, Jumping <qu...@gmail.com> wrote:
>
>  Thanks all of you guys.
>>
>>
>> Best Regards,
>> Jumping Qu
>>
>> ------
>> Don't tell me how many enemies we have, but where they are!
>> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
>> budget.)
>>
>>
>> On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <za...@gmail.com>
>> wrote:
>>
>>  In this case, why wouldn't you simply use globbing in your load
>>> statements?
>>> Somethign like
>>>
>>> baidu = LOAD 'input/path/*baidu*' AS (schema);
>>> google = LOAD 'input/path/*google*' AS (schema);
>>>
>>> Store baidu INTO 'output/path/baidu_all';
>>> Store google INTO 'output/path/google_all';
>>>
>>> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <romain.rigaux@gmail.com
>>>
>>>> wrote:
>>>>
>>>
>>>  Actually I was using another loader and I just tried with PigStorage
>>>> (Pig
>>>> 0.6) and it seems to work too.
>>>>
>>>> If your input file has two columns this will have the expected schema
>>>> and
>>>> data:
>>>>
>>>> A = load 'file' USING MyLoader() AS (f1:chararray,
>>>> f2:chararray, fileName:chararray);
>>>>
>>>> A: {f1: chararray,f2: chararray,filename: chararray}
>>>>
>>>> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
>>>> will be null.
>>>>
>>>> So in practice the loader loads the data "independently" and then
>>>> "casts"
>>>> it
>>>> to the schema you provided. After yes, I don't say that it is a very
>>>>
>>> clean
>>>
>>>> solution.
>>>>
>>>> Thanks,
>>>>
>>>> Romain
>>>>
>>>> 2010/3/2 Mridul Muralidharan <mr...@yahoo-inc.com>
>>>>
>>>>
>>>>> I am not sure if this will work as you expect.
>>>>> Depending on which implementation of PigStorage you end up using, it
>>>>> might exhibit different behavior.
>>>>>
>>>>> If I am not wrong, currently, for example, if you specify something
>>>>>
>>>> like
>>>
>>>> :
>>>>
>>>>>
>>>>> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
>>>>> fileName:chararray);
>>>>>
>>>>>
>>>>> your code will end up generating a tuple of 4 fields - the fileName
>>>>> always being 'null' and the actual filename you inserted through
>>>>> MyLoader ending up being the 4th field (and so not 'seen' by pig - not
>>>>> sure what happens if you do a join, etc with this tuple though !
>>>>> Essentially runtime is not consistent with script schema).
>>>>>
>>>>>
>>>>> Note - this is an implementation specific behavior, which could
>>>>>
>>>> probably
>>>
>>>> have been fixed by implementation specific hack
>>>>> "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
>>>>> the last field expected].
>>>>>
>>>>> As expected, it is brittle code.
>>>>>
>>>>>
>>>>> From a while back, I remember facing issues with pig's implicit
>>>>> conversion to/from bytearray, its implicit project which was
>>>>>
>>>> introduced,
>>>
>>>> insertion of null's to extend to schema specified (the above behavior),
>>>>> etc.
>>>>> So you would become dependent on the impl changes.
>>>>>
>>>>>
>>>>> I dont think BinStorage and PigStorage have been written with
>>>>> inheritance in mind ...
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> In Pig 0.6 you can extend the PigStorage and grab the name of the
>>>>>>
>>>>> file
>>>
>>>> with
>>>>>
>>>>>> something like this:
>>>>>>
>>>>>>  @Override
>>>>>>  public void bindTo(String fileName, BufferedPositionedInputStream
>>>>>>
>>>>> is,
>>>>
>>>>> long
>>>>>
>>>>>> offset, long end)
>>>>>>      throws IOException {
>>>>>>    super.bindTo(fileName, is, offset, end);
>>>>>>
>>>>>>    this.fileName = fileName; // In your case match with a regexp
>>>>>>
>>>>> and
>>>
>>>> get
>>>>>
>>>>>> the group with the name only (e.g. google, baidu)
>>>>>>  }
>>>>>>
>>>>>>  @Override
>>>>>>  public Tuple getNext() throws IOException {
>>>>>>    Tuple next = super.getNext();
>>>>>>
>>>>>>    if (next != null) {
>>>>>>      next.append(fileName);
>>>>>>    }
>>>>>>
>>>>>>    return next;
>>>>>>  }
>>>>>>
>>>>>> Then you can group on the name and split on it.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Romain
>>>>>>
>>>>>> On Mon, Mar 1, 2010 at 3:09 AM, Jumping<qu...@gmail.com>
>>>>>>
>>>>> wrote:
>>>
>>>>
>>>>>>  Hi,
>>>>>>> Could pig recognize files name are importing ? If could, how to do ?
>>>>>>>
>>>>>> I
>>>
>>>> want
>>>>>
>>>>>> to combine them according filename.
>>>>>>>
>>>>>>> Exp:
>>>>>>> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
>>>>>>> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
>>>>>>>
>>>>>> ....
>>>
>>>>
>>>>>>> Sort and combine by name, then output two files:  google_all.csv,
>>>>>>> baidu_all.csv  in a pig script.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Jumping Qu
>>>>>>>
>>>>>>> ------
>>>>>>> Don't tell me how many enemies we have, but where they are!
>>>>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and
>>>>>>>
>>>>>> under
>>>>
>>>>> budget.)
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Zaki Rahaman
>>>
>>>

Re: Could identify file name?

Posted by Zaki Rahaman <za...@gmail.com>.
Just curious,

What solution did you use?

Sent from my iPhone

On Mar 3, 2010, at 8:06 PM, Jumping <qu...@gmail.com> wrote:

> Thanks all of you guys.
>
>
> Best Regards,
> Jumping Qu
>
> ------
> Don't tell me how many enemies we have, but where they are!
> (ADV:Perl -- It's like Java, only it lets you deliver on time and  
> under
> budget.)
>
>
> On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman  
> <za...@gmail.com> wrote:
>
>> In this case, why wouldn't you simply use globbing in your load  
>> statements?
>> Somethign like
>>
>> baidu = LOAD 'input/path/*baidu*' AS (schema);
>> google = LOAD 'input/path/*google*' AS (schema);
>>
>> Store baidu INTO 'output/path/baidu_all';
>> Store google INTO 'output/path/google_all';
>>
>> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux  
>> <romain.rigaux@gmail.com
>>> wrote:
>>
>>> Actually I was using another loader and I just tried with  
>>> PigStorage (Pig
>>> 0.6) and it seems to work too.
>>>
>>> If your input file has two columns this will have the expected  
>>> schema and
>>> data:
>>>
>>> A = load 'file' USING MyLoader() AS (f1:chararray,
>>> f2:chararray, fileName:chararray);
>>>
>>> A: {f1: chararray,f2: chararray,filename: chararray}
>>>
>>> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third  
>>> column
>>> will be null.
>>>
>>> So in practice the loader loads the data "independently" and then  
>>> "casts"
>>> it
>>> to the schema you provided. After yes, I don't say that it is a very
>> clean
>>> solution.
>>>
>>> Thanks,
>>>
>>> Romain
>>>
>>> 2010/3/2 Mridul Muralidharan <mr...@yahoo-inc.com>
>>>
>>>>
>>>> I am not sure if this will work as you expect.
>>>> Depending on which implementation of PigStorage you end up using,  
>>>> it
>>>> might exhibit different behavior.
>>>>
>>>> If I am not wrong, currently, for example, if you specify something
>> like
>>> :
>>>>
>>>> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
>>>> fileName:chararray);
>>>>
>>>>
>>>> your code will end up generating a tuple of 4 fields - the fileName
>>>> always being 'null' and the actual filename you inserted through
>>>> MyLoader ending up being the 4th field (and so not 'seen' by pig  
>>>> - not
>>>> sure what happens if you do a join, etc with this tuple though !
>>>> Essentially runtime is not consistent with script schema).
>>>>
>>>>
>>>> Note - this is an implementation specific behavior, which could
>> probably
>>>> have been fixed by implementation specific hack
>>>> "tuple.set(tuple.getLength() - 1, fileName)" [if you know  
>>>> fileName is
>>>> the last field expected].
>>>>
>>>> As expected, it is brittle code.
>>>>
>>>>
>>>> From a while back, I remember facing issues with pig's implicit
>>>> conversion to/from bytearray, its implicit project which was
>> introduced,
>>>> insertion of null's to extend to schema specified (the above  
>>>> behavior),
>>>> etc.
>>>> So you would become dependent on the impl changes.
>>>>
>>>>
>>>> I dont think BinStorage and PigStorage have been written with
>>>> inheritance in mind ...
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
>>>>> Hi,
>>>>>
>>>>> In Pig 0.6 you can extend the PigStorage and grab the name of the
>> file
>>>> with
>>>>> something like this:
>>>>>
>>>>>   @Override
>>>>>   public void bindTo(String fileName,  
>>>>> BufferedPositionedInputStream
>>> is,
>>>> long
>>>>> offset, long end)
>>>>>       throws IOException {
>>>>>     super.bindTo(fileName, is, offset, end);
>>>>>
>>>>>     this.fileName = fileName; // In your case match with a regexp
>> and
>>>> get
>>>>> the group with the name only (e.g. google, baidu)
>>>>>   }
>>>>>
>>>>>   @Override
>>>>>   public Tuple getNext() throws IOException {
>>>>>     Tuple next = super.getNext();
>>>>>
>>>>>     if (next != null) {
>>>>>       next.append(fileName);
>>>>>     }
>>>>>
>>>>>     return next;
>>>>>   }
>>>>>
>>>>> Then you can group on the name and split on it.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Romain
>>>>>
>>>>> On Mon, Mar 1, 2010 at 3:09 AM, Jumping<qu...@gmail.com>
>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Could pig recognize files name are importing ? If could, how to  
>>>>>> do ?
>> I
>>>> want
>>>>>> to combine them according filename.
>>>>>>
>>>>>> Exp:
>>>>>> google_2009_12_21.csv, google_2010_01_21.csv,  
>>>>>> google_2010_02_21.csv,
>>>>>> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
>> ....
>>>>>>
>>>>>> Sort and combine by name, then output two files:  google_all.csv,
>>>>>> baidu_all.csv  in a pig script.
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Jumping Qu
>>>>>>
>>>>>> ------
>>>>>> Don't tell me how many enemies we have, but where they are!
>>>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and
>>> under
>>>>>> budget.)
>>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Zaki Rahaman
>>

Re: Could identify file name?

Posted by Jumping <qu...@gmail.com>.
Thanks all of you guys.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)


On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <za...@gmail.com> wrote:

> In this case, why wouldn't you simply use globbing in your load statements?
> Somethign like
>
> baidu = LOAD 'input/path/*baidu*' AS (schema);
> google = LOAD 'input/path/*google*' AS (schema);
>
> Store baidu INTO 'output/path/baidu_all';
> Store google INTO 'output/path/google_all';
>
> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <romain.rigaux@gmail.com
> >wrote:
>
> > Actually I was using another loader and I just tried with PigStorage (Pig
> > 0.6) and it seems to work too.
> >
> > If your input file has two columns this will have the expected schema and
> > data:
> >
> > A = load 'file' USING MyLoader() AS (f1:chararray,
> > f2:chararray, fileName:chararray);
> >
> > A: {f1: chararray,f2: chararray,filename: chararray}
> >
> > If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
> > will be null.
> >
> > So in practice the loader loads the data "independently" and then "casts"
> > it
> > to the schema you provided. After yes, I don't say that it is a very
> clean
> > solution.
> >
> > Thanks,
> >
> > Romain
> >
> > 2010/3/2 Mridul Muralidharan <mr...@yahoo-inc.com>
> >
> > >
> > > I am not sure if this will work as you expect.
> > > Depending on which implementation of PigStorage you end up using, it
> > > might exhibit different behavior.
> > >
> > > If I am not wrong, currently, for example, if you specify something
> like
> > :
> > >
> > > A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
> > > fileName:chararray);
> > >
> > >
> > > your code will end up generating a tuple of 4 fields - the fileName
> > > always being 'null' and the actual filename you inserted through
> > > MyLoader ending up being the 4th field (and so not 'seen' by pig - not
> > > sure what happens if you do a join, etc with this tuple though !
> > > Essentially runtime is not consistent with script schema).
> > >
> > >
> > > Note - this is an implementation specific behavior, which could
> probably
> > > have been fixed by implementation specific hack
> > > "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
> > > the last field expected].
> > >
> > > As expected, it is brittle code.
> > >
> > >
> > > From a while back, I remember facing issues with pig's implicit
> > > conversion to/from bytearray, its implicit project which was
> introduced,
> > > insertion of null's to extend to schema specified (the above behavior),
> > > etc.
> > > So you would become dependent on the impl changes.
> > >
> > >
> > > I dont think BinStorage and PigStorage have been written with
> > > inheritance in mind ...
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > >
> > >
> > >
> > > On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
> > > > Hi,
> > > >
> > > > In Pig 0.6 you can extend the PigStorage and grab the name of the
> file
> > > with
> > > > something like this:
> > > >
> > > >    @Override
> > > >    public void bindTo(String fileName, BufferedPositionedInputStream
> > is,
> > > long
> > > > offset, long end)
> > > >        throws IOException {
> > > >      super.bindTo(fileName, is, offset, end);
> > > >
> > > >      this.fileName = fileName; // In your case match with a regexp
> and
> > > get
> > > > the group with the name only (e.g. google, baidu)
> > > >    }
> > > >
> > > >    @Override
> > > >    public Tuple getNext() throws IOException {
> > > >      Tuple next = super.getNext();
> > > >
> > > >      if (next != null) {
> > > >        next.append(fileName);
> > > >      }
> > > >
> > > >      return next;
> > > >    }
> > > >
> > > > Then you can group on the name and split on it.
> > > >
> > > > Thanks,
> > > >
> > > > Romain
> > > >
> > > > On Mon, Mar 1, 2010 at 3:09 AM, Jumping<qu...@gmail.com>
>  wrote:
> > > >
> > > >> Hi,
> > > >> Could pig recognize files name are importing ? If could, how to do ?
> I
> > > want
> > > >> to combine them according filename.
> > > >>
> > > >> Exp:
> > > >> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
> > > >> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
> ....
> > > >>
> > > >> Sort and combine by name, then output two files:  google_all.csv,
> > > >> baidu_all.csv  in a pig script.
> > > >>
> > > >>
> > > >> Best Regards,
> > > >> Jumping Qu
> > > >>
> > > >> ------
> > > >> Don't tell me how many enemies we have, but where they are!
> > > >> (ADV:Perl -- It's like Java, only it lets you deliver on time and
> > under
> > > >> budget.)
> > > >>
> > >
> > >
> >
>
>
>
> --
> Zaki Rahaman
>

Re: Could identify file name?

Posted by zaki rahaman <za...@gmail.com>.
In this case, why wouldn't you simply use globbing in your load statements?
Somethign like

baidu = LOAD 'input/path/*baidu*' AS (schema);
google = LOAD 'input/path/*google*' AS (schema);

Store baidu INTO 'output/path/baidu_all';
Store google INTO 'output/path/google_all';

On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <ro...@gmail.com>wrote:

> Actually I was using another loader and I just tried with PigStorage (Pig
> 0.6) and it seems to work too.
>
> If your input file has two columns this will have the expected schema and
> data:
>
> A = load 'file' USING MyLoader() AS (f1:chararray,
> f2:chararray, fileName:chararray);
>
> A: {f1: chararray,f2: chararray,filename: chararray}
>
> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
> will be null.
>
> So in practice the loader loads the data "independently" and then "casts"
> it
> to the schema you provided. After yes, I don't say that it is a very clean
> solution.
>
> Thanks,
>
> Romain
>
> 2010/3/2 Mridul Muralidharan <mr...@yahoo-inc.com>
>
> >
> > I am not sure if this will work as you expect.
> > Depending on which implementation of PigStorage you end up using, it
> > might exhibit different behavior.
> >
> > If I am not wrong, currently, for example, if you specify something like
> :
> >
> > A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
> > fileName:chararray);
> >
> >
> > your code will end up generating a tuple of 4 fields - the fileName
> > always being 'null' and the actual filename you inserted through
> > MyLoader ending up being the 4th field (and so not 'seen' by pig - not
> > sure what happens if you do a join, etc with this tuple though !
> > Essentially runtime is not consistent with script schema).
> >
> >
> > Note - this is an implementation specific behavior, which could probably
> > have been fixed by implementation specific hack
> > "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
> > the last field expected].
> >
> > As expected, it is brittle code.
> >
> >
> > From a while back, I remember facing issues with pig's implicit
> > conversion to/from bytearray, its implicit project which was introduced,
> > insertion of null's to extend to schema specified (the above behavior),
> > etc.
> > So you would become dependent on the impl changes.
> >
> >
> > I dont think BinStorage and PigStorage have been written with
> > inheritance in mind ...
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> >
> >
> > On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
> > > Hi,
> > >
> > > In Pig 0.6 you can extend the PigStorage and grab the name of the file
> > with
> > > something like this:
> > >
> > >    @Override
> > >    public void bindTo(String fileName, BufferedPositionedInputStream
> is,
> > long
> > > offset, long end)
> > >        throws IOException {
> > >      super.bindTo(fileName, is, offset, end);
> > >
> > >      this.fileName = fileName; // In your case match with a regexp and
> > get
> > > the group with the name only (e.g. google, baidu)
> > >    }
> > >
> > >    @Override
> > >    public Tuple getNext() throws IOException {
> > >      Tuple next = super.getNext();
> > >
> > >      if (next != null) {
> > >        next.append(fileName);
> > >      }
> > >
> > >      return next;
> > >    }
> > >
> > > Then you can group on the name and split on it.
> > >
> > > Thanks,
> > >
> > > Romain
> > >
> > > On Mon, Mar 1, 2010 at 3:09 AM, Jumping<qu...@gmail.com>  wrote:
> > >
> > >> Hi,
> > >> Could pig recognize files name are importing ? If could, how to do ? I
> > want
> > >> to combine them according filename.
> > >>
> > >> Exp:
> > >> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
> > >> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....
> > >>
> > >> Sort and combine by name, then output two files:  google_all.csv,
> > >> baidu_all.csv  in a pig script.
> > >>
> > >>
> > >> Best Regards,
> > >> Jumping Qu
> > >>
> > >> ------
> > >> Don't tell me how many enemies we have, but where they are!
> > >> (ADV:Perl -- It's like Java, only it lets you deliver on time and
> under
> > >> budget.)
> > >>
> >
> >
>



-- 
Zaki Rahaman

Re: Could identify file name?

Posted by Romain Rigaux <ro...@gmail.com>.
Actually I was using another loader and I just tried with PigStorage (Pig
0.6) and it seems to work too.

If your input file has two columns this will have the expected schema and
data:

A = load 'file' USING MyLoader() AS (f1:chararray,
f2:chararray, fileName:chararray);

A: {f1: chararray,f2: chararray,filename: chararray}

If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
will be null.

So in practice the loader loads the data "independently" and then "casts" it
to the schema you provided. After yes, I don't say that it is a very clean
solution.

Thanks,

Romain

2010/3/2 Mridul Muralidharan <mr...@yahoo-inc.com>

>
> I am not sure if this will work as you expect.
> Depending on which implementation of PigStorage you end up using, it
> might exhibit different behavior.
>
> If I am not wrong, currently, for example, if you specify something like :
>
> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
> fileName:chararray);
>
>
> your code will end up generating a tuple of 4 fields - the fileName
> always being 'null' and the actual filename you inserted through
> MyLoader ending up being the 4th field (and so not 'seen' by pig - not
> sure what happens if you do a join, etc with this tuple though !
> Essentially runtime is not consistent with script schema).
>
>
> Note - this is an implementation specific behavior, which could probably
> have been fixed by implementation specific hack
> "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
> the last field expected].
>
> As expected, it is brittle code.
>
>
> From a while back, I remember facing issues with pig's implicit
> conversion to/from bytearray, its implicit project which was introduced,
> insertion of null's to extend to schema specified (the above behavior),
> etc.
> So you would become dependent on the impl changes.
>
>
> I dont think BinStorage and PigStorage have been written with
> inheritance in mind ...
>
>
> Regards,
> Mridul
>
>
>
>
>
> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
> > Hi,
> >
> > In Pig 0.6 you can extend the PigStorage and grab the name of the file
> with
> > something like this:
> >
> >    @Override
> >    public void bindTo(String fileName, BufferedPositionedInputStream is,
> long
> > offset, long end)
> >        throws IOException {
> >      super.bindTo(fileName, is, offset, end);
> >
> >      this.fileName = fileName; // In your case match with a regexp and
> get
> > the group with the name only (e.g. google, baidu)
> >    }
> >
> >    @Override
> >    public Tuple getNext() throws IOException {
> >      Tuple next = super.getNext();
> >
> >      if (next != null) {
> >        next.append(fileName);
> >      }
> >
> >      return next;
> >    }
> >
> > Then you can group on the name and split on it.
> >
> > Thanks,
> >
> > Romain
> >
> > On Mon, Mar 1, 2010 at 3:09 AM, Jumping<qu...@gmail.com>  wrote:
> >
> >> Hi,
> >> Could pig recognize files name are importing ? If could, how to do ? I
> want
> >> to combine them according filename.
> >>
> >> Exp:
> >> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
> >> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....
> >>
> >> Sort and combine by name, then output two files:  google_all.csv,
> >> baidu_all.csv  in a pig script.
> >>
> >>
> >> Best Regards,
> >> Jumping Qu
> >>
> >> ------
> >> Don't tell me how many enemies we have, but where they are!
> >> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
> >> budget.)
> >>
>
>

Re: Could identify file name?

Posted by Romain Rigaux <ro...@gmail.com>.
Hi,

In Pig 0.6 you can extend the PigStorage and grab the name of the file with
something like this:

  @Override
  public void bindTo(String fileName, BufferedPositionedInputStream is, long
offset, long end)
      throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp and get
the group with the name only (e.g. google, baidu)
  }

  @Override
  public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
      next.append(fileName);
    }

    return next;
  }

Then you can group on the name and split on it.

Thanks,

Romain

On Mon, Mar 1, 2010 at 3:09 AM, Jumping <qu...@gmail.com> wrote:

> Hi,
> Could pig recognize files name are importing ? If could, how to do ? I want
> to combine them according filename.
>
> Exp:
> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....
>
> Sort and combine by name, then output two files:  google_all.csv,
> baidu_all.csv  in a pig script.
>
>
> Best Regards,
> Jumping Qu
>
> ------
> Don't tell me how many enemies we have, but where they are!
> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
> budget.)
>