You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/09/11 01:11:08 UTC

Input and output path

Our input path is something like YYYY/MM/DD/HH/input and we like to write
to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
clause?

Re: Input and output path

Posted by Aniket Mokashi <an...@gmail.com>.
You can do something similar to -
https://cwiki.apache.org/PIG/faq.html#FAQ-Q%253AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%253F

Get input path from pig and then substitute the values for date, hour etc.
You have to also override getSchema method so that pig gets to see these
fields.

Just beware of -https://issues.apache.org/jira/browse/PIG-2462

Thanks,
Aniket

On Thu, Sep 13, 2012 at 2:04 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:

> MiaoMiao, Mohit,
>
> If we are talking about embedding Pig into Python, I'd like to add
> that we can also embed Pig into Java using PigServer
> http://wiki.apache.org/pig/EmbeddedPig
>
> MiaoMiao, what's the purpose of embedding here (if we already have
> parameter substitution feature)? I guess Pig embedding is mostly
> suitable in case we want to add IF/ELSE or LOOP functionality
>
> Thanks
>
> On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <li...@gmail.com> wrote:
> > I wrote a python script to do this
> >
> > import sys
> > yyyymmddhh = sys.argv[1]
> > inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input"
> > outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to
> "YYYY/MM/DD/HH/output"
> > pigScript = '''
> > some = load '$input' using PigStorage(',')
> >     as(
> >         id:INT,
> >         value:INT
> >     );
> > final = ..... ;
> > STORE final INTO '$output' using PigStorage(',');
> > '''
> > P = Pig.compile(pigScript)
> > result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
> > if result.isSuccessful() :
> >     print 'Pig job succeeded'
> > else :
> >     raise 'Pig job failed'
> >
> > Then you can run it with pig
> > pig -x local pig.py 2012091108
> >
> > On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> >> Our input path is something like YYYY/MM/DD/HH/input and we like to
> write
> >> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a
> String
> >> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
> >> clause?
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Input and output path

Posted by MiaoMiao <li...@gmail.com>.
Ah, sorry I missed your former reply. I used python because it's more
flexible, and can generate Pig script from XML files containing all
fields info in my input and output files. These XML files can also
apply to Hive.

On Fri, Sep 14, 2012 at 5:04 AM, Ruslan Al-Fakikh <me...@gmail.com> wrote:
> MiaoMiao, Mohit,
>
> If we are talking about embedding Pig into Python, I'd like to add
> that we can also embed Pig into Java using PigServer
> http://wiki.apache.org/pig/EmbeddedPig
>
> MiaoMiao, what's the purpose of embedding here (if we already have
> parameter substitution feature)? I guess Pig embedding is mostly
> suitable in case we want to add IF/ELSE or LOOP functionality
>
> Thanks
>
> On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <li...@gmail.com> wrote:
>> I wrote a python script to do this
>>
>> import sys
>> yyyymmddhh = sys.argv[1]
>> inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input"
>> outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output"
>> pigScript = '''
>> some = load '$input' using PigStorage(',')
>>     as(
>>         id:INT,
>>         value:INT
>>     );
>> final = ..... ;
>> STORE final INTO '$output' using PigStorage(',');
>> '''
>> P = Pig.compile(pigScript)
>> result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
>> if result.isSuccessful() :
>>     print 'Pig job succeeded'
>> else :
>>     raise 'Pig job failed'
>>
>> Then you can run it with pig
>> pig -x local pig.py 2012091108
>>
>> On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <mo...@gmail.com> wrote:
>>> Our input path is something like YYYY/MM/DD/HH/input and we like to write
>>> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
>>> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
>>> clause?

Re: Input and output path

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
MiaoMiao, Mohit,

If we are talking about embedding Pig into Python, I'd like to add
that we can also embed Pig into Java using PigServer
http://wiki.apache.org/pig/EmbeddedPig

MiaoMiao, what's the purpose of embedding here (if we already have
parameter substitution feature)? I guess Pig embedding is mostly
suitable in case we want to add IF/ELSE or LOOP functionality

Thanks

On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <li...@gmail.com> wrote:
> I wrote a python script to do this
>
> import sys
> yyyymmddhh = sys.argv[1]
> inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input"
> outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output"
> pigScript = '''
> some = load '$input' using PigStorage(',')
>     as(
>         id:INT,
>         value:INT
>     );
> final = ..... ;
> STORE final INTO '$output' using PigStorage(',');
> '''
> P = Pig.compile(pigScript)
> result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
> if result.isSuccessful() :
>     print 'Pig job succeeded'
> else :
>     raise 'Pig job failed'
>
> Then you can run it with pig
> pig -x local pig.py 2012091108
>
> On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <mo...@gmail.com> wrote:
>> Our input path is something like YYYY/MM/DD/HH/input and we like to write
>> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
>> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
>> clause?

Re: Input and output path

Posted by MiaoMiao <li...@gmail.com>.
I wrote a python script to do this

import sys
yyyymmddhh = sys.argv[1]
inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input"
outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output"
pigScript = '''
some = load '$input' using PigStorage(',')
    as(
        id:INT,
        value:INT
    );
final = ..... ;
STORE final INTO '$output' using PigStorage(',');
'''
P = Pig.compile(pigScript)
result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
if result.isSuccessful() :
    print 'Pig job succeeded'
else :
    raise 'Pig job failed'

Then you can run it with pig
pig -x local pig.py 2012091108

On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> Our input path is something like YYYY/MM/DD/HH/input and we like to write
> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
> clause?

Re: Input and output path

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Mohit,

I am suggesting setting up a whole Hive warehouse. This way your
folders will look like
/user/hive/warehouse/yourdataset/date=2012-09-11
/user/hive/warehouse/yourdataset/date=2012-09-12
...
All the partitions' metadata will be kept in a RDBMS, so when you
query them with Hive it will look like
select * from yourdataset where date = 2012-09-11
and it will be fast

HCatalog is a layer that provides this Hive's functionality to Pig and
MapReduce, so in Pig you can FILTER by those dates.
http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html#Load+Examples

Best Regards

On Tue, Sep 11, 2012 at 3:29 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:
>
>> Mohit,
>>
>> I guess you could use parameters substitution here
>> http://wiki.apache.org/pig/ParameterSubstitution
>>
>> thanks this works.
>
>
>> Also, a note about your architecture:
>>
>
> Are you suggesting change to the path names or your suggestion is to use
> HCatalog with pig?
>
>
>> You can consider using Hive partitions to effectively select
>> appropriate dates in the folder names. But as your tool is Pig, not
>> Hive, you can use HCatalog as a layer
>>
>> Best Regards
>>
>> On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > Our input path is something like YYYY/MM/DD/HH/input and we like to write
>> > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a
>> String
>> > and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
>> > clause?
>>

Re: Input and output path

Posted by Mohit Anchlia <mo...@gmail.com>.
On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:

> Mohit,
>
> I guess you could use parameters substitution here
> http://wiki.apache.org/pig/ParameterSubstitution
>
> thanks this works.


> Also, a note about your architecture:
>

Are you suggesting change to the path names or your suggestion is to use
HCatalog with pig?


> You can consider using Hive partitions to effectively select
> appropriate dates in the folder names. But as your tool is Pig, not
> Hive, you can use HCatalog as a layer
>
> Best Regards
>
> On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Our input path is something like YYYY/MM/DD/HH/input and we like to write
> > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a
> String
> > and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
> > clause?
>

Re: Input and output path

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Mohit,

I guess you could use parameters substitution here
http://wiki.apache.org/pig/ParameterSubstitution

Also, a note about your architecture:
You can consider using Hive partitions to effectively select
appropriate dates in the folder names. But as your tool is Pig, not
Hive, you can use HCatalog as a layer

Best Regards

On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> Our input path is something like YYYY/MM/DD/HH/input and we like to write
> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
> clause?