You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/09/11 01:11:08 UTC
Input and output path
Our input path is something like YYYY/MM/DD/HH/input and we like to write
to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
clause?
Re: Input and output path
Posted by Aniket Mokashi <an...@gmail.com>.
You can do something similar to -
https://cwiki.apache.org/PIG/faq.html#FAQ-Q%253AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%253F
Get input path from pig and then substitute the values for date, hour etc.
You have to also override getSchema method so that pig gets to see these
fields.
Just beware of -https://issues.apache.org/jira/browse/PIG-2462
Thanks,
Aniket
On Thu, Sep 13, 2012 at 2:04 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:
> MiaoMiao, Mohit,
>
> If we are talking about embedding Pig into Python, I'd like to add
> that we can also embed Pig into Java using PigServer
> http://wiki.apache.org/pig/EmbeddedPig
>
> MiaoMiao, what's the purpose of embedding here (if we already have
> parameter substitution feature)? I guess Pig embedding is mostly
> suitable in case we want to add IF/ELSE or LOOP functionality
>
> Thanks
>
> On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <li...@gmail.com> wrote:
> > I wrote a python script to do this
> >
> > import sys
> > yyyymmddhh = sys.argv[1]
> > inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input"
> > outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to
> "YYYY/MM/DD/HH/output"
> > pigScript = '''
> > some = load '$input' using PigStorage(',')
> > as(
> > id:INT,
> > value:INT
> > );
> > final = ..... ;
> > STORE final INTO '$output' using PigStorage(',');
> > '''
> > P = Pig.compile(pigScript)
> > result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
> > if result.isSuccessful() :
> > print 'Pig job succeeded'
> > else :
> > raise 'Pig job failed'
> >
> > Then you can run it with pig
> > pig -x local pig.py 2012091108
> >
> > On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> >> Our input path is something like YYYY/MM/DD/HH/input and we like to
> write
> >> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a
> String
> >> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
> >> clause?
>
--
"...:::Aniket:::... Quetzalco@tl"
Re: Input and output path
Posted by MiaoMiao <li...@gmail.com>.
Ah, sorry I missed your former reply. I used python because it's more
flexible, and can generate Pig script from XML files containing all
fields info in my input and output files. These XML files can also
apply to Hive.
On Fri, Sep 14, 2012 at 5:04 AM, Ruslan Al-Fakikh <me...@gmail.com> wrote:
> MiaoMiao, Mohit,
>
> If we are talking about embedding Pig into Python, I'd like to add
> that we can also embed Pig into Java using PigServer
> http://wiki.apache.org/pig/EmbeddedPig
>
> MiaoMiao, what's the purpose of embedding here (if we already have
> parameter substitution feature)? I guess Pig embedding is mostly
> suitable in case we want to add IF/ELSE or LOOP functionality
>
> Thanks
>
> On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <li...@gmail.com> wrote:
>> I wrote a python script to do this
>>
>> import sys
>> yyyymmddhh = sys.argv[1]
>> inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input"
>> outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output"
>> pigScript = '''
>> some = load '$input' using PigStorage(',')
>> as(
>> id:INT,
>> value:INT
>> );
>> final = ..... ;
>> STORE final INTO '$output' using PigStorage(',');
>> '''
>> P = Pig.compile(pigScript)
>> result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
>> if result.isSuccessful() :
>> print 'Pig job succeeded'
>> else :
>> raise 'Pig job failed'
>>
>> Then you can run it with pig
>> pig -x local pig.py 2012091108
>>
>> On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <mo...@gmail.com> wrote:
>>> Our input path is something like YYYY/MM/DD/HH/input and we like to write
>>> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
>>> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
>>> clause?
Re: Input and output path
Posted by Ruslan Al-Fakikh <me...@gmail.com>.
MiaoMiao, Mohit,
If we are talking about embedding Pig into Python, I'd like to add
that we can also embed Pig into Java using PigServer
http://wiki.apache.org/pig/EmbeddedPig
MiaoMiao, what's the purpose of embedding here (if we already have
parameter substitution feature)? I guess Pig embedding is mostly
suitable in case we want to add IF/ELSE or LOOP functionality
Thanks
On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao <li...@gmail.com> wrote:
> I wrote a python script to do this
>
> import sys
> yyyymmddhh = sys.argv[1]
> inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input"
> outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output"
> pigScript = '''
> some = load '$input' using PigStorage(',')
> as(
> id:INT,
> value:INT
> );
> final = ..... ;
> STORE final INTO '$output' using PigStorage(',');
> '''
> P = Pig.compile(pigScript)
> result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
> if result.isSuccessful() :
> print 'Pig job succeeded'
> else :
> raise 'Pig job failed'
>
> Then you can run it with pig
> pig -x local pig.py 2012091108
>
> On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <mo...@gmail.com> wrote:
>> Our input path is something like YYYY/MM/DD/HH/input and we like to write
>> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
>> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
>> clause?
Re: Input and output path
Posted by MiaoMiao <li...@gmail.com>.
I wrote a python script to do this
import sys
yyyymmddhh = sys.argv[1]
inputPath = getInputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/input"
outputPath = getOutputPath(yyyymmddhh) #yyyymmddhh to "YYYY/MM/DD/HH/output"
pigScript = '''
some = load '$input' using PigStorage(',')
as(
id:INT,
value:INT
);
final = ..... ;
STORE final INTO '$output' using PigStorage(',');
'''
P = Pig.compile(pigScript)
result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
if result.isSuccessful() :
print 'Pig job succeeded'
else :
raise 'Pig job failed'
Then you can run it with pig
pig -x local pig.py 2012091108
On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> Our input path is something like YYYY/MM/DD/HH/input and we like to write
> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
> clause?
Re: Input and output path
Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Mohit,
I am suggesting setting up a whole Hive warehouse. This way your
folders will look like
/user/hive/warehouse/yourdataset/date=2012-09-11
/user/hive/warehouse/yourdataset/date=2012-09-12
...
All the partitions' metadata will be kept in a RDBMS, so when you
query them with Hive it will look like
select * from yourdataset where date = 2012-09-11
and it will be fast
HCatalog is a layer that provides this Hive's functionality to Pig and
MapReduce, so in Pig you can FILTER by those dates.
http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html#Load+Examples
Best Regards
On Tue, Sep 11, 2012 at 3:29 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:
>
>> Mohit,
>>
>> I guess you could use parameters substitution here
>> http://wiki.apache.org/pig/ParameterSubstitution
>>
>> thanks this works.
>
>
>> Also, a note about your architecture:
>>
>
> Are you suggesting change to the path names or your suggestion is to use
> HCatalog with pig?
>
>
>> You can consider using Hive partitions to effectively select
>> appropriate dates in the folder names. But as your tool is Pig, not
>> Hive, you can use HCatalog as a layer
>>
>> Best Regards
>>
>> On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > Our input path is something like YYYY/MM/DD/HH/input and we like to write
>> > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a
>> String
>> > and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
>> > clause?
>>
Re: Input and output path
Posted by Mohit Anchlia <mo...@gmail.com>.
On Mon, Sep 10, 2012 at 4:17 PM, Ruslan Al-Fakikh <me...@gmail.com>wrote:
> Mohit,
>
> I guess you could use parameters substitution here
> http://wiki.apache.org/pig/ParameterSubstitution
>
> thanks this works.
> Also, a note about your architecture:
>
Are you suggesting change to the path names or your suggestion is to use
HCatalog with pig?
> You can consider using Hive partitions to effectively select
> appropriate dates in the folder names. But as your tool is Pig, not
> Hive, you can use HCatalog as a layer
>
> Best Regards
>
> On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Our input path is something like YYYY/MM/DD/HH/input and we like to write
> > to YYYY/MM/DD/HH/output . Is it possible to get the input path as a
> String
> > and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
> > clause?
>
Re: Input and output path
Posted by Ruslan Al-Fakikh <me...@gmail.com>.
Mohit,
I guess you could use parameters substitution here
http://wiki.apache.org/pig/ParameterSubstitution
Also, a note about your architecture:
You can consider using Hive partitions to effectively select
appropriate dates in the folder names. But as your tool is Pig, not
Hive, you can use HCatalog as a layer
Best Regards
On Tue, Sep 11, 2012 at 3:11 AM, Mohit Anchlia <mo...@gmail.com> wrote:
> Our input path is something like YYYY/MM/DD/HH/input and we like to write
> to YYYY/MM/DD/HH/output . Is it possible to get the input path as a String
> and convert it to YYYY/MM/DD/HH/output that I can use in "store into"
> clause?