You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Prashant Kommireddi <pr...@gmail.com> on 2013/04/13 06:57:05 UTC

Re: Why does Pig not use default resources from the Configuration object?

+User group

Hi Bhooshan,

By default you should be running in MapReduce mode unless specified
otherwise. Are you creating a PigServer object to run your jobs? Can you
provide your code here?

Sent from my iPhone

On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <bh...@gmail.com>
wrote:

Apologies for the premature send. I may have some more information. After I
applied the patch and set "pig.use.overriden.hadoop.configs=true", I saw an
NPE (stacktrace below) and a message saying pig was running in exectype
local -

2013-04-13 07:37:13,758 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: local
2013-04-13 07:37:13,760 [main] WARN  org.apache.hadoop.conf.Configuration -
mapred.used.genericoptionsparser is deprecated. Instead, use
mapreduce.client.genericoptionsparser.used
2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: Pig script failed to parse:
<file test.pig, line 1, column 4> pig script failed to validate:
java.lang.NullPointerException


Here is the stacktrace =

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
during parsing. Pig script failed to parse:
<file test.pig, line 1, column 4> pig script failed to validate:
java.lang.NullPointerException
        at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
        at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
        at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
        at org.apache.pig.Main.run(Main.java:555)
        at org.apache.pig.Main.main(Main.java:111)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: Failed to parse: Pig script failed to parse:
<file test.pig, line 1, column 4> pig script failed to validate:
java.lang.NullPointerException
        at
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
        at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
        ... 14 more
Caused by:
<file test.pig, line 1, column 4> pig script failed to validate:
java.lang.NullPointerException
        at
org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
        at
org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
        at
org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
        at
org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
        at
org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
        at
org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
        at
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
        ... 15 more




On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <bh...@gmail.com>wrote:

> Yes, however I did not add core-site.xml, hdfs-site.xml, yarn-site.xml.
> Only my-filesystem-site.xml using both Configuration.addDefaultResource and
> Configuration.addResource.
>
> I see what you are saying though. The patch might require users to take
> care of adding the default config resources as well apart from their own
> resources?
>
>
> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
>
>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>> configuration resources?
>>
>>
>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <bhooshan.mogal@gmail.com
>> > wrote:
>>
>>> Hi Prashant,
>>>
>>> Thanks for your response to my question, and sorry for the delayed
>>> reply. I was not subscribed to the dev mailing list and hence did not get a
>>> notification about your reply. I have copied our thread below so you can
>>> get some context.
>>>
>>> I tried the patch that you pointed to, however with that patch looks
>>> like pig is unable to find core-site.xml. It indicates that it is running
>>> the script in local mode inspite of having fs.default.name defined as
>>> the location of the HDFS namenode.
>>>
>>> Here is what I am trying to do - I have developed my own
>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>> my pig script. This implementation requires its own *-default and
>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>> as I am able to read these configurations in my code. However, pig code
>>> cannot find these configuration parameters. Upon doing some debugging in
>>> the pig code, it seems to me that pig does not use all the resources added
>>> in the Configuration object, but only seems to use certain specific ones
>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>> have pig load user-defined resources like say foo-default.xml and
>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>> the problem, because pig can find my config parameters if I define them in
>>> core-site.xml instead of my-filesystem-site.xml.
>>>
>>> Let me know if you need more details about the issue.
>>>
>>>
>>> Here is our previous conversation -
>>>
>>> Hi Bhooshan,
>>>
>>> There is a patch that addresses what you need, and is part of 0.12
>>> (unreleased). Take a look and see if you can apply the patch to the version
>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>
>>> With this patch, the following property will allow you to override the
>>> default and pass in your own configuration.
>>> pig.use.overriden.hadoop.configs=true
>>>
>>>
>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>
>>> > Hi Folks,
>>> >
>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>> > at work. This implementation uses some config files that are similar in
>>> > structure to hadoop config files. They have a *-default.xml and a
>>> > *-site.xml for users to override default properties. In the class that
>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>> > default resources in a static block using
>>> > Configuration.addDefaultResource("my-default.xml") and
>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>> > for our storage system. However, when we tried using this storage system in
>>> > pig scripts, we saw errors indicating that our configuration parameters
>>> > were not available. Upon further debugging, we saw that the config files
>>> > were added to the Configuration object as resources, but were part of
>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>> > Configuration object was created as Configuration conf = new
>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>> > object. As a result, properties from the default resources (including my
>>> > config files) were not loaded and hence, unavailable.
>>> >
>>> > We solved the problem by using Configuration.addResource instead of
>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>> > does not use default resources?
>>> >
>>> > Could someone on the list explain why this is the case?
>>> >
>>> > Thanks,
>>> > --
>>> > Bhooshan
>>> >
>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>
>
> --
> Bhooshan
>



-- 
Bhooshan

Re: Why does Pig not use default resources from the Configuration object?

Posted by Bhooshan Mogal <bh...@gmail.com>.

Hi Prashant,

Any update regarding this?

-
Bhooshan.


On Wed, May 29, 2013 at 4:55 PM, Bhooshan Mogal <bh...@gmail.com>wrote:

> Hi Prashant,
>
> Apologies for the delay regarding this. I took some more time over the
> past couple of weeks to investigate this issue. It does turn out that Pig
> does not eliminate parameters from non-standard configuration resources. It
> seems that the reason why parameters from my config file were unavailable
> to pig was because I was adding these files to the configuration object
> using the Configuration.addDefaultResources() method. So the problem is the
> same one that I originally described. If I use
> conf.addResource("my-conf-site.xml") as opposed to
> Configuration.addDefaultResource("my-conf-site.xml"), the problem does not
> occur. Pig does not seem to use parameters from the defaultResources list
> in the Configuration class. In Main.java at
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/Main.java, I
> can see that the Configuration object is created as Configuration conf =
> new Configuration(false); (line 168). As a false is passed to the
> constructor, it does not include defaultResources while building the
> Configuration object.
>
> Could you (or anyone else on the list) explain why pig does not use
> defaultResources from the Configuration object? If my findings are correct,
> is there a case for not passing this false to the Configuration object,
> based on whether a pig parameter is set? I would be more than happy to
> provide a patch for this parameter if required.
>
>
> Thanks,
> Bhooshan.
>
>
> On Mon, Apr 15, 2013 at 5:57 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
>
>> Pig actually does not add it core-site.xml or hadoop-site.xml explicitly,
>> it merely looks for these resources to be present on the classpath.
>>
>> JobConf is the interface describing MR specifics to hadoop and pig uses
>> it to define jobs for execution. It loads up mapred*.xml. It does extend
>> from Configuration and uses the props loaded by it.
>>
>>
>> On Mon, Apr 15, 2013 at 5:34 PM, Bhooshan Mogal <bhooshan.mogal@gmail.com
>> > wrote:
>>
>>> Thanks! Quick question before starting this though. Since resources are
>>> added to the Configuration object in various classes in hadoop
>>> (Configuration.java adds core-*.xml, HDFSConfiguration.java adds
>>> hdfs-*.xml), why does Pig create a new JobConf object with selected
>>> resources before submitting a job and not reuse the Configuration object
>>> that may have been created earlier? Trying to understand why Pig adds
>>> core-site.xml, hdfs-site.xml, yarn-site.xml again.
>>>
>>>
>>> On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi <
>>> prash1784@gmail.com> wrote:
>>>
>>>> Sounds good. Here is a doc on contributing patch (for some pointers)
>>>> https://cwiki.apache.org/confluence/display/PIG/HowToContribute
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <
>>>> bhooshan.mogal@gmail.com> wrote:
>>>>
>>>>> Hey Prashant,
>>>>>
>>>>> Yup, I can take a stab at it. This is the first time I am looking at
>>>>> Pig code, so I might take some time to get started. Will get back to you if
>>>>> I have questions in the meantime. And yes, I will write it so it reads a
>>>>> pig property.
>>>>>
>>>>> -
>>>>> Bhooshan.
>>>>>
>>>>>
>>>>> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <
>>>>> prash1784@gmail.com> wrote:
>>>>>
>>>>>> Hi Bhooshan,
>>>>>>
>>>>>> This makes more sense now. I think overriding fs implementation
>>>>>> should go into core-site.xml, but it would be useful to be able to add
>>>>>> resources if you have a bunch of other properties.
>>>>>>
>>>>>> Would you like to submit a patch? It should be based on a pig
>>>>>> property that suggests the additional resource names (myfs-site.xml) in
>>>>>> your case.
>>>>>>
>>>>>> -Prashant
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Prashant,
>>>>>>>
>>>>>>>
>>>>>>> Yes, I am running in MapReduce mode. Let me give you the steps in
>>>>>>> the scenario that I am trying to test -
>>>>>>>
>>>>>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem
>>>>>>> for a filesystem I am trying to implement - Let's call it
>>>>>>> MyFileSystem.class. This filesystem uses the scheme myfs:// for its URIs
>>>>>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml
>>>>>>> and made the class available through a jar file that is part of
>>>>>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>>>>>> 3. In MyFileSystem.class, I have a static block as -
>>>>>>> static {
>>>>>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>>>>>     Configuration.addDefaultResource("myfs-site.xml");
>>>>>>> }
>>>>>>> Both these files are in the classpath. To be safe, I have also added
>>>>>>> the my-fs-site.xml in the constructor of MyFileSystem as
>>>>>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>>>>>> resources as well as the non-default resources in the Configuration object.
>>>>>>> 4. I am trying to access the filesystem in my pig script as -
>>>>>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>>>>>> (name:chararray, age:int); -- loading data
>>>>>>> B = FOREACH A GENERATE name;
>>>>>>> store B into 'myfs://myhost.com:8999/testoutput';
>>>>>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>>>>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>>>>>> is loaded and the properties defined in it are available.
>>>>>>> 6. However, when Pig tries to submit the job, it cannot find these
>>>>>>> properties and the job fails to submit successfully.
>>>>>>> 7. If I move all the properties defined in myfs-site.xml to
>>>>>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>>>>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>>>>>> with all of the properties for a separate filesystem.
>>>>>>> 8. As I said earlier, upon taking a closer look at the pig code, I
>>>>>>> saw that while creating the JobConf object for a job, pig adds very
>>>>>>> specific resources to the job object, and ignores the resources that may
>>>>>>> have been added already (eg myfs-site.xml) in the Configuration object.
>>>>>>> 9. I have tested this with native map-reduce code as well as hive,
>>>>>>> and this approach of having a separate config file for MyFileSystem works
>>>>>>> fine in both those cases.
>>>>>>>
>>>>>>> So, to summarize, I am looking for a way to ask Pig to load
>>>>>>> parameters from my own config file before submitting a job.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -
>>>>>>> Bhooshan.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>>>>>> prash1784@gmail.com> wrote:
>>>>>>>
>>>>>>>> +User group
>>>>>>>>
>>>>>>>> Hi Bhooshan,
>>>>>>>>
>>>>>>>> By default you should be running in MapReduce mode unless specified
>>>>>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>>>>>> provide your code here?
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <
>>>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>>>
>>>>>>>>  Apologies for the premature send. I may have some more
>>>>>>>> information. After I applied the patch and set
>>>>>>>> "pig.use.overriden.hadoop.configs=true", I saw an NPE (stacktrace below)
>>>>>>>> and a message saying pig was running in exectype local -
>>>>>>>>
>>>>>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>>>>>>>> to hadoop file system at: local
>>>>>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>>>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>>>>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>>>>>> 2013-04-13 07:37:14,162 [main] ERROR
>>>>>>>> org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
>>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>>> java.lang.NullPointerException
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is the stacktrace =
>>>>>>>>
>>>>>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000:
>>>>>>>> Error during parsing. Pig script failed to parse:
>>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>>> java.lang.NullPointerException
>>>>>>>>         at
>>>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>>>>>         at
>>>>>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>>>>>         at
>>>>>>>> org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>>>>>         at
>>>>>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>>>>>         at
>>>>>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>>>>>         at
>>>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>>>>>         at
>>>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>> Method)
>>>>>>>>         at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>         at
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>>>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>>> java.lang.NullPointerException
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>>>>>         at
>>>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>>>>>         ... 14 more
>>>>>>>> Caused by:
>>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>>> java.lang.NullPointerException
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>>>>>         at
>>>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>>>>>         ... 15 more
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>>>>>
>>>>>>>>> I see what you are saying though. The patch might require users to
>>>>>>>>> take care of adding the default config resources as well apart from their
>>>>>>>>> own resources?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>>>>>> prash1784@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add
>>>>>>>>>> your configuration resources?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Prashant,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your response to my question, and sorry for the
>>>>>>>>>>> delayed reply. I was not subscribed to the dev mailing list and hence did
>>>>>>>>>>> not get a notification about your reply. I have copied our thread below so
>>>>>>>>>>> you can get some context.
>>>>>>>>>>>
>>>>>>>>>>> I tried the patch that you pointed to, however with that patch
>>>>>>>>>>> looks like pig is unable to find core-site.xml. It indicates that it is
>>>>>>>>>>> running the script in local mode inspite of having
>>>>>>>>>>> fs.default.name defined as the location of the HDFS namenode.
>>>>>>>>>>>
>>>>>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>>>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>>>>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>>>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>>>>>>>> the pig code, it seems to me that pig does not use all the resources added
>>>>>>>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>>>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>>>>>>>> the problem, because pig can find my config parameters if I define them in
>>>>>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>>>>>
>>>>>>>>>>> Let me know if you need more details about the issue.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here is our previous conversation -
>>>>>>>>>>>
>>>>>>>>>>> Hi Bhooshan,
>>>>>>>>>>>
>>>>>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>>>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>>>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>>>>>
>>>>>>>>>>> With this patch, the following property will allow you to override the
>>>>>>>>>>> default and pass in your own configuration.
>>>>>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>>>>>>>>>
>>>>>>>>>>> > Hi Folks,
>>>>>>>>>>> >
>>>>>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>>>>>>>>>> > at work. This implementation uses some config files that are similar in
>>>>>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>>>>>> > *-site.xml for users to override default properties. In the class that
>>>>>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>>>>>>>>>> > default resources in a static block using
>>>>>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>>>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>>>>>>>>>> > for our storage system. However, when we tried using this storage system in
>>>>>>>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>>>>>>>> > were not available. Upon further debugging, we saw that the config files
>>>>>>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>>>>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>>>>>>>>> > object. As a result, properties from the default resources (including my
>>>>>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>>>>>> >
>>>>>>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>>>>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>>>>>>>>> > does not use default resources?
>>>>>>>>>>> >
>>>>>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks,
>>>>>>>>>>> > --
>>>>>>>>>>> > Bhooshan
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Bhooshan
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Bhooshan
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Bhooshan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bhooshan
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>
>
> --
> Bhooshan
>



-- 
Bhooshan

Re: Why does Pig not use default resources from the Configuration object?

Posted by Bhooshan Mogal <bh...@gmail.com>.

Hi Prashant,

Apologies for the delay regarding this. I took some more time over the past
couple of weeks to investigate this issue. It does turn out that Pig does
not eliminate parameters from non-standard configuration resources. It
seems that the reason why parameters from my config file were unavailable
to pig was because I was adding these files to the configuration object
using the Configuration.addDefaultResources() method. So the problem is the
same one that I originally described. If I use
conf.addResource("my-conf-site.xml") as opposed to
Configuration.addDefaultResource("my-conf-site.xml"), the problem does not
occur. Pig does not seem to use parameters from the defaultResources list
in the Configuration class. In Main.java at
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/Main.java, I
can see that the Configuration object is created as Configuration conf = new
Configuration(false); (line 168). As a false is passed to the constructor,
it does not include defaultResources while building the Configuration
object.

Could you (or anyone else on the list) explain why pig does not use
defaultResources from the Configuration object? If my findings are correct,
is there a case for not passing this false to the Configuration object,
based on whether a pig parameter is set? I would be more than happy to
provide a patch for this parameter if required.


Thanks,
Bhooshan.


On Mon, Apr 15, 2013 at 5:57 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Pig actually does not add it core-site.xml or hadoop-site.xml explicitly,
> it merely looks for these resources to be present on the classpath.
>
> JobConf is the interface describing MR specifics to hadoop and pig uses it
> to define jobs for execution. It loads up mapred*.xml. It does extend from
> Configuration and uses the props loaded by it.
>
>
> On Mon, Apr 15, 2013 at 5:34 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>
>> Thanks! Quick question before starting this though. Since resources are
>> added to the Configuration object in various classes in hadoop
>> (Configuration.java adds core-*.xml, HDFSConfiguration.java adds
>> hdfs-*.xml), why does Pig create a new JobConf object with selected
>> resources before submitting a job and not reuse the Configuration object
>> that may have been created earlier? Trying to understand why Pig adds
>> core-site.xml, hdfs-site.xml, yarn-site.xml again.
>>
>>
>> On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi <prash1784@gmail.com
>> > wrote:
>>
>>> Sounds good. Here is a doc on contributing patch (for some pointers)
>>> https://cwiki.apache.org/confluence/display/PIG/HowToContribute
>>>
>>>
>>> On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <
>>> bhooshan.mogal@gmail.com> wrote:
>>>
>>>> Hey Prashant,
>>>>
>>>> Yup, I can take a stab at it. This is the first time I am looking at
>>>> Pig code, so I might take some time to get started. Will get back to you if
>>>> I have questions in the meantime. And yes, I will write it so it reads a
>>>> pig property.
>>>>
>>>> -
>>>> Bhooshan.
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <
>>>> prash1784@gmail.com> wrote:
>>>>
>>>>> Hi Bhooshan,
>>>>>
>>>>> This makes more sense now. I think overriding fs implementation should
>>>>> go into core-site.xml, but it would be useful to be able to add
>>>>> resources if you have a bunch of other properties.
>>>>>
>>>>> Would you like to submit a patch? It should be based on a pig property
>>>>> that suggests the additional resource names (myfs-site.xml) in your case.
>>>>>
>>>>> -Prashant
>>>>>
>>>>>
>>>>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>
>>>>>> Hi Prashant,
>>>>>>
>>>>>>
>>>>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>>>>> scenario that I am trying to test -
>>>>>>
>>>>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem
>>>>>> for a filesystem I am trying to implement - Let's call it
>>>>>> MyFileSystem.class. This filesystem uses the scheme myfs:// for its URIs
>>>>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>>>>> made the class available through a jar file that is part of
>>>>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>>>>> 3. In MyFileSystem.class, I have a static block as -
>>>>>> static {
>>>>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>>>>     Configuration.addDefaultResource("myfs-site.xml");
>>>>>> }
>>>>>> Both these files are in the classpath. To be safe, I have also added
>>>>>> the my-fs-site.xml in the constructor of MyFileSystem as
>>>>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>>>>> resources as well as the non-default resources in the Configuration object.
>>>>>> 4. I am trying to access the filesystem in my pig script as -
>>>>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>>>>> (name:chararray, age:int); -- loading data
>>>>>> B = FOREACH A GENERATE name;
>>>>>> store B into 'myfs://myhost.com:8999/testoutput';
>>>>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>>>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>>>>> is loaded and the properties defined in it are available.
>>>>>> 6. However, when Pig tries to submit the job, it cannot find these
>>>>>> properties and the job fails to submit successfully.
>>>>>> 7. If I move all the properties defined in myfs-site.xml to
>>>>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>>>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>>>>> with all of the properties for a separate filesystem.
>>>>>> 8. As I said earlier, upon taking a closer look at the pig code, I
>>>>>> saw that while creating the JobConf object for a job, pig adds very
>>>>>> specific resources to the job object, and ignores the resources that may
>>>>>> have been added already (eg myfs-site.xml) in the Configuration object.
>>>>>> 9. I have tested this with native map-reduce code as well as hive,
>>>>>> and this approach of having a separate config file for MyFileSystem works
>>>>>> fine in both those cases.
>>>>>>
>>>>>> So, to summarize, I am looking for a way to ask Pig to load
>>>>>> parameters from my own config file before submitting a job.
>>>>>>
>>>>>> Thanks,
>>>>>> -
>>>>>> Bhooshan.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>>>>> prash1784@gmail.com> wrote:
>>>>>>
>>>>>>> +User group
>>>>>>>
>>>>>>> Hi Bhooshan,
>>>>>>>
>>>>>>> By default you should be running in MapReduce mode unless specified
>>>>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>>>>> provide your code here?
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <
>>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>>
>>>>>>>  Apologies for the premature send. I may have some more
>>>>>>> information. After I applied the patch and set
>>>>>>> "pig.use.overriden.hadoop.configs=true", I saw an NPE (stacktrace below)
>>>>>>> and a message saying pig was running in exectype local -
>>>>>>>
>>>>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>>>>>>> to hadoop file system at: local
>>>>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>>>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>>>>> 2013-04-13 07:37:14,162 [main] ERROR
>>>>>>> org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>> java.lang.NullPointerException
>>>>>>>
>>>>>>>
>>>>>>> Here is the stacktrace =
>>>>>>>
>>>>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000:
>>>>>>> Error during parsing. Pig script failed to parse:
>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>> java.lang.NullPointerException
>>>>>>>         at
>>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>>>>         at
>>>>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>>>>         at
>>>>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>>>>         at
>>>>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>>>>         at
>>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>>>>         at
>>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>> Method)
>>>>>>>         at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>         at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>> java.lang.NullPointerException
>>>>>>>         at
>>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>>>>         at
>>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>>>>         ... 14 more
>>>>>>> Caused by:
>>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>>> java.lang.NullPointerException
>>>>>>>         at
>>>>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>>>>         at
>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>>>>         at
>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>>>>         at
>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>>>>         at
>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>>>>         at
>>>>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>>>>         at
>>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>>>>         ... 15 more
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>>>>
>>>>>>>> I see what you are saying though. The patch might require users to
>>>>>>>> take care of adding the default config resources as well apart from their
>>>>>>>> own resources?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>>>>> prash1784@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add
>>>>>>>>> your configuration resources?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Prashant,
>>>>>>>>>>
>>>>>>>>>> Thanks for your response to my question, and sorry for the
>>>>>>>>>> delayed reply. I was not subscribed to the dev mailing list and hence did
>>>>>>>>>> not get a notification about your reply. I have copied our thread below so
>>>>>>>>>> you can get some context.
>>>>>>>>>>
>>>>>>>>>> I tried the patch that you pointed to, however with that patch
>>>>>>>>>> looks like pig is unable to find core-site.xml. It indicates that it is
>>>>>>>>>> running the script in local mode inspite of having
>>>>>>>>>> fs.default.name defined as the location of the HDFS namenode.
>>>>>>>>>>
>>>>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>>>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>>>>>>> the pig code, it seems to me that pig does not use all the resources added
>>>>>>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>>>>>>> the problem, because pig can find my config parameters if I define them in
>>>>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>>>>
>>>>>>>>>> Let me know if you need more details about the issue.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Here is our previous conversation -
>>>>>>>>>>
>>>>>>>>>> Hi Bhooshan,
>>>>>>>>>>
>>>>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>>>>
>>>>>>>>>> With this patch, the following property will allow you to override the
>>>>>>>>>> default and pass in your own configuration.
>>>>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>>>>>>>>
>>>>>>>>>> > Hi Folks,
>>>>>>>>>> >
>>>>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>>>>>>>>> > at work. This implementation uses some config files that are similar in
>>>>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>>>>> > *-site.xml for users to override default properties. In the class that
>>>>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>>>>>>>>> > default resources in a static block using
>>>>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>>>>>>>>> > for our storage system. However, when we tried using this storage system in
>>>>>>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>>>>>>> > were not available. Upon further debugging, we saw that the config files
>>>>>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>>>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>>>>>>>> > object. As a result, properties from the default resources (including my
>>>>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>>>>> >
>>>>>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>>>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>>>>>>>> > does not use default resources?
>>>>>>>>>> >
>>>>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>>>>> >
>>>>>>>>>> > Thanks,
>>>>>>>>>> > --
>>>>>>>>>> > Bhooshan
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Bhooshan
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Bhooshan
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bhooshan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bhooshan
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>
>>>
>>
>>
>> --
>> Bhooshan
>>
>
>


-- 
Bhooshan

Re: Why does Pig not use default resources from the Configuration object?

Posted by Prashant Kommireddi <pr...@gmail.com>.

Pig actually does not add it core-site.xml or hadoop-site.xml explicitly,
it merely looks for these resources to be present on the classpath.

JobConf is the interface describing MR specifics to hadoop and pig uses it
to define jobs for execution. It loads up mapred*.xml. It does extend from
Configuration and uses the props loaded by it.


On Mon, Apr 15, 2013 at 5:34 PM, Bhooshan Mogal <bh...@gmail.com>wrote:

> Thanks! Quick question before starting this though. Since resources are
> added to the Configuration object in various classes in hadoop
> (Configuration.java adds core-*.xml, HDFSConfiguration.java adds
> hdfs-*.xml), why does Pig create a new JobConf object with selected
> resources before submitting a job and not reuse the Configuration object
> that may have been created earlier? Trying to understand why Pig adds
> core-site.xml, hdfs-site.xml, yarn-site.xml again.
>
>
> On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
>
>> Sounds good. Here is a doc on contributing patch (for some pointers)
>> https://cwiki.apache.org/confluence/display/PIG/HowToContribute
>>
>>
>> On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <bhooshan.mogal@gmail.com
>> > wrote:
>>
>>> Hey Prashant,
>>>
>>> Yup, I can take a stab at it. This is the first time I am looking at Pig
>>> code, so I might take some time to get started. Will get back to you if I
>>> have questions in the meantime. And yes, I will write it so it reads a pig
>>> property.
>>>
>>> -
>>> Bhooshan.
>>>
>>>
>>> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <
>>> prash1784@gmail.com> wrote:
>>>
>>>> Hi Bhooshan,
>>>>
>>>> This makes more sense now. I think overriding fs implementation should
>>>> go into core-site.xml, but it would be useful to be able to add
>>>> resources if you have a bunch of other properties.
>>>>
>>>> Would you like to submit a patch? It should be based on a pig property
>>>> that suggests the additional resource names (myfs-site.xml) in your case.
>>>>
>>>> -Prashant
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>>>> bhooshan.mogal@gmail.com> wrote:
>>>>
>>>>> Hi Prashant,
>>>>>
>>>>>
>>>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>>>> scenario that I am trying to test -
>>>>>
>>>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for
>>>>> a filesystem I am trying to implement - Let's call it MyFileSystem.class.
>>>>> This filesystem uses the scheme myfs:// for its URIs
>>>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>>>> made the class available through a jar file that is part of
>>>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>>>> 3. In MyFileSystem.class, I have a static block as -
>>>>> static {
>>>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>>>     Configuration.addDefaultResource("myfs-site.xml");
>>>>> }
>>>>> Both these files are in the classpath. To be safe, I have also added
>>>>> the my-fs-site.xml in the constructor of MyFileSystem as
>>>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>>>> resources as well as the non-default resources in the Configuration object.
>>>>> 4. I am trying to access the filesystem in my pig script as -
>>>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>>>> (name:chararray, age:int); -- loading data
>>>>> B = FOREACH A GENERATE name;
>>>>> store B into 'myfs://myhost.com:8999/testoutput';
>>>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>>>> is loaded and the properties defined in it are available.
>>>>> 6. However, when Pig tries to submit the job, it cannot find these
>>>>> properties and the job fails to submit successfully.
>>>>> 7. If I move all the properties defined in myfs-site.xml to
>>>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>>>> with all of the properties for a separate filesystem.
>>>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>>>>> that while creating the JobConf object for a job, pig adds very specific
>>>>> resources to the job object, and ignores the resources that may have been
>>>>> added already (eg myfs-site.xml) in the Configuration object.
>>>>> 9. I have tested this with native map-reduce code as well as hive, and
>>>>> this approach of having a separate config file for MyFileSystem works fine
>>>>> in both those cases.
>>>>>
>>>>> So, to summarize, I am looking for a way to ask Pig to load parameters
>>>>> from my own config file before submitting a job.
>>>>>
>>>>> Thanks,
>>>>> -
>>>>> Bhooshan.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>>>> prash1784@gmail.com> wrote:
>>>>>
>>>>>> +User group
>>>>>>
>>>>>> Hi Bhooshan,
>>>>>>
>>>>>> By default you should be running in MapReduce mode unless specified
>>>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>>>> provide your code here?
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <bh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>  Apologies for the premature send. I may have some more information.
>>>>>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>>>>>> I saw an NPE (stacktrace below) and a message saying pig was running in
>>>>>> exectype local -
>>>>>>
>>>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>>>>>> to hadoop file system at: local
>>>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>>> - ERROR 1200: Pig script failed to parse:
>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>> java.lang.NullPointerException
>>>>>>
>>>>>>
>>>>>> Here is the stacktrace =
>>>>>>
>>>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>>>>>> during parsing. Pig script failed to parse:
>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>> java.lang.NullPointerException
>>>>>>         at
>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>>>         at
>>>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>>>         at
>>>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>>>         at
>>>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>>>         at
>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>>>         at
>>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>         at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>         at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>> java.lang.NullPointerException
>>>>>>         at
>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>>>         at
>>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>>>         ... 14 more
>>>>>> Caused by:
>>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>>> java.lang.NullPointerException
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>>>         at
>>>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>>>         at
>>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>>>         ... 15 more
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>>>
>>>>>>> I see what you are saying though. The patch might require users to
>>>>>>> take care of adding the default config resources as well apart from their
>>>>>>> own resources?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>>>> prash1784@gmail.com> wrote:
>>>>>>>
>>>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add
>>>>>>>> your configuration resources?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Prashant,
>>>>>>>>>
>>>>>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>>>>>> reply. I was not subscribed to the dev mailing list and hence did not get a
>>>>>>>>> notification about your reply. I have copied our thread below so you can
>>>>>>>>> get some context.
>>>>>>>>>
>>>>>>>>> I tried the patch that you pointed to, however with that patch
>>>>>>>>> looks like pig is unable to find core-site.xml. It indicates that it is
>>>>>>>>> running the script in local mode inspite of having fs.default.namedefined as the location of the HDFS namenode.
>>>>>>>>>
>>>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>>>>>> the pig code, it seems to me that pig does not use all the resources added
>>>>>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>>>>>> the problem, because pig can find my config parameters if I define them in
>>>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>>>
>>>>>>>>> Let me know if you need more details about the issue.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here is our previous conversation -
>>>>>>>>>
>>>>>>>>> Hi Bhooshan,
>>>>>>>>>
>>>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>>>
>>>>>>>>> With this patch, the following property will allow you to override the
>>>>>>>>> default and pass in your own configuration.
>>>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>> > Hi Folks,
>>>>>>>>> >
>>>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>>>>>>>> > at work. This implementation uses some config files that are similar in
>>>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>>>> > *-site.xml for users to override default properties. In the class that
>>>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>>>>>>>> > default resources in a static block using
>>>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>>>>>>>> > for our storage system. However, when we tried using this storage system in
>>>>>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>>>>>> > were not available. Upon further debugging, we saw that the config files
>>>>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>>>>>>> > object. As a result, properties from the default resources (including my
>>>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>>>> >
>>>>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>>>>>>> > does not use default resources?
>>>>>>>>> >
>>>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > --
>>>>>>>>> > Bhooshan
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Bhooshan
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bhooshan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bhooshan
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>
>
> --
> Bhooshan
>

Re: Why does Pig not use default resources from the Configuration object?

Posted by Bhooshan Mogal <bh...@gmail.com>.

Thanks! Quick question before starting this though. Since resources are
added to the Configuration object in various classes in hadoop
(Configuration.java adds core-*.xml, HDFSConfiguration.java adds
hdfs-*.xml), why does Pig create a new JobConf object with selected
resources before submitting a job and not reuse the Configuration object
that may have been created earlier? Trying to understand why Pig adds
core-site.xml, hdfs-site.xml, yarn-site.xml again.


On Mon, Apr 15, 2013 at 4:43 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Sounds good. Here is a doc on contributing patch (for some pointers)
> https://cwiki.apache.org/confluence/display/PIG/HowToContribute
>
>
> On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>
>> Hey Prashant,
>>
>> Yup, I can take a stab at it. This is the first time I am looking at Pig
>> code, so I might take some time to get started. Will get back to you if I
>> have questions in the meantime. And yes, I will write it so it reads a pig
>> property.
>>
>> -
>> Bhooshan.
>>
>>
>> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <
>> prash1784@gmail.com> wrote:
>>
>>> Hi Bhooshan,
>>>
>>> This makes more sense now. I think overriding fs implementation should
>>> go into core-site.xml, but it would be useful to be able to add
>>> resources if you have a bunch of other properties.
>>>
>>> Would you like to submit a patch? It should be based on a pig property
>>> that suggests the additional resource names (myfs-site.xml) in your case.
>>>
>>> -Prashant
>>>
>>>
>>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>>> bhooshan.mogal@gmail.com> wrote:
>>>
>>>> Hi Prashant,
>>>>
>>>>
>>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>>> scenario that I am trying to test -
>>>>
>>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for
>>>> a filesystem I am trying to implement - Let's call it MyFileSystem.class.
>>>> This filesystem uses the scheme myfs:// for its URIs
>>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>>> made the class available through a jar file that is part of
>>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>>> 3. In MyFileSystem.class, I have a static block as -
>>>> static {
>>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>>     Configuration.addDefaultResource("myfs-site.xml");
>>>> }
>>>> Both these files are in the classpath. To be safe, I have also added
>>>> the my-fs-site.xml in the constructor of MyFileSystem as
>>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>>> resources as well as the non-default resources in the Configuration object.
>>>> 4. I am trying to access the filesystem in my pig script as -
>>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>>> (name:chararray, age:int); -- loading data
>>>> B = FOREACH A GENERATE name;
>>>> store B into 'myfs://myhost.com:8999/testoutput';
>>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>>> is loaded and the properties defined in it are available.
>>>> 6. However, when Pig tries to submit the job, it cannot find these
>>>> properties and the job fails to submit successfully.
>>>> 7. If I move all the properties defined in myfs-site.xml to
>>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>>> with all of the properties for a separate filesystem.
>>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>>>> that while creating the JobConf object for a job, pig adds very specific
>>>> resources to the job object, and ignores the resources that may have been
>>>> added already (eg myfs-site.xml) in the Configuration object.
>>>> 9. I have tested this with native map-reduce code as well as hive, and
>>>> this approach of having a separate config file for MyFileSystem works fine
>>>> in both those cases.
>>>>
>>>> So, to summarize, I am looking for a way to ask Pig to load parameters
>>>> from my own config file before submitting a job.
>>>>
>>>> Thanks,
>>>> -
>>>> Bhooshan.
>>>>
>>>>
>>>>
>>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>>> prash1784@gmail.com> wrote:
>>>>
>>>>> +User group
>>>>>
>>>>> Hi Bhooshan,
>>>>>
>>>>> By default you should be running in MapReduce mode unless specified
>>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>>> provide your code here?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <bh...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>  Apologies for the premature send. I may have some more information.
>>>>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>>>>> I saw an NPE (stacktrace below) and a message saying pig was running in
>>>>> exectype local -
>>>>>
>>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>>>>> to hadoop file system at: local
>>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>> - ERROR 1200: Pig script failed to parse:
>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>> java.lang.NullPointerException
>>>>>
>>>>>
>>>>> Here is the stacktrace =
>>>>>
>>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>>>>> during parsing. Pig script failed to parse:
>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>>         at
>>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>>         at
>>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>>         at
>>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>>         at
>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>>         at
>>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>         at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>         at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>>         at
>>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>>         ... 14 more
>>>>> Caused by:
>>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>>> java.lang.NullPointerException
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>>         at
>>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>>         at
>>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>>         ... 15 more
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>
>>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>>
>>>>>> I see what you are saying though. The patch might require users to
>>>>>> take care of adding the default config resources as well apart from their
>>>>>> own resources?
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>>> prash1784@gmail.com> wrote:
>>>>>>
>>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add
>>>>>>> your configuration resources?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Prashant,
>>>>>>>>
>>>>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>>>>> reply. I was not subscribed to the dev mailing list and hence did not get a
>>>>>>>> notification about your reply. I have copied our thread below so you can
>>>>>>>> get some context.
>>>>>>>>
>>>>>>>> I tried the patch that you pointed to, however with that patch
>>>>>>>> looks like pig is unable to find core-site.xml. It indicates that it is
>>>>>>>> running the script in local mode inspite of having fs.default.namedefined as the location of the HDFS namenode.
>>>>>>>>
>>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>>>>> the pig code, it seems to me that pig does not use all the resources added
>>>>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>>>>> the problem, because pig can find my config parameters if I define them in
>>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>>
>>>>>>>> Let me know if you need more details about the issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is our previous conversation -
>>>>>>>>
>>>>>>>> Hi Bhooshan,
>>>>>>>>
>>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>>
>>>>>>>> With this patch, the following property will allow you to override the
>>>>>>>> default and pass in your own configuration.
>>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>>>>>>
>>>>>>>> > Hi Folks,
>>>>>>>> >
>>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>>>>>>> > at work. This implementation uses some config files that are similar in
>>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>>> > *-site.xml for users to override default properties. In the class that
>>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>>>>>>> > default resources in a static block using
>>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>>>>>>> > for our storage system. However, when we tried using this storage system in
>>>>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>>>>> > were not available. Upon further debugging, we saw that the config files
>>>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>>>>>> > object. As a result, properties from the default resources (including my
>>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>>> >
>>>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>>>>>> > does not use default resources?
>>>>>>>> >
>>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > --
>>>>>>>> > Bhooshan
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Bhooshan
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bhooshan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>
>>>
>>
>>
>> --
>> Bhooshan
>>
>
>


-- 
Bhooshan

Re: Why does Pig not use default resources from the Configuration object?

Posted by Prashant Kommireddi <pr...@gmail.com>.

Sounds good. Here is a doc on contributing patch (for some pointers)
https://cwiki.apache.org/confluence/display/PIG/HowToContribute


On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <bh...@gmail.com>wrote:

> Hey Prashant,
>
> Yup, I can take a stab at it. This is the first time I am looking at Pig
> code, so I might take some time to get started. Will get back to you if I
> have questions in the meantime. And yes, I will write it so it reads a pig
> property.
>
> -
> Bhooshan.
>
>
> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <prash1784@gmail.com
> > wrote:
>
>> Hi Bhooshan,
>>
>> This makes more sense now. I think overriding fs implementation should go
>> into core-site.xml, but it would be useful to be able to add resources if
>> you have a bunch of other properties.
>>
>> Would you like to submit a patch? It should be based on a pig property
>> that suggests the additional resource names (myfs-site.xml) in your case.
>>
>> -Prashant
>>
>>
>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>> bhooshan.mogal@gmail.com> wrote:
>>
>>> Hi Prashant,
>>>
>>>
>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>> scenario that I am trying to test -
>>>
>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
>>> filesystem I am trying to implement - Let's call it MyFileSystem.class.
>>> This filesystem uses the scheme myfs:// for its URIs
>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>> made the class available through a jar file that is part of
>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>> 3. In MyFileSystem.class, I have a static block as -
>>> static {
>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>     Configuration.addDefaultResource("myfs-site.xml");
>>> }
>>> Both these files are in the classpath. To be safe, I have also added the
>>> my-fs-site.xml in the constructor of MyFileSystem as
>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>> resources as well as the non-default resources in the Configuration object.
>>> 4. I am trying to access the filesystem in my pig script as -
>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>> (name:chararray, age:int); -- loading data
>>> B = FOREACH A GENERATE name;
>>> store B into 'myfs://myhost.com:8999/testoutput';
>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>> is loaded and the properties defined in it are available.
>>> 6. However, when Pig tries to submit the job, it cannot find these
>>> properties and the job fails to submit successfully.
>>> 7. If I move all the properties defined in myfs-site.xml to
>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>> with all of the properties for a separate filesystem.
>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>>> that while creating the JobConf object for a job, pig adds very specific
>>> resources to the job object, and ignores the resources that may have been
>>> added already (eg myfs-site.xml) in the Configuration object.
>>> 9. I have tested this with native map-reduce code as well as hive, and
>>> this approach of having a separate config file for MyFileSystem works fine
>>> in both those cases.
>>>
>>> So, to summarize, I am looking for a way to ask Pig to load parameters
>>> from my own config file before submitting a job.
>>>
>>> Thanks,
>>> -
>>> Bhooshan.
>>>
>>>
>>>
>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>> prash1784@gmail.com> wrote:
>>>
>>>> +User group
>>>>
>>>> Hi Bhooshan,
>>>>
>>>> By default you should be running in MapReduce mode unless specified
>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>> provide your code here?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <bh...@gmail.com>
>>>> wrote:
>>>>
>>>>  Apologies for the premature send. I may have some more information.
>>>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>>>> I saw an NPE (stacktrace below) and a message saying pig was running in
>>>> exectype local -
>>>>
>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>>>> to hadoop file system at: local
>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>>> ERROR 1200: Pig script failed to parse:
>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>> java.lang.NullPointerException
>>>>
>>>>
>>>> Here is the stacktrace =
>>>>
>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>>>> during parsing. Pig script failed to parse:
>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>> java.lang.NullPointerException
>>>>         at
>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>         at
>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>         at
>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>         at
>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>         at
>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>         at
>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>         at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>         at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>> java.lang.NullPointerException
>>>>         at
>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>         at
>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>         ... 14 more
>>>> Caused by:
>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>> java.lang.NullPointerException
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>         at
>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>         ... 15 more
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>> bhooshan.mogal@gmail.com> wrote:
>>>>
>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>
>>>>> I see what you are saying though. The patch might require users to
>>>>> take care of adding the default config resources as well apart from their
>>>>> own resources?
>>>>>
>>>>>
>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>> prash1784@gmail.com> wrote:
>>>>>
>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>>>>>> configuration resources?
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Prashant,
>>>>>>>
>>>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>>>> reply. I was not subscribed to the dev mailing list and hence did not get a
>>>>>>> notification about your reply. I have copied our thread below so you can
>>>>>>> get some context.
>>>>>>>
>>>>>>> I tried the patch that you pointed to, however with that patch looks
>>>>>>> like pig is unable to find core-site.xml. It indicates that it is running
>>>>>>> the script in local mode inspite of having fs.default.name defined
>>>>>>> as the location of the HDFS namenode.
>>>>>>>
>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>>>> the pig code, it seems to me that pig does not use all the resources added
>>>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>>>> the problem, because pig can find my config parameters if I define them in
>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>
>>>>>>> Let me know if you need more details about the issue.
>>>>>>>
>>>>>>>
>>>>>>> Here is our previous conversation -
>>>>>>>
>>>>>>> Hi Bhooshan,
>>>>>>>
>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>
>>>>>>> With this patch, the following property will allow you to override the
>>>>>>> default and pass in your own configuration.
>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>>>>>
>>>>>>> > Hi Folks,
>>>>>>> >
>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>>>>>> > at work. This implementation uses some config files that are similar in
>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>> > *-site.xml for users to override default properties. In the class that
>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>>>>>> > default resources in a static block using
>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>>>>>> > for our storage system. However, when we tried using this storage system in
>>>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>>>> > were not available. Upon further debugging, we saw that the config files
>>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>>>>> > object. As a result, properties from the default resources (including my
>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>> >
>>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>>>>> > does not use default resources?
>>>>>>> >
>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > --
>>>>>>> > Bhooshan
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bhooshan
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>
>
> --
> Bhooshan
>

Re: Why does Pig not use default resources from the Configuration object?

Posted by Bhooshan Mogal <bh...@gmail.com>.

Hey Prashant,

Yup, I can take a stab at it. This is the first time I am looking at Pig
code, so I might take some time to get started. Will get back to you if I
have questions in the meantime. And yes, I will write it so it reads a pig
property.

-
Bhooshan.


On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi
<pr...@gmail.com>wrote:

> Hi Bhooshan,
>
> This makes more sense now. I think overriding fs implementation should go
> into core-site.xml, but it would be useful to be able to add resources if
> you have a bunch of other properties.
>
> Would you like to submit a patch? It should be based on a pig property
> that suggests the additional resource names (myfs-site.xml) in your case.
>
> -Prashant
>
>
> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <bhooshan.mogal@gmail.com
> > wrote:
>
>> Hi Prashant,
>>
>>
>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>> scenario that I am trying to test -
>>
>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
>> filesystem I am trying to implement - Let's call it MyFileSystem.class.
>> This filesystem uses the scheme myfs:// for its URIs
>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>> made the class available through a jar file that is part of
>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>> 3. In MyFileSystem.class, I have a static block as -
>> static {
>>     Configuration.addDefaultResource("myfs-default.xml");
>>     Configuration.addDefaultResource("myfs-site.xml");
>> }
>> Both these files are in the classpath. To be safe, I have also added the
>> my-fs-site.xml in the constructor of MyFileSystem as
>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>> resources as well as the non-default resources in the Configuration object.
>> 4. I am trying to access the filesystem in my pig script as -
>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>> (name:chararray, age:int); -- loading data
>> B = FOREACH A GENERATE name;
>> store B into 'myfs://myhost.com:8999/testoutput';
>> 5. The execution seems to start correctly, and MyFileSystem.class is
>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>> is loaded and the properties defined in it are available.
>> 6. However, when Pig tries to submit the job, it cannot find these
>> properties and the job fails to submit successfully.
>> 7. If I move all the properties defined in myfs-site.xml to
>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>> However, this is not ideal as I do not want to proliferate core-site.xml
>> with all of the properties for a separate filesystem.
>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>> that while creating the JobConf object for a job, pig adds very specific
>> resources to the job object, and ignores the resources that may have been
>> added already (eg myfs-site.xml) in the Configuration object.
>> 9. I have tested this with native map-reduce code as well as hive, and
>> this approach of having a separate config file for MyFileSystem works fine
>> in both those cases.
>>
>> So, to summarize, I am looking for a way to ask Pig to load parameters
>> from my own config file before submitting a job.
>>
>> Thanks,
>> -
>> Bhooshan.
>>
>>
>>
>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <prash1784@gmail.com
>> > wrote:
>>
>>> +User group
>>>
>>> Hi Bhooshan,
>>>
>>> By default you should be running in MapReduce mode unless specified
>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>> provide your code here?
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <bh...@gmail.com>
>>> wrote:
>>>
>>>  Apologies for the premature send. I may have some more information.
>>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>>> I saw an NPE (stacktrace below) and a message saying pig was running in
>>> exectype local -
>>>
>>> 2013-04-13 07:37:13,758 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>>> to hadoop file system at: local
>>> 2013-04-13 07:37:13,760 [main] WARN
>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> ERROR 1200: Pig script failed to parse:
>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>> java.lang.NullPointerException
>>>
>>>
>>> Here is the stacktrace =
>>>
>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>>> during parsing. Pig script failed to parse:
>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>> java.lang.NullPointerException
>>>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>         at
>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>         at
>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>         at
>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>         at
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>         at
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>         at org.apache.pig.Main.run(Main.java:555)
>>>         at org.apache.pig.Main.main(Main.java:111)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>> Caused by: Failed to parse: Pig script failed to parse:
>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>         ... 14 more
>>> Caused by:
>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>         at
>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>         at
>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>         ... 15 more
>>>
>>>
>>>
>>>
>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>> bhooshan.mogal@gmail.com> wrote:
>>>
>>>> Yes, however I did not add core-site.xml, hdfs-site.xml, yarn-site.xml.
>>>> Only my-filesystem-site.xml using both Configuration.addDefaultResource and
>>>> Configuration.addResource.
>>>>
>>>> I see what you are saying though. The patch might require users to take
>>>> care of adding the default config resources as well apart from their own
>>>> resources?
>>>>
>>>>
>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>> prash1784@gmail.com> wrote:
>>>>
>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>>>>> configuration resources?
>>>>>
>>>>>
>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>> bhooshan.mogal@gmail.com> wrote:
>>>>>
>>>>>> Hi Prashant,
>>>>>>
>>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>>> reply. I was not subscribed to the dev mailing list and hence did not get a
>>>>>> notification about your reply. I have copied our thread below so you can
>>>>>> get some context.
>>>>>>
>>>>>> I tried the patch that you pointed to, however with that patch looks
>>>>>> like pig is unable to find core-site.xml. It indicates that it is running
>>>>>> the script in local mode inspite of having fs.default.name defined
>>>>>> as the location of the HDFS namenode.
>>>>>>
>>>>>> Here is what I am trying to do - I have developed my own
>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>>> my pig script. This implementation requires its own *-default and
>>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>>> the pig code, it seems to me that pig does not use all the resources added
>>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>>> the problem, because pig can find my config parameters if I define them in
>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>
>>>>>> Let me know if you need more details about the issue.
>>>>>>
>>>>>>
>>>>>> Here is our previous conversation -
>>>>>>
>>>>>> Hi Bhooshan,
>>>>>>
>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>
>>>>>> With this patch, the following property will allow you to override the
>>>>>> default and pass in your own configuration.
>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>>>>
>>>>>> > Hi Folks,
>>>>>> >
>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>>>>> > at work. This implementation uses some config files that are similar in
>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>> > *-site.xml for users to override default properties. In the class that
>>>>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>>>>> > default resources in a static block using
>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>>>>> > for our storage system. However, when we tried using this storage system in
>>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>>> > were not available. Upon further debugging, we saw that the config files
>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>>>>> > Configuration object was created as Configuration conf = new
>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>>>> > object. As a result, properties from the default resources (including my
>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>> >
>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>>>> > does not use default resources?
>>>>>> >
>>>>>> > Could someone on the list explain why this is the case?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > --
>>>>>> > Bhooshan
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bhooshan
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>>
>>
>>
>> --
>> Bhooshan
>>
>
>


-- 
Bhooshan

Re: Why does Pig not use default resources from the Configuration object?

Posted by Prashant Kommireddi <pr...@gmail.com>.

Hi Bhooshan,

This makes more sense now. I think overriding fs implementation should go
into core-site.xml, but it would be useful to be able to add resources if
you have a bunch of other properties.

Would you like to submit a patch? It should be based on a pig property that
suggests the additional resource names (myfs-site.xml) in your case.

-Prashant


On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal
<bh...@gmail.com>wrote:

> Hi Prashant,
>
>
> Yes, I am running in MapReduce mode. Let me give you the steps in the
> scenario that I am trying to test -
>
> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
> filesystem I am trying to implement - Let's call it MyFileSystem.class.
> This filesystem uses the scheme myfs:// for its URIs
> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and made
> the class available through a jar file that is part of HADOOP_CLASSPATH (or
> PIG_CLASSPATH).
> 3. In MyFileSystem.class, I have a static block as -
> static {
>     Configuration.addDefaultResource("myfs-default.xml");
>     Configuration.addDefaultResource("myfs-site.xml");
> }
> Both these files are in the classpath. To be safe, I have also added the
> my-fs-site.xml in the constructor of MyFileSystem as
> conf.addResource("myfs-site.xml"), so that it is part of both the default
> resources as well as the non-default resources in the Configuration object.
> 4. I am trying to access the filesystem in my pig script as -
> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
> (name:chararray, age:int); -- loading data
> B = FOREACH A GENERATE name;
> store B into 'myfs://myhost.com:8999/testoutput';
> 5. The execution seems to start correctly, and MyFileSystem.class is
> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
> is loaded and the properties defined in it are available.
> 6. However, when Pig tries to submit the job, it cannot find these
> properties and the job fails to submit successfully.
> 7. If I move all the properties defined in myfs-site.xml to core-site.xml,
> the job gets submitted successfully, and it even succeeds. However, this is
> not ideal as I do not want to proliferate core-site.xml with all of the
> properties for a separate filesystem.
> 8. As I said earlier, upon taking a closer look at the pig code, I saw
> that while creating the JobConf object for a job, pig adds very specific
> resources to the job object, and ignores the resources that may have been
> added already (eg myfs-site.xml) in the Configuration object.
> 9. I have tested this with native map-reduce code as well as hive, and
> this approach of having a separate config file for MyFileSystem works fine
> in both those cases.
>
> So, to summarize, I am looking for a way to ask Pig to load parameters
> from my own config file before submitting a job.
>
> Thanks,
> -
> Bhooshan.
>
>
>
> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
>
>> +User group
>>
>> Hi Bhooshan,
>>
>> By default you should be running in MapReduce mode unless specified
>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>> provide your code here?
>>
>> Sent from my iPhone
>>
>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <bh...@gmail.com>
>> wrote:
>>
>>  Apologies for the premature send. I may have some more information.
>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>> I saw an NPE (stacktrace below) and a message saying pig was running in
>> exectype local -
>>
>> 2013-04-13 07:37:13,758 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>> to hadoop file system at: local
>> 2013-04-13 07:37:13,760 [main] WARN  org.apache.hadoop.conf.Configuration
>> - mapred.used.genericoptionsparser is deprecated. Instead, use
>> mapreduce.client.genericoptionsparser.used
>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1200: Pig script failed to parse:
>> <file test.pig, line 1, column 4> pig script failed to validate:
>> java.lang.NullPointerException
>>
>>
>> Here is the stacktrace =
>>
>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>> during parsing. Pig script failed to parse:
>> <file test.pig, line 1, column 4> pig script failed to validate:
>> java.lang.NullPointerException
>>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>         at
>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>         at
>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>         at
>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>         at
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>         at
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>         at org.apache.pig.Main.run(Main.java:555)
>>         at org.apache.pig.Main.main(Main.java:111)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>> Caused by: Failed to parse: Pig script failed to parse:
>> <file test.pig, line 1, column 4> pig script failed to validate:
>> java.lang.NullPointerException
>>         at
>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>         ... 14 more
>> Caused by:
>> <file test.pig, line 1, column 4> pig script failed to validate:
>> java.lang.NullPointerException
>>         at
>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>         at
>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>         ... 15 more
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <bhooshan.mogal@gmail.com
>> > wrote:
>>
>>> Yes, however I did not add core-site.xml, hdfs-site.xml, yarn-site.xml.
>>> Only my-filesystem-site.xml using both Configuration.addDefaultResource and
>>> Configuration.addResource.
>>>
>>> I see what you are saying though. The patch might require users to take
>>> care of adding the default config resources as well apart from their own
>>> resources?
>>>
>>>
>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>> prash1784@gmail.com> wrote:
>>>
>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>>>> configuration resources?
>>>>
>>>>
>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>> bhooshan.mogal@gmail.com> wrote:
>>>>
>>>>> Hi Prashant,
>>>>>
>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>> reply. I was not subscribed to the dev mailing list and hence did not get a
>>>>> notification about your reply. I have copied our thread below so you can
>>>>> get some context.
>>>>>
>>>>> I tried the patch that you pointed to, however with that patch looks
>>>>> like pig is unable to find core-site.xml. It indicates that it is running
>>>>> the script in local mode inspite of having fs.default.name defined as
>>>>> the location of the HDFS namenode.
>>>>>
>>>>> Here is what I am trying to do - I have developed my own
>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>> my pig script. This implementation requires its own *-default and
>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>>> as I am able to read these configurations in my code. However, pig code
>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>> the pig code, it seems to me that pig does not use all the resources added
>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>> the problem, because pig can find my config parameters if I define them in
>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>
>>>>> Let me know if you need more details about the issue.
>>>>>
>>>>>
>>>>> Here is our previous conversation -
>>>>>
>>>>> Hi Bhooshan,
>>>>>
>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>
>>>>> With this patch, the following property will allow you to override the
>>>>> default and pass in your own configuration.
>>>>> pig.use.overriden.hadoop.configs=true
>>>>>
>>>>>
>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>>>
>>>>> > Hi Folks,
>>>>> >
>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>>>> > at work. This implementation uses some config files that are similar in
>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>> > *-site.xml for users to override default properties. In the class that
>>>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>>>> > default resources in a static block using
>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>>>> > for our storage system. However, when we tried using this storage system in
>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>> > were not available. Upon further debugging, we saw that the config files
>>>>> > were added to the Configuration object as resources, but were part of
>>>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>>>> > Configuration object was created as Configuration conf = new
>>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>>> > object. As a result, properties from the default resources (including my
>>>>> > config files) were not loaded and hence, unavailable.
>>>>> >
>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>>> > does not use default resources?
>>>>> >
>>>>> > Could someone on the list explain why this is the case?
>>>>> >
>>>>> > Thanks,
>>>>> > --
>>>>> > Bhooshan
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>>
>> --
>> Bhooshan
>>
>>
>
>
> --
> Bhooshan
>

Re: Why does Pig not use default resources from the Configuration object?

Posted by Bhooshan Mogal <bh...@gmail.com>.

Hi Prashant,


Yes, I am running in MapReduce mode. Let me give you the steps in the
scenario that I am trying to test -

1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
filesystem I am trying to implement - Let's call it MyFileSystem.class.
This filesystem uses the scheme myfs:// for its URIs
2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and made
the class available through a jar file that is part of HADOOP_CLASSPATH (or
PIG_CLASSPATH).
3. In MyFileSystem.class, I have a static block as -
static {
    Configuration.addDefaultResource("myfs-default.xml");
    Configuration.addDefaultResource("myfs-site.xml");
}
Both these files are in the classpath. To be safe, I have also added the
my-fs-site.xml in the constructor of MyFileSystem as
conf.addResource("myfs-site.xml"), so that it is part of both the default
resources as well as the non-default resources in the Configuration object.
4. I am trying to access the filesystem in my pig script as -
A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
(name:chararray, age:int); -- loading data
B = FOREACH A GENERATE name;
store B into 'myfs://myhost.com:8999/testoutput';
5. The execution seems to start correctly, and MyFileSystem.class is
invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
is loaded and the properties defined in it are available.
6. However, when Pig tries to submit the job, it cannot find these
properties and the job fails to submit successfully.
7. If I move all the properties defined in myfs-site.xml to core-site.xml,
the job gets submitted successfully, and it even succeeds. However, this is
not ideal as I do not want to proliferate core-site.xml with all of the
properties for a separate filesystem.
8. As I said earlier, upon taking a closer look at the pig code, I saw that
while creating the JobConf object for a job, pig adds very specific
resources to the job object, and ignores the resources that may have been
added already (eg myfs-site.xml) in the Configuration object.
9. I have tested this with native map-reduce code as well as hive, and this
approach of having a separate config file for MyFileSystem works fine in
both those cases.

So, to summarize, I am looking for a way to ask Pig to load parameters from
my own config file before submitting a job.

Thanks,
-
Bhooshan.



On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> +User group
>
> Hi Bhooshan,
>
> By default you should be running in MapReduce mode unless specified
> otherwise. Are you creating a PigServer object to run your jobs? Can you
> provide your code here?
>
> Sent from my iPhone
>
> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <bh...@gmail.com>
> wrote:
>
> Apologies for the premature send. I may have some more information. After
> I applied the patch and set "pig.use.overriden.hadoop.configs=true", I saw
> an NPE (stacktrace below) and a message saying pig was running in exectype
> local -
>
> 2013-04-13 07:37:13,758 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
> to hadoop file system at: local
> 2013-04-13 07:37:13,760 [main] WARN  org.apache.hadoop.conf.Configuration
> - mapred.used.genericoptionsparser is deprecated. Instead, use
> mapreduce.client.genericoptionsparser.used
> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: Pig script failed to parse:
> <file test.pig, line 1, column 4> pig script failed to validate:
> java.lang.NullPointerException
>
>
> Here is the stacktrace =
>
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
> during parsing. Pig script failed to parse:
> <file test.pig, line 1, column 4> pig script failed to validate:
> java.lang.NullPointerException
>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>         at
> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>         at
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>         at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>         at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>         at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>         at org.apache.pig.Main.run(Main.java:555)
>         at org.apache.pig.Main.main(Main.java:111)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:616)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> Caused by: Failed to parse: Pig script failed to parse:
> <file test.pig, line 1, column 4> pig script failed to validate:
> java.lang.NullPointerException
>         at
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>         ... 14 more
> Caused by:
> <file test.pig, line 1, column 4> pig script failed to validate:
> java.lang.NullPointerException
>         at
> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>         at
> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>         at
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>         ... 15 more
>
>
>
>
> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>
>> Yes, however I did not add core-site.xml, hdfs-site.xml, yarn-site.xml.
>> Only my-filesystem-site.xml using both Configuration.addDefaultResource and
>> Configuration.addResource.
>>
>> I see what you are saying though. The patch might require users to take
>> care of adding the default config resources as well apart from their own
>> resources?
>>
>>
>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <prash1784@gmail.com
>> > wrote:
>>
>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>>> configuration resources?
>>>
>>>
>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>> bhooshan.mogal@gmail.com> wrote:
>>>
>>>> Hi Prashant,
>>>>
>>>> Thanks for your response to my question, and sorry for the delayed
>>>> reply. I was not subscribed to the dev mailing list and hence did not get a
>>>> notification about your reply. I have copied our thread below so you can
>>>> get some context.
>>>>
>>>> I tried the patch that you pointed to, however with that patch looks
>>>> like pig is unable to find core-site.xml. It indicates that it is running
>>>> the script in local mode inspite of having fs.default.name defined as
>>>> the location of the HDFS namenode.
>>>>
>>>> Here is what I am trying to do - I have developed my own
>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>> my pig script. This implementation requires its own *-default and
>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>> as I am able to read these configurations in my code. However, pig code
>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>> the pig code, it seems to me that pig does not use all the resources added
>>>> in the Configuration object, but only seems to use certain specific ones
>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>> have pig load user-defined resources like say foo-default.xml and
>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>> the problem, because pig can find my config parameters if I define them in
>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>
>>>> Let me know if you need more details about the issue.
>>>>
>>>>
>>>> Here is our previous conversation -
>>>>
>>>> Hi Bhooshan,
>>>>
>>>> There is a patch that addresses what you need, and is part of 0.12
>>>> (unreleased). Take a look and see if you can apply the patch to the version
>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>
>>>> With this patch, the following property will allow you to override the
>>>> default and pass in your own configuration.
>>>> pig.use.overriden.hadoop.configs=true
>>>>
>>>>
>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal <bh...@gmail.com>wrote:
>>>>
>>>> > Hi Folks,
>>>> >
>>>> > I had implemented the Hadoop FileSystem abstract class for a storage system
>>>> > at work. This implementation uses some config files that are similar in
>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>> > *-site.xml for users to override default properties. In the class that
>>>> > implemented the Hadoop FileSystem, I had added these configuration files as
>>>> > default resources in a static block using
>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine and
>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just fine
>>>> > for our storage system. However, when we tried using this storage system in
>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>> > were not available. Upon further debugging, we saw that the config files
>>>> > were added to the Configuration object as resources, but were part of
>>>> > defaultResources. However, in Main.java in the pig source, we saw that the
>>>> > Configuration object was created as Configuration conf = new
>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>> > object. As a result, properties from the default resources (including my
>>>> > config files) were not loaded and hence, unavailable.
>>>> >
>>>> > We solved the problem by using Configuration.addResource instead of
>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>> > does not use default resources?
>>>> >
>>>> > Could someone on the list explain why this is the case?
>>>> >
>>>> > Thanks,
>>>> > --
>>>> > Bhooshan
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>
>>>
>>
>>
>> --
>> Bhooshan
>>
>
>
>
> --
> Bhooshan
>
>


-- 
Bhooshan