You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Santhosh Srinivasan (JIRA)" <ji...@apache.org> on 2008/05/21 01:24:55 UTC

[jira] Issue Comment Edited: (PIG-143) Proposal for refactoring of parsing logic in Pig

    [ https://issues.apache.org/jira/browse/PIG-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598525#action_12598525 ] 

sms edited comment on PIG-143 at 5/20/08 4:24 PM:
------------------------------------------------------------------

I have a few comments:

   - LOProject.java:

   1. In getType() It would be safe to call getFieldSchema() before the check on mFieldSchema. This will ensure that you have the right type all the time.
{code}
+    public byte getType() {
+        if (mFieldSchema != null){
+            return mFieldSchema.type ;
+        }
+        else {
+            return DataType.UNKNOWN ;
+        }
+    }
{code}

   2. In getSchema(), the same comment holds. Call getFieldSchema() before the check on mFieldSchema. Also, I am curious to know the use of getSchema() for an expression operator. Relational Operators support getSchema() and Expression Operators support getFieldSchema() respectively

   3. toDetailString: A minor comment, isStart should be isStar (note the additional t in the patch)

{code}
+        sb.append(" isStart=") ;
}
{code}

   - LOGenerate.java

   1. I could not figure out the use of dataSrc. I think this goes back to the discussion about a sentinel in generate. I do not have use cases for introducing a sentinel for LOGenerate.

{code}
+    private LogicalOperator dataSrc = null ;
{code}


      was (Author: sms):
    I have a few comments:

   - LOProject.java:

   1. In getType() It would be safe to call getFieldSchema() before the check on mFieldSchema. This will ensure that you have the right type all the time.
{code}
+    public byte getType() {
+        if (mFieldSchema != null){
+            return mFieldSchema.type ;
+        }
+        else {
+            return DataType.UNKNOWN ;
+        }
+    }
{code}

   2. In getSchema(), the same comment holds. Call getFieldSchema() before the check on mFieldSchema. Also, I am curious to know the use of getSchema() for an expression operator. Relational Operators support getSchema() and Expression Operators support getFieldSchema() respectively

   3. toDetailString: A minor comment, isStart should be isStar (note the additional t in the patch)

{code}
+        sb.append(" isStart=") ;
}
   - LOGenerate.java

   1. I could not figure out the use of dataSrc. I think this goes back to the discussion about a sentinel in generate. I do not have use cases for introducing a sentinel for LOGenerate.

{code}
+    private LogicalOperator dataSrc = null ;
{code}

  
> Proposal for refactoring of parsing logic in Pig
> ------------------------------------------------
>
>                 Key: PIG-143
>                 URL: https://issues.apache.org/jira/browse/PIG-143
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Pi Song
>            Assignee: Pi Song
>         Attachments: ParserDrawing.png, pigtype_cycle_check.patch, PostParseValidation_Set2.patch, PostParseValidation_Set2_v2.patch, PostParseValidation_Set3.patch, PostParseValidation_WorkingVersion_1.patch, PostParseValidation_WorkingVersion_2.patch, SchemaMerge.patch
>
>
> h2. Pig Script Parser Refactor Proposal 
> This is my initial proposal on pig script parser refactor work. Please note that I need your opinions for improvements.
> *Problem*
> The basic concept is around the fact that currently we do validation logics in parsing stage (for example, file existence checking) which I think is not clean and difficult to add new validation rules. In the future, we will need to add more and more validation logics to improve usability.
> *My proposal:-*  (see [^ParserDrawing.png])
> - Only keep parsing logic in the parser and leave output of parsing logic being unchecked logical plans. (Therefore the parser only does syntactic checking)
> - Create a new class called LogicalPlanValidationManager which is responsible for validations of the AST from the parser.
> - A new validation logic will be subclassing LogicalPlanValidator 
> - We can chain a set of LogicalPlanValidators inside LogicalPlanValidationManager to do validation work. This allows a new LogicalPlanValidator to be added easily like a plug-in. 
> - This greatly promotes modularity of the validation logics which  is +particularly good when we have a lot of people working on different things+ (eg. streaming may require a special validation logic)
> - We can set the execution order of validators
> - There might be some backend specific validations needed when we implement new execution engines (For example a logical operation that one backend can do but others can't).  We can plug-in this kind of validations on-the-fly based on the backend in use.
> *List of LogicalPlanValidators extracted from the current parser logic:-*
> - File existence validator
> - Alias existence validator
> *Logics possibly be added in the very near future:-*
> - Streaming script test execution
> - Type checking + casting promotion + type inference
> - Untyped plan test execution
> - Logic to prevent reading and writing from/to the same file
> The common way to implement a LogicalPlanValidator will be based on Visitor pattern. 
> *Cons:-*
>  - By having every validation logic traversing AST from the root node every time, there is a performance hit. However I think this is neglectable due to the fact that Pig is very expressive and normally queries aren't too big (99% of queries contain no more than 1000 AST nodes).
> *Next Step:-*
> LogicalPlanFinalizer which is also a pipeline except that each stage can modify the input AST. This component will generally do a kind of global optimizations.
> *Further ideas:-*
> - Composite visitor can make validations more efficient in some cases but I don't think we need
> - ASTs within the pipeline never change (read-only) so validations can be done in parallel to improve responsiveness. But again I don't think we need this unless we have so many I/O bound logics.
> - The same pipeline concept can also be applied in physical plan validation/optimization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.