You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by bdesert <gi...@git.apache.org> on 2018/05/13 05:32:35 UTC

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

GitHub user bdesert opened a pull request:

    https://github.com/apache/nifi/pull/2695

    NIFI-5044 SelectHiveQL accept only one statement

    SelectHiveQL support only single SELECT statement.
    This change adds support for pre- and post- select statements.
    It will be useful for configuration queries, i.e. "set tez.queue.name=default", and others.
    
    Thank you for submitting a contribution to Apache NiFi.
    
    In order to streamline the review of the contribution we ask you
    to ensure the following steps have been taken:
    
    ### For all changes:
    - [X] Is there a JIRA ticket associated with this PR? Is it referenced 
         in the commit message?
    
    - [X] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    
    - [X] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    - [X] Is your initial contribution a single, squashed commit?
    
    ### For code changes:
    - [X] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
    - [X] Have you written or updated unit tests to verify your changes?
    - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
    - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
    - [X] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?
    
    ### For documentation related changes:
    - [ ] Have you ensured that format looks appropriate for the output in which it is rendered?
    
    ### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bdesert/nifi NIFI-5044_SelectHiveQL_

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/2695.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2695
    
----
commit 5e4c2b00405418fc4673e851740129e94f59caab
Author: Ed B <eb...@...>
Date:   2018-05-13T05:24:09Z

    NIFI-5044 SelectHiveQL accept only one statement
    
    SelectHiveQL support only single SELECT statement.
    This change adds support for pre- and post- select statements.
    It will be useful for configuration queries, i.e. "set tez.queue.name=default", and others.

----


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by mattyb149 <gi...@git.apache.org>.
Github user mattyb149 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r194183320
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/test/java/org/apache/nifi/processors/hive/TestSelectHiveQL.java ---
    @@ -198,6 +200,51 @@ public void testWithSqlException() throws SQLException {
             runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
         }
     
    +    @Test
    +    public void invokeOnTriggerExceptionInPreQieriesNoIncomingFlows()
    +            throws InitializationException, ClassNotFoundException, SQLException, IOException {
    +
    +        doOnTrigger(QUERY_WITHOUT_EL, false, CSV,
    +                "select 'no exception' from persons; select exception from persons",
    +                null);
    +
    +        runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
    --- End diff --
    
    I must have been thinking of some other processor or perhaps the behavior had changed at some point, sorry about that. As long as it behaves the same way it used to with respect to where/if FFs get transferred, then I'm good :)


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by pvillard31 <gi...@git.apache.org>.
Github user pvillard31 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r188990055
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -311,7 +340,15 @@ private void onTrigger(final ProcessContext context, final ProcessSession sessio
             try (final Connection con = dbcpService.getConnection(fileToProcess == null ? Collections.emptyMap() : fileToProcess.getAttributes());
                  final Statement st = (flowbased ? con.prepareStatement(selectQuery) : con.createStatement())
             ) {
    -
    +            Pair<String,SQLException> failure = executeConfigStatements(con, preQueries);
    +            if (failure != null) {
    +                // In case of failure, assigning config query to "selectQuery" var will allow to avoid major changes in error handling (catch block),
    --- End diff --
    
    I understand the intent here, just wondering if it's a good idea on the long term for code maintainability/clarity. @mattyb149 what are your thoughts?


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by bdesert <gi...@git.apache.org>.
Github user bdesert commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r190653106
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/test/java/org/apache/nifi/processors/hive/TestSelectHiveQL.java ---
    @@ -198,6 +200,51 @@ public void testWithSqlException() throws SQLException {
             runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
         }
     
    +    @Test
    +    public void invokeOnTriggerExceptionInPreQieriesNoIncomingFlows()
    +            throws InitializationException, ClassNotFoundException, SQLException, IOException {
    +
    +        doOnTrigger(QUERY_WITHOUT_EL, false, CSV,
    +                "select 'no exception' from persons; select exception from persons",
    +                null);
    +
    +        runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
    --- End diff --
    
    >In general the behavior should remain the same whenever possible
    
    Currently, if there is an error in SQL Query - it will go to "failure" relationship (even if there are no incoming connections)
    ![image](https://user-images.githubusercontent.com/19496093/40499024-ae42cbec-5f4e-11e8-9ff0-348dd6793b2a.png)
    ![image](https://user-images.githubusercontent.com/19496093/40498559-68c064c2-5f4d-11e8-9a18-88fcb082b18e.png)
    
    So, I follow current error handling strategy. It's just wasn't accurate about:
    
    > since we weren't issuing a flow file on failure before
    
    because we do issue FF on failure (on establishing connection it is different though, and not impacted by this change).
    
    @mattyb149 , Any word on this?


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/nifi/pull/2695


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by pvillard31 <gi...@git.apache.org>.
Github user pvillard31 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r188985931
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -113,6 +126,17 @@
                 .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
                 .build();
     
    +    public static final PropertyDescriptor HIVEQL_POST_QUERY = new PropertyDescriptor.Builder()
    +            .name("hive-post-query")
    +            .displayName("HiveQL Post-Query")
    +            .description("HiveQL post-query to execute. Semicolon-delimited list of queries. "
    +                    + "Example: 'set tez.queue.name=default; set hive.exec.orc.split.strategy=HYBRID; set hive.exec.reducers.bytes.per.reducer=258998272'. "
    --- End diff --
    
    I'm clearly not a Hive expert but does it make sense to give this kind of examples for post-queries?


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by bdesert <gi...@git.apache.org>.
Github user bdesert commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r189075604
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -311,7 +340,15 @@ private void onTrigger(final ProcessContext context, final ProcessSession sessio
             try (final Connection con = dbcpService.getConnection(fileToProcess == null ? Collections.emptyMap() : fileToProcess.getAttributes());
                  final Statement st = (flowbased ? con.prepareStatement(selectQuery) : con.createStatement())
             ) {
    -
    +            Pair<String,SQLException> failure = executeConfigStatements(con, preQueries);
    +            if (failure != null) {
    +                // In case of failure, assigning config query to "selectQuery" var will allow to avoid major changes in error handling (catch block),
    --- End diff --
    
    @pvillard31 , I think if I just rename this var into "hqlStatement" - that will remove confusions and make reusing it for pre- and post- queries natural. Going to push updates for this.


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by bdesert <gi...@git.apache.org>.
Github user bdesert commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r189005203
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -113,6 +126,17 @@
                 .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
                 .build();
     
    +    public static final PropertyDescriptor HIVEQL_POST_QUERY = new PropertyDescriptor.Builder()
    +            .name("hive-post-query")
    +            .displayName("HiveQL Post-Query")
    +            .description("HiveQL post-query to execute. Semicolon-delimited list of queries. "
    +                    + "Example: 'set tez.queue.name=default; set hive.exec.orc.split.strategy=HYBRID; set hive.exec.reducers.bytes.per.reducer=258998272'. "
    --- End diff --
    
    In my opinion - there shouldn't be "post" queries at all, as after "select" it doesn't make sense to run anything. But based on discussion, there is a demand for "post-queries", that's why we implement.  I'll be happy to get any suggestion on what could be a good example for post-query :)


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by pvillard31 <gi...@git.apache.org>.
Github user pvillard31 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r190637767
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/test/java/org/apache/nifi/processors/hive/TestSelectHiveQL.java ---
    @@ -198,6 +200,51 @@ public void testWithSqlException() throws SQLException {
             runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
         }
     
    +    @Test
    +    public void invokeOnTriggerExceptionInPreQieriesNoIncomingFlows()
    +            throws InitializationException, ClassNotFoundException, SQLException, IOException {
    +
    +        doOnTrigger(QUERY_WITHOUT_EL, false, CSV,
    +                "select 'no exception' from persons; select exception from persons",
    +                null);
    +
    +        runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
    +    }
    +
    +    @Test
    +    public void invokeOnTriggerExceptionInPreQieriesWithIncomingFlows()
    +            throws InitializationException, ClassNotFoundException, SQLException, IOException {
    +
    +        doOnTrigger(QUERY_WITHOUT_EL, true, CSV,
    +                "select 'no exception' from persons; select exception from persons",
    +                null);
    +
    +        runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
    +    }
    +
    +    @Test
    +    public void invokeOnTriggerExceptionInPostQieriesNoIncomingFlows()
    +            throws InitializationException, ClassNotFoundException, SQLException, IOException {
    +
    +        doOnTrigger(QUERY_WITHOUT_EL, false, CSV,
    +                null,
    +                "select 'no exception' from persons; select exception from persons");
    +
    +        runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
    --- End diff --
    
    Same here


---

[GitHub] nifi issue #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by mattyb149 <gi...@git.apache.org>.
Github user mattyb149 commented on the issue:

    https://github.com/apache/nifi/pull/2695
  
    Decided to merge this to make sure it gets in, I'll put up an accompanying PR shortly. Thanks for this improvement! Merging to master


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by pvillard31 <gi...@git.apache.org>.
Github user pvillard31 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r188988886
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -113,6 +126,17 @@
                 .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
                 .build();
     
    +    public static final PropertyDescriptor HIVEQL_POST_QUERY = new PropertyDescriptor.Builder()
    +            .name("hive-post-query")
    +            .displayName("HiveQL Post-Query")
    +            .description("HiveQL post-query to execute. Semicolon-delimited list of queries. "
    +                    + "Example: 'set tez.queue.name=default; set hive.exec.orc.split.strategy=HYBRID; set hive.exec.reducers.bytes.per.reducer=258998272'. "
    +                    + "Note, the results/outputs of these queries will be suppressed if successful executed.")
    +            .required(false)
    +            .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
    --- End diff --
    
    Could it make sense to have a custom validator here just to be sure the statements will be accepted by the driver? (would need to check if EL is used or not first). Not sure it's worth the trouble though...


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by pvillard31 <gi...@git.apache.org>.
Github user pvillard31 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r190637825
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/test/java/org/apache/nifi/processors/hive/TestSelectHiveQL.java ---
    @@ -198,6 +200,51 @@ public void testWithSqlException() throws SQLException {
             runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
         }
     
    +    @Test
    +    public void invokeOnTriggerExceptionInPreQieriesNoIncomingFlows()
    +            throws InitializationException, ClassNotFoundException, SQLException, IOException {
    +
    +        doOnTrigger(QUERY_WITHOUT_EL, false, CSV,
    +                "select 'no exception' from persons; select exception from persons",
    +                null);
    +
    +        runner.assertAllFlowFilesTransferred(SelectHiveQL.REL_FAILURE, 1);
    --- End diff --
    
    Based on the comment from @mattyb149 in the JIRA:
    > In general the behavior should remain the same whenever possible, so if you have a SelectHiveQL that doesn't have incoming (non-loop) connections, then the query must be supplied, and whether it (or the pre-post queries) have EL in them, then since we weren't issuing a flow file on failure before, we shouldn't now either. So when I said "Route to Failure with penalization for everything else", that's only when there is a flow file to route. If there isn't, then we should yield (and remove any created flow files and/or rollback our session anyway).
    
    Not sure to understand the assertion here since there is no incoming ff, right?


---

[GitHub] nifi issue #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by mattyb149 <gi...@git.apache.org>.
Github user mattyb149 commented on the issue:

    https://github.com/apache/nifi/pull/2695
  
    As of #2755, we also have a SelectHive3QL processor that should have these improvements. I took the liberty of porting them to SelectHive3QL: https://github.com/mattyb149/nifi/commit/ce70910530cee447d7179e7dfb73f75291bd6030
    
    Would you like to cherry-pick this commit in and submit the whole thing as one PR? If not I am +1 on this PR as it stands, I could merge without closing the Jira and issue my own PR for applying the changes to SelectHive3QL.


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by bdesert <gi...@git.apache.org>.
Github user bdesert commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r189008316
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -113,6 +126,17 @@
                 .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
                 .build();
     
    +    public static final PropertyDescriptor HIVEQL_POST_QUERY = new PropertyDescriptor.Builder()
    +            .name("hive-post-query")
    +            .displayName("HiveQL Post-Query")
    +            .description("HiveQL post-query to execute. Semicolon-delimited list of queries. "
    +                    + "Example: 'set tez.queue.name=default; set hive.exec.orc.split.strategy=HYBRID; set hive.exec.reducers.bytes.per.reducer=258998272'. "
    +                    + "Note, the results/outputs of these queries will be suppressed if successful executed.")
    +            .required(false)
    +            .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
    --- End diff --
    
    Well, would be interesting to have a parser here, but then it should be global, and include also main query ("select" statement). I believe it can be done, but out of scope for this change. We can have another Improvement request in Jira to have new HiveQL validator for all Hive related processors.


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by pvillard31 <gi...@git.apache.org>.
Github user pvillard31 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r190633475
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -113,6 +126,17 @@
                 .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
                 .build();
     
    +    public static final PropertyDescriptor HIVEQL_POST_QUERY = new PropertyDescriptor.Builder()
    +            .name("hive-post-query")
    +            .displayName("HiveQL Post-Query")
    +            .description("HiveQL post-query to execute. Semicolon-delimited list of queries. "
    +                    + "Example: 'set tez.queue.name=default; set hive.exec.orc.split.strategy=HYBRID; set hive.exec.reducers.bytes.per.reducer=258998272'. "
    --- End diff --
    
    don't have anything in mind :) maybe just removing the examples to be sure it's not misleading somehow?


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by mattyb149 <gi...@git.apache.org>.
Github user mattyb149 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r194283856
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -443,31 +481,72 @@ private void onTrigger(final ProcessContext context, final ProcessSession sessio
                     throw e;
                 }
     
    -            session.transfer(resultSetFlowFiles, REL_SUCCESS);
    +            failure = executeConfigStatements(con, postQueries);
    +            if (failure != null) {
    +                hqlStatement = failure.getLeft();
    +                if (resultSetFlowFiles != null) {
    +                    resultSetFlowFiles.forEach(ff -> session.remove(ff));
    +                }
    +                flowfile = (fileToProcess == null) ? session.create() : fileToProcess;
    +                fileToProcess = null;
    +                throw failure.getRight();
    +            }
     
    +            session.transfer(resultSetFlowFiles, REL_SUCCESS);
    +            if (fileToProcess != null) {
    +                session.remove(fileToProcess);
    +            }
             } catch (final ProcessException | SQLException e) {
    -            logger.error("Issue processing SQL {} due to {}.", new Object[]{selectQuery, e});
    +            logger.error("Issue processing SQL {} due to {}.", new Object[]{hqlStatement, e});
                 if (flowfile == null) {
                     // This can happen if any exceptions occur while setting up the connection, statement, etc.
                     logger.error("Unable to execute HiveQL select query {} due to {}. No FlowFile to route to failure",
    -                        new Object[]{selectQuery, e});
    +                        new Object[]{hqlStatement, e});
                     context.yield();
                 } else {
                     if (context.hasIncomingConnection()) {
                         logger.error("Unable to execute HiveQL select query {} for {} due to {}; routing to failure",
    -                            new Object[]{selectQuery, flowfile, e});
    +                            new Object[]{hqlStatement, flowfile, e});
                         flowfile = session.penalize(flowfile);
                     } else {
                         logger.error("Unable to execute HiveQL select query {} due to {}; routing to failure",
    -                            new Object[]{selectQuery, e});
    +                            new Object[]{hqlStatement, e});
                         context.yield();
                     }
                     session.transfer(flowfile, REL_FAILURE);
                 }
    -        } finally {
    -            if (fileToProcess != null) {
    -                session.remove(fileToProcess);
    +        }
    +    }
    +
    +    /*
    +     * Executes given queries using pre-defined connection.
    +     * Returns null on success, or a query string if failed.
    +     */
    +    protected Pair<String,SQLException> executeConfigStatements(final Connection con, final List<String> configQueries){
    +        if (configQueries == null || configQueries.isEmpty()) {
    +            return null;
    +        }
    +
    +        for (String confSQL : configQueries) {
    +            try(final Statement st = con.createStatement()){
    +                st.executeQuery(confSQL);
    --- End diff --
    
    Shouldn't this be st.execute()? When I run with a single "set" statement I get a SQLException that the query did not return a result set...


---

[GitHub] nifi pull request #2695: NIFI-5044 SelectHiveQL accept only one statement

Posted by pvillard31 <gi...@git.apache.org>.
Github user pvillard31 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2695#discussion_r190633734
  
    --- Diff: nifi-nar-bundles/nifi-hive-bundle/nifi-hive-processors/src/main/java/org/apache/nifi/processors/hive/SelectHiveQL.java ---
    @@ -113,6 +126,17 @@
                 .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
                 .build();
     
    +    public static final PropertyDescriptor HIVEQL_POST_QUERY = new PropertyDescriptor.Builder()
    +            .name("hive-post-query")
    +            .displayName("HiveQL Post-Query")
    +            .description("HiveQL post-query to execute. Semicolon-delimited list of queries. "
    +                    + "Example: 'set tez.queue.name=default; set hive.exec.orc.split.strategy=HYBRID; set hive.exec.reducers.bytes.per.reducer=258998272'. "
    +                    + "Note, the results/outputs of these queries will be suppressed if successful executed.")
    +            .required(false)
    +            .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
    --- End diff --
    
    Yeah, tbh, I'm really not sure it's worth the effort. I'm fine with current approach.


---