You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2018/08/15 15:09:54 UTC

ExecuteSQL to support multiple statements

Hi guys,

I need to issue a query like below on Impala. it works fine from
impala-shell but NiFi seems not to like multiple statements like that.

set max_row_size=7mb;

create table blabla as
select blabla
from blablabla;


I thought it was addressed in 1.7 but I got it confused with Hive
processors and this Jira
https://issues.apache.org/jira/browse/NIFI-5044

Is there something in work already to address it or should I open a new
Jira?

For now, I am going to use Groovy script but if anyone knows a better
workaround, please let me know.

Boris

Re: ExecuteSQL to support multiple statements

Posted by Boris Tyukin <bo...@boristyukin.com>.

thanks for the explanation, Matt.

On Mon, Aug 20, 2018 at 9:19 AM Matt Burgess <ma...@apache.org> wrote:

> Boris,
>
> A lot of those tools, especially if they are geared for simply
> querying the DB, only call executeQuery() or execute() or something
> like that, and even the younger NoSQL drivers will at least implement
> those (for this very reason). However our SQL processors are much more
> powerful and complex, plus we are aiming for high performance, so we
> use the JDBC API calls that are designed for higher throughput, with
> better control of fetch size, batch size, etc.  We could trade off
> performance for more flexibility, but in a flow-based system it didn't
> seem prudent.
>
> Having said that, it would be kind of cool to have a built-in UI tool
> that could leverage a DBCPConnectionPool and allow you to run various
> one-off SQL statements, like a stripped-down built-in
> SquirrelSQL/DBeaver :)
>
> Regards,
> Matt
>
> On Thu, Aug 16, 2018 at 2:19 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >
> > Interesting, thanks for the explanation, Matt. I did not realize it was
> so complicated. When we just started with Big Data and were testing our BI
> tools (Qlik and Oracle BI) and also ETL tools (Informatica and Alteryx), we
> were just able to use Hive and Impala ODBC drivers to connect and run
> simple SELECT statements. I also used Oracle SQL Developer and DBeaver to
> connect using jdbc drivers. I assumed it would be similar with NiFi.
> >
> > On Thu, Aug 16, 2018 at 12:38 PM Matt Burgess <ma...@apache.org>
> wrote:
> >>
> >> Boris,
> >>
> >> Historically the Hive JDBC drivers (at least around 1.2.x) did not
> >> support some of the JDBC API methods called from ExecuteSQL/PutSQL,
> >> namely setQueryTimeout(), executeBatch(), etc. Also the column names
> >> are returned from the metadata with the table name prepended, so we'd
> >> need special logic for the Hive case, and currently we don't have to
> >> specify the database type in those processors. There have been some
> >> Jiras/PRs around adding Hive support for the SQL processors, but they
> >> haven't yet been incorporated. The Apache Hive driver still doesn't
> >> support executeBatch(), so PutSQL and PutDatabaseRecord wouldn't be
> >> able to call that. Again we'd need a Hive database adapter and would
> >> ask the specified adapter if batching is supported, then go along some
> >> other logic route if not. It should be doable, but just hasn't been
> >> finished yet.
> >>
> >> Regards,
> >> Matt
> >>
> >> On Thu, Aug 16, 2018 at 12:14 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >> >
> >> > awesome, passing property to impala connection string did the trick!
> thanks Ed!
> >> >
> >> > Does it still make sense to open Jira to support multiple statements
> for non-Hive processors?
> >> >
> >> > Totally separate subject, since you mentioned Hive - is there a
> reason why I cannot connect to Hive and run basic SQL using
> ExecuteSQL/PutSQL? Not sure why dedicated processors exist just for Hive.
> Does it not work over generic DBCP connection pool? We've done some
> research yesterday when we were looking for options on how to run multiple
> queries but because we are on CDH, Hive jars included with NiFi do not work
> for us. Looks like we can compile CDH specific NiFi but I am wondering why
> bother if we can just access Hive using generic pool
> >> >
> >> > On Thu, Aug 16, 2018 at 11:42 AM Ed B <bd...@gmail.com> wrote:
> >> >>
> >> >> Hi Boris,
> >> >> True, multi-statements support has been added to SelectHiveQL, but
> not to ExecuteSQL/PutSQL.
> >> >> As a workaround you could try to specify config params on JDBC
> connection string when you define controller service:
> >> >>
> >> >> jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
> >> >>
> >> >> If you can use Hive driver to access Impala:
> >> >>
> jdbc:hive2://host:port/;principal=principal_name;Property1=Value;Property2=Value;...
> >> >>
> >> >> Some params will be working after "?" in URL:
> >> >>
> jdbc:hive2://host:port/;principal=principal_name?Property1=Value;Property2=Value;...
> >> >>
> >> >> But, if you really can use Hive drivers, then you can use PutHiveQL
> and ExecuteHiveQL. Both support multi-statements.
> >> >>
> >> >> Regards.
> >> >> Ed.
> >> >>
> >> >> On Wed, Aug 15, 2018 at 11:11 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >> >>>
> >> >>> Hi guys,
> >> >>>
> >> >>> I need to issue a query like below on Impala. it works fine from
> impala-shell but NiFi seems not to like multiple statements like that.
> >> >>>
> >> >>> set max_row_size=7mb;
> >> >>>
> >> >>> create table blabla as
> >> >>> select blabla
> >> >>> from blablabla;
> >> >>>
> >> >>>
> >> >>> I thought it was addressed in 1.7 but I got it confused with Hive
> processors and this Jira
> >> >>> https://issues.apache.org/jira/browse/NIFI-5044
> >> >>>
> >> >>> Is there something in work already to address it or should I open a
> new Jira?
> >> >>>
> >> >>> For now, I am going to use Groovy script but if anyone knows a
> better workaround, please let me know.
> >> >>>
> >> >>> Boris
>

Re: ExecuteSQL to support multiple statements

Posted by Matt Burgess <ma...@apache.org>.

Boris,

A lot of those tools, especially if they are geared for simply
querying the DB, only call executeQuery() or execute() or something
like that, and even the younger NoSQL drivers will at least implement
those (for this very reason). However our SQL processors are much more
powerful and complex, plus we are aiming for high performance, so we
use the JDBC API calls that are designed for higher throughput, with
better control of fetch size, batch size, etc.  We could trade off
performance for more flexibility, but in a flow-based system it didn't
seem prudent.

Having said that, it would be kind of cool to have a built-in UI tool
that could leverage a DBCPConnectionPool and allow you to run various
one-off SQL statements, like a stripped-down built-in
SquirrelSQL/DBeaver :)

Regards,
Matt

On Thu, Aug 16, 2018 at 2:19 PM Boris Tyukin <bo...@boristyukin.com> wrote:
>
> Interesting, thanks for the explanation, Matt. I did not realize it was so complicated. When we just started with Big Data and were testing our BI tools (Qlik and Oracle BI) and also ETL tools (Informatica and Alteryx), we were just able to use Hive and Impala ODBC drivers to connect and run simple SELECT statements. I also used Oracle SQL Developer and DBeaver to connect using jdbc drivers. I assumed it would be similar with NiFi.
>
> On Thu, Aug 16, 2018 at 12:38 PM Matt Burgess <ma...@apache.org> wrote:
>>
>> Boris,
>>
>> Historically the Hive JDBC drivers (at least around 1.2.x) did not
>> support some of the JDBC API methods called from ExecuteSQL/PutSQL,
>> namely setQueryTimeout(), executeBatch(), etc. Also the column names
>> are returned from the metadata with the table name prepended, so we'd
>> need special logic for the Hive case, and currently we don't have to
>> specify the database type in those processors. There have been some
>> Jiras/PRs around adding Hive support for the SQL processors, but they
>> haven't yet been incorporated. The Apache Hive driver still doesn't
>> support executeBatch(), so PutSQL and PutDatabaseRecord wouldn't be
>> able to call that. Again we'd need a Hive database adapter and would
>> ask the specified adapter if batching is supported, then go along some
>> other logic route if not. It should be doable, but just hasn't been
>> finished yet.
>>
>> Regards,
>> Matt
>>
>> On Thu, Aug 16, 2018 at 12:14 PM Boris Tyukin <bo...@boristyukin.com> wrote:
>> >
>> > awesome, passing property to impala connection string did the trick! thanks Ed!
>> >
>> > Does it still make sense to open Jira to support multiple statements for non-Hive processors?
>> >
>> > Totally separate subject, since you mentioned Hive - is there a reason why I cannot connect to Hive and run basic SQL using ExecuteSQL/PutSQL? Not sure why dedicated processors exist just for Hive. Does it not work over generic DBCP connection pool? We've done some research yesterday when we were looking for options on how to run multiple queries but because we are on CDH, Hive jars included with NiFi do not work for us. Looks like we can compile CDH specific NiFi but I am wondering why bother if we can just access Hive using generic pool
>> >
>> > On Thu, Aug 16, 2018 at 11:42 AM Ed B <bd...@gmail.com> wrote:
>> >>
>> >> Hi Boris,
>> >> True, multi-statements support has been added to SelectHiveQL, but not to ExecuteSQL/PutSQL.
>> >> As a workaround you could try to specify config params on JDBC connection string when you define controller service:
>> >>
>> >> jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
>> >>
>> >> If you can use Hive driver to access Impala:
>> >>  jdbc:hive2://host:port/;principal=principal_name;Property1=Value;Property2=Value;...
>> >>
>> >> Some params will be working after "?" in URL:
>> >>  jdbc:hive2://host:port/;principal=principal_name?Property1=Value;Property2=Value;...
>> >>
>> >> But, if you really can use Hive drivers, then you can use PutHiveQL and ExecuteHiveQL. Both support multi-statements.
>> >>
>> >> Regards.
>> >> Ed.
>> >>
>> >> On Wed, Aug 15, 2018 at 11:11 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I need to issue a query like below on Impala. it works fine from impala-shell but NiFi seems not to like multiple statements like that.
>> >>>
>> >>> set max_row_size=7mb;
>> >>>
>> >>> create table blabla as
>> >>> select blabla
>> >>> from blablabla;
>> >>>
>> >>>
>> >>> I thought it was addressed in 1.7 but I got it confused with Hive processors and this Jira
>> >>> https://issues.apache.org/jira/browse/NIFI-5044
>> >>>
>> >>> Is there something in work already to address it or should I open a new Jira?
>> >>>
>> >>> For now, I am going to use Groovy script but if anyone knows a better workaround, please let me know.
>> >>>
>> >>> Boris

Re: ExecuteSQL to support multiple statements

Posted by Boris Tyukin <bo...@boristyukin.com>.

Interesting, thanks for the explanation, Matt. I did not realize it was so
complicated. When we just started with Big Data and were testing our BI
tools (Qlik and Oracle BI) and also ETL tools (Informatica and Alteryx), we
were just able to use Hive and Impala ODBC drivers to connect and run
simple SELECT statements. I also used Oracle SQL Developer and DBeaver to
connect using jdbc drivers. I assumed it would be similar with NiFi.

On Thu, Aug 16, 2018 at 12:38 PM Matt Burgess <ma...@apache.org> wrote:

> Boris,
>
> Historically the Hive JDBC drivers (at least around 1.2.x) did not
> support some of the JDBC API methods called from ExecuteSQL/PutSQL,
> namely setQueryTimeout(), executeBatch(), etc. Also the column names
> are returned from the metadata with the table name prepended, so we'd
> need special logic for the Hive case, and currently we don't have to
> specify the database type in those processors. There have been some
> Jiras/PRs around adding Hive support for the SQL processors, but they
> haven't yet been incorporated. The Apache Hive driver still doesn't
> support executeBatch(), so PutSQL and PutDatabaseRecord wouldn't be
> able to call that. Again we'd need a Hive database adapter and would
> ask the specified adapter if batching is supported, then go along some
> other logic route if not. It should be doable, but just hasn't been
> finished yet.
>
> Regards,
> Matt
>
> On Thu, Aug 16, 2018 at 12:14 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >
> > awesome, passing property to impala connection string did the trick!
> thanks Ed!
> >
> > Does it still make sense to open Jira to support multiple statements for
> non-Hive processors?
> >
> > Totally separate subject, since you mentioned Hive - is there a reason
> why I cannot connect to Hive and run basic SQL using ExecuteSQL/PutSQL? Not
> sure why dedicated processors exist just for Hive. Does it not work over
> generic DBCP connection pool? We've done some research yesterday when we
> were looking for options on how to run multiple queries but because we are
> on CDH, Hive jars included with NiFi do not work for us. Looks like we can
> compile CDH specific NiFi but I am wondering why bother if we can just
> access Hive using generic pool
> >
> > On Thu, Aug 16, 2018 at 11:42 AM Ed B <bd...@gmail.com> wrote:
> >>
> >> Hi Boris,
> >> True, multi-statements support has been added to SelectHiveQL, but not
> to ExecuteSQL/PutSQL.
> >> As a workaround you could try to specify config params on JDBC
> connection string when you define controller service:
> >>
> >> jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
> >>
> >> If you can use Hive driver to access Impala:
> >>
> jdbc:hive2://host:port/;principal=principal_name;Property1=Value;Property2=Value;...
> >>
> >> Some params will be working after "?" in URL:
> >>
> jdbc:hive2://host:port/;principal=principal_name?Property1=Value;Property2=Value;...
> >>
> >> But, if you really can use Hive drivers, then you can use PutHiveQL and
> ExecuteHiveQL. Both support multi-statements.
> >>
> >> Regards.
> >> Ed.
> >>
> >> On Wed, Aug 15, 2018 at 11:11 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I need to issue a query like below on Impala. it works fine from
> impala-shell but NiFi seems not to like multiple statements like that.
> >>>
> >>> set max_row_size=7mb;
> >>>
> >>> create table blabla as
> >>> select blabla
> >>> from blablabla;
> >>>
> >>>
> >>> I thought it was addressed in 1.7 but I got it confused with Hive
> processors and this Jira
> >>> https://issues.apache.org/jira/browse/NIFI-5044
> >>>
> >>> Is there something in work already to address it or should I open a
> new Jira?
> >>>
> >>> For now, I am going to use Groovy script but if anyone knows a better
> workaround, please let me know.
> >>>
> >>> Boris
>

Re: ExecuteSQL to support multiple statements

Posted by Matt Burgess <ma...@apache.org>.

Boris,

Historically the Hive JDBC drivers (at least around 1.2.x) did not
support some of the JDBC API methods called from ExecuteSQL/PutSQL,
namely setQueryTimeout(), executeBatch(), etc. Also the column names
are returned from the metadata with the table name prepended, so we'd
need special logic for the Hive case, and currently we don't have to
specify the database type in those processors. There have been some
Jiras/PRs around adding Hive support for the SQL processors, but they
haven't yet been incorporated. The Apache Hive driver still doesn't
support executeBatch(), so PutSQL and PutDatabaseRecord wouldn't be
able to call that. Again we'd need a Hive database adapter and would
ask the specified adapter if batching is supported, then go along some
other logic route if not. It should be doable, but just hasn't been
finished yet.

Regards,
Matt

On Thu, Aug 16, 2018 at 12:14 PM Boris Tyukin <bo...@boristyukin.com> wrote:
>
> awesome, passing property to impala connection string did the trick! thanks Ed!
>
> Does it still make sense to open Jira to support multiple statements for non-Hive processors?
>
> Totally separate subject, since you mentioned Hive - is there a reason why I cannot connect to Hive and run basic SQL using ExecuteSQL/PutSQL? Not sure why dedicated processors exist just for Hive. Does it not work over generic DBCP connection pool? We've done some research yesterday when we were looking for options on how to run multiple queries but because we are on CDH, Hive jars included with NiFi do not work for us. Looks like we can compile CDH specific NiFi but I am wondering why bother if we can just access Hive using generic pool
>
> On Thu, Aug 16, 2018 at 11:42 AM Ed B <bd...@gmail.com> wrote:
>>
>> Hi Boris,
>> True, multi-statements support has been added to SelectHiveQL, but not to ExecuteSQL/PutSQL.
>> As a workaround you could try to specify config params on JDBC connection string when you define controller service:
>>
>> jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
>>
>> If you can use Hive driver to access Impala:
>>  jdbc:hive2://host:port/;principal=principal_name;Property1=Value;Property2=Value;...
>>
>> Some params will be working after "?" in URL:
>>  jdbc:hive2://host:port/;principal=principal_name?Property1=Value;Property2=Value;...
>>
>> But, if you really can use Hive drivers, then you can use PutHiveQL and ExecuteHiveQL. Both support multi-statements.
>>
>> Regards.
>> Ed.
>>
>> On Wed, Aug 15, 2018 at 11:11 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>>>
>>> Hi guys,
>>>
>>> I need to issue a query like below on Impala. it works fine from impala-shell but NiFi seems not to like multiple statements like that.
>>>
>>> set max_row_size=7mb;
>>>
>>> create table blabla as
>>> select blabla
>>> from blablabla;
>>>
>>>
>>> I thought it was addressed in 1.7 but I got it confused with Hive processors and this Jira
>>> https://issues.apache.org/jira/browse/NIFI-5044
>>>
>>> Is there something in work already to address it or should I open a new Jira?
>>>
>>> For now, I am going to use Groovy script but if anyone knows a better workaround, please let me know.
>>>
>>> Boris

Re: ExecuteSQL to support multiple statements

Posted by Boris Tyukin <bo...@boristyukin.com>.

awesome, passing property to impala connection string did the trick! thanks
Ed!

Does it still make sense to open Jira to support multiple statements for
non-Hive processors?

Totally separate subject, since you mentioned Hive - is there a reason why
I cannot connect to Hive and run basic SQL using ExecuteSQL/PutSQL? Not
sure why dedicated processors exist just for Hive. Does it not work over
generic DBCP connection pool? We've done some research yesterday when we
were looking for options on how to run multiple queries but because we are
on CDH, Hive jars included with NiFi do not work for us. Looks like we can
compile CDH specific NiFi but I am wondering why bother if we can just
access Hive using generic pool

On Thu, Aug 16, 2018 at 11:42 AM Ed B <bd...@gmail.com> wrote:

> Hi Boris,
> True, multi-statements support has been added to SelectHiveQL, but not to
> ExecuteSQL/PutSQL.
> As a workaround you could try to specify config params on JDBC connection
> string when you define controller service:
>
> jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
>
> If you can use Hive driver to access Impala:
>  jdbc:hive2://host:port/;principal=principal_name;Property1=Value;
> Property2=Value;...
>
> Some params will be working after "?" in URL:
>  jdbc:hive2://host:port/;principal=principal_name?Property1=Value;
> Property2=Value;...
>
> But, if you really can use Hive drivers, then you can use PutHiveQL and
> ExecuteHiveQL. Both support multi-statements.
>
> Regards.
> Ed.
>
> On Wed, Aug 15, 2018 at 11:11 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi guys,
>>
>> I need to issue a query like below on Impala. it works fine from
>> impala-shell but NiFi seems not to like multiple statements like that.
>>
>> set max_row_size=7mb;
>>
>> create table blabla as
>> select blabla
>> from blablabla;
>>
>>
>> I thought it was addressed in 1.7 but I got it confused with Hive
>> processors and this Jira
>> https://issues.apache.org/jira/browse/NIFI-5044
>>
>> Is there something in work already to address it or should I open a new
>> Jira?
>>
>> For now, I am going to use Groovy script but if anyone knows a better
>> workaround, please let me know.
>>
>> Boris
>>
>

Re: ExecuteSQL to support multiple statements

Posted by Ed B <bd...@gmail.com>.

Hi Boris,
True, multi-statements support has been added to SelectHiveQL, but not to
ExecuteSQL/PutSQL.
As a workaround you could try to specify config params on JDBC connection
string when you define controller service:

jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...

If you can use Hive driver to access Impala:
 jdbc:hive2://host:port/;principal=principal_name;Property1=Value;Property2=
Value;...

Some params will be working after "?" in URL:
 jdbc:hive2://host:port/;principal=principal_name?Property1=Value;Property2=
Value;...

But, if you really can use Hive drivers, then you can use PutHiveQL and
ExecuteHiveQL. Both support multi-statements.

Regards.
Ed.

On Wed, Aug 15, 2018 at 11:11 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> Hi guys,
>
> I need to issue a query like below on Impala. it works fine from
> impala-shell but NiFi seems not to like multiple statements like that.
>
> set max_row_size=7mb;
>
> create table blabla as
> select blabla
> from blablabla;
>
>
> I thought it was addressed in 1.7 but I got it confused with Hive
> processors and this Jira
> https://issues.apache.org/jira/browse/NIFI-5044
>
> Is there something in work already to address it or should I open a new
> Jira?
>
> For now, I am going to use Groovy script but if anyone knows a better
> workaround, please let me know.
>
> Boris
>