You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by srimugunthan dhandapani <sr...@gmail.com> on 2018/08/06 18:35:54 UTC

Accessing source table data from hive/Presto

Hi all,
I read the Flink documentation  and came across the connectors supported
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/connectors/index.html#bundled-connectors

We have some data that  resides in Hive/Presto that needs to be made
available to the flink job. The data in the hive or presto can be updated
once in a day or less than that.

Ideally we will connect to the hive or presto , run the query and get back
the results and use it in a flink job.
What are the options to achieve something like that?

Thanks,
mugunthan

Re: Accessing source table data from hive/Presto

Posted by Fabian Hueske <fh...@gmail.com>.

Do you want to read the data once or monitor a directory and process new
files as they appear?

Reading from S3 with Flink's current MonitoringFileSource implementation is
not working reliably due to S3's eventual consistent list operation (see
FLINK-9940 [1]).
Reading a directory also has some issues as it won't work with
checkpointing enabled.

These limitations could be worked around with custom source implementations.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-9940

2018-08-07 19:45 GMT+02:00 srimugunthan dhandapani <
srimugunthan.dhandapani@gmail.com>:

> Thanks for the reply. I was mainly thinking of the usecase of streaming
> job.
> In the approach to port to Flink's SQL API, is it possible to read parquet
> data from S3 and register table in flink?
>
>
> On Tue, Aug 7, 2018 at 1:05 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Mugunthan,
>>
>> this depends on the type of your job. Is it a batch or a streaming job?
>> Some queries could be ported to Flink's SQL API as suggested by the link
>> that Hequn posted. In that case, the query would be executed in Flink.
>>
>> Other options are to use a JDBC InputFormat or persisting the result to
>> files and ingesting it with a monitoring file sink.
>> These options would mean to run the query in Hive/Presto and just
>> ingesting the result (via JDBC or files).
>>
>> It depends on the details, which solution works best for you.
>>
>> Best, Fabian
>>
>> 2018-08-07 3:28 GMT+02:00 Hequn Cheng <ch...@gmail.com>:
>>
>>> Hi srimugunthan,
>>>
>>> I found a related link[1]. Hope it helps.
>>>
>>> [1] https://stackoverflow.com/questions/41683108/flink-1-1-3
>>> -interact-with-hive-2-1-0
>>>
>>> On Tue, Aug 7, 2018 at 2:35 AM, srimugunthan dhandapani <
>>> srimugunthan.dhandapani@gmail.com> wrote:
>>>
>>>> Hi all,
>>>> I read the Flink documentation  and came across the connectors supported
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>>> dev/connectors/index.html#bundled-connectors
>>>>
>>>> We have some data that  resides in Hive/Presto that needs to be made
>>>> available to the flink job. The data in the hive or presto can be updated
>>>> once in a day or less than that.
>>>>
>>>> Ideally we will connect to the hive or presto , run the query and get
>>>> back the results and use it in a flink job.
>>>> What are the options to achieve something like that?
>>>>
>>>> Thanks,
>>>> mugunthan
>>>>
>>>
>>>
>>
>

Re: Accessing source table data from hive/Presto

Posted by srimugunthan dhandapani <sr...@gmail.com>.

Thanks for the reply. I was mainly thinking of the usecase of streaming job.
In the approach to port to Flink's SQL API, is it possible to read parquet
data from S3 and register table in flink?


On Tue, Aug 7, 2018 at 1:05 PM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi Mugunthan,
>
> this depends on the type of your job. Is it a batch or a streaming job?
> Some queries could be ported to Flink's SQL API as suggested by the link
> that Hequn posted. In that case, the query would be executed in Flink.
>
> Other options are to use a JDBC InputFormat or persisting the result to
> files and ingesting it with a monitoring file sink.
> These options would mean to run the query in Hive/Presto and just
> ingesting the result (via JDBC or files).
>
> It depends on the details, which solution works best for you.
>
> Best, Fabian
>
> 2018-08-07 3:28 GMT+02:00 Hequn Cheng <ch...@gmail.com>:
>
>> Hi srimugunthan,
>>
>> I found a related link[1]. Hope it helps.
>>
>> [1] https://stackoverflow.com/questions/41683108/flink-1-1-3
>> -interact-with-hive-2-1-0
>>
>> On Tue, Aug 7, 2018 at 2:35 AM, srimugunthan dhandapani <
>> srimugunthan.dhandapani@gmail.com> wrote:
>>
>>> Hi all,
>>> I read the Flink documentation  and came across the connectors supported
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>> dev/connectors/index.html#bundled-connectors
>>>
>>> We have some data that  resides in Hive/Presto that needs to be made
>>> available to the flink job. The data in the hive or presto can be updated
>>> once in a day or less than that.
>>>
>>> Ideally we will connect to the hive or presto , run the query and get
>>> back the results and use it in a flink job.
>>> What are the options to achieve something like that?
>>>
>>> Thanks,
>>> mugunthan
>>>
>>
>>
>

Re: Accessing source table data from hive/Presto

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Mugunthan,

this depends on the type of your job. Is it a batch or a streaming job?
Some queries could be ported to Flink's SQL API as suggested by the link
that Hequn posted. In that case, the query would be executed in Flink.

Other options are to use a JDBC InputFormat or persisting the result to
files and ingesting it with a monitoring file sink.
These options would mean to run the query in Hive/Presto and just ingesting
the result (via JDBC or files).

It depends on the details, which solution works best for you.

Best, Fabian

2018-08-07 3:28 GMT+02:00 Hequn Cheng <ch...@gmail.com>:

> Hi srimugunthan,
>
> I found a related link[1]. Hope it helps.
>
> [1] https://stackoverflow.com/questions/41683108/flink-1-1-
> 3-interact-with-hive-2-1-0
>
> On Tue, Aug 7, 2018 at 2:35 AM, srimugunthan dhandapani <
> srimugunthan.dhandapani@gmail.com> wrote:
>
>> Hi all,
>> I read the Flink documentation  and came across the connectors supported
>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>> dev/connectors/index.html#bundled-connectors
>>
>> We have some data that  resides in Hive/Presto that needs to be made
>> available to the flink job. The data in the hive or presto can be updated
>> once in a day or less than that.
>>
>> Ideally we will connect to the hive or presto , run the query and get
>> back the results and use it in a flink job.
>> What are the options to achieve something like that?
>>
>> Thanks,
>> mugunthan
>>
>
>

Re: Accessing source table data from hive/Presto

Posted by Hequn Cheng <ch...@gmail.com>.

Hi srimugunthan,

I found a related link[1]. Hope it helps.

[1]
https://stackoverflow.com/questions/41683108/flink-1-1-3-interact-with-hive-2-1-0

On Tue, Aug 7, 2018 at 2:35 AM, srimugunthan dhandapani <
srimugunthan.dhandapani@gmail.com> wrote:

> Hi all,
> I read the Flink documentation  and came across the connectors supported
> https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/dev/connectors/index.html#bundled-connectors
>
> We have some data that  resides in Hive/Presto that needs to be made
> available to the flink job. The data in the hive or presto can be updated
> once in a day or less than that.
>
> Ideally we will connect to the hive or presto , run the query and get back
> the results and use it in a flink job.
> What are the options to achieve something like that?
>
> Thanks,
> mugunthan
>