You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Paul Parker <ni...@gmail.com> on 2020/03/27 18:13:08 UTC

How to use delta storage format

We read data via JDBC from a database and want to save the results as a
delta table and then read them again. How can I realize this with Nifi and
Hive or Glue Metastore?

Re: How to use delta storage format

Posted by Mike Thomsen <mi...@gmail.com>.

Paul,

> What it does is it basically runs a local-mode (in-process) spark to read
the log.

That's unfortunately not particularly scalable unless I'm missing
something. I think the easiest path to accomplish this would be to build a
NiFi flow that generates Parquet files, uploads them into a S3 bucket and
then periodically run a Spark job to read the entire bucket and merge the
files into the table. I've done something simple like this in a personal
experiment. It's particularly easy now that we have Record API components
that will directly generate Parquet output files from an Avro schema and a
record set.

On Tue, Mar 31, 2020 at 11:08 AM Paul Parker <ni...@gmail.com> wrote:

> Let me share answers from the delta community:
>
> Answer to Q1:
> Structured streaming queries can do commits every minute, even every 20-30
> seconds. This definitely creates small files. But that is okay, because it
> is expected that people will periodically compact the files. The same
> timing should work fine for Nifi and any other streaming engine. It does
> create 1000-ish versions per day, but that is okay.
>
> Answer to Q2:
> That is up to the sink implementation. Both are okay. In fact, it probably
> can be combination of both, as long as we dont commit every second. That
> may not scale well.
>
> Answer to Q3:
> You need a primary node which is responsible for managing the Delta table.
> That note would be responsible for reading the log, parsing it, updating
> it, etc. Unfortunately, we have no good non-spark way to read the log, and
> much less write to the log.
> there is an experimental uber jar that tries to package delta + spark
> together into a single jar ... using which you could read the log. its
> available here - https://github.com/delta-io/connectors/
>
> What it does is it basically runs a local-mode (in-process) spark to read
> the log. This what we are using to build a hive connector, that will allow
> hive to read delta files. Now the goal of that was to only read. your goal
> is to write which is definitely more complicated, because for that you have
> to do much more. Now that uber jar has all the necessary code to do the
> writing. ... which you could use. but there has to be a driver node which
> has to collect all the parquet files written by other nodes and atomically
> commit those parquet files to the Delta log to make them visible to all
> readers.
>
> Does the first orientation help?
>
>
> Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020,
> 20:29:
>
>> It looks like a lot of their connectors rely on external management by
>> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
>> documentation. Read some of the fine print near the bottom of this to get
>> an idea of what I mean:
>>
>> https://github.com/delta-io/connectors
>>
>> Hypothetically, we could build a connector for NiFi, but there are some
>> things about the design of Delta Lake that I am not sure about based on my
>> own research and what not. Roughly, they are the following:
>>
>> 1. What is a good timing strategy for doing commits to Delta Lake?
>> 2. What would trigger a commit in the first place?
>> 3. Is there a good way to trigger commits that would work within the
>> masterless cluster design of clustered NiFi instead of requiring a special
>> "primary node only" processor for executing commits?
>>
>> Based on my experimentation, one of the biggest questions around the
>> first point is do you really want potentially thousands or tens of
>> thousands of time shift events to be created throughout the day? A record
>> processor that reads in a ton of small record sets and injects that into
>> the Delta Lake would create a ton of these checkpoints, and they'd be
>> largely meaningless to people trying to make sense of them for the purpose
>> of going back and forth in time between versions.
>>
>> Do we trigger a commit per record set or set a timer?
>>
>> Most of us on the NiFi dev side have no real experience here. It would be
>> helpful for us to get some ideas to form use cases from the community
>> because there are some big gaps on how we'd even start to shape the
>> requirements.
>>
>> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker <ni...@gmail.com>
>> wrote:
>>
>>> Hi Mike,
>>> your alternate suggestion sounds good. But how does it work if I want to
>>> keep this running continuously? In other words, the delta table should be
>>> continuously updated. Finally, this is one of the biggest advantages of
>>> Delta: you can ingest batch and streaming data into one table.
>>>
>>> I also think about workarounds (Use Athena, Presto or Redshift with
>>> Nifi):
>>> "Here is the list of integrations that enable you to access Delta
>>> tables from external data processing engines.
>>>
>>>    - Presto and Athena to Delta Lake Integration
>>>    <https://docs.delta.io/latest/presto-integration.html>
>>>    - Redshift Spectrum to Delta Lake Integration
>>>    <https://docs.delta.io/latest/redshift-spectrum-integration.html>
>>>    - Snowflake to Delta Lake Integration
>>>    <https://docs.delta.io/latest/snowflake-integration.html>
>>>    - Apache Hive to Delta Lake Integration
>>>    <https://docs.delta.io/latest/hive-integration.html>"
>>>
>>> Source:
>>> https://docs.delta.io/latest/integrations.html
>>>
>>> I am looking forward to further ideas from the community.
>>>
>>> Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020,
>>> 17:23:
>>>
>>>> I think there is a connector for Hive that works with Delta. You could
>>>> try setting up Hive to work with Delta and then using NiFi to feed Hive.
>>>> Alternatively, you can simply convert the results of the JDBC query into a
>>>> Parquet file, push to a desired location and run a Spark job to convert
>>>> from Parquet to Delta (that should be pretty fast because Delta is
>>>> basically a fork of Parquet).
>>>>
>>>> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <ni...@gmail.com>
>>>> wrote:
>>>>
>>>>> We read data via JDBC from a database and want to save the results as
>>>>> a delta table and then read them again. How can I realize this with Nifi
>>>>> and Hive or Glue Metastore?
>>>>>
>>>>

Re: How to use delta storage format

Posted by Mike Thomsen <mi...@gmail.com>.

Interesting. You've given me a lot to think about in terms of designing
this.

On Thu, Apr 2, 2020 at 10:43 AM Paul Parker <ni...@gmail.com> wrote:

> @Mike I'd appreciate some feedback.
>
> Paul Parker <ni...@gmail.com> schrieb am Di., 31. März 2020, 17:07:
>
>> Let me share answers from the delta community:
>>
>> Answer to Q1:
>> Structured streaming queries can do commits every minute, even every
>> 20-30 seconds. This definitely creates small files. But that is okay,
>> because it is expected that people will periodically compact the files. The
>> same timing should work fine for Nifi and any other streaming engine. It
>> does create 1000-ish versions per day, but that is okay.
>>
>> Answer to Q2:
>> That is up to the sink implementation. Both are okay. In fact, it
>> probably can be combination of both, as long as we dont commit every
>> second. That may not scale well.
>>
>> Answer to Q3:
>> You need a primary node which is responsible for managing the Delta
>> table. That note would be responsible for reading the log, parsing it,
>> updating it, etc. Unfortunately, we have no good non-spark way to read the
>> log, and much less write to the log.
>> there is an experimental uber jar that tries to package delta + spark
>> together into a single jar ... using which you could read the log. its
>> available here - https://github.com/delta-io/connectors/
>>
>> What it does is it basically runs a local-mode (in-process) spark to read
>> the log. This what we are using to build a hive connector, that will allow
>> hive to read delta files. Now the goal of that was to only read. your goal
>> is to write which is definitely more complicated, because for that you have
>> to do much more. Now that uber jar has all the necessary code to do the
>> writing. ... which you could use. but there has to be a driver node which
>> has to collect all the parquet files written by other nodes and atomically
>> commit those parquet files to the Delta log to make them visible to all
>> readers.
>>
>> Does the first orientation help?
>>
>>
>> Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020,
>> 20:29:
>>
>>> It looks like a lot of their connectors rely on external management by
>>> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
>>> documentation. Read some of the fine print near the bottom of this to get
>>> an idea of what I mean:
>>>
>>> https://github.com/delta-io/connectors
>>>
>>> Hypothetically, we could build a connector for NiFi, but there are some
>>> things about the design of Delta Lake that I am not sure about based on my
>>> own research and what not. Roughly, they are the following:
>>>
>>> 1. What is a good timing strategy for doing commits to Delta Lake?
>>> 2. What would trigger a commit in the first place?
>>> 3. Is there a good way to trigger commits that would work within the
>>> masterless cluster design of clustered NiFi instead of requiring a special
>>> "primary node only" processor for executing commits?
>>>
>>> Based on my experimentation, one of the biggest questions around the
>>> first point is do you really want potentially thousands or tens of
>>> thousands of time shift events to be created throughout the day? A record
>>> processor that reads in a ton of small record sets and injects that into
>>> the Delta Lake would create a ton of these checkpoints, and they'd be
>>> largely meaningless to people trying to make sense of them for the purpose
>>> of going back and forth in time between versions.
>>>
>>> Do we trigger a commit per record set or set a timer?
>>>
>>> Most of us on the NiFi dev side have no real experience here. It would
>>> be helpful for us to get some ideas to form use cases from the community
>>> because there are some big gaps on how we'd even start to shape the
>>> requirements.
>>>
>>> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker <ni...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mike,
>>>> your alternate suggestion sounds good. But how does it work if I want
>>>> to keep this running continuously? In other words, the delta table should
>>>> be continuously updated. Finally, this is one of the biggest advantages of
>>>> Delta: you can ingest batch and streaming data into one table.
>>>>
>>>> I also think about workarounds (Use Athena, Presto or Redshift with
>>>> Nifi):
>>>> "Here is the list of integrations that enable you to access Delta
>>>> tables from external data processing engines.
>>>>
>>>>    - Presto and Athena to Delta Lake Integration
>>>>    <https://docs.delta.io/latest/presto-integration.html>
>>>>    - Redshift Spectrum to Delta Lake Integration
>>>>    <https://docs.delta.io/latest/redshift-spectrum-integration.html>
>>>>    - Snowflake to Delta Lake Integration
>>>>    <https://docs.delta.io/latest/snowflake-integration.html>
>>>>    - Apache Hive to Delta Lake Integration
>>>>    <https://docs.delta.io/latest/hive-integration.html>"
>>>>
>>>> Source:
>>>> https://docs.delta.io/latest/integrations.html
>>>>
>>>> I am looking forward to further ideas from the community.
>>>>
>>>> Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020,
>>>> 17:23:
>>>>
>>>>> I think there is a connector for Hive that works with Delta. You could
>>>>> try setting up Hive to work with Delta and then using NiFi to feed Hive.
>>>>> Alternatively, you can simply convert the results of the JDBC query into a
>>>>> Parquet file, push to a desired location and run a Spark job to convert
>>>>> from Parquet to Delta (that should be pretty fast because Delta is
>>>>> basically a fork of Parquet).
>>>>>
>>>>> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <ni...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We read data via JDBC from a database and want to save the results as
>>>>>> a delta table and then read them again. How can I realize this with Nifi
>>>>>> and Hive or Glue Metastore?
>>>>>>
>>>>>

Re: How to use delta storage format

Posted by Paul Parker <ni...@gmail.com>.

@Mike I'd appreciate some feedback.

Paul Parker <ni...@gmail.com> schrieb am Di., 31. März 2020, 17:07:

> Let me share answers from the delta community:
>
> Answer to Q1:
> Structured streaming queries can do commits every minute, even every 20-30
> seconds. This definitely creates small files. But that is okay, because it
> is expected that people will periodically compact the files. The same
> timing should work fine for Nifi and any other streaming engine. It does
> create 1000-ish versions per day, but that is okay.
>
> Answer to Q2:
> That is up to the sink implementation. Both are okay. In fact, it probably
> can be combination of both, as long as we dont commit every second. That
> may not scale well.
>
> Answer to Q3:
> You need a primary node which is responsible for managing the Delta table.
> That note would be responsible for reading the log, parsing it, updating
> it, etc. Unfortunately, we have no good non-spark way to read the log, and
> much less write to the log.
> there is an experimental uber jar that tries to package delta + spark
> together into a single jar ... using which you could read the log. its
> available here - https://github.com/delta-io/connectors/
>
> What it does is it basically runs a local-mode (in-process) spark to read
> the log. This what we are using to build a hive connector, that will allow
> hive to read delta files. Now the goal of that was to only read. your goal
> is to write which is definitely more complicated, because for that you have
> to do much more. Now that uber jar has all the necessary code to do the
> writing. ... which you could use. but there has to be a driver node which
> has to collect all the parquet files written by other nodes and atomically
> commit those parquet files to the Delta log to make them visible to all
> readers.
>
> Does the first orientation help?
>
>
> Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020,
> 20:29:
>
>> It looks like a lot of their connectors rely on external management by
>> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
>> documentation. Read some of the fine print near the bottom of this to get
>> an idea of what I mean:
>>
>> https://github.com/delta-io/connectors
>>
>> Hypothetically, we could build a connector for NiFi, but there are some
>> things about the design of Delta Lake that I am not sure about based on my
>> own research and what not. Roughly, they are the following:
>>
>> 1. What is a good timing strategy for doing commits to Delta Lake?
>> 2. What would trigger a commit in the first place?
>> 3. Is there a good way to trigger commits that would work within the
>> masterless cluster design of clustered NiFi instead of requiring a special
>> "primary node only" processor for executing commits?
>>
>> Based on my experimentation, one of the biggest questions around the
>> first point is do you really want potentially thousands or tens of
>> thousands of time shift events to be created throughout the day? A record
>> processor that reads in a ton of small record sets and injects that into
>> the Delta Lake would create a ton of these checkpoints, and they'd be
>> largely meaningless to people trying to make sense of them for the purpose
>> of going back and forth in time between versions.
>>
>> Do we trigger a commit per record set or set a timer?
>>
>> Most of us on the NiFi dev side have no real experience here. It would be
>> helpful for us to get some ideas to form use cases from the community
>> because there are some big gaps on how we'd even start to shape the
>> requirements.
>>
>> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker <ni...@gmail.com>
>> wrote:
>>
>>> Hi Mike,
>>> your alternate suggestion sounds good. But how does it work if I want to
>>> keep this running continuously? In other words, the delta table should be
>>> continuously updated. Finally, this is one of the biggest advantages of
>>> Delta: you can ingest batch and streaming data into one table.
>>>
>>> I also think about workarounds (Use Athena, Presto or Redshift with
>>> Nifi):
>>> "Here is the list of integrations that enable you to access Delta
>>> tables from external data processing engines.
>>>
>>>    - Presto and Athena to Delta Lake Integration
>>>    <https://docs.delta.io/latest/presto-integration.html>
>>>    - Redshift Spectrum to Delta Lake Integration
>>>    <https://docs.delta.io/latest/redshift-spectrum-integration.html>
>>>    - Snowflake to Delta Lake Integration
>>>    <https://docs.delta.io/latest/snowflake-integration.html>
>>>    - Apache Hive to Delta Lake Integration
>>>    <https://docs.delta.io/latest/hive-integration.html>"
>>>
>>> Source:
>>> https://docs.delta.io/latest/integrations.html
>>>
>>> I am looking forward to further ideas from the community.
>>>
>>> Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020,
>>> 17:23:
>>>
>>>> I think there is a connector for Hive that works with Delta. You could
>>>> try setting up Hive to work with Delta and then using NiFi to feed Hive.
>>>> Alternatively, you can simply convert the results of the JDBC query into a
>>>> Parquet file, push to a desired location and run a Spark job to convert
>>>> from Parquet to Delta (that should be pretty fast because Delta is
>>>> basically a fork of Parquet).
>>>>
>>>> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <ni...@gmail.com>
>>>> wrote:
>>>>
>>>>> We read data via JDBC from a database and want to save the results as
>>>>> a delta table and then read them again. How can I realize this with Nifi
>>>>> and Hive or Glue Metastore?
>>>>>
>>>>

Re: How to use delta storage format

Posted by Paul Parker <ni...@gmail.com>.

Let me share answers from the delta community:

Answer to Q1:
Structured streaming queries can do commits every minute, even every 20-30
seconds. This definitely creates small files. But that is okay, because it
is expected that people will periodically compact the files. The same
timing should work fine for Nifi and any other streaming engine. It does
create 1000-ish versions per day, but that is okay.

Answer to Q2:
That is up to the sink implementation. Both are okay. In fact, it probably
can be combination of both, as long as we dont commit every second. That
may not scale well.

Answer to Q3:
You need a primary node which is responsible for managing the Delta table.
That note would be responsible for reading the log, parsing it, updating
it, etc. Unfortunately, we have no good non-spark way to read the log, and
much less write to the log.
there is an experimental uber jar that tries to package delta + spark
together into a single jar ... using which you could read the log. its
available here - https://github.com/delta-io/connectors/

What it does is it basically runs a local-mode (in-process) spark to read
the log. This what we are using to build a hive connector, that will allow
hive to read delta files. Now the goal of that was to only read. your goal
is to write which is definitely more complicated, because for that you have
to do much more. Now that uber jar has all the necessary code to do the
writing. ... which you could use. but there has to be a driver node which
has to collect all the parquet files written by other nodes and atomically
commit those parquet files to the Delta log to make them visible to all
readers.

Does the first orientation help?

Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020, 20:29:

> It looks like a lot of their connectors rely on external management by
> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
> documentation. Read some of the fine print near the bottom of this to get
> an idea of what I mean:
>
> https://github.com/delta-io/connectors
>
> Hypothetically, we could build a connector for NiFi, but there are some
> things about the design of Delta Lake that I am not sure about based on my
> own research and what not. Roughly, they are the following:
>
> 1. What is a good timing strategy for doing commits to Delta Lake?
> 2. What would trigger a commit in the first place?
> 3. Is there a good way to trigger commits that would work within the
> masterless cluster design of clustered NiFi instead of requiring a special
> "primary node only" processor for executing commits?
>
> Based on my experimentation, one of the biggest questions around the first
> point is do you really want potentially thousands or tens of thousands of
> time shift events to be created throughout the day? A record processor that
> reads in a ton of small record sets and injects that into the Delta Lake
> would create a ton of these checkpoints, and they'd be largely meaningless
> to people trying to make sense of them for the purpose of going back and
> forth in time between versions.
>
> Do we trigger a commit per record set or set a timer?
>
> Most of us on the NiFi dev side have no real experience here. It would be
> helpful for us to get some ideas to form use cases from the community
> because there are some big gaps on how we'd even start to shape the
> requirements.
>
> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker <ni...@gmail.com> wrote:
>
>> Hi Mike,
>> your alternate suggestion sounds good. But how does it work if I want to
>> keep this running continuously? In other words, the delta table should be
>> continuously updated. Finally, this is one of the biggest advantages of
>> Delta: you can ingest batch and streaming data into one table.
>>
>> I also think about workarounds (Use Athena, Presto or Redshift with Nifi):
>> "Here is the list of integrations that enable you to access Delta tables
>> from external data processing engines.
>>
>>    - Presto and Athena to Delta Lake Integration
>>    <https://docs.delta.io/latest/presto-integration.html>
>>    - Redshift Spectrum to Delta Lake Integration
>>    <https://docs.delta.io/latest/redshift-spectrum-integration.html>
>>    - Snowflake to Delta Lake Integration
>>    <https://docs.delta.io/latest/snowflake-integration.html>
>>    - Apache Hive to Delta Lake Integration
>>    <https://docs.delta.io/latest/hive-integration.html>"
>>
>> Source:
>> https://docs.delta.io/latest/integrations.html
>>
>> I am looking forward to further ideas from the community.
>>
>> Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020,
>> 17:23:
>>
>>> I think there is a connector for Hive that works with Delta. You could
>>> try setting up Hive to work with Delta and then using NiFi to feed Hive.
>>> Alternatively, you can simply convert the results of the JDBC query into a
>>> Parquet file, push to a desired location and run a Spark job to convert
>>> from Parquet to Delta (that should be pretty fast because Delta is
>>> basically a fork of Parquet).
>>>
>>> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <ni...@gmail.com>
>>> wrote:
>>>
>>>> We read data via JDBC from a database and want to save the results as a
>>>> delta table and then read them again. How can I realize this with Nifi and
>>>> Hive or Glue Metastore?
>>>>
>>>

Re: How to use delta storage format

Posted by Mike Thomsen <mi...@gmail.com>.

It looks like a lot of their connectors rely on external management by
Spark. That is true of Hive, and also of Athena/Presto unless I misread the
documentation. Read some of the fine print near the bottom of this to get
an idea of what I mean:

https://github.com/delta-io/connectors

Hypothetically, we could build a connector for NiFi, but there are some
things about the design of Delta Lake that I am not sure about based on my
own research and what not. Roughly, they are the following:

1. What is a good timing strategy for doing commits to Delta Lake?
2. What would trigger a commit in the first place?
3. Is there a good way to trigger commits that would work within the
masterless cluster design of clustered NiFi instead of requiring a special
"primary node only" processor for executing commits?

Based on my experimentation, one of the biggest questions around the first
point is do you really want potentially thousands or tens of thousands of
time shift events to be created throughout the day? A record processor that
reads in a ton of small record sets and injects that into the Delta Lake
would create a ton of these checkpoints, and they'd be largely meaningless
to people trying to make sense of them for the purpose of going back and
forth in time between versions.

Do we trigger a commit per record set or set a timer?

Most of us on the NiFi dev side have no real experience here. It would be
helpful for us to get some ideas to form use cases from the community
because there are some big gaps on how we'd even start to shape the
requirements.

On Sun, Mar 29, 2020 at 1:28 PM Paul Parker <ni...@gmail.com> wrote:

> Hi Mike,
> your alternate suggestion sounds good. But how does it work if I want to
> keep this running continuously? In other words, the delta table should be
> continuously updated. Finally, this is one of the biggest advantages of
> Delta: you can ingest batch and streaming data into one table.
>
> I also think about workarounds (Use Athena, Presto or Redshift with Nifi):
> "Here is the list of integrations that enable you to access Delta tables
> from external data processing engines.
>
>    - Presto and Athena to Delta Lake Integration
>    <https://docs.delta.io/latest/presto-integration.html>
>    - Redshift Spectrum to Delta Lake Integration
>    <https://docs.delta.io/latest/redshift-spectrum-integration.html>
>    - Snowflake to Delta Lake Integration
>    <https://docs.delta.io/latest/snowflake-integration.html>
>    - Apache Hive to Delta Lake Integration
>    <https://docs.delta.io/latest/hive-integration.html>"
>
> Source:
> https://docs.delta.io/latest/integrations.html
>
> I am looking forward to further ideas from the community.
>
> Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020,
> 17:23:
>
>> I think there is a connector for Hive that works with Delta. You could
>> try setting up Hive to work with Delta and then using NiFi to feed Hive.
>> Alternatively, you can simply convert the results of the JDBC query into a
>> Parquet file, push to a desired location and run a Spark job to convert
>> from Parquet to Delta (that should be pretty fast because Delta is
>> basically a fork of Parquet).
>>
>> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <ni...@gmail.com>
>> wrote:
>>
>>> We read data via JDBC from a database and want to save the results as a
>>> delta table and then read them again. How can I realize this with Nifi and
>>> Hive or Glue Metastore?
>>>
>>

Re: How to use delta storage format

Posted by Paul Parker <ni...@gmail.com>.

Hi Mike,
your alternate suggestion sounds good. But how does it work if I want to
keep this running continuously? In other words, the delta table should be
continuously updated. Finally, this is one of the biggest advantages of
Delta: you can ingest batch and streaming data into one table.

I also think about workarounds (Use Athena, Presto or Redshift with Nifi):
"Here is the list of integrations that enable you to access Delta tables
from external data processing engines.

   - Presto and Athena to Delta Lake Integration
   <https://docs.delta.io/latest/presto-integration.html>
   - Redshift Spectrum to Delta Lake Integration
   <https://docs.delta.io/latest/redshift-spectrum-integration.html>
   - Snowflake to Delta Lake Integration
   <https://docs.delta.io/latest/snowflake-integration.html>
   - Apache Hive to Delta Lake Integration
   <https://docs.delta.io/latest/hive-integration.html>"

Source:
https://docs.delta.io/latest/integrations.html

I am looking forward to further ideas from the community.

Mike Thomsen <mi...@gmail.com> schrieb am So., 29. März 2020, 17:23:

> I think there is a connector for Hive that works with Delta. You could try
> setting up Hive to work with Delta and then using NiFi to feed Hive.
> Alternatively, you can simply convert the results of the JDBC query into a
> Parquet file, push to a desired location and run a Spark job to convert
> from Parquet to Delta (that should be pretty fast because Delta is
> basically a fork of Parquet).
>
> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <ni...@gmail.com> wrote:
>
>> We read data via JDBC from a database and want to save the results as a
>> delta table and then read them again. How can I realize this with Nifi and
>> Hive or Glue Metastore?
>>
>

Re: How to use delta storage format

Posted by Mike Thomsen <mi...@gmail.com>.

I think there is a connector for Hive that works with Delta. You could try
setting up Hive to work with Delta and then using NiFi to feed Hive.
Alternatively, you can simply convert the results of the JDBC query into a
Parquet file, push to a desired location and run a Spark job to convert
from Parquet to Delta (that should be pretty fast because Delta is
basically a fork of Parquet).

On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <ni...@gmail.com> wrote:

> We read data via JDBC from a database and want to save the results as a
> delta table and then read them again. How can I realize this with Nifi and
> Hive or Glue Metastore?
>