You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Data Guy <da...@gmail.com> on 2022/03/08 18:41:22 UTC

Decompress Gzip files from EventHub with Structured Streaming

Hi everyone,

*<first time writing to this mailing list>*

Context: I have events coming into Databricks from an Azure Event Hub in a
Gzip compressed format. Currently, I extract the files with a UDF and send
the unzipped data into the silver layer in my Delta Lake with .write. Note
that even though data comes in continuously I do not use .writeStream as of
now.

I have a few design-related questions that I hope someone with experience
could help me with!

   1. Is there a better way to extract Gzip files than a UDF?
   2. Is Spark Structured Streaming or Batch with Databricks Jobs better?
   (Pipeline runs every 3 hours once, but the data is continuously coming from
   Event Hub)
   3. Should I use Autoloader or just simply stream data into Databricks
   using Event Hubs?

I am especially curious about the trade-offs and the best way forward. I
don't have massive amounts of data.

Thank you very much in advance!

Best wishes,
Maurizio Vancho Argall

Re: Decompress Gzip files from EventHub with Structured Streaming

Posted by ayan guha <gu...@gmail.com>.
Hi

IMHO this is not the best use of spark. I would suggest to use simple azure
function to unzip.

Is there any specific reason to use gzip over event hub?

If you can wait 10-20 sec to process, you can use eventhub capture to write
data to storage and  then process it.

It all depends on compute you are willing to pay, every 3 sec of scheduled
job should not give you any benefit over streaming.

Best
Ayan

On Wed, 9 Mar 2022 at 5:42 am, Data Guy <da...@gmail.com> wrote:

> Hi everyone,
>
> *<first time writing to this mailing list>*
>
> Context: I have events coming into Databricks from an Azure Event Hub in a
> Gzip compressed format. Currently, I extract the files with a UDF and send
> the unzipped data into the silver layer in my Delta Lake with .write. Note
> that even though data comes in continuously I do not use .writeStream as of
> now.
>
> I have a few design-related questions that I hope someone with experience
> could help me with!
>
>    1. Is there a better way to extract Gzip files than a UDF?
>    2. Is Spark Structured Streaming or Batch with Databricks Jobs better?
>    (Pipeline runs every 3 hours once, but the data is continuously coming from
>    Event Hub)
>    3. Should I use Autoloader or just simply stream data into Databricks
>    using Event Hubs?
>
> I am especially curious about the trade-offs and the best way forward. I
> don't have massive amounts of data.
>
> Thank you very much in advance!
>
> Best wishes,
> Maurizio Vancho Argall
>
> --
Best Regards,
Ayan Guha