You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Aaron Grubb <aa...@kaden.ai> on 2023/12/01 10:37:49 UTC

JSON in Kafka -> ORC in HDFS - Thoughts on different tools?

Hi all,

Posting this here to avoid biases from the individual mailing lists on why the product they're using is the best. I'm analyzing tools to
replace a section of our pipeline with something more efficient. Currently we're using Kafka Connect to take data from Kafka and put it into
S3 (not HDFS cause the connector is paid) in JSON format, then Hive reads JSON from S3 and creates ORC files in HDFS after a group by. I
would like to replace this with something that reads Kafka, applies aggregations and windowing in-place and writes HDFS directly. I know that
the impending Hive 4 release will support this but Hive LLAP is *very* slow when processing JSON. So far I have a working PySpark application
that accomplishes this replacement using structured streaming + windowing, however the decision to evaluate Spark was based on there being a
use for Spark in other areas, so I'm interested in getting opinions on other tools that may be better for this use case based on resource
usage, ease of use, scalability, resilience, etc.

In terms of absolute must-haves:
- Read JSON from Kafka
- Ability to summarize data over a window
- Write ORC to HDFS
- Fault tolerant (can tolerate anywhere from a single machine to an entire cluster going offline unexpectedly while maintaining exactly-once
guarantees)

Nice-to-haves:
- Automatically scalable, both up and down (doesn't matter if standalone, YARN, Kubernetes, etc)
- Coding in Java not required

So far I've found that Gobblin, Flink and NiFi seem capable of accomplishing what I'm looking for, but neither I nor anyone at my company has
any experience with those products, so I was hoping to get some opinions on what the users here would choose and why. I'm also open to other
tools that I'm not yet aware of.

Thanks for your time,
Aaron



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: JSON in Kafka -> ORC in HDFS - Thoughts on different tools?

Posted by Aaron Grubb <aa...@kaden.ai>.

Hi Michal,

Thanks for your detailed reply, it was very helpful. The motivation for replacing Kafka Connect is mostly related to having to run backfills from time-to-time - we store all the raw data from Kafka Connect, extract the fields we're currently using, then drop the extracted data and keep the raw JSON, and that's fine in the case that backfilling is never needed, but when it becomes necessary, processing 90+ days of JSON at 12+ billion rows per day using Hive LLAP is excruciatingly slow. Therefore we wanted to have the data in ORC format as early as possible instead of adding an intermediate job to transform the JSON to ORC in the current pipeline. Changing this part of the pipeline over should also result in an overall reduction of resources used - nothing crazy for this first topic that we're changing over but if it goes well, we have a few Kafka Connect clusters that we would be interested in converting, and that would also free up a ton of CPU time in Hive.

Thanks,
Aaron

On Thu, 2023-12-07 at 13:32 +0100, Michal Klempa wrote:
Hi Aaron,
I do not know Gobblin, so no advice there.

You write that currently Kafka Connect dumps to files, as you probably already know, Kafka Connect can't do the aggregation.
To my knowledge, NiFi is also ETL with local Transformation, there is no state maintenance on a global scale. You can write processors to do stateful transformation, but for this task, would be tedious in my opinion. I would put NiFi out of the game.

Now to the requirements, I assume, you have:
- large volume (~1k events per second)
- small messages (<1k JSONs)
- need to have data in near real-time (seconds at most) after a window (aggregation) is triggered for data stake holders to query

Then, it makes sense to think of doing the aggregation on-the-fly, in a real-time framework, i.e. real-time ETL non-stop running job.
If your use-case is not satisfying the criteria, e.g low volume, or no real-time need (1 to 5 minutes lateness is fine), I would strongly encourage to avoid using real-time stateful streaming, as it is complicated to setup, scale, maintain, run and mostly: code bug-free. It is a non-stop running application, any memory leak > you have restart on OOM every couple of hours. It is hard.

You may have:
- high volume + no real-time (5 minutes lateness is fine)

In that case, running any pyspark every 5 minutes with ad-hoc AWS spot instances cluster with batch job is just fine.

You may have:
- low volume + no real-time (5 minutes lateness is fine)
In that case, just run plain 1 instance python script doing the job, 1k to 100k events you can just consume from Kafka directly, pack into ORC, and dump on S3/HDFS on a single CPU. Use any cron to run it every 5 minutes. Done.

In case your use case is:
- large volume + real-time
for this, Flink and Spark Structured Streaming are both good fit, but there is also a thing called Kafka Streams, I would suggest to add this as a competitor. Also there is Beam(Google Dataflow), if you are on GCP already. All of them do the same job.

Flink vs. Spark Structured Streaming vs. Kafka Streams:
Deployement: Kafka Streams is just one fat-jar, Flink+Spark - you need to maintain clusters, but both frameworks are working on being k8s native, but... not easy to setup either.
Coding: everything is JVM, Spark has python, Flink added python too. Seems there are some python attempts on Kafka Streams approach, no experience though.
Fault Tolerance: I have real-world experience with Flink+Spark Structured Streaming, both can restart from checkpoints, Flink have also savepoints which is a good feature to start a new job after modifications (but also not easy to setup).
Automatically Scalable: I think none of the open-source has this feature out-of-the-box (correct me if wrong). You may want to pay Ververica platform (Flink authors offering), Databricks (Spark authors offering), there must be something from Confluent or competitors,too. Google of course has its Dataflow (Beam API). All auto-scaling is pain however, each rescale means reshuffle of the data.
Exactly once: To my knowledge, only Flink nowadays offers end-to-end exactly once and I would not be sure whether that can be achieved with ORC on HDFS as destination. Maybe idempontent ORC writer can be used or some other form of "transaction" on the destination must exist.

All in all, if I would be solving your problem, I would first attack the requirements list. Whether it can't be done easier. If not, Flink would be my choice as I had good experience with it and you can really hack anything inside. But prepare yourself, that the requirement list is hard, even if you get pipeline up&running in 2 weeks, you surely will re-iterate the decision after some incidents in next 6 months.
If you loosen requirements a bit, it becomes easier and easier. Your current solution sounds very reasonable to me. You picked something that works out of the box (Kafka Connect) and done ELT, where something, that can aggregate out of the box (Hive) does it. Why exactly you need to replace it?

Good luck, M.

On Fri, Dec 1, 2023 at 11:38 AM Aaron Grubb <aa...@kaden.ai>> wrote:
Hi all,

Posting this here to avoid biases from the individual mailing lists on why the product they're using is the best. I'm analyzing tools to
replace a section of our pipeline with something more efficient. Currently we're using Kafka Connect to take data from Kafka and put it into
S3 (not HDFS cause the connector is paid) in JSON format, then Hive reads JSON from S3 and creates ORC files in HDFS after a group by. I
would like to replace this with something that reads Kafka, applies aggregations and windowing in-place and writes HDFS directly. I know that
the impending Hive 4 release will support this but Hive LLAP is *very* slow when processing JSON. So far I have a working PySpark application
that accomplishes this replacement using structured streaming + windowing, however the decision to evaluate Spark was based on there being a
use for Spark in other areas, so I'm interested in getting opinions on other tools that may be better for this use case based on resource
usage, ease of use, scalability, resilience, etc.

In terms of absolute must-haves:
- Read JSON from Kafka
- Ability to summarize data over a window
- Write ORC to HDFS
- Fault tolerant (can tolerate anywhere from a single machine to an entire cluster going offline unexpectedly while maintaining exactly-once
guarantees)

Nice-to-haves:
- Automatically scalable, both up and down (doesn't matter if standalone, YARN, Kubernetes, etc)
- Coding in Java not required

So far I've found that Gobblin, Flink and NiFi seem capable of accomplishing what I'm looking for, but neither I nor anyone at my company has
any experience with those products, so I was hoping to get some opinions on what the users here would choose and why. I'm also open to other
tools that I'm not yet aware of.

Thanks for your time,
Aaron

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org<ma...@hadoop.apache.org>
For additional commands, e-mail: user-help@hadoop.apache.org<ma...@hadoop.apache.org>

Re: JSON in Kafka -> ORC in HDFS - Thoughts on different tools?

Posted by Michal Klempa <mi...@gmail.com>.

Hi Aaron,
I do not know Gobblin, so no advice there.

You write that currently Kafka Connect dumps to files, as you probably
already know, Kafka Connect can't do the aggregation.
To my knowledge, NiFi is also ETL with local Transformation, there is no
state maintenance on a global scale. You can write processors to do
stateful transformation, but for this task, would be tedious in my opinion.
I would put NiFi out of the game.

Now to the requirements, I assume, you have:
- large volume (~1k events per second)
- small messages (<1k JSONs)
- need to have data in near real-time (seconds at most) after a window
(aggregation) is triggered for data stake holders to query

Then, it makes sense to think of doing the aggregation on-the-fly, in a
real-time framework, i.e. real-time ETL non-stop running job.
If your use-case is not satisfying the criteria, e.g low volume, or no
real-time need (1 to 5 minutes lateness is fine), I would strongly
encourage to avoid using real-time stateful streaming, as it is complicated
to setup, scale, maintain, run and mostly: code bug-free. It is a non-stop
running application, any memory leak > you have restart on OOM every couple
of hours. It is hard.

You may have:
- high volume + no real-time (5 minutes lateness is fine)
In that case, running any pyspark every 5 minutes with ad-hoc AWS spot
instances cluster with batch job is just fine.

You may have:
- low volume + no real-time (5 minutes lateness is fine)
In that case, just run plain 1 instance python script doing the job, 1k to
100k events you can just consume from Kafka directly, pack into ORC, and
dump on S3/HDFS on a single CPU. Use any cron to run it every 5 minutes.
Done.

In case your use case is:
- large volume + real-time
for this, Flink and Spark Structured Streaming are both good fit, but there
is also a thing called Kafka Streams, I would suggest to add this as a
competitor. Also there is Beam(Google Dataflow), if you are on GCP already.
All of them do the same job.

Flink vs. Spark Structured Streaming vs. Kafka Streams:
Deployement: Kafka Streams is just one fat-jar, Flink+Spark - you need to
maintain clusters, but both frameworks are working on being k8s native,
but... not easy to setup either.
Coding: everything is JVM, Spark has python, Flink added python too. Seems
there are some python attempts on Kafka Streams approach, no experience
though.
Fault Tolerance: I have real-world experience with Flink+Spark Structured
Streaming, both can restart from checkpoints, Flink have also savepoints
which is a good feature to start a new job after modifications (but also
not easy to setup).
Automatically Scalable: I think none of the open-source has this feature
out-of-the-box (correct me if wrong). You may want to pay Ververica
platform (Flink authors offering), Databricks (Spark authors offering),
there must be something from Confluent or competitors,too. Google of course
has its Dataflow (Beam API). All auto-scaling is pain however, each rescale
means reshuffle of the data.
Exactly once: To my knowledge, only Flink nowadays offers end-to-end
exactly once and I would not be sure whether that can be achieved with ORC
on HDFS as destination. Maybe idempontent ORC writer can be used or some
other form of "transaction" on the destination must exist.

All in all, if I would be solving your problem, I would first attack the
requirements list. Whether it can't be done easier. If not, Flink would be
my choice as I had good experience with it and you can really hack anything
inside. But prepare yourself, that the requirement list is hard, even if
you get pipeline up&running in 2 weeks, you surely will re-iterate the
decision after some incidents in next 6 months.
If you loosen requirements a bit, it becomes easier and easier. Your
current solution sounds very reasonable to me. You picked something that
works out of the box (Kafka Connect) and done ELT, where something, that
can aggregate out of the box (Hive) does it. Why exactly you need to
replace it?

Good luck, M.

On Fri, Dec 1, 2023 at 11:38 AM Aaron Grubb <aa...@kaden.ai> wrote:

> Hi all,
>
> Posting this here to avoid biases from the individual mailing lists on why
> the product they're using is the best. I'm analyzing tools to
> replace a section of our pipeline with something more efficient. Currently
> we're using Kafka Connect to take data from Kafka and put it into
> S3 (not HDFS cause the connector is paid) in JSON format, then Hive reads
> JSON from S3 and creates ORC files in HDFS after a group by. I
> would like to replace this with something that reads Kafka, applies
> aggregations and windowing in-place and writes HDFS directly. I know that
> the impending Hive 4 release will support this but Hive LLAP is *very*
> slow when processing JSON. So far I have a working PySpark application
> that accomplishes this replacement using structured streaming + windowing,
> however the decision to evaluate Spark was based on there being a
> use for Spark in other areas, so I'm interested in getting opinions on
> other tools that may be better for this use case based on resource
> usage, ease of use, scalability, resilience, etc.
>
> In terms of absolute must-haves:
> - Read JSON from Kafka
> - Ability to summarize data over a window
> - Write ORC to HDFS
> - Fault tolerant (can tolerate anywhere from a single machine to an entire
> cluster going offline unexpectedly while maintaining exactly-once
> guarantees)
>
> Nice-to-haves:
> - Automatically scalable, both up and down (doesn't matter if standalone,
> YARN, Kubernetes, etc)
> - Coding in Java not required
>
> So far I've found that Gobblin, Flink and NiFi seem capable of
> accomplishing what I'm looking for, but neither I nor anyone at my company
> has
> any experience with those products, so I was hoping to get some opinions
> on what the users here would choose and why. I'm also open to other
> tools that I'm not yet aware of.
>
> Thanks for your time,
> Aaron
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>
>