You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cody Koeninger (JIRA)" <ji...@apache.org> on 2016/10/07 03:21:20 UTC

[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

    [ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553955#comment-15553955 ] 

Cody Koeninger edited comment on SPARK-15406 at 10/7/16 3:20 AM:
-----------------------------------------------------------------

As soon as you say "checkpoint", a whole class of people who were burned by the problems with DStream checkpoints are going to tune out.  I don't trust anything unless I can get my hands on the offsets.

Regarding exactly once, this has two components, producing idempotently to kafka, and storing idempotently/transactionally downstream.  Specifically regarding producing to Kafka, I was talking with Jay Kreps about this yesterday.  There have been a lot of ideas over the years (e.g. put-with-expected-offset).  The top contender at this point is apparently something along the lines of a tcp/ip sequence number on the producer side allowing for repeated writes to be ignored on the broker side.  The KIP doc for it should be coming "soon".  Specifically regarding downstream write, without an actual implementation for real downstream systems and no way to get offsets except for individual messages, what exists currently couldn't even be tested in production at my current gig.  I could probably pretty quickly write a jdbc sink and a Citus sink, but those seem like they would require significant knowledge and / or assumptions about the job.  Having a schema helps a lot, but do you still want to assume things like a particular table format for outputs and offsets?  That those sinks only work with Kafka?


was (Author: cody@koeninger.org):
As soon as you say "checkpoint", a whole class of people who were burned by the problems with DStream checkpoints are going to tune out.  I don't trust anything unless I can get my hands on the offsets.

Regarding exactly once, this has two components, producing idempotently to kafka, and storing idempotently/transactionally downstream.  Specifically regarding producing to Kafka, I was talking with Jay Kreps about this yesterday.  There have been a lot of ideas over the years (e.g. put-with-expected-offset).  The top contender at this point is apparently something along the lines of a tcp/ip sequence number on the producer side allowing for repeated writes to be ignored on the broker side.  The KIP doc for it should be coming "soon".  Specifically regarding downstream write, without an actual implementation for real downstream systems and no way to get offsets except for individual messages, what exists currently couldn't even be tested in production at my current gig.  I could probably pretty quickly write a jdbc sink and a Citus sink, but those seem like they would require significant knowledge and / or assumptions about the job.  Having a schema helps a lot, but do you still want to assume things like a particular table format for outputs?  That those sinks only work with Kafka?

> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
>                 Key: SPARK-15406
>                 URL: https://issues.apache.org/jira/browse/SPARK-15406
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source for Structured Streaming. Here is the design doc for an initial version of the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet.  I personally feel like time based indexing would make for a much better interface, but it's been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org