You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jacek Laskowski <ja...@japila.pl> on 2017/05/01 13:52:23 UTC

[KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

Hi,

I've just found out that KafkaSourceProvider supports topic option
that sets the Kafka topic to save a DataFrame to.

You can also use topic column to assign rows to topics.

Given the features, I've been wondering why "path" option is not
supported (even of least precedence) so when no topic column or option
are defined, save(path: String) would be the least priority.

WDYT?

It looks pretty trivial to support --> see KafkaSourceProvider at
lines [1] and [2] if I'm not mistaken.

[1] https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L145
[2] https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L163

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

Posted by Jacek Laskowski <ja...@japila.pl>.

https://issues.apache.org/jira/browse/SPARK-20597

I'm going to send a PR soon.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, May 1, 2017 at 8:26 PM, Cody Koeninger <co...@koeninger.org> wrote:
> Yeah, seems reasonable.
>
> On Mon, May 1, 2017 at 12:40 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>> Hi,
>>
>> Thanks Cody and Michael! I didn't expect to get two answers so quickly and
>> from THE brains behind spark - Kafka integration. #impressed
>>
>> Yes, Michael has nailed it. Using save's path was so natural to me after
>> months with Spark that I was surprised to not have seen it instead of the
>> custom and surely not very obvious topic.
>>
>> Imagine my day today when I'd discovered that I could use KafkaSource in
>> batch queries and then suddenly found out about no support for path in save.
>> I'm not faint-hearted so I survived :-)
>>
>> I think that change would make KafkaSource even cooler. Please add support
>> if possible (and make it part of the upcoming 2.2.0, too!)
>>
>> Thanks.
>>
>> Jacek
>>
>> On 1 May 2017 7:26 p.m., "Michael Armbrust" <mi...@databricks.com> wrote:
>>>
>>> He's just suggesting that since the DataStreamWriter start() method can
>>> fill in an option named "path", we should make that a synonym for "topic".
>>> Then you could do something like.
>>>
>>> df.writeStream.format("kafka").start("topic")
>>>
>>> Seems reasonable if people don't think that is confusing.
>>>
>>> On Mon, May 1, 2017 at 8:43 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>>>
>>>> I'm confused about what you're suggesting.  Are you saying that a
>>>> Kafka sink should take a filesystem path as an option?
>>>>
>>>> On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>>>> > Hi,
>>>> >
>>>> > I've just found out that KafkaSourceProvider supports topic option
>>>> > that sets the Kafka topic to save a DataFrame to.
>>>> >
>>>> > You can also use topic column to assign rows to topics.
>>>> >
>>>> > Given the features, I've been wondering why "path" option is not
>>>> > supported (even of least precedence) so when no topic column or option
>>>> > are defined, save(path: String) would be the least priority.
>>>> >
>>>> > WDYT?
>>>> >
>>>> > It looks pretty trivial to support --> see KafkaSourceProvider at
>>>> > lines [1] and [2] if I'm not mistaken.
>>>> >
>>>> > [1]
>>>> > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L145
>>>> > [2]
>>>> > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L163
>>>> >
>>>> > Pozdrawiam,
>>>> > Jacek Laskowski
>>>> > ----
>>>> > https://medium.com/@jaceklaskowski/
>>>> > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>>>> > Follow me at https://twitter.com/jaceklaskowski
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

Posted by Cody Koeninger <co...@koeninger.org>.

Yeah, seems reasonable.

On Mon, May 1, 2017 at 12:40 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi,
>
> Thanks Cody and Michael! I didn't expect to get two answers so quickly and
> from THE brains behind spark - Kafka integration. #impressed
>
> Yes, Michael has nailed it. Using save's path was so natural to me after
> months with Spark that I was surprised to not have seen it instead of the
> custom and surely not very obvious topic.
>
> Imagine my day today when I'd discovered that I could use KafkaSource in
> batch queries and then suddenly found out about no support for path in save.
> I'm not faint-hearted so I survived :-)
>
> I think that change would make KafkaSource even cooler. Please add support
> if possible (and make it part of the upcoming 2.2.0, too!)
>
> Thanks.
>
> Jacek
>
> On 1 May 2017 7:26 p.m., "Michael Armbrust" <mi...@databricks.com> wrote:
>>
>> He's just suggesting that since the DataStreamWriter start() method can
>> fill in an option named "path", we should make that a synonym for "topic".
>> Then you could do something like.
>>
>> df.writeStream.format("kafka").start("topic")
>>
>> Seems reasonable if people don't think that is confusing.
>>
>> On Mon, May 1, 2017 at 8:43 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>>
>>> I'm confused about what you're suggesting.  Are you saying that a
>>> Kafka sink should take a filesystem path as an option?
>>>
>>> On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>>> > Hi,
>>> >
>>> > I've just found out that KafkaSourceProvider supports topic option
>>> > that sets the Kafka topic to save a DataFrame to.
>>> >
>>> > You can also use topic column to assign rows to topics.
>>> >
>>> > Given the features, I've been wondering why "path" option is not
>>> > supported (even of least precedence) so when no topic column or option
>>> > are defined, save(path: String) would be the least priority.
>>> >
>>> > WDYT?
>>> >
>>> > It looks pretty trivial to support --> see KafkaSourceProvider at
>>> > lines [1] and [2] if I'm not mistaken.
>>> >
>>> > [1]
>>> > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L145
>>> > [2]
>>> > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L163
>>> >
>>> > Pozdrawiam,
>>> > Jacek Laskowski
>>> > ----
>>> > https://medium.com/@jaceklaskowski/
>>> > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>>> > Follow me at https://twitter.com/jaceklaskowski
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi,

Thanks Cody and Michael! I didn't expect to get two answers so quickly and
from THE brains behind spark - Kafka integration. #impressed

Yes, Michael has nailed it. Using save's path was so natural to me after
months with Spark that I was surprised to not have seen it instead of the
custom and surely not very obvious topic.

Imagine my day today when I'd discovered that I could use KafkaSource in
batch queries and then suddenly found out about no support for path in
save. I'm not faint-hearted so I survived :-)

I think that change would make KafkaSource even cooler. Please add support
if possible (and make it part of the upcoming 2.2.0, too!)

Thanks.

Jacek

On 1 May 2017 7:26 p.m., "Michael Armbrust" <mi...@databricks.com> wrote:

> He's just suggesting that since the DataStreamWriter start() method can
> fill in an option named "path", we should make that a synonym for "topic".
> Then you could do something like.
>
> df.writeStream.format("kafka").start("topic")
>
> Seems reasonable if people don't think that is confusing.
>
> On Mon, May 1, 2017 at 8:43 AM, Cody Koeninger <co...@koeninger.org> wrote:
>
>> I'm confused about what you're suggesting.  Are you saying that a
>> Kafka sink should take a filesystem path as an option?
>>
>> On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>> > Hi,
>> >
>> > I've just found out that KafkaSourceProvider supports topic option
>> > that sets the Kafka topic to save a DataFrame to.
>> >
>> > You can also use topic column to assign rows to topics.
>> >
>> > Given the features, I've been wondering why "path" option is not
>> > supported (even of least precedence) so when no topic column or option
>> > are defined, save(path: String) would be the least priority.
>> >
>> > WDYT?
>> >
>> > It looks pretty trivial to support --> see KafkaSourceProvider at
>> > lines [1] and [2] if I'm not mistaken.
>> >
>> > [1] https://github.com/apache/spark/blob/master/external/kafka-
>> 0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/Kafk
>> aSourceProvider.scala#L145
>> > [2] https://github.com/apache/spark/blob/master/external/kafka-
>> 0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/Kafk
>> aSourceProvider.scala#L163
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > ----
>> > https://medium.com/@jaceklaskowski/
>> > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

Posted by Michael Armbrust <mi...@databricks.com>.

He's just suggesting that since the DataStreamWriter start() method can
fill in an option named "path", we should make that a synonym for "topic".
Then you could do something like.

df.writeStream.format("kafka").start("topic")

Seems reasonable if people don't think that is confusing.

On Mon, May 1, 2017 at 8:43 AM, Cody Koeninger <co...@koeninger.org> wrote:

> I'm confused about what you're suggesting.  Are you saying that a
> Kafka sink should take a filesystem path as an option?
>
> On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski <ja...@japila.pl> wrote:
> > Hi,
> >
> > I've just found out that KafkaSourceProvider supports topic option
> > that sets the Kafka topic to save a DataFrame to.
> >
> > You can also use topic column to assign rows to topics.
> >
> > Given the features, I've been wondering why "path" option is not
> > supported (even of least precedence) so when no topic column or option
> > are defined, save(path: String) would be the least priority.
> >
> > WDYT?
> >
> > It looks pretty trivial to support --> see KafkaSourceProvider at
> > lines [1] and [2] if I'm not mistaken.
> >
> > [1] https://github.com/apache/spark/blob/master/external/
> kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/
> KafkaSourceProvider.scala#L145
> > [2] https://github.com/apache/spark/blob/master/external/
> kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/
> KafkaSourceProvider.scala#L163
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > ----
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

Posted by Cody Koeninger <co...@koeninger.org>.

I'm confused about what you're suggesting.  Are you saying that a
Kafka sink should take a filesystem path as an option?

On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi,
>
> I've just found out that KafkaSourceProvider supports topic option
> that sets the Kafka topic to save a DataFrame to.
>
> You can also use topic column to assign rows to topics.
>
> Given the features, I've been wondering why "path" option is not
> supported (even of least precedence) so when no topic column or option
> are defined, save(path: String) would be the least priority.
>
> WDYT?
>
> It looks pretty trivial to support --> see KafkaSourceProvider at
> lines [1] and [2] if I'm not mistaken.
>
> [1] https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L145
> [2] https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L163
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org