You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Priyank Shrivastava <pr...@asperasoft.com> on 2017/07/26 01:05:58 UTC

[SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

I am trying to write key-values to redis using a DataStreamWriter object
using pyspark structured streaming APIs. I am using Spark 2.2

Since the Foreach Sink is not supported for python; here
<http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
I am trying to find out some alternatives.

One alternative is to write a separate Scala module only to push data into
redis using foreach; ForeachWriter
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter>
is
supported in Scala. BUT this doesn't seem like an efficient approach and
adds deployment overhead because now I will have to support Scala in my app.

Another approach is obviously to use Scala instead of python, which is fine
but I want to make sure that I absolutely cannot use python for this
problem before I take this path.

Would appreciate some feedback and alternative design approaches for this
problem.

Thanks.

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by Priyank Shrivastava <pr...@asperasoft.com>.

Also, in your example doesn't the tempview need to be accessed using the
same sparkSession on the scala side?  Since I am not using a notebook, how
can I get access to the same sparksession in scala.

On Fri, Jul 28, 2017 at 3:17 PM, Priyank Shrivastava <priyank@asperasoft.com
> wrote:

> Thanks Burak.
>
> In a streaming context would I need to do any state management for the
> temp views? for example across sliding windows.
>
> Priyank
>
> On Fri, Jul 28, 2017 at 3:13 PM, Burak Yavuz <br...@gmail.com> wrote:
>
>> Hi Priyank,
>>
>> You may register them as temporary tables to use across language
>> boundaries.
>>
>> Python:
>> df = spark.readStream...
>> # Python logic
>> df.createOrReplaceTempView("tmp1")
>>
>> Scala:
>> val df = spark.table("tmp1")
>> df.writeStream
>>   .foreach(...)
>>
>>
>> On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava <
>> priyank@asperasoft.com> wrote:
>>
>>> TD,
>>>
>>> For a hybrid python-scala approach, what's the recommended way of
>>> handing off a dataframe from python to scala.  I would like to know
>>> especially in a streaming context.
>>>
>>> I am not using notebooks/databricks.  We are running it on our own spark
>>> 2.1 cluster.
>>>
>>> Priyank
>>>
>>> On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das <
>>> tathagata.das1565@gmail.com> wrote:
>>>
>>>> We see that all the time. For example, in SQL, people can write their
>>>> user-defined function in Scala/Java and use it from SQL/python/anywhere.
>>>> That is the recommended way to get the best combo of performance and
>>>> ease-of-use from non-jvm languages.
>>>>
>>>> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava <
>>>> priyank@asperasoft.com> wrote:
>>>>
>>>>> Thanks TD.  I am going to try the python-scala hybrid approach by
>>>>> using scala only for custom redis sink and python for the rest of the app
>>>>> .  I understand it might not be as efficient as purely writing the app in
>>>>> scala but unfortunately I am constrained on scala resources.  Have you come
>>>>> across other use cases where people have resided to such python-scala
>>>>> hybrid approach?
>>>>>
>>>>> Regards,
>>>>> Priyank
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>>>>> tathagata.das1565@gmail.com> wrote:
>>>>>
>>>>>> Hello Priyank
>>>>>>
>>>>>> Writing something purely in Scale/Java would be the most efficient.
>>>>>> Even if we expose python APIs that allow writing custom sinks in pure
>>>>>> Python, it wont be as efficient as Scala/Java foreach as the data would
>>>>>> have to go through JVM / PVM boundary which has significant overheads. So
>>>>>> Scala/Java foreach is always going to be the best option.
>>>>>>
>>>>>> TD
>>>>>>
>>>>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>>>>> priyank@asperasoft.com> wrote:
>>>>>>
>>>>>>> I am trying to write key-values to redis using a DataStreamWriter
>>>>>>> object using pyspark structured streaming APIs. I am using Spark 2.2
>>>>>>>
>>>>>>> Since the Foreach Sink is not supported for python; here
>>>>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>>>>>> I am trying to find out some alternatives.
>>>>>>>
>>>>>>> One alternative is to write a separate Scala module only to push
>>>>>>> data into redis using foreach; ForeachWriter
>>>>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
>>>>>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>>>>>> adds deployment overhead because now I will have to support Scala in my app.
>>>>>>>
>>>>>>> Another approach is obviously to use Scala instead of python, which
>>>>>>> is fine but I want to make sure that I absolutely cannot use python for
>>>>>>> this problem before I take this path.
>>>>>>>
>>>>>>> Would appreciate some feedback and alternative design approaches for
>>>>>>> this problem.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by Priyank Shrivastava <pr...@asperasoft.com>.

Thanks Burak.

In a streaming context would I need to do any state management for the temp
views? for example across sliding windows.

Priyank

On Fri, Jul 28, 2017 at 3:13 PM, Burak Yavuz <br...@gmail.com> wrote:

> Hi Priyank,
>
> You may register them as temporary tables to use across language
> boundaries.
>
> Python:
> df = spark.readStream...
> # Python logic
> df.createOrReplaceTempView("tmp1")
>
> Scala:
> val df = spark.table("tmp1")
> df.writeStream
>   .foreach(...)
>
>
> On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava <
> priyank@asperasoft.com> wrote:
>
>> TD,
>>
>> For a hybrid python-scala approach, what's the recommended way of handing
>> off a dataframe from python to scala.  I would like to know especially in a
>> streaming context.
>>
>> I am not using notebooks/databricks.  We are running it on our own spark
>> 2.1 cluster.
>>
>> Priyank
>>
>> On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> We see that all the time. For example, in SQL, people can write their
>>> user-defined function in Scala/Java and use it from SQL/python/anywhere.
>>> That is the recommended way to get the best combo of performance and
>>> ease-of-use from non-jvm languages.
>>>
>>> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava <
>>> priyank@asperasoft.com> wrote:
>>>
>>>> Thanks TD.  I am going to try the python-scala hybrid approach by using
>>>> scala only for custom redis sink and python for the rest of the app .  I
>>>> understand it might not be as efficient as purely writing the app in scala
>>>> but unfortunately I am constrained on scala resources.  Have you come
>>>> across other use cases where people have resided to such python-scala
>>>> hybrid approach?
>>>>
>>>> Regards,
>>>> Priyank
>>>>
>>>>
>>>>
>>>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>>>> tathagata.das1565@gmail.com> wrote:
>>>>
>>>>> Hello Priyank
>>>>>
>>>>> Writing something purely in Scale/Java would be the most efficient.
>>>>> Even if we expose python APIs that allow writing custom sinks in pure
>>>>> Python, it wont be as efficient as Scala/Java foreach as the data would
>>>>> have to go through JVM / PVM boundary which has significant overheads. So
>>>>> Scala/Java foreach is always going to be the best option.
>>>>>
>>>>> TD
>>>>>
>>>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>>>> priyank@asperasoft.com> wrote:
>>>>>
>>>>>> I am trying to write key-values to redis using a DataStreamWriter
>>>>>> object using pyspark structured streaming APIs. I am using Spark 2.2
>>>>>>
>>>>>> Since the Foreach Sink is not supported for python; here
>>>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>>>>> I am trying to find out some alternatives.
>>>>>>
>>>>>> One alternative is to write a separate Scala module only to push data
>>>>>> into redis using foreach; ForeachWriter
>>>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
>>>>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>>>>> adds deployment overhead because now I will have to support Scala in my app.
>>>>>>
>>>>>> Another approach is obviously to use Scala instead of python, which
>>>>>> is fine but I want to make sure that I absolutely cannot use python for
>>>>>> this problem before I take this path.
>>>>>>
>>>>>> Would appreciate some feedback and alternative design approaches for
>>>>>> this problem.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by Burak Yavuz <br...@gmail.com>.

Hi Priyank,

You may register them as temporary tables to use across language boundaries.

Python:
df = spark.readStream...
# Python logic
df.createOrReplaceTempView("tmp1")

Scala:
val df = spark.table("tmp1")
df.writeStream
  .foreach(...)


On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava <priyank@asperasoft.com
> wrote:

> TD,
>
> For a hybrid python-scala approach, what's the recommended way of handing
> off a dataframe from python to scala.  I would like to know especially in a
> streaming context.
>
> I am not using notebooks/databricks.  We are running it on our own spark
> 2.1 cluster.
>
> Priyank
>
> On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> We see that all the time. For example, in SQL, people can write their
>> user-defined function in Scala/Java and use it from SQL/python/anywhere.
>> That is the recommended way to get the best combo of performance and
>> ease-of-use from non-jvm languages.
>>
>> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava <
>> priyank@asperasoft.com> wrote:
>>
>>> Thanks TD.  I am going to try the python-scala hybrid approach by using
>>> scala only for custom redis sink and python for the rest of the app .  I
>>> understand it might not be as efficient as purely writing the app in scala
>>> but unfortunately I am constrained on scala resources.  Have you come
>>> across other use cases where people have resided to such python-scala
>>> hybrid approach?
>>>
>>> Regards,
>>> Priyank
>>>
>>>
>>>
>>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>>> tathagata.das1565@gmail.com> wrote:
>>>
>>>> Hello Priyank
>>>>
>>>> Writing something purely in Scale/Java would be the most efficient.
>>>> Even if we expose python APIs that allow writing custom sinks in pure
>>>> Python, it wont be as efficient as Scala/Java foreach as the data would
>>>> have to go through JVM / PVM boundary which has significant overheads. So
>>>> Scala/Java foreach is always going to be the best option.
>>>>
>>>> TD
>>>>
>>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>>> priyank@asperasoft.com> wrote:
>>>>
>>>>> I am trying to write key-values to redis using a DataStreamWriter
>>>>> object using pyspark structured streaming APIs. I am using Spark 2.2
>>>>>
>>>>> Since the Foreach Sink is not supported for python; here
>>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>>>> I am trying to find out some alternatives.
>>>>>
>>>>> One alternative is to write a separate Scala module only to push data
>>>>> into redis using foreach; ForeachWriter
>>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
>>>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>>>> adds deployment overhead because now I will have to support Scala in my app.
>>>>>
>>>>> Another approach is obviously to use Scala instead of python, which is
>>>>> fine but I want to make sure that I absolutely cannot use python for this
>>>>> problem before I take this path.
>>>>>
>>>>> Would appreciate some feedback and alternative design approaches for
>>>>> this problem.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by Priyank Shrivastava <pr...@asperasoft.com>.

TD,

For a hybrid python-scala approach, what's the recommended way of handing
off a dataframe from python to scala.  I would like to know especially in a
streaming context.

I am not using notebooks/databricks.  We are running it on our own spark
2.1 cluster.

Priyank

On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das <tathagata.das1565@gmail.com
> wrote:

> We see that all the time. For example, in SQL, people can write their
> user-defined function in Scala/Java and use it from SQL/python/anywhere.
> That is the recommended way to get the best combo of performance and
> ease-of-use from non-jvm languages.
>
> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava <
> priyank@asperasoft.com> wrote:
>
>> Thanks TD.  I am going to try the python-scala hybrid approach by using
>> scala only for custom redis sink and python for the rest of the app .  I
>> understand it might not be as efficient as purely writing the app in scala
>> but unfortunately I am constrained on scala resources.  Have you come
>> across other use cases where people have resided to such python-scala
>> hybrid approach?
>>
>> Regards,
>> Priyank
>>
>>
>>
>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> Hello Priyank
>>>
>>> Writing something purely in Scale/Java would be the most efficient. Even
>>> if we expose python APIs that allow writing custom sinks in pure Python, it
>>> wont be as efficient as Scala/Java foreach as the data would have to go
>>> through JVM / PVM boundary which has significant overheads. So Scala/Java
>>> foreach is always going to be the best option.
>>>
>>> TD
>>>
>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>> priyank@asperasoft.com> wrote:
>>>
>>>> I am trying to write key-values to redis using a DataStreamWriter
>>>> object using pyspark structured streaming APIs. I am using Spark 2.2
>>>>
>>>> Since the Foreach Sink is not supported for python; here
>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>>> I am trying to find out some alternatives.
>>>>
>>>> One alternative is to write a separate Scala module only to push data
>>>> into redis using foreach; ForeachWriter
>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
>>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>>> adds deployment overhead because now I will have to support Scala in my app.
>>>>
>>>> Another approach is obviously to use Scala instead of python, which is
>>>> fine but I want to make sure that I absolutely cannot use python for this
>>>> problem before I take this path.
>>>>
>>>> Would appreciate some feedback and alternative design approaches for
>>>> this problem.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by Tathagata Das <ta...@gmail.com>.

We see that all the time. For example, in SQL, people can write their
user-defined function in Scala/Java and use it from SQL/python/anywhere.
That is the recommended way to get the best combo of performance and
ease-of-use from non-jvm languages.

On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava <
priyank@asperasoft.com> wrote:

> Thanks TD.  I am going to try the python-scala hybrid approach by using
> scala only for custom redis sink and python for the rest of the app .  I
> understand it might not be as efficient as purely writing the app in scala
> but unfortunately I am constrained on scala resources.  Have you come
> across other use cases where people have resided to such python-scala
> hybrid approach?
>
> Regards,
> Priyank
>
>
>
> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> Hello Priyank
>>
>> Writing something purely in Scale/Java would be the most efficient. Even
>> if we expose python APIs that allow writing custom sinks in pure Python, it
>> wont be as efficient as Scala/Java foreach as the data would have to go
>> through JVM / PVM boundary which has significant overheads. So Scala/Java
>> foreach is always going to be the best option.
>>
>> TD
>>
>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>> priyank@asperasoft.com> wrote:
>>
>>> I am trying to write key-values to redis using a DataStreamWriter object
>>> using pyspark structured streaming APIs. I am using Spark 2.2
>>>
>>> Since the Foreach Sink is not supported for python; here
>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>> I am trying to find out some alternatives.
>>>
>>> One alternative is to write a separate Scala module only to push data
>>> into redis using foreach; ForeachWriter
>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>> adds deployment overhead because now I will have to support Scala in my app.
>>>
>>> Another approach is obviously to use Scala instead of python, which is
>>> fine but I want to make sure that I absolutely cannot use python for this
>>> problem before I take this path.
>>>
>>> Would appreciate some feedback and alternative design approaches for
>>> this problem.
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by Tathagata Das <ta...@gmail.com>.

For built-in SQL functions, it does not matter which language you use as
the engine will use the most optimized JVM code to execute. However, in
your case, you are asking for foreach in python. My interpretation was that
you want to specify your python function that process the rows in python.
This is different from the built-in functions, as the engine will have to
invoke your function in the python inside a python VM.







On Wed, Jul 26, 2017 at 12:54 PM, ayan guha <gu...@gmail.com> wrote:

> Hi TD
>
> I thought structured streaming does provide similar concept of dataframes
> where it does not matter which language I use to invoke the APIs, with
> exception of udf.
>
> So, when I think of support foreach sink in python, I think it as just a
> wrapper api and data should remain in JVM only. Similar to, for example, a
> hive writer or hdfs writer in Dataframe API.
>
> Am I too simplifying? Or is it just early days in structured streaming?
> Happy to learn any mistakes in my thinking and understanding.
>
> Best
> Ayan
>
> On Thu, 27 Jul 2017 at 4:49 am, Priyank Shrivastava <
> priyank@asperasoft.com> wrote:
>
>> Thanks TD.  I am going to try the python-scala hybrid approach by using
>> scala only for custom redis sink and python for the rest of the app .  I
>> understand it might not be as efficient as purely writing the app in scala
>> but unfortunately I am constrained on scala resources.  Have you come
>> across other use cases where people have resided to such python-scala
>> hybrid approach?
>>
>> Regards,
>> Priyank
>>
>>
>>
>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> Hello Priyank
>>>
>>> Writing something purely in Scale/Java would be the most efficient. Even
>>> if we expose python APIs that allow writing custom sinks in pure Python, it
>>> wont be as efficient as Scala/Java foreach as the data would have to go
>>> through JVM / PVM boundary which has significant overheads. So Scala/Java
>>> foreach is always going to be the best option.
>>>
>>> TD
>>>
>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>> priyank@asperasoft.com> wrote:
>>>
>>>> I am trying to write key-values to redis using a DataStreamWriter
>>>> object using pyspark structured streaming APIs. I am using Spark 2.2
>>>>
>>>> Since the Foreach Sink is not supported for python; here
>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>>> I am trying to find out some alternatives.
>>>>
>>>> One alternative is to write a separate Scala module only to push data
>>>> into redis using foreach; ForeachWriter
>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
>>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>>> adds deployment overhead because now I will have to support Scala in my app.
>>>>
>>>> Another approach is obviously to use Scala instead of python, which is
>>>> fine but I want to make sure that I absolutely cannot use python for this
>>>> problem before I take this path.
>>>>
>>>> Would appreciate some feedback and alternative design approaches for
>>>> this problem.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>> --
> Best Regards,
> Ayan Guha
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by ayan guha <gu...@gmail.com>.

Hi TD

I thought structured streaming does provide similar concept of dataframes
where it does not matter which language I use to invoke the APIs, with
exception of udf.

So, when I think of support foreach sink in python, I think it as just a
wrapper api and data should remain in JVM only. Similar to, for example, a
hive writer or hdfs writer in Dataframe API.

Am I too simplifying? Or is it just early days in structured streaming?
Happy to learn any mistakes in my thinking and understanding.

Best
Ayan

On Thu, 27 Jul 2017 at 4:49 am, Priyank Shrivastava <pr...@asperasoft.com>
wrote:

> Thanks TD.  I am going to try the python-scala hybrid approach by using
> scala only for custom redis sink and python for the rest of the app .  I
> understand it might not be as efficient as purely writing the app in scala
> but unfortunately I am constrained on scala resources.  Have you come
> across other use cases where people have resided to such python-scala
> hybrid approach?
>
> Regards,
> Priyank
>
>
>
> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> Hello Priyank
>>
>> Writing something purely in Scale/Java would be the most efficient. Even
>> if we expose python APIs that allow writing custom sinks in pure Python, it
>> wont be as efficient as Scala/Java foreach as the data would have to go
>> through JVM / PVM boundary which has significant overheads. So Scala/Java
>> foreach is always going to be the best option.
>>
>> TD
>>
>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>> priyank@asperasoft.com> wrote:
>>
>>> I am trying to write key-values to redis using a DataStreamWriter object
>>> using pyspark structured streaming APIs. I am using Spark 2.2
>>>
>>> Since the Foreach Sink is not supported for python; here
>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>> I am trying to find out some alternatives.
>>>
>>> One alternative is to write a separate Scala module only to push data
>>> into redis using foreach; ForeachWriter
>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>> adds deployment overhead because now I will have to support Scala in my app.
>>>
>>> Another approach is obviously to use Scala instead of python, which is
>>> fine but I want to make sure that I absolutely cannot use python for this
>>> problem before I take this path.
>>>
>>> Would appreciate some feedback and alternative design approaches for
>>> this problem.
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>
> --
Best Regards,
Ayan Guha

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by Priyank Shrivastava <pr...@asperasoft.com>.

Thanks TD.  I am going to try the python-scala hybrid approach by using
scala only for custom redis sink and python for the rest of the app .  I
understand it might not be as efficient as purely writing the app in scala
but unfortunately I am constrained on scala resources.  Have you come
across other use cases where people have resided to such python-scala
hybrid approach?

Regards,
Priyank



On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <ta...@gmail.com>
wrote:

> Hello Priyank
>
> Writing something purely in Scale/Java would be the most efficient. Even
> if we expose python APIs that allow writing custom sinks in pure Python, it
> wont be as efficient as Scala/Java foreach as the data would have to go
> through JVM / PVM boundary which has significant overheads. So Scala/Java
> foreach is always going to be the best option.
>
> TD
>
> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
> priyank@asperasoft.com> wrote:
>
>> I am trying to write key-values to redis using a DataStreamWriter object
>> using pyspark structured streaming APIs. I am using Spark 2.2
>>
>> Since the Foreach Sink is not supported for python; here
>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>> I am trying to find out some alternatives.
>>
>> One alternative is to write a separate Scala module only to push data
>> into redis using foreach; ForeachWriter
>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
>> supported in Scala. BUT this doesn't seem like an efficient approach and
>> adds deployment overhead because now I will have to support Scala in my app.
>>
>> Another approach is obviously to use Scala instead of python, which is
>> fine but I want to make sure that I absolutely cannot use python for this
>> problem before I take this path.
>>
>> Would appreciate some feedback and alternative design approaches for this
>> problem.
>>
>> Thanks.
>>
>>
>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Posted by Tathagata Das <ta...@gmail.com>.

Hello Priyank

Writing something purely in Scale/Java would be the most efficient. Even if
we expose python APIs that allow writing custom sinks in pure Python, it
wont be as efficient as Scala/Java foreach as the data would have to go
through JVM / PVM boundary which has significant overheads. So Scala/Java
foreach is always going to be the best option.

TD

On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <priyank@asperasoft.com
> wrote:

> I am trying to write key-values to redis using a DataStreamWriter object
> using pyspark structured streaming APIs. I am using Spark 2.2
>
> Since the Foreach Sink is not supported for python; here
> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
> I am trying to find out some alternatives.
>
> One alternative is to write a separate Scala module only to push data into
> redis using foreach; ForeachWriter
> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> is
> supported in Scala. BUT this doesn't seem like an efficient approach and
> adds deployment overhead because now I will have to support Scala in my app.
>
> Another approach is obviously to use Scala instead of python, which is
> fine but I want to make sure that I absolutely cannot use python for this
> problem before I take this path.
>
> Would appreciate some feedback and alternative design approaches for this
> problem.
>
> Thanks.
>
>
>
>