You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by xmehaut <xa...@gmail.com> on 2018/05/14 04:53:21 UTC

[Arrow][Dremio]

Hello,
I've some question about Spark and Apache Arrow. Up to now, Arrow is only
used for sharing data between Python and Spark executors instead of
transmitting them through sockets. I'm studying currently Dremio as an
interesting way to access multiple sources of data, and as a potential
replacement of ETL tools, included sparksql.
It seems, if the promises are actually right, that arrow and dremio may be
changing game for these two purposes (data source abstraction, etl tasks),
leaving then spark on te two following goals , ie ml/dl and graph
processing, which can be a danger for spark at middle term with the arising
of multiple frameworks in these areas.
My question is then :
- is there a means to use arrow more broadly in spark itself and not only
for sharing data?
- what are the strenghts and weaknesses of spark wrt Arrow and consequently
Dremio?
- What is the difference finally between databricks DBIO and Dremio/arrow?
-How do you see the future of spark regarding these assumptions?
regards 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [Arrow][Dremio]

Posted by Xavier Mehaut <xa...@gmail.com>.

thanks bryan for the answer

Envoyé de mon iPhone

> Le 15 mai 2018 à 19:06, Bryan Cutler <cu...@gmail.com> a écrit :
> 
> Hi Xavier,
> 
> Regarding Arrow usage in Spark, using Arrow format to transfer data between Python and Java has been the focus so far because this area stood to benefit the most.  It's possible that the scope of Arrow could broaden in the future, but there still needs to be discussions about this.
> 
> Bryan
> 
>> On Mon, May 14, 2018 at 9:55 AM, Pierce Lamb <ri...@gmail.com> wrote:
>> Hi Xavier,
>> 
>> Along the lines of connecting to multiple sources of data and replacing ETL tools you may want to check out Confluent's blog on building a real-time streaming ETL pipeline on Kafka as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData where Spark is central to connecting to multiple data sources, executing SQL on streams etc. These should provide nice comparisons to your ideas about Dremio + Spark as ETL tools.
>> 
>> Disclaimer: I am a SnappyData employee
>> 
>> Hope this helps,
>> 
>> Pierce
>> 
>>> On Mon, May 14, 2018 at 2:24 AM, xmehaut <xa...@gmail.com> wrote:
>>> Hi Michaël,
>>> 
>>> I'm not an expert of Dremio, i just try to evaluate the potential of this
>>> techno and what impacts it could have on spark, and how they can work
>>> together, or how spark could use even further arrow internally along the
>>> existing algorithms.
>>> 
>>> Dremio has already a quite rich api set enabling to access for instance to
>>> metadata, sql queries, or even to create virtual datasets programmatically.
>>> They also have a lot of predefined functions, and I imagine there will be
>>> more an more fucntions in the future, eg machine learning functions like the
>>> ones we may find in azure sql server which enables to mix sql and ml
>>> functions.  Acces to dremio is made through jdbc, and we may imagine to
>>> access virtual datasets through spark and create dynamically new datasets
>>> from the api connected to parquets files stored dynamycally by spark on
>>> hdfs, azure datalake or s3... Of course a more thight integration between
>>> both should be better with a spark read/write connector to dremio :)
>>> 
>>> regards
>>> xavier
>>> 
>>> 
>>> 
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> 
>> 
>

Re: [Arrow][Dremio]

Posted by Bryan Cutler <cu...@gmail.com>.

Hi Xavier,

Regarding Arrow usage in Spark, using Arrow format to transfer data between
Python and Java has been the focus so far because this area stood to
benefit the most.  It's possible that the scope of Arrow could broaden in
the future, but there still needs to be discussions about this.

Bryan

On Mon, May 14, 2018 at 9:55 AM, Pierce Lamb <ri...@gmail.com>
wrote:

> Hi Xavier,
>
> Along the lines of connecting to multiple sources of data and replacing
> ETL tools you may want to check out Confluent's blog on building a
> real-time streaming ETL pipeline on Kafka
> <https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/>
> as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData
> <http://www.snappydata.io/blog/real-time-streaming-etl-with-snappydata> where
> Spark is central to connecting to multiple data sources, executing SQL on
> streams etc. These should provide nice comparisons to your ideas about
> Dremio + Spark as ETL tools.
>
> Disclaimer: I am a SnappyData employee
>
> Hope this helps,
>
> Pierce
>
> On Mon, May 14, 2018 at 2:24 AM, xmehaut <xa...@gmail.com> wrote:
>
>> Hi Michaël,
>>
>> I'm not an expert of Dremio, i just try to evaluate the potential of this
>> techno and what impacts it could have on spark, and how they can work
>> together, or how spark could use even further arrow internally along the
>> existing algorithms.
>>
>> Dremio has already a quite rich api set enabling to access for instance to
>> metadata, sql queries, or even to create virtual datasets
>> programmatically.
>> They also have a lot of predefined functions, and I imagine there will be
>> more an more fucntions in the future, eg machine learning functions like
>> the
>> ones we may find in azure sql server which enables to mix sql and ml
>> functions.  Acces to dremio is made through jdbc, and we may imagine to
>> access virtual datasets through spark and create dynamically new datasets
>> from the api connected to parquets files stored dynamycally by spark on
>> hdfs, azure datalake or s3... Of course a more thight integration between
>> both should be better with a spark read/write connector to dremio :)
>>
>> regards
>> xavier
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: [Arrow][Dremio]

Posted by Pierce Lamb <ri...@gmail.com>.

Hi Xavier,

Along the lines of connecting to multiple sources of data and replacing ETL
tools you may want to check out Confluent's blog on building a real-time
streaming ETL pipeline on Kafka
<https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/>
as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData
<http://www.snappydata.io/blog/real-time-streaming-etl-with-snappydata> where
Spark is central to connecting to multiple data sources, executing SQL on
streams etc. These should provide nice comparisons to your ideas about
Dremio + Spark as ETL tools.

Disclaimer: I am a SnappyData employee

Hope this helps,

Pierce

On Mon, May 14, 2018 at 2:24 AM, xmehaut <xa...@gmail.com> wrote:

> Hi Michaël,
>
> I'm not an expert of Dremio, i just try to evaluate the potential of this
> techno and what impacts it could have on spark, and how they can work
> together, or how spark could use even further arrow internally along the
> existing algorithms.
>
> Dremio has already a quite rich api set enabling to access for instance to
> metadata, sql queries, or even to create virtual datasets programmatically.
> They also have a lot of predefined functions, and I imagine there will be
> more an more fucntions in the future, eg machine learning functions like
> the
> ones we may find in azure sql server which enables to mix sql and ml
> functions.  Acces to dremio is made through jdbc, and we may imagine to
> access virtual datasets through spark and create dynamically new datasets
> from the api connected to parquets files stored dynamycally by spark on
> hdfs, azure datalake or s3... Of course a more thight integration between
> both should be better with a spark read/write connector to dremio :)
>
> regards
> xavier
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: [Arrow][Dremio]

Posted by xmehaut <xa...@gmail.com>.

Hi Michaël,

I'm not an expert of Dremio, i just try to evaluate the potential of this
techno and what impacts it could have on spark, and how they can work
together, or how spark could use even further arrow internally along the
existing algorithms.

Dremio has already a quite rich api set enabling to access for instance to
metadata, sql queries, or even to create virtual datasets programmatically.
They also have a lot of predefined functions, and I imagine there will be
more an more fucntions in the future, eg machine learning functions like the
ones we may find in azure sql server which enables to mix sql and ml
functions.  Acces to dremio is made through jdbc, and we may imagine to
access virtual datasets through spark and create dynamically new datasets
from the api connected to parquets files stored dynamycally by spark on
hdfs, azure datalake or s3... Of course a more thight integration between
both should be better with a spark read/write connector to dremio :)

regards
xavier



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [Arrow][Dremio]

Posted by Michael Shtelma <ms...@gmail.com>.

Hi Xavier,

Dremio is looking really interesting and has nice UI. I think the idea to
replace SSIS or similar tools with Dremio is not so bad, but what about
complex scenarios with a lot of code and transformations ?
Is it possible to use Dremio via API and define own transformations and
transformation workflows with Java or Scala in Dremio?
I am not sure, if it is supported at all.
I think Dremio guys are looking forward to give users access to Sabot API
in order to use Dremio in the same way you can use spark, but I am not sure
if it is possible now.
Have you also tried comparing performance with Spark ? Are there any
benchmarks ?

Best,
Michael

On Mon, May 14, 2018 at 6:53 AM, xmehaut <xa...@gmail.com> wrote:

> Hello,
> I've some question about Spark and Apache Arrow. Up to now, Arrow is only
> used for sharing data between Python and Spark executors instead of
> transmitting them through sockets. I'm studying currently Dremio as an
> interesting way to access multiple sources of data, and as a potential
> replacement of ETL tools, included sparksql.
> It seems, if the promises are actually right, that arrow and dremio may be
> changing game for these two purposes (data source abstraction, etl tasks),
> leaving then spark on te two following goals , ie ml/dl and graph
> processing, which can be a danger for spark at middle term with the arising
> of multiple frameworks in these areas.
> My question is then :
> - is there a means to use arrow more broadly in spark itself and not only
> for sharing data?
> - what are the strenghts and weaknesses of spark wrt Arrow and consequently
> Dremio?
> - What is the difference finally between databricks DBIO and Dremio/arrow?
> -How do you see the future of spark regarding these assumptions?
> regards
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>