You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ajay Chander <it...@gmail.com> on 2016/06/07 14:09:21 UTC

Spark_Usecase

Hi Spark users,

Right now we are using spark for everything(loading the data from
sqlserver, apply transformations, save it as permanent tables in hive) in
our environment. Everything is being done in one spark application.

The only thing we do before we launch our spark application through
oozie is, to load the data from edge node to hdfs(it is being triggered
through a ssh action from oozie to run shell script on edge node).

My question is,  there's any way we can accomplish edge-to-hdfs copy
through spark ? So that everything is done in one spark DAG and lineage
graph?

Any pointers are highly appreciated. Thanks

Regards,
Aj

Re: Spark_Usecase

Posted by Ajay Chander <it...@gmail.com>.

Marco, Ted, thanks for your time. I am sorry if I wasn't clear enough. We
have two sources,

1) sql server
2) files are pushed onto edge node by upstreams on a daily basis.

Point 1 has been achieved by using JDBC format in spark sql.

Point 2 has been achieved by using shell script.

My only concern is about point 2. To see if there is any way I can do it in
my spark app instead os shell script.

Thanks.

On Tuesday, June 7, 2016, Ted Yu <yu...@gmail.com> wrote:

> bq. load the data from edge node to hdfs
>
> Does the loading involve accessing sqlserver ?
>
> Please take a look at
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mmistroni@gmail.com
> <javascript:_e(%7B%7D,'cvml','mmistroni@gmail.com');>> wrote:
>
>> Hi
>> how about
>>
>> 1.  have a process that read the data from your sqlserver and dumps it as
>> a file into a directory on your hd
>> 2. use spark-streanming to read data from that directory  and store it
>> into hdfs
>>
>> perhaps there is some sort of spark 'connectors' that allows you to read
>> data from a db directly so you dont need to go via spk streaming?
>>
>>
>> hth
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <itschevva@gmail.com
>> <javascript:_e(%7B%7D,'cvml','itschevva@gmail.com');>> wrote:
>>
>>> Hi Spark users,
>>>
>>> Right now we are using spark for everything(loading the data from
>>> sqlserver, apply transformations, save it as permanent tables in
>>> hive) in our environment. Everything is being done in one spark application.
>>>
>>> The only thing we do before we launch our spark application through
>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>> through a ssh action from oozie to run shell script on edge node).
>>>
>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>> through spark ? So that everything is done in one spark DAG and lineage
>>> graph?
>>>
>>> Any pointers are highly appreciated. Thanks
>>>
>>> Regards,
>>> Aj
>>>
>>
>>
>

Re: Spark_Usecase

Posted by Ajay Chander <it...@gmail.com>.

Hi Deepak, thanks for the info. I was thinking of reading both source and
destination tables into separate rdds/dataframes, then apply some specific
transformations to find the updated info, remove updated keyed rows from
destination and append updated info to the destination. Any pointers on
this kind of usage ?

It would be great, If it is possible for you to provide an example with
regards to what you mentioned below? Thanks much.

Regards,
Aj

On Tuesday, June 7, 2016, Deepak Sharma <de...@gmail.com> wrote:

> I am not sure if Spark provides any support for incremental extracts
> inherently.
> But you can maintain a file e.g. extractRange.conf in hdfs , to read from
> it the end range and update it with new end range from  spark job before it
> finishes with the new relevant ranges to be used next time.
>
> On Tue, Jun 7, 2016 at 8:49 PM, Ajay Chander <itschevva@gmail.com
> <javascript:_e(%7B%7D,'cvml','itschevva@gmail.com');>> wrote:
>
>> Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
>> Now I am using spark to do the same. Right now, I am trying
>> to implement incremental updates while loading from MySQL through spark.
>> Can you suggest any best practices for this ? Thank you.
>>
>>
>> On Tuesday, June 7, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mich.talebzadeh@gmail.com');>> wrote:
>>
>>> I use Spark rather that Sqoop to import data from an Oracle table into a
>>> Hive ORC table.
>>>
>>> It used JDBC for this purpose. All inclusive in Scala itself.
>>>
>>> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
>>> map-reduce/.
>>>
>>> pretty simple.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 7 June 2016 at 15:38, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> bq. load the data from edge node to hdfs
>>>>
>>>> Does the loading involve accessing sqlserver ?
>>>>
>>>> Please take a look at
>>>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>>>
>>>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mm...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>> how about
>>>>>
>>>>> 1.  have a process that read the data from your sqlserver and dumps it
>>>>> as a file into a directory on your hd
>>>>> 2. use spark-streanming to read data from that directory  and store it
>>>>> into hdfs
>>>>>
>>>>> perhaps there is some sort of spark 'connectors' that allows you to
>>>>> read data from a db directly so you dont need to go via spk streaming?
>>>>>
>>>>>
>>>>> hth
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <it...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Spark users,
>>>>>>
>>>>>> Right now we are using spark for everything(loading the data from
>>>>>> sqlserver, apply transformations, save it as permanent tables in
>>>>>> hive) in our environment. Everything is being done in one spark application.
>>>>>>
>>>>>> The only thing we do before we launch our spark application through
>>>>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>>>>> through a ssh action from oozie to run shell script on edge node).
>>>>>>
>>>>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>>>>> through spark ? So that everything is done in one spark DAG and lineage
>>>>>> graph?
>>>>>>
>>>>>> Any pointers are highly appreciated. Thanks
>>>>>>
>>>>>> Regards,
>>>>>> Aj
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Spark_Usecase

Posted by vaquar khan <va...@gmail.com>.

Deepak Spark does provide support to incremental load,if users want to
schedule their batch jobs frequently and  want to have incremental load of
their data from databases.

You will not get good performance  to update your Spark SQL tables backed
by files. Instead, you can use message queues and Spark Streaming or do an
incremental select to make sure your Spark SQL tables stay up to date with
your production databases

Regards,
Vaquar khan
On 7 Jun 2016 10:29, "Deepak Sharma" <de...@gmail.com> wrote:

I am not sure if Spark provides any support for incremental extracts
inherently.
But you can maintain a file e.g. extractRange.conf in hdfs , to read from
it the end range and update it with new end range from  spark job before it
finishes with the new relevant ranges to be used next time.

On Tue, Jun 7, 2016 at 8:49 PM, Ajay Chander <it...@gmail.com> wrote:

> Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
> Now I am using spark to do the same. Right now, I am trying
> to implement incremental updates while loading from MySQL through spark.
> Can you suggest any best practices for this ? Thank you.
>
>
> On Tuesday, June 7, 2016, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> I use Spark rather that Sqoop to import data from an Oracle table into a
>> Hive ORC table.
>>
>> It used JDBC for this purpose. All inclusive in Scala itself.
>>
>> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
>> map-reduce/.
>>
>> pretty simple.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 15:38, Ted Yu <yu...@gmail.com> wrote:
>>
>>> bq. load the data from edge node to hdfs
>>>
>>> Does the loading involve accessing sqlserver ?
>>>
>>> Please take a look at
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>>
>>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mm...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>> how about
>>>>
>>>> 1.  have a process that read the data from your sqlserver and dumps it
>>>> as a file into a directory on your hd
>>>> 2. use spark-streanming to read data from that directory  and store it
>>>> into hdfs
>>>>
>>>> perhaps there is some sort of spark 'connectors' that allows you to
>>>> read data from a db directly so you dont need to go via spk streaming?
>>>>
>>>>
>>>> hth
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <it...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Spark users,
>>>>>
>>>>> Right now we are using spark for everything(loading the data from
>>>>> sqlserver, apply transformations, save it as permanent tables in
>>>>> hive) in our environment. Everything is being done in one spark application.
>>>>>
>>>>> The only thing we do before we launch our spark application through
>>>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>>>> through a ssh action from oozie to run shell script on edge node).
>>>>>
>>>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>>>> through spark ? So that everything is done in one spark DAG and lineage
>>>>> graph?
>>>>>
>>>>> Any pointers are highly appreciated. Thanks
>>>>>
>>>>> Regards,
>>>>> Aj
>>>>>
>>>>
>>>>
>>>
>>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Spark_Usecase

Posted by Deepak Sharma <de...@gmail.com>.

I am not sure if Spark provides any support for incremental extracts
inherently.
But you can maintain a file e.g. extractRange.conf in hdfs , to read from
it the end range and update it with new end range from  spark job before it
finishes with the new relevant ranges to be used next time.

On Tue, Jun 7, 2016 at 8:49 PM, Ajay Chander <it...@gmail.com> wrote:

> Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
> Now I am using spark to do the same. Right now, I am trying
> to implement incremental updates while loading from MySQL through spark.
> Can you suggest any best practices for this ? Thank you.
>
>
> On Tuesday, June 7, 2016, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> I use Spark rather that Sqoop to import data from an Oracle table into a
>> Hive ORC table.
>>
>> It used JDBC for this purpose. All inclusive in Scala itself.
>>
>> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
>> map-reduce/.
>>
>> pretty simple.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 15:38, Ted Yu <yu...@gmail.com> wrote:
>>
>>> bq. load the data from edge node to hdfs
>>>
>>> Does the loading involve accessing sqlserver ?
>>>
>>> Please take a look at
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>>
>>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mm...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>> how about
>>>>
>>>> 1.  have a process that read the data from your sqlserver and dumps it
>>>> as a file into a directory on your hd
>>>> 2. use spark-streanming to read data from that directory  and store it
>>>> into hdfs
>>>>
>>>> perhaps there is some sort of spark 'connectors' that allows you to
>>>> read data from a db directly so you dont need to go via spk streaming?
>>>>
>>>>
>>>> hth
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <it...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Spark users,
>>>>>
>>>>> Right now we are using spark for everything(loading the data from
>>>>> sqlserver, apply transformations, save it as permanent tables in
>>>>> hive) in our environment. Everything is being done in one spark application.
>>>>>
>>>>> The only thing we do before we launch our spark application through
>>>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>>>> through a ssh action from oozie to run shell script on edge node).
>>>>>
>>>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>>>> through spark ? So that everything is done in one spark DAG and lineage
>>>>> graph?
>>>>>
>>>>> Any pointers are highly appreciated. Thanks
>>>>>
>>>>> Regards,
>>>>> Aj
>>>>>
>>>>
>>>>
>>>
>>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Spark_Usecase

Posted by Ajay Chander <it...@gmail.com>.

Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
Now I am using spark to do the same. Right now, I am trying
to implement incremental updates while loading from MySQL through spark.
Can you suggest any best practices for this ? Thank you.


On Tuesday, June 7, 2016, Mich Talebzadeh <mi...@gmail.com> wrote:

> I use Spark rather that Sqoop to import data from an Oracle table into a
> Hive ORC table.
>
> It used JDBC for this purpose. All inclusive in Scala itself.
>
> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
> map-reduce/.
>
> pretty simple.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 15:38, Ted Yu <yuzhihong@gmail.com
> <javascript:_e(%7B%7D,'cvml','yuzhihong@gmail.com');>> wrote:
>
>> bq. load the data from edge node to hdfs
>>
>> Does the loading involve accessing sqlserver ?
>>
>> Please take a look at
>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mmistroni@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mmistroni@gmail.com');>> wrote:
>>
>>> Hi
>>> how about
>>>
>>> 1.  have a process that read the data from your sqlserver and dumps it
>>> as a file into a directory on your hd
>>> 2. use spark-streanming to read data from that directory  and store it
>>> into hdfs
>>>
>>> perhaps there is some sort of spark 'connectors' that allows you to read
>>> data from a db directly so you dont need to go via spk streaming?
>>>
>>>
>>> hth
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <itschevva@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','itschevva@gmail.com');>> wrote:
>>>
>>>> Hi Spark users,
>>>>
>>>> Right now we are using spark for everything(loading the data from
>>>> sqlserver, apply transformations, save it as permanent tables in
>>>> hive) in our environment. Everything is being done in one spark application.
>>>>
>>>> The only thing we do before we launch our spark application through
>>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>>> through a ssh action from oozie to run shell script on edge node).
>>>>
>>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>>> through spark ? So that everything is done in one spark DAG and lineage
>>>> graph?
>>>>
>>>> Any pointers are highly appreciated. Thanks
>>>>
>>>> Regards,
>>>> Aj
>>>>
>>>
>>>
>>
>

Re: Spark_Usecase

Posted by Mich Talebzadeh <mi...@gmail.com>.

I use Spark rather that Sqoop to import data from an Oracle table into a
Hive ORC table.

It used JDBC for this purpose. All inclusive in Scala itself.

Also Hive runs on Spark engine. Order of magnitude faster with Inde on
map-reduce/.

pretty simple.

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 15:38, Ted Yu <yu...@gmail.com> wrote:

> bq. load the data from edge node to hdfs
>
> Does the loading involve accessing sqlserver ?
>
> Please take a look at
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mm...@gmail.com>
> wrote:
>
>> Hi
>> how about
>>
>> 1.  have a process that read the data from your sqlserver and dumps it as
>> a file into a directory on your hd
>> 2. use spark-streanming to read data from that directory  and store it
>> into hdfs
>>
>> perhaps there is some sort of spark 'connectors' that allows you to read
>> data from a db directly so you dont need to go via spk streaming?
>>
>>
>> hth
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <it...@gmail.com> wrote:
>>
>>> Hi Spark users,
>>>
>>> Right now we are using spark for everything(loading the data from
>>> sqlserver, apply transformations, save it as permanent tables in
>>> hive) in our environment. Everything is being done in one spark application.
>>>
>>> The only thing we do before we launch our spark application through
>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>> through a ssh action from oozie to run shell script on edge node).
>>>
>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>> through spark ? So that everything is done in one spark DAG and lineage
>>> graph?
>>>
>>> Any pointers are highly appreciated. Thanks
>>>
>>> Regards,
>>> Aj
>>>
>>
>>
>

Re: Spark_Usecase

Posted by Ted Yu <yu...@gmail.com>.

bq. load the data from edge node to hdfs

Does the loading involve accessing sqlserver ?

Please take a look at
https://spark.apache.org/docs/latest/sql-programming-guide.html

On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mm...@gmail.com> wrote:

> Hi
> how about
>
> 1.  have a process that read the data from your sqlserver and dumps it as
> a file into a directory on your hd
> 2. use spark-streanming to read data from that directory  and store it
> into hdfs
>
> perhaps there is some sort of spark 'connectors' that allows you to read
> data from a db directly so you dont need to go via spk streaming?
>
>
> hth
>
>
>
>
>
>
>
>
>
>
> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <it...@gmail.com> wrote:
>
>> Hi Spark users,
>>
>> Right now we are using spark for everything(loading the data from
>> sqlserver, apply transformations, save it as permanent tables in
>> hive) in our environment. Everything is being done in one spark application.
>>
>> The only thing we do before we launch our spark application through
>> oozie is, to load the data from edge node to hdfs(it is being triggered
>> through a ssh action from oozie to run shell script on edge node).
>>
>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>> through spark ? So that everything is done in one spark DAG and lineage
>> graph?
>>
>> Any pointers are highly appreciated. Thanks
>>
>> Regards,
>> Aj
>>
>
>

Re: Spark_Usecase

Posted by Marco Mistroni <mm...@gmail.com>.

Hi
how about

1.  have a process that read the data from your sqlserver and dumps it as a
file into a directory on your hd
2. use spark-streanming to read data from that directory  and store it into
hdfs

perhaps there is some sort of spark 'connectors' that allows you to read
data from a db directly so you dont need to go via spk streaming?

hth

On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <it...@gmail.com> wrote:

> Hi Spark users,
>
> Right now we are using spark for everything(loading the data from
> sqlserver, apply transformations, save it as permanent tables in hive) in
> our environment. Everything is being done in one spark application.
>
> The only thing we do before we launch our spark application through
> oozie is, to load the data from edge node to hdfs(it is being triggered
> through a ssh action from oozie to run shell script on edge node).
>
> My question is,  there's any way we can accomplish edge-to-hdfs copy
> through spark ? So that everything is done in one spark DAG and lineage
> graph?
>
> Any pointers are highly appreciated. Thanks
>
> Regards,
> Aj
>