You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2017/06/15 16:03:21 UTC

fetching and joining data from two different clusters

Hi,

With Spark how easy is it to fetch data from two different clusters and do
a join in Spark.

I can use two JDBC connections to join two tables from two different Oracle
instances in Spark though creating two Data Frames and joining them
together.

would that be possible for data residing on two different HDFS clusters?

thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: fetching and joining data from two different clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

It is a proprietary solution to an open source problem

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 June 2017 at 21:11, Jörn Franke <jo...@gmail.com> wrote:

> Sorry cannot help you there - I do not know the cost for isilon. I also
> cannot predict what the majority will do ...
>
> On 18. Jun 2017, at 21:49, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> thanks Jorn.
>
> I have been told that Hadoop 3 (alpha testing now) will support Docking
> and virtualised Hadoop clusters
>
> Also if we decided to use something like Isolin and blue data to create
> zoning (meaning two different Hadoop clusters migrated to Isolin storage
> each residing on its zone/compartment) and virtualised clusters, we haave
> to migrate two separate physical Hadoop clusters to Isolin and then create
> the structure.
>
> My point is if we went that way we have to weight up the cost and efforts
> in migrating two Hadoop clusters to Isolin, versus adding one Hadoop
> cluster to the other one to make one cluster out of two and still we have
> the underlying HDFS file system. And then of course how many companies
> going this way and overriding reason to use such approach. What will happen
> if we have performance issues, where to pinpoint the bottleneck (Isolin) or
> third party Hadoop vendor. There is really no community to rely on as well.
>
> Your thoughts?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 June 2017 at 21:27, Jörn Franke <jo...@gmail.com> wrote:
>
>> On HDFS you have storage policies where you can define ssd etc
>> https://hadoop.apache.org/docs/current/hadoop-project-di
>> st/hadoop-hdfs/ArchivalStorage.html
>>
>> Not sure if this is a similar offering to what you refer to.
>>
>> Open stack swift is similar to S3 but for your own data center
>> https://docs.openstack.org/developer/swift/associated_projects.html
>>
>> On 15. Jun 2017, at 21:55, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> In Isilon etc you have SSD, middle layer and archive later where data is
>> moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that
>> low level archive disk?
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 June 2017 at 20:42, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Well this happens also if you use amazon EMR - most data will be stored
>>> on S3 and there you have also no data locality. You can move it temporary
>>> to HDFS or in-memory (ignite) and you can use sampling etc to avoid the
>>> need to process all the data. In fact, that is done in Spark machine
>>> learning algorithms (stochastic gradient descent etc). This will avoid that
>>> you need to move all the data through the networks and you loose only
>>> little precision (and you can statistically reason on that).
>>> For a lot of data I see also the trend that companies move it anyway to
>>> cheap object storages (swift etc) to reduce cost - particularly because it
>>> is not used often.
>>>
>>>
>>> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> thanks Jorn.
>>>
>>> If the idea is to separate compute from data using Isilon etc then one
>>> is going to lose the locality of data.
>>>
>>> Also the argument is that we would like to run queries/reports against
>>> two independent clusters simultaneously so do this
>>>
>>>
>>>    1. Use Isilon OneFS
>>>    <https://en.wikipedia.org/wiki/OneFS_distributed_file_system>for Big
>>>    data to migrate two independent Hadoop clusters into Isilon OneFS
>>>    2. Locate data from each cluster into its own zone in Isilon
>>>    3. Run queries to combine data from each zone
>>>    4. Use blue data
>>>    <https://www.bluedata.com/blog/2016/10/next-generation-big-data-with-dell-and-emc/>
>>>    to create virtual Hadoop clusters on top of Isilon so one isolates the
>>>    performance impact of analytics/Data Science versus other users
>>>
>>>
>>> Now that is easily said than done as usual. First you have to migrate
>>> the two existing clusters data into zones in Isilon. Then you are
>>> effectively separating Compute from data so data locality is lost. This is
>>> no different from your Spark cluster accessing data from each cluster.
>>> There are a lot of tangential arguments here. Like Isilon will use RAID and
>>> you don't need to replicate your data R3. Even including Isilon licensing
>>> cost, the total cost goes down!
>>>
>>> The side effect is the network now that you have lost data locality. How
>>> fast your network is going to be to handle the throughputs. Networks are
>>> shared across say a Bank unless you spend $$$ creating private infiniband
>>> networks. Standard 10Gbits/s is not going to be good enough.
>>>
>>> Also in reality blue data does not need Isilon. It runs on HP and other
>>> hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.
>>> Alpha currently, will be released at end of this year. As we have not
>>> started on Isilon it may be worth looking at this also?
>>>
>>> Cheers
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 15 June 2017 at 17:05, Jörn Franke <jo...@gmail.com> wrote:
>>>
>>>> It does not matter to Spark you just put the HDFS URL of the namenode
>>>> there. Of course the issue is that you loose data locality, but this would
>>>> be also the case for Oracle.
>>>>
>>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> With Spark how easy is it to fetch data from two different clusters and
>>>> do a join in Spark.
>>>>
>>>> I can use two JDBC connections to join two tables from two different
>>>> Oracle instances in Spark though creating two Data Frames and joining them
>>>> together.
>>>>
>>>> would that be possible for data residing on two different HDFS clusters?
>>>>
>>>> thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: fetching and joining data from two different clusters

Posted by Jörn Franke <jo...@gmail.com>.

Sorry cannot help you there - I do not know the cost for isilon. I also cannot predict what the majority will do ...

> On 18. Jun 2017, at 21:49, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> thanks Jorn.
> 
> I have been told that Hadoop 3 (alpha testing now) will support Docking and virtualised Hadoop clusters
> 
> Also if we decided to use something like Isolin and blue data to create zoning (meaning two different Hadoop clusters migrated to Isolin storage each residing on its zone/compartment) and virtualised clusters, we haave to migrate two separate physical Hadoop clusters to Isolin and then create the structure.
> 
> My point is if we went that way we have to weight up the cost and efforts in migrating two Hadoop clusters to Isolin, versus adding one Hadoop cluster to the other one to make one cluster out of two and still we have the underlying HDFS file system. And then of course how many companies going this way and overriding reason to use such approach. What will happen if we have performance issues, where to pinpoint the bottleneck (Isolin) or third party Hadoop vendor. There is really no community to rely on as well.
> 
> Your thoughts?
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
>> On 15 June 2017 at 21:27, Jörn Franke <jo...@gmail.com> wrote:
>> On HDFS you have storage policies where you can define ssd etc https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
>> 
>> Not sure if this is a similar offering to what you refer to.
>> 
>> Open stack swift is similar to S3 but for your own data center https://docs.openstack.org/developer/swift/associated_projects.html
>> 
>>> On 15. Jun 2017, at 21:55, Mich Talebzadeh <mi...@gmail.com> wrote:
>>> 
>>> In Isilon etc you have SSD, middle layer and archive later where data is moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that low level archive disk?
>>> 
>>> thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>>> 
>>>> On 15 June 2017 at 20:42, Jörn Franke <jo...@gmail.com> wrote:
>>>> Well this happens also if you use amazon EMR - most data will be stored on S3 and there you have also no data locality. You can move it temporary to HDFS or in-memory (ignite) and you can use sampling etc to avoid the need to process all the data. In fact, that is done in Spark machine learning algorithms (stochastic gradient descent etc). This will avoid that you need to move all the data through the networks and you loose only little precision (and you can statistically reason on that).
>>>> For a lot of data I see also the trend that companies move it anyway to cheap object storages (swift etc) to reduce cost - particularly because it is not used often.
>>>> 
>>>> 
>>>>> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>> 
>>>>> thanks Jorn.
>>>>> 
>>>>> If the idea is to separate compute from data using Isilon etc then one is going to lose the locality of data.
>>>>> 
>>>>> Also the argument is that we would like to run queries/reports against two independent clusters simultaneously so do this
>>>>> 
>>>>> Use Isilon OneFS for Big data to migrate two independent Hadoop clusters into Isilon OneFS
>>>>> Locate data from each cluster into its own zone in Isilon
>>>>> Run queries to combine data from each zone
>>>>> Use blue data to create virtual Hadoop clusters on top of Isilon so one isolates the performance impact of analytics/Data Science versus other users
>>>>> 
>>>>> Now that is easily said than done as usual. First you have to migrate the two existing clusters data into zones in Isilon. Then you are effectively separating Compute from data so data locality is lost. This is no different from your Spark cluster accessing data from each cluster. There are a lot of tangential arguments here. Like Isilon will use RAID and you don't need to replicate your data R3. Even including Isilon licensing cost, the total cost goes down!
>>>>> 
>>>>> The side effect is the network now that you have lost data locality. How fast your network is going to be to handle the throughputs. Networks are shared across say a Bank unless you spend $$$ creating private infiniband networks. Standard 10Gbits/s is not going to be good enough.
>>>>> 
>>>>> Also in reality blue data does not need Isilon. It runs on HP and other hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.  Alpha currently, will be released at end of this year. As we have not started on Isilon it may be worth looking at this also?
>>>>> 
>>>>> Cheers  
>>>>>  
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>>> On 15 June 2017 at 17:05, Jörn Franke <jo...@gmail.com> wrote:
>>>>>> It does not matter to Spark you just put the HDFS URL of the namenode there. Of course the issue is that you loose data locality, but this would be also the case for Oracle.
>>>>>> 
>>>>>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> With Spark how easy is it to fetch data from two different clusters and do a join in Spark.
>>>>>>> 
>>>>>>> I can use two JDBC connections to join two tables from two different Oracle instances in Spark though creating two Data Frames and joining them together.
>>>>>>> 
>>>>>>> would that be possible for data residing on two different HDFS clusters?
>>>>>>> 
>>>>>>> thanks
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>> 
>>> 
>

Re: fetching and joining data from two different clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

thanks Jorn.

I have been told that Hadoop 3 (alpha testing now) will support Docking and
virtualised Hadoop clusters

Also if we decided to use something like Isolin and blue data to create
zoning (meaning two different Hadoop clusters migrated to Isolin storage
each residing on its zone/compartment) and virtualised clusters, we haave
to migrate two separate physical Hadoop clusters to Isolin and then create
the structure.

My point is if we went that way we have to weight up the cost and efforts
in migrating two Hadoop clusters to Isolin, versus adding one Hadoop
cluster to the other one to make one cluster out of two and still we have
the underlying HDFS file system. And then of course how many companies
going this way and overriding reason to use such approach. What will happen
if we have performance issues, where to pinpoint the bottleneck (Isolin) or
third party Hadoop vendor. There is really no community to rely on as well.

Your thoughts?

Thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 June 2017 at 21:27, Jörn Franke <jo...@gmail.com> wrote:

> On HDFS you have storage policies where you can define ssd etc
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/
> ArchivalStorage.html
>
> Not sure if this is a similar offering to what you refer to.
>
> Open stack swift is similar to S3 but for your own data center
> https://docs.openstack.org/developer/swift/associated_projects.html
>
> On 15. Jun 2017, at 21:55, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> In Isilon etc you have SSD, middle layer and archive later where data is
> moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that
> low level archive disk?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 June 2017 at 20:42, Jörn Franke <jo...@gmail.com> wrote:
>
>> Well this happens also if you use amazon EMR - most data will be stored
>> on S3 and there you have also no data locality. You can move it temporary
>> to HDFS or in-memory (ignite) and you can use sampling etc to avoid the
>> need to process all the data. In fact, that is done in Spark machine
>> learning algorithms (stochastic gradient descent etc). This will avoid that
>> you need to move all the data through the networks and you loose only
>> little precision (and you can statistically reason on that).
>> For a lot of data I see also the trend that companies move it anyway to
>> cheap object storages (swift etc) to reduce cost - particularly because it
>> is not used often.
>>
>>
>> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> thanks Jorn.
>>
>> If the idea is to separate compute from data using Isilon etc then one is
>> going to lose the locality of data.
>>
>> Also the argument is that we would like to run queries/reports against
>> two independent clusters simultaneously so do this
>>
>>
>>    1. Use Isilon OneFS
>>    <https://en.wikipedia.org/wiki/OneFS_distributed_file_system>for Big
>>    data to migrate two independent Hadoop clusters into Isilon OneFS
>>    2. Locate data from each cluster into its own zone in Isilon
>>    3. Run queries to combine data from each zone
>>    4. Use blue data
>>    <https://www.bluedata.com/blog/2016/10/next-generation-big-data-with-dell-and-emc/>
>>    to create virtual Hadoop clusters on top of Isilon so one isolates the
>>    performance impact of analytics/Data Science versus other users
>>
>>
>> Now that is easily said than done as usual. First you have to migrate the
>> two existing clusters data into zones in Isilon. Then you are effectively
>> separating Compute from data so data locality is lost. This is no different
>> from your Spark cluster accessing data from each cluster. There are a lot
>> of tangential arguments here. Like Isilon will use RAID and you don't need
>> to replicate your data R3. Even including Isilon licensing cost, the total
>> cost goes down!
>>
>> The side effect is the network now that you have lost data locality. How
>> fast your network is going to be to handle the throughputs. Networks are
>> shared across say a Bank unless you spend $$$ creating private infiniband
>> networks. Standard 10Gbits/s is not going to be good enough.
>>
>> Also in reality blue data does not need Isilon. It runs on HP and other
>> hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.
>> Alpha currently, will be released at end of this year. As we have not
>> started on Isilon it may be worth looking at this also?
>>
>> Cheers
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 June 2017 at 17:05, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> It does not matter to Spark you just put the HDFS URL of the namenode
>>> there. Of course the issue is that you loose data locality, but this would
>>> be also the case for Oracle.
>>>
>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> With Spark how easy is it to fetch data from two different clusters and
>>> do a join in Spark.
>>>
>>> I can use two JDBC connections to join two tables from two different
>>> Oracle instances in Spark though creating two Data Frames and joining them
>>> together.
>>>
>>> would that be possible for data residing on two different HDFS clusters?
>>>
>>> thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>
>

Re: fetching and joining data from two different clusters

Posted by Jörn Franke <jo...@gmail.com>.

On HDFS you have storage policies where you can define ssd etc https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html

Not sure if this is a similar offering to what you refer to.

Open stack swift is similar to S3 but for your own data center https://docs.openstack.org/developer/swift/associated_projects.html

> On 15. Jun 2017, at 21:55, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> In Isilon etc you have SSD, middle layer and archive later where data is moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that low level archive disk?
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
>> On 15 June 2017 at 20:42, Jörn Franke <jo...@gmail.com> wrote:
>> Well this happens also if you use amazon EMR - most data will be stored on S3 and there you have also no data locality. You can move it temporary to HDFS or in-memory (ignite) and you can use sampling etc to avoid the need to process all the data. In fact, that is done in Spark machine learning algorithms (stochastic gradient descent etc). This will avoid that you need to move all the data through the networks and you loose only little precision (and you can statistically reason on that).
>> For a lot of data I see also the trend that companies move it anyway to cheap object storages (swift etc) to reduce cost - particularly because it is not used often.
>> 
>> 
>>> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mi...@gmail.com> wrote:
>>> 
>>> thanks Jorn.
>>> 
>>> If the idea is to separate compute from data using Isilon etc then one is going to lose the locality of data.
>>> 
>>> Also the argument is that we would like to run queries/reports against two independent clusters simultaneously so do this
>>> 
>>> Use Isilon OneFS for Big data to migrate two independent Hadoop clusters into Isilon OneFS
>>> Locate data from each cluster into its own zone in Isilon
>>> Run queries to combine data from each zone
>>> Use blue data to create virtual Hadoop clusters on top of Isilon so one isolates the performance impact of analytics/Data Science versus other users
>>> 
>>> Now that is easily said than done as usual. First you have to migrate the two existing clusters data into zones in Isilon. Then you are effectively separating Compute from data so data locality is lost. This is no different from your Spark cluster accessing data from each cluster. There are a lot of tangential arguments here. Like Isilon will use RAID and you don't need to replicate your data R3. Even including Isilon licensing cost, the total cost goes down!
>>> 
>>> The side effect is the network now that you have lost data locality. How fast your network is going to be to handle the throughputs. Networks are shared across say a Bank unless you spend $$$ creating private infiniband networks. Standard 10Gbits/s is not going to be good enough.
>>> 
>>> Also in reality blue data does not need Isilon. It runs on HP and other hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.  Alpha currently, will be released at end of this year. As we have not started on Isilon it may be worth looking at this also?
>>> 
>>> Cheers  
>>>  
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>>> 
>>>> On 15 June 2017 at 17:05, Jörn Franke <jo...@gmail.com> wrote:
>>>> It does not matter to Spark you just put the HDFS URL of the namenode there. Of course the issue is that you loose data locality, but this would be also the case for Oracle.
>>>> 
>>>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> With Spark how easy is it to fetch data from two different clusters and do a join in Spark.
>>>>> 
>>>>> I can use two JDBC connections to join two tables from two different Oracle instances in Spark though creating two Data Frames and joining them together.
>>>>> 
>>>>> would that be possible for data residing on two different HDFS clusters?
>>>>> 
>>>>> thanks
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>  
>>> 
>

Re: fetching and joining data from two different clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

In Isilon etc you have SSD, middle layer and archive later where data is
moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that
low level archive disk?

thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 June 2017 at 20:42, Jörn Franke <jo...@gmail.com> wrote:

> Well this happens also if you use amazon EMR - most data will be stored on
> S3 and there you have also no data locality. You can move it temporary to
> HDFS or in-memory (ignite) and you can use sampling etc to avoid the need
> to process all the data. In fact, that is done in Spark machine learning
> algorithms (stochastic gradient descent etc). This will avoid that you need
> to move all the data through the networks and you loose only little
> precision (and you can statistically reason on that).
> For a lot of data I see also the trend that companies move it anyway to
> cheap object storages (swift etc) to reduce cost - particularly because it
> is not used often.
>
>
> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> thanks Jorn.
>
> If the idea is to separate compute from data using Isilon etc then one is
> going to lose the locality of data.
>
> Also the argument is that we would like to run queries/reports against two
> independent clusters simultaneously so do this
>
>
>    1. Use Isilon OneFS
>    <https://en.wikipedia.org/wiki/OneFS_distributed_file_system>for Big
>    data to migrate two independent Hadoop clusters into Isilon OneFS
>    2. Locate data from each cluster into its own zone in Isilon
>    3. Run queries to combine data from each zone
>    4. Use blue data
>    <https://www.bluedata.com/blog/2016/10/next-generation-big-data-with-dell-and-emc/>
>    to create virtual Hadoop clusters on top of Isilon so one isolates the
>    performance impact of analytics/Data Science versus other users
>
>
> Now that is easily said than done as usual. First you have to migrate the
> two existing clusters data into zones in Isilon. Then you are effectively
> separating Compute from data so data locality is lost. This is no different
> from your Spark cluster accessing data from each cluster. There are a lot
> of tangential arguments here. Like Isilon will use RAID and you don't need
> to replicate your data R3. Even including Isilon licensing cost, the total
> cost goes down!
>
> The side effect is the network now that you have lost data locality. How
> fast your network is going to be to handle the throughputs. Networks are
> shared across say a Bank unless you spend $$$ creating private infiniband
> networks. Standard 10Gbits/s is not going to be good enough.
>
> Also in reality blue data does not need Isilon. It runs on HP and other
> hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.
> Alpha currently, will be released at end of this year. As we have not
> started on Isilon it may be worth looking at this also?
>
> Cheers
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 June 2017 at 17:05, Jörn Franke <jo...@gmail.com> wrote:
>
>> It does not matter to Spark you just put the HDFS URL of the namenode
>> there. Of course the issue is that you loose data locality, but this would
>> be also the case for Oracle.
>>
>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> With Spark how easy is it to fetch data from two different clusters and
>> do a join in Spark.
>>
>> I can use two JDBC connections to join two tables from two different
>> Oracle instances in Spark though creating two Data Frames and joining them
>> together.
>>
>> would that be possible for data residing on two different HDFS clusters?
>>
>> thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>

Re: fetching and joining data from two different clusters

Posted by Jörn Franke <jo...@gmail.com>.

Well this happens also if you use amazon EMR - most data will be stored on S3 and there you have also no data locality. You can move it temporary to HDFS or in-memory (ignite) and you can use sampling etc to avoid the need to process all the data. In fact, that is done in Spark machine learning algorithms (stochastic gradient descent etc). This will avoid that you need to move all the data through the networks and you loose only little precision (and you can statistically reason on that).
For a lot of data I see also the trend that companies move it anyway to cheap object storages (swift etc) to reduce cost - particularly because it is not used often.


> On 15. Jun 2017, at 21:34, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> thanks Jorn.
> 
> If the idea is to separate compute from data using Isilon etc then one is going to lose the locality of data.
> 
> Also the argument is that we would like to run queries/reports against two independent clusters simultaneously so do this
> 
> Use Isilon OneFS for Big data to migrate two independent Hadoop clusters into Isilon OneFS
> Locate data from each cluster into its own zone in Isilon
> Run queries to combine data from each zone
> Use blue data to create virtual Hadoop clusters on top of Isilon so one isolates the performance impact of analytics/Data Science versus other users
> 
> Now that is easily said than done as usual. First you have to migrate the two existing clusters data into zones in Isilon. Then you are effectively separating Compute from data so data locality is lost. This is no different from your Spark cluster accessing data from each cluster. There are a lot of tangential arguments here. Like Isilon will use RAID and you don't need to replicate your data R3. Even including Isilon licensing cost, the total cost goes down!
> 
> The side effect is the network now that you have lost data locality. How fast your network is going to be to handle the throughputs. Networks are shared across say a Bank unless you spend $$$ creating private infiniband networks. Standard 10Gbits/s is not going to be good enough.
> 
> Also in reality blue data does not need Isilon. It runs on HP and other hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.  Alpha currently, will be released at end of this year. As we have not started on Isilon it may be worth looking at this also?
> 
> Cheers  
>  
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
>> On 15 June 2017 at 17:05, Jörn Franke <jo...@gmail.com> wrote:
>> It does not matter to Spark you just put the HDFS URL of the namenode there. Of course the issue is that you loose data locality, but this would be also the case for Oracle.
>> 
>>> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mi...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> With Spark how easy is it to fetch data from two different clusters and do a join in Spark.
>>> 
>>> I can use two JDBC connections to join two tables from two different Oracle instances in Spark though creating two Data Frames and joining them together.
>>> 
>>> would that be possible for data residing on two different HDFS clusters?
>>> 
>>> thanks
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>

Re: fetching and joining data from two different clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

thanks Jorn.

If the idea is to separate compute from data using Isilon etc then one is
going to lose the locality of data.

Also the argument is that we would like to run queries/reports against two
independent clusters simultaneously so do this

   1. Use Isilon OneFS
   <https://en.wikipedia.org/wiki/OneFS_distributed_file_system>for Big
   data to migrate two independent Hadoop clusters into Isilon OneFS
   2. Locate data from each cluster into its own zone in Isilon
   3. Run queries to combine data from each zone
   4. Use blue data
   <https://www.bluedata.com/blog/2016/10/next-generation-big-data-with-dell-and-emc/>
   to create virtual Hadoop clusters on top of Isilon so one isolates the
   performance impact of analytics/Data Science versus other users

Now that is easily said than done as usual. First you have to migrate the
two existing clusters data into zones in Isilon. Then you are effectively
separating Compute from data so data locality is lost. This is no different
from your Spark cluster accessing data from each cluster. There are a lot
of tangential arguments here. Like Isilon will use RAID and you don't need
to replicate your data R3. Even including Isilon licensing cost, the total
cost goes down!

The side effect is the network now that you have lost data locality. How
fast your network is going to be to handle the throughputs. Networks are
shared across say a Bank unless you spend $$$ creating private infiniband
networks. Standard 10Gbits/s is not going to be good enough.

Also in reality blue data does not need Isilon. It runs on HP and other
hardware also.  In Apache Hadoop 3.0 docker engine on yarn is available.
Alpha currently, will be released at end of this year. As we have not
started on Isilon it may be worth looking at this also?

Cheers

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 15 June 2017 at 17:05, Jörn Franke <jo...@gmail.com> wrote:

> It does not matter to Spark you just put the HDFS URL of the namenode
> there. Of course the issue is that you loose data locality, but this would
> be also the case for Oracle.
>
> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi,
>
> With Spark how easy is it to fetch data from two different clusters and do
> a join in Spark.
>
> I can use two JDBC connections to join two tables from two different
> Oracle instances in Spark though creating two Data Frames and joining them
> together.
>
> would that be possible for data residing on two different HDFS clusters?
>
> thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: fetching and joining data from two different clusters

Posted by Jörn Franke <jo...@gmail.com>.

It does not matter to Spark you just put the HDFS URL of the namenode there. Of course the issue is that you loose data locality, but this would be also the case for Oracle.

> On 15. Jun 2017, at 18:03, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> 
> With Spark how easy is it to fetch data from two different clusters and do a join in Spark.
> 
> I can use two JDBC connections to join two tables from two different Oracle instances in Spark though creating two Data Frames and joining them together.
> 
> would that be possible for data residing on two different HDFS clusters?
> 
> thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>