You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jörn Franke <jo...@gmail.com> on 2017/06/01 07:21:08 UTC

Re: An Architecture question on the use of virtualised clusters

Hi,

I have done this (not Isilon, but another storage system). It can be efficient for small clusters and depending on how you design the network.

What I have also seen is the microservice approach with object stores (e.g. In the cloud s3, on premise swift) which is somehow also similar.

If you want additional performance you could fetch the data from the object stores and store it temporarily in a local HDFS. Not sure to what extent this affects regulatory requirements though.

Best regards

> On 31. May 2017, at 18:07, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> 
> I realize this may not have direct relevance to Spark but has anyone tried to create virtualized HDFS clusters using tools like ISILON or similar?
> 
> The prime motive behind this approach is to minimize the propagation or copy of data which has regulatory implication. In shoret you want your data to be in one place regardless of artefacts used against it such as Spark?
> 
> Thanks,
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>

Re: An Architecture question on the use of virtualised clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

My main concern is that the choice of Isolin is not for one use case. It
will be a strategic decision for the client and if we decide to go that way
we are effectively moving away from HDFS principals (3x replication) etc as
well.

Granted one can argue this may be OK but of course we have to look at our
future needs. From my experience of these tools, you cannot simply roll it
back without incurring considerable work and considerable cost.

And after all will the cost justify the whole of this setup? How about
performance and other bottlenecks?

Thanks



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 June 2017 at 15:46, John Leach <jl...@splicemachine.com> wrote:

> Mich,
>
> Yes, Isilon is in production...
>
> Isilon is a serious product and has been around for quite a while.  For
> on-premise external storage, we see it quite a bit.  Separating the compute
> from the storage actually helps.  It is also a nice transition to the cloud
> providers.
>
> Have you looked at MapR?  Usually the system guys target snapshots,
> volumes, and posix compliance if they are bought into Isilon.
>
> Good luck Mich.
>
> Regards,
> John Leach
>
>
>
>
> On Jun 5, 2017, at 9:27 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi John,
>
> Thanks. Did you end up in production or in other words besides PoC did you
> use it in anger?
>
> The intention is to build Isilon on top of the whole HDFS cluster!. If we
> go that way we also need to adopt it for DR as well.
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 June 2017 at 15:19, John Leach <jl...@splicemachine.com> wrote:
>
>> Mich,
>>
>> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase
>> for real-time).  We were concerned initially and the initial setup took a
>> bit longer than excepted, but it performed well on both low latency and
>> high throughput use cases at scale (our POC ~ 100 TB).
>>
>> Just a data point.
>>
>> Regards,
>> John Leach
>>
>> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> I am concerned about the use case of tools like Isilon or Panasas to
>> create a layer on top of HDFS, essentially a HCFS on top of HDFS with the
>> usual 3x replication gone into the tool itself.
>>
>> There is interest to push Isilon  as a the solution forward but my
>> caution is about scalability and future proof of such tools. So I was
>> wondering if anyone else has tried such solution.
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 June 2017 at 19:09, Gene Pang <ge...@gmail.com> wrote:
>>
>>> As Vincent mentioned earlier, I think Alluxio can work for this. You can mount
>>> your (potentially remote) storage systems to Alluxio
>>> <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>,
>>> and deploy Alluxio co-located to the compute cluster. The computation
>>> framework will still achieve data locality since Alluxio workers are
>>> co-located, even though the existing storage systems may be remote. You can
>>> also use tiered storage
>>> <http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html>
>>> to deploy using only memory, and/or other physical media.
>>>
>>> Here are some blogs (Alluxio with Minio
>>> <https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>,
>>> Alluxio with HDFS
>>> <https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>,
>>> Alluxio with S3
>>> <https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>)
>>> which use similar architecture.
>>>
>>> Hope that helps,
>>> Gene
>>>
>>> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> As a matter of interest what is the best way of creating virtualised
>>>> clusters all pointing to the same physical data?
>>>>
>>>> thanks
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 1 June 2017 at 09:27, vincent gromakowski <
>>>> vincent.gromakowski@gmail.com> wrote:
>>>>
>>>>> If mandatory, you can use a local cache like alluxio
>>>>>
>>>>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mi...@gmail.com>
>>>>> a écrit :
>>>>>
>>>>>> Thanks Vincent. I assume by physical data locality you mean you are
>>>>>> going through Isilon and HCFS and not through direct HDFS.
>>>>>>
>>>>>> Also I agree with you that shared network could be an issue as well.
>>>>>> However, it allows you to reduce data redundancy (you do not need R3 in
>>>>>> HDFS anymore) and also you can build virtual clusters on the same data. One
>>>>>> cluster for read/writes and another for Reads? That is what has been
>>>>>> suggestes!.
>>>>>>
>>>>>> regards
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1 June 2017 at 08:55, vincent gromakowski <
>>>>>> vincent.gromakowski@gmail.com> wrote:
>>>>>>
>>>>>>> I don't recommend this kind of design because you loose physical
>>>>>>> data locality and you will be affected by "bad neighboors" that are also
>>>>>>> using the network storage... We have one similar design but restricted to
>>>>>>> small clusters (more for experiments than production)
>>>>>>>
>>>>>>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com
>>>>>>> >:
>>>>>>>
>>>>>>>> Thanks Jorn,
>>>>>>>>
>>>>>>>> This was a proposal made by someone as the firm is already using
>>>>>>>> this tool on other SAN based storage and extend it to Big Data
>>>>>>>>
>>>>>>>> On paper it seems like a good idea, in practice it may be a
>>>>>>>> Wandisco scenario again..  Of course as ever one needs to EMC for reference
>>>>>>>> calls ans whether anyone is using this product in anger.
>>>>>>>>
>>>>>>>>
>>>>>>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>>>>>>  However that may suit our needs.  But  would need to PoC it and test it
>>>>>>>> thoroughly!
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have done this (not Isilon, but another storage system). It can
>>>>>>>>> be efficient for small clusters and depending on how you design the network.
>>>>>>>>>
>>>>>>>>> What I have also seen is the microservice approach with object
>>>>>>>>> stores (e.g. In the cloud s3, on premise swift) which is somehow also
>>>>>>>>> similar.
>>>>>>>>>
>>>>>>>>> If you want additional performance you could fetch the data from
>>>>>>>>> the object stores and store it temporarily in a local HDFS. Not sure to
>>>>>>>>> what extent this affects regulatory requirements though.
>>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>>
>>>>>>>>> On 31. May 2017, at 18:07, Mich Talebzadeh <
>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I realize this may not have direct relevance to Spark but has
>>>>>>>>> anyone tried to create virtualized HDFS clusters using tools like ISILON or
>>>>>>>>> similar?
>>>>>>>>>
>>>>>>>>> The prime motive behind this approach is to minimize the
>>>>>>>>> propagation or copy of data which has regulatory implication. In shoret you
>>>>>>>>> want your data to be in one place regardless of artefacts used against it
>>>>>>>>> such as Spark?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>
>

Re: An Architecture question on the use of virtualised clusters

Posted by John Leach <jl...@splicemachine.com>.

Mich,

Yes, Isilon is in production...

Isilon is a serious product and has been around for quite a while.  For on-premise external storage, we see it quite a bit.  Separating the compute from the storage actually helps.  It is also a nice transition to the cloud providers.  

Have you looked at MapR?  Usually the system guys target snapshots, volumes, and posix compliance if they are bought into Isilon.  

Good luck Mich.

Regards,
John Leach




> On Jun 5, 2017, at 9:27 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi John,
> 
> Thanks. Did you end up in production or in other words besides PoC did you use it in anger?
> 
> The intention is to build Isilon on top of the whole HDFS cluster!. If we go that way we also need to adopt it for DR as well.
> 
> Cheers
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 5 June 2017 at 15:19, John Leach <jleach@splicemachine.com <ma...@splicemachine.com>> wrote:
> Mich,
> 
> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for real-time).  We were concerned initially and the initial setup took a bit longer than excepted, but it performed well on both low latency and high throughput use cases at scale (our POC ~ 100 TB).  
> 
> Just a data point.
> 
> Regards,
> John Leach
> 
>> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I am concerned about the use case of tools like Isilon or Panasas to create a layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x replication gone into the tool itself.
>> 
>> There is interest to push Isilon  as a the solution forward but my caution is about scalability and future proof of such tools. So I was wondering if anyone else has tried such solution.
>> 
>> Thanks
>>  
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> On 2 June 2017 at 19:09, Gene Pang <gene.pang@gmail.com <ma...@gmail.com>> wrote:
>> As Vincent mentioned earlier, I think Alluxio can work for this. You can mount your (potentially remote) storage systems to Alluxio <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>, and deploy Alluxio co-located to the compute cluster. The computation framework will still achieve data locality since Alluxio workers are co-located, even though the existing storage systems may be remote. You can also use tiered storage <http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html> to deploy using only memory, and/or other physical media.
>> 
>> Here are some blogs (Alluxio with Minio <https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>, Alluxio with HDFS <https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>, Alluxio with S3 <https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>) which use similar architecture.
>> 
>> Hope that helps,
>> Gene
>> 
>> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> As a matter of interest what is the best way of creating virtualised clusters all pointing to the same physical data?
>> 
>> thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> On 1 June 2017 at 09:27, vincent gromakowski <vincent.gromakowski@gmail.com <ma...@gmail.com>> wrote:
>> If mandatory, you can use a local cache like alluxio
>> 
>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <ma...@gmail.com>> a écrit :
>> Thanks Vincent. I assume by physical data locality you mean you are going through Isilon and HCFS and not through direct HDFS.
>> 
>> Also I agree with you that shared network could be an issue as well. However, it allows you to reduce data redundancy (you do not need R3 in HDFS anymore) and also you can build virtual clusters on the same data. One cluster for read/writes and another for Reads? That is what has been suggestes!.
>> 
>> regards
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> On 1 June 2017 at 08:55, vincent gromakowski <vincent.gromakowski@gmail.com <ma...@gmail.com>> wrote:
>> I don't recommend this kind of design because you loose physical data locality and you will be affected by "bad neighboors" that are also using the network storage... We have one similar design but restricted to small clusters (more for experiments than production)
>> 
>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>:
>> Thanks Jorn,
>> 
>> This was a proposal made by someone as the firm is already using this tool on other SAN based storage and extend it to Big Data
>> 
>> On paper it seems like a good idea, in practice it may be a Wandisco scenario again..  Of course as ever one needs to EMC for reference calls ans whether anyone is using this product in anger.
>>  
>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.  However that may suit our needs.  But  would need to PoC it and test it thoroughly!
>> 
>> Cheers
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> On 1 June 2017 at 08:21, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>> Hi,
>> 
>> I have done this (not Isilon, but another storage system). It can be efficient for small clusters and depending on how you design the network.
>> 
>> What I have also seen is the microservice approach with object stores (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>> 
>> If you want additional performance you could fetch the data from the object stores and store it temporarily in a local HDFS. Not sure to what extent this affects regulatory requirements though.
>> 
>> Best regards
>> 
>> On 31. May 2017, at 18:07, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> Hi,
>>> 
>>> I realize this may not have direct relevance to Spark but has anyone tried to create virtualized HDFS clusters using tools like ISILON or similar?
>>> 
>>> The prime motive behind this approach is to minimize the propagation or copy of data which has regulatory implication. In shoret you want your data to be in one place regardless of artefacts used against it such as Spark?
>>> 
>>> Thanks,
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: An Architecture question on the use of virtualised clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi John,

Thanks. Did you end up in production or in other words besides PoC did you
use it in anger?

The intention is to build Isilon on top of the whole HDFS cluster!. If we
go that way we also need to adopt it for DR as well.

Cheers



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 June 2017 at 15:19, John Leach <jl...@splicemachine.com> wrote:

> Mich,
>
> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for
> real-time).  We were concerned initially and the initial setup took a bit
> longer than excepted, but it performed well on both low latency and high
> throughput use cases at scale (our POC ~ 100 TB).
>
> Just a data point.
>
> Regards,
> John Leach
>
> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> I am concerned about the use case of tools like Isilon or Panasas to
> create a layer on top of HDFS, essentially a HCFS on top of HDFS with the
> usual 3x replication gone into the tool itself.
>
> There is interest to push Isilon  as a the solution forward but my caution
> is about scalability and future proof of such tools. So I was wondering if
> anyone else has tried such solution.
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 June 2017 at 19:09, Gene Pang <ge...@gmail.com> wrote:
>
>> As Vincent mentioned earlier, I think Alluxio can work for this. You can mount
>> your (potentially remote) storage systems to Alluxio
>> <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>,
>> and deploy Alluxio co-located to the compute cluster. The computation
>> framework will still achieve data locality since Alluxio workers are
>> co-located, even though the existing storage systems may be remote. You can
>> also use tiered storage
>> <http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html>
>> to deploy using only memory, and/or other physical media.
>>
>> Here are some blogs (Alluxio with Minio
>> <https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>,
>> Alluxio with HDFS
>> <https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>,
>> Alluxio with S3
>> <https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>)
>> which use similar architecture.
>>
>> Hope that helps,
>> Gene
>>
>> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> As a matter of interest what is the best way of creating virtualised
>>> clusters all pointing to the same physical data?
>>>
>>> thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 09:27, vincent gromakowski <
>>> vincent.gromakowski@gmail.com> wrote:
>>>
>>>> If mandatory, you can use a local cache like alluxio
>>>>
>>>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mi...@gmail.com>
>>>> a écrit :
>>>>
>>>>> Thanks Vincent. I assume by physical data locality you mean you are
>>>>> going through Isilon and HCFS and not through direct HDFS.
>>>>>
>>>>> Also I agree with you that shared network could be an issue as well.
>>>>> However, it allows you to reduce data redundancy (you do not need R3 in
>>>>> HDFS anymore) and also you can build virtual clusters on the same data. One
>>>>> cluster for read/writes and another for Reads? That is what has been
>>>>> suggestes!.
>>>>>
>>>>> regards
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 1 June 2017 at 08:55, vincent gromakowski <
>>>>> vincent.gromakowski@gmail.com> wrote:
>>>>>
>>>>>> I don't recommend this kind of design because you loose physical data
>>>>>> locality and you will be affected by "bad neighboors" that are also using
>>>>>> the network storage... We have one similar design but restricted to small
>>>>>> clusters (more for experiments than production)
>>>>>>
>>>>>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mi...@gmail.com>
>>>>>> :
>>>>>>
>>>>>>> Thanks Jorn,
>>>>>>>
>>>>>>> This was a proposal made by someone as the firm is already using
>>>>>>> this tool on other SAN based storage and extend it to Big Data
>>>>>>>
>>>>>>> On paper it seems like a good idea, in practice it may be a Wandisco
>>>>>>> scenario again..  Of course as ever one needs to EMC for reference calls
>>>>>>> ans whether anyone is using this product in anger.
>>>>>>>
>>>>>>>
>>>>>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>>>>>  However that may suit our needs.  But  would need to PoC it and test it
>>>>>>> thoroughly!
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have done this (not Isilon, but another storage system). It can
>>>>>>>> be efficient for small clusters and depending on how you design the network.
>>>>>>>>
>>>>>>>> What I have also seen is the microservice approach with object
>>>>>>>> stores (e.g. In the cloud s3, on premise swift) which is somehow also
>>>>>>>> similar.
>>>>>>>>
>>>>>>>> If you want additional performance you could fetch the data from
>>>>>>>> the object stores and store it temporarily in a local HDFS. Not sure to
>>>>>>>> what extent this affects regulatory requirements though.
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> On 31. May 2017, at 18:07, Mich Talebzadeh <
>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I realize this may not have direct relevance to Spark but has
>>>>>>>> anyone tried to create virtualized HDFS clusters using tools like ISILON or
>>>>>>>> similar?
>>>>>>>>
>>>>>>>> The prime motive behind this approach is to minimize the
>>>>>>>> propagation or copy of data which has regulatory implication. In shoret you
>>>>>>>> want your data to be in one place regardless of artefacts used against it
>>>>>>>> such as Spark?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>
>

Re: An Architecture question on the use of virtualised clusters

Posted by John Leach <jl...@splicemachine.com>.

Mich,

We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for real-time).  We were concerned initially and the initial setup took a bit longer than excepted, but it performed well on both low latency and high throughput use cases at scale (our POC ~ 100 TB).  

Just a data point.

Regards,
John Leach

> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> I am concerned about the use case of tools like Isilon or Panasas to create a layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x replication gone into the tool itself.
> 
> There is interest to push Isilon  as a the solution forward but my caution is about scalability and future proof of such tools. So I was wondering if anyone else has tried such solution.
> 
> Thanks
>  
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 2 June 2017 at 19:09, Gene Pang <gene.pang@gmail.com <ma...@gmail.com>> wrote:
> As Vincent mentioned earlier, I think Alluxio can work for this. You can mount your (potentially remote) storage systems to Alluxio <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>, and deploy Alluxio co-located to the compute cluster. The computation framework will still achieve data locality since Alluxio workers are co-located, even though the existing storage systems may be remote. You can also use tiered storage <http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html> to deploy using only memory, and/or other physical media.
> 
> Here are some blogs (Alluxio with Minio <https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>, Alluxio with HDFS <https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>, Alluxio with S3 <https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>) which use similar architecture.
> 
> Hope that helps,
> Gene
> 
> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> As a matter of interest what is the best way of creating virtualised clusters all pointing to the same physical data?
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 1 June 2017 at 09:27, vincent gromakowski <vincent.gromakowski@gmail.com <ma...@gmail.com>> wrote:
> If mandatory, you can use a local cache like alluxio
> 
> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <ma...@gmail.com>> a écrit :
> Thanks Vincent. I assume by physical data locality you mean you are going through Isilon and HCFS and not through direct HDFS.
> 
> Also I agree with you that shared network could be an issue as well. However, it allows you to reduce data redundancy (you do not need R3 in HDFS anymore) and also you can build virtual clusters on the same data. One cluster for read/writes and another for Reads? That is what has been suggestes!.
> 
> regards
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 1 June 2017 at 08:55, vincent gromakowski <vincent.gromakowski@gmail.com <ma...@gmail.com>> wrote:
> I don't recommend this kind of design because you loose physical data locality and you will be affected by "bad neighboors" that are also using the network storage... We have one similar design but restricted to small clusters (more for experiments than production)
> 
> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>:
> Thanks Jorn,
> 
> This was a proposal made by someone as the firm is already using this tool on other SAN based storage and extend it to Big Data
> 
> On paper it seems like a good idea, in practice it may be a Wandisco scenario again..  Of course as ever one needs to EMC for reference calls ans whether anyone is using this product in anger.
>  
> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.  However that may suit our needs.  But  would need to PoC it and test it thoroughly!
> 
> Cheers
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 1 June 2017 at 08:21, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> I have done this (not Isilon, but another storage system). It can be efficient for small clusters and depending on how you design the network.
> 
> What I have also seen is the microservice approach with object stores (e.g. In the cloud s3, on premise swift) which is somehow also similar.
> 
> If you want additional performance you could fetch the data from the object stores and store it temporarily in a local HDFS. Not sure to what extent this affects regulatory requirements though.
> 
> Best regards
> 
> On 31. May 2017, at 18:07, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> 
>> Hi,
>> 
>> I realize this may not have direct relevance to Spark but has anyone tried to create virtualized HDFS clusters using tools like ISILON or similar?
>> 
>> The prime motive behind this approach is to minimize the propagation or copy of data which has regulatory implication. In shoret you want your data to be in one place regardless of artefacts used against it such as Spark?
>> 
>> Thanks,
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
> 
> 
> 
> 
> 
>

Re: An Architecture question on the use of virtualised clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

I am concerned about the use case of tools like Isilon or Panasas to create
a layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x
replication gone into the tool itself.

There is interest to push Isilon  as a the solution forward but my caution
is about scalability and future proof of such tools. So I was wondering if
anyone else has tried such solution.

Thanks



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 June 2017 at 19:09, Gene Pang <ge...@gmail.com> wrote:

> As Vincent mentioned earlier, I think Alluxio can work for this. You can mount
> your (potentially remote) storage systems to Alluxio
> <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>,
> and deploy Alluxio co-located to the compute cluster. The computation
> framework will still achieve data locality since Alluxio workers are
> co-located, even though the existing storage systems may be remote. You can
> also use tiered storage
> <http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html> to
> deploy using only memory, and/or other physical media.
>
> Here are some blogs (Alluxio with Minio
> <https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>,
> Alluxio with HDFS
> <https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>,
> Alluxio with S3
> <https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>)
> which use similar architecture.
>
> Hope that helps,
> Gene
>
> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> > wrote:
>
>> As a matter of interest what is the best way of creating virtualised
>> clusters all pointing to the same physical data?
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 June 2017 at 09:27, vincent gromakowski <
>> vincent.gromakowski@gmail.com> wrote:
>>
>>> If mandatory, you can use a local cache like alluxio
>>>
>>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mi...@gmail.com>
>>> a écrit :
>>>
>>>> Thanks Vincent. I assume by physical data locality you mean you are
>>>> going through Isilon and HCFS and not through direct HDFS.
>>>>
>>>> Also I agree with you that shared network could be an issue as well.
>>>> However, it allows you to reduce data redundancy (you do not need R3 in
>>>> HDFS anymore) and also you can build virtual clusters on the same data. One
>>>> cluster for read/writes and another for Reads? That is what has been
>>>> suggestes!.
>>>>
>>>> regards
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 1 June 2017 at 08:55, vincent gromakowski <
>>>> vincent.gromakowski@gmail.com> wrote:
>>>>
>>>>> I don't recommend this kind of design because you loose physical data
>>>>> locality and you will be affected by "bad neighboors" that are also using
>>>>> the network storage... We have one similar design but restricted to small
>>>>> clusters (more for experiments than production)
>>>>>
>>>>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mi...@gmail.com>:
>>>>>
>>>>>> Thanks Jorn,
>>>>>>
>>>>>> This was a proposal made by someone as the firm is already using this
>>>>>> tool on other SAN based storage and extend it to Big Data
>>>>>>
>>>>>> On paper it seems like a good idea, in practice it may be a Wandisco
>>>>>> scenario again..  Of course as ever one needs to EMC for reference calls
>>>>>> ans whether anyone is using this product in anger.
>>>>>>
>>>>>>
>>>>>>
>>>>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>>>>  However that may suit our needs.  But  would need to PoC it and test it
>>>>>> thoroughly!
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have done this (not Isilon, but another storage system). It can be
>>>>>>> efficient for small clusters and depending on how you design the network.
>>>>>>>
>>>>>>> What I have also seen is the microservice approach with object
>>>>>>> stores (e.g. In the cloud s3, on premise swift) which is somehow also
>>>>>>> similar.
>>>>>>>
>>>>>>> If you want additional performance you could fetch the data from the
>>>>>>> object stores and store it temporarily in a local HDFS. Not sure to what
>>>>>>> extent this affects regulatory requirements though.
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> On 31. May 2017, at 18:07, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I realize this may not have direct relevance to Spark but has anyone
>>>>>>> tried to create virtualized HDFS clusters using tools like ISILON or
>>>>>>> similar?
>>>>>>>
>>>>>>> The prime motive behind this approach is to minimize the propagation
>>>>>>> or copy of data which has regulatory implication. In shoret you want your
>>>>>>> data to be in one place regardless of artefacts used against it such as
>>>>>>> Spark?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: An Architecture question on the use of virtualised clusters

Posted by Gene Pang <ge...@gmail.com>.

As Vincent mentioned earlier, I think Alluxio can work for this. You can mount
your (potentially remote) storage systems to Alluxio
<http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>,
and deploy Alluxio co-located to the compute cluster. The computation
framework will still achieve data locality since Alluxio workers are
co-located, even though the existing storage systems may be remote. You can
also use tiered storage
<http://www.alluxio.org/docs/master/en/Tiered-Storage-on-Alluxio.html> to
deploy using only memory, and/or other physical media.

Here are some blogs (Alluxio with Minio
<https://www.alluxio.com/blog/scalable-genomics-data-processing-pipeline-with-alluxio-mesos-and-minio>,
Alluxio with HDFS
<https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>,
Alluxio with S3
<https://www.alluxio.com/blog/accelerating-on-demand-data-analytics-with-alluxio>)
which use similar architecture.

Hope that helps,
Gene

On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> As a matter of interest what is the best way of creating virtualised
> clusters all pointing to the same physical data?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 09:27, vincent gromakowski <
> vincent.gromakowski@gmail.com> wrote:
>
>> If mandatory, you can use a local cache like alluxio
>>
>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mi...@gmail.com> a
>> écrit :
>>
>>> Thanks Vincent. I assume by physical data locality you mean you are
>>> going through Isilon and HCFS and not through direct HDFS.
>>>
>>> Also I agree with you that shared network could be an issue as well.
>>> However, it allows you to reduce data redundancy (you do not need R3 in
>>> HDFS anymore) and also you can build virtual clusters on the same data. One
>>> cluster for read/writes and another for Reads? That is what has been
>>> suggestes!.
>>>
>>> regards
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 08:55, vincent gromakowski <
>>> vincent.gromakowski@gmail.com> wrote:
>>>
>>>> I don't recommend this kind of design because you loose physical data
>>>> locality and you will be affected by "bad neighboors" that are also using
>>>> the network storage... We have one similar design but restricted to small
>>>> clusters (more for experiments than production)
>>>>
>>>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mi...@gmail.com>:
>>>>
>>>>> Thanks Jorn,
>>>>>
>>>>> This was a proposal made by someone as the firm is already using this
>>>>> tool on other SAN based storage and extend it to Big Data
>>>>>
>>>>> On paper it seems like a good idea, in practice it may be a Wandisco
>>>>> scenario again..  Of course as ever one needs to EMC for reference calls
>>>>> ans whether anyone is using this product in anger.
>>>>>
>>>>>
>>>>>
>>>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>>>  However that may suit our needs.  But  would need to PoC it and test it
>>>>> thoroughly!
>>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have done this (not Isilon, but another storage system). It can be
>>>>>> efficient for small clusters and depending on how you design the network.
>>>>>>
>>>>>> What I have also seen is the microservice approach with object stores
>>>>>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>>>>>
>>>>>> If you want additional performance you could fetch the data from the
>>>>>> object stores and store it temporarily in a local HDFS. Not sure to what
>>>>>> extent this affects regulatory requirements though.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> On 31. May 2017, at 18:07, Mich Talebzadeh <mi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I realize this may not have direct relevance to Spark but has anyone
>>>>>> tried to create virtualized HDFS clusters using tools like ISILON or
>>>>>> similar?
>>>>>>
>>>>>> The prime motive behind this approach is to minimize the propagation
>>>>>> or copy of data which has regulatory implication. In shoret you want your
>>>>>> data to be in one place regardless of artefacts used against it such as
>>>>>> Spark?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: An Architecture question on the use of virtualised clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

As a matter of interest what is the best way of creating virtualised
clusters all pointing to the same physical data?

thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 June 2017 at 09:27, vincent gromakowski <vi...@gmail.com>
wrote:

> If mandatory, you can use a local cache like alluxio
>
> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mi...@gmail.com> a
> écrit :
>
>> Thanks Vincent. I assume by physical data locality you mean you are going
>> through Isilon and HCFS and not through direct HDFS.
>>
>> Also I agree with you that shared network could be an issue as well.
>> However, it allows you to reduce data redundancy (you do not need R3 in
>> HDFS anymore) and also you can build virtual clusters on the same data. One
>> cluster for read/writes and another for Reads? That is what has been
>> suggestes!.
>>
>> regards
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 June 2017 at 08:55, vincent gromakowski <
>> vincent.gromakowski@gmail.com> wrote:
>>
>>> I don't recommend this kind of design because you loose physical data
>>> locality and you will be affected by "bad neighboors" that are also using
>>> the network storage... We have one similar design but restricted to small
>>> clusters (more for experiments than production)
>>>
>>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mi...@gmail.com>:
>>>
>>>> Thanks Jorn,
>>>>
>>>> This was a proposal made by someone as the firm is already using this
>>>> tool on other SAN based storage and extend it to Big Data
>>>>
>>>> On paper it seems like a good idea, in practice it may be a Wandisco
>>>> scenario again..  Of course as ever one needs to EMC for reference calls
>>>> ans whether anyone is using this product in anger.
>>>>
>>>>
>>>>
>>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>>  However that may suit our needs.  But  would need to PoC it and test it
>>>> thoroughly!
>>>>
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have done this (not Isilon, but another storage system). It can be
>>>>> efficient for small clusters and depending on how you design the network.
>>>>>
>>>>> What I have also seen is the microservice approach with object stores
>>>>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>>>>
>>>>> If you want additional performance you could fetch the data from the
>>>>> object stores and store it temporarily in a local HDFS. Not sure to what
>>>>> extent this affects regulatory requirements though.
>>>>>
>>>>> Best regards
>>>>>
>>>>> On 31. May 2017, at 18:07, Mich Talebzadeh <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I realize this may not have direct relevance to Spark but has anyone
>>>>> tried to create virtualized HDFS clusters using tools like ISILON or
>>>>> similar?
>>>>>
>>>>> The prime motive behind this approach is to minimize the propagation
>>>>> or copy of data which has regulatory implication. In shoret you want your
>>>>> data to be in one place regardless of artefacts used against it such as
>>>>> Spark?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: An Architecture question on the use of virtualised clusters

Posted by vincent gromakowski <vi...@gmail.com>.

If mandatory, you can use a local cache like alluxio

Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" <mi...@gmail.com> a
écrit :

> Thanks Vincent. I assume by physical data locality you mean you are going
> through Isilon and HCFS and not through direct HDFS.
>
> Also I agree with you that shared network could be an issue as well.
> However, it allows you to reduce data redundancy (you do not need R3 in
> HDFS anymore) and also you can build virtual clusters on the same data. One
> cluster for read/writes and another for Reads? That is what has been
> suggestes!.
>
> regards
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 08:55, vincent gromakowski <
> vincent.gromakowski@gmail.com> wrote:
>
>> I don't recommend this kind of design because you loose physical data
>> locality and you will be affected by "bad neighboors" that are also using
>> the network storage... We have one similar design but restricted to small
>> clusters (more for experiments than production)
>>
>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mi...@gmail.com>:
>>
>>> Thanks Jorn,
>>>
>>> This was a proposal made by someone as the firm is already using this
>>> tool on other SAN based storage and extend it to Big Data
>>>
>>> On paper it seems like a good idea, in practice it may be a Wandisco
>>> scenario again..  Of course as ever one needs to EMC for reference calls
>>> ans whether anyone is using this product in anger.
>>>
>>>
>>>
>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>  However that may suit our needs.  But  would need to PoC it and test it
>>> thoroughly!
>>>
>>>
>>> Cheers
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have done this (not Isilon, but another storage system). It can be
>>>> efficient for small clusters and depending on how you design the network.
>>>>
>>>> What I have also seen is the microservice approach with object stores
>>>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>>>
>>>> If you want additional performance you could fetch the data from the
>>>> object stores and store it temporarily in a local HDFS. Not sure to what
>>>> extent this affects regulatory requirements though.
>>>>
>>>> Best regards
>>>>
>>>> On 31. May 2017, at 18:07, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I realize this may not have direct relevance to Spark but has anyone
>>>> tried to create virtualized HDFS clusters using tools like ISILON or
>>>> similar?
>>>>
>>>> The prime motive behind this approach is to minimize the propagation or
>>>> copy of data which has regulatory implication. In shoret you want your data
>>>> to be in one place regardless of artefacts used against it such as Spark?
>>>>
>>>> Thanks,
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: An Architecture question on the use of virtualised clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Vincent. I assume by physical data locality you mean you are going
through Isilon and HCFS and not through direct HDFS.

Also I agree with you that shared network could be an issue as well.
However, it allows you to reduce data redundancy (you do not need R3 in
HDFS anymore) and also you can build virtual clusters on the same data. One
cluster for read/writes and another for Reads? That is what has been
suggestes!.

regards

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 June 2017 at 08:55, vincent gromakowski <vi...@gmail.com>
wrote:

> I don't recommend this kind of design because you loose physical data
> locality and you will be affected by "bad neighboors" that are also using
> the network storage... We have one similar design but restricted to small
> clusters (more for experiments than production)
>
> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mi...@gmail.com>:
>
>> Thanks Jorn,
>>
>> This was a proposal made by someone as the firm is already using this
>> tool on other SAN based storage and extend it to Big Data
>>
>> On paper it seems like a good idea, in practice it may be a Wandisco
>> scenario again..  Of course as ever one needs to EMC for reference calls
>> ans whether anyone is using this product in anger.
>>
>>
>>
>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>  However that may suit our needs.  But  would need to PoC it and test it
>> thoroughly!
>>
>>
>> Cheers
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have done this (not Isilon, but another storage system). It can be
>>> efficient for small clusters and depending on how you design the network.
>>>
>>> What I have also seen is the microservice approach with object stores
>>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>>
>>> If you want additional performance you could fetch the data from the
>>> object stores and store it temporarily in a local HDFS. Not sure to what
>>> extent this affects regulatory requirements though.
>>>
>>> Best regards
>>>
>>> On 31. May 2017, at 18:07, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I realize this may not have direct relevance to Spark but has anyone
>>> tried to create virtualized HDFS clusters using tools like ISILON or
>>> similar?
>>>
>>> The prime motive behind this approach is to minimize the propagation or
>>> copy of data which has regulatory implication. In shoret you want your data
>>> to be in one place regardless of artefacts used against it such as Spark?
>>>
>>> Thanks,
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>
>

Re: An Architecture question on the use of virtualised clusters

Posted by vincent gromakowski <vi...@gmail.com>.

I don't recommend this kind of design because you loose physical data
locality and you will be affected by "bad neighboors" that are also using
the network storage... We have one similar design but restricted to small
clusters (more for experiments than production)

2017-06-01 9:47 GMT+02:00 Mich Talebzadeh <mi...@gmail.com>:

> Thanks Jorn,
>
> This was a proposal made by someone as the firm is already using this tool
> on other SAN based storage and extend it to Big Data
>
> On paper it seems like a good idea, in practice it may be a Wandisco
> scenario again..  Of course as ever one needs to EMC for reference calls
> ans whether anyone is using this product in anger.
>
>
>
> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>  However that may suit our needs.  But  would need to PoC it and test it
> thoroughly!
>
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:
>
>> Hi,
>>
>> I have done this (not Isilon, but another storage system). It can be
>> efficient for small clusters and depending on how you design the network.
>>
>> What I have also seen is the microservice approach with object stores
>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>
>> If you want additional performance you could fetch the data from the
>> object stores and store it temporarily in a local HDFS. Not sure to what
>> extent this affects regulatory requirements though.
>>
>> Best regards
>>
>> On 31. May 2017, at 18:07, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I realize this may not have direct relevance to Spark but has anyone
>> tried to create virtualized HDFS clusters using tools like ISILON or
>> similar?
>>
>> The prime motive behind this approach is to minimize the propagation or
>> copy of data which has regulatory implication. In shoret you want your data
>> to be in one place regardless of artefacts used against it such as Spark?
>>
>> Thanks,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>

Re: An Architecture question on the use of virtualised clusters

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Jorn,

This was a proposal made by someone as the firm is already using this tool
on other SAN based storage and extend it to Big Data

On paper it seems like a good idea, in practice it may be a Wandisco
scenario again..  Of course as ever one needs to EMC for reference calls
ans whether anyone is using this product in anger.



At the end of the day it's not HDFS.  It is OneFS with a HCFS API.  However
that may suit our needs.  But  would need to PoC it and test it thoroughly!


Cheers



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 June 2017 at 08:21, Jörn Franke <jo...@gmail.com> wrote:

> Hi,
>
> I have done this (not Isilon, but another storage system). It can be
> efficient for small clusters and depending on how you design the network.
>
> What I have also seen is the microservice approach with object stores
> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>
> If you want additional performance you could fetch the data from the
> object stores and store it temporarily in a local HDFS. Not sure to what
> extent this affects regulatory requirements though.
>
> Best regards
>
> On 31. May 2017, at 18:07, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi,
>
> I realize this may not have direct relevance to Spark but has anyone tried
> to create virtualized HDFS clusters using tools like ISILON or similar?
>
> The prime motive behind this approach is to minimize the propagation or
> copy of data which has regulatory implication. In shoret you want your data
> to be in one place regardless of artefacts used against it such as Spark?
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>