You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Adam Bordelon <ad...@mesosphere.io> on 2014/06/26 00:53:24 UTC

Re: HDFS on Mesos

+dev@

I think it makes a lot of sense to run Distributed File Systems on top of
Mesos, whether that be HDFS, MapRFS, Lustre, BitTorrent, or whatever. HDFS
is very popular with Mesos users, and is currently supported as an executor
fetching protocol/source. I would love to see an HDFS framework for Mesos.
Below are my thoughts.

** Advantages:
+ Fault-tolerance/HA: Automatically restart failed NameNodes, always have
enough standbys.
+ Shared resources: Mesos can allocate/isolate resources for HDFS NN/DN
processes alongside other frameworks
+ Scaling: Easily scale up/down the number of DataNodes as the cluster grows

** Challenges:
Launching NameNode (NN) tasks:
- Need multiple NNs, for HA-standbys and/or federated NNs.
- Do we have to manually configure federated namespaces?
- Could we use the Mesos replicated log for NN-HA's edit log, instead of
NFS or the JournalNodes?

Launching DataNode (DN) tasks:
- DNs must be started with names of all(?) NNs, register/update with each.
Need a svc-discovery tool, or just start all NNs first, then start DNs with
known NNs? How to update when NNs move? Use ZK to track?

The Bootstrap problem:
- Where to fetch the NN/DN executors/tasks? Could use another HDFS cluster,
S3/HTTP/FTP, or pre-install the binaries on each slave.

Migrating an existing HDFS cluster:
- Is it possible to do a migration from raw HDFS to HDFS-on-Mesos without
moving the data?

Data Residency:
- Should we destroy the sandbox/hdfs-data when shutting down a DN?
- If starting DN on node that was previously running a DN, can/should we
try to revive the existing data?

Topology contraints:
- Must guarantee only one DN (per fwk) per slave, only one NN (per fwk) per
slave.
- Wouldn't want NNs (or replicated blocks?) to live on the same physical
node/rack. Could use attributes to express topology.

Kerberos integration:
- How to ensure that NN has access to the KDC and/or required
keytabs/credentials?



On Wed, Jun 25, 2014 at 6:17 AM, Vladimir Vivien <vl...@gmail.com>
wrote:

> +1 wondered about this.  Would love to hear pros/cons.
>
>
> On Wed, Jun 25, 2014 at 8:00 AM, Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
>
>> Hi Mesos Community,
>>
>> I am a bit surprised to see that no one has done a framework to run HDFS
>> on top of Mesos directly since a lot of people seem to use HDFS in the
>> community. HDFS seems to be managed separately from Mesos (but will
>> probably run on the same machines). Is there any reason for that?
>>
>> My understanding is that using Mesos to manage all resources and having
>> HDFS on top of it makes much more sense (just like a FS runs inside an OS,
>> not on the side).
>>
>> Is it technical complexity? (we run HDFS and YARN with HA, journalnodes
>> and Kerberos Security and it is definitely a beast) Is it because no one
>> really feels the need for this since they are already running HDFS on the
>> side close to the hardware and don't want to waste time having it in Mesos?
>>
>> Best
>> Maxime
>>
>
>
>
> --
> Vladimir Vivien
>

Re: HDFS on Mesos

Posted by Vetoshkin Nikita <ni...@gmail.com>.

Here we go: https://issues.apache.org/jira/browse/MESOS-1554. Though I
didn't know how to name an epic, so set ticket type as "Story".


On Fri, Jun 27, 2014 at 1:22 AM, Benjamin Hindman <be...@eecs.berkeley.edu>
wrote:

> Wanted to jump in here and provide some context on 'persistent resources'.
> As Vinod mentioned, this is how we're thinking about enabling storage-like
> frameworks on Mesos.
>
> The idea originally came about because, even today, if we allocate some
> file system space to a task/executor, and then that task/executor
> terminates, we haven't officially "freed" those file system resources until
> after we garbage collect the task/executor sandbox! (We keep the sandbox
> around so a user/operator can get the stdout/stderr or anything else left
> around from their task/executor.)
>
> To solve this problem we wanted to be able to let a task/executor terminate
> but not *give up* all of it's resources, hence: persistent resources.
>
> Pushing this concept even further you could imagine always reallocating
> resources to a framework that had already been allocated those resources
> for a previous task/executor. Looked at from another perspective, these are
> "late-binding", or "lazy", resource reservations.
>
> At one point in time we had considered just doing 'right-of-first-refusal'
> for allocations after a task/executor terminate. But this is really
> insufficient for supporting storage-like frameworks well (and likely even
> harder to reliably implement then 'persistent resources' IMHO).
>
> There are a ton of things that need to get worked out in this model,
> including (but not limited to), how should a file system (or disk) be
> exposed in order to be made persistent? How should persistent resources be
> returned to a master? How many persistent resources can a framework get
> allocated?
>
> The right place to capture this all is in an "Epic" ticket on JIRA. Nikita,
> do you want to create a ticket? If not, no worries, I'm happy to create the
> ticket. Really looking forward to seeing this develop!
>
> Ben.
>
>
>
>
> On Thu, Jun 26, 2014 at 11:33 AM, Vinod Kone <vi...@gmail.com> wrote:
>
> > SGTM. Feel free to create the ticket!
> >
> >
> > On Thu, Jun 26, 2014 at 11:20 AM, Vetoshkin Nikita <
> > nikita.vetoshkin@gmail.com> wrote:
> >
> > > Thanks, Vinod! I really like the "persistent resources" idea. Maybe
> there
> > > should be a ticket for discussion and brainstorming?
> > > On Jun 26, 2014 11:06 PM, "Vinod Kone" <vi...@gmail.com> wrote:
> > >
> > > > As Maxime mentioned, the long term solution is for Mesos to support
> the
> > > > notion of "persistent resources" i.e., resources that stay (and
> > accounted
> > > > for) after the life cycle of task/executor. The idea still needs
> > fleshing
> > > > out.
> > > >
> > > >
> > > > On Thu, Jun 26, 2014 at 8:23 AM, Vetoshkin Nikita <
> > > > nikita.vetoshkin@gmail.com> wrote:
> > > >
> > > > > What about long term solution? Any ideas? Twitter's Manhattan
> > database
> > > > > claims to use Mesos for scaling up and down. Can you shed some
> light
> > > how
> > > > do
> > > > > they deal with the situation like this?
> > > > > On Jun 26, 2014 5:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
> > > > >
> > > > > > Thanks for listing this out Adam.
> > > > > >
> > > > > > Data Residency:
> > > > > > > - Should we destroy the sandbox/hdfs-data when shutting down a
> > DN?
> > > > > > > - If starting DN on node that was previously running a DN,
> > > can/should
> > > > > we
> > > > > > > try to revive the existing data?
> > > > > > >
> > > > > >
> > > > > > I think this is one of the key challenges for a production
> quality
> > > HDFS
> > > > > on
> > > > > > Mesos. Currently, since sandbox is deleted after a task exits, if
> > all
> > > > the
> > > > > > data nodes that hold a block (and its replicas) get lost/killed
> for
> > > > > > whatever reason there would be data loss. A short terms solution
> > > would
> > > > be
> > > > > > to write outside sandbox and use slave attributes to track where
> to
> > > > > > re-launch data node tasks.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: HDFS on Mesos

Posted by Benjamin Hindman <be...@eecs.berkeley.edu>.

Wanted to jump in here and provide some context on 'persistent resources'.
As Vinod mentioned, this is how we're thinking about enabling storage-like
frameworks on Mesos.

The idea originally came about because, even today, if we allocate some
file system space to a task/executor, and then that task/executor
terminates, we haven't officially "freed" those file system resources until
after we garbage collect the task/executor sandbox! (We keep the sandbox
around so a user/operator can get the stdout/stderr or anything else left
around from their task/executor.)

To solve this problem we wanted to be able to let a task/executor terminate
but not *give up* all of it's resources, hence: persistent resources.

Pushing this concept even further you could imagine always reallocating
resources to a framework that had already been allocated those resources
for a previous task/executor. Looked at from another perspective, these are
"late-binding", or "lazy", resource reservations.

At one point in time we had considered just doing 'right-of-first-refusal'
for allocations after a task/executor terminate. But this is really
insufficient for supporting storage-like frameworks well (and likely even
harder to reliably implement then 'persistent resources' IMHO).

There are a ton of things that need to get worked out in this model,
including (but not limited to), how should a file system (or disk) be
exposed in order to be made persistent? How should persistent resources be
returned to a master? How many persistent resources can a framework get
allocated?

The right place to capture this all is in an "Epic" ticket on JIRA. Nikita,
do you want to create a ticket? If not, no worries, I'm happy to create the
ticket. Really looking forward to seeing this develop!

Ben.

On Thu, Jun 26, 2014 at 11:33 AM, Vinod Kone <vi...@gmail.com> wrote:

> SGTM. Feel free to create the ticket!
>
>
> On Thu, Jun 26, 2014 at 11:20 AM, Vetoshkin Nikita <
> nikita.vetoshkin@gmail.com> wrote:
>
> > Thanks, Vinod! I really like the "persistent resources" idea. Maybe there
> > should be a ticket for discussion and brainstorming?
> > On Jun 26, 2014 11:06 PM, "Vinod Kone" <vi...@gmail.com> wrote:
> >
> > > As Maxime mentioned, the long term solution is for Mesos to support the
> > > notion of "persistent resources" i.e., resources that stay (and
> accounted
> > > for) after the life cycle of task/executor. The idea still needs
> fleshing
> > > out.
> > >
> > >
> > > On Thu, Jun 26, 2014 at 8:23 AM, Vetoshkin Nikita <
> > > nikita.vetoshkin@gmail.com> wrote:
> > >
> > > > What about long term solution? Any ideas? Twitter's Manhattan
> database
> > > > claims to use Mesos for scaling up and down. Can you shed some light
> > how
> > > do
> > > > they deal with the situation like this?
> > > > On Jun 26, 2014 5:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
> > > >
> > > > > Thanks for listing this out Adam.
> > > > >
> > > > > Data Residency:
> > > > > > - Should we destroy the sandbox/hdfs-data when shutting down a
> DN?
> > > > > > - If starting DN on node that was previously running a DN,
> > can/should
> > > > we
> > > > > > try to revive the existing data?
> > > > > >
> > > > >
> > > > > I think this is one of the key challenges for a production quality
> > HDFS
> > > > on
> > > > > Mesos. Currently, since sandbox is deleted after a task exits, if
> all
> > > the
> > > > > data nodes that hold a block (and its replicas) get lost/killed for
> > > > > whatever reason there would be data loss. A short terms solution
> > would
> > > be
> > > > > to write outside sandbox and use slave attributes to track where to
> > > > > re-launch data node tasks.
> > > > >
> > > >
> > >
> >
>

Re: HDFS on Mesos

Posted by Vinod Kone <vi...@gmail.com>.

SGTM. Feel free to create the ticket!


On Thu, Jun 26, 2014 at 11:20 AM, Vetoshkin Nikita <
nikita.vetoshkin@gmail.com> wrote:

> Thanks, Vinod! I really like the "persistent resources" idea. Maybe there
> should be a ticket for discussion and brainstorming?
> On Jun 26, 2014 11:06 PM, "Vinod Kone" <vi...@gmail.com> wrote:
>
> > As Maxime mentioned, the long term solution is for Mesos to support the
> > notion of "persistent resources" i.e., resources that stay (and accounted
> > for) after the life cycle of task/executor. The idea still needs fleshing
> > out.
> >
> >
> > On Thu, Jun 26, 2014 at 8:23 AM, Vetoshkin Nikita <
> > nikita.vetoshkin@gmail.com> wrote:
> >
> > > What about long term solution? Any ideas? Twitter's Manhattan database
> > > claims to use Mesos for scaling up and down. Can you shed some light
> how
> > do
> > > they deal with the situation like this?
> > > On Jun 26, 2014 5:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
> > >
> > > > Thanks for listing this out Adam.
> > > >
> > > > Data Residency:
> > > > > - Should we destroy the sandbox/hdfs-data when shutting down a DN?
> > > > > - If starting DN on node that was previously running a DN,
> can/should
> > > we
> > > > > try to revive the existing data?
> > > > >
> > > >
> > > > I think this is one of the key challenges for a production quality
> HDFS
> > > on
> > > > Mesos. Currently, since sandbox is deleted after a task exits, if all
> > the
> > > > data nodes that hold a block (and its replicas) get lost/killed for
> > > > whatever reason there would be data loss. A short terms solution
> would
> > be
> > > > to write outside sandbox and use slave attributes to track where to
> > > > re-launch data node tasks.
> > > >
> > >
> >
>

Re: HDFS on Mesos

Posted by Vetoshkin Nikita <ni...@gmail.com>.

Thanks, Vinod! I really like the "persistent resources" idea. Maybe there
should be a ticket for discussion and brainstorming?
On Jun 26, 2014 11:06 PM, "Vinod Kone" <vi...@gmail.com> wrote:

> As Maxime mentioned, the long term solution is for Mesos to support the
> notion of "persistent resources" i.e., resources that stay (and accounted
> for) after the life cycle of task/executor. The idea still needs fleshing
> out.
>
>
> On Thu, Jun 26, 2014 at 8:23 AM, Vetoshkin Nikita <
> nikita.vetoshkin@gmail.com> wrote:
>
> > What about long term solution? Any ideas? Twitter's Manhattan database
> > claims to use Mesos for scaling up and down. Can you shed some light how
> do
> > they deal with the situation like this?
> > On Jun 26, 2014 5:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
> >
> > > Thanks for listing this out Adam.
> > >
> > > Data Residency:
> > > > - Should we destroy the sandbox/hdfs-data when shutting down a DN?
> > > > - If starting DN on node that was previously running a DN, can/should
> > we
> > > > try to revive the existing data?
> > > >
> > >
> > > I think this is one of the key challenges for a production quality HDFS
> > on
> > > Mesos. Currently, since sandbox is deleted after a task exits, if all
> the
> > > data nodes that hold a block (and its replicas) get lost/killed for
> > > whatever reason there would be data loss. A short terms solution would
> be
> > > to write outside sandbox and use slave attributes to track where to
> > > re-launch data node tasks.
> > >
> >
>

Re: HDFS on Mesos

Posted by Vinod Kone <vi...@gmail.com>.

As Maxime mentioned, the long term solution is for Mesos to support the
notion of "persistent resources" i.e., resources that stay (and accounted
for) after the life cycle of task/executor. The idea still needs fleshing
out.


On Thu, Jun 26, 2014 at 8:23 AM, Vetoshkin Nikita <
nikita.vetoshkin@gmail.com> wrote:

> What about long term solution? Any ideas? Twitter's Manhattan database
> claims to use Mesos for scaling up and down. Can you shed some light how do
> they deal with the situation like this?
> On Jun 26, 2014 5:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
>
> > Thanks for listing this out Adam.
> >
> > Data Residency:
> > > - Should we destroy the sandbox/hdfs-data when shutting down a DN?
> > > - If starting DN on node that was previously running a DN, can/should
> we
> > > try to revive the existing data?
> > >
> >
> > I think this is one of the key challenges for a production quality HDFS
> on
> > Mesos. Currently, since sandbox is deleted after a task exits, if all the
> > data nodes that hold a block (and its replicas) get lost/killed for
> > whatever reason there would be data loss. A short terms solution would be
> > to write outside sandbox and use slave attributes to track where to
> > re-launch data node tasks.
> >
>

Re: HDFS on Mesos

Posted by Vetoshkin Nikita <ni...@gmail.com>.

What about long term solution? Any ideas? Twitter's Manhattan database
claims to use Mesos for scaling up and down. Can you shed some light how do
they deal with the situation like this?
On Jun 26, 2014 5:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:

> Thanks for listing this out Adam.
>
> Data Residency:
> > - Should we destroy the sandbox/hdfs-data when shutting down a DN?
> > - If starting DN on node that was previously running a DN, can/should we
> > try to revive the existing data?
> >
>
> I think this is one of the key challenges for a production quality HDFS on
> Mesos. Currently, since sandbox is deleted after a task exits, if all the
> data nodes that hold a block (and its replicas) get lost/killed for
> whatever reason there would be data loss. A short terms solution would be
> to write outside sandbox and use slave attributes to track where to
> re-launch data node tasks.
>

Re: HDFS on Mesos

Posted by Maxime Brugidou <ma...@gmail.com>.

Since i wanted to learn the scheduler API i did a quick and dirty proof of
concept to run HDFS over Mesos: https://github.com/brugidou/hdfs-mesos

I actually run it with deimos/docker and it's very ugly: not much exception
checking. logging is trash, a lot of stuff is hard-coded... etc. It only
runs one namenode for the entire cluster and one datanode on each slave.
You can theoretically run multiple clusters using different cluster names
and data directories. There is no HA, Security, JournalNode.... etc. It
runs hadoop-2.4.1 (latest release).

The hard part is to get to know where the namenode will run, in order to
simplify this i chose to pre-select (using configuration) where the
namenode will run.

Other than that i doubt i chose the easiest framework to begin with...


On Thu, Jun 26, 2014 at 4:21 PM, Rick Richardson <ri...@gmail.com>
wrote:

> It might be that some tighter integration beyond a framework is needed.  A
> killer docker/chroot feature would simply be providing a standard Port to
> all containers which is an open socket to a namenode.
>
> As this is more about general purpose storage, it would probably be nice
> to use something with fewer sharp edges, like  CephFS or Lustre.  HDFS
> requires the data author to think about the size and shape of their data.
>
>
>
> On Thu, Jun 26, 2014 at 7:03 AM, Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
>
>> This has been discussed before apparently
>> http://mail-archives.apache.org/mod_mbox/mesos-user/201401.mbox/%3CCAAoDHHF4CvcsjFJ5zuSUAhbLw+0iie5ARHpmHJVKUCVjMqTNsg@mail.gmail.com%3E
>>
>> I think that this topic will become more important now that external
>> containerization is out. The write-outside-sandbox pattern won't work in a
>> chroot or docker AFAIK.
>>
>> In addition the docker pattern for persistent data storage is to use a
>> data-only docker image. Not sure if this is appropriate here.
>> On Jun 26, 2014 12:42 PM, "Maxime Brugidou" <ma...@gmail.com>
>> wrote:
>>
>>> There is clearly a need for persistent storage management in Mesos from
>>> what I can observe.
>>>
>>> The current sandbox is what I consider ephemeral storage since it gets
>>> lost when task exits. It can recover after a slave failure using the
>>> recovery mechanism but for example it won't survive a slave reboot.
>>>
>>> Other frameworks I know of that seem to use or need persistent storage
>>> are Cassandra and Kafka. I wonder what has been done in the framework to
>>> survive a DC power outage for example. Is all data lost?
>>>
>>> As Vinod said if we want to implement persistent storage by ourselves we
>>> need to track the resource "manually" using attributes or zk. This "trick"
>>> will be reimplemented over and over by frameworks and will be outside
>>> Mesos' control (I don't even know if this trick is feasible with docker
>>> containerization).
>>>
>>> The proper way would be to have a persistent disk resource type (or
>>> something else equivalent) that let you keep data on disk. The resource
>>> will belong to a user/framework and we can have quotas. I have no idea how
>>> to implement that since I'm not familiar with the details but it could be
>>> using simple FS quotas and directories in the mesos directory itself (so we
>>> mutualize ephemeral and persistent storage), it could also be on the form
>>> of raw storage using LVM volumes to enable other sort of applications... Or
>>> it could be both actually, mesos could have a raw volume group to use for
>>> any sort of temporary/ephemeral and persistent volumes.
>>>
>>> This is probably very complex since you will need tools to report the
>>> storage usage and do some cleanup (or have a TTL/expiry mechanism). But I
>>> believe that every storage framework will reinvent this every time outside
>>> Mesos.
>>> On Jun 26, 2014 1:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
>>>
>>>> Thanks for listing this out Adam.
>>>>
>>>> Data Residency:
>>>>> - Should we destroy the sandbox/hdfs-data when shutting down a DN?
>>>>> - If starting DN on node that was previously running a DN, can/should
>>>>> we try to revive the existing data?
>>>>>
>>>>
>>>> I think this is one of the key challenges for a production quality HDFS
>>>> on Mesos. Currently, since sandbox is deleted after a task exits, if all
>>>> the data nodes that hold a block (and its replicas) get lost/killed for
>>>> whatever reason there would be data loss. A short terms solution would be
>>>> to write outside sandbox and use slave attributes to track where to
>>>> re-launch data node tasks.
>>>>
>>>>
>>>>
>
>
> --
>
> "Historically, the most terrible things - war, genocide, and slavery -
> have resulted not from disobedience, but from obedience"
>                                                                --  Howard
> Zinn
>

Re: HDFS on Mesos

Posted by Rick Richardson <ri...@gmail.com>.

It might be that some tighter integration beyond a framework is needed.  A
killer docker/chroot feature would simply be providing a standard Port to
all containers which is an open socket to a namenode.

As this is more about general purpose storage, it would probably be nice to
use something with fewer sharp edges, like  CephFS or Lustre.  HDFS
requires the data author to think about the size and shape of their data.



On Thu, Jun 26, 2014 at 7:03 AM, Maxime Brugidou <ma...@gmail.com>
wrote:

> This has been discussed before apparently
> http://mail-archives.apache.org/mod_mbox/mesos-user/201401.mbox/%3CCAAoDHHF4CvcsjFJ5zuSUAhbLw+0iie5ARHpmHJVKUCVjMqTNsg@mail.gmail.com%3E
>
> I think that this topic will become more important now that external
> containerization is out. The write-outside-sandbox pattern won't work in a
> chroot or docker AFAIK.
>
> In addition the docker pattern for persistent data storage is to use a
> data-only docker image. Not sure if this is appropriate here.
> On Jun 26, 2014 12:42 PM, "Maxime Brugidou" <ma...@gmail.com>
> wrote:
>
>> There is clearly a need for persistent storage management in Mesos from
>> what I can observe.
>>
>> The current sandbox is what I consider ephemeral storage since it gets
>> lost when task exits. It can recover after a slave failure using the
>> recovery mechanism but for example it won't survive a slave reboot.
>>
>> Other frameworks I know of that seem to use or need persistent storage
>> are Cassandra and Kafka. I wonder what has been done in the framework to
>> survive a DC power outage for example. Is all data lost?
>>
>> As Vinod said if we want to implement persistent storage by ourselves we
>> need to track the resource "manually" using attributes or zk. This "trick"
>> will be reimplemented over and over by frameworks and will be outside
>> Mesos' control (I don't even know if this trick is feasible with docker
>> containerization).
>>
>> The proper way would be to have a persistent disk resource type (or
>> something else equivalent) that let you keep data on disk. The resource
>> will belong to a user/framework and we can have quotas. I have no idea how
>> to implement that since I'm not familiar with the details but it could be
>> using simple FS quotas and directories in the mesos directory itself (so we
>> mutualize ephemeral and persistent storage), it could also be on the form
>> of raw storage using LVM volumes to enable other sort of applications... Or
>> it could be both actually, mesos could have a raw volume group to use for
>> any sort of temporary/ephemeral and persistent volumes.
>>
>> This is probably very complex since you will need tools to report the
>> storage usage and do some cleanup (or have a TTL/expiry mechanism). But I
>> believe that every storage framework will reinvent this every time outside
>> Mesos.
>> On Jun 26, 2014 1:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
>>
>>> Thanks for listing this out Adam.
>>>
>>> Data Residency:
>>>> - Should we destroy the sandbox/hdfs-data when shutting down a DN?
>>>> - If starting DN on node that was previously running a DN, can/should
>>>> we try to revive the existing data?
>>>>
>>>
>>> I think this is one of the key challenges for a production quality HDFS
>>> on Mesos. Currently, since sandbox is deleted after a task exits, if all
>>> the data nodes that hold a block (and its replicas) get lost/killed for
>>> whatever reason there would be data loss. A short terms solution would be
>>> to write outside sandbox and use slave attributes to track where to
>>> re-launch data node tasks.
>>>
>>>
>>>


-- 

"Historically, the most terrible things - war, genocide, and slavery - have
resulted not from disobedience, but from obedience"
                                                               --  Howard
Zinn

Re: HDFS on Mesos

Posted by Rick Richardson <ri...@gmail.com>.

It might be that some tighter integration beyond a framework is needed.  A
killer docker/chroot feature would simply be providing a standard Port to
all containers which is an open socket to a namenode.

As this is more about general purpose storage, it would probably be nice to
use something with fewer sharp edges, like  CephFS or Lustre.  HDFS
requires the data author to think about the size and shape of their data.



On Thu, Jun 26, 2014 at 7:03 AM, Maxime Brugidou <ma...@gmail.com>
wrote:

> This has been discussed before apparently
> http://mail-archives.apache.org/mod_mbox/mesos-user/201401.mbox/%3CCAAoDHHF4CvcsjFJ5zuSUAhbLw+0iie5ARHpmHJVKUCVjMqTNsg@mail.gmail.com%3E
>
> I think that this topic will become more important now that external
> containerization is out. The write-outside-sandbox pattern won't work in a
> chroot or docker AFAIK.
>
> In addition the docker pattern for persistent data storage is to use a
> data-only docker image. Not sure if this is appropriate here.
> On Jun 26, 2014 12:42 PM, "Maxime Brugidou" <ma...@gmail.com>
> wrote:
>
>> There is clearly a need for persistent storage management in Mesos from
>> what I can observe.
>>
>> The current sandbox is what I consider ephemeral storage since it gets
>> lost when task exits. It can recover after a slave failure using the
>> recovery mechanism but for example it won't survive a slave reboot.
>>
>> Other frameworks I know of that seem to use or need persistent storage
>> are Cassandra and Kafka. I wonder what has been done in the framework to
>> survive a DC power outage for example. Is all data lost?
>>
>> As Vinod said if we want to implement persistent storage by ourselves we
>> need to track the resource "manually" using attributes or zk. This "trick"
>> will be reimplemented over and over by frameworks and will be outside
>> Mesos' control (I don't even know if this trick is feasible with docker
>> containerization).
>>
>> The proper way would be to have a persistent disk resource type (or
>> something else equivalent) that let you keep data on disk. The resource
>> will belong to a user/framework and we can have quotas. I have no idea how
>> to implement that since I'm not familiar with the details but it could be
>> using simple FS quotas and directories in the mesos directory itself (so we
>> mutualize ephemeral and persistent storage), it could also be on the form
>> of raw storage using LVM volumes to enable other sort of applications... Or
>> it could be both actually, mesos could have a raw volume group to use for
>> any sort of temporary/ephemeral and persistent volumes.
>>
>> This is probably very complex since you will need tools to report the
>> storage usage and do some cleanup (or have a TTL/expiry mechanism). But I
>> believe that every storage framework will reinvent this every time outside
>> Mesos.
>> On Jun 26, 2014 1:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
>>
>>> Thanks for listing this out Adam.
>>>
>>> Data Residency:
>>>> - Should we destroy the sandbox/hdfs-data when shutting down a DN?
>>>> - If starting DN on node that was previously running a DN, can/should
>>>> we try to revive the existing data?
>>>>
>>>
>>> I think this is one of the key challenges for a production quality HDFS
>>> on Mesos. Currently, since sandbox is deleted after a task exits, if all
>>> the data nodes that hold a block (and its replicas) get lost/killed for
>>> whatever reason there would be data loss. A short terms solution would be
>>> to write outside sandbox and use slave attributes to track where to
>>> re-launch data node tasks.
>>>
>>>
>>>


-- 

"Historically, the most terrible things - war, genocide, and slavery - have
resulted not from disobedience, but from obedience"
                                                               --  Howard
Zinn

Re: HDFS on Mesos

Posted by Maxime Brugidou <ma...@gmail.com>.

This has been discussed before apparently
http://mail-archives.apache.org/mod_mbox/mesos-user/201401.mbox/%3CCAAoDHHF4CvcsjFJ5zuSUAhbLw+0iie5ARHpmHJVKUCVjMqTNsg@mail.gmail.com%3E

I think that this topic will become more important now that external
containerization is out. The write-outside-sandbox pattern won't work in a
chroot or docker AFAIK.

In addition the docker pattern for persistent data storage is to use a
data-only docker image. Not sure if this is appropriate here.
On Jun 26, 2014 12:42 PM, "Maxime Brugidou" <ma...@gmail.com>
wrote:

> There is clearly a need for persistent storage management in Mesos from
> what I can observe.
>
> The current sandbox is what I consider ephemeral storage since it gets
> lost when task exits. It can recover after a slave failure using the
> recovery mechanism but for example it won't survive a slave reboot.
>
> Other frameworks I know of that seem to use or need persistent storage are
> Cassandra and Kafka. I wonder what has been done in the framework to
> survive a DC power outage for example. Is all data lost?
>
> As Vinod said if we want to implement persistent storage by ourselves we
> need to track the resource "manually" using attributes or zk. This "trick"
> will be reimplemented over and over by frameworks and will be outside
> Mesos' control (I don't even know if this trick is feasible with docker
> containerization).
>
> The proper way would be to have a persistent disk resource type (or
> something else equivalent) that let you keep data on disk. The resource
> will belong to a user/framework and we can have quotas. I have no idea how
> to implement that since I'm not familiar with the details but it could be
> using simple FS quotas and directories in the mesos directory itself (so we
> mutualize ephemeral and persistent storage), it could also be on the form
> of raw storage using LVM volumes to enable other sort of applications... Or
> it could be both actually, mesos could have a raw volume group to use for
> any sort of temporary/ephemeral and persistent volumes.
>
> This is probably very complex since you will need tools to report the
> storage usage and do some cleanup (or have a TTL/expiry mechanism). But I
> believe that every storage framework will reinvent this every time outside
> Mesos.
> On Jun 26, 2014 1:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:
>
>> Thanks for listing this out Adam.
>>
>> Data Residency:
>>> - Should we destroy the sandbox/hdfs-data when shutting down a DN?
>>> - If starting DN on node that was previously running a DN, can/should we
>>> try to revive the existing data?
>>>
>>
>> I think this is one of the key challenges for a production quality HDFS
>> on Mesos. Currently, since sandbox is deleted after a task exits, if all
>> the data nodes that hold a block (and its replicas) get lost/killed for
>> whatever reason there would be data loss. A short terms solution would be
>> to write outside sandbox and use slave attributes to track where to
>> re-launch data node tasks.
>>
>>
>>

Re: HDFS on Mesos

Posted by Maxime Brugidou <ma...@gmail.com>.

There is clearly a need for persistent storage management in Mesos from
what I can observe.

The current sandbox is what I consider ephemeral storage since it gets lost
when task exits. It can recover after a slave failure using the recovery
mechanism but for example it won't survive a slave reboot.

Other frameworks I know of that seem to use or need persistent storage are
Cassandra and Kafka. I wonder what has been done in the framework to
survive a DC power outage for example. Is all data lost?

As Vinod said if we want to implement persistent storage by ourselves we
need to track the resource "manually" using attributes or zk. This "trick"
will be reimplemented over and over by frameworks and will be outside
Mesos' control (I don't even know if this trick is feasible with docker
containerization).

The proper way would be to have a persistent disk resource type (or
something else equivalent) that let you keep data on disk. The resource
will belong to a user/framework and we can have quotas. I have no idea how
to implement that since I'm not familiar with the details but it could be
using simple FS quotas and directories in the mesos directory itself (so we
mutualize ephemeral and persistent storage), it could also be on the form
of raw storage using LVM volumes to enable other sort of applications... Or
it could be both actually, mesos could have a raw volume group to use for
any sort of temporary/ephemeral and persistent volumes.

This is probably very complex since you will need tools to report the
storage usage and do some cleanup (or have a TTL/expiry mechanism). But I
believe that every storage framework will reinvent this every time outside
Mesos.
On Jun 26, 2014 1:01 AM, "Vinod Kone" <vi...@gmail.com> wrote:

> Thanks for listing this out Adam.
>
> Data Residency:
>> - Should we destroy the sandbox/hdfs-data when shutting down a DN?
>> - If starting DN on node that was previously running a DN, can/should we
>> try to revive the existing data?
>>
>
> I think this is one of the key challenges for a production quality HDFS on
> Mesos. Currently, since sandbox is deleted after a task exits, if all the
> data nodes that hold a block (and its replicas) get lost/killed for
> whatever reason there would be data loss. A short terms solution would be
> to write outside sandbox and use slave attributes to track where to
> re-launch data node tasks.
>
>
>

Re: HDFS on Mesos

Posted by Vinod Kone <vi...@gmail.com>.

Thanks for listing this out Adam.

Data Residency:
> - Should we destroy the sandbox/hdfs-data when shutting down a DN?
> - If starting DN on node that was previously running a DN, can/should we
> try to revive the existing data?
>

I think this is one of the key challenges for a production quality HDFS on
Mesos. Currently, since sandbox is deleted after a task exits, if all the
data nodes that hold a block (and its replicas) get lost/killed for
whatever reason there would be data loss. A short terms solution would be
to write outside sandbox and use slave attributes to track where to
re-launch data node tasks.

Re: HDFS on Mesos

Posted by Vinod Kone <vi...@gmail.com>.

Thanks for listing this out Adam.

Data Residency:
> - Should we destroy the sandbox/hdfs-data when shutting down a DN?
> - If starting DN on node that was previously running a DN, can/should we
> try to revive the existing data?
>

I think this is one of the key challenges for a production quality HDFS on
Mesos. Currently, since sandbox is deleted after a task exits, if all the
data nodes that hold a block (and its replicas) get lost/killed for
whatever reason there would be data loss. A short terms solution would be
to write outside sandbox and use slave attributes to track where to
re-launch data node tasks.