You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Haoyuan Li <ha...@gmail.com> on 2013/09/01 00:33:17 UTC

Re: off-heap RDDs

Evan,

If I understand you correctly, you want to avoid network I/O as much as
possible by caching the data on the node having the data on disk. Actually,
what I meant client caching would automatically do this. For example,
suppose you have a cluster of machines, nothing cached in memory yet. Then
a spark application runs on it. Spark asks Tachyon where data X is. Since
nothing is in memory yet, Tachyon would return disk locations for the first
time. Then Spark program will try to take advantage of disk data locality,
and load the data X in HDFS node N into the off-heap memory of node N. In
the future, when Spark asks Tachyon the location of X, Tachyon will return
node N. There is no network I/O involved in the whole process. Let me know
if I misunderstood something.

Haoyuan


On Fri, Aug 30, 2013 at 10:00 AM, Evan Chan <ev...@ooyala.com> wrote:

> Hey guys,
>
> I would also prefer to strengthen and get behind Tachyon, rather than
> implement a separate solution (though I guess if it's not offiically
> supported, then nobody will ask questions).  But it's more that off-heap
> memory is difficult, so it's better to focus efforts on one project, is my
> feeling.
>
> Haoyuan,
>
> Tachyon brings cached HDFS data to the local client.  Have we thought about
> the opposite approach, which might be more efficient?
>  - Load the data in HDFS node N into the off-heap memory of node N
>  - in Spark, inform the framework (maybe via RDD partition/location info)
> of where the data is, that it is located in node N
>  - bring the computation to node N
>
> This avoids network IO and may be much more efficient for many types of
> applications.   I know this would be a big win for us.
>
> -Evan
>
>
> On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li <ha...@gmail.com> wrote:
>
> > No problem. Like reading/writing data from/to off-heap bytebuffer, when a
> > program reads/writes data from/to Tachyon, Spark/Shark needs to do
> ser/de.
> > Efficient ser/de will help on performance a lot as people pointed out.
> One
> > solution is that the application can do primitive operations directly on
> > ByteBuffer, like how Shark is handling it now. Most related code is
> located
> > at "
> >
> https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2
> > "
> > and "
> >
> >
> https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon
> > ".
> >
> > Haoyuan
> >
> >
> > On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid <im...@therashids.com>
> > wrote:
> >
> > > Thanks Haoyuan.  It seems like we should try out Tachyon, sounds like
> > > it is what we are looking for.
> > >
> > > On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <ha...@gmail.com>
> > wrote:
> > > > Response inline.
> > > >
> > > >
> > > > On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid <im...@therashids.com>
> > > wrote:
> > > >
> > > >> Thanks for all the great comments & discussion.  Let me expand a bit
> > > >> on our use case, and then I'm gonna combine responses to various
> > > >> questions.
> > > >>
> > > >> In general, when we use spark, we have some really big RDDs that use
> > > >> up a lot of memory (10s of GB per node) that are really our "core"
> > > >> data sets.  We tend to start up a spark application, immediately
> load
> > > >> all those data sets, and just leave them loaded for the lifetime of
> > > >> that process.  We definitely create a lot of other RDDs along the
> way,
> > > >> and lots of intermediate objects that we'd like to go through normal
> > > >> garbage collection.  But those all require much less memory, maybe
> > > >> 1/10th of the big RDDs that we just keep around.  I know this is a
> bit
> > > >> of a special case, but it seems like it probably isn't that
> different
> > > >> from a lot of use cases.
> > > >>
> > > >> Reynold Xin wrote:
> > > >> > This is especially attractive if the application can read directly
> > > from
> > > >> a byte
> > > >> > buffer without generic serialization (like Shark).
> > > >>
> > > >> interesting -- can you explain how this works in Shark?  do you have
> > > >> some general way of storing data in byte buffers that avoids
> > > >> serialization?  Or do you mean that if the user is effectively
> > > >> creating an RDD of ints, that you create a an RDD[ByteBuffer], and
> > > >> then you read / write ints into the byte buffer yourself?
> > > >> Sorry, I'm familiar with the basic idea of shark but not the code at
> > > >> all -- even a pointer to the code would be helpful.
> > > >>
> > > >> Haoyun Li wrote:
> > > >> > One possible solution is that you can use
> > > >> > Tachyon<https://github.com/amplab/tachyon>.
> > > >>
> > > >> This is a good idea, that I had probably overlooked.  There are two
> > > >> potential issues that I can think of with this approach, though:
> > > >> 1) I was under the impression that Tachyon is still not really
> tested
> > > >> in production systems, and I need something a bit more mature.  Of
> > > >> course, my changes wouldn't be thoroughly tested either, but
> somehow I
> > > >> feel better about deploying my 5-line patch to a codebase I
> understand
> > > >> than adding another entire system.  (This isn't a good reason to add
> > > >> this to spark in general, though, just might be a temporary patch we
> > > >> locally deploy)
> > > >>
> > > >
> > > > This is a legitimate concern. The good news is that, several
> companies
> > > have
> > > > been testing it for a while, and some are close to make it to
> > production.
> > > > For example, as Yahoo mentioned in today's meetup, we are working to
> > > > integrate Shark and Tachyon closely, and results are very promising.
> It
> > > > will be in production soon.
> > > >
> > > >
> > > >> 2) I may have misunderstood Tachyon, but it seems there is a big
> > > >> difference in the data locality in these two approaches.  On a large
> > > >> cluster, HDFS will spread the data all over the cluster, and so any
> > > >> particular piece of on-disk data will only live on a few machines.
> > > >> When you start a spark application, which only uses a small subset
> of
> > > >> the nodes, odds are the data you want is *not* on those nodes.  So
> > > >> even if tachyon caches data from HDFS into memory, it won't be on
> the
> > > >> same nodes as the spark application.  Which means that when the
> spark
> > > >> application reads data from the RDD, even though the data is in
> memory
> > > >> on some node in the cluster, it will need to be read over the
> network
> > > >> by the actual spark worker assigned to the application.
> > > >
> > > > Is my understanding correct?  I haven't done any measurements at all
> > > >> of a difference in performance, but it seems this would be much
> > > >> slower.
> > > >>
> > > >
> > > > This is a great question. Actually, from data locality perspective,
> two
> > > > approaches have no difference. Tachyon does client side caching,
> which
> > > > means, if a client on a node reads data not on its local machine, the
> > > first
> > > > read will cache the data on that node. Therefore, all future access
> on
> > > that
> > > > node will read the data from its local memory. For example, suppose
> you
> > > > have a cluster with 100 nodes all running HDFS and Tachyon. Then you
> > > launch
> > > > a Spark jobs running on 20 nodes only. When it reads or caches the
> data
> > > > first time, all data will be cached on those 20 nodes. In the future,
> > > when
> > > > Spark master tries to schedule tasks, it will query Tachyon about
> data
> > > > locations, and take advantage of data localities automatically.
> > > >
> > > > Best,
> > > >
> > > > Haoyuan
> > > >
> > > >
> > > >>
> > > >>
> > > >> Mark Hamstra wrote:
> > > >> > What worries me is the
> > > >> > combinatoric explosion of different caching and persistence
> > > mechanisms.
> > > >>
> > > >> great points, and I have no ideas of the real advantages yet.  I
> agree
> > > >> we'd need to actual observe an improvement to add yet another
> option.
> > > >> (I would really like some alternative to what I'm doing now, but
> maybe
> > > >> tachyon is all I need ...)
> > > >>
> > > >> Reynold Xin wrote:
> > > >> > Mark - you don't necessarily need to construct a separate storage
> > > level.
> > > >> > One simple way to accomplish this is for the user application to
> > pass
> > > >> Spark
> > > >> > a DirectByteBuffer.
> > > >>
> > > >> hmm, that's true I suppose, but I had originally thought of making
> it
> > > >> another storage level, just for convenience & consistency.  Couldn't
> > > >> you get rid of all the storage levels and just have the user apply
> > > >> various transformations to an RDD?  eg.
> > > >>
> > > >> rdd.cache(MEMORY_ONLY_SER)
> > > >>
> > > >> could be
> > > >>
> > > >> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY)
> > > >>
> > > >>
> > > >> And I agree with all of Lijie's comments that using off-heap memory
> is
> > > >> unsafe & difficult.  But I feel that isn't a reason to completely
> > > >> disallow it, if there is a significant performance improvement.  It
> > > >> would need to be clearly documented as an advanced feature with some
> > > >> risks involved.
> > > >>
> > > >> thanks,
> > > >> imran
> > > >>
> > > >> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <cs...@gmail.com>
> > wrote:
> > > >> > I remember that I talked about this off-heap approach with Reynold
> > in
> > > >> > person several months ago. I think this approach is attractive to
> > > >> > Spark/Shark, since there are many large objects in JVM. But the
> main
> > > >> > problem in original Spark (without Tachyon support) is that it
> uses
> > > the
> > > >> > same memory space both for storing critical data and processing
> > > temporary
> > > >> > data. Separating storing and processing is more important than
> > looking
> > > >> for
> > > >> > memory-efficient storing technique. So I think this separation is
> > the
> > > >> main
> > > >> > contribution of Tachyon.
> > > >> >
> > > >> >
> > > >> > As for off-heap approach, we are not the first to realize this
> > > problem.
> > > >> > Apache DirectMemory is promising, though not mature currently.
> > > However, I
> > > >> > think there are some problems while using direct memory.
> > > >> >
> > > >> > 1)       Unsafe. As same as C++, there may be memory leak. Users
> > will
> > > >> also
> > > >> > be confused to set right memory-related configurations such as
> –Xmx
> > > and
> > > >> > –MaxDirectMemorySize.
> > > >> >
> > > >> > 2)       Difficult. Designing an effective and efficient memory
> > > >> management
> > > >> > system is not an easy job. How to allocate, replace, reclaim
> objects
> > > at
> > > >> > right time and at right location is challenging. It’s a bit
> similar
> > > with
> > > >> GC
> > > >> > algorithms.
> > > >> >
> > > >> > 3)       Limited usage. It’s useful for write-once-read-many-times
> > > large
> > > >> > objects but not for others.
> > > >> >
> > > >> >
> > > >> >
> > > >> > I also have two related questions:
> > > >> >
> > > >> > 1)       Can JVM’s heap use virtual memory or just use physical
> > > memory?
> > > >> >
> > > >> > 2)       Can direct memory use virtual memory or just use physical
> > > >> memory?
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <haoyuan.li@gmail.com
> >
> > > >> wrote:
> > > >> >
> > > >> >> Hi Imran,
> > > >> >>
> > > >> >> One possible solution is that you can use
> > > >> >> Tachyon<https://github.com/amplab/tachyon>.
> > > >> >> When data is in Tachyon, Spark jobs will read it from off-heap
> > > memory.
> > > >> >> Internally, it uses direct byte buffers to store
> memory-serialized
> > > RDDs
> > > >> as
> > > >> >> you mentioned. Also, different Spark jobs can share the same data
> > in
> > > >> >> Tachyon's memory. Here is a presentation
> > > >> >> (slide<
> > > >> >>
> > > >>
> > >
> >
> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
> > > >> >> >)
> > > >> >> we did in May.
> > > >> >>
> > > >> >> Haoyuan
> > > >> >>
> > > >> >>
> > > >> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <
> > imran@therashids.com>
> > > >> >> wrote:
> > > >> >>
> > > >> >> > Hi,
> > > >> >> >
> > > >> >> > I was wondering if anyone has thought about putting cached data
> > in
> > > an
> > > >> >> > RDD into off-heap memory, eg. w/ direct byte buffers.  For
> really
> > > >> >> > long-lived RDDs that use a lot of memory, this seems like a
> huge
> > > >> >> > improvement, since all the memory is now totally ignored during
> > GC.
> > > >> >> > (and reading data from direct byte buffers is potentially
> faster
> > as
> > > >> >> > well, buts thats just a nice bonus).
> > > >> >> >
> > > >> >> > The easiest thing to do is to store memory-serialized RDDs in
> > > direct
> > > >> >> > byte buffers, but I guess we could also store the serialized
> RDD
> > on
> > > >> >> > disk and use a memory mapped file.  Serializing into off-heap
> > > buffers
> > > >> >> > is a really simple patch, I just changed a few lines (I haven't
> > > done
> > > >> >> > any real tests w/ it yet, though).  But I dont' really have a
> ton
> > > of
> > > >> >> > experience w/ off-heap memory, so I thought I would ask what
> > others
> > > >> >> > think of the idea, if it makes sense or if there are any
> gotchas
> > I
> > > >> >> > should be aware of, etc.
> > > >> >> >
> > > >> >> > thanks,
> > > >> >> > Imran
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> ev@ooyala.com  |
>
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><
> http://www.twitter.com/ooyala>
>

Re: off-heap RDDs

Posted by Haoyuan Li <ha...@gmail.com>.

That will be great!

Haoyuan


On Thu, Sep 5, 2013 at 9:28 PM, Evan Chan <ev...@ooyala.com> wrote:

> Haoyuan,
>
> Thanks, that sounds great, exactly what we are looking for.
>
> We might be interested in integrating Tachyon with CFS (Cassandra File
> System, the Cassandra-based implementation of HDFS).
>
> -Evan
>
>
>
> On Sat, Aug 31, 2013 at 3:33 PM, Haoyuan Li <ha...@gmail.com> wrote:
>
> > Evan,
> >
> > If I understand you correctly, you want to avoid network I/O as much as
> > possible by caching the data on the node having the data on disk.
> Actually,
> > what I meant client caching would automatically do this. For example,
> > suppose you have a cluster of machines, nothing cached in memory yet.
> Then
> > a spark application runs on it. Spark asks Tachyon where data X is. Since
> > nothing is in memory yet, Tachyon would return disk locations for the
> first
> > time. Then Spark program will try to take advantage of disk data
> locality,
> > and load the data X in HDFS node N into the off-heap memory of node N. In
> > the future, when Spark asks Tachyon the location of X, Tachyon will
> return
> > node N. There is no network I/O involved in the whole process. Let me
> know
> > if I misunderstood something.
> >
> > Haoyuan
> >
> >
> > On Fri, Aug 30, 2013 at 10:00 AM, Evan Chan <ev...@ooyala.com> wrote:
> >
> > > Hey guys,
> > >
> > > I would also prefer to strengthen and get behind Tachyon, rather than
> > > implement a separate solution (though I guess if it's not offiically
> > > supported, then nobody will ask questions).  But it's more that
> off-heap
> > > memory is difficult, so it's better to focus efforts on one project, is
> > my
> > > feeling.
> > >
> > > Haoyuan,
> > >
> > > Tachyon brings cached HDFS data to the local client.  Have we thought
> > about
> > > the opposite approach, which might be more efficient?
> > >  - Load the data in HDFS node N into the off-heap memory of node N
> > >  - in Spark, inform the framework (maybe via RDD partition/location
> info)
> > > of where the data is, that it is located in node N
> > >  - bring the computation to node N
> > >
> > > This avoids network IO and may be much more efficient for many types of
> > > applications.   I know this would be a big win for us.
> > >
> > > -Evan
> > >
> > >
> > > On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li <ha...@gmail.com>
> > wrote:
> > >
> > > > No problem. Like reading/writing data from/to off-heap bytebuffer,
> > when a
> > > > program reads/writes data from/to Tachyon, Spark/Shark needs to do
> > > ser/de.
> > > > Efficient ser/de will help on performance a lot as people pointed
> out.
> > > One
> > > > solution is that the application can do primitive operations directly
> > on
> > > > ByteBuffer, like how Shark is handling it now. Most related code is
> > > located
> > > > at "
> > > >
> > >
> >
> https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2
> > > > "
> > > > and "
> > > >
> > > >
> > >
> >
> https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon
> > > > ".
> > > >
> > > > Haoyuan
> > > >
> > > >
> > > > On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid <im...@therashids.com>
> > > > wrote:
> > > >
> > > > > Thanks Haoyuan.  It seems like we should try out Tachyon, sounds
> like
> > > > > it is what we are looking for.
> > > > >
> > > > > On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <ha...@gmail.com>
> > > > wrote:
> > > > > > Response inline.
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid <
> > imran@therashids.com>
> > > > > wrote:
> > > > > >
> > > > > >> Thanks for all the great comments & discussion.  Let me expand a
> > bit
> > > > > >> on our use case, and then I'm gonna combine responses to various
> > > > > >> questions.
> > > > > >>
> > > > > >> In general, when we use spark, we have some really big RDDs that
> > use
> > > > > >> up a lot of memory (10s of GB per node) that are really our
> "core"
> > > > > >> data sets.  We tend to start up a spark application, immediately
> > > load
> > > > > >> all those data sets, and just leave them loaded for the lifetime
> > of
> > > > > >> that process.  We definitely create a lot of other RDDs along
> the
> > > way,
> > > > > >> and lots of intermediate objects that we'd like to go through
> > normal
> > > > > >> garbage collection.  But those all require much less memory,
> maybe
> > > > > >> 1/10th of the big RDDs that we just keep around.  I know this
> is a
> > > bit
> > > > > >> of a special case, but it seems like it probably isn't that
> > > different
> > > > > >> from a lot of use cases.
> > > > > >>
> > > > > >> Reynold Xin wrote:
> > > > > >> > This is especially attractive if the application can read
> > directly
> > > > > from
> > > > > >> a byte
> > > > > >> > buffer without generic serialization (like Shark).
> > > > > >>
> > > > > >> interesting -- can you explain how this works in Shark?  do you
> > have
> > > > > >> some general way of storing data in byte buffers that avoids
> > > > > >> serialization?  Or do you mean that if the user is effectively
> > > > > >> creating an RDD of ints, that you create a an RDD[ByteBuffer],
> and
> > > > > >> then you read / write ints into the byte buffer yourself?
> > > > > >> Sorry, I'm familiar with the basic idea of shark but not the
> code
> > at
> > > > > >> all -- even a pointer to the code would be helpful.
> > > > > >>
> > > > > >> Haoyun Li wrote:
> > > > > >> > One possible solution is that you can use
> > > > > >> > Tachyon<https://github.com/amplab/tachyon>.
> > > > > >>
> > > > > >> This is a good idea, that I had probably overlooked.  There are
> > two
> > > > > >> potential issues that I can think of with this approach, though:
> > > > > >> 1) I was under the impression that Tachyon is still not really
> > > tested
> > > > > >> in production systems, and I need something a bit more mature.
>  Of
> > > > > >> course, my changes wouldn't be thoroughly tested either, but
> > > somehow I
> > > > > >> feel better about deploying my 5-line patch to a codebase I
> > > understand
> > > > > >> than adding another entire system.  (This isn't a good reason to
> > add
> > > > > >> this to spark in general, though, just might be a temporary
> patch
> > we
> > > > > >> locally deploy)
> > > > > >>
> > > > > >
> > > > > > This is a legitimate concern. The good news is that, several
> > > companies
> > > > > have
> > > > > > been testing it for a while, and some are close to make it to
> > > > production.
> > > > > > For example, as Yahoo mentioned in today's meetup, we are working
> > to
> > > > > > integrate Shark and Tachyon closely, and results are very
> > promising.
> > > It
> > > > > > will be in production soon.
> > > > > >
> > > > > >
> > > > > >> 2) I may have misunderstood Tachyon, but it seems there is a big
> > > > > >> difference in the data locality in these two approaches.  On a
> > large
> > > > > >> cluster, HDFS will spread the data all over the cluster, and so
> > any
> > > > > >> particular piece of on-disk data will only live on a few
> machines.
> > > > > >> When you start a spark application, which only uses a small
> subset
> > > of
> > > > > >> the nodes, odds are the data you want is *not* on those nodes.
>  So
> > > > > >> even if tachyon caches data from HDFS into memory, it won't be
> on
> > > the
> > > > > >> same nodes as the spark application.  Which means that when the
> > > spark
> > > > > >> application reads data from the RDD, even though the data is in
> > > memory
> > > > > >> on some node in the cluster, it will need to be read over the
> > > network
> > > > > >> by the actual spark worker assigned to the application.
> > > > > >
> > > > > > Is my understanding correct?  I haven't done any measurements at
> > all
> > > > > >> of a difference in performance, but it seems this would be much
> > > > > >> slower.
> > > > > >>
> > > > > >
> > > > > > This is a great question. Actually, from data locality
> perspective,
> > > two
> > > > > > approaches have no difference. Tachyon does client side caching,
> > > which
> > > > > > means, if a client on a node reads data not on its local machine,
> > the
> > > > > first
> > > > > > read will cache the data on that node. Therefore, all future
> access
> > > on
> > > > > that
> > > > > > node will read the data from its local memory. For example,
> suppose
> > > you
> > > > > > have a cluster with 100 nodes all running HDFS and Tachyon. Then
> > you
> > > > > launch
> > > > > > a Spark jobs running on 20 nodes only. When it reads or caches
> the
> > > data
> > > > > > first time, all data will be cached on those 20 nodes. In the
> > future,
> > > > > when
> > > > > > Spark master tries to schedule tasks, it will query Tachyon about
> > > data
> > > > > > locations, and take advantage of data localities automatically.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Haoyuan
> > > > > >
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >> Mark Hamstra wrote:
> > > > > >> > What worries me is the
> > > > > >> > combinatoric explosion of different caching and persistence
> > > > > mechanisms.
> > > > > >>
> > > > > >> great points, and I have no ideas of the real advantages yet.  I
> > > agree
> > > > > >> we'd need to actual observe an improvement to add yet another
> > > option.
> > > > > >> (I would really like some alternative to what I'm doing now, but
> > > maybe
> > > > > >> tachyon is all I need ...)
> > > > > >>
> > > > > >> Reynold Xin wrote:
> > > > > >> > Mark - you don't necessarily need to construct a separate
> > storage
> > > > > level.
> > > > > >> > One simple way to accomplish this is for the user application
> to
> > > > pass
> > > > > >> Spark
> > > > > >> > a DirectByteBuffer.
> > > > > >>
> > > > > >> hmm, that's true I suppose, but I had originally thought of
> making
> > > it
> > > > > >> another storage level, just for convenience & consistency.
> >  Couldn't
> > > > > >> you get rid of all the storage levels and just have the user
> apply
> > > > > >> various transformations to an RDD?  eg.
> > > > > >>
> > > > > >> rdd.cache(MEMORY_ONLY_SER)
> > > > > >>
> > > > > >> could be
> > > > > >>
> > > > > >> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY)
> > > > > >>
> > > > > >>
> > > > > >> And I agree with all of Lijie's comments that using off-heap
> > memory
> > > is
> > > > > >> unsafe & difficult.  But I feel that isn't a reason to
> completely
> > > > > >> disallow it, if there is a significant performance improvement.
> >  It
> > > > > >> would need to be clearly documented as an advanced feature with
> > some
> > > > > >> risks involved.
> > > > > >>
> > > > > >> thanks,
> > > > > >> imran
> > > > > >>
> > > > > >> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <cs...@gmail.com>
> > > > wrote:
> > > > > >> > I remember that I talked about this off-heap approach with
> > Reynold
> > > > in
> > > > > >> > person several months ago. I think this approach is attractive
> > to
> > > > > >> > Spark/Shark, since there are many large objects in JVM. But
> the
> > > main
> > > > > >> > problem in original Spark (without Tachyon support) is that it
> > > uses
> > > > > the
> > > > > >> > same memory space both for storing critical data and
> processing
> > > > > temporary
> > > > > >> > data. Separating storing and processing is more important than
> > > > looking
> > > > > >> for
> > > > > >> > memory-efficient storing technique. So I think this separation
> > is
> > > > the
> > > > > >> main
> > > > > >> > contribution of Tachyon.
> > > > > >> >
> > > > > >> >
> > > > > >> > As for off-heap approach, we are not the first to realize this
> > > > > problem.
> > > > > >> > Apache DirectMemory is promising, though not mature currently.
> > > > > However, I
> > > > > >> > think there are some problems while using direct memory.
> > > > > >> >
> > > > > >> > 1)       Unsafe. As same as C++, there may be memory leak.
> Users
> > > > will
> > > > > >> also
> > > > > >> > be confused to set right memory-related configurations such as
> > > –Xmx
> > > > > and
> > > > > >> > –MaxDirectMemorySize.
> > > > > >> >
> > > > > >> > 2)       Difficult. Designing an effective and efficient
> memory
> > > > > >> management
> > > > > >> > system is not an easy job. How to allocate, replace, reclaim
> > > objects
> > > > > at
> > > > > >> > right time and at right location is challenging. It’s a bit
> > > similar
> > > > > with
> > > > > >> GC
> > > > > >> > algorithms.
> > > > > >> >
> > > > > >> > 3)       Limited usage. It’s useful for
> > write-once-read-many-times
> > > > > large
> > > > > >> > objects but not for others.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > I also have two related questions:
> > > > > >> >
> > > > > >> > 1)       Can JVM’s heap use virtual memory or just use
> physical
> > > > > memory?
> > > > > >> >
> > > > > >> > 2)       Can direct memory use virtual memory or just use
> > physical
> > > > > >> memory?
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <
> > haoyuan.li@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> Hi Imran,
> > > > > >> >>
> > > > > >> >> One possible solution is that you can use
> > > > > >> >> Tachyon<https://github.com/amplab/tachyon>.
> > > > > >> >> When data is in Tachyon, Spark jobs will read it from
> off-heap
> > > > > memory.
> > > > > >> >> Internally, it uses direct byte buffers to store
> > > memory-serialized
> > > > > RDDs
> > > > > >> as
> > > > > >> >> you mentioned. Also, different Spark jobs can share the same
> > data
> > > > in
> > > > > >> >> Tachyon's memory. Here is a presentation
> > > > > >> >> (slide<
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
> > > > > >> >> >)
> > > > > >> >> we did in May.
> > > > > >> >>
> > > > > >> >> Haoyuan
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <
> > > > imran@therashids.com>
> > > > > >> >> wrote:
> > > > > >> >>
> > > > > >> >> > Hi,
> > > > > >> >> >
> > > > > >> >> > I was wondering if anyone has thought about putting cached
> > data
> > > > in
> > > > > an
> > > > > >> >> > RDD into off-heap memory, eg. w/ direct byte buffers.  For
> > > really
> > > > > >> >> > long-lived RDDs that use a lot of memory, this seems like a
> > > huge
> > > > > >> >> > improvement, since all the memory is now totally ignored
> > during
> > > > GC.
> > > > > >> >> > (and reading data from direct byte buffers is potentially
> > > faster
> > > > as
> > > > > >> >> > well, buts thats just a nice bonus).
> > > > > >> >> >
> > > > > >> >> > The easiest thing to do is to store memory-serialized RDDs
> in
> > > > > direct
> > > > > >> >> > byte buffers, but I guess we could also store the
> serialized
> > > RDD
> > > > on
> > > > > >> >> > disk and use a memory mapped file.  Serializing into
> off-heap
> > > > > buffers
> > > > > >> >> > is a really simple patch, I just changed a few lines (I
> > haven't
> > > > > done
> > > > > >> >> > any real tests w/ it yet, though).  But I dont' really
> have a
> > > ton
> > > > > of
> > > > > >> >> > experience w/ off-heap memory, so I thought I would ask
> what
> > > > others
> > > > > >> >> > think of the idea, if it makes sense or if there are any
> > > gotchas
> > > > I
> > > > > >> >> > should be aware of, etc.
> > > > > >> >> >
> > > > > >> >> > thanks,
> > > > > >> >> > Imran
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --
> > > Evan Chan
> > > Staff Engineer
> > > ev@ooyala.com  |
> > >
> > > <http://www.ooyala.com/>
> > > <http://www.facebook.com/ooyala><
> http://www.linkedin.com/company/ooyala
> > ><
> > > http://www.twitter.com/ooyala>
> > >
> >
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> ev@ooyala.com  |
>
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><
> http://www.twitter.com/ooyala>
>

Re: off-heap RDDs

Posted by Evan Chan <ev...@ooyala.com>.

Haoyuan,

Thanks, that sounds great, exactly what we are looking for.

We might be interested in integrating Tachyon with CFS (Cassandra File
System, the Cassandra-based implementation of HDFS).

-Evan



On Sat, Aug 31, 2013 at 3:33 PM, Haoyuan Li <ha...@gmail.com> wrote:

> Evan,
>
> If I understand you correctly, you want to avoid network I/O as much as
> possible by caching the data on the node having the data on disk. Actually,
> what I meant client caching would automatically do this. For example,
> suppose you have a cluster of machines, nothing cached in memory yet. Then
> a spark application runs on it. Spark asks Tachyon where data X is. Since
> nothing is in memory yet, Tachyon would return disk locations for the first
> time. Then Spark program will try to take advantage of disk data locality,
> and load the data X in HDFS node N into the off-heap memory of node N. In
> the future, when Spark asks Tachyon the location of X, Tachyon will return
> node N. There is no network I/O involved in the whole process. Let me know
> if I misunderstood something.
>
> Haoyuan
>
>
> On Fri, Aug 30, 2013 at 10:00 AM, Evan Chan <ev...@ooyala.com> wrote:
>
> > Hey guys,
> >
> > I would also prefer to strengthen and get behind Tachyon, rather than
> > implement a separate solution (though I guess if it's not offiically
> > supported, then nobody will ask questions).  But it's more that off-heap
> > memory is difficult, so it's better to focus efforts on one project, is
> my
> > feeling.
> >
> > Haoyuan,
> >
> > Tachyon brings cached HDFS data to the local client.  Have we thought
> about
> > the opposite approach, which might be more efficient?
> >  - Load the data in HDFS node N into the off-heap memory of node N
> >  - in Spark, inform the framework (maybe via RDD partition/location info)
> > of where the data is, that it is located in node N
> >  - bring the computation to node N
> >
> > This avoids network IO and may be much more efficient for many types of
> > applications.   I know this would be a big win for us.
> >
> > -Evan
> >
> >
> > On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li <ha...@gmail.com>
> wrote:
> >
> > > No problem. Like reading/writing data from/to off-heap bytebuffer,
> when a
> > > program reads/writes data from/to Tachyon, Spark/Shark needs to do
> > ser/de.
> > > Efficient ser/de will help on performance a lot as people pointed out.
> > One
> > > solution is that the application can do primitive operations directly
> on
> > > ByteBuffer, like how Shark is handling it now. Most related code is
> > located
> > > at "
> > >
> >
> https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2
> > > "
> > > and "
> > >
> > >
> >
> https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon
> > > ".
> > >
> > > Haoyuan
> > >
> > >
> > > On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid <im...@therashids.com>
> > > wrote:
> > >
> > > > Thanks Haoyuan.  It seems like we should try out Tachyon, sounds like
> > > > it is what we are looking for.
> > > >
> > > > On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <ha...@gmail.com>
> > > wrote:
> > > > > Response inline.
> > > > >
> > > > >
> > > > > On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid <
> imran@therashids.com>
> > > > wrote:
> > > > >
> > > > >> Thanks for all the great comments & discussion.  Let me expand a
> bit
> > > > >> on our use case, and then I'm gonna combine responses to various
> > > > >> questions.
> > > > >>
> > > > >> In general, when we use spark, we have some really big RDDs that
> use
> > > > >> up a lot of memory (10s of GB per node) that are really our "core"
> > > > >> data sets.  We tend to start up a spark application, immediately
> > load
> > > > >> all those data sets, and just leave them loaded for the lifetime
> of
> > > > >> that process.  We definitely create a lot of other RDDs along the
> > way,
> > > > >> and lots of intermediate objects that we'd like to go through
> normal
> > > > >> garbage collection.  But those all require much less memory, maybe
> > > > >> 1/10th of the big RDDs that we just keep around.  I know this is a
> > bit
> > > > >> of a special case, but it seems like it probably isn't that
> > different
> > > > >> from a lot of use cases.
> > > > >>
> > > > >> Reynold Xin wrote:
> > > > >> > This is especially attractive if the application can read
> directly
> > > > from
> > > > >> a byte
> > > > >> > buffer without generic serialization (like Shark).
> > > > >>
> > > > >> interesting -- can you explain how this works in Shark?  do you
> have
> > > > >> some general way of storing data in byte buffers that avoids
> > > > >> serialization?  Or do you mean that if the user is effectively
> > > > >> creating an RDD of ints, that you create a an RDD[ByteBuffer], and
> > > > >> then you read / write ints into the byte buffer yourself?
> > > > >> Sorry, I'm familiar with the basic idea of shark but not the code
> at
> > > > >> all -- even a pointer to the code would be helpful.
> > > > >>
> > > > >> Haoyun Li wrote:
> > > > >> > One possible solution is that you can use
> > > > >> > Tachyon<https://github.com/amplab/tachyon>.
> > > > >>
> > > > >> This is a good idea, that I had probably overlooked.  There are
> two
> > > > >> potential issues that I can think of with this approach, though:
> > > > >> 1) I was under the impression that Tachyon is still not really
> > tested
> > > > >> in production systems, and I need something a bit more mature.  Of
> > > > >> course, my changes wouldn't be thoroughly tested either, but
> > somehow I
> > > > >> feel better about deploying my 5-line patch to a codebase I
> > understand
> > > > >> than adding another entire system.  (This isn't a good reason to
> add
> > > > >> this to spark in general, though, just might be a temporary patch
> we
> > > > >> locally deploy)
> > > > >>
> > > > >
> > > > > This is a legitimate concern. The good news is that, several
> > companies
> > > > have
> > > > > been testing it for a while, and some are close to make it to
> > > production.
> > > > > For example, as Yahoo mentioned in today's meetup, we are working
> to
> > > > > integrate Shark and Tachyon closely, and results are very
> promising.
> > It
> > > > > will be in production soon.
> > > > >
> > > > >
> > > > >> 2) I may have misunderstood Tachyon, but it seems there is a big
> > > > >> difference in the data locality in these two approaches.  On a
> large
> > > > >> cluster, HDFS will spread the data all over the cluster, and so
> any
> > > > >> particular piece of on-disk data will only live on a few machines.
> > > > >> When you start a spark application, which only uses a small subset
> > of
> > > > >> the nodes, odds are the data you want is *not* on those nodes.  So
> > > > >> even if tachyon caches data from HDFS into memory, it won't be on
> > the
> > > > >> same nodes as the spark application.  Which means that when the
> > spark
> > > > >> application reads data from the RDD, even though the data is in
> > memory
> > > > >> on some node in the cluster, it will need to be read over the
> > network
> > > > >> by the actual spark worker assigned to the application.
> > > > >
> > > > > Is my understanding correct?  I haven't done any measurements at
> all
> > > > >> of a difference in performance, but it seems this would be much
> > > > >> slower.
> > > > >>
> > > > >
> > > > > This is a great question. Actually, from data locality perspective,
> > two
> > > > > approaches have no difference. Tachyon does client side caching,
> > which
> > > > > means, if a client on a node reads data not on its local machine,
> the
> > > > first
> > > > > read will cache the data on that node. Therefore, all future access
> > on
> > > > that
> > > > > node will read the data from its local memory. For example, suppose
> > you
> > > > > have a cluster with 100 nodes all running HDFS and Tachyon. Then
> you
> > > > launch
> > > > > a Spark jobs running on 20 nodes only. When it reads or caches the
> > data
> > > > > first time, all data will be cached on those 20 nodes. In the
> future,
> > > > when
> > > > > Spark master tries to schedule tasks, it will query Tachyon about
> > data
> > > > > locations, and take advantage of data localities automatically.
> > > > >
> > > > > Best,
> > > > >
> > > > > Haoyuan
> > > > >
> > > > >
> > > > >>
> > > > >>
> > > > >> Mark Hamstra wrote:
> > > > >> > What worries me is the
> > > > >> > combinatoric explosion of different caching and persistence
> > > > mechanisms.
> > > > >>
> > > > >> great points, and I have no ideas of the real advantages yet.  I
> > agree
> > > > >> we'd need to actual observe an improvement to add yet another
> > option.
> > > > >> (I would really like some alternative to what I'm doing now, but
> > maybe
> > > > >> tachyon is all I need ...)
> > > > >>
> > > > >> Reynold Xin wrote:
> > > > >> > Mark - you don't necessarily need to construct a separate
> storage
> > > > level.
> > > > >> > One simple way to accomplish this is for the user application to
> > > pass
> > > > >> Spark
> > > > >> > a DirectByteBuffer.
> > > > >>
> > > > >> hmm, that's true I suppose, but I had originally thought of making
> > it
> > > > >> another storage level, just for convenience & consistency.
>  Couldn't
> > > > >> you get rid of all the storage levels and just have the user apply
> > > > >> various transformations to an RDD?  eg.
> > > > >>
> > > > >> rdd.cache(MEMORY_ONLY_SER)
> > > > >>
> > > > >> could be
> > > > >>
> > > > >> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY)
> > > > >>
> > > > >>
> > > > >> And I agree with all of Lijie's comments that using off-heap
> memory
> > is
> > > > >> unsafe & difficult.  But I feel that isn't a reason to completely
> > > > >> disallow it, if there is a significant performance improvement.
>  It
> > > > >> would need to be clearly documented as an advanced feature with
> some
> > > > >> risks involved.
> > > > >>
> > > > >> thanks,
> > > > >> imran
> > > > >>
> > > > >> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <cs...@gmail.com>
> > > wrote:
> > > > >> > I remember that I talked about this off-heap approach with
> Reynold
> > > in
> > > > >> > person several months ago. I think this approach is attractive
> to
> > > > >> > Spark/Shark, since there are many large objects in JVM. But the
> > main
> > > > >> > problem in original Spark (without Tachyon support) is that it
> > uses
> > > > the
> > > > >> > same memory space both for storing critical data and processing
> > > > temporary
> > > > >> > data. Separating storing and processing is more important than
> > > looking
> > > > >> for
> > > > >> > memory-efficient storing technique. So I think this separation
> is
> > > the
> > > > >> main
> > > > >> > contribution of Tachyon.
> > > > >> >
> > > > >> >
> > > > >> > As for off-heap approach, we are not the first to realize this
> > > > problem.
> > > > >> > Apache DirectMemory is promising, though not mature currently.
> > > > However, I
> > > > >> > think there are some problems while using direct memory.
> > > > >> >
> > > > >> > 1)       Unsafe. As same as C++, there may be memory leak. Users
> > > will
> > > > >> also
> > > > >> > be confused to set right memory-related configurations such as
> > –Xmx
> > > > and
> > > > >> > –MaxDirectMemorySize.
> > > > >> >
> > > > >> > 2)       Difficult. Designing an effective and efficient memory
> > > > >> management
> > > > >> > system is not an easy job. How to allocate, replace, reclaim
> > objects
> > > > at
> > > > >> > right time and at right location is challenging. It’s a bit
> > similar
> > > > with
> > > > >> GC
> > > > >> > algorithms.
> > > > >> >
> > > > >> > 3)       Limited usage. It’s useful for
> write-once-read-many-times
> > > > large
> > > > >> > objects but not for others.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > I also have two related questions:
> > > > >> >
> > > > >> > 1)       Can JVM’s heap use virtual memory or just use physical
> > > > memory?
> > > > >> >
> > > > >> > 2)       Can direct memory use virtual memory or just use
> physical
> > > > >> memory?
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <
> haoyuan.li@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> >> Hi Imran,
> > > > >> >>
> > > > >> >> One possible solution is that you can use
> > > > >> >> Tachyon<https://github.com/amplab/tachyon>.
> > > > >> >> When data is in Tachyon, Spark jobs will read it from off-heap
> > > > memory.
> > > > >> >> Internally, it uses direct byte buffers to store
> > memory-serialized
> > > > RDDs
> > > > >> as
> > > > >> >> you mentioned. Also, different Spark jobs can share the same
> data
> > > in
> > > > >> >> Tachyon's memory. Here is a presentation
> > > > >> >> (slide<
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
> > > > >> >> >)
> > > > >> >> we did in May.
> > > > >> >>
> > > > >> >> Haoyuan
> > > > >> >>
> > > > >> >>
> > > > >> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <
> > > imran@therashids.com>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > Hi,
> > > > >> >> >
> > > > >> >> > I was wondering if anyone has thought about putting cached
> data
> > > in
> > > > an
> > > > >> >> > RDD into off-heap memory, eg. w/ direct byte buffers.  For
> > really
> > > > >> >> > long-lived RDDs that use a lot of memory, this seems like a
> > huge
> > > > >> >> > improvement, since all the memory is now totally ignored
> during
> > > GC.
> > > > >> >> > (and reading data from direct byte buffers is potentially
> > faster
> > > as
> > > > >> >> > well, buts thats just a nice bonus).
> > > > >> >> >
> > > > >> >> > The easiest thing to do is to store memory-serialized RDDs in
> > > > direct
> > > > >> >> > byte buffers, but I guess we could also store the serialized
> > RDD
> > > on
> > > > >> >> > disk and use a memory mapped file.  Serializing into off-heap
> > > > buffers
> > > > >> >> > is a really simple patch, I just changed a few lines (I
> haven't
> > > > done
> > > > >> >> > any real tests w/ it yet, though).  But I dont' really have a
> > ton
> > > > of
> > > > >> >> > experience w/ off-heap memory, so I thought I would ask what
> > > others
> > > > >> >> > think of the idea, if it makes sense or if there are any
> > gotchas
> > > I
> > > > >> >> > should be aware of, etc.
> > > > >> >> >
> > > > >> >> > thanks,
> > > > >> >> > Imran
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> >
> >
> > --
> > --
> > Evan Chan
> > Staff Engineer
> > ev@ooyala.com  |
> >
> > <http://www.ooyala.com/>
> > <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala
> ><
> > http://www.twitter.com/ooyala>
> >
>



-- 
--
Evan Chan
Staff Engineer
ev@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>