You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Aaron Smuts <AS...@therealm.com> on 2002/02/01 22:19:08 UTC

RE: [collections][PATCH] SequencedHashMap violates Map contract, has poor performance -- BufferCache, SequencedHastable, MRUMap and LRUMa p .

I was talking about the one in cvs.  I was trying to point out that there
are other classes that suffer from the same problem that you are fixing and
that there are other problems with one of the LRU implementations.  See the
other emails.

Cheers,

Aaron

> -----Original Message-----
> From: Michael A. Smith [mailto:iammichael@iammichael.org]
> Sent: Friday, February 01, 2002 10:46 AM
> To: Jakarta Commons Developers List
> Subject: RE: [collections][PATCH] SequencedHashMap violates Map contract,
> has poor performance -- BufferCache, SequencedHastable, MRUMap and LRUMap
> .
> 
> On Fri, 1 Feb 2002, Aaron Smuts wrote:
> > The SequencedHashMap has the same problem with the LinkedList remove
> > operation which executes in O(N) time as BufferCache, SequencedHastable,
> and
> > MRUMap.
> 
> Umm...  I'm a bit confused.  Are you referring to the SequencedHashMap
> that exists in current CVS, or the version I posted (I'm not sure because
> the subject starts with the same subject as my recent mail containing a
> patch for SequencedHashMap and quotes one of my mails).
> 
> If you're referring to the current SequencedHashMap in CVS, then yes, it
> has O(n) performance.  I mentioned that fact in my email, and is one of
> the things fixed in my implementation.
> 
> If you're referring to the version in my patch, then I don't see how it
> executes in O(n).
> 
> > The LRUStore I put in stratum looks similar to what you are working on.
> 
> ah, maybe you were just reiterating the problems with existing CVS
> version.  Stratum is some component in turbine, right?  I'm not that
> familiar with it.  I did see your post here about it though, and plan on
> commenting on it this evening or tomorrow when I have some more time.
> 
> > LRUMap in Collections looks pretty good.  I'm going to test it.
> 
> I haven't looked over that yet.  It's on my list.  I'll comment on the
> rest of your mail after I review it either this evening or tomorrow.
> 
> regards,
> michael
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:commons-dev-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:commons-dev-
> help@jakarta.apache.org>

RE: [collections][PATCH] SequencedHashMap violates Map contract, has poor performance -- BufferCache, SequencedHastable, MRUMap and LRUMap .

Posted by Aaron Smuts <aa...@verizon.net>.

> -----Original Message-----
> From: Michael Smith [mailto:michael@iammichael.org]
> Sent: Monday, February 04, 2002 12:22 AM
> To: Jakarta Commons Developers List
> Subject: RE: [collections][PATCH] SequencedHashMap violates Map
contract,
> has poor performance -- BufferCache, SequencedHastable, MRUMap and
LRUMap
> .
> 
> Aaron Smuts [mailto:aaron.smuts3@verizon.net] wrote:
> > I'm looking for new memory management algorithms for JCS.  The LRU
I'm
> > using, which is basically the LRUStore I added, is good, but I'd
like
> > something more efficient for huge collections of objects.
Basically,
> > every get results in a reshuffling of some retrieved item to
> > the front,
> > unless the item is already the front runner.  Since the likelihood
of
> > the retrieved item already being the first it small, I'd like to
find
> > something even more efficient.
> 
> But that "move to front" operation is essentually six lines of code in
> addition to the hash table lookup.  Might have trouble getting more
> efficient than that.
> 

Ya.

> > The cache keeps pushing items to the front.  When new items
> > arrive they
> > are pushed off to disk in a group of say 5.  The memory cache
> > plugs into
> > the cache hub.  The hub has a bunch of auxiliary caches plugged in
and
> > manages expiration and the transmission of items to the auxiliaries
> > based on the element and regional attributes of the cache.  If one
of
> > the auxiliaries is of the disk type, the overflow from the
> > memory cache
> > (the LRU) is put to disk.  These puts are queued in a very efficient
> > manner, first going to a purgatory and then to disk (take a look).
> 
> sounds like the real problem is figuring out which items are "safe" to
> move to disk, 

Yes.  I want a LeastFrequentlyUsed and not a LRU store.

and keeping track of those items once the've been written.
> 

I have a few stable, efficient solutions.  The indexed disk cache works
very well and is far faster than any b tree.

> > I'd like to avoid the amount of shuffling of memory elements
somehow.
> > Perhaps by keeping an integer tally of the access and having a
> > background thread handling the overflow, so the retrievals
> > are almost at
> > map time.  An algorithm based on the current access number
> > and the total
> > accesses vs the total number of items could be used.  This
> > may result in
> > a few less operations in the synchronous get operation.
> 
> when you say "map" time, I assume you mean "hash" map time

Ya.  

 -- that is,
> O(1).  An interesting idea.  I'm not sure how much it would buy you
> though if you have to amortize the cost out.  Probably depends greatly
> on the frequency of scans looking for items to offload, and the
> frequency of retrievals.  I probably wouldn't use a background thread
> though...  just perform the scan every n retrievals or every 2n
> retrieveals.  That probably helps things be a little more
deterministic
> and help amortize the cost of the operations.
> 

Ya.  Except that items will hang around till they are used or spooled.
When we consider expiration, you might want to have some cleanup on a
stable sized cache, to proactively get some memory back.  Right now the
cache hub handles this on get.  There is some motivation for a
background process here . . . 

> > Since this isn't a bottleneck, I'm not too concerned with it,
> > but it is
> > an interesting problem that someone might want to undertake.
> 
> I seem to remember seeing someone implementing a scalable hashtable of
> sorts that offloaded infrequently used items to disk as part of a PhD
> project or something like that.  It was a few years ago though, and I
> can't find a reference to it now.  Too bad. 

Hmmn.

 It certainly is an
> interesting problem, and one I'd probably enjoy working on.  I'm just
> not sure when I'll have the time to work on such a thing.
> 

I'd like to implement an LFU, rather than LRU, cache but the algorithms
can get messy.  If you know of any clean systems point me that way.
I'll look around.

Something simple could be done where the number of total accesses at
insertion time combined with the number of element accesses vs the total
number of cache accesses could be evaluated as a means for background
removal.  This might be easier to implement.  The least accessed item
could be determined, but you have to deal with constantly incoming items
and changing totals.

( (current_time access total  - Insertion_time total)/ total number of
items ) / item access total = score

A background thread could compare scores over time to see if it was
getting worse.  Anything below a certain score or/and with 2 drops could
be removed.  For gets you only have to update the access number.  All
you'd need would be the wrapped objects in a hash.  You'd just have to
deal with the messy business of iteration through the hashmap without
hurting performance.  Coming up with a decent scoring equation is tricky
in cases where all the gets are evenly distributed.  

Or you could just record the total accesses.  The element with the
smallest possible given the total number of items gets removed, if you
can keep that stable enough.  But this is just LRU . . .

LRU is cheap, easy and effective, but not as sophisticated as you might
want.  LFU is a different animal.

I'm just thinking out loud.

Aaron

> regards,
> michael
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:commons-dev-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:commons-dev-
> help@jakarta.apache.org>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: [collections][PATCH] SequencedHashMap violates Map contract, has poor performance -- BufferCache, SequencedHastable, MRUMap and LRUMap .

Posted by Michael Smith <mi...@iammichael.org>.

Aaron Smuts [mailto:aaron.smuts3@verizon.net] wrote:
> I'm looking for new memory management algorithms for JCS.  The LRU I'm
> using, which is basically the LRUStore I added, is good, but I'd like
> something more efficient for huge collections of objects.  Basically,
> every get results in a reshuffling of some retrieved item to
> the front,
> unless the item is already the front runner.  Since the likelihood of
> the retrieved item already being the first it small, I'd like to find
> something even more efficient.

But that "move to front" operation is essentually six lines of code in
addition to the hash table lookup.  Might have trouble getting more
efficient than that.

> The cache keeps pushing items to the front.  When new items
> arrive they
> are pushed off to disk in a group of say 5.  The memory cache
> plugs into
> the cache hub.  The hub has a bunch of auxiliary caches plugged in and
> manages expiration and the transmission of items to the auxiliaries
> based on the element and regional attributes of the cache.  If one of
> the auxiliaries is of the disk type, the overflow from the
> memory cache
> (the LRU) is put to disk.  These puts are queued in a very efficient
> manner, first going to a purgatory and then to disk (take a look).

sounds like the real problem is figuring out which items are "safe" to
move to disk, and keeping track of those items once the've been written.

> I'd like to avoid the amount of shuffling of memory elements somehow.
> Perhaps by keeping an integer tally of the access and having a
> background thread handling the overflow, so the retrievals
> are almost at
> map time.  An algorithm based on the current access number
> and the total
> accesses vs the total number of items could be used.  This
> may result in
> a few less operations in the synchronous get operation.

when you say "map" time, I assume you mean "hash" map time -- that is,
O(1).  An interesting idea.  I'm not sure how much it would buy you
though if you have to amortize the cost out.  Probably depends greatly
on the frequency of scans looking for items to offload, and the
frequency of retrievals.  I probably wouldn't use a background thread
though...  just perform the scan every n retrievals or every 2n
retrieveals.  That probably helps things be a little more deterministic
and help amortize the cost of the operations.

> Since this isn't a bottleneck, I'm not too concerned with it,
> but it is
> an interesting problem that someone might want to undertake.

I seem to remember seeing someone implementing a scalable hashtable of
sorts that offloaded infrequently used items to disk as part of a PhD
project or something like that.  It was a few years ago though, and I
can't find a reference to it now.  Too bad.  It certainly is an
interesting problem, and one I'd probably enjoy working on.  I'm just
not sure when I'll have the time to work on such a thing.

regards,
michael

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: [collections][PATCH] SequencedHashMap violates Map contract, has poor performance -- BufferCache, SequencedHastable, MRUMap and LRUMap .

Posted by Aaron Smuts <aa...@verizon.net>.


> -----Original Message-----
> From: Michael Smith [mailto:michael@iammichael.org]
> Sent: Sunday, February 03, 2002 5:17 PM
> To: Jakarta Commons Developers List; aaron.smuts3@verizon.net
> Subject: RE: [collections][PATCH] SequencedHashMap violates Map
contract,
> has poor performance -- BufferCache, SequencedHastable, MRUMap and
LRUMap
> .
> 
> > > ok.  I've looked over some of the classes you mentioned.  LRUMap
in
> > > commons.collections definately suffers the same problems.  I'm
about
> > > to post a patch for that class that enumerates some of the
> > > problems that I
> > > saw (I didn't fix them yet -- the patch just changes the license
to
> > > the proper long form and documents the problems).
> > >
> >
> > The main problem here is that the second most recently used
> > item can get
> > dropped.  It is not proper LRU.
> 
> yup.  That's one of the things I mentioned in the documentation (in my
> patch).  I even mentioned your example.  :)
> 

I just saw the patch.  Sounds good.

> > > Assuming you mean org.apache.turbine.util.BufferCache when you
> > referred
> > > to BufferCache, I looked it over as well.  It inherits from
> > > org.apache.turbine.util.SequencedHashtable which looks to be a
near
> > > exact duplicate of the SequencedHashtable that's in commons.util
and
> > > nearly the same as the SequencedHashMap that exists in
> > > commons.collections (which is the one I submitted my patch
against).
> > >
> > > I can't find any reference to MRUMap.  Can you give me a pointer?
> > >
> >
> > It's in the commons sandbox(?) simplestore.
> 
> doh.  I don't know how I missed that.
> 
> Yes, MRUMap does the same thing.  Linked list; O(n) retrieval, O(n)
> removal.  Kind of defeats the purpose of using a map at all, doesn't
it?
> :)
> 

Sure does.  It wasn't noticed since the default size was set at 100.  

I'm looking for new memory management algorithms for JCS.  The LRU I'm
using, which is basically the LRUStore I added, is good, but I'd like
something more efficient for huge collections of objects.  Basically,
every get results in a reshuffling of some retrieved item to the front,
unless the item is already the front runner.  Since the likelihood of
the retrieved item already being the first it small, I'd like to find
something even more efficient.  

The cache keeps pushing items to the front.  When new items arrive they
are pushed off to disk in a group of say 5.  The memory cache plugs into
the cache hub.  The hub has a bunch of auxiliary caches plugged in and
manages expiration and the transmission of items to the auxiliaries
based on the element and regional attributes of the cache.  If one of
the auxiliaries is of the disk type, the overflow from the memory cache
(the LRU) is put to disk.  These puts are queued in a very efficient
manner, first going to a purgatory and then to disk (take a look).

I'd like to avoid the amount of shuffling of memory elements somehow.
Perhaps by keeping an integer tally of the access and having a
background thread handling the overflow, so the retrievals are almost at
map time.  An algorithm based on the current access number and the total
accesses vs the total number of items could be used.  This may result in
a few less operations in the synchronous get operation.  

Since this isn't a bottleneck, I'm not too concerned with it, but it is
an interesting problem that someone might want to undertake. 

Good luck with your patches.

Aaron

> regards,
> michael
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:commons-dev-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:commons-dev-
> help@jakarta.apache.org>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: [collections][PATCH] SequencedHashMap violates Map contract, has poor performance -- BufferCache, SequencedHastable, MRUMap and LRUMap .

Posted by Michael Smith <mi...@iammichael.org>.

> > ok.  I've looked over some of the classes you mentioned.  LRUMap in
> > commons.collections definately suffers the same problems.  I'm about
> > to post a patch for that class that enumerates some of the
> > problems that I
> > saw (I didn't fix them yet -- the patch just changes the license to
> > the proper long form and documents the problems).
> >
>
> The main problem here is that the second most recently used
> item can get
> dropped.  It is not proper LRU.

yup.  That's one of the things I mentioned in the documentation (in my
patch).  I even mentioned your example.  :)

> > Assuming you mean org.apache.turbine.util.BufferCache when you
> referred
> > to BufferCache, I looked it over as well.  It inherits from
> > org.apache.turbine.util.SequencedHashtable which looks to be a near
> > exact duplicate of the SequencedHashtable that's in commons.util and
> > nearly the same as the SequencedHashMap that exists in
> > commons.collections (which is the one I submitted my patch against).
> >
> > I can't find any reference to MRUMap.  Can you give me a pointer?
> >
>
> It's in the commons sandbox(?) simplestore.

doh.  I don't know how I missed that.

Yes, MRUMap does the same thing.  Linked list; O(n) retrieval, O(n)
removal.  Kind of defeats the purpose of using a map at all, doesn't it?
:)

regards,
michael



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: [collections][PATCH] SequencedHashMap violates Map contract, has poor performance -- BufferCache, SequencedHastable, MRUMap and LRUMap .

Posted by Aaron Smuts <aa...@verizon.net>.


> -----Original Message-----
> From: Michael Smith [mailto:michael@iammichael.org]
> Sent: Sunday, February 03, 2002 11:30 AM
> To: Jakarta Commons Developers List
> Subject: RE: [collections][PATCH] SequencedHashMap violates Map
contract,
> has poor performance -- BufferCache, SequencedHastable, MRUMap and
LRUMap
> .
> 
> > I was talking about the one in cvs.  I was trying to point
> > out that there
> > are other classes that suffer from the same problem that you
> > are fixing and
> > that there are other problems with one of the LRU
> > implementations.  See the
> > other emails.
> 
> ok.  I've looked over some of the classes you mentioned.  LRUMap in
> commons.collections definately suffers the same problems.  I'm about
to
> post a patch for that class that enumerates some of the problems that
I
> saw (I didn't fix them yet -- the patch just changes the license to
the
> proper long form and documents the problems).
> 

The main problem here is that the second most recently used item can get
dropped.  It is not proper LRU.

> Assuming you mean org.apache.turbine.util.BufferCache when you
referred
> to BufferCache, I looked it over as well.  It inherits from
> org.apache.turbine.util.SequencedHashtable which looks to be a near
> exact duplicate of the SequencedHashtable that's in commons.util and
> nearly the same as the SequencedHashMap that exists in
> commons.collections (which is the one I submitted my patch against).
> 
> I can't find any reference to MRUMap.  Can you give me a pointer?
> 

It's in the commons sandbox(?) simplestore.

> The LRUStore you added to stratum looks good to me; however I think
its
> a bit too restricting by requiring contents to be instances of
> ILRUElement and for the keys and values to be Serializable.  Seems
like
> it would work for your needs though.  :)

You can put any object into it.  It just wraps it.  You can override the
wrapper and create something more sophisticated, since you can get the
wrapper out.  You can easily add an expiration and expire on get or
whatever.  

I'll remove the serializable restriction.  That was from the cache.

Cheers,

Aaron

> 
> 
> Overall, yes, there are other classes that suffer from the problem I
was
> fixing with commons.collections.SequencedHashMap.  I'm working on
fixing
> the stuff in commons, since that's what has my interest.  LRUMap is on
> my list of things to fix.  Of course, this all assumes my patches will
> get applied eventually.  :)
> 
> regards,
> michael
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:commons-dev-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:commons-dev-
> help@jakarta.apache.org>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: [collections][PATCH] SequencedHashMap violates Map contract, has poor performance -- BufferCache, SequencedHastable, MRUMap and LRUMap .

Posted by Michael Smith <mi...@iammichael.org>.

> I was talking about the one in cvs.  I was trying to point
> out that there
> are other classes that suffer from the same problem that you
> are fixing and
> that there are other problems with one of the LRU
> implementations.  See the
> other emails.

ok.  I've looked over some of the classes you mentioned.  LRUMap in
commons.collections definately suffers the same problems.  I'm about to
post a patch for that class that enumerates some of the problems that I
saw (I didn't fix them yet -- the patch just changes the license to the
proper long form and documents the problems).

Assuming you mean org.apache.turbine.util.BufferCache when you referred
to BufferCache, I looked it over as well.  It inherits from
org.apache.turbine.util.SequencedHashtable which looks to be a near
exact duplicate of the SequencedHashtable that's in commons.util and
nearly the same as the SequencedHashMap that exists in
commons.collections (which is the one I submitted my patch against).

I can't find any reference to MRUMap.  Can you give me a pointer?

The LRUStore you added to stratum looks good to me; however I think its
a bit too restricting by requiring contents to be instances of
ILRUElement and for the keys and values to be Serializable.  Seems like
it would work for your needs though.  :)


Overall, yes, there are other classes that suffer from the problem I was
fixing with commons.collections.SequencedHashMap.  I'm working on fixing
the stuff in commons, since that's what has my interest.  LRUMap is on
my list of things to fix.  Of course, this all assumes my patches will
get applied eventually.  :)

regards,
michael



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>