You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/09/06 13:16:45 UTC

More code style; common packages?

I'd like to ask we take a moment to agree on and then implement some
small code hygiene... should be things we always do to adhere to
project and industry norms:

- Make sure a copyright statement appears in each file
- Let's not do * imports
- No serialVersionUID
- No printStackTrace() or System.{out,err} -- logger instead (except
maybe in command-line program classes, or maybe test classes)
- No empty javadoc if you please
- Let's use canonical literals for floats and doubles -- 1.0, or 1.0f,
instead of 1f or 1d or 1.0d


also there are now two "cache" abstractions in this project. This may
be in vain but would be good to attempt to unify. I prefer how
org.apache.mahout.cf.taste.impl.common.Cache works -- don't think a
cache should have a set() operation for instance. The difference is in
eviction strategy. If it's actually necessary to have different
strategies, we could refactor this. I personally find it faster and as
fine to use the Bitset-based thing. Not going to push on it but
ideally we'd do a lot more unification like this.

There are also two Pair classes.

On that subject, also thought I'd ask about the
org.apache.mahout.utils package -- makes sense to have a place to
stuff miscellaneous methods, though some classes literally have one
three-line method. Also, what are the distance measure classes doing
there? And how does this package relate to org.apache.mahout.common?


The reason I am nit-picking here is I would like to clean up the
'common spaces' a bit to pave the way for more reuse across the code
base. That's good per se, but also acts against the trend for this to
end up being the mere concatenation of five developers' personal
projects.

Re: More code style; common packages?

Posted by Sean Owen <sr...@gmail.com>.
May I also ask we agree on a convention for asserts? One convention
is: don't use them.

If they are used, they should be used correctly. I see them being used
in a few places to check arguments to public methods, which doesn't
work out, because these checks don't happen when asserts are off, and
they are off by default.

asserts are for internal invariants that the class 'knows' to be true
when everything is functioning correctly and bug-free. They're turned
on in testing since, hey, mistakes are made. But once testing proves
they don't fire, they are turned off in production.

Re: More code style; common packages?

Posted by Sean Owen <sr...@gmail.com>.
Agree, that is an even better idea, just using this class. I agree,
very well done. my only 'objection' is it uses Objects, and for my
purposes I need to go farther to using only longs (and not even using
Map.Entry or intermediate Reference objects). But in general until a
piece of code hits that point, agree, use something already built. I
don't feel like I have the authority to do this replacing but think it
would prove a good change.

On Mon, Sep 7, 2009 at 6:02 PM, Ted Dunning<te...@gmail.com> wrote:
> yeah... but the difficulty of getting these implementations *really* right
> is substantial.  I would rather unify on the Google code if we need to do
> that.

Re: More code style; common packages?

Posted by Ted Dunning <te...@gmail.com>.
yeah... but the difficulty of getting these implementations *really* right
is substantial.  I would rather unify on the Google code if we need to do
that.

On Sun, Sep 6, 2009 at 11:55 PM, Sean Owen <sr...@gmail.com> wrote:

> Anyhow I might suggest this eviction strategy business could be a
> motivation to unify the caches, at which time it might become clearer
> that there's probably little barrier to doing so.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: More code style; common packages?

Posted by Ted Dunning <te...@gmail.com>.
My recent reading of Google collections makes me think that the right way to
do a cache is with their MapMaker and computableHash method.  Very nicely
done.

On Sun, Sep 6, 2009 at 11:55 PM, Sean Owen <sr...@gmail.com> wrote:

> My pet theory about caches is the caller should only need one
> operation -- get(K key). The cache is instantiated with an object that
> can load a value for a key. Given that it can do all its work.
> Otherwise every caller has to have knowledge about how to manage the
> cache and how to load values for keys. I am not sure, maybe this is
> partly what you are getting at, in which case this approach solves
> some of those issues?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: More code style; common packages?

Posted by Sean Owen <sr...@gmail.com>.
On Mon, Sep 7, 2009 at 2:15 AM, Robin Anil<ro...@gmail.com> wrote:
> The FastMap, BitVector and other classes in taste.common are being used(or
> should be used)  by other packages. We can start our own collections package
> say ...mahout.collections ?

They could be used, but could need some work. Those 'Map' and 'Set'
classes only work with longs, and don't fully implement the Map/Set
interface (they could). If someone has a use case, holler at me and
I'll move it out.

The possibly useful classes from taste.common are all the
Iterator/Iterable related items. FileLineIterator for iterating over a
file's lines, SkippingIterator interface for being able to skip
forward, PermutingIterator for iterating over a permutation of a
collection, SamplingIterator for iterating over only x% of something.

RandomUtils might be useful in that it provides a standard way to get
a PRNG, and I think there are three or four implementations used in
the code at the moment. It also lets one globally select 'test mode',
wherein all the PRNGs are seeded with the same seed every time, to
make unit tests reproducible.


> about Cache: the reason why both implementation differ is. At one place the
> datastore/retriever code is extended by the cache class. The other place,
> the datastore is independent of the cache and the application or the
> algorithm uses cache explicitly to get data or fetch+put data into the
> cache. Plus it didnt make sense to make Hbase retriever of the form <K,V>
> because, Hbase is used as getCell(row, columnfamily, column) or
> getFamily(row, columnfamily) or getRow(row)

I wouldn't push on it, but to continue the thread, I think it's
possible to do anything with either implementation.

My pet theory about caches is the caller should only need one
operation -- get(K key). The cache is instantiated with an object that
can load a value for a key. Given that it can do all its work.
Otherwise every caller has to have knowledge about how to manage the
cache and how to load values for keys. I am not sure, maybe this is
partly what you are getting at, in which case this approach solves
some of those issues?

... and since I'm on that topic now, it can get expensive to implement
strategies like LRU or LFU. It needs a whole new set of parallel data
structures to track access time, etc. In the end, this mattered so
much for my purposes that I went with something that makes slightly
dumber decisions about what to evict but is much leaner. And, I'd
suggest that in most cases, it is not important to find the absolute
oldest thing to evict, but just something kinda old.

I don't know what this is called, but what I do is hold a bitset with
a bit for each entry. When the entry is accessed, the bit is set to 1.
When the cache fills, a random item is evicted -- but if we choose an
item whose bit is 1, we just set it to 0 and pick something else. So
recently used items get a 'second chance' to escape eviction. Quite
lean, and works fine.

Anyhow I might suggest this eviction strategy business could be a
motivation to unify the caches, at which time it might become clearer
that there's probably little barrier to doing so.


> about Pair: I really didnt see that. Again, lets move all these helper
> classes out of taste and ensure its getting reused by other algorithms as
> well. And it will also ease adding more trove/colt like collections classes

As a warmup I'll put together a patch that unifies all this.

Re: More code style; common packages?

Posted by Robin Anil <ro...@gmail.com>.
+1 to all that.
Adding to that

The FastMap, BitVector and other classes in taste.common are being used(or
should be used)  by other packages. We can start our own collections package
say ...mahout.collections ?

about Cache: the reason why both implementation differ is. At one place the
datastore/retriever code is extended by the cache class. The other place,
the datastore is independent of the cache and the application or the
algorithm uses cache explicitly to get data or fetch+put data into the
cache. Plus it didnt make sense to make Hbase retriever of the form <K,V>
because, Hbase is used as getCell(row, columnfamily, column) or
getFamily(row, columnfamily) or getRow(row)

about Pair: I really didnt see that. Again, lets move all these helper
classes out of taste and ensure its getting reused by other algorithms as
well. And it will also ease adding more trove/colt like collections classes

Re: More code style; common packages?

Posted by Benson Margulies <bi...@gmail.com>.
I'm sorry that I haven't followed through on the automation here; I have so
far failed to find a way to reconcile the existing preferred style to
checkstyle as we discussed.

On Sun, Sep 6, 2009 at 6:35 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Sep 6, 2009, at 7:16 AM, Sean Owen wrote:
>
>  I'd like to ask we take a moment to agree on and then implement some
>> small code hygiene... should be things we always do to adhere to
>> project and industry norms:
>>
>> - Make sure a copyright statement appears in each file
>>
>
> RAT should help with this.
>
>  - Let's not do * imports
>>
>
> +1
>
>  - No serialVersionUID
>> - No printStackTrace() or System.{out,err} -- logger instead (except
>> maybe in command-line program classes, or maybe test classes)
>> - No empty javadoc if you please
>> - Let's use canonical literals for floats and doubles -- 1.0, or 1.0f,
>> instead of 1f or 1d or 1.0d
>>
>>
>> also there are now two "cache" abstractions in this project. This may
>> be in vain but would be good to attempt to unify. I prefer how
>> org.apache.mahout.cf.taste.impl.common.Cache works -- don't think a
>> cache should have a set() operation for instance. The difference is in
>> eviction strategy. If it's actually necessary to have different
>> strategies, we could refactor this. I personally find it faster and as
>> fine to use the Bitset-based thing. Not going to push on it but
>> ideally we'd do a lot more unification like this.
>>
>>
> Yeah, we should consolidate to common code where possible.  Not sure about
> set() or not
>
>  There are also two Pair classes.
>>
>> On that subject, also thought I'd ask about the
>> org.apache.mahout.utils package -- makes sense to have a place to
>> stuff miscellaneous methods, though some classes literally have one
>> three-line method. Also, what are the distance measure classes doing
>> there? And how does this package relate to org.apache.mahout.common?
>>
>
>
>> The reason I am nit-picking here is I would like to clean up the
>> 'common spaces' a bit to pave the way for more reuse across the code
>> base. That's good per se, but also acts against the trend for this to
>> end up being the mere concatenation of five developers' personal
>> projects.
>>
>
>
> Sounds good.

Re: More code style; common packages?

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 6, 2009, at 7:16 AM, Sean Owen wrote:

> I'd like to ask we take a moment to agree on and then implement some
> small code hygiene... should be things we always do to adhere to
> project and industry norms:
>
> - Make sure a copyright statement appears in each file

RAT should help with this.

> - Let's not do * imports

+1

> - No serialVersionUID
> - No printStackTrace() or System.{out,err} -- logger instead (except
> maybe in command-line program classes, or maybe test classes)
> - No empty javadoc if you please
> - Let's use canonical literals for floats and doubles -- 1.0, or 1.0f,
> instead of 1f or 1d or 1.0d
>
>
> also there are now two "cache" abstractions in this project. This may
> be in vain but would be good to attempt to unify. I prefer how
> org.apache.mahout.cf.taste.impl.common.Cache works -- don't think a
> cache should have a set() operation for instance. The difference is in
> eviction strategy. If it's actually necessary to have different
> strategies, we could refactor this. I personally find it faster and as
> fine to use the Bitset-based thing. Not going to push on it but
> ideally we'd do a lot more unification like this.
>

Yeah, we should consolidate to common code where possible.  Not sure  
about set() or not

> There are also two Pair classes.
>
> On that subject, also thought I'd ask about the
> org.apache.mahout.utils package -- makes sense to have a place to
> stuff miscellaneous methods, though some classes literally have one
> three-line method. Also, what are the distance measure classes doing
> there? And how does this package relate to org.apache.mahout.common?

>
> The reason I am nit-picking here is I would like to clean up the
> 'common spaces' a bit to pave the way for more reuse across the code
> base. That's good per se, but also acts against the trend for this to
> end up being the mere concatenation of five developers' personal
> projects.


Sounds good.