You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/09/05 20:59:12 UTC

Interesting iterator issue

Having an interesting issue fixing up a technical point in the class
SequenceFileVectorIterable. Its Iterator is wrong in that hasNext()
advances the iteration and next() doesn't. There's a way to fix this
easily: the Iterator just needs to always read one item ahead to know
whether a next one exists.

However doing this the straightforward way, the Iterator doesn't know
the current key/value it's on -- always the next one. This is an issue
since in the one usage of this class, in VectorDumper, the current key
is accessed for printing.

The Iterator could just store the last key/value it saw. However, the
key can be an arbitrary Writable. To do this correctly, the Writable
would have to be Cloneable and be clone()-ed, which is not guaranteed
and maybe undesirable.

1. We can remove the option in VectorDumper to print keys to fix this,
since that's the only thing that wants to read the current key. How
bad is that?
2. We can iterate directly over the SequenceFile in VectorDumper to
get desired behavior. OK? Then actually SequenceFileVectorIterable
goes away.

Sean

Re: Interesting iterator issue

Posted by Sean Owen <sr...@gmail.com>.
It's not used anywhere else right now, but yes I can also imagine it
being useful. The hard part is that it's hard to write correctly
(unless I miss something) on account of a few related wrinkles:

1. SequenceFile.Reader doesn't have a "hasNext()"-like method. The
only way to know if it has more is to read the next value and see if
it's there. That's OK -- you could just read ahead one entry every
time, storing the last entry as the real current 'next' value.

2. But, SequenceFile.Reader reads into a Writable container object of
some kind. There is no generic "get()" method on Writable that pops
out the underlying value to be saved off.

3. So you have to save off the whole Writable as the last value read,
and make a new Writable via reflection for every value read. This is
possible but sort of defeats the purpose of how this is supposed to
work in Hadoop.

4. (And as a corollary, you can't define the class to use generics in
a way that lets you operate in terms of the underlying key-value
classes, only Writables.)

Re: Interesting iterator issue

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 5, 2010, at 2:59 PM, Sean Owen wrote:

> Having an interesting issue fixing up a technical point in the class
> SequenceFileVectorIterable. Its Iterator is wrong in that hasNext()
> advances the iteration and next() doesn't. There's a way to fix this
> easily: the Iterator just needs to always read one item ahead to know
> whether a next one exists.
> 
> However doing this the straightforward way, the Iterator doesn't know
> the current key/value it's on -- always the next one. This is an issue
> since in the one usage of this class, in VectorDumper, the current key
> is accessed for printing.
> 
> The Iterator could just store the last key/value it saw. However, the
> key can be an arbitrary Writable. To do this correctly, the Writable
> would have to be Cloneable and be clone()-ed, which is not guaranteed
> and maybe undesirable.
> 
> 1. We can remove the option in VectorDumper to print keys to fix this,
> since that's the only thing that wants to read the current key. How
> bad is that?
> 2. We can iterate directly over the SequenceFile in VectorDumper to
> get desired behavior. OK? Then actually SequenceFileVectorIterable
> goes away.

For some reason, I recall the SFVI being useful for other things, but I'm failing to recall them at the moment.

Re: Interesting iterator issue

Posted by Sean Owen <sr...@gmail.com>.
I'm going for #2. My reasoning is that the current Iterator's point is
to be reusable, but is not reusable at the moment. It is only used in
one place, and "inlining" its logic results in a correct and more
efficient implementation than is possible by patching up the Iterator,
since it needs more particular behavior than a general Iterator can
provide here.

Should we want an Iterator over key-value pairs in a SequenceFile --
and that's interesting -- it can be rewritten (by me if you like) in a
more correct way later.

On Sun, Sep 5, 2010 at 7:59 PM, Sean Owen <sr...@gmail.com> wrote:
> Having an interesting issue fixing up a technical point in the class
> SequenceFileVectorIterable. Its Iterator is wrong in that hasNext()
> advances the iteration and next() doesn't. There's a way to fix this
> easily: the Iterator just needs to always read one item ahead to know
> whether a next one exists.
>
> However doing this the straightforward way, the Iterator doesn't know
> the current key/value it's on -- always the next one. This is an issue
> since in the one usage of this class, in VectorDumper, the current key
> is accessed for printing.
>
> The Iterator could just store the last key/value it saw. However, the
> key can be an arbitrary Writable. To do this correctly, the Writable
> would have to be Cloneable and be clone()-ed, which is not guaranteed
> and maybe undesirable.
>
> 1. We can remove the option in VectorDumper to print keys to fix this,
> since that's the only thing that wants to read the current key. How
> bad is that?
> 2. We can iterate directly over the SequenceFile in VectorDumper to
> get desired behavior. OK? Then actually SequenceFileVectorIterable
> goes away.
>
> Sean
>