You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2010/12/13 18:49:28 UTC

Sequential access to VectorWritable content proposal.

Hi all,

I would like to submit a patch to VectorWritable that allows for streaming
access to vector elements without having to prebuffer all of them first.
(current code allows for the latter only).

That patch would allow to strike down one of the memory usage issues in
current Stochastic SVD implementation and effectively open memory bound for
n of the SVD work. (The value i see is not to open up the the bound though
but just be more efficient in memory use, thus essentially speeding u p the
computation. )

If it's ok, i would like to create a JIRA issue and provide a patch for it.

Another issue is to provide an SSVD patch that depends on that patch for
VectorWritable.

Thank you.
-Dmitriy

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Initial work on vector preprocessing is done in my git on "ssvd-vw-hack"
branch of my ssvd work (doc is here : https://github.com/dlyubimov/ssvd-doc) .
The heavy lifting is done thru VectorPreprocessor interface and seems to
work like a charm. I did not test it extensively, but when complete, it
should be able to cope with ocasional spikes in data density without
cirppling SVD mapper's memory.

thanks. -d

On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Hi all,
>
> I would like to submit a patch to VectorWritable that allows for streaming
> access to vector elements without having to prebuffer all of them first.
> (current code allows for the latter only).
>
> That patch would allow to strike down one of the memory usage issues in
> current Stochastic SVD implementation and effectively open memory bound for
> n of the SVD work. (The value i see is not to open up the the bound though
> but just be more efficient in memory use, thus essentially speeding u p the
> computation. )
>
> If it's ok, i would like to create a JIRA issue and provide a patch for it.
>
>
> Another issue is to provide an SSVD patch that depends on that patch for
> VectorWritable.
>
> Thank you.
> -Dmitriy
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

Why not use the Iterable interface then?

On Mon, Dec 13, 2010 at 10:57 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> No, no new vector types. All that is needed is push-type element handler
> that the algorithm can set, something like
>
> void onNextVectorElement ( int index, double value ) throws IOException
>
> or something like that.
>
> IF the handler is set, then the writable uses it during read process, (and
> subequently get() is not producing anything) ,
>
> and if the handler is not set , then it works as it does today.
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

No, no new vector types. All that is needed is push-type element handler
that the algorithm can set, something like

void onNextVectorElement ( int index, double value ) throws IOException

or something like that.

IF the handler is set, then the writable uses it during read process, (and
subequently get() is not producing anything) ,

and if the handler is not set , then it works as it does today.

On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com> wrote:

> Interesting idea.
>
> Would this introduce a new vector type that only allows iterating through
> the elements once?
>
> On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I would like to submit a patch to VectorWritable that allows for
> streaming
> > access to vector elements without having to prebuffer all of them first.
> > (current code allows for the latter only).
> >
> > That patch would allow to strike down one of the memory usage issues in
> > current Stochastic SVD implementation and effectively open memory bound
> for
> > n of the SVD work. (The value i see is not to open up the the bound
> though
> > but just be more efficient in memory use, thus essentially speeding u p
> the
> > computation. )
> >
> > If it's ok, i would like to create a JIRA issue and provide a patch for
> it.
> >
> > Another issue is to provide an SSVD patch that depends on that patch for
> > VectorWritable.
> >
> > Thank you.
> > -Dmitriy
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jake,

Yes, that's the problem here. That's also why i am trying to make it a
proposal instead of making a private patch and be done with it, since i saw
your previous post, so it would make sense to do something in a more
coordinated fashion thru OSS format.

Yes, that's the same problem you are having (except in my case it's more
about memory balance between buffering and algorithm which also needs some
of that juice) than a problem of fitting one single vector into all the
memory i have.

-d



On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com> wrote:

> Hey Dmitriy,
>
>  I've also been playing around with a VectorWritable format which is backed
> by a
> SequenceFile, but I've been focussed on the case where it's essentially the
> entire
> matrix, and the rows don't fit into memory.  This seems different than your
> current
> use case, however - you just want (relatively) small vectors to load
> faster,
> right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ted,

what Iterable are you talking about? On a Vector? i don't think it helps
because you have to have a Vector loaded (prebuffered) already-- so you
already took the memory for it?
I don't think VectorWritable has an Iterable though.

Iterator is a pull technique, and in MR it would be possible to do something
like this, but Iterable suggests you could do multiple passes over data
which is not possible in context of MR jobs without prebuffering.

Push technique explicitly suggests that one would not be able to do multiple
passes (as in DocumentHandler in SAX standard). So even if we implemented
iterator withough prebuffering vector data, in context of MR job it means
you can't have more than one iterator, which i think is conceptually
incoherent (i.e. misleading) with the Iterable contract.

Thanks.
-Dmitriy

On Mon, Dec 13, 2010 at 11:05 AM, Ted Dunning <te...@gmail.com> wrote:

> How common is it that a row won't fit in memory?  My experience is that
> essentially all rows that
> I am interested will fit in very modest amounts of memory, but that row by
> row handling is imperative.
>
> Is this just gilding the lily?
>
> On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Hey Dmitriy,
> >
> >  I've also been playing around with a VectorWritable format which is
> backed
> > by a
> > SequenceFile, but I've been focussed on the case where it's essentially
> the
> > entire
> > matrix, and the rows don't fit into memory.  This seems different than
> your
> > current
> > use case, however - you just want (relatively) small vectors to load
> > faster,
> > right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Interesting idea.
> > >
> > > Would this introduce a new vector type that only allows iterating
> through
> > > the elements once?
> > >
> > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to submit a patch to VectorWritable that allows for
> > > streaming
> > > > access to vector elements without having to prebuffer all of them
> > first.
> > > > (current code allows for the latter only).
> > > >
> > > > That patch would allow to strike down one of the memory usage issues
> in
> > > > current Stochastic SVD implementation and effectively open memory
> bound
> > > for
> > > > n of the SVD work. (The value i see is not to open up the the bound
> > > though
> > > > but just be more efficient in memory use, thus essentially speeding u
> p
> > > the
> > > > computation. )
> > > >
> > > > If it's ok, i would like to create a JIRA issue and provide a patch
> for
> > > it.
> > > >
> > > > Another issue is to provide an SSVD patch that depends on that patch
> > for
> > > > VectorWritable.
> > > >
> > > > Thank you.
> > > > -Dmitriy
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jake,

Are you saying vector element=sequence file record?

This is also a common problem and there are huge benefits to be had in terms
of I/O in stochastic SVD implementation (i had reviewed that in issues
section there). I was thinking in that direction too, but what i thought was
that this problem might be better served by block-oriented record rather
than element-oriented record format. I think block-wise formats are quite
common in scientific libraries like ScaLaPack.

But it's quite clear that some algorithms would be averse to pre-blocking,
but not to row splicing (i.e. just split a long row into subfragments).
That's also an interesting initiative although it's not going to work out of
the door with the stochastic SVD code. The fundamental issue with the
existing approach is that it has to create MR splits on vector bounaries,
not block boundaries (although it 100% block-wise algorithm seems quite
plausible and promising there too).

On Mon, Dec 13, 2010 at 1:06 PM, Jake Mannix <ja...@gmail.com> wrote:

> I'm not sure that Dmitriy's use-case has an easy solution.  As you
> say, Writable loads into memory the whole thing, independently of
> whether you try / not try to do buffering on iteration.
>
> My situation (monstrous vectors) is easier, in some respects: if
> the matrices are essentially
> SequenceFile<IntWritable,Pair<IntWritable,DoubleWritable>>, then
> there are a lot bigger vectors which can be handled in MR jobs, but
> they no longer really look like "vectors" in the interface sense.
>
>  -jake
>
> On Mon, Dec 13, 2010 at 12:52 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > OK.
> >
> > Let's assume that this is needed.
> >
> > I think that an iterable interface on VectorWritable that throws
> > UnsupportedOperationException or similar if
> > you try to get the iterator twice is much more transparent than a watcher
> > structure and much easier for a user
> > to discover/re-invent.
> >
> > Another (evil) thought is a parallel class to VectorWritable which is
> > essentially SequentialAccessVectorWritable that supports reading and
> > writing.  It seems to me that the Writable isn't real compatible with
> this
> > interface in any case.  How will that be resolved?
> >
> >
> > On Mon, Dec 13, 2010 at 11:36 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> >
> > > Absent of this solution, i realistically don't see how i can go without
> a
> > > push technique in accessing the vectors.
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Jake Mannix <ja...@gmail.com>.

I'm not sure that Dmitriy's use-case has an easy solution.  As you
say, Writable loads into memory the whole thing, independently of
whether you try / not try to do buffering on iteration.

My situation (monstrous vectors) is easier, in some respects: if
the matrices are essentially
SequenceFile<IntWritable,Pair<IntWritable,DoubleWritable>>, then
there are a lot bigger vectors which can be handled in MR jobs, but
they no longer really look like "vectors" in the interface sense.

  -jake

On Mon, Dec 13, 2010 at 12:52 PM, Ted Dunning <te...@gmail.com> wrote:

> OK.
>
> Let's assume that this is needed.
>
> I think that an iterable interface on VectorWritable that throws
> UnsupportedOperationException or similar if
> you try to get the iterator twice is much more transparent than a watcher
> structure and much easier for a user
> to discover/re-invent.
>
> Another (evil) thought is a parallel class to VectorWritable which is
> essentially SequentialAccessVectorWritable that supports reading and
> writing.  It seems to me that the Writable isn't real compatible with this
> interface in any case.  How will that be resolved?
>
>
> On Mon, Dec 13, 2010 at 11:36 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Absent of this solution, i realistically don't see how i can go without a
> > push technique in accessing the vectors.
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

I meant that algorithms will need change to use the per element capability.
 Getting algorithms to work with a chunk is the same kind of work.  Chunks
are probably going to be more efficient due to blocking effects
although network overhead may dominate.

On Mon, Dec 13, 2010 at 5:09 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I disagree.
>
> it a simple binary logic: Am i equipped with push handler? yes --> push
> values to handler. no? --> same as before. no change in anything that does
> not _opt in_ into the capability.
>
> On Mon, Dec 13, 2010 at 5:06 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I am not sure about that since the algorithms will have to be massively
> > changed either way.
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I disagree.

it a simple binary logic: Am i equipped with push handler? yes --> push
values to handler. no? --> same as before. no change in anything that does
not _opt in_ into the capability.

On Mon, Dec 13, 2010 at 5:06 PM, Ted Dunning <te...@gmail.com> wrote:

> I am not sure about that since the algorithms will have to be massively
> changed either way.
>
> On Mon, Dec 13, 2010 at 5:03 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > But that brings up an issue of data
> > prep utils, so that would require more coding than adding a capability to
> > the VW.
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

I am not sure about that since the algorithms will have to be massively
changed either way.

On Mon, Dec 13, 2010 at 5:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> But that brings up an issue of data
> prep utils, so that would require more coding than adding a capability to
> the VW.
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

> We should talk about whether it isn't better to just serialize chunks of
> vectors instead.

That's actually is a better option, i mentioned that as a preferrable
solution venue. Just splice the rows.  But that brings up an issue of data
prep utils, so that would require more coding than adding a capability to
the VW.



On Mon, Dec 13, 2010 at 4:54 PM, Ted Dunning <te...@gmail.com> wrote:

> We should talk about whether it isn't better to just serialize chunks of
> vectors instead.
>
> This would not require fancy footwork and perversion of the contracts that
> WritableVector thinks it has now.  It would also be bound to be a better
> match for the problem domain.
>
> So what about just using VW as it stands for vectors that are at most 100K
> elements and are keyed by row and left-most column?  Why isn't that the
> better option?
>
> On Mon, Dec 13, 2010 at 4:41 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Thank you, Jake.
> >
> > Yes i will do that tonight. I did do a brief assessment of that code
> before
> > but not very thoroughly though which was why i thought it must be a low
> > hanging fruit; but Ted and you certainly should know better so  i need to
> > look at it the second time to be sure. Thank you, sir.
> >
> > -Dima
> >
> > On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Check the source for VectorWritable, I'm pretty sure it serializes
> > > in the order of the nonDefaultIterator(), which for SASVectors is in
> > order,
> > > so while these are indeed non-optimal for random access and mutating
> > > operations, that is indeed the tradeoff you have to make when picking
> > > your vector impl.
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > Yes, it should be. I thought Ted implied VectorWritable does it only
> > this
> > > > way and non other.
> > > >
> > > > If we can differentiate I'd rather do it. Implying that if you save
> in
> > > one
> > > > format (non-sequential) we'd support it with caveat that it's subpar
> in
> > > > certain cases whereas where you want to format input sequentially,
> we'd
> > > > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> > > Jake.
> > > >
> > > > -d
> > > >
> > > >
> > > > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <ja...@gmail.com>
> > > > wrote:
> > > >
> > > > > Dmitriy,
> > > > >
> > > > >  You should be able to specify that your matrices be stored in
> > > > > SequentialAccessSparseVector format if you need to.  This is
> > > > > almost always the right thing for HDFS-backed matrices, because
> > > > > HDFS is write-once, and SASVectors are optimized for read-only
> > > > > sequential access, which is your exact use case, right?
> > > > >
> > > > >  -jake
> > > > >
> > > > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I don't think sequentiality is a requirement in the case i am
> > working
> > > > on.
> > > > > > However, let me peek at the code first. I am guessing it is some
> > form
> > > > of
> > > > > a
> > > > > > near-perfect hash, in which case it may not be possible to read
> it
> > in
> > > > > parts
> > > > > > at all. Which would be bad, indeed. I would need to find a
> > completely
> > > > > > alternative input format then to overcome my case.
> > > > > >
> > > > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I don't thikn that sequentiality part of the contract.
> > > > > > >  RandomAccessSparseVectors are likely to
> > > > > > > produce disordered values when serialized, I think.
> > > > > > >
> > > > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> > > dlieu.7@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I will have to look at details of VectorWritable to make sure
> > all
> > > > > cases
> > > > > > > are
> > > > > > > > covered (I only took a very brief look so far). But as long
> as
> > it
> > > > is
> > > > > > able
> > > > > > > > to
> > > > > > > > produce elements in order of index increase, push technique
> > will
> > > > > > > certainly
> > > > > > > > work for most algorithms (and in some cases, notably with
> SSVD,
> > > > even
> > > > > if
> > > > > > > it
> > > > > > > > produces the data in non-sequential way, it would work too )
> .
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

Sometimes sampling works.  Sometimes it doesn't.  Sampling helps in
recommendations even more than with space because cooccurrence counting
requires time quadratic in the number of occurrences of an item.

In fact, it is possible to do the block decomposition we were talking about
using an interleaved sampling.  That would have an interesting and often
positive result in many algorithms.

But sampling is, sadly, not an option for many algorithms.

On Tue, Dec 14, 2010 at 12:26 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I see. Thank you, Sean.
>
> On Tue, Dec 14, 2010 at 12:17 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > Yah I have the same sort of problem with recommenders. Once in a while
> you
> > have a very, very popular item. In those cases I use some smart sampling
> to
> > simply throw out some data. For my use cases, that has virtually no
> effect
> > on the result so it's valid. Not sure if you can use the same approaches.
> >
> >
> > On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Sean,
> > >
> > > Absolutely agree. There's no such thing as 1Gb HDFS block. The max i
> ever
> > > saw is 128m but most commonly 64.
> > >
> > > Imagine for a second loading 1Gb records into a mapper. Assuming
> > > sufficiently large amount of nodes, the math expectancy of the
> collocated
> > > data in this case is approximately 32M, everything else comes from some
> > > other node.
> > >
> > > So we start our job with moving ~97% of the data around just to make it
> > > available to the mappers.
> > >
> > > There are two considerations in my case however that alleviated that
> > > concern
> > > after all at least in my particular case somewhat:
> > >
> > > 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
> > > 100-150mb at most. But i still have to yield 50% of mapper RAM to the
> VW
> > > just to make sure 0.1% of cases goes thru. Sp tje case i am making is
> not
> > > for 1Gb vectors, i am saying that the problem is already quite
> > detrimental
> > > to algorithm running time at much smaller sizes.
> > >
> > > 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb
> > vectors,
> > > nor do i plan to. But if i did, i strongly suspect it is not I/O that
> > would
> > > be a bottleneck.
> > >
> > > On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <sr...@gmail.com> wrote:
> > >
> > > > This may be a late or ill-informed comment but --
> > > >
> > > > I don't believe there's any issue with VectorWritable per se, no.
> > Hadoop
> > > > most certainly assumes that one Writable can fit into RAM. A 1GB
> > Writable
> > > > is
> > > > just completely incompatible with how Hadoop works. The algorithm
> would
> > > > have
> > > > to be parallelized more then.
> > > >
> > > > Yes that may mean re-writing the code to deal with small Vectors and
> > > such.
> > > > That probably doesn't imply a change to VectorWritable but to the
> jobs
> > > > using
> > > > it.
> > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I see. Thank you, Sean.

On Tue, Dec 14, 2010 at 12:17 AM, Sean Owen <sr...@gmail.com> wrote:

> Yah I have the same sort of problem with recommenders. Once in a while you
> have a very, very popular item. In those cases I use some smart sampling to
> simply throw out some data. For my use cases, that has virtually no effect
> on the result so it's valid. Not sure if you can use the same approaches.
>
>
> On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Sean,
> >
> > Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
> > saw is 128m but most commonly 64.
> >
> > Imagine for a second loading 1Gb records into a mapper. Assuming
> > sufficiently large amount of nodes, the math expectancy of the collocated
> > data in this case is approximately 32M, everything else comes from some
> > other node.
> >
> > So we start our job with moving ~97% of the data around just to make it
> > available to the mappers.
> >
> > There are two considerations in my case however that alleviated that
> > concern
> > after all at least in my particular case somewhat:
> >
> > 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
> > 100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
> > just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
> > for 1Gb vectors, i am saying that the problem is already quite
> detrimental
> > to algorithm running time at much smaller sizes.
> >
> > 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb
> vectors,
> > nor do i plan to. But if i did, i strongly suspect it is not I/O that
> would
> > be a bottleneck.
> >
> > On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > This may be a late or ill-informed comment but --
> > >
> > > I don't believe there's any issue with VectorWritable per se, no.
> Hadoop
> > > most certainly assumes that one Writable can fit into RAM. A 1GB
> Writable
> > > is
> > > just completely incompatible with how Hadoop works. The algorithm would
> > > have
> > > to be parallelized more then.
> > >
> > > Yes that may mean re-writing the code to deal with small Vectors and
> > such.
> > > That probably doesn't imply a change to VectorWritable but to the jobs
> > > using
> > > it.
> > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Sean Owen <sr...@gmail.com>.

Yah I have the same sort of problem with recommenders. Once in a while you
have a very, very popular item. In those cases I use some smart sampling to
simply throw out some data. For my use cases, that has virtually no effect
on the result so it's valid. Not sure if you can use the same approaches.


On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Sean,
>
> Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
> saw is 128m but most commonly 64.
>
> Imagine for a second loading 1Gb records into a mapper. Assuming
> sufficiently large amount of nodes, the math expectancy of the collocated
> data in this case is approximately 32M, everything else comes from some
> other node.
>
> So we start our job with moving ~97% of the data around just to make it
> available to the mappers.
>
> There are two considerations in my case however that alleviated that
> concern
> after all at least in my particular case somewhat:
>
> 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
> 100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
> just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
> for 1Gb vectors, i am saying that the problem is already quite detrimental
> to algorithm running time at much smaller sizes.
>
> 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb vectors,
> nor do i plan to. But if i did, i strongly suspect it is not I/O that would
> be a bottleneck.
>
> On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > This may be a late or ill-informed comment but --
> >
> > I don't believe there's any issue with VectorWritable per se, no. Hadoop
> > most certainly assumes that one Writable can fit into RAM. A 1GB Writable
> > is
> > just completely incompatible with how Hadoop works. The algorithm would
> > have
> > to be parallelized more then.
> >
> > Yes that may mean re-writing the code to deal with small Vectors and
> such.
> > That probably doesn't imply a change to VectorWritable but to the jobs
> > using
> > it.
> >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Sean,

Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
saw is 128m but most commonly 64.

Imagine for a second loading 1Gb records into a mapper. Assuming
sufficiently large amount of nodes, the math expectancy of the collocated
data in this case is approximately 32M, everything else comes from some
other node.

So we start our job with moving ~97% of the data around just to make it
available to the mappers.

There are two considerations in my case however that alleviated that concern
after all at least in my particular case somewhat:

1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
for 1Gb vectors, i am saying that the problem is already quite detrimental
to algorithm running time at much smaller sizes.

2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb vectors,
nor do i plan to. But if i did, i strongly suspect it is not I/O that would
be a bottleneck.

On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <sr...@gmail.com> wrote:

> This may be a late or ill-informed comment but --
>
> I don't believe there's any issue with VectorWritable per se, no. Hadoop
> most certainly assumes that one Writable can fit into RAM. A 1GB Writable
> is
> just completely incompatible with how Hadoop works. The algorithm would
> have
> to be parallelized more then.
>
> Yes that may mean re-writing the code to deal with small Vectors and such.
> That probably doesn't imply a change to VectorWritable but to the jobs
> using
> it.
>
>

Re: Sequential access to VectorWritable content proposal.

Posted by Sean Owen <sr...@gmail.com>.

This may be a late or ill-informed comment but --

I don't believe there's any issue with VectorWritable per se, no. Hadoop
most certainly assumes that one Writable can fit into RAM. A 1GB Writable is
just completely incompatible with how Hadoop works. The algorithm would have
to be parallelized more then.

Yes that may mean re-writing the code to deal with small Vectors and such.
That probably doesn't imply a change to VectorWritable but to the jobs using
it.

On Tue, Dec 14, 2010 at 12:54 AM, Ted Dunning <te...@gmail.com> wrote:

> We should talk about whether it isn't better to just serialize chunks of
> vectors instead.
>
> This would not require fancy footwork and perversion of the contracts that
> WritableVector thinks it has now.  It would also be bound to be a better
> match for the problem domain.
>
> So what about just using VW as it stands for vectors that are at most 100K
> elements and are keyed by row and left-most column?  Why isn't that the
> better option?
>
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

We should talk about whether it isn't better to just serialize chunks of
vectors instead.

This would not require fancy footwork and perversion of the contracts that
WritableVector thinks it has now.  It would also be bound to be a better
match for the problem domain.

So what about just using VW as it stands for vectors that are at most 100K
elements and are keyed by row and left-most column?  Why isn't that the
better option?

On Mon, Dec 13, 2010 at 4:41 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Thank you, Jake.
>
> Yes i will do that tonight. I did do a brief assessment of that code before
> but not very thoroughly though which was why i thought it must be a low
> hanging fruit; but Ted and you certainly should know better so  i need to
> look at it the second time to be sure. Thank you, sir.
>
> -Dima
>
> On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Check the source for VectorWritable, I'm pretty sure it serializes
> > in the order of the nonDefaultIterator(), which for SASVectors is in
> order,
> > so while these are indeed non-optimal for random access and mutating
> > operations, that is indeed the tradeoff you have to make when picking
> > your vector impl.
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Yes, it should be. I thought Ted implied VectorWritable does it only
> this
> > > way and non other.
> > >
> > > If we can differentiate I'd rather do it. Implying that if you save in
> > one
> > > format (non-sequential) we'd support it with caveat that it's subpar in
> > > certain cases whereas where you want to format input sequentially, we'd
> > > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> > Jake.
> > >
> > > -d
> > >
> > >
> > > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <ja...@gmail.com>
> > > wrote:
> > >
> > > > Dmitriy,
> > > >
> > > >  You should be able to specify that your matrices be stored in
> > > > SequentialAccessSparseVector format if you need to.  This is
> > > > almost always the right thing for HDFS-backed matrices, because
> > > > HDFS is write-once, and SASVectors are optimized for read-only
> > > > sequential access, which is your exact use case, right?
> > > >
> > > >  -jake
> > > >
> > > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I don't think sequentiality is a requirement in the case i am
> working
> > > on.
> > > > > However, let me peek at the code first. I am guessing it is some
> form
> > > of
> > > > a
> > > > > near-perfect hash, in which case it may not be possible to read it
> in
> > > > parts
> > > > > at all. Which would be bad, indeed. I would need to find a
> completely
> > > > > alternative input format then to overcome my case.
> > > > >
> > > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <
> ted.dunning@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I don't thikn that sequentiality part of the contract.
> > > > > >  RandomAccessSparseVectors are likely to
> > > > > > produce disordered values when serialized, I think.
> > > > > >
> > > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I will have to look at details of VectorWritable to make sure
> all
> > > > cases
> > > > > > are
> > > > > > > covered (I only took a very brief look so far). But as long as
> it
> > > is
> > > > > able
> > > > > > > to
> > > > > > > produce elements in order of index increase, push technique
> will
> > > > > > certainly
> > > > > > > work for most algorithms (and in some cases, notably with SSVD,
> > > even
> > > > if
> > > > > > it
> > > > > > > produces the data in non-sequential way, it would work too ) .
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Thank you, Jake.

Yes i will do that tonight. I did do a brief assessment of that code before
but not very thoroughly though which was why i thought it must be a low
hanging fruit; but Ted and you certainly should know better so  i need to
look at it the second time to be sure. Thank you, sir.

-Dima

On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <ja...@gmail.com> wrote:

> Check the source for VectorWritable, I'm pretty sure it serializes
> in the order of the nonDefaultIterator(), which for SASVectors is in order,
> so while these are indeed non-optimal for random access and mutating
> operations, that is indeed the tradeoff you have to make when picking
> your vector impl.
>
>  -jake
>
> On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Yes, it should be. I thought Ted implied VectorWritable does it only this
> > way and non other.
> >
> > If we can differentiate I'd rather do it. Implying that if you save in
> one
> > format (non-sequential) we'd support it with caveat that it's subpar in
> > certain cases whereas where you want to format input sequentially, we'd
> > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> Jake.
> >
> > -d
> >
> >
> > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Dmitriy,
> > >
> > >  You should be able to specify that your matrices be stored in
> > > SequentialAccessSparseVector format if you need to.  This is
> > > almost always the right thing for HDFS-backed matrices, because
> > > HDFS is write-once, and SASVectors are optimized for read-only
> > > sequential access, which is your exact use case, right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > I don't think sequentiality is a requirement in the case i am working
> > on.
> > > > However, let me peek at the code first. I am guessing it is some form
> > of
> > > a
> > > > near-perfect hash, in which case it may not be possible to read it in
> > > parts
> > > > at all. Which would be bad, indeed. I would need to find a completely
> > > > alternative input format then to overcome my case.
> > > >
> > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > I don't thikn that sequentiality part of the contract.
> > > > >  RandomAccessSparseVectors are likely to
> > > > > produce disordered values when serialized, I think.
> > > > >
> > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I will have to look at details of VectorWritable to make sure all
> > > cases
> > > > > are
> > > > > > covered (I only took a very brief look so far). But as long as it
> > is
> > > > able
> > > > > > to
> > > > > > produce elements in order of index increase, push technique will
> > > > > certainly
> > > > > > work for most algorithms (and in some cases, notably with SSVD,
> > even
> > > if
> > > > > it
> > > > > > produces the data in non-sequential way, it would work too ) .
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jake, yes, just looked at VW code again. I don't see any problem equipping
it with an element handler. Even when it's a sparse random access.

So, driven by practical delivery goals, i'd have to push on with hacking
Vector Writable for now. It's not clear if this will result in significant
decrease in SSVD running time but I think it will help me to get thru
current RAM starving nuisances.  I am staying fairly unconvinced that
there's been a really strong argumentation against providing vector
preprocessing capability of some sort so far, or that push preprocessor
technique is somehow not-so-elegant (or should i say ugly?) in itself. It's
a pattern that is well understood and widely used in the community. Esp. for
as long as javadoc is very explicit about it. I gather there's little
interest in this sort of memory allocation improvement. So I'll keep this
hack private, np, until some sort of spliced record or blocked matrix format
is matured.  The existing patch is certainly very usable as is, esp. if one
can afford some extra memory in child processes.

I fully support the notion  that eventually some sort of blocking format is
most promising for 1Gb vectors, but i think for my purposes i can get away
with just this and some MR optimizations for split sizes and associated IO.
But the more i think about it, the more convinced i become that SSVD on
wider matrices are preferrable over tall matrices as it would allow to
reduce amount of data ciculation thru shuffle-and-sort circuitry as well as
the number of blocks in SSVD computations. So it kind of makes wider
matrices more scalable. Althought i don't think that a billion rows tall
matrix is a problem for anything but CPU in context of SSVD either.

thanks.
-Dmitriy

On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <ja...@gmail.com> wrote:

> Check the source for VectorWritable, I'm pretty sure it serializes
> in the order of the nonDefaultIterator(), which for SASVectors is in order,
> so while these are indeed non-optimal for random access and mutating
> operations, that is indeed the tradeoff you have to make when picking
> your vector impl.
>
>  -jake
>
> On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Yes, it should be. I thought Ted implied VectorWritable does it only this
> > way and non other.
> >
> > If we can differentiate I'd rather do it. Implying that if you save in
> one
> > format (non-sequential) we'd support it with caveat that it's subpar in
> > certain cases whereas where you want to format input sequentially, we'd
> > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> Jake.
> >
> > -d
> >
> >
> > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Dmitriy,
> > >
> > >  You should be able to specify that your matrices be stored in
> > > SequentialAccessSparseVector format if you need to.  This is
> > > almost always the right thing for HDFS-backed matrices, because
> > > HDFS is write-once, and SASVectors are optimized for read-only
> > > sequential access, which is your exact use case, right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > I don't think sequentiality is a requirement in the case i am working
> > on.
> > > > However, let me peek at the code first. I am guessing it is some form
> > of
> > > a
> > > > near-perfect hash, in which case it may not be possible to read it in
> > > parts
> > > > at all. Which would be bad, indeed. I would need to find a completely
> > > > alternative input format then to overcome my case.
> > > >
> > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > I don't thikn that sequentiality part of the contract.
> > > > >  RandomAccessSparseVectors are likely to
> > > > > produce disordered values when serialized, I think.
> > > > >
> > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > I will have to look at details of VectorWritable to make sure all
> > > cases
> > > > > are
> > > > > > covered (I only took a very brief look so far). But as long as it
> > is
> > > > able
> > > > > > to
> > > > > > produce elements in order of index increase, push technique will
> > > > > certainly
> > > > > > work for most algorithms (and in some cases, notably with SSVD,
> > even
> > > if
> > > > > it
> > > > > > produces the data in non-sequential way, it would work too ) .
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Jake Mannix <ja...@gmail.com>.

Check the source for VectorWritable, I'm pretty sure it serializes
in the order of the nonDefaultIterator(), which for SASVectors is in order,
so while these are indeed non-optimal for random access and mutating
operations, that is indeed the tradeoff you have to make when picking
your vector impl.

  -jake

On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Yes, it should be. I thought Ted implied VectorWritable does it only this
> way and non other.
>
> If we can differentiate I'd rather do it. Implying that if you save in one
> format (non-sequential) we'd support it with caveat that it's subpar in
> certain cases whereas where you want to format input sequentially, we'd
> eliminate vector prebuffering stage. Yes, that will work. Thank you, Jake.
>
> -d
>
>
> On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Dmitriy,
> >
> >  You should be able to specify that your matrices be stored in
> > SequentialAccessSparseVector format if you need to.  This is
> > almost always the right thing for HDFS-backed matrices, because
> > HDFS is write-once, and SASVectors are optimized for read-only
> > sequential access, which is your exact use case, right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > I don't think sequentiality is a requirement in the case i am working
> on.
> > > However, let me peek at the code first. I am guessing it is some form
> of
> > a
> > > near-perfect hash, in which case it may not be possible to read it in
> > parts
> > > at all. Which would be bad, indeed. I would need to find a completely
> > > alternative input format then to overcome my case.
> > >
> > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > I don't thikn that sequentiality part of the contract.
> > > >  RandomAccessSparseVectors are likely to
> > > > produce disordered values when serialized, I think.
> > > >
> > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I will have to look at details of VectorWritable to make sure all
> > cases
> > > > are
> > > > > covered (I only took a very brief look so far). But as long as it
> is
> > > able
> > > > > to
> > > > > produce elements in order of index increase, push technique will
> > > > certainly
> > > > > work for most algorithms (and in some cases, notably with SSVD,
> even
> > if
> > > > it
> > > > > produces the data in non-sequential way, it would work too ) .
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Yes, it should be. I thought Ted implied VectorWritable does it only this
way and non other.

If we can differentiate I'd rather do it. Implying that if you save in one
format (non-sequential) we'd support it with caveat that it's subpar in
certain cases whereas where you want to format input sequentially, we'd
eliminate vector prebuffering stage. Yes, that will work. Thank you, Jake.

-d


On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <ja...@gmail.com> wrote:

> Dmitriy,
>
>  You should be able to specify that your matrices be stored in
> SequentialAccessSparseVector format if you need to.  This is
> almost always the right thing for HDFS-backed matrices, because
> HDFS is write-once, and SASVectors are optimized for read-only
> sequential access, which is your exact use case, right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > I don't think sequentiality is a requirement in the case i am working on.
> > However, let me peek at the code first. I am guessing it is some form of
> a
> > near-perfect hash, in which case it may not be possible to read it in
> parts
> > at all. Which would be bad, indeed. I would need to find a completely
> > alternative input format then to overcome my case.
> >
> > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > I don't thikn that sequentiality part of the contract.
> > >  RandomAccessSparseVectors are likely to
> > > produce disordered values when serialized, I think.
> > >
> > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > I will have to look at details of VectorWritable to make sure all
> cases
> > > are
> > > > covered (I only took a very brief look so far). But as long as it is
> > able
> > > > to
> > > > produce elements in order of index increase, push technique will
> > > certainly
> > > > work for most algorithms (and in some cases, notably with SSVD, even
> if
> > > it
> > > > produces the data in non-sequential way, it would work too ) .
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Jake Mannix <ja...@gmail.com>.

Dmitriy,

  You should be able to specify that your matrices be stored in
SequentialAccessSparseVector format if you need to.  This is
almost always the right thing for HDFS-backed matrices, because
HDFS is write-once, and SASVectors are optimized for read-only
sequential access, which is your exact use case, right?

  -jake

On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I don't think sequentiality is a requirement in the case i am working on.
> However, let me peek at the code first. I am guessing it is some form of a
> near-perfect hash, in which case it may not be possible to read it in parts
> at all. Which would be bad, indeed. I would need to find a completely
> alternative input format then to overcome my case.
>
> On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I don't thikn that sequentiality part of the contract.
> >  RandomAccessSparseVectors are likely to
> > produce disordered values when serialized, I think.
> >
> > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > I will have to look at details of VectorWritable to make sure all cases
> > are
> > > covered (I only took a very brief look so far). But as long as it is
> able
> > > to
> > > produce elements in order of index increase, push technique will
> > certainly
> > > work for most algorithms (and in some cases, notably with SSVD, even if
> > it
> > > produces the data in non-sequential way, it would work too ) .
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I don't think sequentiality is a requirement in the case i am working on.
However, let me peek at the code first. I am guessing it is some form of a
near-perfect hash, in which case it may not be possible to read it in parts
at all. Which would be bad, indeed. I would need to find a completely
alternative input format then to overcome my case.

On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <te...@gmail.com> wrote:

> I don't thikn that sequentiality part of the contract.
>  RandomAccessSparseVectors are likely to
> produce disordered values when serialized, I think.
>
> On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > I will have to look at details of VectorWritable to make sure all cases
> are
> > covered (I only took a very brief look so far). But as long as it is able
> > to
> > produce elements in order of index increase, push technique will
> certainly
> > work for most algorithms (and in some cases, notably with SSVD, even if
> it
> > produces the data in non-sequential way, it would work too ) .
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

I don't thikn that sequentiality part of the contract.
 RandomAccessSparseVectors are likely to
produce disordered values when serialized, I think.

On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I will have to look at details of VectorWritable to make sure all cases are
> covered (I only took a very brief look so far). But as long as it is able
> to
> produce elements in order of index increase, push technique will certainly
> work for most algorithms (and in some cases, notably with SSVD, even if it
> produces the data in non-sequential way, it would work too ) .
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ted,

I suspect there still will be problems if you  have pull ("Iterator") but
not  push ("Watcher") technique with MR.

Ted,

I already thought a little bit about that.

The problem here is that one, fundamentally, wants to do some input
preprocessing, whatever that is, (in case of SSVD it's sequential matrix
multiplication or dot-product accumulation), during execution of
Writable.readFields(), thus rendering preallocation of memory for the entire
vector unnecessary.

The framework will not make Writable available to the reducer until input
format's record reader is done reading it. Hence I think one can't request
an iterator after the record is read. So, like you said, Writable is indeed
really not compatible with Iterator. But not with  push-parsing. That's what
i am in part saying, is that Writable imposes fundamental constraints on how
you could do any sequential data preprocessing (either dense or sparse
sequential data) in a map.

I will have to look at details of VectorWritable to make sure all cases are
covered (I only took a very brief look so far). But as long as it is able to
produce elements in order of index increase, push technique will certainly
work for most algorithms (and in some cases, notably with SSVD, even if it
produces the data in non-sequential way, it would work too ) .

BTW absense of sequentiality in data access requirement in stochastic svd
technique is one of the reasons why e.g. blocked format would work too with
SSVD.

On Mon, Dec 13, 2010 at 12:52 PM, Ted Dunning <te...@gmail.com> wrote:

> OK.
>
> Let's assume that this is needed.
>
> I think that an iterable interface on VectorWritable that throws
> UnsupportedOperationException or similar if
> you try to get the iterator twice is much more transparent than a watcher
> structure and much easier for a user
> to discover/re-invent.
>
> Another (evil) thought is a parallel class to VectorWritable which is
> essentially SequentialAccessVectorWritable that supports reading and
> writing.  It seems to me that the Writable isn't real compatible with this
> interface in any case.  How will that be resolved?
>
>
> On Mon, Dec 13, 2010 at 11:36 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Absent of this solution, i realistically don't see how i can go without a
> > push technique in accessing the vectors.
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

OK.

Let's assume that this is needed.

I think that an iterable interface on VectorWritable that throws
UnsupportedOperationException or similar if
you try to get the iterator twice is much more transparent than a watcher
structure and much easier for a user
to discover/re-invent.

Another (evil) thought is a parallel class to VectorWritable which is
essentially SequentialAccessVectorWritable that supports reading and
writing.  It seems to me that the Writable isn't real compatible with this
interface in any case.  How will that be resolved?

On Mon, Dec 13, 2010 at 11:36 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Absent of this solution, i realistically don't see how i can go without a
> push technique in accessing the vectors.
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS. Just another detail.

Most of the time actually i don't run into long vectors. But occasiionally,
among millions of them, i may run into a 10-20 even 50 million-long one.
Which means, not only i have to adjust algorithm settings to allow for 400m
for prebuffering (which is all i have), i actually have to allow for the
worst case, i.e. what happens only in 0.1% of the cases.

I can't accept answer "use -Xmx1G" in map processes for reasons that i can't
change.

Absent of this solution, i realistically don't see how i can go without a
push technique in accessing the vectors. You are welcome to suggest
alternative solution to this conundrum, but i see nothing in current code
that would help to fix it, without introducing some (very little amount of )
a new code.



On Mon, Dec 13, 2010 at 11:19 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Ted,
>
> like i commented in the issue, 100m dense data elements means ~800M to load
> a vector. If (suppose) i have 1G in a mapper (which is not terribly common,
> not in our case, we only run jobs with -500M), then require 80% of RAM for
> prebuffering (i.e. just being able to get access to the data) and leave 20%
> for algorithm? Setting aside other considerations, doesn't this strike you
> as being off-balance in requirements ? it does to me and unfortunately  in
> my practical application with the hardware constraints i got i won't be able
> to afford that luxury. So it is a problem for my task for the moment,
> believe it or not. At this moment even 100Mb for prebuffering the data is
> prohibitively costly in my app.
>
> -d
>
>
> On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> I really don't see this as a big deal even with crazy big vectors.
>>
>> Looking at web scale, for instance, the most linked wikipedia article only
>> has 10 million in-links or so.  On the web, the most massive web site is
>> unlikely to have >100 million in-links.  Both of these fit in very modest
>> amounts of memory.
>>
>> Where's the rub?
>>
>> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >wrote:
>>
>> > Jake,
>> > No i was trying exactly what you were proposing some time ago on the
>> list.
>> > I
>> > am trying to make long vectors not to occupy a lot of memory.
>> >
>> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
>> > saying, hey, there's a lot of sequential techniques that can provide a
>> > hander that would inspect vector element-by-element without having to
>> > preallocate 8Mb.
>> >
>> > for 1 million-long vectors it doesn't scary too much but starts being so
>> > for
>> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
>> > non-zero elements). Stochastic SVD will survive that, but it means less
>> > memory for blocking, and the more blocks you have, the more CPU it
>> requires
>> > (although CPU demand is only linear to the number of blocks and only in
>> > signficantly smaller part of computation, so that only insigificant part
>> of
>> > total CPU flops depends on # of blocks, but there is part that does,
>> still.
>> > )
>> >
>> > Like i said, it also would address the case when rows don't fit in the
>> > memory (hence no memory bound for n of A) but the most immediate benefit
>> is
>> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
>> >
>> > -Dmitriy
>> >
>> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com>
>> > wrote:
>> >
>> > > Hey Dmitriy,
>> > >
>> > >  I've also been playing around with a VectorWritable format which is
>> > backed
>> > > by a
>> > > SequenceFile, but I've been focussed on the case where it's
>> essentially
>> > the
>> > > entire
>> > > matrix, and the rows don't fit into memory.  This seems different than
>> > your
>> > > current
>> > > use case, however - you just want (relatively) small vectors to load
>> > > faster,
>> > > right?
>> > >
>> > >  -jake
>> > >
>> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
>> > > wrote:
>> > >
>> > > > Interesting idea.
>> > > >
>> > > > Would this introduce a new vector type that only allows iterating
>> > through
>> > > > the elements once?
>> > > >
>> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I would like to submit a patch to VectorWritable that allows for
>> > > > streaming
>> > > > > access to vector elements without having to prebuffer all of them
>> > > first.
>> > > > > (current code allows for the latter only).
>> > > > >
>> > > > > That patch would allow to strike down one of the memory usage
>> issues
>> > in
>> > > > > current Stochastic SVD implementation and effectively open memory
>> > bound
>> > > > for
>> > > > > n of the SVD work. (The value i see is not to open up the the
>> bound
>> > > > though
>> > > > > but just be more efficient in memory use, thus essentially
>> speeding u
>> > p
>> > > > the
>> > > > > computation. )
>> > > > >
>> > > > > If it's ok, i would like to create a JIRA issue and provide a
>> patch
>> > for
>> > > > it.
>> > > > >
>> > > > > Another issue is to provide an SSVD patch that depends on that
>> patch
>> > > for
>> > > > > VectorWritable.
>> > > > >
>> > > > > Thank you.
>> > > > > -Dmitriy
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ted,

like i commented in the issue, 100m dense data elements means ~800M to load
a vector. If (suppose) i have 1G in a mapper (which is not terribly common,
not in our case, we only run jobs with -500M), then require 80% of RAM for
prebuffering (i.e. just being able to get access to the data) and leave 20%
for algorithm? Setting aside other considerations, doesn't this strike you
as being off-balance in requirements ? it does to me and unfortunately  in
my practical application with the hardware constraints i got i won't be able
to afford that luxury. So it is a problem for my task for the moment,
believe it or not. At this moment even 100Mb for prebuffering the data is
prohibitively costly in my app.

-d

On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <te...@gmail.com> wrote:

> I really don't see this as a big deal even with crazy big vectors.
>
> Looking at web scale, for instance, the most linked wikipedia article only
> has 10 million in-links or so.  On the web, the most massive web site is
> unlikely to have >100 million in-links.  Both of these fit in very modest
> amounts of memory.
>
> Where's the rub?
>
> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Jake,
> > No i was trying exactly what you were proposing some time ago on the
> list.
> > I
> > am trying to make long vectors not to occupy a lot of memory.
> >
> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> > saying, hey, there's a lot of sequential techniques that can provide a
> > hander that would inspect vector element-by-element without having to
> > preallocate 8Mb.
> >
> > for 1 million-long vectors it doesn't scary too much but starts being so
> > for
> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> > non-zero elements). Stochastic SVD will survive that, but it means less
> > memory for blocking, and the more blocks you have, the more CPU it
> requires
> > (although CPU demand is only linear to the number of blocks and only in
> > signficantly smaller part of computation, so that only insigificant part
> of
> > total CPU flops depends on # of blocks, but there is part that does,
> still.
> > )
> >
> > Like i said, it also would address the case when rows don't fit in the
> > memory (hence no memory bound for n of A) but the most immediate benefit
> is
> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
> >
> > -Dmitriy
> >
> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Hey Dmitriy,
> > >
> > >  I've also been playing around with a VectorWritable format which is
> > backed
> > > by a
> > > SequenceFile, but I've been focussed on the case where it's essentially
> > the
> > > entire
> > > matrix, and the rows don't fit into memory.  This seems different than
> > your
> > > current
> > > use case, however - you just want (relatively) small vectors to load
> > > faster,
> > > right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Interesting idea.
> > > >
> > > > Would this introduce a new vector type that only allows iterating
> > through
> > > > the elements once?
> > > >
> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to submit a patch to VectorWritable that allows for
> > > > streaming
> > > > > access to vector elements without having to prebuffer all of them
> > > first.
> > > > > (current code allows for the latter only).
> > > > >
> > > > > That patch would allow to strike down one of the memory usage
> issues
> > in
> > > > > current Stochastic SVD implementation and effectively open memory
> > bound
> > > > for
> > > > > n of the SVD work. (The value i see is not to open up the the bound
> > > > though
> > > > > but just be more efficient in memory use, thus essentially speeding
> u
> > p
> > > > the
> > > > > computation. )
> > > > >
> > > > > If it's ok, i would like to create a JIRA issue and provide a patch
> > for
> > > > it.
> > > > >
> > > > > Another issue is to provide an SSVD patch that depends on that
> patch
> > > for
> > > > > VectorWritable.
> > > > >
> > > > > Thank you.
> > > > > -Dmitriy
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

OK.  Let's take that as read, then.

On Mon, Dec 13, 2010 at 11:27 AM, Jake Mannix <ja...@gmail.com> wrote:

> Ted, there are some vectors where it certainly matters: the eigenvectors of
> a
> really big matrix are dense, and thus take O(8*num_vertices) bytes to hold
> just
> one of them in memory.  Doing something sequential with these can certainly
> make sense, and in some cases is actually necessary, esp. if done in the
> mappers or reducers where there is less memory than you usually have...
>
>  -jake
>
> On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I really don't see this as a big deal even with crazy big vectors.
> >
> > Looking at web scale, for instance, the most linked wikipedia article
> only
> > has 10 million in-links or so.  On the web, the most massive web site is
> > unlikely to have >100 million in-links.  Both of these fit in very modest
> > amounts of memory.
> >
> > Where's the rub?
> >
> > On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> >
> > > Jake,
> > > No i was trying exactly what you were proposing some time ago on the
> > list.
> > > I
> > > am trying to make long vectors not to occupy a lot of memory.
> > >
> > > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> > > saying, hey, there's a lot of sequential techniques that can provide a
> > > hander that would inspect vector element-by-element without having to
> > > preallocate 8Mb.
> > >
> > > for 1 million-long vectors it doesn't scary too much but starts being
> so
> > > for
> > > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> > > non-zero elements). Stochastic SVD will survive that, but it means less
> > > memory for blocking, and the more blocks you have, the more CPU it
> > requires
> > > (although CPU demand is only linear to the number of blocks and only in
> > > signficantly smaller part of computation, so that only insigificant
> part
> > of
> > > total CPU flops depends on # of blocks, but there is part that does,
> > still.
> > > )
> > >
> > > Like i said, it also would address the case when rows don't fit in the
> > > memory (hence no memory bound for n of A) but the most immediate
> benefit
> > is
> > > to speed/ scalability/memory req of SSVD in most practical LSI cases.
> > >
> > > -Dmitriy
> > >
> > > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com>
> > > wrote:
> > >
> > > > Hey Dmitriy,
> > > >
> > > >  I've also been playing around with a VectorWritable format which is
> > > backed
> > > > by a
> > > > SequenceFile, but I've been focussed on the case where it's
> essentially
> > > the
> > > > entire
> > > > matrix, and the rows don't fit into memory.  This seems different
> than
> > > your
> > > > current
> > > > use case, however - you just want (relatively) small vectors to load
> > > > faster,
> > > > right?
> > > >
> > > >  -jake
> > > >
> > > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <ted.dunning@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Interesting idea.
> > > > >
> > > > > Would this introduce a new vector type that only allows iterating
> > > through
> > > > > the elements once?
> > > > >
> > > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I would like to submit a patch to VectorWritable that allows for
> > > > > streaming
> > > > > > access to vector elements without having to prebuffer all of them
> > > > first.
> > > > > > (current code allows for the latter only).
> > > > > >
> > > > > > That patch would allow to strike down one of the memory usage
> > issues
> > > in
> > > > > > current Stochastic SVD implementation and effectively open memory
> > > bound
> > > > > for
> > > > > > n of the SVD work. (The value i see is not to open up the the
> bound
> > > > > though
> > > > > > but just be more efficient in memory use, thus essentially
> speeding
> > u
> > > p
> > > > > the
> > > > > > computation. )
> > > > > >
> > > > > > If it's ok, i would like to create a JIRA issue and provide a
> patch
> > > for
> > > > > it.
> > > > > >
> > > > > > Another issue is to provide an SSVD patch that depends on that
> > patch
> > > > for
> > > > > > VectorWritable.
> > > > > >
> > > > > > Thank you.
> > > > > > -Dmitriy
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Jake Mannix <ja...@gmail.com>.

Ted, there are some vectors where it certainly matters: the eigenvectors of
a
really big matrix are dense, and thus take O(8*num_vertices) bytes to hold
just
one of them in memory.  Doing something sequential with these can certainly
make sense, and in some cases is actually necessary, esp. if done in the
mappers or reducers where there is less memory than you usually have...

  -jake

On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <te...@gmail.com> wrote:

> I really don't see this as a big deal even with crazy big vectors.
>
> Looking at web scale, for instance, the most linked wikipedia article only
> has 10 million in-links or so.  On the web, the most massive web site is
> unlikely to have >100 million in-links.  Both of these fit in very modest
> amounts of memory.
>
> Where's the rub?
>
> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Jake,
> > No i was trying exactly what you were proposing some time ago on the
> list.
> > I
> > am trying to make long vectors not to occupy a lot of memory.
> >
> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> > saying, hey, there's a lot of sequential techniques that can provide a
> > hander that would inspect vector element-by-element without having to
> > preallocate 8Mb.
> >
> > for 1 million-long vectors it doesn't scary too much but starts being so
> > for
> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> > non-zero elements). Stochastic SVD will survive that, but it means less
> > memory for blocking, and the more blocks you have, the more CPU it
> requires
> > (although CPU demand is only linear to the number of blocks and only in
> > signficantly smaller part of computation, so that only insigificant part
> of
> > total CPU flops depends on # of blocks, but there is part that does,
> still.
> > )
> >
> > Like i said, it also would address the case when rows don't fit in the
> > memory (hence no memory bound for n of A) but the most immediate benefit
> is
> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
> >
> > -Dmitriy
> >
> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Hey Dmitriy,
> > >
> > >  I've also been playing around with a VectorWritable format which is
> > backed
> > > by a
> > > SequenceFile, but I've been focussed on the case where it's essentially
> > the
> > > entire
> > > matrix, and the rows don't fit into memory.  This seems different than
> > your
> > > current
> > > use case, however - you just want (relatively) small vectors to load
> > > faster,
> > > right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Interesting idea.
> > > >
> > > > Would this introduce a new vector type that only allows iterating
> > through
> > > > the elements once?
> > > >
> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to submit a patch to VectorWritable that allows for
> > > > streaming
> > > > > access to vector elements without having to prebuffer all of them
> > > first.
> > > > > (current code allows for the latter only).
> > > > >
> > > > > That patch would allow to strike down one of the memory usage
> issues
> > in
> > > > > current Stochastic SVD implementation and effectively open memory
> > bound
> > > > for
> > > > > n of the SVD work. (The value i see is not to open up the the bound
> > > > though
> > > > > but just be more efficient in memory use, thus essentially speeding
> u
> > p
> > > > the
> > > > > computation. )
> > > > >
> > > > > If it's ok, i would like to create a JIRA issue and provide a patch
> > for
> > > > it.
> > > > >
> > > > > Another issue is to provide an SSVD patch that depends on that
> patch
> > > for
> > > > > VectorWritable.
> > > > >
> > > > > Thank you.
> > > > > -Dmitriy
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

I really don't see this as a big deal even with crazy big vectors.

Looking at web scale, for instance, the most linked wikipedia article only
has 10 million in-links or so.  On the web, the most massive web site is
unlikely to have >100 million in-links.  Both of these fit in very modest
amounts of memory.

Where's the rub?

On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Jake,
> No i was trying exactly what you were proposing some time ago on the list.
> I
> am trying to make long vectors not to occupy a lot of memory.
>
> E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> saying, hey, there's a lot of sequential techniques that can provide a
> hander that would inspect vector element-by-element without having to
> preallocate 8Mb.
>
> for 1 million-long vectors it doesn't scary too much but starts being so
> for
> default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> non-zero elements). Stochastic SVD will survive that, but it means less
> memory for blocking, and the more blocks you have, the more CPU it requires
> (although CPU demand is only linear to the number of blocks and only in
> signficantly smaller part of computation, so that only insigificant part of
> total CPU flops depends on # of blocks, but there is part that does, still.
> )
>
> Like i said, it also would address the case when rows don't fit in the
> memory (hence no memory bound for n of A) but the most immediate benefit is
> to speed/ scalability/memory req of SSVD in most practical LSI cases.
>
> -Dmitriy
>
> On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Hey Dmitriy,
> >
> >  I've also been playing around with a VectorWritable format which is
> backed
> > by a
> > SequenceFile, but I've been focussed on the case where it's essentially
> the
> > entire
> > matrix, and the rows don't fit into memory.  This seems different than
> your
> > current
> > use case, however - you just want (relatively) small vectors to load
> > faster,
> > right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Interesting idea.
> > >
> > > Would this introduce a new vector type that only allows iterating
> through
> > > the elements once?
> > >
> > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to submit a patch to VectorWritable that allows for
> > > streaming
> > > > access to vector elements without having to prebuffer all of them
> > first.
> > > > (current code allows for the latter only).
> > > >
> > > > That patch would allow to strike down one of the memory usage issues
> in
> > > > current Stochastic SVD implementation and effectively open memory
> bound
> > > for
> > > > n of the SVD work. (The value i see is not to open up the the bound
> > > though
> > > > but just be more efficient in memory use, thus essentially speeding u
> p
> > > the
> > > > computation. )
> > > >
> > > > If it's ok, i would like to create a JIRA issue and provide a patch
> for
> > > it.
> > > >
> > > > Another issue is to provide an SSVD patch that depends on that patch
> > for
> > > > VectorWritable.
> > > >
> > > > Thank you.
> > > > -Dmitriy
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jake,
No i was trying exactly what you were proposing some time ago on the list. I
am trying to make long vectors not to occupy a lot of memory.

E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
saying, hey, there's a lot of sequential techniques that can provide a
hander that would inspect vector element-by-element without having to
preallocate 8Mb.

for 1 million-long vectors it doesn't scary too much but starts being so for
default hadoop memory settings at the area of 50-100Mb (or 5-10 million
non-zero elements). Stochastic SVD will survive that, but it means less
memory for blocking, and the more blocks you have, the more CPU it requires
(although CPU demand is only linear to the number of blocks and only in
signficantly smaller part of computation, so that only insigificant part of
total CPU flops depends on # of blocks, but there is part that does, still.
)

Like i said, it also would address the case when rows don't fit in the
memory (hence no memory bound for n of A) but the most immediate benefit is
to speed/ scalability/memory req of SSVD in most practical LSI cases.

-Dmitriy

On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com> wrote:

> Hey Dmitriy,
>
>  I've also been playing around with a VectorWritable format which is backed
> by a
> SequenceFile, but I've been focussed on the case where it's essentially the
> entire
> matrix, and the rows don't fit into memory.  This seems different than your
> current
> use case, however - you just want (relatively) small vectors to load
> faster,
> right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Which is exactly why I am saying I don't see this as  the primary problem i
am trying to solve.

On Mon, Dec 13, 2010 at 11:13 AM, Ted Dunning <te...@gmail.com> wrote:

> http://en.wikipedia.org/wiki/Program_optimization#When_to_optimize
>
> On Mon, Dec 13, 2010 at 11:06 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > I also think it is not very common, but it would be a collateral benefit
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

http://en.wikipedia.org/wiki/Program_optimization#When_to_optimize

On Mon, Dec 13, 2010 at 11:06 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I also think it is not very common, but it would be a collateral benefit
>

Re: Sequential access to VectorWritable content proposal.

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I also think it is not very common, but it would be a collateral benefit

On Mon, Dec 13, 2010 at 11:05 AM, Ted Dunning <te...@gmail.com> wrote:

> How common is it that a row won't fit in memory?  My experience is that
> essentially all rows that
> I am interested will fit in very modest amounts of memory, but that row by
> row handling is imperative.
>
> Is this just gilding the lily?
>
> On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Hey Dmitriy,
> >
> >  I've also been playing around with a VectorWritable format which is
> backed
> > by a
> > SequenceFile, but I've been focussed on the case where it's essentially
> the
> > entire
> > matrix, and the rows don't fit into memory.  This seems different than
> your
> > current
> > use case, however - you just want (relatively) small vectors to load
> > faster,
> > right?
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Interesting idea.
> > >
> > > Would this introduce a new vector type that only allows iterating
> through
> > > the elements once?
> > >
> > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to submit a patch to VectorWritable that allows for
> > > streaming
> > > > access to vector elements without having to prebuffer all of them
> > first.
> > > > (current code allows for the latter only).
> > > >
> > > > That patch would allow to strike down one of the memory usage issues
> in
> > > > current Stochastic SVD implementation and effectively open memory
> bound
> > > for
> > > > n of the SVD work. (The value i see is not to open up the the bound
> > > though
> > > > but just be more efficient in memory use, thus essentially speeding u
> p
> > > the
> > > > computation. )
> > > >
> > > > If it's ok, i would like to create a JIRA issue and provide a patch
> for
> > > it.
> > > >
> > > > Another issue is to provide an SSVD patch that depends on that patch
> > for
> > > > VectorWritable.
> > > >
> > > > Thank you.
> > > > -Dmitriy
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

How common is it that a row won't fit in memory?  My experience is that
essentially all rows that
I am interested will fit in very modest amounts of memory, but that row by
row handling is imperative.

Is this just gilding the lily?

On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <ja...@gmail.com> wrote:

> Hey Dmitriy,
>
>  I've also been playing around with a VectorWritable format which is backed
> by a
> SequenceFile, but I've been focussed on the case where it's essentially the
> entire
> matrix, and the rows don't fit into memory.  This seems different than your
> current
> use case, however - you just want (relatively) small vectors to load
> faster,
> right?
>
>  -jake
>
> On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Interesting idea.
> >
> > Would this introduce a new vector type that only allows iterating through
> > the elements once?
> >
> > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to submit a patch to VectorWritable that allows for
> > streaming
> > > access to vector elements without having to prebuffer all of them
> first.
> > > (current code allows for the latter only).
> > >
> > > That patch would allow to strike down one of the memory usage issues in
> > > current Stochastic SVD implementation and effectively open memory bound
> > for
> > > n of the SVD work. (The value i see is not to open up the the bound
> > though
> > > but just be more efficient in memory use, thus essentially speeding u p
> > the
> > > computation. )
> > >
> > > If it's ok, i would like to create a JIRA issue and provide a patch for
> > it.
> > >
> > > Another issue is to provide an SSVD patch that depends on that patch
> for
> > > VectorWritable.
> > >
> > > Thank you.
> > > -Dmitriy
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Jake Mannix <ja...@gmail.com>.

Hey Dmitriy,

  I've also been playing around with a VectorWritable format which is backed
by a
SequenceFile, but I've been focussed on the case where it's essentially the
entire
matrix, and the rows don't fit into memory.  This seems different than your
current
use case, however - you just want (relatively) small vectors to load faster,
right?

  -jake

On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <te...@gmail.com> wrote:

> Interesting idea.
>
> Would this introduce a new vector type that only allows iterating through
> the elements once?
>
> On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I would like to submit a patch to VectorWritable that allows for
> streaming
> > access to vector elements without having to prebuffer all of them first.
> > (current code allows for the latter only).
> >
> > That patch would allow to strike down one of the memory usage issues in
> > current Stochastic SVD implementation and effectively open memory bound
> for
> > n of the SVD work. (The value i see is not to open up the the bound
> though
> > but just be more efficient in memory use, thus essentially speeding u p
> the
> > computation. )
> >
> > If it's ok, i would like to create a JIRA issue and provide a patch for
> it.
> >
> > Another issue is to provide an SSVD patch that depends on that patch for
> > VectorWritable.
> >
> > Thank you.
> > -Dmitriy
> >
>

Re: Sequential access to VectorWritable content proposal.

Posted by Ted Dunning <te...@gmail.com>.

Interesting idea.

Would this introduce a new vector type that only allows iterating through
the elements once?

On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Hi all,
>
> I would like to submit a patch to VectorWritable that allows for streaming
> access to vector elements without having to prebuffer all of them first.
> (current code allows for the latter only).
>
> That patch would allow to strike down one of the memory usage issues in
> current Stochastic SVD implementation and effectively open memory bound for
> n of the SVD work. (The value i see is not to open up the the bound though
> but just be more efficient in memory use, thus essentially speeding u p the
> computation. )
>
> If it's ok, i would like to create a JIRA issue and provide a patch for it.
>
> Another issue is to provide an SSVD patch that depends on that patch for
> VectorWritable.
>
> Thank you.
> -Dmitriy
>