You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Lance Norskog (JIRA)" <ji...@apache.org> on 2011/02/10 02:06:59 UTC

[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

    [ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992818#comment-12992818 ] 

Lance Norskog commented on MAHOUT-322:
--------------------------------------

Would I use this to do graph logic on boolean matrices? If so, I would want a separate Writable for a "block of bits", that is part of a row of bits.

> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>            Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix states that the matrix lives in SequenceFile<WritableComparable, VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD package, mainly to perform PCA on a massive document corpus. Given such corpus, it makes sense to not limit the user by forcing the document "key" to be integer. Instead, users should be able to use Text keys (document name or id) or keys made of any other arbitrary class. One may even argue that forcing a WritableComparable key is too limiting, and a simple Writable key should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the SequenceFile key at all, to allow user-specific classes (unknown to Mahout) to be used as opaque keys even when their libraries are not available in runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but reader has methods to query just the value, avoiding key deserialization altogether.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Posted by Ted Dunning <te...@gmail.com>.
For that you already have simple sequence files.

A DRM is always going to be at least DRM<Key extends WritableComparable,
Vector>

The questions are what can Key be and what kinds of Vector exist.

On Thu, Feb 10, 2011 at 1:08 AM, Lance Norskog <go...@gmail.com> wrote:

> Yup, me too. For elements it would be DRM<Writable,Writable> instead
> of DRM<Writable, Vector>.
>
> On 2/9/11, Ted Dunning <te...@gmail.com> wrote:
> > Ahh... I read it as elements.  As in adjacency matrix for a graph.
> >
> > On Wed, Feb 9, 2011 at 5:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> Ted, booleans as elements or booleans as row labels? i read the issue
> >> as an issue of row labels, not what's packed inside VectorWritable?
> >>
> >> Thank you.
> >>
> >> On Wed, Feb 9, 2011 at 5:47 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >> > DRM's of booleans would be a useful special case for a number of
> things,
> >> > especially if you backed it with a sparse boolean matrix and vector
> >> > implementation.
> >> >
> >>
> >
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Posted by Lance Norskog <go...@gmail.com>.
Yup, me too. For elements it would be DRM<Writable,Writable> instead
of DRM<Writable, Vector>.

On 2/9/11, Ted Dunning <te...@gmail.com> wrote:
> Ahh... I read it as elements.  As in adjacency matrix for a graph.
>
> On Wed, Feb 9, 2011 at 5:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Ted, booleans as elements or booleans as row labels? i read the issue
>> as an issue of row labels, not what's packed inside VectorWritable?
>>
>> Thank you.
>>
>> On Wed, Feb 9, 2011 at 5:47 PM, Ted Dunning <te...@gmail.com> wrote:
>> > DRM's of booleans would be a useful special case for a number of things,
>> > especially if you backed it with a sparse boolean matrix and vector
>> > implementation.
>> >
>>
>


-- 
Lance Norskog
goksron@gmail.com

Re: [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Posted by Ted Dunning <te...@gmail.com>.
Ahh... I read it as elements.  As in adjacency matrix for a graph.

On Wed, Feb 9, 2011 at 5:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Ted, booleans as elements or booleans as row labels? i read the issue
> as an issue of row labels, not what's packed inside VectorWritable?
>
> Thank you.
>
> On Wed, Feb 9, 2011 at 5:47 PM, Ted Dunning <te...@gmail.com> wrote:
> > DRM's of booleans would be a useful special case for a number of things,
> > especially if you backed it with a sparse boolean matrix and vector
> > implementation.
> >
>

Re: [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Ted, booleans as elements or booleans as row labels? i read the issue
as an issue of row labels, not what's packed inside VectorWritable?

Thank you.

On Wed, Feb 9, 2011 at 5:47 PM, Ted Dunning <te...@gmail.com> wrote:
> DRM's of booleans would be a useful special case for a number of things,
> especially if you backed it with a sparse boolean matrix and vector
> implementation.
>

Re: [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Posted by Ted Dunning <te...@gmail.com>.
DRM's of booleans would be a useful special case for a number of things,
especially if you backed it with a sparse boolean matrix and vector
implementation.

One caveat is that sparse boolean implementations need a bit more cleverness
than you might expect.  Typically you need to change implementation
depending on the statistics of the data:

- for very sparse vectors, run-length encoding, possibly with small-integer
compression on the run-lengths is an excellent way to go for a serialized
form.  In memory, a sorted array of integer indexes of non-zeros is more
useful because it allows more efficient binary search.  Each non-zero bit
costs 4 bits in this representation which can hurt if you have many bits
near each other.t

- for clumpy, but overall sparse vectors, composite run-length and bitmap
representations are useful.  With good compression of the run-lengths, this
reduces to the first case for storage, but you often need a different
in-memory storage format for these.  One candidate is an array of 32 bit
combined bitmap offset and int count that refers to an array of 32 bit
bitmap segments.  In this format, a single bit can cost 8 bytes, but 5
clumped bits could have an amortized cost of just over a bytes.

- when sparsity varies, composite formats might be best.  This is
essentially an overlay of a traditional bitset and representation 1.  This
is best for long-tail Zipfian sorts of data where indexes are roughly
ordered by descending probability.


On Wed, Feb 9, 2011 at 5:06 PM, Lance Norskog (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992818#comment-12992818]
>
> Lance Norskog commented on MAHOUT-322:
> --------------------------------------
>
> Would I use this to do graph logic on boolean matrices? If so, I would want
> a separate Writable for a "block of bits", that is part of a row of bits.
>
> > DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable>
> instead of SequenceFile<IntWritable,VectorWritable>
> >
> -----------------------------------------------------------------------------------------------------------------------------
> >
> >                 Key: MAHOUT-322
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Math
> >    Affects Versions: 0.3
> >            Reporter: Danny Leshem
> >            Assignee: Jake Mannix
> >            Priority: Minor
> >
> > Class documentation for
> org.apache.mahout.math.hadoop.DistributedRowMatrix states that the matrix
> lives in SequenceFile<WritableComparable, VectorWritable>. Implementation,
> however, assumes SequenceFile<IntWritable, VectorWritable> is passed.
> > Currently, usage of this class inside Mahout is limited to Jake Mannix's
> SVD package, mainly to perform PCA on a massive document corpus. Given such
> corpus, it makes sense to not limit the user by forcing the document "key"
> to be integer. Instead, users should be able to use Text keys (document name
> or id) or keys made of any other arbitrary class. One may even argue that
> forcing a WritableComparable key is too limiting, and a simple Writable key
> should be assumed.
> > In fact, it would be best if DistributedRowMatrix did not read the
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout)
> to be used as opaque keys even when their libraries are not available in
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but
> reader has methods to query just the value, avoiding key deserialization
> altogether.
>
> --
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>