You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2009/08/31 20:29:32 UTC

[jira] Created: (MAHOUT-168) Need integer compression routines

Need integer compression routines
---------------------------------

                 Key: MAHOUT-168
                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
             Project: Mahout
          Issue Type: Improvement
          Components: Matrix
            Reporter: Ted Dunning


A selection of these algorithms would be very nice to have:

www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (MAHOUT-168) Need integer compression routines

Posted by Sean Owen <sr...@gmail.com>.
0) Best thing is not to get to this point in the first place! It was
necessary from the outset to just let 100 things proceed and see what
sticks. Now I think we can gently move towards more focus. So I'd hope
someone doesn't make up a big patch without it being clear there's a
path to commit it quickly. And that's why some nominal owners are
going to be useful.

1) I'd say don't mothball stuff for a long time. I'm not really
touching anything that seems to have had any activity in 6 months.
That mitigates this a lot.

2) Finding an old issue is still possible in JIRA of course, but might
not be obvious. Old patches are probably not applicable anymore, so
the use may be somewhat limited. So maybe this too means it's not such
a big deal in practice. That is, I doubt we're actually going to have
the same work happen once, and fail, twice, and fail, etc. with nobody
remembering.

On Fri, Dec 11, 2009 at 9:36 PM, Ted Dunning <te...@gmail.com> wrote:
> I am also unclear on how to do this.
>
> Anybody have good suggestions?
>
> On Fri, Dec 11, 2009 at 1:33 PM, Jake Mannix <ja...@gmail.com> wrote:
>
>> I'm not sure of the right way to avoid redoing work again and again, yet
>> still avoid cluttering our codebase with a bunch of unsupported, unfinished
>> code.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: [jira] Updated: (MAHOUT-168) Need integer compression routines

Posted by Ted Dunning <te...@gmail.com>.
I am also unclear on how to do this.

Anybody have good suggestions?

On Fri, Dec 11, 2009 at 1:33 PM, Jake Mannix <ja...@gmail.com> wrote:

> I'm not sure of the right way to avoid redoing work again and again, yet
> still avoid cluttering our codebase with a bunch of unsupported, unfinished
> code.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Updated: (MAHOUT-168) Need integer compression routines

Posted by Jake Mannix <ja...@gmail.com>.
So I have a question about the whole "mothballing" process.  We're don't
have infinite time, and there's a limited number of us, so I understand
wanting to keep some focus and not have half-finished work all over the
place.  But when we archive something as "mothballed", how will we ever find
it again if we get some time?  I mean, if I'm working on doing perf work on
vectors, incorporating Colt primitives, and all that, won't I want to reopen
this?  How will I find it?  By searching through JIRA tickets which are
marked "won't fix" or fix version "later"?

What is the process by which we keep track of these half-finished pieces of
work which we don't want to just lose?  Ted did some work which will be
helpful at some point, but what help will it be if it disappears?

I'm not sure of the right way to avoid redoing work again and again, yet
still avoid cluttering our codebase with a bunch of unsupported, unfinished
code.

  -jake

[jira] Updated: (MAHOUT-168) Need integer compression routines

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-168:
-----------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 0.3)
           Status: Resolved  (was: Patch Available)

> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.1
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>            Priority: Minor
>         Attachments: MAHOUT-168.patch
>
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-168) Need integer compression routines

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-168:
-----------------------------

             Priority: Minor  (was: Major)
    Affects Version/s: 0.1
        Fix Version/s: 0.3

Question, is this commitable? Looks so. My only question is do we have a use case for this and does it support the project goals.

> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.1
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-168.patch
>
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-168) Need integer compression routines

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749703#action_12749703 ] 

Grant Ingersoll commented on MAHOUT-168:
----------------------------------------

Can we leverage some of Lucene's capabilities here?  It doesn't have all the algorithms mentioned, but does have delta encodings.

> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Ted Dunning
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-168) Need integer compression routines

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-168:
-------------------------------

    Status: Patch Available  (was: Open)

First version.  100% test coverage except for BitInputStream at 95%.

Needs review for API and obvious speed improvements.



> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-168) Need integer compression routines

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789539#action_12789539 ] 

Ted Dunning commented on MAHOUT-168:
------------------------------------

Then intent for this was for improving the storage of integer arrays,
especially in the context of sparse vectors.  This would have the largest
impact on binary immutable vectors.

There are a few impinging factors.  The first is that Lucene now has an
implementation of PFOR and PFOR-delta that shoudl provide much better
performance and slightly better compression.  The second is the adoption of
colt and attendant digestion load.  Jake would be the most likely consumer
of these routines, but he won't be looking for them for some time.

My vote is to mothball it.




-- 
Ted Dunning, CTO
DeepDyve


> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.1
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-168.patch
>
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (MAHOUT-168) Need integer compression routines

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning reassigned MAHOUT-168:
----------------------------------

    Assignee: Ted Dunning

> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-168) Need integer compression routines

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751019#action_12751019 ] 

Grant Ingersoll commented on MAHOUT-168:
----------------------------------------

I think you forgot to attach the patch

> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-168) Need integer compression routines

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-168:
-------------------------------

    Attachment: MAHOUT-168.patch

How very silly to over look this.  I think that changing the status to "patch available" counted as an upload in my brain (or something)

> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>         Attachments: MAHOUT-168.patch
>
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-168) Need integer compression routines

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749721#action_12749721 ] 

Ted Dunning commented on MAHOUT-168:
------------------------------------


I considered that, but it seemed easier to write a version from scratch.  Somebody should look at the Lucene code to see if they did anything earth-shakingly clever for speed.

A delta-code, btw, includes gamma and unary codes so that is complete enough to be interesting.  

It should be very little work to add Golomb and byte variable codes to this.  Only the byte-variable code seems like a good candidate because it can be made faster than these other forms pretty easily.

 

> Need integer compression routines
> ---------------------------------
>
>                 Key: MAHOUT-168
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-168
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>
> A selection of these algorithms would be very nice to have:
> www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.