You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2009/08/04 11:47:14 UTC

[jira] Created: (MAHOUT-158) Replace all ID values with long

Replace all ID values with long
-------------------------------

                 Key: MAHOUT-158
                 URL: https://issues.apache.org/jira/browse/MAHOUT-158
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.2
            Reporter: Sean Owen
            Assignee: Sean Owen
             Fix For: 0.2


As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-158) Replace all ID values with long

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-158.
------------------------------

    Resolution: Fixed

Submitted. Massive change, but, it really did a lot for performance and memory usage. 

> Replace all ID values with long
> -------------------------------
>
>                 Key: MAHOUT-158
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-158
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.2
>
>         Attachments: MAHOUT-158.patch, MAHOUT-158.patch
>
>
> As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-158) Replace all ID values with long

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-158:
-----------------------------

    Attachment: MAHOUT-158.patch

This is what I intend to submit. It does not include the promised support for long<->String ID translation; that will be separate.

> Replace all ID values with long
> -------------------------------
>
>                 Key: MAHOUT-158
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-158
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.2
>
>         Attachments: MAHOUT-158.patch, MAHOUT-158.patch
>
>
> As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-158) Replace all ID values with long

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-158:
-----------------------------

    Attachment: MAHOUT-158.patch

Preliminary patch for review for anyone that is curious. Also epic -- core changes only in this one so far! In my realistic-ish test case, required heap size went down about 25% (less than expected...) and speed increased by about 30%.

> Replace all ID values with long
> -------------------------------
>
>                 Key: MAHOUT-158
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-158
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.2
>
>         Attachments: MAHOUT-158.patch
>
>
> As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-158) Replace all ID values with long

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740070#action_12740070 ] 

Sean Owen commented on MAHOUT-158:
----------------------------------

For the interested, was able to drive the memory requirements down to more like the expected value -- now running comfortably in a heap of 360M compared to needing 600M before (and more like 1GB before MAHOUT-151/154).

It was an interesting lesson in GC ergonomics. I found myself running into incredible GC overhead before the heap was full -- not even close. I learned the difference between the young generation and tenured generation in the GC: the default way memory is organized, it will let "old" objects consume only about 75% of the heap. Now that this system is more lean, almost all objects in memory are long-lived, and a lot less garbage is generated since long primitives are used instead of Longs and there is much less conversion between the two. So I had to set -XX:NewRatio=9 to ask it to allow more like 90% for 'old' objects. Then I was able to bring down the heap size to a more reasonble value.

I am proceeding to convert the tests now, as I review the changes. This is another big one.

> Replace all ID values with long
> -------------------------------
>
>                 Key: MAHOUT-158
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-158
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.2
>
>         Attachments: MAHOUT-158.patch
>
>
> As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.