You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2009/08/04 11:47:14 UTC
[jira] Created: (MAHOUT-158) Replace all ID values with long
Replace all ID values with long
-------------------------------
Key: MAHOUT-158
URL: https://issues.apache.org/jira/browse/MAHOUT-158
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Sean Owen
Fix For: 0.2
As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-158) Replace all ID values with long
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved MAHOUT-158.
------------------------------
Resolution: Fixed
Submitted. Massive change, but, it really did a lot for performance and memory usage.
> Replace all ID values with long
> -------------------------------
>
> Key: MAHOUT-158
> URL: https://issues.apache.org/jira/browse/MAHOUT-158
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Sean Owen
> Assignee: Sean Owen
> Fix For: 0.2
>
> Attachments: MAHOUT-158.patch, MAHOUT-158.patch
>
>
> As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-158) Replace all ID values with long
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-158:
-----------------------------
Attachment: MAHOUT-158.patch
This is what I intend to submit. It does not include the promised support for long<->String ID translation; that will be separate.
> Replace all ID values with long
> -------------------------------
>
> Key: MAHOUT-158
> URL: https://issues.apache.org/jira/browse/MAHOUT-158
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Sean Owen
> Assignee: Sean Owen
> Fix For: 0.2
>
> Attachments: MAHOUT-158.patch, MAHOUT-158.patch
>
>
> As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-158) Replace all ID values with long
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-158:
-----------------------------
Attachment: MAHOUT-158.patch
Preliminary patch for review for anyone that is curious. Also epic -- core changes only in this one so far! In my realistic-ish test case, required heap size went down about 25% (less than expected...) and speed increased by about 30%.
> Replace all ID values with long
> -------------------------------
>
> Key: MAHOUT-158
> URL: https://issues.apache.org/jira/browse/MAHOUT-158
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Sean Owen
> Assignee: Sean Owen
> Fix For: 0.2
>
> Attachments: MAHOUT-158.patch
>
>
> As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-158) Replace all ID values with long
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740070#action_12740070 ]
Sean Owen commented on MAHOUT-158:
----------------------------------
For the interested, was able to drive the memory requirements down to more like the expected value -- now running comfortably in a heap of 360M compared to needing 600M before (and more like 1GB before MAHOUT-151/154).
It was an interesting lesson in GC ergonomics. I found myself running into incredible GC overhead before the heap was full -- not even close. I learned the difference between the young generation and tenured generation in the GC: the default way memory is organized, it will let "old" objects consume only about 75% of the heap. Now that this system is more lean, almost all objects in memory are long-lived, and a lot less garbage is generated since long primitives are used instead of Longs and there is much less conversion between the two. So I had to set -XX:NewRatio=9 to ask it to allow more like 90% for 'old' objects. Then I was able to bring down the heap size to a more reasonble value.
I am proceeding to convert the tests now, as I review the changes. This is another big one.
> Replace all ID values with long
> -------------------------------
>
> Key: MAHOUT-158
> URL: https://issues.apache.org/jira/browse/MAHOUT-158
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Sean Owen
> Assignee: Sean Owen
> Fix For: 0.2
>
> Attachments: MAHOUT-158.patch
>
>
> As mentioned on mailing list, I am tracking this as a possible change for evaluation. The idea is to save more memory / CPU by avoiding the Object overhead of tens of millions of ID objects by using long IDs instead.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.