You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by "Thilo Goetz (JIRA)" <ui...@incubator.apache.org> on 2008/06/06 12:22:45 UTC

[jira] Created: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS

Remove char heap/ref heap in StringHeap of the CAS
--------------------------------------------------

Key: UIMA-1067
URL: https://issues.apache.org/jira/browse/UIMA-1067
Project: UIMA
Issue Type: Improvement
Components: Core Java Framework
Affects Versions: 2.2.2
Reporter: Thilo Goetz
Assignee: Thilo Goetz
Fix For: 2.3

The StringHeap class provides two ways to store strings: either as Java strings, or by copying characters onto a character heap. The second option is only used for deserialization from a binary CAS. However, even if not used, this capability means a very significant memory overhead. To demonstrate this, I ran the following experiment. As analysis engine, I used our sandbox POS tagger. It sets just one string feature on each token. As text, I used a 2.4MB input file (2x moby.txt). To run this in IBM Java 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify -Xmx135M. I checked 5MB increments. The I patched the StringHeap implementation to work without the additional book keeping overhead and ran the experiment again. I was then able to run with -Xmx115M. This represents a very significant gain, particularly given the fact that I ran so little analysis (only tokens and sentences are produced, and only a single string-valued feature set). The new code also ran a tiny bit faster, but not much. One might see more improvement for analysis that is not as compute intensive as the Tagger.

The challenge is to make sure that the serialization code still works after this change.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Fwd: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS

Posted by Thilo Goetz <tw...@gmx.de>.

Thanks Eddie, that's great!

--Thilo

Eddie Epstein wrote:
> Thilo,
> 
> Just tested this change with the JNI interface to uimacpp and it works fine.
> 
> Eddie
> 
> 
> ---------- Forwarded message ----------
> From: Thilo Goetz (JIRA) <ui...@incubator.apache.org>
> Date: Fri, Jun 6, 2008 at 10:21 AM
> Subject: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap
> of the CAS
> To: uima-dev@incubator.apache.org
> 
> 
> 
>     [
> https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> 
> Thilo Goetz closed UIMA-1067.
> -----------------------------
> 
>    Resolution: Fixed
> 
> Fixed, all unit tests pass.  Please test this change if you use (binary)
> serialization.  It should work the same as before, I haven't changed the
> serialization format in any way.
> 
>> Remove char heap/ref heap in StringHeap of the CAS
>> --------------------------------------------------
>>
>>                 Key: UIMA-1067
>>                 URL: https://issues.apache.org/jira/browse/UIMA-1067
>>             Project: UIMA
>>          Issue Type: Improvement
>>          Components: Core Java Framework
>>    Affects Versions: 2.2.2
>>            Reporter: Thilo Goetz
>>            Assignee: Thilo Goetz
>>             Fix For: 2.3
>>
>>
>> The StringHeap class provides two ways to store strings: either as Java
> strings, or by copying characters onto a character heap.  The second option
> is only used for deserialization from a binary CAS.  However, even if not
> used, this capability means a very significant memory overhead.  To
> demonstrate this, I ran the following experiment.  As analysis engine, I
> used our sandbox POS tagger.  It sets just one string feature on each token.
>  As text, I used a 2.4MB input file (2x moby.txt).  To run this in IBM Java
> 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify
> -Xmx135M.  I checked 5MB increments.  The I patched the StringHeap
> implementation to work without the additional book keeping overhead and ran
> the experiment again.  I was then able to run with -Xmx115M.  This
> represents a very significant gain, particularly given the fact that I ran
> so little analysis (only tokens and sentences are produced, and only a
> single string-valued feature set).  The new code also ran a tiny bit faster,
> but not much.  One might see more improvement for analysis that is not as
> compute intensive as the Tagger.
>> The challenge is to make sure that the serialization code still works
> after this change.
> 
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

Fwd: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS

Posted by Eddie Epstein <ea...@gmail.com>.

Thilo,

Just tested this change with the JNI interface to uimacpp and it works fine.

Eddie


---------- Forwarded message ----------
From: Thilo Goetz (JIRA) <ui...@incubator.apache.org>
Date: Fri, Jun 6, 2008 at 10:21 AM
Subject: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap
of the CAS
To: uima-dev@incubator.apache.org



    [
https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

Thilo Goetz closed UIMA-1067.
-----------------------------

   Resolution: Fixed

Fixed, all unit tests pass.  Please test this change if you use (binary)
serialization.  It should work the same as before, I haven't changed the
serialization format in any way.

> Remove char heap/ref heap in StringHeap of the CAS
> --------------------------------------------------
>
>                 Key: UIMA-1067
>                 URL: https://issues.apache.org/jira/browse/UIMA-1067
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Core Java Framework
>    Affects Versions: 2.2.2
>            Reporter: Thilo Goetz
>            Assignee: Thilo Goetz
>             Fix For: 2.3
>
>
> The StringHeap class provides two ways to store strings: either as Java
strings, or by copying characters onto a character heap.  The second option
is only used for deserialization from a binary CAS.  However, even if not
used, this capability means a very significant memory overhead.  To
demonstrate this, I ran the following experiment.  As analysis engine, I
used our sandbox POS tagger.  It sets just one string feature on each token.
 As text, I used a 2.4MB input file (2x moby.txt).  To run this in IBM Java
1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify
-Xmx135M.  I checked 5MB increments.  The I patched the StringHeap
implementation to work without the additional book keeping overhead and ran
the experiment again.  I was then able to run with -Xmx115M.  This
represents a very significant gain, particularly given the fact that I ran
so little analysis (only tokens and sentences are produced, and only a
single string-valued feature set).  The new code also ran a tiny bit faster,
but not much.  One might see more improvement for analysis that is not as
compute intensive as the Tagger.
> The challenge is to make sure that the serialization code still works
after this change.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS

Posted by "Thilo Goetz (JIRA)" <ui...@incubator.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thilo Goetz reopened UIMA-1067:
-------------------------------


Fix in 2.2.2 hotfix 1.

> Remove char heap/ref heap in StringHeap of the CAS
> --------------------------------------------------
>
>                 Key: UIMA-1067
>                 URL: https://issues.apache.org/jira/browse/UIMA-1067
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Core Java Framework
>    Affects Versions: 2.2.2
>            Reporter: Thilo Goetz
>            Assignee: Thilo Goetz
>             Fix For: 2.3
>
>
> The StringHeap class provides two ways to store strings: either as Java strings, or by copying characters onto a character heap.  The second option is only used for deserialization from a binary CAS.  However, even if not used, this capability means a very significant memory overhead.  To demonstrate this, I ran the following experiment.  As analysis engine, I used our sandbox POS tagger.  It sets just one string feature on each token.  As text, I used a 2.4MB input file (2x moby.txt).  To run this in IBM Java 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify -Xmx135M.  I checked 5MB increments.  The I patched the StringHeap implementation to work without the additional book keeping overhead and ran the experiment again.  I was then able to run with -Xmx115M.  This represents a very significant gain, particularly given the fact that I ran so little analysis (only tokens and sentences are produced, and only a single string-valued feature set).  The new code also ran a tiny bit faster, but not much.  One might see more improvement for analysis that is not as compute intensive as the Tagger.
> The challenge is to make sure that the serialization code still works after this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS

Posted by "Thilo Goetz (JIRA)" <ui...@incubator.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thilo Goetz closed UIMA-1067.
-----------------------------

    Resolution: Fixed

Backported to 2.2.2-01.

> Remove char heap/ref heap in StringHeap of the CAS
> --------------------------------------------------
>
>                 Key: UIMA-1067
>                 URL: https://issues.apache.org/jira/browse/UIMA-1067
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Core Java Framework
>    Affects Versions: 2.2.2
>            Reporter: Thilo Goetz
>            Assignee: Thilo Goetz
>             Fix For: 2.3
>
>
> The StringHeap class provides two ways to store strings: either as Java strings, or by copying characters onto a character heap.  The second option is only used for deserialization from a binary CAS.  However, even if not used, this capability means a very significant memory overhead.  To demonstrate this, I ran the following experiment.  As analysis engine, I used our sandbox POS tagger.  It sets just one string feature on each token.  As text, I used a 2.4MB input file (2x moby.txt).  To run this in IBM Java 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify -Xmx135M.  I checked 5MB increments.  The I patched the StringHeap implementation to work without the additional book keeping overhead and ran the experiment again.  I was then able to run with -Xmx115M.  This represents a very significant gain, particularly given the fact that I ran so little analysis (only tokens and sentences are produced, and only a single string-valued feature set).  The new code also ran a tiny bit faster, but not much.  One might see more improvement for analysis that is not as compute intensive as the Tagger.
> The challenge is to make sure that the serialization code still works after this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS

Posted by "Thilo Goetz (JIRA)" <ui...@incubator.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thilo Goetz closed UIMA-1067.
-----------------------------

    Resolution: Fixed

Fixed, all unit tests pass.  Please test this change if you use (binary) serialization.  It should work the same as before, I haven't changed the serialization format in any way.

> Remove char heap/ref heap in StringHeap of the CAS
> --------------------------------------------------
>
>                 Key: UIMA-1067
>                 URL: https://issues.apache.org/jira/browse/UIMA-1067
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Core Java Framework
>    Affects Versions: 2.2.2
>            Reporter: Thilo Goetz
>            Assignee: Thilo Goetz
>             Fix For: 2.3
>
>
> The StringHeap class provides two ways to store strings: either as Java strings, or by copying characters onto a character heap.  The second option is only used for deserialization from a binary CAS.  However, even if not used, this capability means a very significant memory overhead.  To demonstrate this, I ran the following experiment.  As analysis engine, I used our sandbox POS tagger.  It sets just one string feature on each token.  As text, I used a 2.4MB input file (2x moby.txt).  To run this in IBM Java 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify -Xmx135M.  I checked 5MB increments.  The I patched the StringHeap implementation to work without the additional book keeping overhead and ran the experiment again.  I was then able to run with -Xmx115M.  This represents a very significant gain, particularly given the fact that I ran so little analysis (only tokens and sentences are produced, and only a single string-valued feature set).  The new code also ran a tiny bit faster, but not much.  One might see more improvement for analysis that is not as compute intensive as the Tagger.
> The challenge is to make sure that the serialization code still works after this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.