You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Thilo Goetz (JIRA)" <ui...@incubator.apache.org> on 2008/06/06 16:21:45 UTC
[jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap
of the CAS
[ https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thilo Goetz closed UIMA-1067.
-----------------------------
Resolution: Fixed
Fixed, all unit tests pass. Please test this change if you use (binary) serialization. It should work the same as before, I haven't changed the serialization format in any way.
> Remove char heap/ref heap in StringHeap of the CAS
> --------------------------------------------------
>
> Key: UIMA-1067
> URL: https://issues.apache.org/jira/browse/UIMA-1067
> Project: UIMA
> Issue Type: Improvement
> Components: Core Java Framework
> Affects Versions: 2.2.2
> Reporter: Thilo Goetz
> Assignee: Thilo Goetz
> Fix For: 2.3
>
>
> The StringHeap class provides two ways to store strings: either as Java strings, or by copying characters onto a character heap. The second option is only used for deserialization from a binary CAS. However, even if not used, this capability means a very significant memory overhead. To demonstrate this, I ran the following experiment. As analysis engine, I used our sandbox POS tagger. It sets just one string feature on each token. As text, I used a 2.4MB input file (2x moby.txt). To run this in IBM Java 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify -Xmx135M. I checked 5MB increments. The I patched the StringHeap implementation to work without the additional book keeping overhead and ran the experiment again. I was then able to run with -Xmx115M. This represents a very significant gain, particularly given the fact that I ran so little analysis (only tokens and sentences are produced, and only a single string-valued feature set). The new code also ran a tiny bit faster, but not much. One might see more improvement for analysis that is not as compute intensive as the Tagger.
> The challenge is to make sure that the serialization code still works after this change.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: Fwd: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in
StringHeap of the CAS
Posted by Thilo Goetz <tw...@gmx.de>.
Thanks Eddie, that's great!
--Thilo
Eddie Epstein wrote:
> Thilo,
>
> Just tested this change with the JNI interface to uimacpp and it works fine.
>
> Eddie
>
>
> ---------- Forwarded message ----------
> From: Thilo Goetz (JIRA) <ui...@incubator.apache.org>
> Date: Fri, Jun 6, 2008 at 10:21 AM
> Subject: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap
> of the CAS
> To: uima-dev@incubator.apache.org
>
>
>
> [
> https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Thilo Goetz closed UIMA-1067.
> -----------------------------
>
> Resolution: Fixed
>
> Fixed, all unit tests pass. Please test this change if you use (binary)
> serialization. It should work the same as before, I haven't changed the
> serialization format in any way.
>
>> Remove char heap/ref heap in StringHeap of the CAS
>> --------------------------------------------------
>>
>> Key: UIMA-1067
>> URL: https://issues.apache.org/jira/browse/UIMA-1067
>> Project: UIMA
>> Issue Type: Improvement
>> Components: Core Java Framework
>> Affects Versions: 2.2.2
>> Reporter: Thilo Goetz
>> Assignee: Thilo Goetz
>> Fix For: 2.3
>>
>>
>> The StringHeap class provides two ways to store strings: either as Java
> strings, or by copying characters onto a character heap. The second option
> is only used for deserialization from a binary CAS. However, even if not
> used, this capability means a very significant memory overhead. To
> demonstrate this, I ran the following experiment. As analysis engine, I
> used our sandbox POS tagger. It sets just one string feature on each token.
> As text, I used a 2.4MB input file (2x moby.txt). To run this in IBM Java
> 1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify
> -Xmx135M. I checked 5MB increments. The I patched the StringHeap
> implementation to work without the additional book keeping overhead and ran
> the experiment again. I was then able to run with -Xmx115M. This
> represents a very significant gain, particularly given the fact that I ran
> so little analysis (only tokens and sentences are produced, and only a
> single string-valued feature set). The new code also ran a tiny bit faster,
> but not much. One might see more improvement for analysis that is not as
> compute intensive as the Tagger.
>> The challenge is to make sure that the serialization code still works
> after this change.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
Fwd: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap of the CAS
Posted by Eddie Epstein <ea...@gmail.com>.
Thilo,
Just tested this change with the JNI interface to uimacpp and it works fine.
Eddie
---------- Forwarded message ----------
From: Thilo Goetz (JIRA) <ui...@incubator.apache.org>
Date: Fri, Jun 6, 2008 at 10:21 AM
Subject: [jira] Closed: (UIMA-1067) Remove char heap/ref heap in StringHeap
of the CAS
To: uima-dev@incubator.apache.org
[
https://issues.apache.org/jira/browse/UIMA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
Thilo Goetz closed UIMA-1067.
-----------------------------
Resolution: Fixed
Fixed, all unit tests pass. Please test this change if you use (binary)
serialization. It should work the same as before, I haven't changed the
serialization format in any way.
> Remove char heap/ref heap in StringHeap of the CAS
> --------------------------------------------------
>
> Key: UIMA-1067
> URL: https://issues.apache.org/jira/browse/UIMA-1067
> Project: UIMA
> Issue Type: Improvement
> Components: Core Java Framework
> Affects Versions: 2.2.2
> Reporter: Thilo Goetz
> Assignee: Thilo Goetz
> Fix For: 2.3
>
>
> The StringHeap class provides two ways to store strings: either as Java
strings, or by copying characters onto a character heap. The second option
is only used for deserialization from a binary CAS. However, even if not
used, this capability means a very significant memory overhead. To
demonstrate this, I ran the following experiment. As analysis engine, I
used our sandbox POS tagger. It sets just one string feature on each token.
As text, I used a 2.4MB input file (2x moby.txt). To run this in IBM Java
1.5.0_7 (which happens to be the JVM I'm interested in) you need to specify
-Xmx135M. I checked 5MB increments. The I patched the StringHeap
implementation to work without the additional book keeping overhead and ran
the experiment again. I was then able to run with -Xmx115M. This
represents a very significant gain, particularly given the fact that I ran
so little analysis (only tokens and sentences are produced, and only a
single string-valued feature set). The new code also ran a tiny bit faster,
but not much. One might see more improvement for analysis that is not as
compute intensive as the Tagger.
> The challenge is to make sure that the serialization code still works
after this change.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.