You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Katrin Tomanek <ka...@averbis.com> on 2012/02/09 12:38:41 UTC

making EOS character configurable for sentence splitter

Hi,

I am moving the discussion on making the EOS characters of the sentence 
splitter configurable to the dev list (it was previously on the user list).

I am currently trying to make the EOS characters a parameter of the 
SentenceDetectorME and store it as model parameter.

Thus far, this works fine (although it requires quite some positions in 
the code to change).

I am putting a "char[] eosCharacters" to the artifactMap in SentenceModel.
When predicting with a model, I test whether the eos parameter is set 
and if so I use these eos symbols, otherwise the language dependent ones.

Anyways, I am now getting into troubles when serializing the model with 
the new "char[]" parameter:

Writing sentence detector model ... Exception in thread "main" 
java.lang.IllegalStateException: Missing serializer for eosCharacters

I know that I would have to write such a serializer, however, I am a bit 
lost here. Any hints (maybe there is already a serializer for char[] 
which I could easily use).

Best
Katrin

Re: making EOS character configurable for sentence splitter

Posted by "william.colen@gmail.com" <wi...@gmail.com>.
Nice that it works now.

I am reviewing the patch.

On Thu, Feb 9, 2012 at 2:26 PM, Katrin Tomanek
<ka...@averbis.com>wrote:

> Hi,
>
> yes, it did improve absolutely. The patch I submitted fixes the problem as
> I can now flexibly define the EOS set.
>
> In our setting this is quite important as we sometimes have non-typical
> EOS symbols which anyways need to be identified.
>
> Best
> Katrin
>
>
> #On 02/09/2012 05:23 PM, william.colen@gmail.com wrote:
>
>> Katrin, did you try to evaluate your model again? Did it improve?
>>
>>
>> On Thu, Feb 9, 2012 at 12:29 PM, Katrin Tomanek
>> <ka...@averbis.com>**wrote:
>>
>>  Hi Jörn,
>>>
>>> thanks for the info. From our side, we would of course favour a 1.5.3 for
>>> that we can use the patch in an "official" version...
>>>
>>> As for your question: We are using our own UIMA components which wrap the
>>> opennlp tools. This is basically because our company has its own
>>> typesystem
>>> and we do some postprocessing steps after having retrieved the opennlp
>>> tagger results.
>>>
>>> Best
>>> Katrin
>>>
>>>
>>>
>>> On 02/09/2012 03:24 PM, Joern Kottmann wrote:
>>>
>>>  On Thu, Feb 9, 2012 at 3:04 PM, Katrin Tomanek
>>>> <ka...@averbis.com>****wrote:
>>>>
>>>>
>>>>  Hi Jörn,
>>>>
>>>>>
>>>>> I have made all changes and added a patch to the JIRA-issue.
>>>>>
>>>>> What are the next steps ?
>>>>>
>>>>> And btw: when do you plan to release 1.5.3 ?
>>>>>
>>>>>
>>>>>  There are no plans yet to release the next version. It isn't even
>>>>>
>>>> decided
>>>> if it will be 1.5.3 or 1.6.0.
>>>>
>>>> One of the committers will review your patch and give you feedback.
>>>>
>>>> BTW, do you plan to use our UIMA integration for your processing?
>>>> We also started to work on UIMA based annotation tooling.
>>>>
>>>> Jörn
>>>>
>>>>
>>>>
>>> --
>>> Dr. Katrin Tomanek
>>> Averbis GmbH
>>> Tennenbacher Strasse 11
>>> D-79106 Freiburg
>>>
>>> Fon: +49 (0) 761 - 203 97696
>>> Fax: +49 (0) 761 - 203 97694
>>> E-Mail: katrin.tomanek@averbis.com
>>>
>>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
>>> Sitz der Gesellschaft: Freiburg i. Br.
>>> AG Freiburg i. Br., HRB 701080
>>>
>>>
>>
>
> --
> Dr. Katrin Tomanek
> Averbis GmbH
> Tennenbacher Strasse 11
> D-79106 Freiburg
>
> Fon: +49 (0) 761 - 203 97696
> Fax: +49 (0) 761 - 203 97694
> E-Mail: katrin.tomanek@averbis.com
>
> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
> Sitz der Gesellschaft: Freiburg i. Br.
> AG Freiburg i. Br., HRB 701080
>

Re: making EOS character configurable for sentence splitter

Posted by Katrin Tomanek <ka...@averbis.com>.
Hi,

yes, it did improve absolutely. The patch I submitted fixes the problem 
as I can now flexibly define the EOS set.

In our setting this is quite important as we sometimes have non-typical 
EOS symbols which anyways need to be identified.

Best
Katrin

#On 02/09/2012 05:23 PM, william.colen@gmail.com wrote:
> Katrin, did you try to evaluate your model again? Did it improve?
>
>
> On Thu, Feb 9, 2012 at 12:29 PM, Katrin Tomanek
> <ka...@averbis.com>wrote:
>
>> Hi Jörn,
>>
>> thanks for the info. From our side, we would of course favour a 1.5.3 for
>> that we can use the patch in an "official" version...
>>
>> As for your question: We are using our own UIMA components which wrap the
>> opennlp tools. This is basically because our company has its own typesystem
>> and we do some postprocessing steps after having retrieved the opennlp
>> tagger results.
>>
>> Best
>> Katrin
>>
>>
>>
>> On 02/09/2012 03:24 PM, Joern Kottmann wrote:
>>
>>> On Thu, Feb 9, 2012 at 3:04 PM, Katrin Tomanek
>>> <ka...@averbis.com>**wrote:
>>>
>>>   Hi Jörn,
>>>>
>>>> I have made all changes and added a patch to the JIRA-issue.
>>>>
>>>> What are the next steps ?
>>>>
>>>> And btw: when do you plan to release 1.5.3 ?
>>>>
>>>>
>>>>   There are no plans yet to release the next version. It isn't even
>>> decided
>>> if it will be 1.5.3 or 1.6.0.
>>>
>>> One of the committers will review your patch and give you feedback.
>>>
>>> BTW, do you plan to use our UIMA integration for your processing?
>>> We also started to work on UIMA based annotation tooling.
>>>
>>> Jörn
>>>
>>>
>>
>> --
>> Dr. Katrin Tomanek
>> Averbis GmbH
>> Tennenbacher Strasse 11
>> D-79106 Freiburg
>>
>> Fon: +49 (0) 761 - 203 97696
>> Fax: +49 (0) 761 - 203 97694
>> E-Mail: katrin.tomanek@averbis.com
>>
>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
>> Sitz der Gesellschaft: Freiburg i. Br.
>> AG Freiburg i. Br., HRB 701080
>>
>


-- 
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.tomanek@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Re: making EOS character configurable for sentence splitter

Posted by "william.colen@gmail.com" <wi...@gmail.com>.
Katrin, did you try to evaluate your model again? Did it improve?


On Thu, Feb 9, 2012 at 12:29 PM, Katrin Tomanek
<ka...@averbis.com>wrote:

> Hi Jörn,
>
> thanks for the info. From our side, we would of course favour a 1.5.3 for
> that we can use the patch in an "official" version...
>
> As for your question: We are using our own UIMA components which wrap the
> opennlp tools. This is basically because our company has its own typesystem
> and we do some postprocessing steps after having retrieved the opennlp
> tagger results.
>
> Best
> Katrin
>
>
>
> On 02/09/2012 03:24 PM, Joern Kottmann wrote:
>
>> On Thu, Feb 9, 2012 at 3:04 PM, Katrin Tomanek
>> <ka...@averbis.com>**wrote:
>>
>>  Hi Jörn,
>>>
>>> I have made all changes and added a patch to the JIRA-issue.
>>>
>>> What are the next steps ?
>>>
>>> And btw: when do you plan to release 1.5.3 ?
>>>
>>>
>>>  There are no plans yet to release the next version. It isn't even
>> decided
>> if it will be 1.5.3 or 1.6.0.
>>
>> One of the committers will review your patch and give you feedback.
>>
>> BTW, do you plan to use our UIMA integration for your processing?
>> We also started to work on UIMA based annotation tooling.
>>
>> Jörn
>>
>>
>
> --
> Dr. Katrin Tomanek
> Averbis GmbH
> Tennenbacher Strasse 11
> D-79106 Freiburg
>
> Fon: +49 (0) 761 - 203 97696
> Fax: +49 (0) 761 - 203 97694
> E-Mail: katrin.tomanek@averbis.com
>
> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
> Sitz der Gesellschaft: Freiburg i. Br.
> AG Freiburg i. Br., HRB 701080
>

Re: making EOS character configurable for sentence splitter

Posted by Katrin Tomanek <ka...@averbis.com>.
Hi Jörn,

thanks for the info. From our side, we would of course favour a 1.5.3 
for that we can use the patch in an "official" version...

As for your question: We are using our own UIMA components which wrap 
the opennlp tools. This is basically because our company has its own 
typesystem and we do some postprocessing steps after having retrieved 
the opennlp tagger results.

Best
Katrin


On 02/09/2012 03:24 PM, Joern Kottmann wrote:
> On Thu, Feb 9, 2012 at 3:04 PM, Katrin Tomanek
> <ka...@averbis.com>wrote:
>
>> Hi Jörn,
>>
>> I have made all changes and added a patch to the JIRA-issue.
>>
>> What are the next steps ?
>>
>> And btw: when do you plan to release 1.5.3 ?
>>
>>
> There are no plans yet to release the next version. It isn't even decided
> if it will be 1.5.3 or 1.6.0.
>
> One of the committers will review your patch and give you feedback.
>
> BTW, do you plan to use our UIMA integration for your processing?
> We also started to work on UIMA based annotation tooling.
>
> Jörn
>


-- 
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.tomanek@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Re: making EOS character configurable for sentence splitter

Posted by Joern Kottmann <ko...@gmail.com>.
On Thu, Feb 9, 2012 at 3:04 PM, Katrin Tomanek
<ka...@averbis.com>wrote:

> Hi Jörn,
>
> I have made all changes and added a patch to the JIRA-issue.
>
> What are the next steps ?
>
> And btw: when do you plan to release 1.5.3 ?
>
>
There are no plans yet to release the next version. It isn't even decided
if it will be 1.5.3 or 1.6.0.

One of the committers will review your patch and give you feedback.

BTW, do you plan to use our UIMA integration for your processing?
We also started to work on UIMA based annotation tooling.

Jörn

Re: making EOS character configurable for sentence splitter

Posted by Katrin Tomanek <ka...@averbis.com>.
Hi Jörn,

I have made all changes and added a patch to the JIRA-issue.

What are the next steps ?

And btw: when do you plan to release 1.5.3 ?

Best
Katrin

On 02/09/2012 02:26 PM, Joern Kottmann wrote:
> You need to fetch the manifest from the artifact map and then
> put the chars into the manifest itself.
>
> Please see TokenizerModel.useAlphaNumericOptimization on how
> to do that.
>
> Jörn
>
> On Thu, Feb 9, 2012 at 2:20 PM, Katrin Tomanek
> <ka...@averbis.com>wrote:
>
>> Hi Jörn,
>>
>> I did that:
>>
>>
>>   public SentenceModel(String languageCode, AbstractModel sentModel,
>>       boolean useTokenEnd, Dictionary abbreviations, char[] eosCharacters,
>> Map<String, String>  manifestInfoEntries) {
>>
>>     super(COMPONENT_NAME, languageCode, manifestInfoEntries);
>>
>>     artifactMap.put(MAXENT_MODEL_**ENTRY_NAME, sentModel);
>>
>>     setManifestProperty(TOKEN_END_**PROPERTY,
>> Boolean.toString(useTokenEnd))**;
>>
>>     // Abbreviations are optional
>>     if (abbreviations != null)
>>         artifactMap.put(ABBREVIATIONS_**ENTRY_NAME, abbreviations);
>>
>>     // EOS characters are optional
>>     if (eosCharacters!=null)
>>         artifactMap.put(EOS_**CHARACTERS_ENTRY_NAME, eosCharArrayToString(*
>> *eosCharacters));
>>
>>     checkArtifactMap();
>>   }
>>
>> the EOS-Char-Array is transformed to a string which is written to the
>> manifest.
>>
>> Still, wenn serializing the model, I get:
>>
>>
>> Exception in thread "main" java.lang.**IllegalStateException: Missing
>> serializer for eosCharacters
>>
>>
>> Best,
>> Katrin
>>
>>
>> On 02/09/2012 12:48 PM, Joern Kottmann wrote:
>>
>>> The artifactMap map contains a manifest (that is a Properties object).
>>> You should store the EOS chars in this manifest. We need a smart way to
>>> convert
>>> them into a String.
>>>
>>> The Sentence Detector should retrieve the EOS chars then from the model
>>> e.g. make a method getEosChars.
>>>
>>> Have a look at the other model classes as well, e.g. the tokenizer model.
>>> It stores some settings in the manifest. That is a good place to look for
>>> a
>>> code sample.
>>>
>>> Jörn
>>>
>>>
>>> On Thu, Feb 9, 2012 at 12:38 PM, Katrin Tomanek
>>> <ka...@averbis.com>**wrote:
>>>
>>>   Hi,
>>>>
>>>> I am moving the discussion on making the EOS characters of the sentence
>>>> splitter configurable to the dev list (it was previously on the user
>>>> list).
>>>>
>>>> I am currently trying to make the EOS characters a parameter of the
>>>> SentenceDetectorME and store it as model parameter.
>>>>
>>>> Thus far, this works fine (although it requires quite some positions in
>>>> the code to change).
>>>>
>>>> I am putting a "char[] eosCharacters" to the artifactMap in
>>>> SentenceModel.
>>>> When predicting with a model, I test whether the eos parameter is set and
>>>> if so I use these eos symbols, otherwise the language dependent ones.
>>>>
>>>> Anyways, I am now getting into troubles when serializing the model with
>>>> the new "char[]" parameter:
>>>>
>>>> Writing sentence detector model ... Exception in thread "main"
>>>> java.lang.*
>>>>
>>>> *IllegalStateException: Missing serializer for eosCharacters
>>>>
>>>> I know that I would have to write such a serializer, however, I am a bit
>>>> lost here. Any hints (maybe there is already a serializer for char[]
>>>> which
>>>> I could easily use).
>>>>
>>>> Best
>>>> Katrin
>>>>
>>>>
>>>
>>
>> --
>> Dr. Katrin Tomanek
>> Averbis GmbH
>> Tennenbacher Strasse 11
>> D-79106 Freiburg
>>
>> Fon: +49 (0) 761 - 203 97696
>> Fax: +49 (0) 761 - 203 97694
>> E-Mail: katrin.tomanek@averbis.com
>>
>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
>> Sitz der Gesellschaft: Freiburg i. Br.
>> AG Freiburg i. Br., HRB 701080
>>
>


-- 
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.tomanek@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Re: making EOS character configurable for sentence splitter

Posted by Joern Kottmann <ko...@gmail.com>.
You need to fetch the manifest from the artifact map and then
put the chars into the manifest itself.

Please see TokenizerModel.useAlphaNumericOptimization on how
to do that.

Jörn

On Thu, Feb 9, 2012 at 2:20 PM, Katrin Tomanek
<ka...@averbis.com>wrote:

> Hi Jörn,
>
> I did that:
>
>
>  public SentenceModel(String languageCode, AbstractModel sentModel,
>      boolean useTokenEnd, Dictionary abbreviations, char[] eosCharacters,
> Map<String, String> manifestInfoEntries) {
>
>    super(COMPONENT_NAME, languageCode, manifestInfoEntries);
>
>    artifactMap.put(MAXENT_MODEL_**ENTRY_NAME, sentModel);
>
>    setManifestProperty(TOKEN_END_**PROPERTY,
> Boolean.toString(useTokenEnd))**;
>
>    // Abbreviations are optional
>    if (abbreviations != null)
>        artifactMap.put(ABBREVIATIONS_**ENTRY_NAME, abbreviations);
>
>    // EOS characters are optional
>    if (eosCharacters!=null)
>        artifactMap.put(EOS_**CHARACTERS_ENTRY_NAME, eosCharArrayToString(*
> *eosCharacters));
>
>    checkArtifactMap();
>  }
>
> the EOS-Char-Array is transformed to a string which is written to the
> manifest.
>
> Still, wenn serializing the model, I get:
>
>
> Exception in thread "main" java.lang.**IllegalStateException: Missing
> serializer for eosCharacters
>
>
> Best,
> Katrin
>
>
> On 02/09/2012 12:48 PM, Joern Kottmann wrote:
>
>> The artifactMap map contains a manifest (that is a Properties object).
>> You should store the EOS chars in this manifest. We need a smart way to
>> convert
>> them into a String.
>>
>> The Sentence Detector should retrieve the EOS chars then from the model
>> e.g. make a method getEosChars.
>>
>> Have a look at the other model classes as well, e.g. the tokenizer model.
>> It stores some settings in the manifest. That is a good place to look for
>> a
>> code sample.
>>
>> Jörn
>>
>>
>> On Thu, Feb 9, 2012 at 12:38 PM, Katrin Tomanek
>> <ka...@averbis.com>**wrote:
>>
>>  Hi,
>>>
>>> I am moving the discussion on making the EOS characters of the sentence
>>> splitter configurable to the dev list (it was previously on the user
>>> list).
>>>
>>> I am currently trying to make the EOS characters a parameter of the
>>> SentenceDetectorME and store it as model parameter.
>>>
>>> Thus far, this works fine (although it requires quite some positions in
>>> the code to change).
>>>
>>> I am putting a "char[] eosCharacters" to the artifactMap in
>>> SentenceModel.
>>> When predicting with a model, I test whether the eos parameter is set and
>>> if so I use these eos symbols, otherwise the language dependent ones.
>>>
>>> Anyways, I am now getting into troubles when serializing the model with
>>> the new "char[]" parameter:
>>>
>>> Writing sentence detector model ... Exception in thread "main"
>>> java.lang.*
>>>
>>> *IllegalStateException: Missing serializer for eosCharacters
>>>
>>> I know that I would have to write such a serializer, however, I am a bit
>>> lost here. Any hints (maybe there is already a serializer for char[]
>>> which
>>> I could easily use).
>>>
>>> Best
>>> Katrin
>>>
>>>
>>
>
> --
> Dr. Katrin Tomanek
> Averbis GmbH
> Tennenbacher Strasse 11
> D-79106 Freiburg
>
> Fon: +49 (0) 761 - 203 97696
> Fax: +49 (0) 761 - 203 97694
> E-Mail: katrin.tomanek@averbis.com
>
> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
> Sitz der Gesellschaft: Freiburg i. Br.
> AG Freiburg i. Br., HRB 701080
>

Re: making EOS character configurable for sentence splitter

Posted by Katrin Tomanek <ka...@averbis.com>.
Hi again,

ok, found it... now I understood what you meant with "manifest". It did 
this:

--------------------------------
if (eosCharacters!=null)
     	setManifestProperty(EOS_CHARACTERS_PROPERTY, 
eosCharArrayToString(eosCharacters));
--------------------------------

now it works.

Best
Katrin

On 02/09/2012 02:20 PM, Katrin Tomanek wrote:
> Hi Jörn,
>
> I did that:
>
>
> public SentenceModel(String languageCode, AbstractModel sentModel,
> boolean useTokenEnd, Dictionary abbreviations, char[] eosCharacters,
> Map<String, String> manifestInfoEntries) {
>
> super(COMPONENT_NAME, languageCode, manifestInfoEntries);
>
> artifactMap.put(MAXENT_MODEL_ENTRY_NAME, sentModel);
>
> setManifestProperty(TOKEN_END_PROPERTY, Boolean.toString(useTokenEnd));
>
> // Abbreviations are optional
> if (abbreviations != null)
> artifactMap.put(ABBREVIATIONS_ENTRY_NAME, abbreviations);
>
> // EOS characters are optional
> if (eosCharacters!=null)
> artifactMap.put(EOS_CHARACTERS_ENTRY_NAME,
> eosCharArrayToString(eosCharacters));
>
> checkArtifactMap();
> }
>
> the EOS-Char-Array is transformed to a string which is written to the
> manifest.
>
> Still, wenn serializing the model, I get:
>
> Exception in thread "main" java.lang.IllegalStateException: Missing
> serializer for eosCharacters
>
>
> Best,
> Katrin
>
> On 02/09/2012 12:48 PM, Joern Kottmann wrote:
>> The artifactMap map contains a manifest (that is a Properties object).
>> You should store the EOS chars in this manifest. We need a smart way to
>> convert
>> them into a String.
>>
>> The Sentence Detector should retrieve the EOS chars then from the model
>> e.g. make a method getEosChars.
>>
>> Have a look at the other model classes as well, e.g. the tokenizer model.
>> It stores some settings in the manifest. That is a good place to look
>> for a
>> code sample.
>>
>> Jörn
>>
>>
>> On Thu, Feb 9, 2012 at 12:38 PM, Katrin Tomanek
>> <ka...@averbis.com>wrote:
>>
>>> Hi,
>>>
>>> I am moving the discussion on making the EOS characters of the sentence
>>> splitter configurable to the dev list (it was previously on the user
>>> list).
>>>
>>> I am currently trying to make the EOS characters a parameter of the
>>> SentenceDetectorME and store it as model parameter.
>>>
>>> Thus far, this works fine (although it requires quite some positions in
>>> the code to change).
>>>
>>> I am putting a "char[] eosCharacters" to the artifactMap in
>>> SentenceModel.
>>> When predicting with a model, I test whether the eos parameter is set
>>> and
>>> if so I use these eos symbols, otherwise the language dependent ones.
>>>
>>> Anyways, I am now getting into troubles when serializing the model with
>>> the new "char[]" parameter:
>>>
>>> Writing sentence detector model ... Exception in thread "main"
>>> java.lang.*
>>> *IllegalStateException: Missing serializer for eosCharacters
>>>
>>> I know that I would have to write such a serializer, however, I am a bit
>>> lost here. Any hints (maybe there is already a serializer for char[]
>>> which
>>> I could easily use).
>>>
>>> Best
>>> Katrin
>>>
>>
>
>


-- 
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.tomanek@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Re: making EOS character configurable for sentence splitter

Posted by Katrin Tomanek <ka...@averbis.com>.
Hi Jörn,

I did that:


   public SentenceModel(String languageCode, AbstractModel sentModel,
       boolean useTokenEnd, Dictionary abbreviations, char[] 
eosCharacters, Map<String, String> manifestInfoEntries) {

     super(COMPONENT_NAME, languageCode, manifestInfoEntries);

     artifactMap.put(MAXENT_MODEL_ENTRY_NAME, sentModel);

     setManifestProperty(TOKEN_END_PROPERTY, Boolean.toString(useTokenEnd));

     // Abbreviations are optional
     if (abbreviations != null)
         artifactMap.put(ABBREVIATIONS_ENTRY_NAME, abbreviations);

     // EOS characters are optional
     if (eosCharacters!=null)
     	artifactMap.put(EOS_CHARACTERS_ENTRY_NAME, 
eosCharArrayToString(eosCharacters));

     checkArtifactMap();
   }

the EOS-Char-Array is transformed to a string which is written to the 
manifest.

Still, wenn serializing the model, I get:

Exception in thread "main" java.lang.IllegalStateException: Missing 
serializer for eosCharacters


Best,
Katrin

On 02/09/2012 12:48 PM, Joern Kottmann wrote:
> The artifactMap map contains a manifest (that is a Properties object).
> You should store the EOS chars in this manifest. We need a smart way to
> convert
> them into a String.
>
> The Sentence Detector should retrieve the EOS chars then from the model
> e.g. make a method getEosChars.
>
> Have a look at the other model classes as well, e.g. the tokenizer model.
> It stores some settings in the manifest. That is a good place to look for a
> code sample.
>
> Jörn
>
>
> On Thu, Feb 9, 2012 at 12:38 PM, Katrin Tomanek
> <ka...@averbis.com>wrote:
>
>> Hi,
>>
>> I am moving the discussion on making the EOS characters of the sentence
>> splitter configurable to the dev list (it was previously on the user list).
>>
>> I am currently trying to make the EOS characters a parameter of the
>> SentenceDetectorME and store it as model parameter.
>>
>> Thus far, this works fine (although it requires quite some positions in
>> the code to change).
>>
>> I am putting a "char[] eosCharacters" to the artifactMap in SentenceModel.
>> When predicting with a model, I test whether the eos parameter is set and
>> if so I use these eos symbols, otherwise the language dependent ones.
>>
>> Anyways, I am now getting into troubles when serializing the model with
>> the new "char[]" parameter:
>>
>> Writing sentence detector model ... Exception in thread "main" java.lang.*
>> *IllegalStateException: Missing serializer for eosCharacters
>>
>> I know that I would have to write such a serializer, however, I am a bit
>> lost here. Any hints (maybe there is already a serializer for char[] which
>> I could easily use).
>>
>> Best
>> Katrin
>>
>


-- 
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.tomanek@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Re: making EOS character configurable for sentence splitter

Posted by Joern Kottmann <ko...@gmail.com>.
The artifactMap map contains a manifest (that is a Properties object).
You should store the EOS chars in this manifest. We need a smart way to
convert
them into a String.

The Sentence Detector should retrieve the EOS chars then from the model
e.g. make a method getEosChars.

Have a look at the other model classes as well, e.g. the tokenizer model.
It stores some settings in the manifest. That is a good place to look for a
code sample.

Jörn


On Thu, Feb 9, 2012 at 12:38 PM, Katrin Tomanek
<ka...@averbis.com>wrote:

> Hi,
>
> I am moving the discussion on making the EOS characters of the sentence
> splitter configurable to the dev list (it was previously on the user list).
>
> I am currently trying to make the EOS characters a parameter of the
> SentenceDetectorME and store it as model parameter.
>
> Thus far, this works fine (although it requires quite some positions in
> the code to change).
>
> I am putting a "char[] eosCharacters" to the artifactMap in SentenceModel.
> When predicting with a model, I test whether the eos parameter is set and
> if so I use these eos symbols, otherwise the language dependent ones.
>
> Anyways, I am now getting into troubles when serializing the model with
> the new "char[]" parameter:
>
> Writing sentence detector model ... Exception in thread "main" java.lang.*
> *IllegalStateException: Missing serializer for eosCharacters
>
> I know that I would have to write such a serializer, however, I am a bit
> lost here. Any hints (maybe there is already a serializer for char[] which
> I could easily use).
>
> Best
> Katrin
>