You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Andreas Niekler <an...@informatik.uni-leipzig.de> on 2013/03/12 15:22:43 UTC

TokenizerTrainer

Dear List,

i created a Tokenizer Model with 300k german Sentences from a very clean
corpus. I see some words that are very strangly separated by a tokenizer
using this model like:

stehenge - blieben
fre - undlicher

and so on. I cant find those in my training data and wonder why openNLP
splits those words without any evidence in the training data and wihout
any whitespace in my text files. I trained the model with 500
Iterations, cutoff 5 and alphanumeric optimisation.

Can anyone state some ideas how i can prevent this?

thank you

Andreas
-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

ok i checked in the sources now and i can see that the tokenizer skips
further tokenisation if a certain pattern is matched. I looked also into
the default values of the patterns. As far as i can see within the
Fatory.java there is no pattern for the "de" language flag. For me this
means that there is no standard way of training a german model with the
TokenizerTrainer tool. I guess i have to write my own training tool
where i set the pattern within the TokenizerME myself. Am i right here?

Finally i wonder if the training class for the de-token.bin file on the
models page is public so that i can adopt it for my own data. If anyone
can point me there this would be very helpful.

Thank you

Andreas

Am 13.03.2013 12:15, schrieb James Kosin:
> Andreas,
> 
> Tokenizing is a very simple procedure; so, the default of 100 iterations
> should suffice as long as you have a large training set.  Greater than
> say about 1,000 lines.
> 
> James
> 
> On 3/13/2013 4:39 AM, Andreas Niekler wrote:
>> Hello,
>>
>> it was a clean set which i just annotated with the <SPLIT> tags.
>>
>> And the german root bases for those examples are not right in those
>> cases i posted.
>>
>> I used 500 iterations could it be an overfitting problem?
>>
>> Thnakns for you help.
>>
>> Am 13.03.2013 02:38, schrieb James Kosin:
>>> On 3/12/2013 10:22 AM, Andreas Niekler wrote:
>>>> stehenge - blieben
>>>> fre - undlicher
>>> Andreas,
>>>
>>> I'm not an expert on German, but in English the models are also trained
>>> on splitting contractions and other words into their root bases.
>>>
>>> ie:  You'll -split-> You 'll -meaning-> You will
>>>        Can't -split-> Can 't -meaning-> Can not
>>>
>>> Other words may also get parsed and separated by the tokenizer.
>>>
>>> Did you create the training data yourself?  Or was this a clean set of
>>> data from another source?
>>>
>>> James
>>>
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

thank you for your information. Let me ask a last question regarding the
TokenizerTrainer. What does the alphanumeric optimisation exactly do?
What is the difference in using this optimisation or not.

All the best

Andreas

Am 13.03.2013 12:15, schrieb James Kosin:
> Andreas,
> 
> Tokenizing is a very simple procedure; so, the default of 100 iterations
> should suffice as long as you have a large training set.  Greater than
> say about 1,000 lines.
> 
> James
> 
> On 3/13/2013 4:39 AM, Andreas Niekler wrote:
>> Hello,
>>
>> it was a clean set which i just annotated with the <SPLIT> tags.
>>
>> And the german root bases for those examples are not right in those
>> cases i posted.
>>
>> I used 500 iterations could it be an overfitting problem?
>>
>> Thnakns for you help.
>>
>> Am 13.03.2013 02:38, schrieb James Kosin:
>>> On 3/12/2013 10:22 AM, Andreas Niekler wrote:
>>>> stehenge - blieben
>>>> fre - undlicher
>>> Andreas,
>>>
>>> I'm not an expert on German, but in English the models are also trained
>>> on splitting contractions and other words into their root bases.
>>>
>>> ie:  You'll -split-> You 'll -meaning-> You will
>>>        Can't -split-> Can 't -meaning-> Can not
>>>
>>> Other words may also get parsed and separated by the tokenizer.
>>>
>>> Did you create the training data yourself?  Or was this a clean set of
>>> data from another source?
>>>
>>> James
>>>
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by James Kosin <ja...@gmail.com>.
Andreas,

Tokenizing is a very simple procedure; so, the default of 100 iterations 
should suffice as long as you have a large training set.  Greater than 
say about 1,000 lines.

James

On 3/13/2013 4:39 AM, Andreas Niekler wrote:
> Hello,
>
> it was a clean set which i just annotated with the <SPLIT> tags.
>
> And the german root bases for those examples are not right in those
> cases i posted.
>
> I used 500 iterations could it be an overfitting problem?
>
> Thnakns for you help.
>
> Am 13.03.2013 02:38, schrieb James Kosin:
>> On 3/12/2013 10:22 AM, Andreas Niekler wrote:
>>> stehenge - blieben
>>> fre - undlicher
>> Andreas,
>>
>> I'm not an expert on German, but in English the models are also trained
>> on splitting contractions and other words into their root bases.
>>
>> ie:  You'll -split-> You 'll -meaning-> You will
>>        Can't -split-> Can 't -meaning-> Can not
>>
>> Other words may also get parsed and separated by the tokenizer.
>>
>> Did you create the training data yourself?  Or was this a clean set of
>> data from another source?
>>
>> James
>>


Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

it was a clean set which i just annotated with the <SPLIT> tags.

And the german root bases for those examples are not right in those
cases i posted.

I used 500 iterations could it be an overfitting problem?

Thnakns for you help.

Am 13.03.2013 02:38, schrieb James Kosin:
> On 3/12/2013 10:22 AM, Andreas Niekler wrote:
>> stehenge - blieben
>> fre - undlicher
> Andreas,
> 
> I'm not an expert on German, but in English the models are also trained
> on splitting contractions and other words into their root bases.
> 
> ie:  You'll -split-> You 'll -meaning-> You will
>       Can't -split-> Can 't -meaning-> Can not
> 
> Other words may also get parsed and separated by the tokenizer.
> 
> Did you create the training data yourself?  Or was this a clean set of
> data from another source?
> 
> James
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by James Kosin <ja...@gmail.com>.
On 3/12/2013 10:22 AM, Andreas Niekler wrote:
> stehenge - blieben
> fre - undlicher
Andreas,

I'm not an expert on German, but in English the models are also trained 
on splitting contractions and other words into their root bases.

ie:  You'll -split-> You 'll -meaning-> You will
       Can't -split-> Can 't -meaning-> Can not

Other words may also get parsed and separated by the tokenizer.

Did you create the training data yourself?  Or was this a clean set of 
data from another source?

James

Re: TokenizerTrainer

Posted by James Kosin <ja...@gmail.com>.
This would also explain why it is trying to split words that shouldn't 
be split as well.

On 3/13/2013 3:13 PM, Jörn Kottmann wrote:
> The tokenizers defaults are for text which is mostly whitespace 
> separated,
> did you lost all your white spaces in the text you want to process?
>
> Jörn
>
> On 03/13/2013 04:31 PM, Andreas Niekler wrote:
>> Hello,
>>
>> i give you some examples below this comment. But i already noticed in
>> the code, that the standard tokenizerTrainer call uses the standard
>> alphanumeric pattern which won't work for typical german examples. Most
>> of the data will be separated because of the inproper pattern in the
>> standard Factory.java class. My believe is that the de-token.bin model
>> was trained with a proper pattern within another implementation of the
>> training procedure.
>>
>> Here are some training lines:
>>
>> Senden<SPLIT>Pfleiderer<SPLIT>verkaufen<SPLIT>Düsseldorf<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Analyst<SPLIT>vom<SPLIT>Bankhaus<SPLIT>Lampe<SPLIT>,<SPLIT>Marc<SPLIT>Gabriel<SPLIT>,<SPLIT>stuft<SPLIT>die<SPLIT>Pfleiderer-Aktie<SPLIT>(<SPLIT>ISIN<SPLIT>DE0006764749<SPLIT>/<SPLIT>WKN<SPLIT>676474<SPLIT>)<SPLIT>von<SPLIT>"<SPLIT>halten<SPLIT>"<SPLIT>auf<SPLIT>"<SPLIT>verkaufen<SPLIT>"<SPLIT>herab<SPLIT>. 
>>
>> Der<SPLIT>vollständige<SPLIT>Zwischenbericht<SPLIT>wird<SPLIT>am<SPLIT>8<SPLIT>.<SPLIT>November<SPLIT>2010<SPLIT>um<SPLIT>12.00<SPLIT>Uhr<SPLIT>veröffentlicht<SPLIT>. 
>>
>> Besonders<SPLIT>in<SPLIT>ländlichen<SPLIT>Gegenden<SPLIT>sind<SPLIT>Telegrafenmaste<SPLIT>auch<SPLIT>heute<SPLIT>noch<SPLIT>weit<SPLIT>verbreitet<SPLIT>-<SPLIT>größtenteils<SPLIT>für<SPLIT>die<SPLIT>Festnetztelefonie<SPLIT>. 
>>
>> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Sarah<SPLIT>Palin<SPLIT>als<SPLIT>Reality-Star<SPLIT>im<SPLIT>US-Fernsehen<SPLIT>auf<SPLIT>Sendung<SPLIT>15.11.10<SPLIT>4:58<SPLIT>:<SPLIT>Washington<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Sarah<SPLIT>Palin<SPLIT>hat<SPLIT>jetzt<SPLIT>eine<SPLIT>eigene<SPLIT>Show<SPLIT>. 
>>
>> Fotos<SPLIT>Terrorwarnung<SPLIT>-<SPLIT>Was<SPLIT>man<SPLIT>jetzt<SPLIT>beachten<SPLIT>sollte<SPLIT>Die<SPLIT>Sicherheitslage<SPLIT>spitzt<SPLIT>sich<SPLIT>zu<SPLIT>. 
>>
>> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Tausende<SPLIT>Siedler<SPLIT>protestieren<SPLIT>gegen<SPLIT>neuen<SPLIT>Baustopp<SPLIT>21.11.10<SPLIT>11:51<SPLIT>:<SPLIT>Jerusalem<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Die<SPLIT>israelischen<SPLIT>Siedler<SPLIT>haben<SPLIT>ihre<SPLIT>Proteste<SPLIT>gegen<SPLIT>einen<SPLIT>erwarteten<SPLIT>neuen<SPLIT>Baustopp<SPLIT>im<SPLIT>Westjordanland<SPLIT>verschärft<SPLIT>. 
>>
>> Jetzt<SPLIT>einloggen<SPLIT>SchwarzKater<SPLIT>(<SPLIT>vor<SPLIT>4<SPLIT>Stunden<SPLIT>)<SPLIT>WTF<SPLIT>? 
>>
>> Das<SPLIT>Bankhaus<SPLIT>hat<SPLIT>das<SPLIT>Kursziel<SPLIT>für<SPLIT>die<SPLIT>Salzgitter-Aktien<SPLIT>von<SPLIT>69,00<SPLIT>auf<SPLIT>58,00<SPLIT>Euro<SPLIT>gesenkt<SPLIT>,<SPLIT>aber<SPLIT>die<SPLIT>Einstufung<SPLIT>auf<SPLIT>´<SPLIT>Overweight<SPLIT>´<SPLIT>belassen<SPLIT>. 
>>
>> Bundeskanzlerin<SPLIT>Angela<SPLIT>Merkel<SPLIT>(<SPLIT>CDU<SPLIT>)<SPLIT>ist<SPLIT>am<SPLIT>Dienstag<SPLIT>zum<SPLIT>Gipfel<SPLIT>der<SPLIT>Organisation<SPLIT>für<SPLIT>Sicherheit<SPLIT>und<SPLIT>Zusammenarbeit<SPLIT>in<SPLIT>Europa<SPLIT>(<SPLIT>OSZE<SPLIT>)<SPLIT>in<SPLIT>Kasachstan<SPLIT>eingetroffen<SPLIT>. 
>>
>> Mann<SPLIT>totgeprügelt<SPLIT>:<SPLIT>Haftstrafen<SPLIT>im<SPLIT>«<SPLIT>20-Cent-Prozess<SPLIT>»<SPLIT>Die<SPLIT>beiden<SPLIT>Schläger<SPLIT>jugendlichen<SPLIT>Schläger<SPLIT>sind<SPLIT>wegen<SPLIT>Körperverletzung<SPLIT>mit<SPLIT>Todesfolge<SPLIT>zu<SPLIT>Haftstrafen<SPLIT>verurteilt<SPLIT>worden<SPLIT>. 
>>
>> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>AKTIEN<SPLIT>SCHWEIZ/Vorbörse<SPLIT>:<SPLIT>Leicht<SPLIT>höhere<SPLIT>Eröffnung<SPLIT>erwartet<SPLIT>-<SPLIT>Positive<SPLIT>US-Vorgaben<SPLIT>06.12.2010<SPLIT>08:45<SPLIT>Zürich<SPLIT>(<SPLIT>awp<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Schweizer<SPLIT>Aktienmarkt<SPLIT>dürfte<SPLIT>die<SPLIT>Sitzung<SPLIT>vom<SPLIT>Montag<SPLIT>mit<SPLIT>moderaten<SPLIT>Gewinnen<SPLIT>eröffnen<SPLIT>. 
>>
>> Werbung<SPLIT>'<SPLIT>)<SPLIT>;<SPLIT>AIG<SPLIT>hatte<SPLIT>sich<SPLIT>auf<SPLIT>dem<SPLIT>US-Häusermarkt<SPLIT>verspekuliert<SPLIT>. 
>>
>> Außerordentliche<SPLIT>Hauptversammlung<SPLIT>genehmigt<SPLIT>Aktiensplit<SPLIT>und<SPLIT>Vorratsbeschlüsse<SPLIT>für<SPLIT>Kapitalmaßnahmen<SPLIT>Ad-hoc-Mitteilung<SPLIT>übermittelt<SPLIT>durch<SPLIT>euro<SPLIT>adhoc<SPLIT>mit<SPLIT>dem<SPLIT>Ziel<SPLIT>einer<SPLIT>europaweiten<SPLIT>Verbreitung<SPLIT>. 
>>
>> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>Führungscrew<SPLIT>übernimmt<SPLIT>bei<SPLIT>Modekette<SPLIT>Schild<SPLIT>Stefan<SPLIT>Portmann<SPLIT>und<SPLIT>Thomas<SPLIT>Herbert<SPLIT>heissen<SPLIT>die<SPLIT>neuen<SPLIT>starken<SPLIT>Männer<SPLIT>bei<SPLIT>Schild<SPLIT>. 
>>
>> IrfanView<SPLIT>Lizenz<SPLIT>:<SPLIT>Freeware<SPLIT>—<SPLIT>Hersteller-Website<SPLIT>IrfanView<SPLIT>ist<SPLIT>ein<SPLIT>für<SPLIT>private<SPLIT>Zwecke<SPLIT>kostenloses<SPLIT>Bildbetrachtungs-<SPLIT>und<SPLIT>Bildbearbeitungsprogramm<SPLIT>,<SPLIT>das<SPLIT>für<SPLIT>kleinere<SPLIT>Belange<SPLIT>durchaus<SPLIT>ausreicht<SPLIT>. 
>>
>> Dragonica<SPLIT>So<SPLIT>testet<SPLIT>4Players<SPLIT>Bitte<SPLIT>einloggen<SPLIT>,<SPLIT>um<SPLIT>Spiel<SPLIT>in<SPLIT>die<SPLIT>Watchlist<SPLIT>aufzunehmen<SPLIT>. 
>>
>> Rating-Update<SPLIT>:<SPLIT>London<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Robert<SPLIT>T<SPLIT>.<SPLIT>Cornell<SPLIT>,<SPLIT>Scott<SPLIT>L<SPLIT>.<SPLIT>Gaffner<SPLIT>und<SPLIT>Darren<SPLIT>Yip<SPLIT>,<SPLIT>Analysten<SPLIT>von<SPLIT>Barclays<SPLIT>Capital<SPLIT>,<SPLIT>stufen<SPLIT>die<SPLIT>Aktie<SPLIT>von<SPLIT>ITT<SPLIT>Industries<SPLIT>(<SPLIT>ISIN<SPLIT>US4509111021<SPLIT>/<SPLIT>WKN<SPLIT>860023<SPLIT>)<SPLIT>weiterhin<SPLIT>mit<SPLIT>dem<SPLIT>Rating<SPLIT>"<SPLIT>equal-weight<SPLIT>"<SPLIT>ein<SPLIT>. 
>>
>> Sollten<SPLIT>der<SPLIT>Branche<SPLIT>durch<SPLIT>"<SPLIT>populäre<SPLIT>Preiskürzungen<SPLIT>"<SPLIT>der<SPLIT>Regulierungsbehörden<SPLIT>weiter<SPLIT>Milliarden<SPLIT>entzogen<SPLIT>werden<SPLIT>,<SPLIT>sei<SPLIT>kaum<SPLIT>vorstellbar<SPLIT>,<SPLIT>wie<SPLIT>ein<SPLIT>flächendeckender<SPLIT>Breitbandausbau<SPLIT>noch<SPLIT>finanziert<SPLIT>werden<SPLIT>könnte<SPLIT>,<SPLIT>kritisierte<SPLIT>Obermann<SPLIT>. 
>>
>> Grichting/Von<SPLIT>Bergen<SPLIT>bewährten<SPLIT>sich<SPLIT>abstimmen<SPLIT>Online/Print<SPLIT>Täglich<SPLIT>stellt<SPLIT>der<SPLIT>BLICK<SPLIT>eine<SPLIT>Frage<SPLIT>des<SPLIT>Tages<SPLIT>. 
>>
>> Best<SPLIT>.<SPLIT>For<SPLIT>additional<SPLIT>information<SPLIT>,<SPLIT>please<SPLIT>visit<SPLIT>www.asih.bm<SPLIT>.<SPLIT>0<SPLIT>Bewertungen<SPLIT>dieses<SPLIT>Artikels<SPLIT>:<SPLIT>Kommentare<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>Geben<SPLIT>Sie<SPLIT>jetzt<SPLIT>einen<SPLIT>Kommentar<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>ab<SPLIT>. 
>>
>>
>> Am 13.03.2013 15:52, schrieb Jörn Kottmann:
>>> Hello,
>>>
>>> can you tell us a bit more about your training data. Did you manually
>>> annotate these 300k sentences?
>>> Is it possible to post 10 lines or so here?
>>>
>>> Jörn
>>>
>>> On 03/12/2013 03:22 PM, Andreas Niekler wrote:
>>>> Dear List,
>>>>
>>>> i created a Tokenizer Model with 300k german Sentences from a very 
>>>> clean
>>>> corpus. I see some words that are very strangly separated by a 
>>>> tokenizer
>>>> using this model like:
>>>>
>>>> stehenge - blieben
>>>> fre - undlicher
>>>>
>>>> and so on. I cant find those in my training data and wonder why 
>>>> openNLP
>>>> splits those words without any evidence in the training data and 
>>>> wihout
>>>> any whitespace in my text files. I trained the model with 500
>>>> Iterations, cutoff 5 and alphanumeric optimisation.
>>>>
>>>> Can anyone state some ideas how i can prevent this?
>>>>
>>>> thank you
>>>>
>>>> Andreas
>
>


Re: TokenizerTrainer

Posted by "Jim foo.bar" <ji...@gmail.com>.
On 14/03/13 12:00, Andreas Niekler wrote:
> you tokenized an example of my already tokenized training data for the
> maxent tonenizer of open nlp.

the sample you posted was a single string, not tokenised text, otherwise 
you would have posted a collection of strings (tokens).

> I asked about the transformation of those
> texts as input to the train method of the open nlp tokenizer

yes I know, I just thought that maybe tokenising was a more pressing 
matter than training a maxent model. If you absolutely *have to* train a 
model then my reply was in vain indeed...

basically, I just noticed that you're having problems converting the 
data and I thought that maybe you don't really have to (if there is a 
regex pattern that does a decent job)...anyway, I guess that wasn't of 
much help...

Jim


Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

you tokenized an example of my already tokenized training data for the
maxent tonenizer of open nlp. I asked about the transformation of those
texts as input to the train method of the open nlp tokenizer

Thanks for your reply

Andreas

Am 14.03.2013 12:40, schrieb Jim foo.bar:
> ps: I don't speak German, but the output seems reasonable to
> me...depending on your use case, this could be enough (or not!)...

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by "Jim foo.bar" <ji...@gmail.com>.
Hi Andreas,

Here is what my simple tokenizing regex pattern 
("[\p{L}\w\d/]+|[\-\,\.\?\!\(\)]") gives me on your sample text:

("Börsen" "-" "Ticker" "RSS" "News" "AKTIEN" "SCHWEIZ/Verlauf" "Leicht" 
"fester" "-" "Gesuchte" "Finanz" "-" "und" "Pharmawerte" "18" "." "10" 
"." "2010" "13" "00" "Zürich" "(" "awp" ")" "-" "Die" "Schweizer" 
"Börse" "zeigt" "sich" "nach" "einem" "Start" "im" "Minus" "zur" 
"Mittagszeit" "leicht" "fester" "." "Aufruhr" "bei" "Bayern" "-" 
"Gegner" "AS" "Rom" "-" "Kritik" "an" "Coach" "Unter" "Druck" "(" "Foto" 
"dpa" ")" "Rom" "(" "dpa" ")" "-" "Bayern" "Münchens" "Champions" "-" 
"League" "-" "Gegner" "AS" "Rom" "ist" "in" "Aufruhr" "." "Weitere" 
"Nachrichten" "Piper" "Jaffray" "Co" "." "stuft" "Baidu" "Sp" "ADR" "-" 
"A" "auf" "overweight" "Minneapolis" "(" "aktiencheck" "." "de" "AG" ")" 
"-" "Gene" "Munster" "," "Analyst" "von" "Piper" "Jaffray" "," "stuft" 
"die" "Aktie" "von" "BAIDU" "." "COM" "(" "ISIN" "US0567521085" "/" 
"WKN" "A0F5DE" ")" "von" "neutral" "auf" "overweight" "hoch" "." 
"Wohnort" "erfurt" "Verfasst" "am" "25" "." "09" "." "2010" "," "02" 
"59" "Titel" "Datum" "des" "PageRank" "Nutzungsrechtest" "von" "Google" 
"Wer" "weiss" "," "wann" "genau" "das" "nutzungsrecht" "nächstes" "jahr" 
"ausläuft" "für" "die" "kostenfreie" "nutzung" "für" "google" "?" "Die" 
"deutsche" "Automobilindustrie" "fährt" "schneller" "aus" "der" "Krise" 
"als" "erwartet" "," "sagte" "VDA" "-" "Präsident" "Matthias" "Wissmann" 
"in" "Berlin" "." "Senden" "Pfleiderer" "verkaufen" "Düsseldorf" "(" 
"aktiencheck" "." "de" "AG" ")" "-" "Der" "Analyst" "vom" "Bankhaus" 
"Lampe" "," "Marc" "Gabriel" "," "stuft" "die" "Pfleiderer" "-" "Aktie" 
"(" "ISIN" "DE0006764749" "/" "WKN" "676474" ")" "von" "halten" "auf" 
"verkaufen" "herab" "." "Der" "vollständige" "Zwischenbericht" "wird" 
"am" "8" "." "November" "2010" "um" "12" "." "00" "Uhr" "veröffentlicht" 
"." "Besonders" "in" "ländlichen" "Gegenden" "sind" "Telegrafenmaste" 
"auch" "heute" "noch" "weit" "verbreitet" "-" "größtenteils" "für" "die" 
"Festnetztelefonie" "." "Newsticker" "RSS" "-" "Feed" "Morgenweb" 
"Sarah" "Palin" "als" "Reality" "-" "Star" "im" "US" "-" "Fernsehen" 
"auf" "Sendung" "15" "." "11" "." "10" "4" "58" "Washington" "(" "dpa" 
")" "-" "Sarah" "Palin" "hat" "jetzt" "eine" "eigene" "Show" "." "Fotos" 
"Terrorwarnung" "-" "Was" "man" "jetzt" "beachten" "sollte" "Die" 
"Sicherheitslage" "spitzt" "sich" "zu" "." "Newsticker" "RSS" "-" "Feed" 
"Morgenweb" "Tausende" "Siedler" "protestieren" "gegen" "neuen" 
"Baustopp" "21" "." "11" "." "10" "11" "51" "Jerusalem" "(" "dpa" ")" 
"-" "Die" "israelischen" "Siedler" "haben" "ihre" "Proteste" "gegen" 
"einen" "erwarteten" "neuen" "Baustopp" "im" "Westjordanland" 
"verschärft" "." "Jetzt" "einloggen" "SchwarzKater" "(" "vor" "4" 
"Stunden" ")" "WTF" "?" "Das" "Bankhaus" "hat" "das" "Kursziel" "für" 
"die" "Salzgitter" "-" "Aktien" "von" "69" "," "00" "auf" "58" "," "00" 
"Euro" "gesenkt" "," "aber" "die" "Einstufung" "auf" "Overweight" 
"belassen" "." "Bundeskanzlerin" "Angela" "Merkel" "(" "CDU" ")" "ist" 
"am" "Dienstag" "zum" "Gipfel" "der" "Organisation" "für" "Sicherheit" 
"und" "Zusammenarbeit" "in" "Europa" "(" "OSZE" ")" "in" "Kasachstan" 
"eingetroffen" "." "Mann" "totgeprügelt" "Haftstrafen" "im" "20" "-" 
"Cent" "-" "Prozess" "Die" "beiden" "Schläger" "jugendlichen" "Schläger" 
"sind" "wegen" "Körperverletzung" "mit" "Todesfolge" "zu" "Haftstrafen" 
"verurteilt" "worden" ".")


mind you, before deploying the regular expression, I had to escape all 
double-quote occurrences within your text as in java you can't have 
nested double-quotes (the expression simply won't compile).

HTH,

Jim

ps: I don't speak German, but the output seems reasonable to 
me...depending on your use case, this could be enough (or not!)...

On 14/03/13 11:20, Andreas Niekler wrote:
> Yes all the tokens are separated by a whitespace.
>
> Example:
> Börsen-Ticker RSS › News AKTIEN SCHWEIZ/Verlauf : Leicht fester -
> Gesuchte Finanz- und Pharmawerte 18.10.2010 13:00 Zürich ( awp ) - Die
> Schweizer Börse zeigt sich nach einem Start im Minus zur Mittagszeit
> leicht fester .
> Aufruhr bei Bayern-Gegner AS Rom - Kritik an Coach Unter Druck(Foto :
> dpa ) Rom ( dpa ) - Bayern Münchens Champions-League-Gegner AS Rom ist
> in Aufruhr .
> Weitere Nachrichten Piper Jaffray & Co . stuft Baidu Sp ADR-A auf
> overweight Minneapolis ( aktiencheck.de AG ) - Gene Munster , Analyst
> von Piper Jaffray , stuft die Aktie von BAIDU.COM ( ISIN US0567521085 /
> WKN A0F5DE ) von " neutral " auf " overweight " hoch .
> Wohnort : erfurt Verfasst am : 25.09.2010 , 02:59 Titel : Datum des
> PageRank Nutzungsrechtest von Google Wer weiss , wann genau das
> nutzungsrecht nächstes jahr ausläuft für die kostenfreie nutzung für
> google ?
> " Die deutsche Automobilindustrie fährt schneller aus der Krise als
> erwartet " , sagte VDA-Präsident Matthias Wissmann in Berlin .
> Senden Pfleiderer verkaufen Düsseldorf ( aktiencheck.de AG ) - Der
> Analyst vom Bankhaus Lampe , Marc Gabriel , stuft die Pfleiderer-Aktie (
> ISIN DE0006764749 / WKN 676474 ) von " halten " auf " verkaufen " herab .
> Der vollständige Zwischenbericht wird am 8 . November 2010 um 12.00 Uhr
> veröffentlicht .
> Besonders in ländlichen Gegenden sind Telegrafenmaste auch heute noch
> weit verbreitet - größtenteils für die Festnetztelefonie .
> Newsticker RSS-Feed Morgenweb Sarah Palin als Reality-Star im
> US-Fernsehen auf Sendung 15.11.10 4:58 : Washington ( dpa ) - Sarah
> Palin hat jetzt eine eigene Show .
> Fotos Terrorwarnung - Was man jetzt beachten sollte Die Sicherheitslage
> spitzt sich zu .
> Newsticker RSS-Feed Morgenweb Tausende Siedler protestieren gegen neuen
> Baustopp 21.11.10 11:51 : Jerusalem ( dpa ) - Die israelischen Siedler
> haben ihre Proteste gegen einen erwarteten neuen Baustopp im
> Westjordanland verschärft .
> Jetzt einloggen SchwarzKater ( vor 4 Stunden ) WTF ?
> Das Bankhaus hat das Kursziel für die Salzgitter-Aktien von 69,00 auf
> 58,00 Euro gesenkt , aber die Einstufung auf ´ Overweight ´ belassen .
> Bundeskanzlerin Angela Merkel ( CDU ) ist am Dienstag zum Gipfel der
> Organisation für Sicherheit und Zusammenarbeit in Europa ( OSZE ) in
> Kasachstan eingetroffen .
> Mann totgeprügelt : Haftstrafen im « 20-Cent-Prozess » Die beiden
> Schläger jugendlichen Schläger sind wegen Körperverletzung mit
> Todesfolge zu Haftstrafen verurteilt worden .


Re: TokenizerTrainer

Posted by Rodrigo Agerri <ro...@ehu.es>.
Hi,

Perhaps you alredy know this, but just in case.
Tokenizing/detokenizing is heavily used in statistical machine
translation to prepare training/evaluation data. There are rule-based
detokenizers developed by the Moses developers, including a German
detokenizer (the Moses research leader is German). I cannot say how
well it works for German, perhaps the Apache OpenNLP detokenizer works
better, but I have used them heavily for English and Spanish and works
fine in those languages.

https://github.com/moses-smt/mosesdecoder/tree/master/scripts

Look for the tokenizer and share folders.

Cheers,

Rodrigo

On Fri, Mar 15, 2013 at 9:44 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 03/15/2013 02:42 AM, James Kosin wrote:
>>
>>
>> Here, each token is separated by a space in the final output. What you
>> seem to have is data that is already tokenized and you are trying to
>> generate a training file on that data.  It isn't impossible but... nothing
>> you do can get a perfect output of the original without the original data.
>>
>> There are some rules that do work, but... not always.
>
>
> We historically always did it like that, because all the corpora we trained
> on only have tokenized text and therefore need
> to be detokenized somehow to produce training data for the tokenizer.
>
> Jörn

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 03/15/2013 02:42 AM, James Kosin wrote:
>
> Here, each token is separated by a space in the final output. What you 
> seem to have is data that is already tokenized and you are trying to 
> generate a training file on that data.  It isn't impossible but... 
> nothing you do can get a perfect output of the original without the 
> original data.
>
> There are some rules that do work, but... not always.

We historically always did it like that, because all the corpora we 
trained on only have tokenized text and therefore need
to be detokenized somehow to produce training data for the tokenizer.

Jörn

Re: TokenizerTrainer

Posted by James Kosin <ja...@gmail.com>.
Do you have the original sources to the text you have?
Or is that it?

If you had the original text, you could start with building a sentence 
detector... with one sentence per line.  Then make the same sentences 
for the tokenizer training file just adding the <SPLIT> where it is 
needed to split into two tokens.

i.e.: a line in English may be:
     "Today is the day all good men say, a good thing."

Would be tokenized for the training file as
     "<SPLIT>Today is the day all good men say<SPLIT>, a good 
thing<SPLIT>.<SPLIT>"

This way when the new data is parsed, through the model built it can 
generate tokenized files like this:
     " Today is the day all good men say , a good thing . "

Here, each token is separated by a space in the final output. What you 
seem to have is data that is already tokenized and you are trying to 
generate a training file on that data.  It isn't impossible but... 
nothing you do can get a perfect output of the original without the 
original data.

There are some rules that do work, but... not always.

James

On 3/14/2013 1:16 PM, Andreas Niekler wrote:
> Maybe just a stupid idea but is it not possible to just use my
> whitespace training data and just add one <SPLIT> tag somewhere where it
> makes sense. The tonenizer just needs the feature and all the
> separations are already made. Abbreviations are not separated in that
> file so that it should learn those examles without any further annotation.
>
> But i'm not sure
>
>
>
> Am 14.03.2013 14:50, schrieb Jörn Kottmann:
>> On 03/14/2013 02:15 PM, Andreas Niekler wrote:
>>> Hello,
>>>
>>> seems that this issue is already opened by you:
>>> https://issues.apache.org/jira/browse/OPENNLP-501
>>>
>>> Shoul i include that into 1.6.0 or just the trunk?
>> Leave the version open, it would probably be nice to pull that
>> fix into 1.5.3, but it depends on how quick we get it and what
>> the other committers think about it, so can't promise anything here.
>> If it will not go into 1.5.3 it will definitely go into the version after.
>>
>> Jörn
>>


Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Maybe just a stupid idea but is it not possible to just use my
whitespace training data and just add one <SPLIT> tag somewhere where it
makes sense. The tonenizer just needs the feature and all the
separations are already made. Abbreviations are not separated in that
file so that it should learn those examles without any further annotation.

But i'm not sure



Am 14.03.2013 14:50, schrieb Jörn Kottmann:
> On 03/14/2013 02:15 PM, Andreas Niekler wrote:
>> Hello,
>>
>> seems that this issue is already opened by you:
>> https://issues.apache.org/jira/browse/OPENNLP-501
>>
>> Shoul i include that into 1.6.0 or just the trunk?
> 
> Leave the version open, it would probably be nice to pull that
> fix into 1.5.3, but it depends on how quick we get it and what
> the other committers think about it, so can't promise anything here.
> If it will not go into 1.5.3 it will definitely go into the version after.
> 
> Jörn
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: Abbreviations for Tokenisation Training

Posted by William Colen <wi...@gmail.com>.
Andreas,

Include the punctuation marks, like in
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=markup

In my experiments I could improve only 0.01% using the abbreviation
dictionary combined with a model trained from a Brazilian Portuguese
corpus, but for the final system the dictionary had a positive impact
because you can add abbreviations that are not common in the training data.

William

On Thu, Mar 14, 2013 at 2:29 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> The abbreviation list has almost no impact on the accuracy of the
> tokenizer,
> it might help if you have data with very rare abbreviations, but its not a
> feature
> you should use when you just get started with the training.
>
> My recommendation is to first get a good baseline tokenizer model, and
> then if
> you are not happy with it experiment with more advanced features or
> customization.
>
> I don't know how the dots are handled in the lookup code, maybe somebody
> else does here,
> otherwise I can have a look at the code.
>
> Jörn
>
>
> On 03/14/2013 05:24 PM, Andreas Niekler wrote:
>
>> Dear List,
>>
>> do the abbreviations for the token trainer include the appending . or do
>> they just come in form of the actual string
>>
>> like
>>
>> e.g. vs. e.g
>>
>> or
>>
>> usw. vs. usw
>>
>> or
>>
>> Dr. vs. Dr
>>
>> Thank you
>>
>> Andreas
>>
>> Am 14.03.2013 14:50, schrieb Jörn Kottmann:
>>
>>> On 03/14/2013 02:15 PM, Andreas Niekler wrote:
>>>
>>>> Hello,
>>>>
>>>> seems that this issue is already opened by you:
>>>> https://issues.apache.org/**jira/browse/OPENNLP-501<https://issues.apache.org/jira/browse/OPENNLP-501>
>>>>
>>>> Shoul i include that into 1.6.0 or just the trunk?
>>>>
>>> Leave the version open, it would probably be nice to pull that
>>> fix into 1.5.3, but it depends on how quick we get it and what
>>> the other committers think about it, so can't promise anything here.
>>> If it will not go into 1.5.3 it will definitely go into the version
>>> after.
>>>
>>> Jörn
>>>
>>>
>

Re: Abbreviations for Tokenisation Training

Posted by Jörn Kottmann <ko...@gmail.com>.
The abbreviation list has almost no impact on the accuracy of the tokenizer,
it might help if you have data with very rare abbreviations, but its not 
a feature
you should use when you just get started with the training.

My recommendation is to first get a good baseline tokenizer model, and 
then if
you are not happy with it experiment with more advanced features or 
customization.

I don't know how the dots are handled in the lookup code, maybe somebody 
else does here,
otherwise I can have a look at the code.

Jörn

On 03/14/2013 05:24 PM, Andreas Niekler wrote:
> Dear List,
>
> do the abbreviations for the token trainer include the appending . or do
> they just come in form of the actual string
>
> like
>
> e.g. vs. e.g
>
> or
>
> usw. vs. usw
>
> or
>
> Dr. vs. Dr
>
> Thank you
>
> Andreas
>
> Am 14.03.2013 14:50, schrieb Jörn Kottmann:
>> On 03/14/2013 02:15 PM, Andreas Niekler wrote:
>>> Hello,
>>>
>>> seems that this issue is already opened by you:
>>> https://issues.apache.org/jira/browse/OPENNLP-501
>>>
>>> Shoul i include that into 1.6.0 or just the trunk?
>> Leave the version open, it would probably be nice to pull that
>> fix into 1.5.3, but it depends on how quick we get it and what
>> the other committers think about it, so can't promise anything here.
>> If it will not go into 1.5.3 it will definitely go into the version after.
>>
>> Jörn
>>


Abbreviations for Tokenisation Training

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Dear List,

do the abbreviations for the token trainer include the appending . or do
they just come in form of the actual string

like

e.g. vs. e.g

or

usw. vs. usw

or

Dr. vs. Dr

Thank you

Andreas

Am 14.03.2013 14:50, schrieb Jörn Kottmann:
> On 03/14/2013 02:15 PM, Andreas Niekler wrote:
>> Hello,
>>
>> seems that this issue is already opened by you:
>> https://issues.apache.org/jira/browse/OPENNLP-501
>>
>> Shoul i include that into 1.6.0 or just the trunk?
> 
> Leave the version open, it would probably be nice to pull that
> fix into 1.5.3, but it depends on how quick we get it and what
> the other committers think about it, so can't promise anything here.
> If it will not go into 1.5.3 it will definitely go into the version after.
> 
> Jörn
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 03/14/2013 02:15 PM, Andreas Niekler wrote:
> Hello,
>
> seems that this issue is already opened by you:
> https://issues.apache.org/jira/browse/OPENNLP-501
>
> Shoul i include that into 1.6.0 or just the trunk?

Leave the version open, it would probably be nice to pull that
fix into 1.5.3, but it depends on how quick we get it and what
the other committers think about it, so can't promise anything here.
If it will not go into 1.5.3 it will definitely go into the version after.

Jörn

Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

seems that this issue is already opened by you:
https://issues.apache.org/jira/browse/OPENNLP-501

Shoul i include that into 1.6.0 or just the trunk?

Andreas

Am 14.03.2013 14:10, schrieb Jörn Kottmann:
> On 03/14/2013 02:06 PM, Andreas Niekler wrote:
>> i'll create a jira issue and implement a flag wether to set <SPLIT> or
>> just delete the whitespace. I hope this will do it then.
> 
> Thanks, it will be helpful, and if you want also create a jira to add a
> German detokenizer
> dictionary.
> 
> Jörn
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 03/14/2013 02:06 PM, Andreas Niekler wrote:
> i'll create a jira issue and implement a flag wether to set <SPLIT> or
> just delete the whitespace. I hope this will do it then.

Thanks, it will be helpful, and if you want also create a jira to add a 
German detokenizer
dictionary.

Jörn

Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

i'll create a jira issue and implement a flag wether to set <SPLIT> or
just delete the whitespace. I hope this will do it then.

Thank you for all the clarifications

Andreas

Am 14.03.2013 13:48, schrieb Jörn Kottmann:
> Have a look here:
> http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer.detokenizing
> 
> 
> Here is the detokenizer tool:
> https://github.com/apache/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/DictionaryDetokenizerTool.java
> 
> 
> Looks like it doesn't output the <SPLIT> tag, we should change that. The
> main purpose of it is to generate training data
> for the tokenizer. Anyway, patches to improve the detokenizer are very
> welcome, looks like the documentation needs a few
> fixes too.
> 
> HTH,
> Jörn
> 
> On 03/14/2013 01:32 PM, Andreas Niekler wrote:
>> Hello,
>>
>> ok i will find out what the name of the tool is and i will create a
>> rules xml and a abbreviations list (not sure about the format as well
>> here - but i hope i find an example).
>>
>> Are you interested in hosting the model after i finally succeed?
>>
>> Thank you very much
>>
>> Andreas
>>
>> Am 14.03.2013 13:25, schrieb Jörn Kottmann:
>>> On 03/14/2013 12:20 PM, Andreas Niekler wrote:
>>>> So the detokenizer adds the <SPLIT> tag where it is needed?
>>> Exactly, you need to merge the tokens again which were previously not
>>> separated
>>> by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text
>>> "AKTIEN SCHWEIZ/Verlauf:"
>>> and in the training data you encode that as "AKTIEN
>>> SCHWEIZ/Verlauf<SPLIT>:".
>>>
>>> The detokenizer just figures out which tokens are merged together and
>>> which are not
>>> based on some rules. There is a util which can use that information to
>>> output the tokenizer
>>> training data, should be integrated into the CLI but its a while since I
>>> last used it.
>>>
>>> Don't hesitate to ask if you need more help,
>>> Jörn
>>>
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
Have a look here:
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer.detokenizing

Here is the detokenizer tool:
https://github.com/apache/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/DictionaryDetokenizerTool.java

Looks like it doesn't output the <SPLIT> tag, we should change that. The 
main purpose of it is to generate training data
for the tokenizer. Anyway, patches to improve the detokenizer are very 
welcome, looks like the documentation needs a few
fixes too.

HTH,
Jörn

On 03/14/2013 01:32 PM, Andreas Niekler wrote:
> Hello,
>
> ok i will find out what the name of the tool is and i will create a
> rules xml and a abbreviations list (not sure about the format as well
> here - but i hope i find an example).
>
> Are you interested in hosting the model after i finally succeed?
>
> Thank you very much
>
> Andreas
>
> Am 14.03.2013 13:25, schrieb Jörn Kottmann:
>> On 03/14/2013 12:20 PM, Andreas Niekler wrote:
>>> So the detokenizer adds the <SPLIT> tag where it is needed?
>> Exactly, you need to merge the tokens again which were previously not
>> separated
>> by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text
>> "AKTIEN SCHWEIZ/Verlauf:"
>> and in the training data you encode that as "AKTIEN
>> SCHWEIZ/Verlauf<SPLIT>:".
>>
>> The detokenizer just figures out which tokens are merged together and
>> which are not
>> based on some rules. There is a util which can use that information to
>> output the tokenizer
>> training data, should be integrated into the CLI but its a while since I
>> last used it.
>>
>> Don't hesitate to ask if you need more help,
>> Jörn
>>


Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

ok i will find out what the name of the tool is and i will create a
rules xml and a abbreviations list (not sure about the format as well
here - but i hope i find an example).

Are you interested in hosting the model after i finally succeed?

Thank you very much

Andreas

Am 14.03.2013 13:25, schrieb Jörn Kottmann:
> On 03/14/2013 12:20 PM, Andreas Niekler wrote:
>> So the detokenizer adds the <SPLIT> tag where it is needed?
> 
> Exactly, you need to merge the tokens again which were previously not
> separated
> by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text
> "AKTIEN SCHWEIZ/Verlauf:"
> and in the training data you encode that as "AKTIEN
> SCHWEIZ/Verlauf<SPLIT>:".
> 
> The detokenizer just figures out which tokens are merged together and
> which are not
> based on some rules. There is a util which can use that information to
> output the tokenizer
> training data, should be integrated into the CLI but its a while since I
> last used it.
> 
> Don't hesitate to ask if you need more help,
> Jörn
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 03/14/2013 12:20 PM, Andreas Niekler wrote:
> So the detokenizer adds the <SPLIT> tag where it is needed?

Exactly, you need to merge the tokens again which were previously not 
separated
by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text 
"AKTIEN SCHWEIZ/Verlauf:"
and in the training data you encode that as "AKTIEN 
SCHWEIZ/Verlauf<SPLIT>:".

The detokenizer just figures out which tokens are merged together and 
which are not
based on some rules. There is a util which can use that information to 
output the tokenizer
training data, should be integrated into the CLI but its a while since I 
last used it.

Don't hesitate to ask if you need more help,
Jörn


Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

So the detokenizer adds the <SPLIT> tag where it is needed?

Am 14.03.2013 12:08, schrieb Jörn Kottmann:
> Can you give us more details about your training data? Is it white space
> tokenized?

Yes all the tokens are separated by a whitespace.

Example:
Börsen-Ticker RSS › News AKTIEN SCHWEIZ/Verlauf : Leicht fester -
Gesuchte Finanz- und Pharmawerte 18.10.2010 13:00 Zürich ( awp ) - Die
Schweizer Börse zeigt sich nach einem Start im Minus zur Mittagszeit
leicht fester .
Aufruhr bei Bayern-Gegner AS Rom - Kritik an Coach Unter Druck(Foto :
dpa ) Rom ( dpa ) - Bayern Münchens Champions-League-Gegner AS Rom ist
in Aufruhr .
Weitere Nachrichten Piper Jaffray & Co . stuft Baidu Sp ADR-A auf
overweight Minneapolis ( aktiencheck.de AG ) - Gene Munster , Analyst
von Piper Jaffray , stuft die Aktie von BAIDU.COM ( ISIN US0567521085 /
WKN A0F5DE ) von " neutral " auf " overweight " hoch .
Wohnort : erfurt Verfasst am : 25.09.2010 , 02:59 Titel : Datum des
PageRank Nutzungsrechtest von Google Wer weiss , wann genau das
nutzungsrecht nächstes jahr ausläuft für die kostenfreie nutzung für
google ?
" Die deutsche Automobilindustrie fährt schneller aus der Krise als
erwartet " , sagte VDA-Präsident Matthias Wissmann in Berlin .
Senden Pfleiderer verkaufen Düsseldorf ( aktiencheck.de AG ) - Der
Analyst vom Bankhaus Lampe , Marc Gabriel , stuft die Pfleiderer-Aktie (
ISIN DE0006764749 / WKN 676474 ) von " halten " auf " verkaufen " herab .
Der vollständige Zwischenbericht wird am 8 . November 2010 um 12.00 Uhr
veröffentlicht .
Besonders in ländlichen Gegenden sind Telegrafenmaste auch heute noch
weit verbreitet - größtenteils für die Festnetztelefonie .
Newsticker RSS-Feed Morgenweb Sarah Palin als Reality-Star im
US-Fernsehen auf Sendung 15.11.10 4:58 : Washington ( dpa ) - Sarah
Palin hat jetzt eine eigene Show .
Fotos Terrorwarnung - Was man jetzt beachten sollte Die Sicherheitslage
spitzt sich zu .
Newsticker RSS-Feed Morgenweb Tausende Siedler protestieren gegen neuen
Baustopp 21.11.10 11:51 : Jerusalem ( dpa ) - Die israelischen Siedler
haben ihre Proteste gegen einen erwarteten neuen Baustopp im
Westjordanland verschärft .
Jetzt einloggen SchwarzKater ( vor 4 Stunden ) WTF ?
Das Bankhaus hat das Kursziel für die Salzgitter-Aktien von 69,00 auf
58,00 Euro gesenkt , aber die Einstufung auf ´ Overweight ´ belassen .
Bundeskanzlerin Angela Merkel ( CDU ) ist am Dienstag zum Gipfel der
Organisation für Sicherheit und Zusammenarbeit in Europa ( OSZE ) in
Kasachstan eingetroffen .
Mann totgeprügelt : Haftstrafen im « 20-Cent-Prozess » Die beiden
Schläger jugendlichen Schläger sind wegen Körperverletzung mit
Todesfolge zu Haftstrafen verurteilt worden .

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 03/14/2013 11:44 AM, Andreas Niekler wrote:
> what i don't understand is how this is producing valid training data
> since i just delete whitespaces. You said that i need to include some
> <SPLIT> Tags to have proper training data. Can you please comment on the
> fact why we have proper training data after detokenizing. I hope that
> it's ok to ask all these querstions but i really want to understand
> openNLP tokenisation.

The training data needs to reflect the data you want to process.
In German (like in English) most tokens are separated by white spaces 
already, and
punctuation and word tokens might be written together without a 
separating white space,
to encode the latter case in the training data we use the <SPLIT> tag.

If you just replace all white spaces with <SPLIT> tags in your white 
space tokenized data the
input data probably does not longer match the training data. To make the 
input data match
it again you would need to remove all white spaces from it.

Can you give us more details about your training data? Is it white space 
tokenized?

Jörn




Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Thank you,

what i don't understand is how this is producing valid training data
since i just delete whitespaces. You said that i need to include some
<SPLIT> Tags to have proper training data. Can you please comment on the
fact why we have proper training data after detokenizing. I hope that
it's ok to ask all these querstions but i really want to understand
openNLP tokenisation.


Thank you very much and i will create a detokenizer dict based on all
relevant special characters contained in my 300k dataset.

Andreas

Am 14.03.2013 11:34, schrieb Jörn Kottmann:
> On 03/14/2013 11:27 AM, Andreas Niekler wrote:
>> Hello,
>>
>>> We probably need to fix the detokenizer rules used for the German models
>>> a bit to handle these cases correctly.
>> Are those rules public somewhere so that i can edit them myself? I can
>> provide them to the community afterwars. Mostly characters like „“ are
>> not recognized by the tokenizer. I don't want to convert them before
>> tokenizing because we analyze things like direct speech and those
>> characters are a good indicator for that.
> 
> No for the German models I wrote some code to do the detokenization
> which supported
> a specific corpus. Anyway, this work then lead me to contribute the
> detokenizer to OpenNLP.
> 
> There is one file for English:
> https://github.com/apache/opennlp/tree/trunk/opennlp-tools/lang/en/tokenizer
> 
> 
> We would be happy to receive a contribution for German.
> Have a look at the documentation, there is a section about the detokenizer.
> 
>>> I suggest to use our detokenizer to turn your tokenized text into
>>> training data.
>> Has the detokenizer a command line tool as well?
>>
> 
> Yes, there is one. Have a look at the CLI help.
> 
> Jörn
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 03/14/2013 11:27 AM, Andreas Niekler wrote:
> Hello,
>
>> We probably need to fix the detokenizer rules used for the German models
>> a bit to handle these cases correctly.
> Are those rules public somewhere so that i can edit them myself? I can
> provide them to the community afterwars. Mostly characters like „“ are
> not recognized by the tokenizer. I don't want to convert them before
> tokenizing because we analyze things like direct speech and those
> characters are a good indicator for that.

No for the German models I wrote some code to do the detokenization 
which supported
a specific corpus. Anyway, this work then lead me to contribute the 
detokenizer to OpenNLP.

There is one file for English:
https://github.com/apache/opennlp/tree/trunk/opennlp-tools/lang/en/tokenizer

We would be happy to receive a contribution for German.
Have a look at the documentation, there is a section about the detokenizer.

>> I suggest to use our detokenizer to turn your tokenized text into
>> training data.
> Has the detokenizer a command line tool as well?
>

Yes, there is one. Have a look at the CLI help.

Jörn

Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

> We probably need to fix the detokenizer rules used for the German models
> a bit to handle these cases correctly.

Are those rules public somewhere so that i can edit them myself? I can
provide them to the community afterwars. Mostly characters like „“ are
not recognized by the tokenizer. I don't want to convert them before
tokenizing because we analyze things like direct speech and those
characters are a good indicator for that.


> I suggest to use our detokenizer to turn your tokenized text into
> training data.

Has the detokenizer a command line tool as well?

Thank you all

Andreas

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
We probably need to fix the detokenizer rules used for the German models
a bit to handle these cases correctly.

To train the tokenizer you either need proper training data, with whites 
spaces
and split tags, or you have already tokenized data which you convert 
with the a
rule based detokenizer into training data.

The current implementation can't train the tokenizer with only white space
separated tokens because that does not generate proper training data for the
maxent trainer. Training with only <SPLIT> tags works, but apparently is not
really compatible with our feature geneartion which was not designed for 
that case.

I suggest to use our detokenizer to turn your tokenized text into 
training data.

Jörn

On 03/14/2013 10:49 AM, Andreas Niekler wrote:
> Hello,
>
>
>> If you want to tokenize based on white spaces I suggest to use our white
>> space tokenizer.
> No. I do not want to tokenize on whitespaces. I found out that the
> de-token.bin model isn't capable of separating things like direct speech
> in texts like Er sagte, dass "die neue. This end with a token "die. So i
> got a clean 300k sentences sample from our german reference corpus which
> is in the form of whitespace separated tokens in one sentence per line.
> I just added this one to the TokenizerTraining Tool and endet up with an
> exception because of only 1 Feature found. So i added all the <SPLIT>
> tags like in the documentation and the training terminated without an
> error. But with the undesired errors. So i surly need a model based
> tokenizer because i also want to tokenize punctuations and so on. The
> only thing i wasn't able to do is the training based on whitespace
> separated sentences.
>
> Thanks for your help
>
> Andreas
>


Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,


> If you want to tokenize based on white spaces I suggest to use our white
> space tokenizer.

No. I do not want to tokenize on whitespaces. I found out that the
de-token.bin model isn't capable of separating things like direct speech
in texts like Er sagte, dass "die neue. This end with a token "die. So i
got a clean 300k sentences sample from our german reference corpus which
is in the form of whitespace separated tokens in one sentence per line.
I just added this one to the TokenizerTraining Tool and endet up with an
exception because of only 1 Feature found. So i added all the <SPLIT>
tags like in the documentation and the training terminated without an
error. But with the undesired errors. So i surly need a model based
tokenizer because i also want to tokenize punctuations and so on. The
only thing i wasn't able to do is the training based on whitespace
separated sentences.

Thanks for your help

Andreas

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 03/14/2013 09:59 AM, Andreas Niekler wrote:
> Hello,
>
> i just added the <SPLIT> Tag because all (only) whitespace files weren't
> able to processed by the command line tool. It just found 1 Feature and
> the training endet with an exception like "Unable to create model due
> to" in the first interation and all the liklihoods are 1.0. I just
> replaced all whitespaces with the split tag as described in the
> documentation.
>

If you want to tokenize based on white spaces I suggest to use our white 
space tokenizer.

In the training data the <SPLIT> tag is usually only used for whitespace 
separated strings where
more than one token occurs in one string, e.g. "... said: ..." and that 
is in the training data "... said<SPLIT>: ...".
You usually need to manually produce this data, or you use some corpus 
which already contains tokenized
text and use the de-tokenizer to produce the training data.

Another option is to use a penn treebank tokenizer, the cTAKES people 
are doing that and have one for UIMA,
they might contribute it to OpenNLP one day.

Jörn


Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

i just added the <SPLIT> Tag because all (only) whitespace files weren't
able to processed by the command line tool. It just found 1 Feature and
the training endet with an exception like "Unable to create model due
to" in the first interation and all the liklihoods are 1.0. I just
replaced all whitespaces with the split tag as described in the
documentation.

Andreas

Am 13.03.2013 20:13, schrieb Jörn Kottmann:
> The tokenizers defaults are for text which is mostly whitespace separated,
> did you lost all your white spaces in the text you want to process?
> 
> Jörn
> 
> On 03/13/2013 04:31 PM, Andreas Niekler wrote:
>> Hello,
>>
>> i give you some examples below this comment. But i already noticed in
>> the code, that the standard tokenizerTrainer call uses the standard
>> alphanumeric pattern which won't work for typical german examples. Most
>> of the data will be separated because of the inproper pattern in the
>> standard Factory.java class. My believe is that the de-token.bin model
>> was trained with a proper pattern within another implementation of the
>> training procedure.
>>
>> Here are some training lines:
>>
>> Senden<SPLIT>Pfleiderer<SPLIT>verkaufen<SPLIT>Düsseldorf<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Analyst<SPLIT>vom<SPLIT>Bankhaus<SPLIT>Lampe<SPLIT>,<SPLIT>Marc<SPLIT>Gabriel<SPLIT>,<SPLIT>stuft<SPLIT>die<SPLIT>Pfleiderer-Aktie<SPLIT>(<SPLIT>ISIN<SPLIT>DE0006764749<SPLIT>/<SPLIT>WKN<SPLIT>676474<SPLIT>)<SPLIT>von<SPLIT>"<SPLIT>halten<SPLIT>"<SPLIT>auf<SPLIT>"<SPLIT>verkaufen<SPLIT>"<SPLIT>herab<SPLIT>.
>>
>> Der<SPLIT>vollständige<SPLIT>Zwischenbericht<SPLIT>wird<SPLIT>am<SPLIT>8<SPLIT>.<SPLIT>November<SPLIT>2010<SPLIT>um<SPLIT>12.00<SPLIT>Uhr<SPLIT>veröffentlicht<SPLIT>.
>>
>> Besonders<SPLIT>in<SPLIT>ländlichen<SPLIT>Gegenden<SPLIT>sind<SPLIT>Telegrafenmaste<SPLIT>auch<SPLIT>heute<SPLIT>noch<SPLIT>weit<SPLIT>verbreitet<SPLIT>-<SPLIT>größtenteils<SPLIT>für<SPLIT>die<SPLIT>Festnetztelefonie<SPLIT>.
>>
>> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Sarah<SPLIT>Palin<SPLIT>als<SPLIT>Reality-Star<SPLIT>im<SPLIT>US-Fernsehen<SPLIT>auf<SPLIT>Sendung<SPLIT>15.11.10<SPLIT>4:58<SPLIT>:<SPLIT>Washington<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Sarah<SPLIT>Palin<SPLIT>hat<SPLIT>jetzt<SPLIT>eine<SPLIT>eigene<SPLIT>Show<SPLIT>.
>>
>> Fotos<SPLIT>Terrorwarnung<SPLIT>-<SPLIT>Was<SPLIT>man<SPLIT>jetzt<SPLIT>beachten<SPLIT>sollte<SPLIT>Die<SPLIT>Sicherheitslage<SPLIT>spitzt<SPLIT>sich<SPLIT>zu<SPLIT>.
>>
>> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Tausende<SPLIT>Siedler<SPLIT>protestieren<SPLIT>gegen<SPLIT>neuen<SPLIT>Baustopp<SPLIT>21.11.10<SPLIT>11:51<SPLIT>:<SPLIT>Jerusalem<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Die<SPLIT>israelischen<SPLIT>Siedler<SPLIT>haben<SPLIT>ihre<SPLIT>Proteste<SPLIT>gegen<SPLIT>einen<SPLIT>erwarteten<SPLIT>neuen<SPLIT>Baustopp<SPLIT>im<SPLIT>Westjordanland<SPLIT>verschärft<SPLIT>.
>>
>> Jetzt<SPLIT>einloggen<SPLIT>SchwarzKater<SPLIT>(<SPLIT>vor<SPLIT>4<SPLIT>Stunden<SPLIT>)<SPLIT>WTF<SPLIT>?
>>
>> Das<SPLIT>Bankhaus<SPLIT>hat<SPLIT>das<SPLIT>Kursziel<SPLIT>für<SPLIT>die<SPLIT>Salzgitter-Aktien<SPLIT>von<SPLIT>69,00<SPLIT>auf<SPLIT>58,00<SPLIT>Euro<SPLIT>gesenkt<SPLIT>,<SPLIT>aber<SPLIT>die<SPLIT>Einstufung<SPLIT>auf<SPLIT>´<SPLIT>Overweight<SPLIT>´<SPLIT>belassen<SPLIT>.
>>
>> Bundeskanzlerin<SPLIT>Angela<SPLIT>Merkel<SPLIT>(<SPLIT>CDU<SPLIT>)<SPLIT>ist<SPLIT>am<SPLIT>Dienstag<SPLIT>zum<SPLIT>Gipfel<SPLIT>der<SPLIT>Organisation<SPLIT>für<SPLIT>Sicherheit<SPLIT>und<SPLIT>Zusammenarbeit<SPLIT>in<SPLIT>Europa<SPLIT>(<SPLIT>OSZE<SPLIT>)<SPLIT>in<SPLIT>Kasachstan<SPLIT>eingetroffen<SPLIT>.
>>
>> Mann<SPLIT>totgeprügelt<SPLIT>:<SPLIT>Haftstrafen<SPLIT>im<SPLIT>«<SPLIT>20-Cent-Prozess<SPLIT>»<SPLIT>Die<SPLIT>beiden<SPLIT>Schläger<SPLIT>jugendlichen<SPLIT>Schläger<SPLIT>sind<SPLIT>wegen<SPLIT>Körperverletzung<SPLIT>mit<SPLIT>Todesfolge<SPLIT>zu<SPLIT>Haftstrafen<SPLIT>verurteilt<SPLIT>worden<SPLIT>.
>>
>> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>AKTIEN<SPLIT>SCHWEIZ/Vorbörse<SPLIT>:<SPLIT>Leicht<SPLIT>höhere<SPLIT>Eröffnung<SPLIT>erwartet<SPLIT>-<SPLIT>Positive<SPLIT>US-Vorgaben<SPLIT>06.12.2010<SPLIT>08:45<SPLIT>Zürich<SPLIT>(<SPLIT>awp<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Schweizer<SPLIT>Aktienmarkt<SPLIT>dürfte<SPLIT>die<SPLIT>Sitzung<SPLIT>vom<SPLIT>Montag<SPLIT>mit<SPLIT>moderaten<SPLIT>Gewinnen<SPLIT>eröffnen<SPLIT>.
>>
>> Werbung<SPLIT>'<SPLIT>)<SPLIT>;<SPLIT>AIG<SPLIT>hatte<SPLIT>sich<SPLIT>auf<SPLIT>dem<SPLIT>US-Häusermarkt<SPLIT>verspekuliert<SPLIT>.
>>
>> Außerordentliche<SPLIT>Hauptversammlung<SPLIT>genehmigt<SPLIT>Aktiensplit<SPLIT>und<SPLIT>Vorratsbeschlüsse<SPLIT>für<SPLIT>Kapitalmaßnahmen<SPLIT>Ad-hoc-Mitteilung<SPLIT>übermittelt<SPLIT>durch<SPLIT>euro<SPLIT>adhoc<SPLIT>mit<SPLIT>dem<SPLIT>Ziel<SPLIT>einer<SPLIT>europaweiten<SPLIT>Verbreitung<SPLIT>.
>>
>> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>Führungscrew<SPLIT>übernimmt<SPLIT>bei<SPLIT>Modekette<SPLIT>Schild<SPLIT>Stefan<SPLIT>Portmann<SPLIT>und<SPLIT>Thomas<SPLIT>Herbert<SPLIT>heissen<SPLIT>die<SPLIT>neuen<SPLIT>starken<SPLIT>Männer<SPLIT>bei<SPLIT>Schild<SPLIT>.
>>
>> IrfanView<SPLIT>Lizenz<SPLIT>:<SPLIT>Freeware<SPLIT>—<SPLIT>Hersteller-Website<SPLIT>IrfanView<SPLIT>ist<SPLIT>ein<SPLIT>für<SPLIT>private<SPLIT>Zwecke<SPLIT>kostenloses<SPLIT>Bildbetrachtungs-<SPLIT>und<SPLIT>Bildbearbeitungsprogramm<SPLIT>,<SPLIT>das<SPLIT>für<SPLIT>kleinere<SPLIT>Belange<SPLIT>durchaus<SPLIT>ausreicht<SPLIT>.
>>
>> Dragonica<SPLIT>So<SPLIT>testet<SPLIT>4Players<SPLIT>Bitte<SPLIT>einloggen<SPLIT>,<SPLIT>um<SPLIT>Spiel<SPLIT>in<SPLIT>die<SPLIT>Watchlist<SPLIT>aufzunehmen<SPLIT>.
>>
>> Rating-Update<SPLIT>:<SPLIT>London<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Robert<SPLIT>T<SPLIT>.<SPLIT>Cornell<SPLIT>,<SPLIT>Scott<SPLIT>L<SPLIT>.<SPLIT>Gaffner<SPLIT>und<SPLIT>Darren<SPLIT>Yip<SPLIT>,<SPLIT>Analysten<SPLIT>von<SPLIT>Barclays<SPLIT>Capital<SPLIT>,<SPLIT>stufen<SPLIT>die<SPLIT>Aktie<SPLIT>von<SPLIT>ITT<SPLIT>Industries<SPLIT>(<SPLIT>ISIN<SPLIT>US4509111021<SPLIT>/<SPLIT>WKN<SPLIT>860023<SPLIT>)<SPLIT>weiterhin<SPLIT>mit<SPLIT>dem<SPLIT>Rating<SPLIT>"<SPLIT>equal-weight<SPLIT>"<SPLIT>ein<SPLIT>.
>>
>> Sollten<SPLIT>der<SPLIT>Branche<SPLIT>durch<SPLIT>"<SPLIT>populäre<SPLIT>Preiskürzungen<SPLIT>"<SPLIT>der<SPLIT>Regulierungsbehörden<SPLIT>weiter<SPLIT>Milliarden<SPLIT>entzogen<SPLIT>werden<SPLIT>,<SPLIT>sei<SPLIT>kaum<SPLIT>vorstellbar<SPLIT>,<SPLIT>wie<SPLIT>ein<SPLIT>flächendeckender<SPLIT>Breitbandausbau<SPLIT>noch<SPLIT>finanziert<SPLIT>werden<SPLIT>könnte<SPLIT>,<SPLIT>kritisierte<SPLIT>Obermann<SPLIT>.
>>
>> Grichting/Von<SPLIT>Bergen<SPLIT>bewährten<SPLIT>sich<SPLIT>abstimmen<SPLIT>Online/Print<SPLIT>Täglich<SPLIT>stellt<SPLIT>der<SPLIT>BLICK<SPLIT>eine<SPLIT>Frage<SPLIT>des<SPLIT>Tages<SPLIT>.
>>
>> Best<SPLIT>.<SPLIT>For<SPLIT>additional<SPLIT>information<SPLIT>,<SPLIT>please<SPLIT>visit<SPLIT>www.asih.bm<SPLIT>.<SPLIT>0<SPLIT>Bewertungen<SPLIT>dieses<SPLIT>Artikels<SPLIT>:<SPLIT>Kommentare<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>Geben<SPLIT>Sie<SPLIT>jetzt<SPLIT>einen<SPLIT>Kommentar<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>ab<SPLIT>.
>>
>>
>> Am 13.03.2013 15:52, schrieb Jörn Kottmann:
>>> Hello,
>>>
>>> can you tell us a bit more about your training data. Did you manually
>>> annotate these 300k sentences?
>>> Is it possible to post 10 lines or so here?
>>>
>>> Jörn
>>>
>>> On 03/12/2013 03:22 PM, Andreas Niekler wrote:
>>>> Dear List,
>>>>
>>>> i created a Tokenizer Model with 300k german Sentences from a very
>>>> clean
>>>> corpus. I see some words that are very strangly separated by a
>>>> tokenizer
>>>> using this model like:
>>>>
>>>> stehenge - blieben
>>>> fre - undlicher
>>>>
>>>> and so on. I cant find those in my training data and wonder why openNLP
>>>> splits those words without any evidence in the training data and wihout
>>>> any whitespace in my text files. I trained the model with 500
>>>> Iterations, cutoff 5 and alphanumeric optimisation.
>>>>
>>>> Can anyone state some ideas how i can prevent this?
>>>>
>>>> thank you
>>>>
>>>> Andreas
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
The tokenizers defaults are for text which is mostly whitespace separated,
did you lost all your white spaces in the text you want to process?

Jörn

On 03/13/2013 04:31 PM, Andreas Niekler wrote:
> Hello,
>
> i give you some examples below this comment. But i already noticed in
> the code, that the standard tokenizerTrainer call uses the standard
> alphanumeric pattern which won't work for typical german examples. Most
> of the data will be separated because of the inproper pattern in the
> standard Factory.java class. My believe is that the de-token.bin model
> was trained with a proper pattern within another implementation of the
> training procedure.
>
> Here are some training lines:
>
> Senden<SPLIT>Pfleiderer<SPLIT>verkaufen<SPLIT>Düsseldorf<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Analyst<SPLIT>vom<SPLIT>Bankhaus<SPLIT>Lampe<SPLIT>,<SPLIT>Marc<SPLIT>Gabriel<SPLIT>,<SPLIT>stuft<SPLIT>die<SPLIT>Pfleiderer-Aktie<SPLIT>(<SPLIT>ISIN<SPLIT>DE0006764749<SPLIT>/<SPLIT>WKN<SPLIT>676474<SPLIT>)<SPLIT>von<SPLIT>"<SPLIT>halten<SPLIT>"<SPLIT>auf<SPLIT>"<SPLIT>verkaufen<SPLIT>"<SPLIT>herab<SPLIT>.
> Der<SPLIT>vollständige<SPLIT>Zwischenbericht<SPLIT>wird<SPLIT>am<SPLIT>8<SPLIT>.<SPLIT>November<SPLIT>2010<SPLIT>um<SPLIT>12.00<SPLIT>Uhr<SPLIT>veröffentlicht<SPLIT>.
> Besonders<SPLIT>in<SPLIT>ländlichen<SPLIT>Gegenden<SPLIT>sind<SPLIT>Telegrafenmaste<SPLIT>auch<SPLIT>heute<SPLIT>noch<SPLIT>weit<SPLIT>verbreitet<SPLIT>-<SPLIT>größtenteils<SPLIT>für<SPLIT>die<SPLIT>Festnetztelefonie<SPLIT>.
> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Sarah<SPLIT>Palin<SPLIT>als<SPLIT>Reality-Star<SPLIT>im<SPLIT>US-Fernsehen<SPLIT>auf<SPLIT>Sendung<SPLIT>15.11.10<SPLIT>4:58<SPLIT>:<SPLIT>Washington<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Sarah<SPLIT>Palin<SPLIT>hat<SPLIT>jetzt<SPLIT>eine<SPLIT>eigene<SPLIT>Show<SPLIT>.
> Fotos<SPLIT>Terrorwarnung<SPLIT>-<SPLIT>Was<SPLIT>man<SPLIT>jetzt<SPLIT>beachten<SPLIT>sollte<SPLIT>Die<SPLIT>Sicherheitslage<SPLIT>spitzt<SPLIT>sich<SPLIT>zu<SPLIT>.
> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Tausende<SPLIT>Siedler<SPLIT>protestieren<SPLIT>gegen<SPLIT>neuen<SPLIT>Baustopp<SPLIT>21.11.10<SPLIT>11:51<SPLIT>:<SPLIT>Jerusalem<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Die<SPLIT>israelischen<SPLIT>Siedler<SPLIT>haben<SPLIT>ihre<SPLIT>Proteste<SPLIT>gegen<SPLIT>einen<SPLIT>erwarteten<SPLIT>neuen<SPLIT>Baustopp<SPLIT>im<SPLIT>Westjordanland<SPLIT>verschärft<SPLIT>.
> Jetzt<SPLIT>einloggen<SPLIT>SchwarzKater<SPLIT>(<SPLIT>vor<SPLIT>4<SPLIT>Stunden<SPLIT>)<SPLIT>WTF<SPLIT>?
> Das<SPLIT>Bankhaus<SPLIT>hat<SPLIT>das<SPLIT>Kursziel<SPLIT>für<SPLIT>die<SPLIT>Salzgitter-Aktien<SPLIT>von<SPLIT>69,00<SPLIT>auf<SPLIT>58,00<SPLIT>Euro<SPLIT>gesenkt<SPLIT>,<SPLIT>aber<SPLIT>die<SPLIT>Einstufung<SPLIT>auf<SPLIT>´<SPLIT>Overweight<SPLIT>´<SPLIT>belassen<SPLIT>.
> Bundeskanzlerin<SPLIT>Angela<SPLIT>Merkel<SPLIT>(<SPLIT>CDU<SPLIT>)<SPLIT>ist<SPLIT>am<SPLIT>Dienstag<SPLIT>zum<SPLIT>Gipfel<SPLIT>der<SPLIT>Organisation<SPLIT>für<SPLIT>Sicherheit<SPLIT>und<SPLIT>Zusammenarbeit<SPLIT>in<SPLIT>Europa<SPLIT>(<SPLIT>OSZE<SPLIT>)<SPLIT>in<SPLIT>Kasachstan<SPLIT>eingetroffen<SPLIT>.
> Mann<SPLIT>totgeprügelt<SPLIT>:<SPLIT>Haftstrafen<SPLIT>im<SPLIT>«<SPLIT>20-Cent-Prozess<SPLIT>»<SPLIT>Die<SPLIT>beiden<SPLIT>Schläger<SPLIT>jugendlichen<SPLIT>Schläger<SPLIT>sind<SPLIT>wegen<SPLIT>Körperverletzung<SPLIT>mit<SPLIT>Todesfolge<SPLIT>zu<SPLIT>Haftstrafen<SPLIT>verurteilt<SPLIT>worden<SPLIT>.
> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>AKTIEN<SPLIT>SCHWEIZ/Vorbörse<SPLIT>:<SPLIT>Leicht<SPLIT>höhere<SPLIT>Eröffnung<SPLIT>erwartet<SPLIT>-<SPLIT>Positive<SPLIT>US-Vorgaben<SPLIT>06.12.2010<SPLIT>08:45<SPLIT>Zürich<SPLIT>(<SPLIT>awp<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Schweizer<SPLIT>Aktienmarkt<SPLIT>dürfte<SPLIT>die<SPLIT>Sitzung<SPLIT>vom<SPLIT>Montag<SPLIT>mit<SPLIT>moderaten<SPLIT>Gewinnen<SPLIT>eröffnen<SPLIT>.
> Werbung<SPLIT>'<SPLIT>)<SPLIT>;<SPLIT>AIG<SPLIT>hatte<SPLIT>sich<SPLIT>auf<SPLIT>dem<SPLIT>US-Häusermarkt<SPLIT>verspekuliert<SPLIT>.
> Außerordentliche<SPLIT>Hauptversammlung<SPLIT>genehmigt<SPLIT>Aktiensplit<SPLIT>und<SPLIT>Vorratsbeschlüsse<SPLIT>für<SPLIT>Kapitalmaßnahmen<SPLIT>Ad-hoc-Mitteilung<SPLIT>übermittelt<SPLIT>durch<SPLIT>euro<SPLIT>adhoc<SPLIT>mit<SPLIT>dem<SPLIT>Ziel<SPLIT>einer<SPLIT>europaweiten<SPLIT>Verbreitung<SPLIT>.
> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>Führungscrew<SPLIT>übernimmt<SPLIT>bei<SPLIT>Modekette<SPLIT>Schild<SPLIT>Stefan<SPLIT>Portmann<SPLIT>und<SPLIT>Thomas<SPLIT>Herbert<SPLIT>heissen<SPLIT>die<SPLIT>neuen<SPLIT>starken<SPLIT>Männer<SPLIT>bei<SPLIT>Schild<SPLIT>.
> IrfanView<SPLIT>Lizenz<SPLIT>:<SPLIT>Freeware<SPLIT>—<SPLIT>Hersteller-Website<SPLIT>IrfanView<SPLIT>ist<SPLIT>ein<SPLIT>für<SPLIT>private<SPLIT>Zwecke<SPLIT>kostenloses<SPLIT>Bildbetrachtungs-<SPLIT>und<SPLIT>Bildbearbeitungsprogramm<SPLIT>,<SPLIT>das<SPLIT>für<SPLIT>kleinere<SPLIT>Belange<SPLIT>durchaus<SPLIT>ausreicht<SPLIT>.
> Dragonica<SPLIT>So<SPLIT>testet<SPLIT>4Players<SPLIT>Bitte<SPLIT>einloggen<SPLIT>,<SPLIT>um<SPLIT>Spiel<SPLIT>in<SPLIT>die<SPLIT>Watchlist<SPLIT>aufzunehmen<SPLIT>.
> Rating-Update<SPLIT>:<SPLIT>London<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Robert<SPLIT>T<SPLIT>.<SPLIT>Cornell<SPLIT>,<SPLIT>Scott<SPLIT>L<SPLIT>.<SPLIT>Gaffner<SPLIT>und<SPLIT>Darren<SPLIT>Yip<SPLIT>,<SPLIT>Analysten<SPLIT>von<SPLIT>Barclays<SPLIT>Capital<SPLIT>,<SPLIT>stufen<SPLIT>die<SPLIT>Aktie<SPLIT>von<SPLIT>ITT<SPLIT>Industries<SPLIT>(<SPLIT>ISIN<SPLIT>US4509111021<SPLIT>/<SPLIT>WKN<SPLIT>860023<SPLIT>)<SPLIT>weiterhin<SPLIT>mit<SPLIT>dem<SPLIT>Rating<SPLIT>"<SPLIT>equal-weight<SPLIT>"<SPLIT>ein<SPLIT>.
> Sollten<SPLIT>der<SPLIT>Branche<SPLIT>durch<SPLIT>"<SPLIT>populäre<SPLIT>Preiskürzungen<SPLIT>"<SPLIT>der<SPLIT>Regulierungsbehörden<SPLIT>weiter<SPLIT>Milliarden<SPLIT>entzogen<SPLIT>werden<SPLIT>,<SPLIT>sei<SPLIT>kaum<SPLIT>vorstellbar<SPLIT>,<SPLIT>wie<SPLIT>ein<SPLIT>flächendeckender<SPLIT>Breitbandausbau<SPLIT>noch<SPLIT>finanziert<SPLIT>werden<SPLIT>könnte<SPLIT>,<SPLIT>kritisierte<SPLIT>Obermann<SPLIT>.
> Grichting/Von<SPLIT>Bergen<SPLIT>bewährten<SPLIT>sich<SPLIT>abstimmen<SPLIT>Online/Print<SPLIT>Täglich<SPLIT>stellt<SPLIT>der<SPLIT>BLICK<SPLIT>eine<SPLIT>Frage<SPLIT>des<SPLIT>Tages<SPLIT>.
> Best<SPLIT>.<SPLIT>For<SPLIT>additional<SPLIT>information<SPLIT>,<SPLIT>please<SPLIT>visit<SPLIT>www.asih.bm<SPLIT>.<SPLIT>0<SPLIT>Bewertungen<SPLIT>dieses<SPLIT>Artikels<SPLIT>:<SPLIT>Kommentare<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>Geben<SPLIT>Sie<SPLIT>jetzt<SPLIT>einen<SPLIT>Kommentar<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>ab<SPLIT>.
>
> Am 13.03.2013 15:52, schrieb Jörn Kottmann:
>> Hello,
>>
>> can you tell us a bit more about your training data. Did you manually
>> annotate these 300k sentences?
>> Is it possible to post 10 lines or so here?
>>
>> Jörn
>>
>> On 03/12/2013 03:22 PM, Andreas Niekler wrote:
>>> Dear List,
>>>
>>> i created a Tokenizer Model with 300k german Sentences from a very clean
>>> corpus. I see some words that are very strangly separated by a tokenizer
>>> using this model like:
>>>
>>> stehenge - blieben
>>> fre - undlicher
>>>
>>> and so on. I cant find those in my training data and wonder why openNLP
>>> splits those words without any evidence in the training data and wihout
>>> any whitespace in my text files. I trained the model with 500
>>> Iterations, cutoff 5 and alphanumeric optimisation.
>>>
>>> Can anyone state some ideas how i can prevent this?
>>>
>>> thank you
>>>
>>> Andreas


Re: TokenizerTrainer

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.
Hello,

i give you some examples below this comment. But i already noticed in
the code, that the standard tokenizerTrainer call uses the standard
alphanumeric pattern which won't work for typical german examples. Most
of the data will be separated because of the inproper pattern in the
standard Factory.java class. My believe is that the de-token.bin model
was trained with a proper pattern within another implementation of the
training procedure.

Here are some training lines:

Senden<SPLIT>Pfleiderer<SPLIT>verkaufen<SPLIT>Düsseldorf<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Analyst<SPLIT>vom<SPLIT>Bankhaus<SPLIT>Lampe<SPLIT>,<SPLIT>Marc<SPLIT>Gabriel<SPLIT>,<SPLIT>stuft<SPLIT>die<SPLIT>Pfleiderer-Aktie<SPLIT>(<SPLIT>ISIN<SPLIT>DE0006764749<SPLIT>/<SPLIT>WKN<SPLIT>676474<SPLIT>)<SPLIT>von<SPLIT>"<SPLIT>halten<SPLIT>"<SPLIT>auf<SPLIT>"<SPLIT>verkaufen<SPLIT>"<SPLIT>herab<SPLIT>.
Der<SPLIT>vollständige<SPLIT>Zwischenbericht<SPLIT>wird<SPLIT>am<SPLIT>8<SPLIT>.<SPLIT>November<SPLIT>2010<SPLIT>um<SPLIT>12.00<SPLIT>Uhr<SPLIT>veröffentlicht<SPLIT>.
Besonders<SPLIT>in<SPLIT>ländlichen<SPLIT>Gegenden<SPLIT>sind<SPLIT>Telegrafenmaste<SPLIT>auch<SPLIT>heute<SPLIT>noch<SPLIT>weit<SPLIT>verbreitet<SPLIT>-<SPLIT>größtenteils<SPLIT>für<SPLIT>die<SPLIT>Festnetztelefonie<SPLIT>.
Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Sarah<SPLIT>Palin<SPLIT>als<SPLIT>Reality-Star<SPLIT>im<SPLIT>US-Fernsehen<SPLIT>auf<SPLIT>Sendung<SPLIT>15.11.10<SPLIT>4:58<SPLIT>:<SPLIT>Washington<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Sarah<SPLIT>Palin<SPLIT>hat<SPLIT>jetzt<SPLIT>eine<SPLIT>eigene<SPLIT>Show<SPLIT>.
Fotos<SPLIT>Terrorwarnung<SPLIT>-<SPLIT>Was<SPLIT>man<SPLIT>jetzt<SPLIT>beachten<SPLIT>sollte<SPLIT>Die<SPLIT>Sicherheitslage<SPLIT>spitzt<SPLIT>sich<SPLIT>zu<SPLIT>.
Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Tausende<SPLIT>Siedler<SPLIT>protestieren<SPLIT>gegen<SPLIT>neuen<SPLIT>Baustopp<SPLIT>21.11.10<SPLIT>11:51<SPLIT>:<SPLIT>Jerusalem<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Die<SPLIT>israelischen<SPLIT>Siedler<SPLIT>haben<SPLIT>ihre<SPLIT>Proteste<SPLIT>gegen<SPLIT>einen<SPLIT>erwarteten<SPLIT>neuen<SPLIT>Baustopp<SPLIT>im<SPLIT>Westjordanland<SPLIT>verschärft<SPLIT>.
Jetzt<SPLIT>einloggen<SPLIT>SchwarzKater<SPLIT>(<SPLIT>vor<SPLIT>4<SPLIT>Stunden<SPLIT>)<SPLIT>WTF<SPLIT>?
Das<SPLIT>Bankhaus<SPLIT>hat<SPLIT>das<SPLIT>Kursziel<SPLIT>für<SPLIT>die<SPLIT>Salzgitter-Aktien<SPLIT>von<SPLIT>69,00<SPLIT>auf<SPLIT>58,00<SPLIT>Euro<SPLIT>gesenkt<SPLIT>,<SPLIT>aber<SPLIT>die<SPLIT>Einstufung<SPLIT>auf<SPLIT>´<SPLIT>Overweight<SPLIT>´<SPLIT>belassen<SPLIT>.
Bundeskanzlerin<SPLIT>Angela<SPLIT>Merkel<SPLIT>(<SPLIT>CDU<SPLIT>)<SPLIT>ist<SPLIT>am<SPLIT>Dienstag<SPLIT>zum<SPLIT>Gipfel<SPLIT>der<SPLIT>Organisation<SPLIT>für<SPLIT>Sicherheit<SPLIT>und<SPLIT>Zusammenarbeit<SPLIT>in<SPLIT>Europa<SPLIT>(<SPLIT>OSZE<SPLIT>)<SPLIT>in<SPLIT>Kasachstan<SPLIT>eingetroffen<SPLIT>.
Mann<SPLIT>totgeprügelt<SPLIT>:<SPLIT>Haftstrafen<SPLIT>im<SPLIT>«<SPLIT>20-Cent-Prozess<SPLIT>»<SPLIT>Die<SPLIT>beiden<SPLIT>Schläger<SPLIT>jugendlichen<SPLIT>Schläger<SPLIT>sind<SPLIT>wegen<SPLIT>Körperverletzung<SPLIT>mit<SPLIT>Todesfolge<SPLIT>zu<SPLIT>Haftstrafen<SPLIT>verurteilt<SPLIT>worden<SPLIT>.
Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>AKTIEN<SPLIT>SCHWEIZ/Vorbörse<SPLIT>:<SPLIT>Leicht<SPLIT>höhere<SPLIT>Eröffnung<SPLIT>erwartet<SPLIT>-<SPLIT>Positive<SPLIT>US-Vorgaben<SPLIT>06.12.2010<SPLIT>08:45<SPLIT>Zürich<SPLIT>(<SPLIT>awp<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Schweizer<SPLIT>Aktienmarkt<SPLIT>dürfte<SPLIT>die<SPLIT>Sitzung<SPLIT>vom<SPLIT>Montag<SPLIT>mit<SPLIT>moderaten<SPLIT>Gewinnen<SPLIT>eröffnen<SPLIT>.
Werbung<SPLIT>'<SPLIT>)<SPLIT>;<SPLIT>AIG<SPLIT>hatte<SPLIT>sich<SPLIT>auf<SPLIT>dem<SPLIT>US-Häusermarkt<SPLIT>verspekuliert<SPLIT>.
Außerordentliche<SPLIT>Hauptversammlung<SPLIT>genehmigt<SPLIT>Aktiensplit<SPLIT>und<SPLIT>Vorratsbeschlüsse<SPLIT>für<SPLIT>Kapitalmaßnahmen<SPLIT>Ad-hoc-Mitteilung<SPLIT>übermittelt<SPLIT>durch<SPLIT>euro<SPLIT>adhoc<SPLIT>mit<SPLIT>dem<SPLIT>Ziel<SPLIT>einer<SPLIT>europaweiten<SPLIT>Verbreitung<SPLIT>.
Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>Führungscrew<SPLIT>übernimmt<SPLIT>bei<SPLIT>Modekette<SPLIT>Schild<SPLIT>Stefan<SPLIT>Portmann<SPLIT>und<SPLIT>Thomas<SPLIT>Herbert<SPLIT>heissen<SPLIT>die<SPLIT>neuen<SPLIT>starken<SPLIT>Männer<SPLIT>bei<SPLIT>Schild<SPLIT>.
IrfanView<SPLIT>Lizenz<SPLIT>:<SPLIT>Freeware<SPLIT>—<SPLIT>Hersteller-Website<SPLIT>IrfanView<SPLIT>ist<SPLIT>ein<SPLIT>für<SPLIT>private<SPLIT>Zwecke<SPLIT>kostenloses<SPLIT>Bildbetrachtungs-<SPLIT>und<SPLIT>Bildbearbeitungsprogramm<SPLIT>,<SPLIT>das<SPLIT>für<SPLIT>kleinere<SPLIT>Belange<SPLIT>durchaus<SPLIT>ausreicht<SPLIT>.
Dragonica<SPLIT>So<SPLIT>testet<SPLIT>4Players<SPLIT>Bitte<SPLIT>einloggen<SPLIT>,<SPLIT>um<SPLIT>Spiel<SPLIT>in<SPLIT>die<SPLIT>Watchlist<SPLIT>aufzunehmen<SPLIT>.
Rating-Update<SPLIT>:<SPLIT>London<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Robert<SPLIT>T<SPLIT>.<SPLIT>Cornell<SPLIT>,<SPLIT>Scott<SPLIT>L<SPLIT>.<SPLIT>Gaffner<SPLIT>und<SPLIT>Darren<SPLIT>Yip<SPLIT>,<SPLIT>Analysten<SPLIT>von<SPLIT>Barclays<SPLIT>Capital<SPLIT>,<SPLIT>stufen<SPLIT>die<SPLIT>Aktie<SPLIT>von<SPLIT>ITT<SPLIT>Industries<SPLIT>(<SPLIT>ISIN<SPLIT>US4509111021<SPLIT>/<SPLIT>WKN<SPLIT>860023<SPLIT>)<SPLIT>weiterhin<SPLIT>mit<SPLIT>dem<SPLIT>Rating<SPLIT>"<SPLIT>equal-weight<SPLIT>"<SPLIT>ein<SPLIT>.
Sollten<SPLIT>der<SPLIT>Branche<SPLIT>durch<SPLIT>"<SPLIT>populäre<SPLIT>Preiskürzungen<SPLIT>"<SPLIT>der<SPLIT>Regulierungsbehörden<SPLIT>weiter<SPLIT>Milliarden<SPLIT>entzogen<SPLIT>werden<SPLIT>,<SPLIT>sei<SPLIT>kaum<SPLIT>vorstellbar<SPLIT>,<SPLIT>wie<SPLIT>ein<SPLIT>flächendeckender<SPLIT>Breitbandausbau<SPLIT>noch<SPLIT>finanziert<SPLIT>werden<SPLIT>könnte<SPLIT>,<SPLIT>kritisierte<SPLIT>Obermann<SPLIT>.
Grichting/Von<SPLIT>Bergen<SPLIT>bewährten<SPLIT>sich<SPLIT>abstimmen<SPLIT>Online/Print<SPLIT>Täglich<SPLIT>stellt<SPLIT>der<SPLIT>BLICK<SPLIT>eine<SPLIT>Frage<SPLIT>des<SPLIT>Tages<SPLIT>.
Best<SPLIT>.<SPLIT>For<SPLIT>additional<SPLIT>information<SPLIT>,<SPLIT>please<SPLIT>visit<SPLIT>www.asih.bm<SPLIT>.<SPLIT>0<SPLIT>Bewertungen<SPLIT>dieses<SPLIT>Artikels<SPLIT>:<SPLIT>Kommentare<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>Geben<SPLIT>Sie<SPLIT>jetzt<SPLIT>einen<SPLIT>Kommentar<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>ab<SPLIT>.

Am 13.03.2013 15:52, schrieb Jörn Kottmann:
> Hello,
> 
> can you tell us a bit more about your training data. Did you manually
> annotate these 300k sentences?
> Is it possible to post 10 lines or so here?
> 
> Jörn
> 
> On 03/12/2013 03:22 PM, Andreas Niekler wrote:
>> Dear List,
>>
>> i created a Tokenizer Model with 300k german Sentences from a very clean
>> corpus. I see some words that are very strangly separated by a tokenizer
>> using this model like:
>>
>> stehenge - blieben
>> fre - undlicher
>>
>> and so on. I cant find those in my training data and wonder why openNLP
>> splits those words without any evidence in the training data and wihout
>> any whitespace in my text files. I trained the model with 500
>> Iterations, cutoff 5 and alphanumeric optimisation.
>>
>> Can anyone state some ideas how i can prevent this?
>>
>> thank you
>>
>> Andreas
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: TokenizerTrainer

Posted by Jörn Kottmann <ko...@gmail.com>.
Hello,

can you tell us a bit more about your training data. Did you manually
annotate these 300k sentences?
Is it possible to post 10 lines or so here?

Jörn

On 03/12/2013 03:22 PM, Andreas Niekler wrote:
> Dear List,
>
> i created a Tokenizer Model with 300k german Sentences from a very clean
> corpus. I see some words that are very strangly separated by a tokenizer
> using this model like:
>
> stehenge - blieben
> fre - undlicher
>
> and so on. I cant find those in my training data and wonder why openNLP
> splits those words without any evidence in the training data and wihout
> any whitespace in my text files. I trained the model with 500
> Iterations, cutoff 5 and alphanumeric optimisation.
>
> Can anyone state some ideas how i can prevent this?
>
> thank you
>
> Andreas