You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2016/08/29 13:11:07 UTC

How to train a Tokenizer for emails ?

Hello,
I am creating a custom tokenizer. It works pretty well but i have problems
with emails.
The emails can have _ - . that are tokenized in normal text, so the
question is, how can i train it better?
After the tokenization I need to apply different regexes to extract
email/dates/telephones so i must not tokenized such patterns.

Thanks
Damiano

Re: How to train a Tokenizer for emails ?

Posted by Damiano Porta <da...@gmail.com>.

ok, thanks!

2016-09-10 23:46 GMT+02:00 William Colen <wi...@gmail.com>:

> When I need I debug the code. I don't know if there is a better way.
>
>
> 2016-09-10 18:24 GMT-03:00 Damiano Porta <da...@gmail.com>:
>
> > Hi WIlliam!
> > Yeah i will go with custom generator that add specific features to this
> > patterns (email, telephone, dates) etc etc.
> > Out of curiosity, how can i get the list of features of a specific token
> ?
> > Thanks!
> > Damiano
> >
> >
> > 2016-09-08 1:46 GMT+02:00 William Colen <wi...@gmail.com>:
> >
> > > Have you trained with enough examples of emails?
> > > Some tools have a sequence validator, but I think the tokenizator don't
> > > have. If there was, you could create one that would recognize this.
> > > Another option would be to customize the feature generator to add a
> > special
> > > feature when the token looks like an email or telephone.
> > >
> > >
> > > Regards
> > > William
> > >
> > >
> > > Em segunda-feira, 29 de agosto de 2016, Damiano Porta <
> > > damianoporta@gmail.com> escreveu:
> > >
> > > > Hello,
> > > > I am creating a custom tokenizer. It works pretty well but i have
> > > problems
> > > > with emails.
> > > > The emails can have _ - . that are tokenized in normal text, so the
> > > > question is, how can i train it better?
> > > > After the tokenization I need to apply different regexes to extract
> > > > email/dates/telephones so i must not tokenized such patterns.
> > > >
> > > > Thanks
> > > > Damiano
> > > >
> > >
> > >
> > > --
> > > William Colen
> > >
> >
>

Re: How to train a Tokenizer for emails ?

Posted by William Colen <wi...@gmail.com>.

When I need I debug the code. I don't know if there is a better way.


2016-09-10 18:24 GMT-03:00 Damiano Porta <da...@gmail.com>:

> Hi WIlliam!
> Yeah i will go with custom generator that add specific features to this
> patterns (email, telephone, dates) etc etc.
> Out of curiosity, how can i get the list of features of a specific token ?
> Thanks!
> Damiano
>
>
> 2016-09-08 1:46 GMT+02:00 William Colen <wi...@gmail.com>:
>
> > Have you trained with enough examples of emails?
> > Some tools have a sequence validator, but I think the tokenizator don't
> > have. If there was, you could create one that would recognize this.
> > Another option would be to customize the feature generator to add a
> special
> > feature when the token looks like an email or telephone.
> >
> >
> > Regards
> > William
> >
> >
> > Em segunda-feira, 29 de agosto de 2016, Damiano Porta <
> > damianoporta@gmail.com> escreveu:
> >
> > > Hello,
> > > I am creating a custom tokenizer. It works pretty well but i have
> > problems
> > > with emails.
> > > The emails can have _ - . that are tokenized in normal text, so the
> > > question is, how can i train it better?
> > > After the tokenization I need to apply different regexes to extract
> > > email/dates/telephones so i must not tokenized such patterns.
> > >
> > > Thanks
> > > Damiano
> > >
> >
> >
> > --
> > William Colen
> >
>

Re: How to train a Tokenizer for emails ?

Posted by Damiano Porta <da...@gmail.com>.

Hi WIlliam!
Yeah i will go with custom generator that add specific features to this
patterns (email, telephone, dates) etc etc.
Out of curiosity, how can i get the list of features of a specific token ?
Thanks!
Damiano


2016-09-08 1:46 GMT+02:00 William Colen <wi...@gmail.com>:

> Have you trained with enough examples of emails?
> Some tools have a sequence validator, but I think the tokenizator don't
> have. If there was, you could create one that would recognize this.
> Another option would be to customize the feature generator to add a special
> feature when the token looks like an email or telephone.
>
>
> Regards
> William
>
>
> Em segunda-feira, 29 de agosto de 2016, Damiano Porta <
> damianoporta@gmail.com> escreveu:
>
> > Hello,
> > I am creating a custom tokenizer. It works pretty well but i have
> problems
> > with emails.
> > The emails can have _ - . that are tokenized in normal text, so the
> > question is, how can i train it better?
> > After the tokenization I need to apply different regexes to extract
> > email/dates/telephones so i must not tokenized such patterns.
> >
> > Thanks
> > Damiano
> >
>
>
> --
> William Colen
>

Re: How to train a Tokenizer for emails ?

Posted by William Colen <wi...@gmail.com>.

Have you trained with enough examples of emails?
Some tools have a sequence validator, but I think the tokenizator don't
have. If there was, you could create one that would recognize this.
Another option would be to customize the feature generator to add a special
feature when the token looks like an email or telephone.


Regards
William


Em segunda-feira, 29 de agosto de 2016, Damiano Porta <
damianoporta@gmail.com> escreveu:

> Hello,
> I am creating a custom tokenizer. It works pretty well but i have problems
> with emails.
> The emails can have _ - . that are tokenized in normal text, so the
> question is, how can i train it better?
> After the tokenization I need to apply different regexes to extract
> email/dates/telephones so i must not tokenized such patterns.
>
> Thanks
> Damiano
>


-- 
William Colen