You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ravikumar Govindarajan <ra...@gmail.com> on 2015/02/17 09:51:09 UTC

URL/Email tokenizer

We have a requirement in that E-mail addresses need to be added in a
tokenized form to one field while untokenized form is added to another field

Ex:

"I have mailed abc@xyz.com" . It should tokenize as below

body = {"I", "have", "mailed", "abc", "xyz", "com"};

I also have a body-addr field. Tokenizer needs to extract e-mail addresses
from body field and add them as below

body-addr = {"abc@xyz.com"}

How to achieve this via tokenizer chain?

--
Ravi

Re: URL/Email tokenizer

Posted by Ian Lea <ia...@gmail.com>.
Ah, you want to do it the hard way.  Sorry, can't help you there - I
prefer to do things the simple way - easier to write and to maintain
and, in my experience, usually more robust in the long run.


--
Ian.


On Tue, Feb 17, 2015 at 11:42 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> Thanks Ian
>
> What I am currently doing is duplicating the data into 2 different fields
> and having my own PerFieldAnalyzerWrapper just like you pointed out
>
> Is there a good way to do this in a single-pass? Like how Bi-Grams or
> Common-Grams do…
>
> --
> Ravi
>
> On Tue, Feb 17, 2015 at 3:08 PM, Ian Lea <ia...@gmail.com> wrote:
>
>> Sounds like a job for
>> org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, Feb 17, 2015 at 8:51 AM, Ravikumar Govindarajan
>> <ra...@gmail.com> wrote:
>> > We have a requirement in that E-mail addresses need to be added in a
>> > tokenized form to one field while untokenized form is added to another
>> field
>> >
>> > Ex:
>> >
>> > "I have mailed abc@xyz.com" . It should tokenize as below
>> >
>> > body = {"I", "have", "mailed", "abc", "xyz", "com"};
>> >
>> > I also have a body-addr field. Tokenizer needs to extract e-mail
>> addresses
>> > from body field and add them as below
>> >
>> > body-addr = {"abc@xyz.com"}
>> >
>> > How to achieve this via tokenizer chain?
>> >
>> > --
>> > Ravi
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: URL/Email tokenizer

Posted by Ravikumar Govindarajan <ra...@gmail.com>.
Thanks Ian

What I am currently doing is duplicating the data into 2 different fields
and having my own PerFieldAnalyzerWrapper just like you pointed out

Is there a good way to do this in a single-pass? Like how Bi-Grams or
Common-Grams do…

--
Ravi

On Tue, Feb 17, 2015 at 3:08 PM, Ian Lea <ia...@gmail.com> wrote:

> Sounds like a job for
> org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper.
>
>
> --
> Ian.
>
>
> On Tue, Feb 17, 2015 at 8:51 AM, Ravikumar Govindarajan
> <ra...@gmail.com> wrote:
> > We have a requirement in that E-mail addresses need to be added in a
> > tokenized form to one field while untokenized form is added to another
> field
> >
> > Ex:
> >
> > "I have mailed abc@xyz.com" . It should tokenize as below
> >
> > body = {"I", "have", "mailed", "abc", "xyz", "com"};
> >
> > I also have a body-addr field. Tokenizer needs to extract e-mail
> addresses
> > from body field and add them as below
> >
> > body-addr = {"abc@xyz.com"}
> >
> > How to achieve this via tokenizer chain?
> >
> > --
> > Ravi
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: URL/Email tokenizer

Posted by Ian Lea <ia...@gmail.com>.
Sounds like a job for
org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper.


--
Ian.


On Tue, Feb 17, 2015 at 8:51 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> We have a requirement in that E-mail addresses need to be added in a
> tokenized form to one field while untokenized form is added to another field
>
> Ex:
>
> "I have mailed abc@xyz.com" . It should tokenize as below
>
> body = {"I", "have", "mailed", "abc", "xyz", "com"};
>
> I also have a body-addr field. Tokenizer needs to extract e-mail addresses
> from body field and add them as below
>
> body-addr = {"abc@xyz.com"}
>
> How to achieve this via tokenizer chain?
>
> --
> Ravi

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org