You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by suriya prakash <su...@gmail.com> on 2016/12/20 14:15:06 UTC

Email id tokenizer (actual email id & multiple terms)

Hi,

I am using standard analyzer and want to split token for email_id "
lucene@gmail.com" as "lucene", "gmail","com","lucene@gmail.com" in a single
pass.

I have already changed jflex to split email id as separate words(lucene,
gmail, com). But we need to do phrase search which will not be efficient.
So i want to index actual email id and splitted words.

Can you please help me to achieve this. OR let me know whether phrase
search is efficient for this case?


Regards,
Suriya

Re: Email id tokenizer (actual email id & multiple terms)

Posted by Trejkaz <tr...@trypticon.org>.

On Wed, Dec 21, 2016 at 1:21 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> You can index whole address in a separate field.
> Otherwise, how would you handle positions of the split tokens?
>
> By the way, speed of phrase search may be just fine, so consider trying first.

Speed aside, phrase search is difficult because you'll accidentally
match too much.
(user@company.com will match user@company.com.au, john@gmail.com will
match little.john@gmail.com, etc.)

Using a separate field for non-tokenised addresses would be my
recommendation too.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Email id tokenizer (actual email id & multiple terms)

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,

You can index whole address in a separate field. 
Otherwise, how would you handle positions of the split tokens?

By the way, speed of phrase search may be just fine, so consider trying first.

Ahmet


On Tuesday, December 20, 2016 5:15 PM, suriya prakash <su...@gmail.com> wrote:
Hi,

I am using standard analyzer and want to split token for email_id "
lucene@gmail.com" as "lucene", "gmail","com","lucene@gmail.com" in a single
pass.

I have already changed jflex to split email id as separate words(lucene,
gmail, com). But we need to do phrase search which will not be efficient.
So i want to index actual email id and splitted words.

Can you please help me to achieve this. OR let me know whether phrase
search is efficient for this case?


Regards,
Suriya

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Email id tokenizer (actual email id & multiple terms)

Posted by Trejkaz <tr...@trypticon.org>.

On Wed, Dec 21, 2016 at 11:23 PM, suriya prakash <su...@gmail.com> wrote:
> Hi,
>
> Thanks for your reply.
>
> I might have one or more emailds in a single record.

Just so you know, you can add the same field more than once with the
field analysed by KeywordAnalyzer, and it will still become multiple
tokens. This is safer than something like WhitespaceAnalyzer, because
email addresses can actually contain spaces. (UAX29URLEmailAnalyzer
might do the right thing though.)

But if you're doing this in the main text content field,
TeeSinkTokenFilter does seem like the right thing to use. (I have
never found a use for it myself.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Email id tokenizer (actual email id & multiple terms)

Posted by suriya prakash <su...@gmail.com>.

Hi,

Thanks for your reply.

I might have one or more emailds in a single record.  So I have to index it
with white space analyser after filtering emailid alone(may be using email
id tokenizer).

Tokenization will happen twice( for normal indexing and for special emailid
field indexing) which is costly for content field.

Is there any way to do it efficiently? will TeeSinkTokenFilter help for my
case?

On Tue, Dec 20, 2016 at 7:45 PM, suriya prakash <su...@gmail.com> wrote:

> Hi,
>
> I am using standard analyzer and want to split token for email_id "
> lucene@gmail.com" as "lucene", "gmail","com","lucene@gmail.com" in a
> single pass.
>
> I have already changed jflex to split email id as separate words(lucene,
> gmail, com). But we need to do phrase search which will not be efficient.
> So i want to index actual email id and splitted words.
>
> Can you please help me to achieve this. OR let me know whether phrase
> search is efficient for this case?
>
>
> Regards,
> Suriya
>