You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mingfeng Yang <mf...@wisewindow.com> on 2013/04/12 01:48:38 UTC

tokenizer of solr

Dear Solr users and developers,

I am trying to index some documents some of which are twitter messages, and
we have a problem when indexing retweet.

Say a twitter user named "jpc_108" post a tweet, and then someone retweet
his msg, and now @jpc_108 become part of the tweet text body.

Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
into "jpc and 108", and when we search for jpc_108, it's not there anymore.


Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?

Thanks,
Ming-

Re: tokenizer of solr

Posted by Mingfeng Yang <mf...@wisewindow.com>.

Jack,

Thanks so much for this info.  It's awesome.

Ming


On Thu, Apr 11, 2013 at 7:32 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> In that case, use the types="wdfftypes.txt" attribute of WDF and map "@"
> and "_" to ALPHA as shown in:
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
> .
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mingfeng Yang
> Sent: Thursday, April 11, 2013 8:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: tokenizer of solr
>
>
> looks like it's due to the word delimiter filter.  Anyone know if the
> "protected" file support regular expression or not?
>
> Ming
>
>
> On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky <ja...@basetechnology.com>*
> *wrote:
>
>  Try the whitespace tokenizer.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Mingfeng Yang Sent: Thursday, April 11,
>> 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr
>> Dear Solr users and developers,
>>
>> I am trying to index some documents some of which are twitter messages,
>> and
>> we have a problem when indexing retweet.
>>
>> Say a twitter user named "jpc_108" post a tweet, and then someone retweet
>> his msg, and now @jpc_108 become part of the tweet text body.
>>
>> Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
>> into "jpc and 108", and when we search for jpc_108, it's not there
>> anymore.
>>
>>
>> Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?
>>
>> Thanks,
>> Ming-
>>
>>
>

Re: tokenizer of solr

Posted by Jack Krupansky <ja...@basetechnology.com>.

In that case, use the types="wdfftypes.txt" attribute of WDF and map "@" and 
"_" to ALPHA as shown in:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory.

-- Jack Krupansky

-----Original Message----- 
From: Mingfeng Yang
Sent: Thursday, April 11, 2013 8:50 PM
To: solr-user@lucene.apache.org
Subject: Re: tokenizer of solr

looks like it's due to the word delimiter filter.  Anyone know if the
"protected" file support regular expression or not?

Ming


On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky 
<ja...@basetechnology.com>wrote:

> Try the whitespace tokenizer.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mingfeng Yang Sent: Thursday, April 11,
> 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr
> Dear Solr users and developers,
>
> I am trying to index some documents some of which are twitter messages, 
> and
> we have a problem when indexing retweet.
>
> Say a twitter user named "jpc_108" post a tweet, and then someone retweet
> his msg, and now @jpc_108 become part of the tweet text body.
>
> Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
> into "jpc and 108", and when we search for jpc_108, it's not there 
> anymore.
>
>
> Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?
>
> Thanks,
> Ming-
>

Re: tokenizer of solr

Posted by Mingfeng Yang <mf...@wisewindow.com>.

looks like it's due to the word delimiter filter.  Anyone know if the
"protected" file support regular expression or not?

Ming


On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> Try the whitespace tokenizer.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mingfeng Yang Sent: Thursday, April 11,
> 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr
> Dear Solr users and developers,
>
> I am trying to index some documents some of which are twitter messages, and
> we have a problem when indexing retweet.
>
> Say a twitter user named "jpc_108" post a tweet, and then someone retweet
> his msg, and now @jpc_108 become part of the tweet text body.
>
> Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
> into "jpc and 108", and when we search for jpc_108, it's not there anymore.
>
>
> Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?
>
> Thanks,
> Ming-
>

Re: tokenizer of solr

Posted by Jack Krupansky <ja...@basetechnology.com>.

Try the whitespace tokenizer.

-- Jack Krupansky

-----Original Message----- 
From: Mingfeng Yang 
Sent: Thursday, April 11, 2013 7:48 PM 
To: solr-user@lucene.apache.org 
Subject: tokenizer of solr 

Dear Solr users and developers,

I am trying to index some documents some of which are twitter messages, and
we have a problem when indexing retweet.

Say a twitter user named "jpc_108" post a tweet, and then someone retweet
his msg, and now @jpc_108 become part of the tweet text body.

Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
into "jpc and 108", and when we search for jpc_108, it's not there anymore.

Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?

Thanks,
Ming-