You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mohammad Shariq <sh...@gmail.com> on 2011/08/04 17:22:41 UTC

Indexing tweet and searching "@keyword" OR "#keyword"

I have indexed around 1 million tweets ( using  "text" dataType).
when I search the tweet with "#"  OR "@"  I dont get the exact result.
e.g.  when I search for "#ipad" OR "@ipad"   I get the result where ipad is
mentioned skipping the "#" and "@".
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as "text", below is "text" which is there in my
schema.xml.


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory"
protected="protwords.txt" language="English"/>
</analyzer>
<analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory"
protected="protwords.txt" language="English"/>
</analyzer>
</fieldType>

-- 
Thanks and Regards
Mohammad Shariq

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Posted by Erick Erickson <er...@gmail.com>.

I don't see an easy way to do that with the standard set of
filters. You'll probably need to write something custom (note,
this is actually pretty easy). I suspect you'll
need to do something like Synonyms, where when you
get a token like #ipod, you essentially make it a synonym
for ipod and insert both in the document...

This assumes you can't create a list of all the terms you want
treated this way, because you could just synonyms if you could.


Best
Erick

On Thu, Aug 11, 2011 at 1:37 AM, Mohammad Shariq <sh...@gmail.com> wrote:
> Do you really want a search on "ipad" to *fail* to match input of "#ipad"?
> Or
> vice-versa?
> My requirement is :  I want to search both '#ipad' and 'ipad' for q='ipad'
> BUT for q='#ipad'  I want to search ONLY '#ipad' excluding 'ipad'.
>
>
> On 10 August 2011 19:49, Erick Erickson <er...@gmail.com> wrote:
>
>> Please look more carefully at the documentation for WDDF,
>> specifically:
>>
>> split on intra-word delimiters (all non alpha-numeric characters).
>>
>> WordDelimiterFilterFactory will always throw away non alpha-numeric
>> characters, you can't tell it do to otherwise. Try some of the other
>> tokenizers/analyzers to get what you want, and also look at the
>> admin/analysis page to see what the exact effects are of your
>> fieldType definitions.
>>
>> Here's a great place to start:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>> You probably want something like WhitespaceTokenizerFactory
>> followed by LowerCaseFilterFactory or some such...
>>
>> But I really question whether this is what you want either. Do you
>> really want a search on "ipad" to *fail* to match input of "#ipad"? Or
>> vice-versa?
>>
>> KeywordTokenizerFactory is probably not the place you want to start,
>> the tokenization process doesn't break anything up, you happen to be
>> getting separate tokens because of WDDF, which as you see can't
>> process things the way you want.
>>
>>
>> Best
>> Erick
>>
>> On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq <sh...@gmail.com>
>> wrote:
>> > I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols
>> > and it ignored totally.
>> > I need solution plz suggest.
>> >
>> > On 4 August 2011 21:08, Jonathan Rochkind <ro...@jhu.edu> wrote:
>> >
>> >> It's the WordDelimiterFactory in your filter chain that's removing the
>> >> punctuation entirely from your index, I think.
>> >>
>> >> Read up on what the WordDelimiter filter does, and what it's settings
>> are;
>> >> decide how you want things to be tokenized in your index to get the
>> behavior
>> >> your want; either get WordDelimiter to do it that way by passing it
>> >> different arguments, or stop using WordDelimiter; come back with any
>> >> questions after trying that!
>> >>
>> >>
>> >>
>> >> On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
>> >>
>> >>> I have indexed around 1 million tweets ( using  "text" dataType).
>> >>> when I search the tweet with "#"  OR "@"  I dont get the exact result.
>> >>> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where
>> ipad
>> >>> is
>> >>> mentioned skipping the "#" and "@".
>> >>> please suggest me, how to tune or what are filterFactories to use to
>> get
>> >>> the
>> >>> desired result.
>> >>> I am indexing the tweet as "text", below is "text" which is there in my
>> >>> schema.xml.
>> >>>
>> >>>
>> >>> <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>> >>> <analyzer type="index">
>> >>>     <tokenizer class="solr.**KeywordTokenizerFactory"/>
>> >>>     <filter class="solr.**CommonGramsFilterFactory"
>> words="stopwords.txt"
>> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>> >>>     <filter class="solr.**WordDelimiterFilterFactory"
>> >>> generateWordParts="1"
>> >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> >>> catenateAll="0" splitOnCaseChange="1"/>
>> >>>     <filter class="solr.**LowerCaseFilterFactory"/>
>> >>>     <filter class="solr.**SnowballPorterFilterFactory"
>> >>> protected="protwords.txt" language="English"/>
>> >>> </analyzer>
>> >>> <analyzer type="query">
>> >>>         <tokenizer class="solr.**KeywordTokenizerFactory"/>
>> >>>         <filter class="solr.**CommonGramsFilterFactory"
>> >>> words="stopwords.txt"
>> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>> >>>         <filter class="solr.**WordDelimiterFilterFactory"
>> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> >>>         <filter class="solr.**LowerCaseFilterFactory"/>
>> >>>         <filter class="solr.**SnowballPorterFilterFactory"
>> >>> protected="protwords.txt" language="English"/>
>> >>> </analyzer>
>> >>> </fieldType>
>> >>>
>> >>>
>> >
>> >
>> > --
>> > Thanks and Regards
>> > Mohammad Shariq
>> >
>>
>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Posted by Mohammad Shariq <sh...@gmail.com>.

Do you really want a search on "ipad" to *fail* to match input of "#ipad"?
Or
vice-versa?
My requirement is :  I want to search both '#ipad' and 'ipad' for q='ipad'
BUT for q='#ipad'  I want to search ONLY '#ipad' excluding 'ipad'.


On 10 August 2011 19:49, Erick Erickson <er...@gmail.com> wrote:

> Please look more carefully at the documentation for WDDF,
> specifically:
>
> split on intra-word delimiters (all non alpha-numeric characters).
>
> WordDelimiterFilterFactory will always throw away non alpha-numeric
> characters, you can't tell it do to otherwise. Try some of the other
> tokenizers/analyzers to get what you want, and also look at the
> admin/analysis page to see what the exact effects are of your
> fieldType definitions.
>
> Here's a great place to start:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> You probably want something like WhitespaceTokenizerFactory
> followed by LowerCaseFilterFactory or some such...
>
> But I really question whether this is what you want either. Do you
> really want a search on "ipad" to *fail* to match input of "#ipad"? Or
> vice-versa?
>
> KeywordTokenizerFactory is probably not the place you want to start,
> the tokenization process doesn't break anything up, you happen to be
> getting separate tokens because of WDDF, which as you see can't
> process things the way you want.
>
>
> Best
> Erick
>
> On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq <sh...@gmail.com>
> wrote:
> > I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols
> > and it ignored totally.
> > I need solution plz suggest.
> >
> > On 4 August 2011 21:08, Jonathan Rochkind <ro...@jhu.edu> wrote:
> >
> >> It's the WordDelimiterFactory in your filter chain that's removing the
> >> punctuation entirely from your index, I think.
> >>
> >> Read up on what the WordDelimiter filter does, and what it's settings
> are;
> >> decide how you want things to be tokenized in your index to get the
> behavior
> >> your want; either get WordDelimiter to do it that way by passing it
> >> different arguments, or stop using WordDelimiter; come back with any
> >> questions after trying that!
> >>
> >>
> >>
> >> On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
> >>
> >>> I have indexed around 1 million tweets ( using  "text" dataType).
> >>> when I search the tweet with "#"  OR "@"  I dont get the exact result.
> >>> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where
> ipad
> >>> is
> >>> mentioned skipping the "#" and "@".
> >>> please suggest me, how to tune or what are filterFactories to use to
> get
> >>> the
> >>> desired result.
> >>> I am indexing the tweet as "text", below is "text" which is there in my
> >>> schema.xml.
> >>>
> >>>
> >>> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> >>> <analyzer type="index">
> >>>     <tokenizer class="solr.**KeywordTokenizerFactory"/>
> >>>     <filter class="solr.**CommonGramsFilterFactory"
> words="stopwords.txt"
> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
> >>>     <filter class="solr.**WordDelimiterFilterFactory"
> >>> generateWordParts="1"
> >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >>> catenateAll="0" splitOnCaseChange="1"/>
> >>>     <filter class="solr.**LowerCaseFilterFactory"/>
> >>>     <filter class="solr.**SnowballPorterFilterFactory"
> >>> protected="protwords.txt" language="English"/>
> >>> </analyzer>
> >>> <analyzer type="query">
> >>>         <tokenizer class="solr.**KeywordTokenizerFactory"/>
> >>>         <filter class="solr.**CommonGramsFilterFactory"
> >>> words="stopwords.txt"
> >>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
> >>>         <filter class="solr.**WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>         <filter class="solr.**LowerCaseFilterFactory"/>
> >>>         <filter class="solr.**SnowballPorterFilterFactory"
> >>> protected="protwords.txt" language="English"/>
> >>> </analyzer>
> >>> </fieldType>
> >>>
> >>>
> >
> >
> > --
> > Thanks and Regards
> > Mohammad Shariq
> >
>



-- 
Thanks and Regards
Mohammad Shariq

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Posted by Erick Erickson <er...@gmail.com>.

Please look more carefully at the documentation for WDDF,
specifically:

split on intra-word delimiters (all non alpha-numeric characters).

WordDelimiterFilterFactory will always throw away non alpha-numeric
characters, you can't tell it do to otherwise. Try some of the other
tokenizers/analyzers to get what you want, and also look at the
admin/analysis page to see what the exact effects are of your
fieldType definitions.

Here's a great place to start:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

You probably want something like WhitespaceTokenizerFactory
followed by LowerCaseFilterFactory or some such...

But I really question whether this is what you want either. Do you
really want a search on "ipad" to *fail* to match input of "#ipad"? Or
vice-versa?

KeywordTokenizerFactory is probably not the place you want to start,
the tokenization process doesn't break anything up, you happen to be
getting separate tokens because of WDDF, which as you see can't
process things the way you want.


Best
Erick

On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq <sh...@gmail.com> wrote:
> I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols
> and it ignored totally.
> I need solution plz suggest.
>
> On 4 August 2011 21:08, Jonathan Rochkind <ro...@jhu.edu> wrote:
>
>> It's the WordDelimiterFactory in your filter chain that's removing the
>> punctuation entirely from your index, I think.
>>
>> Read up on what the WordDelimiter filter does, and what it's settings are;
>> decide how you want things to be tokenized in your index to get the behavior
>> your want; either get WordDelimiter to do it that way by passing it
>> different arguments, or stop using WordDelimiter; come back with any
>> questions after trying that!
>>
>>
>>
>> On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
>>
>>> I have indexed around 1 million tweets ( using  "text" dataType).
>>> when I search the tweet with "#"  OR "@"  I dont get the exact result.
>>> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where ipad
>>> is
>>> mentioned skipping the "#" and "@".
>>> please suggest me, how to tune or what are filterFactories to use to get
>>> the
>>> desired result.
>>> I am indexing the tweet as "text", below is "text" which is there in my
>>> schema.xml.
>>>
>>>
>>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>> <analyzer type="index">
>>>     <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>>     <filter class="solr.**CommonGramsFilterFactory" words="stopwords.txt"
>>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>>>     <filter class="solr.**WordDelimiterFilterFactory"
>>> generateWordParts="1"
>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>> catenateAll="0" splitOnCaseChange="1"/>
>>>     <filter class="solr.**LowerCaseFilterFactory"/>
>>>     <filter class="solr.**SnowballPorterFilterFactory"
>>> protected="protwords.txt" language="English"/>
>>> </analyzer>
>>> <analyzer type="query">
>>>         <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>>         <filter class="solr.**CommonGramsFilterFactory"
>>> words="stopwords.txt"
>>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>>>         <filter class="solr.**WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>         <filter class="solr.**LowerCaseFilterFactory"/>
>>>         <filter class="solr.**SnowballPorterFilterFactory"
>>> protected="protwords.txt" language="English"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Posted by Mohammad Shariq <sh...@gmail.com>.

I tried tweaking "WordDelimiterFactory" but I won't accept # OR @ symbols
and it ignored totally.
I need solution plz suggest.

On 4 August 2011 21:08, Jonathan Rochkind <ro...@jhu.edu> wrote:

> It's the WordDelimiterFactory in your filter chain that's removing the
> punctuation entirely from your index, I think.
>
> Read up on what the WordDelimiter filter does, and what it's settings are;
> decide how you want things to be tokenized in your index to get the behavior
> your want; either get WordDelimiter to do it that way by passing it
> different arguments, or stop using WordDelimiter; come back with any
> questions after trying that!
>
>
>
> On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
>
>> I have indexed around 1 million tweets ( using  "text" dataType).
>> when I search the tweet with "#"  OR "@"  I dont get the exact result.
>> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where ipad
>> is
>> mentioned skipping the "#" and "@".
>> please suggest me, how to tune or what are filterFactories to use to get
>> the
>> desired result.
>> I am indexing the tweet as "text", below is "text" which is there in my
>> schema.xml.
>>
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>>     <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>     <filter class="solr.**CommonGramsFilterFactory" words="stopwords.txt"
>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>>     <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>     <filter class="solr.**LowerCaseFilterFactory"/>
>>     <filter class="solr.**SnowballPorterFilterFactory"
>> protected="protwords.txt" language="English"/>
>> </analyzer>
>> <analyzer type="query">
>>         <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>         <filter class="solr.**CommonGramsFilterFactory"
>> words="stopwords.txt"
>> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>>         <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>         <filter class="solr.**LowerCaseFilterFactory"/>
>>         <filter class="solr.**SnowballPorterFilterFactory"
>> protected="protwords.txt" language="English"/>
>> </analyzer>
>> </fieldType>
>>
>>


-- 
Thanks and Regards
Mohammad Shariq

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Posted by Jonathan Rochkind <ro...@jhu.edu>.

It's the WordDelimiterFactory in your filter chain that's removing the 
punctuation entirely from your index, I think.

Read up on what the WordDelimiter filter does, and what it's settings 
are; decide how you want things to be tokenized in your index to get the 
behavior your want; either get WordDelimiter to do it that way by 
passing it different arguments, or stop using WordDelimiter; come back 
with any questions after trying that!


On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
> I have indexed around 1 million tweets ( using  "text" dataType).
> when I search the tweet with "#"  OR "@"  I dont get the exact result.
> e.g.  when I search for "#ipad" OR "@ipad"   I get the result where ipad is
> mentioned skipping the "#" and "@".
> please suggest me, how to tune or what are filterFactories to use to get the
> desired result.
> I am indexing the tweet as "text", below is "text" which is there in my
> schema.xml.
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
>      <tokenizer class="solr.KeywordTokenizerFactory"/>
>      <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>      <filter class="solr.SnowballPorterFilterFactory"
> protected="protwords.txt" language="English"/>
> </analyzer>
> <analyzer type="query">
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
> minShingleSize="3" maxShingleSize="3" ignoreCase="true"/>
>          <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.SnowballPorterFilterFactory"
> protected="protwords.txt" language="English"/>
> </analyzer>
> </fieldType>
>