You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Utkarsh Sengar <ut...@gmail.com> on 2013/09/17 23:20:24 UTC

Some text not indexed in solr4.4

I have a copyField called allText with type text_general:
https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68

I have ~100 documents which have the text: dyson and dc44 or dc41 etc.

For example:
"title": "Dyson DC44 Animal Digital Slim Cordless Vacuum"
"description": "The DC44 Animal is the new Dyson Digital Slim vacuum
cleaner  the cordless machine that doesn’t lose suction. It has been
engineered for floor to ceiling cleaning. DC44 Animal has a detachable
long-reach wand  which is balanced for floor to ceiling cleaning.   The
motorized floor tool has twice the power of the DC35 floor tool  to drive
the bristles deeper into the carpet pile with more force. It attaches to
the wand or directly to the machine for cleaning awkward spaces. The brush
bar has carbon fiber filaments for removing fine dust from hard floors.
DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
manganese cobalt battery and Root Cyclone technology for constant  powerful
suction.",
UPC: 0879957006362

The documents are indexed.

Analysis says its indexeD: http://i.imgur.com/O52ino1.png
But when I search for allText:"dyson dc44" I get no results, response:
http://pastie.org/8334220

Any suggestions about the problem? I am out of ideas about how to debug
this.

-- 
Thanks,
-Utkarsh

Re: Some text not indexed in solr4.4

Posted by Furkan KAMACI <fu...@gmail.com>.
On the other hand did you check here:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

what it says about MultiPhraseQuery?


18 Eylül 2013 Çarşamba tarihinde Furkan KAMACI <fu...@gmail.com>
adlı kullanıcı şöyle yazdı:
> Hi;
>
> Did you run commit command?
>
> 18 Eylül 2013 Çarşamba tarihinde Utkarsh Sengar <ut...@gmail.com>
adlı kullanıcı şöyle yazdı:
>> To add to it, I see the exact problem with the queries: "nikon d7100",
>> "nikon d5100", "samsung ps-we450" etc.
>>
>> Thanks,
>> -Utkarsh
>>
>>
>> On Tue, Sep 17, 2013 at 2:20 PM, Utkarsh Sengar <utkarsh2012@gmail.com
>wrote:
>>
>>> I have a copyField called allText with type text_general:
>>> https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68
>>>
>>> I have ~100 documents which have the text: dyson and dc44 or dc41 etc.
>>>
>>> For example:
>>> "title": "Dyson DC44 Animal Digital Slim Cordless Vacuum"
>>> "description": "The DC44 Animal is the new Dyson Digital Slim vacuum
>>> cleaner  the cordless machine that doesn't lose suction. It has been
>>> engineered for floor to ceiling cleaning. DC44 Animal has a detachable
>>> long-reach wand  which is balanced for floor to ceiling cleaning.   The
>>> motorized floor tool has twice the power of the DC35 floor tool  to
drive
>>> the bristles deeper into the carpet pile with more force. It attaches to
>>> the wand or directly to the machine for cleaning awkward spaces. The
brush
>>> bar has carbon fiber filaments for removing fine dust from hard floors.
>>> DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
>>> Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
>>> manganese cobalt battery and Root Cyclone technology for constant
 powerful
>>> suction.",
>>> UPC: 0879957006362
>>>
>>> The documents are indexed.
>>>
>>> Analysis says its indexeD: http://i.imgur.com/O52ino1.png
>>> But when I search for allText:"dyson dc44" I get no results, response:
>>> http://pastie.org/8334220
>>>
>>> Any suggestions about the problem? I am out of ideas about how to debug
>>> this.
>>>
>>> --
>>> Thanks,
>>> -Utkarsh
>>>
>>
>>
>>
>> --
>> Thanks,
>> -Utkarsh
>>

Re: Some text not indexed in solr4.4

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi;

Did you run commit command?

18 Eylül 2013 Çarşamba tarihinde Utkarsh Sengar <ut...@gmail.com>
adlı kullanıcı şöyle yazdı:
> To add to it, I see the exact problem with the queries: "nikon d7100",
> "nikon d5100", "samsung ps-we450" etc.
>
> Thanks,
> -Utkarsh
>
>
> On Tue, Sep 17, 2013 at 2:20 PM, Utkarsh Sengar <utkarsh2012@gmail.com
>wrote:
>
>> I have a copyField called allText with type text_general:
>> https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68
>>
>> I have ~100 documents which have the text: dyson and dc44 or dc41 etc.
>>
>> For example:
>> "title": "Dyson DC44 Animal Digital Slim Cordless Vacuum"
>> "description": "The DC44 Animal is the new Dyson Digital Slim vacuum
>> cleaner  the cordless machine that doesn't lose suction. It has been
>> engineered for floor to ceiling cleaning. DC44 Animal has a detachable
>> long-reach wand  which is balanced for floor to ceiling cleaning.   The
>> motorized floor tool has twice the power of the DC35 floor tool  to drive
>> the bristles deeper into the carpet pile with more force. It attaches to
>> the wand or directly to the machine for cleaning awkward spaces. The
brush
>> bar has carbon fiber filaments for removing fine dust from hard floors.
>> DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
>> Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
>> manganese cobalt battery and Root Cyclone technology for constant
 powerful
>> suction.",
>> UPC: 0879957006362
>>
>> The documents are indexed.
>>
>> Analysis says its indexeD: http://i.imgur.com/O52ino1.png
>> But when I search for allText:"dyson dc44" I get no results, response:
>> http://pastie.org/8334220
>>
>> Any suggestions about the problem? I am out of ideas about how to debug
>> this.
>>
>> --
>> Thanks,
>> -Utkarsh
>>
>
>
>
> --
> Thanks,
> -Utkarsh
>

Re: Some text not indexed in solr4.4

Posted by Utkarsh Sengar <ut...@gmail.com>.
To add to it, I see the exact problem with the queries: "nikon d7100",
"nikon d5100", "samsung ps-we450" etc.

Thanks,
-Utkarsh


On Tue, Sep 17, 2013 at 2:20 PM, Utkarsh Sengar <ut...@gmail.com>wrote:

> I have a copyField called allText with type text_general:
> https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68
>
> I have ~100 documents which have the text: dyson and dc44 or dc41 etc.
>
> For example:
> "title": "Dyson DC44 Animal Digital Slim Cordless Vacuum"
> "description": "The DC44 Animal is the new Dyson Digital Slim vacuum
> cleaner  the cordless machine that doesn’t lose suction. It has been
> engineered for floor to ceiling cleaning. DC44 Animal has a detachable
> long-reach wand  which is balanced for floor to ceiling cleaning.   The
> motorized floor tool has twice the power of the DC35 floor tool  to drive
> the bristles deeper into the carpet pile with more force. It attaches to
> the wand or directly to the machine for cleaning awkward spaces. The brush
> bar has carbon fiber filaments for removing fine dust from hard floors.
> DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
> Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
> manganese cobalt battery and Root Cyclone technology for constant  powerful
> suction.",
> UPC: 0879957006362
>
> The documents are indexed.
>
> Analysis says its indexeD: http://i.imgur.com/O52ino1.png
> But when I search for allText:"dyson dc44" I get no results, response:
> http://pastie.org/8334220
>
> Any suggestions about the problem? I am out of ideas about how to debug
> this.
>
> --
> Thanks,
> -Utkarsh
>



-- 
Thanks,
-Utkarsh

Re: Some text not indexed in solr4.4

Posted by Utkarsh Sengar <ut...@gmail.com>.
WordDelimiterFilterFactory was the culprit. Removing that fixed the problem.


Thanks,
-Utkarsh


On Tue, Sep 24, 2013 at 12:17 PM, Utkarsh Sengar <ut...@gmail.com>wrote:

> @Furkan Yes, I have run a commit, other text is searchable.
> Not sure what you mean there for MultiPhraseQuery. It is mentioned in
> context to SynonymFilterFactory, RemoveDuplicatesTokenFilterFactory and
> PositionFilterFactory. Which part are you referring to?
>
> @Jason I get this response (I have multi-core setup) by hitting this URL:
> http://SOLR_SERVER/solr/prodinfo/terms?terms.fl=text&terms.prefix=dc
>
> <response><lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int></lst><lst name="terms"><lst name="text"/></lst></response>
>
> Not sure how can I infer this response. I get the same response for any
> prefix like: a, b, iph etc.
>
> My guess is this is happening due to WordDelimiterFilterFactory here:
> https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L16, what do
> you think? dc44 is somehow delimited during the query time?
> Example here says:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> -> Split on letter-number transitions (can be turned off - see
> splitOnNumerics parameter) "SD500" -> "SD", "500"
>
> I will test it out and update this thread with my findings.
>
> Thanks,
> -Utkarsh
>
>
>
> On Tue, Sep 17, 2013 at 5:10 PM, Jason Hellman <
> jhellman@innoventsolutions.com> wrote:
>
>> Utkarsh,
>>
>> Check to see if the value is actually indexed into the field by using the
>> Terms request handler:
>>
>> http://localhost:8983/solr/terms?terms.fl=text&terms.prefix=d
>>
>> (adjust the prefix to whatever you're looking for)
>>
>> This should get you going in the right direction.
>>
>> Jason
>>
>>
>> On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar <ut...@gmail.com>
>> wrote:
>>
>> > I have a copyField called allText with type text_general:
>> > https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68
>> >
>> > I have ~100 documents which have the text: dyson and dc44 or dc41 etc.
>> >
>> > For example:
>> > "title": "Dyson DC44 Animal Digital Slim Cordless Vacuum"
>> > "description": "The DC44 Animal is the new Dyson Digital Slim vacuum
>> > cleaner  the cordless machine that doesn’t lose suction. It has been
>> > engineered for floor to ceiling cleaning. DC44 Animal has a detachable
>> > long-reach wand  which is balanced for floor to ceiling cleaning.   The
>> > motorized floor tool has twice the power of the DC35 floor tool  to
>> drive
>> > the bristles deeper into the carpet pile with more force. It attaches to
>> > the wand or directly to the machine for cleaning awkward spaces. The
>> brush
>> > bar has carbon fiber filaments for removing fine dust from hard floors.
>> > DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
>> > Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
>> > manganese cobalt battery and Root Cyclone technology for constant
>>  powerful
>> > suction.",
>> > UPC: 0879957006362
>> >
>> > The documents are indexed.
>> >
>> > Analysis says its indexeD: http://i.imgur.com/O52ino1.png
>> > But when I search for allText:"dyson dc44" I get no results, response:
>> > http://pastie.org/8334220
>> >
>> > Any suggestions about the problem? I am out of ideas about how to debug
>> > this.
>> >
>> > --
>> > Thanks,
>> > -Utkarsh
>>
>>
>
>
> --
> Thanks,
> -Utkarsh
>



-- 
Thanks,
-Utkarsh

Re: Some text not indexed in solr4.4

Posted by Utkarsh Sengar <ut...@gmail.com>.
@Furkan Yes, I have run a commit, other text is searchable.
Not sure what you mean there for MultiPhraseQuery. It is mentioned in
context to SynonymFilterFactory, RemoveDuplicatesTokenFilterFactory and
PositionFilterFactory. Which part are you referring to?

@Jason I get this response (I have multi-core setup) by hitting this URL:
http://SOLR_SERVER/solr/prodinfo/terms?terms.fl=text&terms.prefix=dc

<response><lst name="responseHeader"><int name="status">0</int><int
name="QTime">0</int></lst><lst name="terms"><lst
name="text"/></lst></response>

Not sure how can I infer this response. I get the same response for any
prefix like: a, b, iph etc.

My guess is this is happening due to WordDelimiterFilterFactory here:
https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L16, what do
you think? dc44 is somehow delimited during the query time?
Example here says:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
-> Split on letter-number transitions (can be turned off - see
splitOnNumerics parameter) "SD500" -> "SD", "500"

I will test it out and update this thread with my findings.

Thanks,
-Utkarsh



On Tue, Sep 17, 2013 at 5:10 PM, Jason Hellman <
jhellman@innoventsolutions.com> wrote:

> Utkarsh,
>
> Check to see if the value is actually indexed into the field by using the
> Terms request handler:
>
> http://localhost:8983/solr/terms?terms.fl=text&terms.prefix=d
>
> (adjust the prefix to whatever you're looking for)
>
> This should get you going in the right direction.
>
> Jason
>
>
> On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar <ut...@gmail.com> wrote:
>
> > I have a copyField called allText with type text_general:
> > https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68
> >
> > I have ~100 documents which have the text: dyson and dc44 or dc41 etc.
> >
> > For example:
> > "title": "Dyson DC44 Animal Digital Slim Cordless Vacuum"
> > "description": "The DC44 Animal is the new Dyson Digital Slim vacuum
> > cleaner  the cordless machine that doesn’t lose suction. It has been
> > engineered for floor to ceiling cleaning. DC44 Animal has a detachable
> > long-reach wand  which is balanced for floor to ceiling cleaning.   The
> > motorized floor tool has twice the power of the DC35 floor tool  to drive
> > the bristles deeper into the carpet pile with more force. It attaches to
> > the wand or directly to the machine for cleaning awkward spaces. The
> brush
> > bar has carbon fiber filaments for removing fine dust from hard floors.
> > DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
> > Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
> > manganese cobalt battery and Root Cyclone technology for constant
>  powerful
> > suction.",
> > UPC: 0879957006362
> >
> > The documents are indexed.
> >
> > Analysis says its indexeD: http://i.imgur.com/O52ino1.png
> > But when I search for allText:"dyson dc44" I get no results, response:
> > http://pastie.org/8334220
> >
> > Any suggestions about the problem? I am out of ideas about how to debug
> > this.
> >
> > --
> > Thanks,
> > -Utkarsh
>
>


-- 
Thanks,
-Utkarsh

Re: Some text not indexed in solr4.4

Posted by Jason Hellman <jh...@innoventsolutions.com>.
Utkarsh,

Check to see if the value is actually indexed into the field by using the Terms request handler:

http://localhost:8983/solr/terms?terms.fl=text&terms.prefix=d

(adjust the prefix to whatever you're looking for)

This should get you going in the right direction.

Jason


On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar <ut...@gmail.com> wrote:

> I have a copyField called allText with type text_general:
> https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68
> 
> I have ~100 documents which have the text: dyson and dc44 or dc41 etc.
> 
> For example:
> "title": "Dyson DC44 Animal Digital Slim Cordless Vacuum"
> "description": "The DC44 Animal is the new Dyson Digital Slim vacuum
> cleaner  the cordless machine that doesn’t lose suction. It has been
> engineered for floor to ceiling cleaning. DC44 Animal has a detachable
> long-reach wand  which is balanced for floor to ceiling cleaning.   The
> motorized floor tool has twice the power of the DC35 floor tool  to drive
> the bristles deeper into the carpet pile with more force. It attaches to
> the wand or directly to the machine for cleaning awkward spaces. The brush
> bar has carbon fiber filaments for removing fine dust from hard floors.
> DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
> Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
> manganese cobalt battery and Root Cyclone technology for constant  powerful
> suction.",
> UPC: 0879957006362
> 
> The documents are indexed.
> 
> Analysis says its indexeD: http://i.imgur.com/O52ino1.png
> But when I search for allText:"dyson dc44" I get no results, response:
> http://pastie.org/8334220
> 
> Any suggestions about the problem? I am out of ideas about how to debug
> this.
> 
> -- 
> Thanks,
> -Utkarsh