You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Khare, Kushal (MIND)" <Ku...@mind-infotech.com> on 2020/07/22 16:42:45 UTC

Can't search formatted text in solr

Hello guys,
I have been using solr for my java application to carry out content search from the saved docs.
I am facing a problem in searching for a word - 'load'
There are 2 cases, in 1st search is working good but in second case with the same doc and same query - 'load' am not getting the result

CASE 1 :

"Doing load test for Solr"  - Simple text in doc format.
Works fine

CASE 2 :

"Doing load test for Solr"  - Simple text in doc format.
In this case, the solr search fails. I don't get the result when I search for the term load.


Please help me with this as am unable to get any help with this


Thanks !
Regards,
Kushal Khare


Re: Can't search formatted text in solr

Posted by Erick Erickson <er...@gmail.com>.
There’s a space between “l” and “oad” in your second doc. Or perhaps it has
markup etc. If you do what I mentioned and use the /terms endpoint to examine
what’s actually in your index, I’m pretty sure you’ll see “l” and “oad” so not 
finding it is perfectly understandable.

What this is is that however you turn the doc into your xml format breaks it up
like this. I’ve seen this happen with other markups.

In other words, this has nothing to do with Solr and everything to do with whatever
extracts the text from the original document.

If you’re using ExtractingRequestHandler to process this, you’re getting the defaults
that Tika uses, which can be tweaked if you run Tika outisde Solr, see the Tika
website.

And you’ll never get this 100%. Every document format does weird things, and docs
produced by one version don’t necessarily match another version even in the same
format (say PDF). Extracting the plain text is correctly for every version of every
format is near impossible unless you do them one-by-one.

Best,
Erick

> On Jul 23, 2020, at 2:48 AM, Khare, Kushal (MIND) <Ku...@mind-infotech.com> wrote:
> 
> I did this debug query thing and everything seems good but still am unable to get the desired doc in my result.
> 
> "debug":{  "rawquerystring":"load",
>            "querystring":"load",
>           "parsedquery":"_text_:load",
>           "parsedquery_toString":"_text_:load",
> 
> Actually , CASE 2  in  my previous mail is the same text : "Doing load test for Solr" but the diff I forgot to mention was here the text is formatted to BOLD & Text color is RED.
> In case 1, it was simple text.
> What I observed is while parsing, if I print the the textHandler String...I get this
> 
> [Content_Types].xml
> 
> _rels/.rels
> 
> word/document.xml
>              Thi      s       docum      ent is being used for the QDMS l      oad testing      .
> 
> 
> So, I don't know what goes wrong when i have same text but formatted.
> Please help me with this as it is critical and needs to be delivered very soon.
> 
> Thanks !
> ________________________________
> From: Erick Erickson <er...@gmail.com>
> Sent: Thursday, July 23, 2020 1:49 AM
> To: solr-user@lucene.apache.org <so...@lucene.apache.org>
> Subject: Re: Can't search formatted text in solr
> 
> ---- This email originated from an external source i.e. outside of the organization. Please do not click on links or open any attachment unless you recognize the sender and know the content is safe ----
> 
> There’s not much info to go on here. Try attaching &debug=query to the queries and see if the parsed query returned is what you expect. If it is, the next thing I’d do is attach &debug=true&explainOther=id:id_of_doc_that_isnt_showing_up
> 
> This last will show you how scoring was done whether or not the doc is returned in the result set.
> 
> Finally, you can use the admin UI to look at the actual tokens indexed.
> 
> My bet is that your doc format isn’t being analyzed properly, perhaps to do markup and the second case doesn’t get indexed the way you think it should. You can use the terms handler to examine exactly what’s in the index
> 
> Best,
> Erick
> 
>> On Jul 22, 2020, at 12:42 PM, Khare, Kushal (MIND) <Ku...@mind-infotech.com> wrote:
>> 
>> Hello guys,
>> I have been using solr for my java application to carry out content search from the saved docs.
>> I am facing a problem in searching for a word - 'load'
>> There are 2 cases, in 1st search is working good but in second case with the same doc and same query - 'load' am not getting the result
>> 
>> CASE 1 :
>> 
>> "Doing load test for Solr"  - Simple text in doc format.
>> Works fine
>> 
>> CASE 2 :
>> 
>> "Doing load test for Solr"  - Simple text in doc format.
>> In this case, the solr search fails. I don't get the result when I search for the term load.
>> 
>> 
>> Please help me with this as am unable to get any help with this
>> 
>> 
>> Thanks !
>> Regards,
>> Kushal Khare
>> 
> 


Re: Can't search formatted text in solr

Posted by "Khare, Kushal (MIND)" <Ku...@mind-infotech.com>.
I did this debug query thing and everything seems good but still am unable to get the desired doc in my result.

"debug":{  "rawquerystring":"load",
            "querystring":"load",
           "parsedquery":"_text_:load",
           "parsedquery_toString":"_text_:load",

Actually , CASE 2  in  my previous mail is the same text : "Doing load test for Solr" but the diff I forgot to mention was here the text is formatted to BOLD & Text color is RED.
In case 1, it was simple text.
What I observed is while parsing, if I print the the textHandler String...I get this

[Content_Types].xml

_rels/.rels

word/document.xml
              Thi      s       docum      ent is being used for the QDMS l      oad testing      .


So, I don't know what goes wrong when i have same text but formatted.
Please help me with this as it is critical and needs to be delivered very soon.

Thanks !
________________________________
From: Erick Erickson <er...@gmail.com>
Sent: Thursday, July 23, 2020 1:49 AM
To: solr-user@lucene.apache.org <so...@lucene.apache.org>
Subject: Re: Can't search formatted text in solr

---- This email originated from an external source i.e. outside of the organization. Please do not click on links or open any attachment unless you recognize the sender and know the content is safe ----

There’s not much info to go on here. Try attaching &debug=query to the queries and see if the parsed query returned is what you expect. If it is, the next thing I’d do is attach &debug=true&explainOther=id:id_of_doc_that_isnt_showing_up

This last will show you how scoring was done whether or not the doc is returned in the result set.

Finally, you can use the admin UI to look at the actual tokens indexed.

My bet is that your doc format isn’t being analyzed properly, perhaps to do markup and the second case doesn’t get indexed the way you think it should. You can use the terms handler to examine exactly what’s in the index

Best,
Erick

> On Jul 22, 2020, at 12:42 PM, Khare, Kushal (MIND) <Ku...@mind-infotech.com> wrote:
>
> Hello guys,
> I have been using solr for my java application to carry out content search from the saved docs.
> I am facing a problem in searching for a word - 'load'
> There are 2 cases, in 1st search is working good but in second case with the same doc and same query - 'load' am not getting the result
>
> CASE 1 :
>
> "Doing load test for Solr"  - Simple text in doc format.
> Works fine
>
> CASE 2 :
>
> "Doing load test for Solr"  - Simple text in doc format.
> In this case, the solr search fails. I don't get the result when I search for the term load.
>
>
> Please help me with this as am unable to get any help with this
>
>
> Thanks !
> Regards,
> Kushal Khare
>


Re: Can't search formatted text in solr

Posted by Erick Erickson <er...@gmail.com>.
There’s not much info to go on here. Try attaching &debug=query to the queries and see if the parsed query returned is what you expect. If it is, the next thing I’d do is attach &debug=true&explainOther=id:id_of_doc_that_isnt_showing_up

This last will show you how scoring was done whether or not the doc is returned in the result set.

Finally, you can use the admin UI to look at the actual tokens indexed.

My bet is that your doc format isn’t being analyzed properly, perhaps to do markup and the second case doesn’t get indexed the way you think it should. You can use the terms handler to examine exactly what’s in the index

Best,
Erick

> On Jul 22, 2020, at 12:42 PM, Khare, Kushal (MIND) <Ku...@mind-infotech.com> wrote:
> 
> Hello guys,
> I have been using solr for my java application to carry out content search from the saved docs.
> I am facing a problem in searching for a word - 'load'
> There are 2 cases, in 1st search is working good but in second case with the same doc and same query - 'load' am not getting the result
> 
> CASE 1 :
> 
> "Doing load test for Solr"  - Simple text in doc format.
> Works fine
> 
> CASE 2 :
> 
> "Doing load test for Solr"  - Simple text in doc format.
> In this case, the solr search fails. I don't get the result when I search for the term load.
> 
> 
> Please help me with this as am unable to get any help with this
> 
> 
> Thanks !
> Regards,
> Kushal Khare
>