You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark Fenbers <ma...@noaa.gov> on 2015/09/11 16:20:19 UTC

Bug or Operator Error?

Greetings!

So, I've created my first index and am able to search programmatically 
(through SolrJ) and through the Web interface. (Yay!)  I get non-empty 
results for my searches!

My index was built from database records using 
/dataimport?command=full-import.  I have 9936 records in the table to be 
indexed and the import status indicated it processed all 9936.  However, 
my searches only pull up a subset of the records that I know to contain 
a word.  For example, I know that there are hundreds of records 
containing the word "Friday", yet my results for my "Friday" query only 
contain 17 records (documents) in the Web interface, and only 10 records 
from the SolrJ query.

I figure I must be doing something wrong in my query, or have somehow 
indexed improperly.  This might be a clue: My main text field in the 
database table is URL-encoded.  I wouldn't think that would matter, though.

Another example... In one of the documents returned by the "Friday" 
query results, I noticed in the text the name of a co-worker "Drzal".  
So, I searched on "Drzal" and my results came up with 0 documents.   (!?)

Any ideas where I went wrong??
Mark

Re: Bug or Operator Error?

Posted by Erick Erickson <er...@gmail.com>.

Oh my. I'll leave it to the DIH guys to suggest whether there's
something that can be done with pure DIH, and offer a couple
of alternatives:

1> You could put a MappingCharFilterFactory in your analysis
chain. In the mapping file you can map things like:
"%20" => " " that would work with DIH as well.

2> You could use SolrJ rather than DIH and unescape the
data before writing it to Solr, here's an exampl:
http://lucidworks.com/blog/indexing-with-solrj/

What's really happening here isn't that Solr isn't indexing
these, rather it's just not splitting your input up. Take a
look at the adminUI/analysis page for one of the fields in
question and you'll see what I mean. The actual tokens
indexed may be things like 20Drzal or similar.

Best,
Erick

On Fri, Sep 11, 2015 at 10:14 AM, Mark Fenbers <ma...@noaa.gov> wrote:
> Additional experimenting lead me to the discovery that /dataimport does
> *not* index words with a preceding %20 (a URL-encoded space), or in fact
> *any* preceding %xx encoding.  I can probably replace each %20 with a '+' in
> each record of my database -- the dataimporter/indexer doesn't sneeze at
> those -- but using some sort of encoding is important for certain characters
> such as double and single quotes, because many non-alphanumeric characters
> have special meanings to the shell and/or PostgreSQL and need to be escaped.
>
> So now that I know what the issue is, I need to find a work-around. Does
> Solr have any baseline processors that will handle the URL-encoding?  Being
> new to Solr, I'm not sure I have the skill to write my own.  Or, is there
> another kind of encoding I can use that Solr doesn't adversely react to??
>
> Mark
>
> On 9/11/2015 12:11 PM, Erick Erickson wrote:
>>
>> Several ideas, all shots in the dark because to analyze this we
>> need the schema definitions and the result of your query with
>> &debug=true added. In particular you'll see the "parsed query"
>> section near the bottom, and often the parsed query isn't
>> quite what you think it is. In particular this is often the issue:
>> you query q=Drzal. this translates into q=default_search_field:Drazl
>> where default_search_field is the "df" parameter in your search
>> handler ("query" or "select" in solrconfig.xml).
>>
>> Next most frequent thing: Your analysis chain does things you're
>> not expecting. Simple example is whether the analysis lower-cases
>> or not. For this kind of problem, the Admin UI>>core>>analysis page
>> is _really_ your friend.
>>
>> Best,
>> Erick
>>
>

Re: Bug or Operator Error?

Posted by Mark Fenbers <ma...@noaa.gov>.

Additional experimenting lead me to the discovery that /dataimport does 
*not* index words with a preceding %20 (a URL-encoded space), or in fact 
*any* preceding %xx encoding.  I can probably replace each %20 with a 
'+' in each record of my database -- the dataimporter/indexer doesn't 
sneeze at those -- but using some sort of encoding is important for 
certain characters such as double and single quotes, because many 
non-alphanumeric characters have special meanings to the shell and/or 
PostgreSQL and need to be escaped.

So now that I know what the issue is, I need to find a work-around. Does 
Solr have any baseline processors that will handle the URL-encoding?  
Being new to Solr, I'm not sure I have the skill to write my own.  Or, 
is there another kind of encoding I can use that Solr doesn't adversely 
react to??

Mark

On 9/11/2015 12:11 PM, Erick Erickson wrote:
> Several ideas, all shots in the dark because to analyze this we
> need the schema definitions and the result of your query with
> &debug=true added. In particular you'll see the "parsed query"
> section near the bottom, and often the parsed query isn't
> quite what you think it is. In particular this is often the issue:
> you query q=Drzal. this translates into q=default_search_field:Drazl
> where default_search_field is the "df" parameter in your search
> handler ("query" or "select" in solrconfig.xml).
>
> Next most frequent thing: Your analysis chain does things you're
> not expecting. Simple example is whether the analysis lower-cases
> or not. For this kind of problem, the Admin UI>>core>>analysis page
> is _really_ your friend.
>
> Best,
> Erick
>

Re: Bug or Operator Error?

Posted by Erick Erickson <er...@gmail.com>.

Several ideas, all shots in the dark because to analyze this we
need the schema definitions and the result of your query with
&debug=true added. In particular you'll see the "parsed query"
section near the bottom, and often the parsed query isn't
quite what you think it is. In particular this is often the issue:
you query q=Drzal. this translates into q=default_search_field:Drazl
where default_search_field is the "df" parameter in your search
handler ("query" or "select" in solrconfig.xml).

Next most frequent thing: Your analysis chain does things you're
not expecting. Simple example is whether the analysis lower-cases
or not. For this kind of problem, the Admin UI>>core>>analysis page
is _really_ your friend.

Best,
Erick

On Fri, Sep 11, 2015 at 7:20 AM, Mark Fenbers <ma...@noaa.gov> wrote:
> Greetings!
>
> So, I've created my first index and am able to search programmatically
> (through SolrJ) and through the Web interface. (Yay!)  I get non-empty
> results for my searches!
>
> My index was built from database records using
> /dataimport?command=full-import.  I have 9936 records in the table to be
> indexed and the import status indicated it processed all 9936.  However, my
> searches only pull up a subset of the records that I know to contain a word.
> For example, I know that there are hundreds of records containing the word
> "Friday", yet my results for my "Friday" query only contain 17 records
> (documents) in the Web interface, and only 10 records from the SolrJ query.
>
> I figure I must be doing something wrong in my query, or have somehow
> indexed improperly.  This might be a clue: My main text field in the
> database table is URL-encoded.  I wouldn't think that would matter, though.
>
> Another example... In one of the documents returned by the "Friday" query
> results, I noticed in the text the name of a co-worker "Drzal".  So, I
> searched on "Drzal" and my results came up with 0 documents.   (!?)
>
> Any ideas where I went wrong??
> Mark
>
>
>
>