You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Theodan <da...@theodan.com> on 2007/03/28 18:02:12 UTC

Solr finding doc by one field but not by another

Hi everyone.

Can anyone explain how this might happen?  I query by the "ID" field and get
the following result:

=========================================================
<?xml version="1.0" encoding="UTF-8" ?> 
<response>
	<lst name="responseHeader">
		<int name="status">0</int> 
		<int name="QTime">16</int> 
		<lst name="params">
			<str name="q">ID:ee483237-399c-4b17-ad73-000cc54fd3e1</str> 
		</lst>
	</lst>
	<result name="response" numFound="1" start="0">
		<doc>
			<str name="AllowedApplications">COSMEO US</str> 
			<str name="Audiences" /> 
			<str name="DefaultURL" /> 
			<str name="FileType" /> 
			<str name="HighGrade" /> 
			<str name="ID">ee483237-399c-4b17-ad73-000cc54fd3e1</str> 
			<str name="IsClosedCaptioned" /> 
			<str name="Language">en-US</str> 
			<str name="LargeIcon" /> 
			<str name="LaunchIcon" /> 
			<str name="LowGrade" /> 
			<str name="MediaGroups" /> 
			<str name="Producer" /> 
			<str name="Provider" /> 
			<str name="Publisher" /> 
			<str name="SmallIcon" /> 
			<str name="Taxonomy">Social Studies American History Historical Periods
Expansion and Reform 1801-1861 Territorial Expansion</str> 
			<str name="TitleEvent" /> 
			<str name="TitleLength" /> 
			<str name="TitleLocation" /> 
			<str name="TitleParticipant" /> 
			<str name="Type">EncyclopediaArticles</str> 
			<str name="concepts" /> 
			<str name="copyright">2005</str> 
			<str name="description">Pony Express was a mail service operating between
Saint Joseph, Mo., and Sacramento, Calif., inaugurated on April 3, 1860,
under the direction of the Central Overland California and Pike's Peak
Express Co.</str> 
			<str name="editable">True</str> 
			<str name="keywords" /> 
			<str name="spanish" /> 
			<str name="title">Pony Express</str> 
			<str name="vocabulary">pony express</str> 
		</doc>
	</result>
</response>
=========================================================

Then I query by the "title" field from the result above (so I know the
document is in the index and has been committed), and I get zero results:

=========================================================
<?xml version="1.0" encoding="UTF-8" ?> 
<response>
	<lst name="responseHeader">
		<int name="status">0</int> 
		<int name="QTime">0</int> 
		<lst name="params">
			<str name="q">title:"Pony Express"</str> 
		</lst>
	</lst>
	<result name="response" numFound="0" start="0" />
</response>
=========================================================

"ID" is not the only field that I can find the doc by.  Searching for
"Type:encyclopediaarticles" finds it too.  Also, "title" is not the only
field that misses the doc.  A search by "vocabulary" misses it too.  I
haven't tried all the fields yet to see exhaustively which ones find it and
which ones don't.  I can do that if it would help.

For what it's worth, I started with an existing Lucene index and modified
Solr's schema.xml so that I could just use the Lucene index in Solr.  That
Lucene index had about 230K docs.  I then used your "post.jar" to post
another 10K docs to the index after starting up the server.  Those 10K docs
only had 7 of the 30 fields that the original 230K docs had.  Could that be
the problem?  I am noticing that the docs that I'm having problems with are
from the original 230K-doc index, not from my subsequent 10K-doc post.  The
10K docs seem to be findable by any of their 7 fields.

Here are my config files:
http://www.nabble.com/file/7488/schema.xml schema.xml 
http://www.nabble.com/file/7489/solrconfig.xml solrconfig.xml 

Any help is greatly appreciated.

Thanks,
-Dan
-- 
View this message in context: http://www.nabble.com/Solr-finding-doc-by-one-field-but-not-by-another-tf3481287.html#a9716918
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr finding doc by one field but not by another

Posted by Theodan <da...@theodan.com>.

Mike Klaas wrote:
> 
> This is almost certainly due to a mismatch between the index- and
> query-time analysis of the fields.  For instance, your schema defines
> the title field to be "string" (unanalyzed), but it is likely that
> some tokenization (perhaps via StandardAnalyzer) occurred in the
> original index.
> 

Yep, that was exactly the problem.  I changed all of my field types from
"string" to "text", and things still didn't work right when querying.  So I
asked the guy who created the Lucene index what analyzers he used, and he
had used the StandardAnalyzer, whereas my Solr configuration was using the
default advanced analyzer setup that Solr comes with in schema.xml.  So I
changed my schema.xml to use just StandardAnalyzer, and the searches now
seem to be returning expected results.

-Dan
-- 
View this message in context: http://www.nabble.com/Solr-finding-doc-by-one-field-but-not-by-another-tf3481287.html#a9761451
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr finding doc by one field but not by another

Posted by Mike Klaas <mi...@gmail.com>.
On 3/28/07, Theodan <da...@theodan.com> wrote:

> For what it's worth, I started with an existing Lucene index and modified
> Solr's schema.xml so that I could just use the Lucene index in Solr.  That
> Lucene index had about 230K docs.  I then used your "post.jar" to post
> another 10K docs to the index after starting up the server.  Those 10K docs
> only had 7 of the 30 fields that the original 230K docs had.  Could that be
> the problem?  I am noticing that the docs that I'm having problems with are
> from the original 230K-doc index, not from my subsequent 10K-doc post.  The
> 10K docs seem to be findable by any of their 7 fields.

This is almost certainly due to a mismatch between the index- and
query-time analysis of the fields.  For instance, your schema defines
the title field to be "string" (unanalyzed), but it is likely that
some tokenization (perhaps via StandardAnalyzer) occurred in the
original index.

-Mike