You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Phillip Farber <pf...@umich.edu> on 2007/11/30 01:29:47 UTC

Document field data not getting indexed

Hi,

I have 22 documents. I index these by posting them using LWP::UserAgent 
all with http status 200 OK.

One of my documents (id=44) contains the word "Campeau" in the "ocr" 
field.  But according to luke this term does not appear in the index. 
Yet when I delete the index (delete by query *:* or restart server after 
  deleting /index) and index just document id=44 its ocr field data does 
appear in the  index according to luke.

Also I notice that the numTerms for 22 documents is 5579 and for just 
the doc id=44 it's 2194.  Hard to believe that 22 documents only 
increase the number of terms by so little.

Why/how could this be happening?

Thanks,

Phil

---

My schema.xml:

   <field name="id" type="string" indexed="true" stored="true" 
required="true"/>
    <field name="extern_id" type="string" indexed="true" stored="true" 
required="true"/>
    <field name="ocr" type="mytext" indexed="true" stored="false" 
required="true"/>

where "mytext" is

  <fieldtype name="mytext" class="solr.TextField">
       <analyzer>
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
           <filter class="solr.WordDelimiterFilterFactory"
                 splitOnCaseChange="0"
                 generateWordParts="1"
                 generateNumberParts="1"
                 catenateWords="0"
                 catenateNumbers="0"
                 catenateAll="0"
                 />
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldtype>

Indexing 22 docs:
-----------------

<lst name="index">
<int name="numDocs">22</int>
<int name="maxDoc">22</int>
<int name="numTerms">5579</int>
<long name="version">1196382086904</long>
<bool name="optimized">true</bool>
<bool name="current">true</bool>
<bool name="hasDeletions">false</bool>
<date name="lastModified">2007-11-30T00:22:06Z</date>
</lst>
<lst name="fields">
<lst name="ocr">
<str name="type">mytext</str>
<str name="schema">IT-----------</str>
<str name="index">(unstored field)</str>
<int name="docs">22</int>
<int name="distinct">5513</int>
<lst name="topTerms">
[...]
<int name="cally">22</int>
<int name="cam">22</int>
<int name="cammi">22</int>  ???<<<<<<<<<<<<<<<<
<int name="cams">22</int>
<int name="can">22</int>


Indexing just doc id=44:
------------------------

<lst name="index">
<int name="numDocs">1</int>
<int name="maxDoc">1</int>
<int name="numTerms">2194</int>
<long name="version">1196381821086</long>
<bool name="optimized">true</bool>
<bool name="current">true</bool>
<bool name="hasDeletions">false</bool>
<date name="lastModified">2007-11-30T00:17:21Z</date>
</lst>
<lst name="fields">
<lst name="ocr">
<str name="type">mytext</str>
<str name="schema">IT-----------</str>
<str name="index">(unstored field)</str>
<int name="docs">1</int>
<int name="distinct">2191</int>
<lst name="topTerms">
[...]
<int name="called">1</int>
<int name="came">1</int>
<int name="camerons">1</int>
<int name="campeau">1</int>  <<<<<<<<<<<<<<
<int name="can">1</int>
<int name="canadian">1</int>
<int name="canal">1</int>




Re: Document field data not getting indexed

Posted by Yonik Seeley <yo...@apache.org>.
On Nov 30, 2007 9:03 AM, Phillip Farber <pf...@umich.edu> wrote:
> I'm using numItems=2000 in the luke url so I am seeing all the items in
> the index or at least up through c in the alphabet:

If Luke is sorting by high term, you wouldn't necessarily see it.
Regardless, the search you did below showed that it probably wasn't in
the index.

Perhaps you have more than one id:44 document and the other overwrites
the one with campeau in it?
On your big index, try a query for id:44 and see if the other stored
fields (like external_id) match what you expect.

-Yonik

Re: Document field data not getting indexed

Posted by Phillip Farber <pf...@umich.edu>.
Well this one falls into the category of bald faced embarrassment. It's 
a bug in my process.  Thanks to all for taking the time to respond. 
Have I said how great solr support is?  :-)

Phil


Phillip Farber wrote:
> Hi Yonik, Hoss, et. al.
> 
> I'm using numItems=2000 in the luke url so I am seeing all the items in 
> the index or at least up through c in the alphabet:
> 
> http://localhost:8983/solr/admin/luke?fl=ocr&numTerms=2000
> 
> When I index all 22 of my documents including doc id=44 which contains 
> the word "Campeau" it is not in the index:
> 
> Luke:
> 
> <int name="call">22</int>
> <int name="called">22</int>
> <int name="calls">22</int>
> <int name="cally">22</int>
> <int name="cam">22</int>
> <int name="cammi">22</int> <<<<<???
> <int name="cams">22</int>
> <int name="can">22</int>
> 
> and my search ocr:campeau does not return it:
> 
> <response>
> 
> <lst name="responseHeader">
>  <int name="status">0</int>
>  <int name="QTime">82</int>
>  <lst name="params">
>   <str name="indent">on</str>
>   <str name="start">0</str>
>   <str name="q">ocr:campeau</str>
>   <str name="version">2.2</str>
>   <str name="rows">10</str>
>  </lst>
> </lst>
> <result name="response" numFound="0" start="0"/>
> </response>
> 
> 
> When I delete data/index and restarting solr and index just doc id=44 
> using the same process as for the 22 docs Campeau *is* in the index and 
> I can retrieve it:
> 
> <response>
> <lst name="responseHeader">
>  <int name="status">0</int>
>  <int name="QTime">90</int>
>  <lst name="params">
>   <str name="indent">on</str>
>   <str name="start">0</str>
>   <str name="q">ocr:campeau</str>
>   <str name="version">2.2</str>
>   <str name="rows">10</str>
>  </lst>
> </lst>
> <result name="response" numFound="1" start="0">
>  <doc>
>   <str name="extern_id">mdp.39015015394847</str>
>   <str name="id">44</str>
>   <date name="timestamp">2007-11-30T13:59:45.783Z</date>
>  </doc>
> </result>
> </response>
> 
> Luke:
> 
> <int name="call">1</int>
> <int name="called">1</int>
> <int name="came">1</int>
> <int name="camerons">1</int>
> <int name="campeau">1</int>  <<<<<<<<<<<<<
> <int name="can">1</int>
> <int name="canadian">1</int>
> 
> 
> Yonik Seeley wrote:
>> On Nov 29, 2007 7:29 PM, Phillip Farber <pf...@umich.edu> wrote:
>>> One of my documents (id=44) contains the word "Campeau" in the "ocr"
>>> field.  But according to luke this term does not appear in the index.
>>
>> AFAIK the Luke handler lists the top terms, not necessarily all of them.
>> Do a search for ocr:Campeau and see if it returns anything.
>>
>> -Yonik

Re: Document field data not getting indexed

Posted by Phillip Farber <pf...@umich.edu>.
Hi Yonik, Hoss, et. al.

I'm using numItems=2000 in the luke url so I am seeing all the items in 
the index or at least up through c in the alphabet:

http://localhost:8983/solr/admin/luke?fl=ocr&numTerms=2000

When I index all 22 of my documents including doc id=44 which contains 
the word "Campeau" it is not in the index:

Luke:

<int name="call">22</int>
<int name="called">22</int>
<int name="calls">22</int>
<int name="cally">22</int>
<int name="cam">22</int>
<int name="cammi">22</int> <<<<<???
<int name="cams">22</int>
<int name="can">22</int>

and my search ocr:campeau does not return it:

<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">82</int>
  <lst name="params">
   <str name="indent">on</str>
   <str name="start">0</str>
   <str name="q">ocr:campeau</str>
   <str name="version">2.2</str>
   <str name="rows">10</str>
  </lst>
</lst>
<result name="response" numFound="0" start="0"/>
</response>


When I delete data/index and restarting solr and index just doc id=44 
using the same process as for the 22 docs Campeau *is* in the index and 
I can retrieve it:

<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">90</int>
  <lst name="params">
   <str name="indent">on</str>
   <str name="start">0</str>
   <str name="q">ocr:campeau</str>
   <str name="version">2.2</str>
   <str name="rows">10</str>
  </lst>
</lst>
<result name="response" numFound="1" start="0">
  <doc>
   <str name="extern_id">mdp.39015015394847</str>
   <str name="id">44</str>
   <date name="timestamp">2007-11-30T13:59:45.783Z</date>
  </doc>
</result>
</response>

Luke:

<int name="call">1</int>
<int name="called">1</int>
<int name="came">1</int>
<int name="camerons">1</int>
<int name="campeau">1</int>  <<<<<<<<<<<<<
<int name="can">1</int>
<int name="canadian">1</int>


Yonik Seeley wrote:
> On Nov 29, 2007 7:29 PM, Phillip Farber <pf...@umich.edu> wrote:
>> One of my documents (id=44) contains the word "Campeau" in the "ocr"
>> field.  But according to luke this term does not appear in the index.
> 
> AFAIK the Luke handler lists the top terms, not necessarily all of them.
> Do a search for ocr:Campeau and see if it returns anything.
> 
> -Yonik

Re: Document field data not getting indexed

Posted by Yonik Seeley <yo...@apache.org>.
On Nov 29, 2007 7:29 PM, Phillip Farber <pf...@umich.edu> wrote:
> One of my documents (id=44) contains the word "Campeau" in the "ocr"
> field.  But according to luke this term does not appear in the index.

AFAIK the Luke handler lists the top terms, not necessarily all of them.
Do a search for ocr:Campeau and see if it returns anything.

-Yonik

Re: Document field data not getting indexed

Posted by Chris Hostetter <ho...@fucit.org>.
see yonik's comments regarding Luke and wether or not your term is 
indexedx, as for this point....

: Also I notice that the numTerms for 22 documents is 5579 and for just the doc
: id=44 it's 2194.  Hard to believe that 22 documents only increase the number
: of terms by so little.

this is not suprising.  numTerms is the number of *unique* terms, 
independent of how many documents each term appears in -- if the word 
"eclipse" appears in the ocr field of 17 documents a total of 457 times, 
it is still only counted once in numTerms.


-Hoss