You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by cloax <jo...@joekondel.com> on 2009/06/20 01:20:14 UTC

ExtractRequestHandler - not properly indexing office docs?

Hi there, 

I've got a Solr instance running and am feeding it rich binary documents to
index from a Django application. The setup works just fine with pdf's, etc..
but no matter what type of MS Word document ( doc and docx ) I feed it I
can't get any results when searching for content-related queries.

I've curl'd with extract.only to verify that Solr ( and tika ) could extract
the contents, and it happily enough spits back the extracted XHTML to me.
That content never seems to find it's way into the ext.def.fl that I have
specified. 

When I go and search for terms specific to content in those documents, I get
zero hits. However I get hits on metadata related queries ( ie: i store
username of who uploaded it, etc.. ) 

Is there some magical bit I forgot to flip?

cheers,
joe
-- 
View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24120125.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractRequestHandler - not properly indexing office docs?

Posted by Grant Ingersoll <gs...@apache.org>.

Can you change the text field to be stored and then point the  
LukeRequestHandler at that field (/admin/luke) and report back?  Also,  
can you post your full schema and config?

Finally, can you get the example to work?


On Jun 23, 2009, at 1:41 AM, cloax wrote:

>
> I've tried 'text' ( taken from the example config ) and then tried  
> creating a
> new field called doc_content and using that. Neither has worked.
>
>
> Grant Ingersoll-6 wrote:
>>
>> What's your default search field?
>>
>> On Jun 22, 2009, at 12:29 PM, cloax wrote:
>>
>>>
>>> Yep, I've tried both of those and still no joy. Here's both my curl
>>> statement
>>> and the resulting Solr log output.
>>>
>>> curl
>>> http://localhost:8983/solr/update/extract?ext.def.fl=text
>>> \&ext.literal.id=1\&ext.map.div=text\&ext.capture=div
>>> -F "myfile=@dj_character.doc"
>>>
>>> Curls output:
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <response>
>>> <lst name="responseHeader"><int name="status">0</int><int
>>> name="QTime">317</int></lst>
>>> </response>
>>>
>>> Solr log:
>>> Jun 22, 2009 12:21:42 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/update/extract
>>> params
>>> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
>>> status=0 QTime=544
>>> Jun 22, 2009 12:22:26 PM
>>> org.apache.solr.update.processor.LogUpdateProcessor
>>> finish
>>> INFO: {add=[1]} 0 317
>>> Jun 22, 2009 12:22:26 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/update/extract
>>> params
>>> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
>>> status=0 QTime=317
>>> Jun 22, 2009 12:22:37 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/select
>>> params
>>> =
>>> {wt
>>> =
>>> standard
>>> &rows
>>> =
>>> 10
>>> &start
>>> =
>>> 0
>>> &explainOther
>>> =&hl.fl=&indent=on&q=kondel&fl=*,score&qt=standard&version=2.2}
>>> hits=0 status=0 QTime=2
>>>
>>> The submitted document has "kondel" in it numerous times, so Solr
>>> should
>>> have a hit. Yet it returns nothing. I also made sure I committed,
>>> but that
>>> didn't seem to help either.
>>>
>>>
>>> Grant Ingersoll-6 wrote:
>>>>
>>>> Do you have a default field declared?  &ext.default.fl=<FIELD NAME>
>>>> Either that, or you need to explicitly capture the fields you are
>>>> interested in using &ext.capture=<FIELD NAME>
>>>>
>>>> You could add this to your curl statement to try out.
>>>>
>>>> -Grant
>>>>
>>>
>>>
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24150763.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24159267.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: ExtractRequestHandler - not properly indexing office docs?

Posted by cloax <jo...@joekondel.com>.

I've tried 'text' ( taken from the example config ) and then tried creating a
new field called doc_content and using that. Neither has worked. 
 

Grant Ingersoll-6 wrote:
> 
> What's your default search field?
> 
> On Jun 22, 2009, at 12:29 PM, cloax wrote:
> 
>>
>> Yep, I've tried both of those and still no joy. Here's both my curl  
>> statement
>> and the resulting Solr log output.
>>
>> curl
>> http://localhost:8983/solr/update/extract?ext.def.fl=text 
>> \&ext.literal.id=1\&ext.map.div=text\&ext.capture=div
>> -F "myfile=@dj_character.doc"
>>
>> Curls output:
>> <?xml version="1.0" encoding="UTF-8"?>
>> <response>
>> <lst name="responseHeader"><int name="status">0</int><int
>> name="QTime">317</int></lst>
>> </response>
>>
>> Solr log:
>> Jun 22, 2009 12:21:42 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract
>> params 
>> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
>> status=0 QTime=544
>> Jun 22, 2009 12:22:26 PM  
>> org.apache.solr.update.processor.LogUpdateProcessor
>> finish
>> INFO: {add=[1]} 0 317
>> Jun 22, 2009 12:22:26 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract
>> params 
>> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
>> status=0 QTime=317
>> Jun 22, 2009 12:22:37 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/select
>> params 
>> = 
>> {wt 
>> = 
>> standard 
>> &rows 
>> = 
>> 10 
>> &start 
>> = 
>> 0 
>> &explainOther 
>> =&hl.fl=&indent=on&q=kondel&fl=*,score&qt=standard&version=2.2}
>> hits=0 status=0 QTime=2
>>
>> The submitted document has "kondel" in it numerous times, so Solr  
>> should
>> have a hit. Yet it returns nothing. I also made sure I committed,  
>> but that
>> didn't seem to help either.
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>> Do you have a default field declared?  &ext.default.fl=<FIELD NAME>
>>> Either that, or you need to explicitly capture the fields you are
>>> interested in using &ext.capture=<FIELD NAME>
>>>
>>> You could add this to your curl statement to try out.
>>>
>>> -Grant
>>>
>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24150763.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24159267.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractRequestHandler - not properly indexing office docs?

Posted by Grant Ingersoll <gs...@apache.org>.

What's your default search field?

On Jun 22, 2009, at 12:29 PM, cloax wrote:

>
> Yep, I've tried both of those and still no joy. Here's both my curl  
> statement
> and the resulting Solr log output.
>
> curl
> http://localhost:8983/solr/update/extract?ext.def.fl=text 
> \&ext.literal.id=1\&ext.map.div=text\&ext.capture=div
> -F "myfile=@dj_character.doc"
>
> Curls output:
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">317</int></lst>
> </response>
>
> Solr log:
> Jun 22, 2009 12:21:42 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract
> params 
> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
> status=0 QTime=544
> Jun 22, 2009 12:22:26 PM  
> org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {add=[1]} 0 317
> Jun 22, 2009 12:22:26 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract
> params 
> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
> status=0 QTime=317
> Jun 22, 2009 12:22:37 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select
> params 
> = 
> {wt 
> = 
> standard 
> &rows 
> = 
> 10 
> &start 
> = 
> 0 
> &explainOther 
> =&hl.fl=&indent=on&q=kondel&fl=*,score&qt=standard&version=2.2}
> hits=0 status=0 QTime=2
>
> The submitted document has "kondel" in it numerous times, so Solr  
> should
> have a hit. Yet it returns nothing. I also made sure I committed,  
> but that
> didn't seem to help either.
>
>
> Grant Ingersoll-6 wrote:
>>
>> Do you have a default field declared?  &ext.default.fl=<FIELD NAME>
>> Either that, or you need to explicitly capture the fields you are
>> interested in using &ext.capture=<FIELD NAME>
>>
>> You could add this to your curl statement to try out.
>>
>> -Grant
>>
>
>
> -- 
> View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24150763.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: ExtractRequestHandler - not properly indexing office docs?

Posted by cloax <jo...@joekondel.com>.

Yep, I've tried both of those and still no joy. Here's both my curl statement
and the resulting Solr log output. 

curl
http://localhost:8983/solr/update/extract?ext.def.fl=text\&ext.literal.id=1\&ext.map.div=text\&ext.capture=div
-F "myfile=@dj_character.doc"  

Curls output:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">317</int></lst>
</response>

Solr log:
Jun 22, 2009 12:21:42 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
status=0 QTime=544 
Jun 22, 2009 12:22:26 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {add=[1]} 0 317
Jun 22, 2009 12:22:26 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
status=0 QTime=317 
Jun 22, 2009 12:22:37 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={wt=standard&rows=10&start=0&explainOther=&hl.fl=&indent=on&q=kondel&fl=*,score&qt=standard&version=2.2}
hits=0 status=0 QTime=2

The submitted document has "kondel" in it numerous times, so Solr should
have a hit. Yet it returns nothing. I also made sure I committed, but that
didn't seem to help either.


Grant Ingersoll-6 wrote:
> 
> Do you have a default field declared?  &ext.default.fl=<FIELD NAME>    
> Either that, or you need to explicitly capture the fields you are  
> interested in using &ext.capture=<FIELD NAME>
> 
> You could add this to your curl statement to try out.
> 
> -Grant
> 


-- 
View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24150763.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractRequestHandler - not properly indexing office docs?

Posted by Grant Ingersoll <gs...@apache.org>.

Do you have a default field declared?  &ext.default.fl=<FIELD NAME>    
Either that, or you need to explicitly capture the fields you are  
interested in using &ext.capture=<FIELD NAME>

You could add this to your curl statement to try out.

-Grant

On Jun 20, 2009, at 8:41 AM, cloax wrote:

>
> Thanks for the quick response.
>
> Here are the fields from the schema:
>
> <field name="id" type="string" indexed="true" stored="true"  
> required="true"
> />
> <field name="original_name" type="text" indexed="true" stored="true"/>
> <field name="current" type="boolean" indexed="true" stored="true"/>
> <field name="file_association" type="sint" indexed="true"  
> stored="true"/>
> <field name="uploaded_by_user" type="text" indexed="true"  
> stored="true"/>
> <field name="text" type="text" indexed="true" stored="false"
> multiValued="true"/>
>
>
> I use text as the content field for the default field for the ERH.
>
> Here's the config of the ERH:
>
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>  <lst name="defaults">
>    <str name="ext.map.Last-Modified">last_modified</str>
>    <bool name="ext.ignore.und.fl">true</bool>
>  </lst>
> </requestHandler>
>
> Here's the output of a curl request w/ the file:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">650</int></lst><str name="afetest.docx">&lt;?xml  
> version="1.0"
> encoding="UTF-8"?&gt;
> &lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
>  &lt;head&gt;
>      &lt;title/&gt;
>  &lt;/head&gt;
>  &lt;body&gt;
>      &lt;div class="package-entry"&gt;
> &lt;h1&gt;[Content_Types].xml&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"/&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;_rels/.rels&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"&gt;&amp;lt;?xml  
> version="1.0"
> encoding="UTF-8" standalone="yes"?&amp;gt;&amp;#xd;
> &amp;lt;Relationships
> xmlns="http://schemas.openxmlformats.org/package/2006/ 
> relationships"&amp;gt;&amp;lt;Relationship
> Id="rId4"
> Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties 
> "
> Target="docProps/app.xml"/&amp;gt;&amp;lt;Relationship Id="rId1"
> Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument 
> "
> Target="word/document.xml"/&amp;gt;&amp;lt;Relationship Id="rId2"
> Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail 
> "
> Target="docProps/thumbnail.jpeg"/&amp;gt;&amp;lt;Relationship  
> Id="rId3"
> Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties 
> "
> Target="docProps/core.xml"/&amp;gt;&amp;lt;/ 
> Relationships&amp;gt;&lt;/p&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;word/_rels/document.xml.rels&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"&gt;&amp;lt;?xml  
> version="1.0"
> encoding="UTF-8" standalone="yes"?&amp;gt;&amp;#xd;
> &amp;lt;Relationships
> xmlns="http://schemas.openxmlformats.org/package/2006/ 
> relationships"&amp;gt;&amp;lt;Relationship
> Id="rId4"
> Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable 
> "
> Target="fontTable.xml"/&amp;gt;&amp;lt;Relationship Id="rId1"
> Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles 
> "
> Target="styles.xml"/&amp;gt;&amp;lt;Relationship Id="rId2"
> Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings 
> "
> Target="settings.xml"/&amp;gt;&amp;lt;Relationship Id="rId3"
> Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings 
> "
> Target="webSettings.xml"/&amp;gt;&amp;lt;Relationship Id="rId5"
> Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme 
> "
> Target="theme/theme1.xml"/&amp;gt;&amp;lt;/Relationships&amp;gt;&lt;/ 
> p&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;word/document.xml&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"&gt;Lorem ipsum dolor sit
> amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt  
> ut
> labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
> exercitation ullamco laboris nisi ut aliquip ex ea commodo  
> consequat. Duis
> aute irure dolor in reprehenderit in voluptate velit esse cillum  
> dolore eu
> fugiat nulla pariatur. Excepteur sint occaecat cupidatat non  
> proident, sunt
> in culpa qui officia deserunt mollit anim id est laborum&lt;/p&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;word/theme/theme1.xml&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"/&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;docProps/thumbnail.jpeg&lt;/h1&gt;
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;word/settings.xml&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"/&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;word/fontTable.xml&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"/&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;word/webSettings.xml&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"/&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;docProps/core.xml&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"&gt;Joe
> Doe12009-06-17T20:29:00Z2009-06-17T20:41:00Z&lt;/p&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;word/styles.xml&lt;/h1&gt;
> &lt;p
>          xmlns="http://www.w3.org/1999/xhtml"/&gt;
>
> &lt;/div&gt;
> &lt;div class="package-entry"&gt;
> &lt;h1&gt;docProps/app.xml&lt;/h1&gt;
> &lt;p xmlns="http://www.w3.org/1999/xhtml"&gt;Normal.dotm1100Microsoft
> Macintosh Word011false10genfalse0falsefalse12.0000&lt;/p&gt;
>
> &lt;/div&gt;
> &lt;/body&gt;
> &lt;/html&gt;
> </str><lst name="afetest.docx_metadata"><arr
> name="stream_source_info"><str>myfile</str></arr><arr
> name="stream_name"><str>afetest.docx</str></arr><arr
> name="stream_content_type"><str>application/octet-stream</str></ 
> arr><arr
> name="Content-Type"><str>application/zip</str></arr><arr
> name="stream_size"><str>38200</str></arr></lst>
> </response>
>
> Query looks like:
>
> INFO: [] webapp=/solr path=/select
> params 
> = 
> {wt 
> = 
> standard 
> &rows=10&start=0&explainOther=&hl.fl=&indent=on&q=text:laborum+AND 
> +uploaded_by_user:joe&fl=*,score&qt=standard&version=2.2}
> hits=0 status=0 QTime=3
>
> Please note that searching solely by "uploaded_by_user:joe" will  
> properly
> return the document.
>
> Thanks again.
>
> -joe
>
>
> Grant Ingersoll-6 wrote:
>>
>> Can you share your schema for the fields you are indexing, the
>> configuration of the ExtractingRequestHandler and what your requests
>> look like?  Also, can you share what the output of the extract only
>> stuff looks like?
>>
>> Also, can you post .doc files to the example per
>> http://wiki.apache.org/solr/ExtractingRequestHandler
>>  ?  I was able to do that and search for the doc that I entered and
>> it was able to handle both .doc and .docx.
>>
>> -Grant
>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24124928.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: ExtractRequestHandler - not properly indexing office docs?

Posted by cloax <jo...@joekondel.com>.

Thanks for the quick response.

Here are the fields from the schema:

 <field name="id" type="string" indexed="true" stored="true" required="true"
/>
 <field name="original_name" type="text" indexed="true" stored="true"/>
 <field name="current" type="boolean" indexed="true" stored="true"/>
 <field name="file_association" type="sint" indexed="true" stored="true"/>
 <field name="uploaded_by_user" type="text" indexed="true" stored="true"/>
 <field name="text" type="text" indexed="true" stored="false"
multiValued="true"/>


I use text as the content field for the default field for the ERH.

Here's the config of the ERH:

<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="ext.map.Last-Modified">last_modified</str>
    <bool name="ext.ignore.und.fl">true</bool>
  </lst>
</requestHandler>

Here's the output of a curl request w/ the file:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">650</int></lst><str name="afetest.docx">&lt;?xml version="1.0"
encoding="UTF-8"?&gt;
&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
  &lt;head&gt;
      &lt;title/&gt;
  &lt;/head&gt;
  &lt;body&gt;
      &lt;div class="package-entry"&gt;
&lt;h1&gt;[Content_Types].xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;_rels/.rels&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"&gt;&amp;lt;?xml version="1.0"
encoding="UTF-8" standalone="yes"?&amp;gt;&amp;#xd;
&amp;lt;Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships"&amp;gt;&amp;lt;Relationship
Id="rId4"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties"
Target="docProps/app.xml"/&amp;gt;&amp;lt;Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
Target="word/document.xml"/&amp;gt;&amp;lt;Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail"
Target="docProps/thumbnail.jpeg"/&amp;gt;&amp;lt;Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties"
Target="docProps/core.xml"/&amp;gt;&amp;lt;/Relationships&amp;gt;&lt;/p&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/_rels/document.xml.rels&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"&gt;&amp;lt;?xml version="1.0"
encoding="UTF-8" standalone="yes"?&amp;gt;&amp;#xd;
&amp;lt;Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships"&amp;gt;&amp;lt;Relationship
Id="rId4"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable"
Target="fontTable.xml"/&amp;gt;&amp;lt;Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
Target="styles.xml"/&amp;gt;&amp;lt;Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings"
Target="settings.xml"/&amp;gt;&amp;lt;Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings"
Target="webSettings.xml"/&amp;gt;&amp;lt;Relationship Id="rId5"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme"
Target="theme/theme1.xml"/&amp;gt;&amp;lt;/Relationships&amp;gt;&lt;/p&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/document.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"&gt;Lorem ipsum dolor sit
amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis
aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
in culpa qui officia deserunt mollit anim id est laborum&lt;/p&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/theme/theme1.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;docProps/thumbnail.jpeg&lt;/h1&gt;
&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/settings.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/fontTable.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/webSettings.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;docProps/core.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"&gt;Joe
Doe12009-06-17T20:29:00Z2009-06-17T20:41:00Z&lt;/p&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/styles.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;docProps/app.xml&lt;/h1&gt;
&lt;p xmlns="http://www.w3.org/1999/xhtml"&gt;Normal.dotm1100Microsoft
Macintosh Word011false10genfalse0falsefalse12.0000&lt;/p&gt;

&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;
</str><lst name="afetest.docx_metadata"><arr
name="stream_source_info"><str>myfile</str></arr><arr
name="stream_name"><str>afetest.docx</str></arr><arr
name="stream_content_type"><str>application/octet-stream</str></arr><arr
name="Content-Type"><str>application/zip</str></arr><arr
name="stream_size"><str>38200</str></arr></lst>
</response>

Query looks like:

INFO: [] webapp=/solr path=/select
params={wt=standard&rows=10&start=0&explainOther=&hl.fl=&indent=on&q=text:laborum+AND+uploaded_by_user:joe&fl=*,score&qt=standard&version=2.2}
hits=0 status=0 QTime=3

Please note that searching solely by "uploaded_by_user:joe" will properly
return the document.

Thanks again.

-joe


Grant Ingersoll-6 wrote:
> 
> Can you share your schema for the fields you are indexing, the  
> configuration of the ExtractingRequestHandler and what your requests  
> look like?  Also, can you share what the output of the extract only  
> stuff looks like?
> 
> Also, can you post .doc files to the example per
> http://wiki.apache.org/solr/ExtractingRequestHandler 
>   ?  I was able to do that and search for the doc that I entered and  
> it was able to handle both .doc and .docx.
> 
> -Grant
> 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24124928.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractRequestHandler - not properly indexing office docs?

Posted by Grant Ingersoll <gs...@apache.org>.

Can you share your schema for the fields you are indexing, the  
configuration of the ExtractingRequestHandler and what your requests  
look like?  Also, can you share what the output of the extract only  
stuff looks like?

Also, can you post .doc files to the example per http://wiki.apache.org/solr/ExtractingRequestHandler 
  ?  I was able to do that and search for the doc that I entered and  
it was able to handle both .doc and .docx.

-Grant

On Jun 19, 2009, at 7:20 PM, cloax wrote:

>
> Hi there,
>
> I've got a Solr instance running and am feeding it rich binary  
> documents to
> index from a Django application. The setup works just fine with  
> pdf's, etc..
> but no matter what type of MS Word document ( doc and docx ) I feed  
> it I
> can't get any results when searching for content-related queries.
>
> I've curl'd with extract.only to verify that Solr ( and tika ) could  
> extract
> the contents, and it happily enough spits back the extracted XHTML  
> to me.
> That content never seems to find it's way into the ext.def.fl that I  
> have
> specified.
>
> When I go and search for terms specific to content in those  
> documents, I get
> zero hits. However I get hits on metadata related queries ( ie: i  
> store
> username of who uploaded it, etc.. )
>
> Is there some magical bit I forgot to flip?
>
> cheers,
> joe
> -- 
> View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24120125.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search