You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Roman K <ro...@gmail.com> on 2012/04/16 14:14:32 UTC

Issue with Solr 3.5 while using TikaEntityProcessor on .docx files

Hello,
I am running some tests to see, whether we can use Solr in our organization.
I have to be able to process MS Word .docx files and then be able to 
search them as they were simple plain text.

The problem is that when processing the docx files, the result that I 
get while running the *:* query is:

<arr name="text"><str>_rels/.rels

word/fontTable.xml

word/_rels/document.xml.rels

word/document.xml

word/styles.xml

docProps/app.xml

docProps/core.xml

[Content_Types].xml

</str></arr>

which are the names of the xml files that are "zipped" inside the docx file.
For regular doc/odt files, everything works great and I get the text 
from inside the document.

I am using the slightly modified example which comes with the Solr 3.5 
download.
My tika-data-config file is:

<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="f" processor="FileListEntityProcessor" recursive="true"
                 rootEntity="false"
                 dataSource="null" baseDir="/myDir/Documents"
                 fileName=".*\.(docx)|(DOCX)" onError="skip">
<entity name="tika-test" processor="TikaEntityProcessor" 
url="${f.fileAbsolutePath}" dataSource="bin" format="text">
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>

the "text" fieldType and field from schema.xml looks like:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

<fields>
<field name="text" type="text" indexed="true" stored="true" 
multiValued="true"/>
</fields>

Tika version used is 0.10 (default that came with Solr 3.5). Downgrade 
to 0.9 didn't help.
The same issue is with docx files saved both from MS Word 2007/2010 and 
from LibreOffice Writer both on Windows and Ubuntu.
Regular doc/odt files work perfect.


Thanks in advance for your help,
Roman.

Re: Issue with Solr 3.5 while using TikaEntityProcessor on .docx files

Posted by Roman K <ro...@gmail.com>.

On 04/16/2012 06:45 PM, Roman K wrote:
> On 04/16/2012 04:31 PM, Jan Høydahl wrote:
>> Hi,
>>
>> Solr3.6 is just out with Tika 1.0. Can you try that? Also, Solr TRUNK 
>> now has Tika 1.1...
>> I recommend downloading Tika-App and testing your offending files 
>> directly with that http://tika.apache.org/1.1/gettingstarted.html
>>
>> -- 
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> On 16. apr. 2012, at 14:14, Roman K wrote:
>>
>>> Hello,
>>> I am running some tests to see, whether we can use Solr in our 
>>> organization.
>>> I have to be able to process MS Word .docx files and then be able to 
>>> search them as they were simple plain text.
>>>
>>> The problem is that when processing the docx files, the result that 
>>> I get while running the *:* query is:
>>>
>>> <arr name="text"><str>_rels/.rels
>>>
>>> word/fontTable.xml
>>>
>>> word/_rels/document.xml.rels
>>>
>>> word/document.xml
>>>
>>> word/styles.xml
>>>
>>> docProps/app.xml
>>>
>>> docProps/core.xml
>>>
>>> [Content_Types].xml
>>>
>>> </str></arr>
>>>
>>> which are the names of the xml files that are "zipped" inside the 
>>> docx file.
>>> For regular doc/odt files, everything works great and I get the text 
>>> from inside the document.
>>>
>>> I am using the slightly modified example which comes with the Solr 
>>> 3.5 download.
>>> My tika-data-config file is:
>>>
>>> <dataConfig>
>>> <dataSource type="BinFileDataSource" name="bin"/>
>>> <document>
>>> <entity name="f" processor="FileListEntityProcessor" recursive="true"
>>>                 rootEntity="false"
>>>                 dataSource="null" baseDir="/myDir/Documents"
>>>                 fileName=".*\.(docx)|(DOCX)" onError="skip">
>>> <entity name="tika-test" processor="TikaEntityProcessor" 
>>> url="${f.fileAbsolutePath}" dataSource="bin" format="text">
>>> <field column="text" name="text"/>
>>> </entity>
>>> </entity>
>>> </document>
>>> </dataConfig>
>>>
>>> the "text" fieldType and field from schema.xml looks like:
>>> <fieldType name="text" class="solr.TextField" 
>>> positionIncrementGap="100">
>>> <analyzer type="index">
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> <filter class="solr.WordDelimiterFilterFactory" 
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" 
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <filter class="solr.PorterStemFilterFactory"/>-->
>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>> </analyzer>
>>>
>>> <fields>
>>> <field name="text" type="text" indexed="true" stored="true" 
>>> multiValued="true"/>
>>> </fields>
>>>
>>> Tika version used is 0.10 (default that came with Solr 3.5). 
>>> Downgrade to 0.9 didn't help.
>>> The same issue is with docx files saved both from MS Word 2007/2010 
>>> and from LibreOffice Writer both on Windows and Ubuntu.
>>> Regular doc/odt files work perfect.
>>>
>>>
>>> Thanks in advance for your help,
>>> Roman.
>>>
> Sorry that I didn't mention it, I used Tika-App 1.1 for testing and it 
> worked just fine.
> I will try Solr 3.6 as you advised and write the result as soon as 
> possible.
Tried with Solr 3.6 and it worked out of the box!
The issue is solved, thank you for great and really fast help.

Re: Issue with Solr 3.5 while using TikaEntityProcessor on .docx files

Posted by Roman K <ro...@gmail.com>.

On 04/16/2012 04:31 PM, Jan Høydahl wrote:
> Hi,
>
> Solr3.6 is just out with Tika 1.0. Can you try that? Also, Solr TRUNK now has Tika 1.1...
> I recommend downloading Tika-App and testing your offending files directly with that http://tika.apache.org/1.1/gettingstarted.html
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 16. apr. 2012, at 14:14, Roman K wrote:
>
>> Hello,
>> I am running some tests to see, whether we can use Solr in our organization.
>> I have to be able to process MS Word .docx files and then be able to search them as they were simple plain text.
>>
>> The problem is that when processing the docx files, the result that I get while running the *:* query is:
>>
>> <arr name="text"><str>_rels/.rels
>>
>> word/fontTable.xml
>>
>> word/_rels/document.xml.rels
>>
>> word/document.xml
>>
>> word/styles.xml
>>
>> docProps/app.xml
>>
>> docProps/core.xml
>>
>> [Content_Types].xml
>>
>> </str></arr>
>>
>> which are the names of the xml files that are "zipped" inside the docx file.
>> For regular doc/odt files, everything works great and I get the text from inside the document.
>>
>> I am using the slightly modified example which comes with the Solr 3.5 download.
>> My tika-data-config file is:
>>
>> <dataConfig>
>> <dataSource type="BinFileDataSource" name="bin"/>
>> <document>
>> <entity name="f" processor="FileListEntityProcessor" recursive="true"
>>                 rootEntity="false"
>>                 dataSource="null" baseDir="/myDir/Documents"
>>                 fileName=".*\.(docx)|(DOCX)" onError="skip">
>> <entity name="tika-test" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" dataSource="bin" format="text">
>> <field column="text" name="text"/>
>> </entity>
>> </entity>
>> </document>
>> </dataConfig>
>>
>> the "text" fieldType and field from schema.xml looks like:
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.PorterStemFilterFactory"/>-->
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>>
>> <fields>
>> <field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
>> </fields>
>>
>> Tika version used is 0.10 (default that came with Solr 3.5). Downgrade to 0.9 didn't help.
>> The same issue is with docx files saved both from MS Word 2007/2010 and from LibreOffice Writer both on Windows and Ubuntu.
>> Regular doc/odt files work perfect.
>>
>>
>> Thanks in advance for your help,
>> Roman.
>>
Sorry that I didn't mention it, I used Tika-App 1.1 for testing and it 
worked just fine.
I will try Solr 3.6 as you advised and write the result as soon as possible.

Re: Issue with Solr 3.5 while using TikaEntityProcessor on .docx files

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Solr3.6 is just out with Tika 1.0. Can you try that? Also, Solr TRUNK now has Tika 1.1...
I recommend downloading Tika-App and testing your offending files directly with that http://tika.apache.org/1.1/gettingstarted.html

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 16. apr. 2012, at 14:14, Roman K wrote:

> Hello,
> I am running some tests to see, whether we can use Solr in our organization.
> I have to be able to process MS Word .docx files and then be able to search them as they were simple plain text.
> 
> The problem is that when processing the docx files, the result that I get while running the *:* query is:
> 
> <arr name="text"><str>_rels/.rels
> 
> word/fontTable.xml
> 
> word/_rels/document.xml.rels
> 
> word/document.xml
> 
> word/styles.xml
> 
> docProps/app.xml
> 
> docProps/core.xml
> 
> [Content_Types].xml
> 
> </str></arr>
> 
> which are the names of the xml files that are "zipped" inside the docx file.
> For regular doc/odt files, everything works great and I get the text from inside the document.
> 
> I am using the slightly modified example which comes with the Solr 3.5 download.
> My tika-data-config file is:
> 
> <dataConfig>
> <dataSource type="BinFileDataSource" name="bin"/>
> <document>
> <entity name="f" processor="FileListEntityProcessor" recursive="true"
>                rootEntity="false"
>                dataSource="null" baseDir="/myDir/Documents"
>                fileName=".*\.(docx)|(DOCX)" onError="skip">
> <entity name="tika-test" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" dataSource="bin" format="text">
> <field column="text" name="text"/>
> </entity>
> </entity>
> </document>
> </dataConfig>
> 
> the "text" fieldType and field from schema.xml looks like:
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PorterStemFilterFactory"/>-->
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> 
> <fields>
> <field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
> </fields>
> 
> Tika version used is 0.10 (default that came with Solr 3.5). Downgrade to 0.9 didn't help.
> The same issue is with docx files saved both from MS Word 2007/2010 and from LibreOffice Writer both on Windows and Ubuntu.
> Regular doc/odt files work perfect.
> 
> 
> Thanks in advance for your help,
> Roman.
>