You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brett Melbourne <bm...@halogensoftware.com> on 2012/11/24 00:25:45 UTC

Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Hi all,

I am encountering a problem where Solr 3.6.1 is not able to extract the text content from ODT (Open Office Document) files submitted to the ExtractingRequestHandler. I can reproduce this issue against the example schema running with jetty.

Executing a simple index request (based on the example in the wiki):
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"<http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true%22> -F "myfile=@testfile.odt"
returns no errors, and does not generate any exceptions in the log/console.

A query for doc1 returns an empty attr_content field:
<arr name="attr_content"> <str></str> </arr>

Oddly enough, executing an "extractOnly=true" request against the ExtractingRequestHandler with the same ODT file correctly returns the text of the file.

I am wondering:

*         Is this a known issue? (I couldn't find any mention of this particular issue anywhere...)

*         Are there any workarounds or does anyone have any suggestions?

Thanks,

Brett.

RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Posted by Brett Melbourne <bm...@halogensoftware.com>.

Hi Erick,

Thanks for the reply!

I don't think there is a problem with my schema, because I can successfully extract text from other file types.

For example, Tika is able to extract the content from a docx:

FINEST: Trying class name org.apache.solr.handler.extraction.ExtractingRequestHandler
Dec 7, 2012 3:18:35 PM org.apache.solr.handler.extraction.SolrContentHandler newDocument
FINE: Doc: SolrInputDocument[{attr_meta=attr_meta(1.0)={[stream_content_type, application/xml, stream_size, 9935, Content-Type, application/vnd.openxmlformats-officedocument.wordprocessingml.document]}, attr_revision_number=attr_revision_number(1.0)={1}, attr_template=attr_template(1.0)={Normal.dotm}, attr_last_author=attr_last_author(1.0)={Brett Melbourne}, attr_page_count=attr_page_count(1.0)={1}, attr_application_name=attr_application_name(1.0)={Microsoft Office Word}, author=author(1.0)={Brett Melbourne}, last_modified=last_modified(1.0)={2012-12-07T19:18:00.000Z}, attr_application_version=attr_application_version(1.0)={12.0000}, attr_character_count_with_spaces=attr_character_count_with_spaces(1.0)={60}, attr_date=attr_date(1.0)={2012-12-07T19:17:00Z}, attr_total_time=attr_total_time(1.0)={1}, attr_publisher=attr_publisher(1.0)={}, attr_creator=attr_creator(1.0)={Brett Melbourne}, attr_word_count=attr_word_count(1.0)={9}, attr_xmptpg_npages=attr_xmptpg_npages(1.0)={1}, attr_creation_date=attr_creation_date(1.0)={2012-12-07T19:17:00Z}, attr_stream_content_type=attr_stream_content_type(1.0)={application/xml}, attr_line_count=attr_line_count(1.0)={1}, attr_character_count=attr_character_count(1.0)={52}, attr_stream_size=attr_stream_size(1.0)={9935}, content_type=content_type(1.0)={application/vnd.openxmlformats-officedocument.wordprocessingml.document}, attr_paragraph_count=attr_paragraph_count(1.0)={1}, id=id(1.0)={doc3}, text=text(1.0)={             This is some text content that Solr should be able to parse.   }}]

The docx content in Solr is:

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">id:doc3</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<arr name="attr_application_name">
<str>Microsoft Office Word</str>
</arr>
<arr name="attr_application_version">
<str>12.0000</str>
</arr>
<arr name="attr_character_count">
<str>52</str>
</arr>
<arr name="attr_character_count_with_spaces">
<str>60</str>
</arr>
<arr name="attr_creation_date">
<str>2012-12-07T19:17:00Z</str>
</arr>
<arr name="attr_creator">
<str>Brett Melbourne</str>
</arr>
<arr name="attr_date">
<str>2012-12-07T19:17:00Z</str>
</arr>
<arr name="attr_last_author">
<str>Brett Melbourne</str>
</arr>
<arr name="attr_line_count">
<str>1</str>
</arr>
<arr name="attr_meta">
<str>stream_content_type</str>
<str>application/xml</str>
<str>stream_size</str>
<str>9935</str>
<str>Content-Type</str>
<str>
application/vnd.openxmlformats-officedocument.wordprocessingml.document
</str>
</arr>
<arr name="attr_page_count">
<str>1</str>
</arr>
<arr name="attr_paragraph_count">
<str>1</str>
</arr>
<arr name="attr_publisher">
<str/>
</arr>
<arr name="attr_revision_number">
<str>1</str>
</arr>
<arr name="attr_stream_content_type">
<str>application/xml</str>
</arr>
<arr name="attr_stream_size">
<str>9935</str>
</arr>
<arr name="attr_template">
<str>Normal.dotm</str>
</arr>
<arr name="attr_total_time">
<str>1</str>
</arr>
<arr name="attr_word_count">
<str>9</str>
</arr>
<arr name="attr_xmptpg_npages">
<str>1</str>
</arr>
<str name="author">Brett Melbourne</str>
<arr name="content_type">
<str>
application/vnd.openxmlformats-officedocument.wordprocessingml.document
</str>
</arr>
<str name="id">doc3</str>
<date name="last_modified">2012-12-07T19:18:00Z</date>
<arr name="text">
<str>
This is some text content that Solr should be able to parse.
</str>
</arr>
</doc>
</result>
</response>

When I attempt to index an ODT, it apparently works fine.. however notice that the text field is empty:

Dec 7, 2012 4:18:43 PM org.apache.solr.handler.extraction.SolrContentHandler newDocument
FINE: Doc: SolrInputDocument[{attr_editing_cycles=attr_editing_cycles(1.0)={1}, attr_page_count=attr_page_count(1.0)={2}, attr_date=attr_date(1.0)={2010-09-16T15:51:00Z}, attr_creator=attr_creator(1.0)={droy}, attr_word_count=attr_word_count(1.0)={475}, attr_xmptpg_npages=attr_xmptpg_npages(1.0)={2}, attr_edit_time=attr_edit_time(1.0)={PT60S}, attr_creation_date=attr_creation_date(1.0)={2010-09-16T15:50:00Z}, attr_nbpara=attr_nbpara(1.0)={6}, attr_stream_content_type=attr_stream_content_type(1.0)={application/xml}, attr_initial_creator=attr_initial_creator(1.0)={droy}, attr_character_count=attr_character_count(1.0)={3177}, attr_stream_size=attr_stream_size(1.0)={9130}, attr_generator=attr_generator(1.0)={MicrosoftOffice/12.0 MicrosoftWord}, attr_nbword=attr_nbword(1.0)={475}, attr_nbpage=attr_nbpage(1.0)={2}, content_type=content_type(1.0)={application/vnd.oasis.opendocument.text}, attr_nbcharacter=attr_nbcharacter(1.0)={3177}, attr_paragraph_count=attr_paragraph_count(1.0)={6}, id=id(1.0)={doc4}, text=text(1.0)={  }}]

The corresponding document in solr looks like this:

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">id:doc4</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<arr name="attr_character_count">
<str>3177</str>
</arr>
<arr name="attr_creation_date">
<str>2010-09-16T15:50:00Z</str>
</arr>
<arr name="attr_creator">
<str>droy</str>
</arr>
<arr name="attr_date">
<str>2010-09-16T15:51:00Z</str>
</arr>
<arr name="attr_edit_time">
<str>PT60S</str>
</arr>
<arr name="attr_editing_cycles">
<str>1</str>
</arr>
<arr name="attr_generator">
<str>MicrosoftOffice/12.0 MicrosoftWord</str>
</arr>
<arr name="attr_initial_creator">
<str>droy</str>
</arr>
<arr name="attr_nbcharacter">
<str>3177</str>
</arr>
<arr name="attr_nbpage">
<str>2</str>
</arr>
<arr name="attr_nbpara">
<str>6</str>
</arr>
<arr name="attr_nbword">
<str>475</str>
</arr>
<arr name="attr_page_count">
<str>2</str>
</arr>
<arr name="attr_paragraph_count">
<str>6</str>
</arr>
<arr name="attr_stream_content_type">
<str>application/xml</str>
</arr>
<arr name="attr_stream_size">
<str>9130</str>
</arr>
<arr name="attr_word_count">
<str>475</str>
</arr>
<arr name="attr_xmptpg_npages">
<str>2</str>
</arr>
<arr name="content_type">
<str>application/vnd.oasis.opendocument.text</str>
</arr>
<str name="id">doc4</str>
<arr name="text">
<str></str>
</arr>
</doc>
</result>
</response>

Brett.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, November 27, 2012 7:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Not an issue that I know of. I expect you've got some obscure problem in your definitions, but I'm guession. Try modifying your schema so the glob pattern maps to a stored field, something like:
<dynamicField name="*" type="string" multiValued="true" stored="true" /> remove all other fields except id, remove your mapping, and try it again.
If you query with fl=* you should see everything that was extracted. That'll tell you whether it is a problem with Solr/Tika or something in how you're using them.

Best
Erick

On Mon, Nov 26, 2012 at 10:19 AM, Brett Melbourne < bmelbourne@halogensoftware.com> wrote:

> Hi Erik,
>
> The document is committed successfully... it is just missing all the 
> extracted content from Tika when I query for that document.
>
> i.e. the mapped content field attr_content is empty
> (fmap.content=attr_content)
>
> <result name="response" numFound="1" start="0" maxScore="1.9162908"> 
> <doc> <float name="score">1.9162908</float> <arr 
> name="attr_character_count"> <str>24</str> </arr> <arr 
> name="attr_content"> <str></str> </arr> <arr 
> name="attr_creation_date"> <str>2009-04-16T11:32:00</str> </arr> <arr 
> name="attr_date"> <str>2012-11-23T00:29:39.73</str> </arr>
>
> ...
>
> </result>
>
>
> Brett.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, November 25, 2012 9:27 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Problem with Solr 3.6.1 extracting ODT content using 
> SolrCell's ExtractingRequestHandler
>
> Did you commit after you added the document but before you tried the 
> search?
>
> Best
> Erick
>
>
> On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne < 
> bmelbourne@halogensoftware.com> wrote:
>
> > Hi all,
> >
> > I am encountering a problem where Solr 3.6.1 is not able to extract 
> > the text content from ODT (Open Office Document) files submitted to 
> > the ExtractingRequestHandler. I can reproduce this issue against the 
> > example schema running with jetty.
> >
> > Executing a simple index request (based on the example in the wiki):
> > curl "
> > http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=at
> > tr _&fmap.content=attr_content&commit=true
> > "<
> > http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=at
> > tr _&fmap.content=attr_content&commit=true%22>
> > -F "myfile=@testfile.odt"
> > returns no errors, and does not generate any exceptions in the
> log/console.
> >
> > A query for doc1 returns an empty attr_content field:
> > <arr name="attr_content"> <str></str> </arr>
> >
> > Oddly enough, executing an "extractOnly=true" request against the 
> > ExtractingRequestHandler with the same ODT file correctly returns 
> > the text of the file.
> >
> > I am wondering:
> >
> > *         Is this a known issue? (I couldn't find any mention of this
> > particular issue anywhere...)
> >
> > *         Are there any workarounds or does anyone have any suggestions?
> >
> > Thanks,
> >
> > Brett.
> >
> >
>

Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Posted by Erick Erickson <er...@gmail.com>.

Not an issue that I know of. I expect you've got some obscure problem in
your definitions, but I'm guession. Try modifying your schema so the glob
pattern maps to a stored field, something like:
<dynamicField name="*" type="string" multiValued="true" stored="true" />
remove all other fields except id, remove your mapping, and try it again.
If you query with fl=* you should see everything that was extracted. That'll
tell you whether it is a problem with Solr/Tika or something in how you're
using
them.

Best
Erick


On Mon, Nov 26, 2012 at 10:19 AM, Brett Melbourne <
bmelbourne@halogensoftware.com> wrote:

> Hi Erik,
>
> The document is committed successfully... it is just missing all the
> extracted content from Tika when I query for that document.
>
> i.e. the mapped content field attr_content is empty
> (fmap.content=attr_content)
>
> <result name="response" numFound="1" start="0" maxScore="1.9162908">
> <doc>
> <float name="score">1.9162908</float>
> <arr name="attr_character_count">
> <str>24</str>
> </arr>
> <arr name="attr_content">
> <str></str>
> </arr>
> <arr name="attr_creation_date">
> <str>2009-04-16T11:32:00</str>
> </arr>
> <arr name="attr_date">
> <str>2012-11-23T00:29:39.73</str>
> </arr>
>
> ...
>
> </result>
>
>
> Brett.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, November 25, 2012 9:27 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Problem with Solr 3.6.1 extracting ODT content using
> SolrCell's ExtractingRequestHandler
>
> Did you commit after you added the document but before you tried the
> search?
>
> Best
> Erick
>
>
> On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne <
> bmelbourne@halogensoftware.com> wrote:
>
> > Hi all,
> >
> > I am encountering a problem where Solr 3.6.1 is not able to extract
> > the text content from ODT (Open Office Document) files submitted to
> > the ExtractingRequestHandler. I can reproduce this issue against the
> > example schema running with jetty.
> >
> > Executing a simple index request (based on the example in the wiki):
> > curl "
> > http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr
> > _&fmap.content=attr_content&commit=true
> > "<
> > http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr
> > _&fmap.content=attr_content&commit=true%22>
> > -F "myfile=@testfile.odt"
> > returns no errors, and does not generate any exceptions in the
> log/console.
> >
> > A query for doc1 returns an empty attr_content field:
> > <arr name="attr_content"> <str></str> </arr>
> >
> > Oddly enough, executing an "extractOnly=true" request against the
> > ExtractingRequestHandler with the same ODT file correctly returns the
> > text of the file.
> >
> > I am wondering:
> >
> > *         Is this a known issue? (I couldn't find any mention of this
> > particular issue anywhere...)
> >
> > *         Are there any workarounds or does anyone have any suggestions?
> >
> > Thanks,
> >
> > Brett.
> >
> >
>

RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Posted by Brett Melbourne <bm...@halogensoftware.com>.

Hi Erik,

The document is committed successfully... it is just missing all the extracted content from Tika when I query for that document.

i.e. the mapped content field attr_content is empty (fmap.content=attr_content)

<result name="response" numFound="1" start="0" maxScore="1.9162908">
<doc>
<float name="score">1.9162908</float>
<arr name="attr_character_count">
<str>24</str>
</arr>
<arr name="attr_content">
<str></str>
</arr>
<arr name="attr_creation_date">
<str>2009-04-16T11:32:00</str>
</arr>
<arr name="attr_date">
<str>2012-11-23T00:29:39.73</str>
</arr>

...

</result>


Brett.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Sunday, November 25, 2012 9:27 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Did you commit after you added the document but before you tried the search?

Best
Erick


On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne < bmelbourne@halogensoftware.com> wrote:

> Hi all,
>
> I am encountering a problem where Solr 3.6.1 is not able to extract 
> the text content from ODT (Open Office Document) files submitted to 
> the ExtractingRequestHandler. I can reproduce this issue against the 
> example schema running with jetty.
>
> Executing a simple index request (based on the example in the wiki):
> curl "
> http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr
> _&fmap.content=attr_content&commit=true
> "<
> http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr
> _&fmap.content=attr_content&commit=true%22>
> -F "myfile=@testfile.odt"
> returns no errors, and does not generate any exceptions in the log/console.
>
> A query for doc1 returns an empty attr_content field:
> <arr name="attr_content"> <str></str> </arr>
>
> Oddly enough, executing an "extractOnly=true" request against the 
> ExtractingRequestHandler with the same ODT file correctly returns the 
> text of the file.
>
> I am wondering:
>
> *         Is this a known issue? (I couldn't find any mention of this
> particular issue anywhere...)
>
> *         Are there any workarounds or does anyone have any suggestions?
>
> Thanks,
>
> Brett.
>
>

Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Posted by Erick Erickson <er...@gmail.com>.

Did you commit after you added the document but before you tried the search?

Best
Erick


On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne <
bmelbourne@halogensoftware.com> wrote:

> Hi all,
>
> I am encountering a problem where Solr 3.6.1 is not able to extract the
> text content from ODT (Open Office Document) files submitted to the
> ExtractingRequestHandler. I can reproduce this issue against the example
> schema running with jetty.
>
> Executing a simple index request (based on the example in the wiki):
> curl "
> http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true
> "<
> http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true%22>
> -F "myfile=@testfile.odt"
> returns no errors, and does not generate any exceptions in the log/console.
>
> A query for doc1 returns an empty attr_content field:
> <arr name="attr_content"> <str></str> </arr>
>
> Oddly enough, executing an "extractOnly=true" request against the
> ExtractingRequestHandler with the same ODT file correctly returns the text
> of the file.
>
> I am wondering:
>
> *         Is this a known issue? (I couldn't find any mention of this
> particular issue anywhere...)
>
> *         Are there any workarounds or does anyone have any suggestions?
>
> Thanks,
>
> Brett.
>
>