You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andy Lam Yin Cong <al...@yahoo.com> on 2009/10/17 16:59:43 UTC

Store tika extracted result as xhtml

Dear All,

I have a field defined in schema.xml as below,
<fieldtype name="string"  class="solr.StrField" sortMissingLast="true" indexed="true" stored="true" multiValued="false" omitNorms="true"/>
<field name="original"     type="string" indexed="false"  />

and in the solrconfig.xml
<str name="fmap.content">original</str>

basically, when I upload the document via the command below
curl 'http://localhost:8983/solr/info/update/extract?map.content=text_shingle&literal.url=test&commit=true' -F "file=@mccm.pdf"

and try to display field via a query, it shows 

Take A Chance On Me      
Take A Chance On Me
Monte Carlo Condensed Matter
A very brief guide to Monte Carlo simulation.
An explanation of what I do.
A chance for far too many ABBA puns
.......
The above is Not an xhtml(!)

However, if I run the command below with extractOnly=true
> curl 'http://localhost:8983/solr/info/update/extract?map.content=text_shingle&literal.url=test&extractOnly=true' -F "file=@mccm.pdf"

I get the result
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
    &lt;head&gt;
        &lt;title&gt;Take A Chance On Me&lt;/title&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;div&gt;
.........
which is an xhtml output.

My objective is to be able to stored it as xhtml in the field and be able to retrieve it as cached output. 
Since tika is already giving xhtml output, I wonder why when Solr save it as a plain text. (Maybe I missed out something in the configuration??)

Also, I will be using SolrJ as the application layer, hence as a workaround if there are any ways that I can get the xhtml result, maybe I can stored it somewhere else outside of Solr.
Any advice on this will be highly appreciated.

 Many Thanks & Kind Regards
Andy

Re: Store tika extracted result as xhtml

Posted by Chris Hostetter <ho...@fucit.org>.

: My objective is to be able to stored it as xhtml in the field and be 
: able to retrieve it as cached output. Since tika is already giving xhtml 
: output, I wonder why when Solr save it as a plain text. (Maybe I missed 
: out something in the configuration??)

I'm not very familiar with Tika or Solr CELL, but I think what you are 
seeing is that Solr only asks Tika for the *content* of the DOM Nodes 
matched by the xpath and/or capture params (ie: node.getTextContent()).

I suspect it wouldnt' be too hard to add an option to allow the capture of 
the serialized DOM Nodes.



-Hoss