You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andy Lam Yin Cong <al...@yahoo.com> on 2009/10/17 16:59:43 UTC
Store tika extracted result as xhtml
Dear All,
I have a field defined in schema.xml as below,
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" indexed="true" stored="true" multiValued="false" omitNorms="true"/>
<field name="original" type="string" indexed="false" />
and in the solrconfig.xml
<str name="fmap.content">original</str>
basically, when I upload the document via the command below
curl 'http://localhost:8983/solr/info/update/extract?map.content=text_shingle&literal.url=test&commit=true' -F "file=@mccm.pdf"
and try to display field via a query, it shows
Take A Chance On Me
Take A Chance On Me
Monte Carlo Condensed Matter
A very brief guide to Monte Carlo simulation.
An explanation of what I do.
A chance for far too many ABBA puns
.......
The above is Not an xhtml(!)
However, if I run the command below with extractOnly=true
> curl 'http://localhost:8983/solr/info/update/extract?map.content=text_shingle&literal.url=test&extractOnly=true' -F "file=@mccm.pdf"
I get the result
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Take A Chance On Me</title>
</head>
<body>
<div>
.........
which is an xhtml output.
My objective is to be able to stored it as xhtml in the field and be able to retrieve it as cached output.
Since tika is already giving xhtml output, I wonder why when Solr save it as a plain text. (Maybe I missed out something in the configuration??)
Also, I will be using SolrJ as the application layer, hence as a workaround if there are any ways that I can get the xhtml result, maybe I can stored it somewhere else outside of Solr.
Any advice on this will be highly appreciated.
Many Thanks & Kind Regards
Andy
Re: Store tika extracted result as xhtml
Posted by Chris Hostetter <ho...@fucit.org>.
: My objective is to be able to stored it as xhtml in the field and be
: able to retrieve it as cached output. Since tika is already giving xhtml
: output, I wonder why when Solr save it as a plain text. (Maybe I missed
: out something in the configuration??)
I'm not very familiar with Tika or Solr CELL, but I think what you are
seeing is that Solr only asks Tika for the *content* of the DOM Nodes
matched by the xpath and/or capture params (ie: node.getTextContent()).
I suspect it wouldnt' be too hard to add an option to allow the capture of
the serialized DOM Nodes.
-Hoss