You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jan Høydahl <ja...@cominvent.com> on 2011/06/09 11:26:21 UTC

ExtractingRequestHandler - renaming tika generated fields

Hi,

I post a PDF from a CMS client, which has metadata about the document. One of those metadata is the title. I trust the title of the CMS more than the title extracted from the PDF, but I cannot find a way to both send &literal.title=CMS-Title as well as changing the name of the title field generated by Tika/SolrCell. If I do fmap.title=tika_title then my literal.title also also changes name. Any ideas?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com


Re: ExtractingRequestHandler - renaming tika generated fields

Posted by Jan Høydahl <ja...@cominvent.com>.
One solution to this problem is to change the order of field operation (http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations) to first do fmap.*= processing, then add the fields from literal.*=. Why would anyone want to rename a field they just have explicitly named anyway?

Another solution that would work for me is an option to let ALL tika generated fields be prefixed, e.g. tprefix=tika_. But I need Extracting handler to output to fields which do not exist in schema.xml. This is because later in the UpdateChain I do field choosing and renaming in another UpdateProcessor, so the field names coming from ExtractingHandler are only tempoprary and will not be sent to Solr. Thus, an option to skip the schema check would be useful, perhaps in the form of a whitelist for uprefix &uprefix.whitelist=fielda,other-non-existing-field, causing uprefix not rename those.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. juni 2011, at 11.26, Jan Høydahl wrote:

> Hi,
> 
> I post a PDF from a CMS client, which has metadata about the document. One of those metadata is the title. I trust the title of the CMS more than the title extracted from the PDF, but I cannot find a way to both send &literal.title=CMS-Title as well as changing the name of the title field generated by Tika/SolrCell. If I do fmap.title=tika_title then my literal.title also also changes name. Any ideas?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>