You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Juan Grande <ju...@gmail.com> on 2011/07/07 22:27:17 UTC

Re: bug in ExtractingRequestHandler with PDFs and metadata field Category

Hi Andras,

I added <str name="uprefix">metadata_</str> so all PDF metadata fields
> should be saved in solr as "metadata_something" fields.
>
The problem is that the "Category" metadata field from the PDF for some
> reason is not prefixed with "metadata_" and
>
solr will merge the "Category" field I have in the schema with the Category
> metadata from PDF
>

This is the expected behavior, as it's described in
http://wiki.apache.org/solr/ExtractingRequestHandler:

uprefix=<prefix> - Prefix all fields that are not defined in the schema with
> the given prefix.
>

You can use the fmap parameter to redirect the category metadata to another
field.

Regards,

*Juan*



On Thu, Jul 7, 2011 at 10:44 AM, Andras Balogh <an...@reea.net> wrote:

> Hi,
>
>    I think this is a bug but before reporting to issue tracker I thought I
> will ask it here first.
> So the problem is I have a PDF file which among other metadata fields like
> Author, CreatedDate etc. has a metadata
> field Category (I can see all metadata fields with tika-app.jar started in
> GUI mode).
> Now what happens that in my SOLR schema I have a "Category" field also
> among other fields and a field called "text"
> that is holding the extracted text from the PDF.
> I added <str name="uprefix">metadata_</str> so all PDF metadata fields
> should be saved in solr as "metadata_something" fields.
> The problem is that the "Category" metadata field from the PDF for some
> reason is not prefixed with "metadata_" and
> solr will merge the "Category" field I have in the schema with the Category
> metadata from PDF and I will have an error like:
> "multiple values encountered for non multiValued field Category"
> I fixed this by patching tika-parsers.jar and will ignore the Category
> metadata in
> org.apache.tika.parser.pdf.**PDFParser
> but this is not the good solution( I don't need that Category metadata so
> it works for me).
>
> So let me know if this should be reported as bug or not.
>
> Regards,
> Andras.
>
>
>
>
>
>
>