You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andras Balogh <an...@reea.net> on 2011/07/07 15:44:04 UTC

bug in ExtractingRequestHandler with PDFs and metadata field Category

Hi,

     I think this is a bug but before reporting to issue tracker I 
thought I will ask it here first.
So the problem is I have a PDF file which among other metadata fields 
like Author, CreatedDate etc. has a metadata
field Category (I can see all metadata fields with tika-app.jar started 
in GUI mode).
Now what happens that in my SOLR schema I have a "Category" field also 
among other fields and a field called "text"
that is holding the extracted text from the PDF.
I added <str name="uprefix">metadata_</str> so all PDF metadata fields 
should be saved in solr as "metadata_something" fields.
The problem is that the "Category" metadata field from the PDF for some 
reason is not prefixed with "metadata_" and
solr will merge the "Category" field I have in the schema with the 
Category metadata from PDF and I will have an error like:
"multiple values encountered for non multiValued field Category"
I fixed this by patching tika-parsers.jar and will ignore the Category 
metadata in
org.apache.tika.parser.pdf.PDFParser
but this is not the good solution( I don't need that Category metadata 
so it works for me).

So let me know if this should be reported as bug or not.

Regards,
Andras.

Re: bug in ExtractingRequestHandler with PDFs and metadata field Category

Posted by Juan Grande <ju...@gmail.com>.

Hi Andras,

I added <str name="uprefix">metadata_</str> so all PDF metadata fields
> should be saved in solr as "metadata_something" fields.
>
The problem is that the "Category" metadata field from the PDF for some
> reason is not prefixed with "metadata_" and
>
solr will merge the "Category" field I have in the schema with the Category
> metadata from PDF
>

This is the expected behavior, as it's described in
http://wiki.apache.org/solr/ExtractingRequestHandler:

uprefix=<prefix> - Prefix all fields that are not defined in the schema with
> the given prefix.
>

You can use the fmap parameter to redirect the category metadata to another
field.

Regards,

*Juan*



On Thu, Jul 7, 2011 at 10:44 AM, Andras Balogh <an...@reea.net> wrote:

> Hi,
>
>    I think this is a bug but before reporting to issue tracker I thought I
> will ask it here first.
> So the problem is I have a PDF file which among other metadata fields like
> Author, CreatedDate etc. has a metadata
> field Category (I can see all metadata fields with tika-app.jar started in
> GUI mode).
> Now what happens that in my SOLR schema I have a "Category" field also
> among other fields and a field called "text"
> that is holding the extracted text from the PDF.
> I added <str name="uprefix">metadata_</str> so all PDF metadata fields
> should be saved in solr as "metadata_something" fields.
> The problem is that the "Category" metadata field from the PDF for some
> reason is not prefixed with "metadata_" and
> solr will merge the "Category" field I have in the schema with the Category
> metadata from PDF and I will have an error like:
> "multiple values encountered for non multiValued field Category"
> I fixed this by patching tika-parsers.jar and will ignore the Category
> metadata in
> org.apache.tika.parser.pdf.**PDFParser
> but this is not the good solution( I don't need that Category metadata so
> it works for me).
>
> So let me know if this should be reported as bug or not.
>
> Regards,
> Andras.
>
>
>
>
>
>
>