You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Phillip Rhodes <mo...@gmail.com> on 2017/12/22 00:31:05 UTC

Issue with Solr Cell mixing metadata and content together

Hi all, I have been having an issue with Solr, using the
ExtractingRequestHandler.  Basically, when indexing a PDF (for
example) I get all the metadata mixed into the "content" field along
with the content.  See:
<https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content>
for the gory details.

I'm guessing this is the same basic issue as
<https://issues.apache.org/jira/browse/SOLR-9178> which is still
unresolved.  But I thought I'd ping the list just to see if anyone had
a workaround or any more information on this.

Is there any way to get reasonable behavior using the
ExtractingRequestHandler, or should I just dump that approach and plan
to run Tika outside of Solr, and then send Solr the exact content I
want?


Thanks,



This message optimized for indexing by NSA PRISM

Re: Issue with Solr Cell mixing metadata and content together

Posted by Phillip Rhodes <mo...@gmail.com>.

Fair enough.  I'm actually using ManifoldCF to manage the indexing,
and I see that they have a TIka Content Extraction transformer
available, so I'll look into wiring that into my pipeline and see if
that gets me the results I'm looking for.


Thanks,


Phil

This message optimized for indexing by NSA PRISM


On Thu, Dec 21, 2017 at 7:43 PM, Erick Erickson <er...@gmail.com> wrote:
> bq: s there any way to get reasonable behavior using the
> ExtractingRequestHandler, or should I just dump that approach and plan
> to run Tika outside of Solr, and then send Solr the exact content I
> want?
>
> Actually, this is recommended for a bunch of reasons, so I'd just
> go there straightaway. Tika has all sorts of "interesting" things to
> cope with, and since the underlying file formats are more-or-less
> followed by this vendor or that, there's always the possibility
> that Tika will kill your Solr.
>
> Here's a place to start:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Thu, Dec 21, 2017 at 4:31 PM, Phillip Rhodes
> <mo...@gmail.com> wrote:
>> Hi all, I have been having an issue with Solr, using the
>> ExtractingRequestHandler.  Basically, when indexing a PDF (for
>> example) I get all the metadata mixed into the "content" field along
>> with the content.  See:
>> <https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content>
>> for the gory details.
>>
>> I'm guessing this is the same basic issue as
>> <https://issues.apache.org/jira/browse/SOLR-9178> which is still
>> unresolved.  But I thought I'd ping the list just to see if anyone had
>> a workaround or any more information on this.
>>
>> Is there any way to get reasonable behavior using the
>> ExtractingRequestHandler, or should I just dump that approach and plan
>> to run Tika outside of Solr, and then send Solr the exact content I
>> want?
>>
>>
>> Thanks,
>>
>>
>>
>> This message optimized for indexing by NSA PRISM

Re: Issue with Solr Cell mixing metadata and content together

Posted by Erick Erickson <er...@gmail.com>.

bq: s there any way to get reasonable behavior using the
ExtractingRequestHandler, or should I just dump that approach and plan
to run Tika outside of Solr, and then send Solr the exact content I
want?

Actually, this is recommended for a bunch of reasons, so I'd just
go there straightaway. Tika has all sorts of "interesting" things to
cope with, and since the underlying file formats are more-or-less
followed by this vendor or that, there's always the possibility
that Tika will kill your Solr.

Here's a place to start:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Dec 21, 2017 at 4:31 PM, Phillip Rhodes
<mo...@gmail.com> wrote:
> Hi all, I have been having an issue with Solr, using the
> ExtractingRequestHandler.  Basically, when indexing a PDF (for
> example) I get all the metadata mixed into the "content" field along
> with the content.  See:
> <https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content>
> for the gory details.
>
> I'm guessing this is the same basic issue as
> <https://issues.apache.org/jira/browse/SOLR-9178> which is still
> unresolved.  But I thought I'd ping the list just to see if anyone had
> a workaround or any more information on this.
>
> Is there any way to get reasonable behavior using the
> ExtractingRequestHandler, or should I just dump that approach and plan
> to run Tika outside of Solr, and then send Solr the exact content I
> want?
>
>
> Thanks,
>
>
>
> This message optimized for indexing by NSA PRISM