You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Phillip Rhodes <mo...@gmail.com> on 2017/12/20 09:25:33 UTC

MCF not indexing documents due to mime-type

MCF folks:

I'm about to tear my hair out over this one... I just realized that
I've been running MCF with the "Use the Extract Update Handler:"
option checked.  Suspecting this might be related to another issue I
was having (content was not being stored in the field named in the
"Content field name:" option in MCF), I turned this option off.

Now, MCF happily rejects nearly every document in my repository with this:

Result Code: EXCLUDEDMIMETYPE
Result Description: Excluding document because of mime type (application/pdf)
(and so on for many other mime types)

So... this is *not* what I would expect to happen as I have nothing at
all listed in the "excluded mime types" setting for this output
connector.  With nothing explicitly excluded, I would (perhaps
naively) expect all mime types to be sent to Solr.

But what makes it even worse is this: even when I explicitly add types
(for example, application/pdf) to the "included mime types" setting
and re-index, I *still* get the same message and no PDF files are
indexed.

Any ideas?  Is this a bug, or is there something else I need to do?



Thanks,


Phil
~~~
This message optimized for indexing by NSA PRISM

Re: MCF not indexing documents due to mime-type

Posted by Phillip Rhodes <mo...@gmail.com>.

As far as I know, the wonkiness with the data I'm seeing is actually a
reflection of an underlying problem with digital images.  Apparently
some or all of the various date typed fields mandated by EXIF and XMP
don't require time-zone information.  So apparently you can have an
image that legitimately has a date/time field like "created date" that
does not include time-zone info.   But since Solr requires UTC
time-zone for date typed fields, if you want to store that date in a
date field, you have to impute the correct value (or a reasonable
approximation).

In my case, I doubt anybody is ever going to care to search images in
a way where a difference of a few hours is going to matter, so I think
I'm just going to force everything to a time value of midnight UTC on
the date in question.

Right now I'm exploring writing my own custom transformer to do the
data munging.   It might be overkill, but I wanted to do it just to
learn that side of MCF if nothing else.  So far the transformer I
threw together seems to be working.


Thanks,


Phil

This message optimized for indexing by NSA PRISM


On Fri, Dec 22, 2017 at 7:27 PM, Karl Wright <da...@gmail.com> wrote:
> Hi Phil,
>
> Are these fields extracted by Tika from your document?  Just curious,
> because if it's in MCF itself we could do something about it.
>
> Anyhow, what you want is the metadata adjuster:
>
> https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#metadataadjuster
>
>
> Karl
>
>
> On Fri, Dec 22, 2017 at 1:47 AM, Phillip Rhodes <mo...@gmail.com>
> wrote:
>>
>> On Thu, Dec 21, 2017 at 8:35 PM, Karl Wright <da...@gmail.com> wrote:
>> > Well, there are some differences; "Solr Cell" (as they used to call it)
>> > generates a couple of fields that the standard Tika extractor in MCF
>> > won't.
>> > But other than that it should work.
>>
>> By and large I don't think I care about those fields, so that part
>> shouldn't be an issue.
>>
>> > Note that you can still use the extracting update handler in the solr
>> > connector; since the input will always be text/plain Tika shouldn't do
>> > anything to the document on the Solr side.  If that doesn't happen to be
>> > true, you can use the standard Solr input handler,
>>
>> FWIW, it appears that even when using the Tika connector in MCF, what
>> gets sent to
>> Solr still triggers some Tika behavior if you have the "use extract
>> handler" option turned on.
>> When I did this I got all sorts of weird Tika parse exceptions and
>> what-not from Solr.
>>
>> Fortunately just sending everything to Solr using the standard handler
>> worked and I'm
>> at a point now where *almost* everything works.
>>
>> The one issue I'm still seeing is this:  when using the Tika
>> connector, it seems that some date oriented
>> fields are being generated with a value that does not have the
>> trailing 'Z` timezone flag.  This causes
>> a Solr error if the corresponding field is date typed, as Solr
>> requires dates to be in that UTC timezone.
>>
>> Ex:
>>
>> dcterms:created: 2011-03-02T08:44:45
>> found field: dcterms:modified: 2011-03-02T08:44:45
>> Last-Save-Date: 2011-03-02T08:44:45
>> meta:save-date: 2011-03-02T08:44:45
>>
>> Solr wants all of thse to look like
>>
>>
>> 2011-03-02T08:44:45Z
>>
>>
>> Is there any way, using any built in MCF functionality, to forcibly
>> munge the field values to correct this?  If not, could I accomplish
>> that by writing a custom Transform connector?
>>
>>
>> Thanks,
>>
>>
>> Phil
>
>

Re: MCF not indexing documents due to mime-type

Posted by Karl Wright <da...@gmail.com>.

Hi Phil,

Are these fields extracted by Tika from your document?  Just curious,
because if it's in MCF itself we could do something about it.

Anyhow, what you want is the metadata adjuster:

https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#metadataadjuster


Karl


On Fri, Dec 22, 2017 at 1:47 AM, Phillip Rhodes <mo...@gmail.com>
wrote:

> On Thu, Dec 21, 2017 at 8:35 PM, Karl Wright <da...@gmail.com> wrote:
> > Well, there are some differences; "Solr Cell" (as they used to call it)
> > generates a couple of fields that the standard Tika extractor in MCF
> won't.
> > But other than that it should work.
>
> By and large I don't think I care about those fields, so that part
> shouldn't be an issue.
>
> > Note that you can still use the extracting update handler in the solr
> > connector; since the input will always be text/plain Tika shouldn't do
> > anything to the document on the Solr side.  If that doesn't happen to be
> > true, you can use the standard Solr input handler,
>
> FWIW, it appears that even when using the Tika connector in MCF, what
> gets sent to
> Solr still triggers some Tika behavior if you have the "use extract
> handler" option turned on.
> When I did this I got all sorts of weird Tika parse exceptions and
> what-not from Solr.
>
> Fortunately just sending everything to Solr using the standard handler
> worked and I'm
> at a point now where *almost* everything works.
>
> The one issue I'm still seeing is this:  when using the Tika
> connector, it seems that some date oriented
> fields are being generated with a value that does not have the
> trailing 'Z` timezone flag.  This causes
> a Solr error if the corresponding field is date typed, as Solr
> requires dates to be in that UTC timezone.
>
> Ex:
>
> dcterms:created: 2011-03-02T08:44:45
> found field: dcterms:modified: 2011-03-02T08:44:45
> Last-Save-Date: 2011-03-02T08:44:45
> meta:save-date: 2011-03-02T08:44:45
>
> Solr wants all of thse to look like
>
>
> 2011-03-02T08:44:45Z
>
>
> Is there any way, using any built in MCF functionality, to forcibly
> munge the field values to correct this?  If not, could I accomplish
> that by writing a custom Transform connector?
>
>
> Thanks,
>
>
> Phil
>

Re: MCF not indexing documents due to mime-type

Posted by Phillip Rhodes <mo...@gmail.com>.

On Thu, Dec 21, 2017 at 8:35 PM, Karl Wright <da...@gmail.com> wrote:
> Well, there are some differences; "Solr Cell" (as they used to call it)
> generates a couple of fields that the standard Tika extractor in MCF won't.
> But other than that it should work.

By and large I don't think I care about those fields, so that part
shouldn't be an issue.

> Note that you can still use the extracting update handler in the solr
> connector; since the input will always be text/plain Tika shouldn't do
> anything to the document on the Solr side.  If that doesn't happen to be
> true, you can use the standard Solr input handler,

FWIW, it appears that even when using the Tika connector in MCF, what
gets sent to
Solr still triggers some Tika behavior if you have the "use extract
handler" option turned on.
When I did this I got all sorts of weird Tika parse exceptions and
what-not from Solr.

Fortunately just sending everything to Solr using the standard handler
worked and I'm
at a point now where *almost* everything works.

The one issue I'm still seeing is this:  when using the Tika
connector, it seems that some date oriented
fields are being generated with a value that does not have the
trailing 'Z` timezone flag.  This causes
a Solr error if the corresponding field is date typed, as Solr
requires dates to be in that UTC timezone.

Ex:

dcterms:created: 2011-03-02T08:44:45
found field: dcterms:modified: 2011-03-02T08:44:45
Last-Save-Date: 2011-03-02T08:44:45
meta:save-date: 2011-03-02T08:44:45

Solr wants all of thse to look like

2011-03-02T08:44:45Z

Is there any way, using any built in MCF functionality, to forcibly
munge the field values to correct this?  If not, could I accomplish
that by writing a custom Transform connector?

Thanks,

Phil

Re: MCF not indexing documents due to mime-type

Posted by Karl Wright <da...@gmail.com>.

Well, there are some differences; "Solr Cell" (as they used to call it)
generates a couple of fields that the standard Tika extractor in MCF
won't.  But other than that it should work.

Note that you can still use the extracting update handler in the solr
connector; since the input will always be text/plain Tika shouldn't do
anything to the document on the Solr side.  If that doesn't happen to be
true, you can use the standard Solr input handler, but bear in mind that
this handler requires memory buffering on the MCF side so we insist you
give a limit on the size of the content sent to Solr for indexing in that
mode.

Thanks,
Karl


On Thu, Dec 21, 2017 at 7:21 PM, Phillip Rhodes <mo...@gmail.com>
wrote:

> OK, it looks like the root of the problem I was seeing, metadata
> winding up mixed in with the content, is ultimately a bug in Solr.
> <https://issues.apache.org/jira/browse/SOLR-9178>
>
> It seems that if you use the "Tika built into Solr" approach this is
> just what you get.  The answer seems to be "do the Tika processing
> outside of Solr".
>
> So now my question vis-a-vis ManifoldCF is this: can I achieve the
> scenario of having MCF index everything, and send it all to Solr,
> while *not* using the ExtractingRequestHandler if I run Tika in MCF
> directly?  My naive understanding is that the "Tika Content Extractor"
> should let me accomplish this.  Can anyone confirm if that is correct?
>
>
> Thanks
>
>
> Phil
>
> This message optimized for indexing by NSA PRISM
>
>
> On Wed, Dec 20, 2017 at 7:53 AM, Karl Wright <da...@gmail.com> wrote:
> > Hi Phil,
> >
> > For some output connectors, they *only* accept text documents.  That's
> why
> > you need to run your documents through Tika first.  So your original
> setup
> > was right.
> >
> > If you are still using ElasticSearch, you can make it accept non-text
> > documents only by specifying the mapper attachment in the output
> connection
> > configuration.
> >
> >
> >
> > Karl
> >
> >
> > On Wed, Dec 20, 2017 at 4:25 AM, Phillip Rhodes <
> motley.crue.fan@gmail.com>
> > wrote:
> >>
> >> MCF folks:
> >>
> >> I'm about to tear my hair out over this one... I just realized that
> >> I've been running MCF with the "Use the Extract Update Handler:"
> >> option checked.  Suspecting this might be related to another issue I
> >> was having (content was not being stored in the field named in the
> >> "Content field name:" option in MCF), I turned this option off.
> >>
> >> Now, MCF happily rejects nearly every document in my repository with
> this:
> >>
> >> Result Code: EXCLUDEDMIMETYPE
> >> Result Description: Excluding document because of mime type
> >> (application/pdf)
> >> (and so on for many other mime types)
> >>
> >> So... this is *not* what I would expect to happen as I have nothing at
> >> all listed in the "excluded mime types" setting for this output
> >> connector.  With nothing explicitly excluded, I would (perhaps
> >> naively) expect all mime types to be sent to Solr.
> >>
> >> But what makes it even worse is this: even when I explicitly add types
> >> (for example, application/pdf) to the "included mime types" setting
> >> and re-index, I *still* get the same message and no PDF files are
> >> indexed.
> >>
> >> Any ideas?  Is this a bug, or is there something else I need to do?
> >>
> >>
> >>
> >> Thanks,
> >>
> >>
> >> Phil
> >> ~~~
> >> This message optimized for indexing by NSA PRISM
> >
> >
>

Re: MCF not indexing documents due to mime-type

Posted by Phillip Rhodes <mo...@gmail.com>.

OK, it looks like the root of the problem I was seeing, metadata
winding up mixed in with the content, is ultimately a bug in Solr.
<https://issues.apache.org/jira/browse/SOLR-9178>

It seems that if you use the "Tika built into Solr" approach this is
just what you get.  The answer seems to be "do the Tika processing
outside of Solr".

So now my question vis-a-vis ManifoldCF is this: can I achieve the
scenario of having MCF index everything, and send it all to Solr,
while *not* using the ExtractingRequestHandler if I run Tika in MCF
directly?  My naive understanding is that the "Tika Content Extractor"
should let me accomplish this.  Can anyone confirm if that is correct?


Thanks


Phil

This message optimized for indexing by NSA PRISM


On Wed, Dec 20, 2017 at 7:53 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Phil,
>
> For some output connectors, they *only* accept text documents.  That's why
> you need to run your documents through Tika first.  So your original setup
> was right.
>
> If you are still using ElasticSearch, you can make it accept non-text
> documents only by specifying the mapper attachment in the output connection
> configuration.
>
>
>
> Karl
>
>
> On Wed, Dec 20, 2017 at 4:25 AM, Phillip Rhodes <mo...@gmail.com>
> wrote:
>>
>> MCF folks:
>>
>> I'm about to tear my hair out over this one... I just realized that
>> I've been running MCF with the "Use the Extract Update Handler:"
>> option checked.  Suspecting this might be related to another issue I
>> was having (content was not being stored in the field named in the
>> "Content field name:" option in MCF), I turned this option off.
>>
>> Now, MCF happily rejects nearly every document in my repository with this:
>>
>> Result Code: EXCLUDEDMIMETYPE
>> Result Description: Excluding document because of mime type
>> (application/pdf)
>> (and so on for many other mime types)
>>
>> So... this is *not* what I would expect to happen as I have nothing at
>> all listed in the "excluded mime types" setting for this output
>> connector.  With nothing explicitly excluded, I would (perhaps
>> naively) expect all mime types to be sent to Solr.
>>
>> But what makes it even worse is this: even when I explicitly add types
>> (for example, application/pdf) to the "included mime types" setting
>> and re-index, I *still* get the same message and no PDF files are
>> indexed.
>>
>> Any ideas?  Is this a bug, or is there something else I need to do?
>>
>>
>>
>> Thanks,
>>
>>
>> Phil
>> ~~~
>> This message optimized for indexing by NSA PRISM
>
>

Re: MCF not indexing documents due to mime-type

Posted by Karl Wright <da...@gmail.com>.

Hi Phil,

For some output connectors, they *only* accept text documents.  That's why
you need to run your documents through Tika first.  So your original setup
was right.

If you are still using ElasticSearch, you can make it accept non-text
documents only by specifying the mapper attachment in the output connection
configuration.



Karl


On Wed, Dec 20, 2017 at 4:25 AM, Phillip Rhodes <mo...@gmail.com>
wrote:

> MCF folks:
>
> I'm about to tear my hair out over this one... I just realized that
> I've been running MCF with the "Use the Extract Update Handler:"
> option checked.  Suspecting this might be related to another issue I
> was having (content was not being stored in the field named in the
> "Content field name:" option in MCF), I turned this option off.
>
> Now, MCF happily rejects nearly every document in my repository with this:
>
> Result Code: EXCLUDEDMIMETYPE
> Result Description: Excluding document because of mime type
> (application/pdf)
> (and so on for many other mime types)
>
> So... this is *not* what I would expect to happen as I have nothing at
> all listed in the "excluded mime types" setting for this output
> connector.  With nothing explicitly excluded, I would (perhaps
> naively) expect all mime types to be sent to Solr.
>
> But what makes it even worse is this: even when I explicitly add types
> (for example, application/pdf) to the "included mime types" setting
> and re-index, I *still* get the same message and no PDF files are
> indexed.
>
> Any ideas?  Is this a bug, or is there something else I need to do?
>
>
>
> Thanks,
>
>
> Phil
> ~~~
> This message optimized for indexing by NSA PRISM
>