You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by "Farrenkopf, Sven" <Sv...@dreso.com> on 2018/08/14 09:40:11 UTC

Using mainfoldCF as a webcrawler with tika and solr

I'm using manifoldCF with solr, trying to get it working as a webcrawler. Crawling the websites (HTML, Text) works fine, the problem is that links to binary documents (pdf, xlsx, docx, ...) don't work even if I put a tika-Transformation in the job. I haven't even found a written confirmation that the webcrawler-connector does support  binary documents, although some posts to the mailing-lists indicate that it is possible.

The documents are apparently recognized - I put a direct link to a pdf-document in the seeds and it is processed as I run the job.

But there is no error (Tika-errors are not ignored!) and the document is not transferred to solr. With no error-message I have nothing to work with ...

Any ideas/hints what to do? Does somebody know a tutorial for setting up a webcrawler with solr & tika? I haven't found any on the web, which made me ask myself if I'm trying sth impossible here?

Thanks in advance.

Sven

Re: Using mainfoldCF as a webcrawler with tika and solr

Posted by Karl Wright <da...@gmail.com>.

Hi Sven,

Please have a look at the Simple History report to see what happened to the
documents you are interested in.
The Web Connector will fetch binary documents no problem, but it sounds
like you have something else in your configuration that is causing them to
be rejected.  The configuration of the web connector, as well as the
configuration of the downstream pipeline connectors, all are able to reject
documents based on mime type.  The Simple History will give you a reason
for that rejection.  If not, you can turn on connector debugging and you
can see the decisions that go into whether to index a document or not.

Karl

On Tue, Aug 14, 2018 at 5:40 AM Farrenkopf, Sven <Sv...@dreso.com>
wrote:

> I’m using manifoldCF with solr, trying to get it working as a webcrawler.
> Crawling the websites (HTML, Text) works fine, the problem is that links to
> binary documents (pdf, xlsx, docx, …) don’t work even if I put a
> tika-Transformation in the job. I haven’t even found a written confirmation
> that the webcrawler-connector does support  binary documents, although some
> posts to the mailing-lists indicate that it is possible.
>
>
>
> The documents are apparently recognized – I put a direct link to a
> pdf-document in the seeds and it is processed as I run the job.
>
>
>
> But there is no error (Tika-errors are not ignored!) and the document is
> not transferred to solr. With no error-message I have nothing to work with …
>
>
>
> Any ideas/hints what to do? Does somebody know a tutorial for setting up a
> webcrawler with solr & tika? I haven’t found any on the web, which made me
> ask myself if I’m trying sth impossible here?
>
>
>
> Thanks in advance.
>
>
>
> Sven
>