You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Andrea Piemontese <ze...@gmail.com> on 2014/07/14 13:16:10 UTC

Mapping Webcrawler metadata

Hi All,

I'm trying to map which informations/metadata will be extracted by the
WebcrawlerConnector to be imported and indexed by the SolrConnector.

Executing a Job with WebcrawlerConnector as input and SolrConnector as
output, the metadata I get in SolR are the following:

- links
- id
- author
- authors
- title
- content_type
- resourcename
- content
- _version_

Is there a way to know which metadata are extracted by the WebcrawlerConnector?
In other words, which metadata can I use in the "Solr Field Mapping"
tab of the job configuration?

Thanks a lot in advance.

Re: Mapping Webcrawler metadata

Posted by Karl Wright <da...@gmail.com>.
Hi Andrea,

The web crawler connector sends along all HTTP header values EXCEPT for
certain explicitly excluded ones as metadata.  The excluded headers are
those which are involved in authorization or which would change on every
fetch.

The kinds of metadata you list above seems to not be coming from the web
connector, but rather from Solr Cell (Tika), which is the extracting update
handler in Solr.  I have no idea what Tika can possibly generate.  The Tika
generated metadata fields cannot be mapped using the Solr Field Mapping tab
because that extraction takes place in Solr, not in ManifoldCF.

MCF 1.7 will have the option of running Tika locally in MCF, as a
transformation connector, and not using Solr's extracting update handler,
so you should have better control when 1.7 is released.

Thanks,
Karl



On Mon, Jul 14, 2014 at 7:16 AM, Andrea Piemontese <ze...@gmail.com>
wrote:

> Hi All,
>
> I'm trying to map which informations/metadata will be extracted by the
> WebcrawlerConnector to be imported and indexed by the SolrConnector.
>
> Executing a Job with WebcrawlerConnector as input and SolrConnector as
> output, the metadata I get in SolR are the following:
>
> - links
> - id
> - author
> - authors
> - title
> - content_type
> - resourcename
> - content
> - _version_
>
> Is there a way to know which metadata are extracted by the
> WebcrawlerConnector?
> In other words, which metadata can I use in the "Solr Field Mapping"
> tab of the job configuration?
>
> Thanks a lot in advance.
>