You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jean-Luc Thiebaut <je...@gmail.com> on 2010/11/10 23:08:40 UTC

Crawling with nutch and mapping fields to solr

Hi

I'm fairly new to solr but I have it configured, along with nutch, as per
this tutorial http://ubuntuforums.org/showthread.php?p=9596257.

Nutch is crawling and injecting documents into solr as expected, however, I
want to break the data down further so what ends up in solr is a bit more
granular.

Can anyone explain in simple terms how I might go about parsing the data I
get from nutch and mapping it to custom fields? Ideally I'd like to be able
to pull out meta-data from the source HTML and map it to specific fields in
solr.

I hope I'm in the right place to ask this question. Any help would be much
appreciated.

Jean-Luc

Re: Crawling with nutch and mapping fields to solr

Posted by Ramavtar Meena <ra...@gmail.com>.

Hi,

This question is more suitable for nutch mailing list but let me give
you couple of pointers.

If its only metadata you can use the below mentioned patch, but if you
want more flexibility with your data you can look at writing your own
parser plugin, here is a good place to start:

http://wiki.apache.org/nutch/WritingPluginExample-0.9

xpath+htmlcleaner+beanshell would be a good set of tools for your custom parser.

regards,
Ram

On Thu, Nov 11, 2010 at 9:21 PM, Jean-Luc <je...@gmail.com> wrote:
>
> I'm going down the route of patching nutch so I can use this ParseMetaTags
> plugin:
> https://issues.apache.org/jira/browse/NUTCH-809
>
> Also wondering whether I will be able to use the XMLParser to allow me to
> parse well formed XHTML, using xpath would be bonus:
> https://issues.apache.org/jira/browse/NUTCH-185
>
> Any thoughts appreciated...
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Crawling-with-nutch-and-mapping-fields-to-solr-tp1879060p1883295.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Crawling with nutch and mapping fields to solr

Posted by Jean-Luc <je...@gmail.com>.

I'm going down the route of patching nutch so I can use this ParseMetaTags
plugin:
https://issues.apache.org/jira/browse/NUTCH-809

Also wondering whether I will be able to use the XMLParser to allow me to
parse well formed XHTML, using xpath would be bonus:
https://issues.apache.org/jira/browse/NUTCH-185

Any thoughts appreciated...
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-with-nutch-and-mapping-fields-to-solr-tp1879060p1883295.html
Sent from the Solr - User mailing list archive at Nabble.com.