You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Johan Svensson <jo...@euroling.se> on 2011/08/12 14:07:42 UTC

Working with facets

Hello,

This is my first question to this list, and I equally new to Nutch. 
Sorry if this question might be too general. I'd be happy with good
links to documentation, if there are any. Google won't help me find
them.

I need to understand how facets can be extracted from a web site crawled
by Nutch then indexed by Solr. On the web site, pages have meta tags,
like <meta name="price" content="123.45"/> or <meta name="categories"
content="category1, category2"/>. Can I tell Nutch to extract those and
Solr to treat them as facets?

In the example above, I want to specify manually that the meta name
"categories" is to be treated as a facet, but the content should be
dynamically used as categories.

Does it make sense? Is it possible to do with Nutch and Solr, or should
I rethink my way of using it?

Best Regards,

Johan



Re: Working with facets

Posted by Markus Jelsma <ma...@openindex.io>.

On Friday 12 August 2011 14:07:42 Johan Svensson wrote:
> Hello,
> 
> This is my first question to this list, and I equally new to Nutch.
> Sorry if this question might be too general. I'd be happy with good
> links to documentation, if there are any. Google won't help me find
> them.
> 
> I need to understand how facets can be extracted from a web site crawled
> by Nutch then indexed by Solr. On the web site, pages have meta tags,
> like <meta name="price" content="123.45"/> or <meta name="categories"
> content="category1, category2"/>. Can I tell Nutch to extract those and
> Solr to treat them as facets?

You first need to extract meta data from your document in Nutch and add these 
as fields to your Nutch documents. I never tried it but there are some 
discussions about `extracting meta data using nutch` on the internet.

Once the fields are in Solr you can use them as facets with ease.

> 
> In the example above, I want to specify manually that the meta name
> "categories" is to be treated as a facet, but the content should be
> dynamically used as categories.
> 
> Does it make sense? Is it possible to do with Nutch and Solr, or should
> I rethink my way of using it?
> 
> Best Regards,
> 
> Johan

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350