You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Peyman Mohajerian <mo...@gmail.com> on 2011/11/27 18:12:21 UTC

Subcategorizing Page Content

Hi,

I have used Nutch and Solr integration to crawl/index some content
successfully. However now I need to categorize the content into more
refined list, e.g. imagine the page has sports and news sections (in
one url) and I'd like to have each separately indexed in solr.
Obviously I have to customize the HTMLParser and look for some css
tags to see the main labels and items below those labels, is there any
parser that reads css tags? Also I need to modify schema.xml to have
other attributes instead of just 'content' it would have 'sport',
'news' and etc. Can these attributes have hierarchy e.g. under
'content' or they have to be separate fields?
Other than changing the parser what other things do I have to worry
about? I'm thinking this is not a very uncommon use case and there
maybe more clues or example? I hope I don't have to touch the
solrIndexer?
Another alternative, I think, is to have solr store the full 'content'
and do all the above things within solr, I don't have enough
experience to know which approach is better?

Thanks,
Peyman

Re: Subcategorizing Page Content

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Peyman,

There are a couple of questions here. Some of which I must admit are
completely Solr related.

1) You seem to have a pretty good idea of what needs to be customised and
where. With regards to a CSS parser, I would assume that Tika would handle
this for your. I would be extremely surprised if it didn't. Having had a
quick look on the tika archives for keyword CSS [1], there is plenty there
so hopefully you can implement something from the libraries.
2) With regards to the various fields, if you have a look at the new Solr
4.x schema support Andrzej added this will give you a flavour for more
complex/expressive configurations. With regards to nesting of fields within
some sort of hierarchy I am not entirely sure, maybe someone can advise,
however even if this is not possible, you can still create individual
fields as we do for numerous other elements.
3) I would imagine that an indexingfilter to handle all of this stuff will
definitely leave you free from having to hack the SolrIndexer.

[1] http://tika.markmail.org/search/?q=css

On Sun, Nov 27, 2011 at 5:12 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> Hi,
>
> I have used Nutch and Solr integration to crawl/index some content
> successfully. However now I need to categorize the content into more
> refined list, e.g. imagine the page has sports and news sections (in
> one url) and I'd like to have each separately indexed in solr.
> Obviously I have to customize the HTMLParser and look for some css
> tags to see the main labels and items below those labels, is there any
> parser that reads css tags? Also I need to modify schema.xml to have
> other attributes instead of just 'content' it would have 'sport',
> 'news' and etc. Can these attributes have hierarchy e.g. under
> 'content' or they have to be separate fields?
> Other than changing the parser what other things do I have to worry
> about? I'm thinking this is not a very uncommon use case and there
> maybe more clues or example? I hope I don't have to touch the
> solrIndexer?
> Another alternative, I think, is to have solr store the full 'content'
> and do all the above things within solr, I don't have enough
> experience to know which approach is better?
>
> Thanks,
> Peyman
>

-- 
*Lewis*