You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rajesh Nikam <ra...@gmail.com> on 2013/05/27 14:57:47 UTC

using solr for web page classification

Hello,

I am working on implementation of system to categorize URLs/Web Pages.

I would have categories like ...

Adult          Health         Business
Arts           Home           Science

I am looking at how Lucence/Solr could help me out to achive this.
I came across links that mention MoreLikeThis could be of my help.

I found LucidWorks Search of help for me as it has done installation for
Jetty, Solr in few clicks.

Importing data and Query was also straight forward.

 My question is:

 - I have pre-defined list of categories for which I would have webpages +
documents that could be stored in solr index assigned with category

 - have input processors like on each page

         Text extractor (from HTML, PDF, Office format)
         Text language detection
         Standard text processors - stemming, remove stopwords, lowwercase
etc
         Title extractor
    Summary extractor
    Field mapping
    Header and footer remover

 - All these document could be processed and stored in Solr Index with
known category

 - When new request comes I need to for MLT or solr Query based on content
of webpage and get similar documents.
 Based on results I could reply back with top 3 categories.


 Please let me know if using solr for this problem in correct way ?
 If yes how to go with the forming query based on web page contents ?

Thanks
Rajesh