You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rajesh Nikam <ra...@gmail.com> on 2013/05/27 14:57:47 UTC
using solr for web page classification
Hello,
I am working on implementation of system to categorize URLs/Web Pages.
I would have categories like ...
Adult Health Business
Arts Home Science
I am looking at how Lucence/Solr could help me out to achive this.
I came across links that mention MoreLikeThis could be of my help.
I found LucidWorks Search of help for me as it has done installation for
Jetty, Solr in few clicks.
Importing data and Query was also straight forward.
My question is:
- I have pre-defined list of categories for which I would have webpages +
documents that could be stored in solr index assigned with category
- have input processors like on each page
Text extractor (from HTML, PDF, Office format)
Text language detection
Standard text processors - stemming, remove stopwords, lowwercase
etc
Title extractor
Summary extractor
Field mapping
Header and footer remover
- All these document could be processed and stored in Solr Index with
known category
- When new request comes I need to for MLT or solr Query based on content
of webpage and get similar documents.
Based on results I could reply back with top 3 categories.
Please let me know if using solr for this problem in correct way ?
If yes how to go with the forming query based on web page contents ?
Thanks
Rajesh