You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vishal Tomar <vi...@gmail.com> on 2014/06/18 14:27:01 UTC

Help in developing a vertical search using nutch

Hi,

I am new to apache nutch and web crawlers in general, I am trying to build
a vertical search engine for real estate.

Now, How do I implement the crawler? Probably use Nutch for the crawling
and modify it to only extract links from a page if the page contents are
relevant to real estate. I'd probably need to write some kind of relevancy
scoring function which uses a mixture of keywords, ontology and some kind
of similarity detection based on sites I know to be relevant.

Now is there any way by which I can configure Nutch to use my relevancy
scoring function or do I need to change the source code, Also I would
prefer working in python over java as I am much more familiar with it, so
is there any library in python for nutch.

Apart from this I would really appreciate any more pointers regarding nutch
in general.

Thanks
Vishal

Re: Help in developing a vertical search using nutch

Posted by Nicholas Roberts <ni...@gmail.com>.

You might be interested in my www.bigdatadrupal.com
On Jun 18, 2014 5:27 AM, "Vishal Tomar" <vi...@gmail.com> wrote:

> Hi,
>
> I am new to apache nutch and web crawlers in general, I am trying to build
> a vertical search engine for real estate.
>
> Now, How do I implement the crawler? Probably use Nutch for the crawling
> and modify it to only extract links from a page if the page contents are
> relevant to real estate. I'd probably need to write some kind of relevancy
> scoring function which uses a mixture of keywords, ontology and some kind
> of similarity detection based on sites I know to be relevant.
>
> Now is there any way by which I can configure Nutch to use my relevancy
> scoring function or do I need to change the source code, Also I would
> prefer working in python over java as I am much more familiar with it, so
> is there any library in python for nutch.
>
> Apart from this I would really appreciate any more pointers regarding nutch
> in general.
>
> Thanks
> Vishal
>

Re: Help in developing a vertical search using nutch

Posted by Guy McDowell <gu...@gmail.com>.

Hey Vishal,

I'm attempting to do a very similar thing, but not with real estate. I'm
only about one step ahead of you in this process though, so I can't offer
much help.

I think you are on the right path as far as having Nutch crawl only
websites related to real estate. A whole web crawl starting with seed URLs
outside of that vertical would probably be a waste of your time. Might as
well start with seeds in the vertical.

I think if you're using Nutch with Solr as the front-end search for your
users, Solr will rank your results based on relevancy of the keywords
entered in the search. I'm focusing on learning Nutch right now, so I'm not
certain of everything Solr does.

>From the research I've done, using Nutch 1.x is better than 2.x as it is
more stable and has more features. I could be wrong, but I think that's
worth double checking on.

I look forward to following your progress and learning from you. Hopefully
my progress will be able to help you as well.

Cheers!

Guy McDowell
guymcdowell@gmail.com
http://www.GuyMcDowell.com

On Wed, Jun 18, 2014 at 9:27 AM, Vishal Tomar <vi...@gmail.com>
wrote:

> Hi,
>
> I am new to apache nutch and web crawlers in general, I am trying to build
> a vertical search engine for real estate.
>
> Now, How do I implement the crawler? Probably use Nutch for the crawling
> and modify it to only extract links from a page if the page contents are
> relevant to real estate. I'd probably need to write some kind of relevancy
> scoring function which uses a mixture of keywords, ontology and some kind
> of similarity detection based on sites I know to be relevant.
>
> Now is there any way by which I can configure Nutch to use my relevancy
> scoring function or do I need to change the source code, Also I would
> prefer working in python over java as I am much more familiar with it, so
> is there any library in python for nutch.
>
> Apart from this I would really appreciate any more pointers regarding nutch
> in general.
>
> Thanks
> Vishal
>

Re: Help in developing a vertical search using nutch

Posted by Guy McDowell <gu...@gmail.com>.

Makes perfect sense and articulates very well what I was planning to do for
my vertical Nutch/Solr implementation.

Guy McDowell
guymcdowell@gmail.com
http://www.GuyMcDowell.com





On Wed, Jun 18, 2014 at 4:27 PM, John McCormac <jm...@hackwatch.com> wrote:

> On 18/06/2014 13:27, Vishal Tomar wrote:
>
>> Hi,
>>
>> I am new to apache nutch and web crawlers in general, I am trying to build
>> a vertical search engine for real estate.
>>
>> Now, How do I implement the crawler? Probably use Nutch for the crawling
>> and modify it to only extract links from a page if the page contents are
>> relevant to real estate. I'd probably need to write some kind of relevancy
>> scoring function which uses a mixture of keywords, ontology and some kind
>> of similarity detection based on sites I know to be relevant.
>>
>
> I think that you might be jumping ahead a few steps. Building a vertical
> search engine is quite different from building an ordinary crawl based
> search engine. With a vertical, the new sites are not so much detected as
> added. It is the same as building a web directory.
> You need to identify the relevant websites and then add them to the crawl
> schedule. Otherwise you will end up with having to clean the index after it
> has included a lot of junk websites. By controlling the websites that you
> add, you also make it a lot easier to deal with compromised websites.
>
> Though Nutch is impressive, I am not exactly up to speed on using it for
> crawling and search as my main work is with domain names and
> website/IP/country mapping.
>
> A better strategy, (rather than running a full crawl on all sites), would
> be to use the index page only and then analyse that for real estate
> keywords and phrases. That could be a faster way of building a list of
> candidate sites for crawling. (Effectively you break your site aquisition
> process into three parts: Collection, Detection, Selection.) It might sound
> like a convoluted way of doing things but for vertical search, it is a lot
> simpler than cleaning an index. :)
>
> Regards...jmcc
> --
> **********************************************************
> John McCormac  *  e-mail: jmcc@hosterstats.com
> MC2            *  web: http://www.hosterstats.com/
> 22 Viewmount   *  Domain Registrations Statistics
> Waterford      *  And Historical DNS Database.
> Ireland        *  Over 392 Million Domains Tracked.
> IE             *  http://www.hosterstats.com/blog
> **********************************************************
>

Re: Help in developing a vertical search using nutch

Posted by John McCormac <jm...@hackwatch.com>.

On 18/06/2014 13:27, Vishal Tomar wrote:
> Hi,
>
> I am new to apache nutch and web crawlers in general, I am trying to build
> a vertical search engine for real estate.
>
> Now, How do I implement the crawler? Probably use Nutch for the crawling
> and modify it to only extract links from a page if the page contents are
> relevant to real estate. I'd probably need to write some kind of relevancy
> scoring function which uses a mixture of keywords, ontology and some kind
> of similarity detection based on sites I know to be relevant.

I think that you might be jumping ahead a few steps. Building a vertical 
search engine is quite different from building an ordinary crawl based 
search engine. With a vertical, the new sites are not so much detected 
as added. It is the same as building a web directory.
You need to identify the relevant websites and then add them to the 
crawl schedule. Otherwise you will end up with having to clean the index 
after it has included a lot of junk websites. By controlling the 
websites that you add, you also make it a lot easier to deal with 
compromised websites.

Though Nutch is impressive, I am not exactly up to speed on using it for 
crawling and search as my main work is with domain names and 
website/IP/country mapping.

A better strategy, (rather than running a full crawl on all sites), 
would be to use the index page only and then analyse that for real 
estate keywords and phrases. That could be a faster way of building a 
list of candidate sites for crawling. (Effectively you break your site 
aquisition process into three parts: Collection, Detection, Selection.) 
It might sound like a convoluted way of doing things but for vertical 
search, it is a lot simpler than cleaning an index. :)

Regards...jmcc
-- 
**********************************************************
John McCormac  *  e-mail: jmcc@hosterstats.com
MC2            *  web: http://www.hosterstats.com/
22 Viewmount   *  Domain Registrations Statistics
Waterford      *  And Historical DNS Database.
Ireland        *  Over 392 Million Domains Tracked.
IE             *  http://www.hosterstats.com/blog
**********************************************************