You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Berlin Brown <be...@gmail.com> on 2007/10/14 02:25:15 UTC

Possible public applications with nutch and hadoop

I really like the concept of nutch and hadoop but I haven't been able
to build an application with them.  Most of the apps I like building
are targetted at the public, anyone on the internet.  I built a
crawler of top sites like the NYtimes and Slate but I couldn't filter
out the sites that were off-topic.  Eg, links to advertising sites
made up the majority of the search content.

My question; have you build a general site to crawl the internet and
how did you find links that people would be interested in as opposed
to capturing a lot of the junk out there.

I guess stop words and other filters are important but I wasn't ever
successful in building them.  It is almost like for these types of
apps, nutch needs a custom spam system.

-- 
Berlin Brown
http://botspiritcompany.com/botlist/spring/help/about.html
newspirit technologies

Re: Possible public applications with nutch and hadoop

Posted by xu xiong <xi...@gmail.com>.
> I can only conclude that the way to succeed as a search startup is to
> CRAWL DIFFERENTLY. Focus on websites in specific regions, specific
> topics, specific data types. Crawl into the corners of websites that
> contain interesting nuggets of data (listings, calendars, etc) that
> won't ever have a high PageRank. Find a data-niche with an audience
> you understand, and hammer away.

Yes, I totally agree that we need more control over the crawl process.

-xiong

Re: Possible public applications with nutch and hadoop

Posted by Matt Kangas <ka...@gmail.com>.
Andrzej, before I dive into your specific questions... I want to step  
back to the original topic: what applications are possible with Nutch?

The specialization that I focused on was a _listings_ crawler. There  
are any number of listings types that one could potentially crawl:

- events (what i focused on)
- job postings
- news
- any category that you might find on either craigslist or ebay

One caveat about listings as a data-type: most traditional listing  
aggregators (newspapers, etc) employ editors to acquire their  
listings, usually via costly methods. They will not be happy if you  
business model is based on scraping their content. (Hence Google  
News' ongoing fights with the Associated Press.) If you build a  
listings search startup, it's a good idea to get your listings  
directly from the listing-creator, not a middleman.

Since I wanted to crawl HTML pages and index _listings_ (0..n per  
HTML page), my system outline looks like this:

- customized Nutch 0.7 crawler
   - custom segment reader, running feature-detector
   - feedback into crawl_db
   - crawl segments not used beyond this point.
- HTML+feature markup fed into an extraction pipeline
- individual event listings written to a listing_db (disk)
- synthetic "crawler" to traverse listing_db, creates new "segments"  
with one record/listing
- Nutch indexer (w. custom plugins) creates Lucene index
- custom servlets to return XML; PHP front-end turns XML results into  
HTML

That's a lot of custom stuff. Shame that any listing-oriented startup  
would have to go through that whole process.

If Solr ever displaces the NutchBean/SearchServlet, that will  
eliminate one step. The bean+servlet is nearly useless for any  
startup, because you'll have to hack them beyond recognition to  
implement a distinctive UI. And, as I mentioned before, creating a  
distinctive product is essential to your startup's survival.

I'm using Solr now on a different project and I love it. Wish I had  
it two years ago. Now, about the crawler...

----( back to Andrzej's question )----

On Oct 16, 2007, at 1:10 PM, Andrzej Bialecki wrote:
> Matt Kangas wrote:
>> In this regard, I always found Nutch a bit painful to use. The  
>> Nutch crawler is highly streamlined for straight-ahead Google- 
>> scale crawling, but it's not modular enough to be considered a  
>> "crawler construction toolkit". This is sad, because what you need  
>> to "crawl differently" is just such a toolkit. Every search  
>> startup must pick some unique crawling+ranking strategy, something  
>> they think will dig up their audience's desired data as cheaply as  
>> possible, and then implement it quickly.
> [...]
>
> In your opinion, what is missing in Nutch to support such use?  
> Single-node operation, management UI, modularity, simplified  
> ranking, xyz ?

Modularity, most definitely.

Consider this scenario: you want to crawl into CGIs, but not all CGIs  
will have content of interest. Example: a site has an events calendar  
and a message board. Crawling the message board is a huge waste of  
bandwidth and cpu. Obviously, you'd like to avoid it. (Same holds if  
you're crawling an auto-dealer's site for car listings, etc.)

If you are feature-detecting pages, one solution is to have a shallow  
default depth-limit, then increase this limit when "interesting"  
content is found. To cut off the crawl on dead-ends, you want  
"updatedb" to pay attention to a (new) parse-status.

We implemented this. If I'm remembering this correctly... the problem  
is, in Nutch 0.7, parse-status isn't something that can force the  
termination of a crawl-path. Only fetch-status is considered.

Looking at Nutch 0.7.2 to refresh my memory, I see in  
Fetcher.FetcherThread:
- "public run()" calls "private handleFetch()" once the content is  
acquired
- if "Fetcher.parsing" is true, and parse-status isSuccess, then:
- "private outputPage()" is called with (FetcherOutput, Content,  
ParseText, ParseData)

"handleFetch()" is what I want to tweak, but it's private. So I have  
to skip parsing here, and implement a custom segment-processor (and  
updatedb) step.

This is a decision-making step in the crawler that can't be easily  
overridden by the user. It's an obstacle to "crawling differently".  
What a startup needs, IMO, is a "crawler construction toolkit":  
something that ships in a sane default configuration, but where  
nearly any operation can be swapped out or overridden.

-------

Some ideas for making the crawler more flexible:

1) Rewrite Fetcher as a "framework", ala Spring, WebObjects, or Rails  
(I think). Every conceivable step has its own, non-private method. To  
customize its behavior, subclass and override the steps of interest.
- This is my strawman solution. :) I know Doug is strongly against  
using OO in this manner.

2) Make lots and lots of extension points. Fetcher.java thus becomes  
nothing more than a sequence of extension-point calls.
- It could work, but... seems like a mess. Diagnosing config errors  
is bad enough already, this makes it worse.

3) Mimic Tomcat or Jetty's design, both of which are "toolkits for  
web server construction".
- Config errors here are hard to diagnose here, too. Need a tool to  
sanity-check a crawler config at startup, or... ?

All things considered, I'd rather bake a crawler configuration into  
a .java file than a .xml. This way the compiler can (hopefully)  
validate my crawl-configuration, instead of sifting through  
ExtensionPoint/Plugin exceptions at runtime.


Ok, I've been typing for too long now. Time to pass the thread to the  
next person. :)

--Matt

--
Matt Kangas / kangas@gmail.com



Re: Possible public applications with nutch and hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Matt Kangas wrote:
> Hi Andrzej (and everyone else),
> 
> A few weeks ago, I intended to chime in on your "Scoring API issues" 
> thread, but this new thread is perhaps an even better place to speak up. 
> Time to stop lurking and contribute. :)

Thanks a lot for sharing your thoughts. Your post touches a few 
important issues ... I hope other lurkers on the lists will pipe in with 
their feedback!

> 
> First, I want echo Stefan Groschupf's comment several months ago that 
> the Nutch community is really lucky to have someone like you still 
> working on critical issues like scoring. Without your knowledge and hard 
> work, Andrzej, Nutch development would grind to a halt, or at least be 
> limited to superficial changes (a new plugin every now and then, etc).

That's very kind of you, but a lot of the code has been either 
contributed or co-developed with others, or based on the input from the 
community. Thankfully, Nutch is still a community effort .. ;)

I wasn't able to contribute as much recently as in the past, for various 
reasons (next week I'm moving with my wife and 2 kids to another city, 
and this involved a lot of preparations ...) - this situation should 
improve around November, and I should be able to propose and implement 
some serious changes in Nutch that I've been mulling over.

> 
> I started following this list in the Nutch 0.6 era. For one month in 
> 2005, I considered jumping in to help with anything Doug wanted done, 
> but I quickly realized that Doug's goals and mine were at odds with each 
> other. Doug always has said he wanted to build an open-source competitor 
> to Google, and everything in Nutch has always been aligned with that 
> principle. I, on the other hand, wanted to build a search startup. A 
> head-on assault on a successful, established competitor is probably the 
> fastest way to kill any startup. The path to success is instead to zig 
> when they zag, innovate where they are not.
> 
> Crawling in the same manner as Google is probably a disaster for any 
> startup.

This is _very_ true. All wannabe SE operators should mark well your 
words. I've personally participated in 2 such attempts, and both failed 
miserably - mainly for business- and quality-related reasons. That's how 
I know what kind of content you get from running an unconstrained crawl 
.. ;)

Any successful venture in this area (that I know of) had each some kind 
of strong focus - either on specific search functionality, or 
information domain, or combined search in a novel way with other content.


> In this regard, I always found Nutch a bit painful to use. The Nutch 
> crawler is highly streamlined for straight-ahead Google-scale crawling, 
> but it's not modular enough to be considered a "crawler construction 
> toolkit". This is sad, because what you need to "crawl differently" is 
> just such a toolkit. Every search startup must pick some unique 
> crawling+ranking strategy, something they think will dig up their 
> audience's desired data as cheaply as possible, and then implement it 
> quickly.
[...]

In your opinion, what is missing in Nutch to support such use? 
Single-node operation, management UI, modularity, simplified ranking, xyz ?

BTW. regarding the ranking / scoring issues - I re-implemented the 
scoring algorithm that we used in 0.7, based on the in-degree and 
out-degree - there are quite a few research papers that claim it's 
roughly equivalent to PageRank (in the absence of link spamming ;)), 
with one huge advantage over the current OPIC - it's easy to ensure that 
scores are stable for a given linkgraph, which is not the case with our 
OPIC-like scoring (and our implementation of OPIC is not easy to fix). 
I'll submit a JIRA issue shortly with the patch.

> But there are still so many things missing, like a simple place to hang 
> a feature-detector, or a way to influence the direction of the crawl 
> based on features found. Or a depth-filter so you can crawl into 
> listings w/o infinite crawls. Etc.

Incidentally, I have developed both of these, for different customers.

The feature detector required adding an extension point (ParseFilter), 
which takes Content and ParseData as arguments, and puts some additional 
metadata in them- it is a generalization of the HtmlParseFilter concept, 
only it works for any type of content. We could also add a 
ContentFilter, to pre-process raw content before it's passed to parsers.

The depth filter - it's easy to implement it, perhaps we could add it to 
the out-of-the-box Nutch ... in my case it consisted of a ScoringFilter, 
which increased a counter in CrawlDatum that counted the number of hops 
from the initial seed. In case of pages discovered via multiple paths 
from more than one seed, a minimum value would be taken. All outlinks 
were processed and stored in crawldb/linkdb, but generator would skip 
pages where the counter was too high.



> What is good for search startups is also good for the Nutch community. 
> And what is good for search startups, IMO, is a flexible crawling 
> toolbox. +1 to any patch that helps turn the Nutch crawler into a more 
> flexible crawling toolkit.

Let's continue the discussion - it's important to decide upon strategic 
direction for Nutch, and feedback such as yours helps to set it so that 
the project answers common needs of the community, instead of being a 
purely academic exercise.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Possible public applications with nutch and hadoop

Posted by Matt Kangas <ka...@gmail.com>.
Hi Andrzej (and everyone else),

A few weeks ago, I intended to chime in on your "Scoring API issues"  
thread, but this new thread is perhaps an even better place to speak  
up. Time to stop lurking and contribute. :)

First, I want echo Stefan Groschupf's comment several months ago that  
the Nutch community is really lucky to have someone like you still  
working on critical issues like scoring. Without your knowledge and  
hard work, Andrzej, Nutch development would grind to a halt, or at  
least be limited to superficial changes (a new plugin every now and  
then, etc).

I started following this list in the Nutch 0.6 era. For one month in  
2005, I considered jumping in to help with anything Doug wanted done,  
but I quickly realized that Doug's goals and mine were at odds with  
each other. Doug always has said he wanted to build an open-source  
competitor to Google, and everything in Nutch has always been aligned  
with that principle. I, on the other hand, wanted to build a search  
startup. A head-on assault on a successful, established competitor is  
probably the fastest way to kill any startup. The path to success is  
instead to zig when they zag, innovate where they are not.

Crawling in the same manner as Google is probably a disaster for any  
startup. Whole-web crawling is quite tricky & expensive, and Google  
has done such a good job already here that, once your crawl succeeds,  
how do you provide results that are noticeably better than Google's?  
Failure to differentiate your product is also a quick path to death  
for a startup.

I can only conclude that the way to succeed as a search startup is to  
CRAWL DIFFERENTLY. Focus on websites in specific regions, specific  
topics, specific data types. Crawl into the corners of websites that  
contain interesting nuggets of data (listings, calendars, etc) that  
won't ever have a high PageRank. Find a data-niche with an audience  
you understand, and hammer away.

Personally, I spent the last two years pursuing this strategy at  
busytonight.com. We built an event-search engine using Nutch 0.7 that  
crawled 30k websites in the USA, automatically discovered & extracted  
~2.5M listings, and indexed ~1M unique listings. These were real- 
world events that people could go to. Sadly, I cannot show you this  
site, because we ran out of funds and were forced to shut the search- 
driven site down.

I say this only to point out that I care about this space, I think  
there are fascinating opportunities in this space. But, if you are a  
startup, you have a finite time-until-death if you don't get a usable  
product fully assembled.

In this regard, I always found Nutch a bit painful to use. The Nutch  
crawler is highly streamlined for straight-ahead Google-scale  
crawling, but it's not modular enough to be considered a "crawler  
construction toolkit". This is sad, because what you need to "crawl  
differently" is just such a toolkit. Every search startup must pick  
some unique crawling+ranking strategy, something they think will dig  
up their audience's desired data as cheaply as possible, and then  
implement it quickly.

At BusyTonight, we integrated a feature-detector into the crawler  
(date patterns, in our case), then added a site-whitelist filter and  
a crawl-depth tracker so we could crawl into calendar CGIs but not  
have an infinite crawl.

These are the kind of things that I think any content-focused search  
startup would have to add themselves to Nutch. My particular  
implementation wouldn't be much help to the average startup, but just  
having some hooks available to plug this stuff in would make a world  
of difference. (We had to patch Nutch 0.7 a lot more than I had hoped.)

Since I started with Nutch 0.7, several things have been added that  
would have made my life easier, such as:
* crawl metadata (thank you Stefan)
* the scoring API (thank you Andrzej)
* the concept of multiple-Parses per HTML page introduced with the  
RSS parsing plugin (thank you Dennis, I think?).

But there are still so many things missing, like a simple place to  
hang a feature-detector, or a way to influence the direction of the  
crawl based on features found. Or a depth-filter so you can crawl  
into listings w/o infinite crawls. Etc.

Ultimately, I believe that what is good for startups is good for the  
Nutch community overall. There isn't as much activity on this list as  
I recall from the Nutch 0.7/0.8 era, and I think that's because some  
people participating then were trying to build startups (like Stefan  
and myself) and needed to get things done on a deadline.

If you bet on Nutch as your foundation but cannot build a  
differentiated product quickly, you'll be screwed, and you will drop  
out of the Nutch community and move on. Nutch will lose a possibly- 
valuable contributor.

What is good for search startups is also good for the Nutch  
community. And what is good for search startups, IMO, is a flexible  
crawling toolbox. +1 to any patch that helps turn the Nutch crawler  
into a more flexible crawling toolkit.

Sincerely,
--Matt Kangas


On Oct 15, 2007, at 6:00 AM, Andrzej Bialecki wrote:

> Berlin Brown wrote:
>> Yea, you are right.  You have to have a constrained set of domains to
>> search and to be honest, that works pretty well.  The only thing, I
>> still get a lot of junk links.  I would say that 30% are valid or
>> interesting links while the other is kind of worthless.  I guess  
>> it is
>> a matter of studying spam filters and removing that but I have been
>> kind of lazy in doing so.
>> http://botspiritcompany.com/botlist/spring/search/ 
>> global_search.html?query=bush&querymode=enabled
>> I have already built a site that I am describing, based on a short
>> list of popular domains using the very basic aspects of nutch.   You
>> can search above and see what you think.  I had about 100k links with
>> my last crawl.
>
> There are quite a few companies (that I know of), who maintain  
> indexes between 50-300 mln pages. All of them implemented their own  
> strategy (specific to their needs) to solve this issue.
>
> It's true that if you start crawling without any constraints, very  
> quickly (~20-30 full cycles) your crawldb will contain 90% of junk,  
> porn and spam. Some strategies to fight this are based on content  
> analysis (detection of porn-related content), url analysis  
> (presence of certain patterns in urls), and link analysis (analysis  
> of link neighborhood). There's a lot of research papers on these  
> subjects, and many strategies can be implemented as Nutch plugins.
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

--
Matt Kangas / kangas@gmail.com



Re: Possible public applications with nutch and hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Berlin Brown wrote:
> Yea, you are right.  You have to have a constrained set of domains to
> search and to be honest, that works pretty well.  The only thing, I
> still get a lot of junk links.  I would say that 30% are valid or
> interesting links while the other is kind of worthless.  I guess it is
> a matter of studying spam filters and removing that but I have been
> kind of lazy in doing so.
> 
> http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled
> 
> I have already built a site that I am describing, based on a short
> list of popular domains using the very basic aspects of nutch.   You
> can search above and see what you think.  I had about 100k links with
> my last crawl.

There are quite a few companies (that I know of), who maintain indexes 
between 50-300 mln pages. All of them implemented their own strategy 
(specific to their needs) to solve this issue.

It's true that if you start crawling without any constraints, very 
quickly (~20-30 full cycles) your crawldb will contain 90% of junk, porn 
and spam. Some strategies to fight this are based on content analysis 
(detection of porn-related content), url analysis (presence of certain 
patterns in urls), and link analysis (analysis of link neighborhood). 
There's a lot of research papers on these subjects, and many strategies 
can be implemented as Nutch plugins.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Possible public applications with nutch and hadoop

Posted by Berlin Brown <be...@gmail.com>.
Yea, you are right.  You have to have a constrained set of domains to
search and to be honest, that works pretty well.  The only thing, I
still get a lot of junk links.  I would say that 30% are valid or
interesting links while the other is kind of worthless.  I guess it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.

http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled

I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch.   You
can search above and see what you think.  I had about 100k links with
my last crawl.

On 10/13/07, Pike <pi...@kw.nl> wrote:
> Hi
>
> > My question; have you build a general site to crawl the internet and
> > how did you find links that people would be interested in as opposed
> > to capturing a lot of the junk out there.
>
> interesting question. are you planning to build a new google ?
> if you are planning to crawl without any limit on f.e. a few
> domains, your indexes will go wild very quickly :-)
>
> we are using nutch now with an extensive list of
> 'interesting domains' - this list is an editorial effort.
> search results are limited to those domains.
> http://www.labforculture.org/opensearch/custom
>
> another application would be to use nutch to crawl
> certain pages, like 'interesting' search results from
> other sites, with a limited depth. this would yield
> 'interesting' indexes.
>
> yet another application would be to crawl 'interesting'
> rss feeds with a depth of 1. I haven't got that working
> yet (see the parse-rss discussion these days).
>
> nevertheless, I am interested in the question:
> anyone else having examples of "possible public
> applications with nutch" ?
>
> $2c,
> *pike
>
>
>
>
>
>


-- 
Berlin Brown
http://botspiritcompany.com/botlist/spring/help/about.html
newspirit technologies

about rdf crawling

Posted by baixi2 <ba...@163.com>.
Hello, I am a little confused with that how to use nutch to crawl and index rdf files? Using the url filter? Thanks. Best, Xi  

Re: Possible public applications with nutch and hadoop

Posted by Pike <pi...@kw.nl>.
Hi

> My question; have you build a general site to crawl the internet and
> how did you find links that people would be interested in as opposed
> to capturing a lot of the junk out there.

interesting question. are you planning to build a new google ?
if you are planning to crawl without any limit on f.e. a few
domains, your indexes will go wild very quickly :-)

we are using nutch now with an extensive list of
'interesting domains' - this list is an editorial effort.
search results are limited to those domains.
http://www.labforculture.org/opensearch/custom

another application would be to use nutch to crawl
certain pages, like 'interesting' search results from
other sites, with a limited depth. this would yield
'interesting' indexes.

yet another application would be to crawl 'interesting'
rss feeds with a depth of 1. I haven't got that working
yet (see the parse-rss discussion these days).

nevertheless, I am interested in the question:
anyone else having examples of "possible public
applications with nutch" ?

$2c,
*pike