You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Gonzalo Aguilar Delgado <ga...@aguilardelgado.com> on 2010/08/19 19:25:51 UTC

To nutch or not to nutch?

Hi there!

I'm building a crawler that will "understand" some kind of pages. I want
to be able to process a restricted group
of websites. 

In essence, for example:  I want to search for reviews of the products
of my company in some blogs I well know.

I don't know if Nutch can help me here.

What I'm currently doing is a crawler that fetches pages, transforms
them with the template designed for the site with xslt
and the parses content.

The question here is: Can this be done well with Nutch or will it imply
a big overhead?

What plugins will needs to be developed?

Thank you!

Re: To nutch or not to nutch?

Posted by Gonzalo Aguilar Delgado <ga...@aguilardelgado.com>.
Hi Alex, 

I will answer inline so we can follow comments...


On jue, 2010-08-19 at 19:21 +0100, Alex McLintock wrote: 
Hello Gonzalo,
> 
> Did you mean to post to the dev list?
> Yes! Users normally don't know what to implement if missing
features...


> Further comments inline
> 
> On 19 August 2010 18:25, Gonzalo Aguilar Delgado
> <ga...@aguilardelgado.com> wrote:
> > Hi there!
> >
> > I'm building a crawler that will "understand" some kind of pages. I
want to
> > be able to process a restricted group
> > of websites.
> 
> Nutch has the capability to configure a URL filter which can limit the
> hosts to a specific set of regular expressions.
> 
That's normal. This will be not much problem.


> > In essence, for example:  I want to search for reviews of the
products of my
> > company in some blogs I well know.
> 
> That sounds like a standard data mining requirement.
> Exactly!


> > I don't know if Nutch can help me here.
> 
> Well, it can, but not out of the box. - It depends on what sort of
> automation you want.
> Nutch can crawl all those sites and build up a SolR/Lucene index for
> you to search through, but I am guessing that wont help you very much.
> Nope, What I need is to extract some fields from pages... And then
maybe SolR Lucene can help...
But not with the whole text since blogs, for example, tends to include
much garbage... 


> > What I'm currently doing is a crawler that fetches pages, transforms
them
> > with the template designed for the site with xslt
> 
> Eh? you are using xslt to transform random web pages? Doesnt the xslt
> fall over whenever it finds non well formed xml?
> What I really do is to normalize input with TagSoup and the proccess
each web with the custom template.
This way everything works...

The problem I have is that for example, MySpaces, is too big for my
crawler. I really never tried to parse it but surely it will
take lots of time to parse... So I will need to scale in the future.



> > and the parses content.
> 
> Parses it for what? What do you do with it?
> 
What I want to do is a kind of Buzz Engine. It will tell me what buzz is
gaining a product in the web. It must parse, blogs, pages of rankings,
oscommerce pages, rss, etc...



> > The question here is: Can this be done well with Nutch or will it
imply a
> > big overhead?
> 
> I don't think this is *easy* with Nutch. The overhead may be worth it
> if you want to do the web crawling on a small cluster rather than one
> machine.
> 
Maybe in the future, but not now... So I think is better build my custom
one...


> There may be other better data mining tools, but I'm not sure I can
> recommend anything right now.
> 
This is very specific so I'm not sure if something will help me.


> > What plugins will needs to be developed?
> 
> Well that depends on what you want. Presumably you want something that
> identifies the web page as a review of your product so that it can be
> highlighted in the index. How do you want to do that?
> 
Pufff! I'm lost on this... Can I write you a personal mail to explain
what I want to do and how this will work?


> 
> > Thank you!
> 
> I've been thinking about this for some time - but to search for book
> reviews instead of product reviews. I can't say that I have a working
> system, but maybe others do.
> 
I already can parse some webs... I'm triying to do it better, multisite
and social 

Let me contact so I can explain it better...

Tnx Alex!


> Alex
> 


Re: To nutch or not to nutch?

Posted by Alex McLintock <al...@gmail.com>.
Hello Gonzalo,

Did you mean to post to the dev list?

Further comments inline

On 19 August 2010 18:25, Gonzalo Aguilar Delgado
<ga...@aguilardelgado.com> wrote:
> Hi there!
>
> I'm building a crawler that will "understand" some kind of pages. I want to
> be able to process a restricted group
> of websites.

Nutch has the capability to configure a URL filter which can limit the
hosts to a specific set of regular expressions.

> In essence, for example:  I want to search for reviews of the products of my
> company in some blogs I well know.

That sounds like a standard data mining requirement.

> I don't know if Nutch can help me here.

Well, it can, but not out of the box. - It depends on what sort of
automation you want.
Nutch can crawl all those sites and build up a SolR/Lucene index for
you to search through, but I am guessing that wont help you very much.

> What I'm currently doing is a crawler that fetches pages, transforms them
> with the template designed for the site with xslt

Eh? you are using xslt to transform random web pages? Doesnt the xslt
fall over whenever it finds non well formed xml?

> and the parses content.

Parses it for what? What do you do with it?

> The question here is: Can this be done well with Nutch or will it imply a
> big overhead?

I don't think this is *easy* with Nutch. The overhead may be worth it
if you want to do the web crawling on a small cluster rather than one
machine.

There may be other better data mining tools, but I'm not sure I can
recommend anything right now.

> What plugins will needs to be developed?

Well that depends on what you want. Presumably you want something that
identifies the web page as a review of your product so that it can be
highlighted in the index. How do you want to do that?


> Thank you!

I've been thinking about this for some time - but to search for book
reviews instead of product reviews. I can't say that I have a working
system, but maybe others do.

Alex