You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ali Nazemian <al...@gmail.com> on 2014/08/06 10:24:33 UTC

Web forum crawling using nutch

Dear all,
Hi,
I used nutch for crawling some news website and solr for indexing the
crawled pages. I was wondering how can I use nutch for crawling web forums?
In crawling web forums we have some problems that need to be considered.
(The ones that are not our concern in the case of news websites) Here is
some of them:
- There should be some techniques to find out each thread/post has how many
pages and how can be reached.
- Some of forums use java script for identifying paging and java script is
a client side programming language. Somehow it should be parsed with nutch.
- The depth method of nutch for crawling becomes useless since each page
consider in new depth. But also infinite depth is off the choice cause it
can be face us with infite crawling!
- More...
I really appreciate if somebody guide me through this subject.
Best regards.

-- 
A.Nazemian

Re: Web forum crawling using nutch

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.
Don’t think you’ll find all your answers on the out-of-the-box nutch, but you should study some of the extension points Nutch has, as far as I can see you should be able of writing custom plugins that will allow you to achieve your goals, but some programming is required. 

Greetings,

On Aug 6, 2014, at 4:24 AM, Ali Nazemian <al...@gmail.com> wrote:

> Dear all,
> Hi,
> I used nutch for crawling some news website and solr for indexing the
> crawled pages. I was wondering how can I use nutch for crawling web forums?
> In crawling web forums we have some problems that need to be considered.
> (The ones that are not our concern in the case of news websites) Here is
> some of them:
> - There should be some techniques to find out each thread/post has how many
> pages and how can be reached.
> - Some of forums use java script for identifying paging and java script is
> a client side programming language. Somehow it should be parsed with nutch.
> - The depth method of nutch for crawling becomes useless since each page
> consider in new depth. But also infinite depth is off the choice cause it
> can be face us with infite crawling!
> - More...
> I really appreciate if somebody guide me through this subject.
> Best regards.
> 
> -- 
> A.Nazemian

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

Re: Web forum crawling using nutch

Posted by Ali Nazemian <al...@gmail.com>.
Dear Patrick,
Sure, I was thinking of selenium. It seems that there is nutch plugin for
this purpose which works with selenium. But I did not test that yet.
Regards.


On Mon, Sep 1, 2014 at 6:51 PM, Patrick Kirsch <pk...@zscho.de> wrote:

> Am 06.08.2014 10:24, schrieb Ali Nazemian:
> > Dear all,
> > Hi,
> > - Some of forums use java script for identifying paging and java script
> is
> > a client side programming language. Somehow it should be parsed with
> nutch.
> Parsing of plain javascript files (plain links) is possible.
> Difficult is the situation, if links will be generated (e.g. click
> events) through a Javascript JQuery Framework like JQuery.
> In this case Nutch needs to behave more like a browser and need the help
> of selenium, phantomjs or xulrunner etc.
> > - The depth method of nutch for crawling becomes useless since each page
> > consider in new depth. But also infinite depth is off the choice cause it
> > can be face us with infite crawling!
> > - More...
> > I really appreciate if somebody guide me through this subject.
> > Best regards.
> >
> Regards,
>  Patrick
>



-- 
A.Nazemian

Re: Web forum crawling using nutch

Posted by Patrick Kirsch <pk...@zscho.de>.
Am 06.08.2014 10:24, schrieb Ali Nazemian:
> Dear all,
> Hi,
> - Some of forums use java script for identifying paging and java script is
> a client side programming language. Somehow it should be parsed with nutch.
Parsing of plain javascript files (plain links) is possible.
Difficult is the situation, if links will be generated (e.g. click
events) through a Javascript JQuery Framework like JQuery.
In this case Nutch needs to behave more like a browser and need the help
of selenium, phantomjs or xulrunner etc.
> - The depth method of nutch for crawling becomes useless since each page
> consider in new depth. But also infinite depth is off the choice cause it
> can be face us with infite crawling!
> - More...
> I really appreciate if somebody guide me through this subject.
> Best regards.
> 
Regards,
 Patrick