You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "S.L" <si...@gmail.com> on 2013/07/08 22:10:45 UTC

Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Hello All,

I am a new Nutch user , I need to be able to get every URL that Nutch is
crawling with in a session and insert the URL into a MySQL database along
with some other metadata , I am using Nutch 1.7 and have set up the project
in Eclipse .Can anyone please give me guidance on which class/classes I
would need to modify to get the URL in the current session  and insert it
into the database?

Any help would be greatly appreciated.

Thank You in advance.

Re: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Posted by "S.L" <si...@gmail.com>.

Marcus,

I need to modify the data i.e the content of the pages before populating
the Solr Index, if I use the solrIndex command to do that what classes
would I need to change inorder to do that.?

Instead I was thinking of the original question I asked as an option i.e to
intercept the url download the content and extract the data and update the
Solr schema from there.

Will one option have any advantage over the other ?

Thanks.


On Mon, Jul 8, 2013 at 4:42 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Well, that would be easier indeed. Nutch will fetch, parse (+ optional
> custom parse filters) and index (+ optional custom indexing filters) to any
> available indexing backend (Solr, ES). Check the Nutch tutorial on the wiki.
>
> -----Original message-----
> > From:S.L <si...@gmail.com>
> > Sent: Monday 8th July 2013 22:39
> > To: user@nutch.apache.org
> > Subject: Re: Intercept the current URL that Nutch is about to crawl in
> Nutch 1.7
> >
> > On a second thought I am also considering Solr instead of the MySQL DB ,
> > you mentioned that I need to look into how to talk to DBs in Hadoop land
> ,
> > what if I have to talk to Solr from Nutch  ?
> >
> >
> > On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma <
> markus.jelsma@openindex.io>wrote:
> >
> > > Processing the logs would be easy but since you need some metadata your
> > > probably need to hack into the Fetcher.java code. The fetcher has
> several
> > > inner classes but you'd need the FetcherThread class which is
> responsible
> > > for the actual download and anything else that needs to be done there.
>  If
> > > you also need metadata that requires parsing the file you need to
> configure
> > > the fetcher to do parsing as well.
> > >
> > >
> > >
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
> > >
> > > The record is fetched around #683. The output() method writes the
> stuff to
> > > the segment and does the optional parsing of the record. Parsing is
> done
> > > around #960.
> > >
> > > In output() you could communicate with your DB although it's not the
> best
> > > place but easy to test. FetcherOutputFormat is more suitable for
> writing
> > > data. Also read about how to talk to DB's in Hadoop land.
> > >
> > > Cheers
> > >
> > >
> > > -----Original message-----
> > > > From:S.L <si...@gmail.com>
> > > > Sent: Monday 8th July 2013 22:11
> > > > To: user@nutch.apache.org
> > > > Subject: Intercept the current URL that Nutch is about to crawl in
> Nutch
> > > 1.7
> > > >
> > > > Hello All,
> > > >
> > > > I am a new Nutch user , I need to be able to get every URL that
> Nutch is
> > > > crawling with in a session and insert the URL into a MySQL database
> along
> > > > with some other metadata , I am using Nutch 1.7 and have set up the
> > > project
> > > > in Eclipse .Can anyone please give me guidance on which
> class/classes I
> > > > would need to modify to get the URL in the current session  and
> insert it
> > > > into the database?
> > > >
> > > > Any help would be greatly appreciated.
> > > >
> > > > Thank You in advance.
> > > >
> > >
> >
>

Re: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Posted by "S.L" <si...@gmail.com>.

On a second thought I am also considering Solr instead of the MySQL DB ,
you mentioned that I need to look into how to talk to DBs in Hadoop land ,
what if I have to talk to Solr from Nutch  ?


On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Processing the logs would be easy but since you need some metadata your
> probably need to hack into the Fetcher.java code. The fetcher has several
> inner classes but you'd need the FetcherThread class which is responsible
> for the actual download and anything else that needs to be done there.  If
> you also need metadata that requires parsing the file you need to configure
> the fetcher to do parsing as well.
>
>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
>
> The record is fetched around #683. The output() method writes the stuff to
> the segment and does the optional parsing of the record. Parsing is done
> around #960.
>
> In output() you could communicate with your DB although it's not the best
> place but easy to test. FetcherOutputFormat is more suitable for writing
> data. Also read about how to talk to DB's in Hadoop land.
>
> Cheers
>
>
> -----Original message-----
> > From:S.L <si...@gmail.com>
> > Sent: Monday 8th July 2013 22:11
> > To: user@nutch.apache.org
> > Subject: Intercept the current URL that Nutch is about to crawl in Nutch
> 1.7
> >
> > Hello All,
> >
> > I am a new Nutch user , I need to be able to get every URL that Nutch is
> > crawling with in a session and insert the URL into a MySQL database along
> > with some other metadata , I am using Nutch 1.7 and have set up the
> project
> > in Eclipse .Can anyone please give me guidance on which class/classes I
> > would need to modify to get the URL in the current session  and insert it
> > into the database?
> >
> > Any help would be greatly appreciated.
> >
> > Thank You in advance.
> >
>

Re: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Posted by "S.L" <si...@gmail.com>.

Awesome! Thank you for the detailed reply.


On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Processing the logs would be easy but since you need some metadata your
> probably need to hack into the Fetcher.java code. The fetcher has several
> inner classes but you'd need the FetcherThread class which is responsible
> for the actual download and anything else that needs to be done there.  If
> you also need metadata that requires parsing the file you need to configure
> the fetcher to do parsing as well.
>
>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
>
> The record is fetched around #683. The output() method writes the stuff to
> the segment and does the optional parsing of the record. Parsing is done
> around #960.
>
> In output() you could communicate with your DB although it's not the best
> place but easy to test. FetcherOutputFormat is more suitable for writing
> data. Also read about how to talk to DB's in Hadoop land.
>
> Cheers
>
>
> -----Original message-----
> > From:S.L <si...@gmail.com>
> > Sent: Monday 8th July 2013 22:11
> > To: user@nutch.apache.org
> > Subject: Intercept the current URL that Nutch is about to crawl in Nutch
> 1.7
> >
> > Hello All,
> >
> > I am a new Nutch user , I need to be able to get every URL that Nutch is
> > crawling with in a session and insert the URL into a MySQL database along
> > with some other metadata , I am using Nutch 1.7 and have set up the
> project
> > in Eclipse .Can anyone please give me guidance on which class/classes I
> > would need to modify to get the URL in the current session  and insert it
> > into the database?
> >
> > Any help would be greatly appreciated.
> >
> > Thank You in advance.
> >
>

RE: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Posted by Markus Jelsma <ma...@openindex.io>.

Processing the logs would be easy but since you need some metadata your probably need to hack into the Fetcher.java code. The fetcher has several inner classes but you'd need the FetcherThread class which is responsible for the actual download and anything else that needs to be done there.  If you also need metadata that requires parsing the file you need to configure the fetcher to do parsing as well. 

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup

The record is fetched around #683. The output() method writes the stuff to the segment and does the optional parsing of the record. Parsing is done around #960.

In output() you could communicate with your DB although it's not the best place but easy to test. FetcherOutputFormat is more suitable for writing data. Also read about how to talk to DB's in Hadoop land.

Cheers

 
-----Original message-----
> From:S.L <si...@gmail.com>
> Sent: Monday 8th July 2013 22:11
> To: user@nutch.apache.org
> Subject: Intercept the current URL that Nutch is about to crawl in Nutch 1.7
> 
> Hello All,
> 
> I am a new Nutch user , I need to be able to get every URL that Nutch is
> crawling with in a session and insert the URL into a MySQL database along
> with some other metadata , I am using Nutch 1.7 and have set up the project
> in Eclipse .Can anyone please give me guidance on which class/classes I
> would need to modify to get the URL in the current session  and insert it
> into the database?
> 
> Any help would be greatly appreciated.
> 
> Thank You in advance.
>

RE: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Posted by Markus Jelsma <ma...@openindex.io>.

Well, that would be easier indeed. Nutch will fetch, parse (+ optional custom parse filters) and index (+ optional custom indexing filters) to any available indexing backend (Solr, ES). Check the Nutch tutorial on the wiki. 
 
-----Original message-----
> From:S.L <si...@gmail.com>
> Sent: Monday 8th July 2013 22:39
> To: user@nutch.apache.org
> Subject: Re: Intercept the current URL that Nutch is about to crawl in Nutch 1.7
> 
> On a second thought I am also considering Solr instead of the MySQL DB ,
> you mentioned that I need to look into how to talk to DBs in Hadoop land ,
> what if I have to talk to Solr from Nutch  ?
> 
> 
> On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma <ma...@openindex.io>wrote:
> 
> > Processing the logs would be easy but since you need some metadata your
> > probably need to hack into the Fetcher.java code. The fetcher has several
> > inner classes but you'd need the FetcherThread class which is responsible
> > for the actual download and anything else that needs to be done there.  If
> > you also need metadata that requires parsing the file you need to configure
> > the fetcher to do parsing as well.
> >
> >
> > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
> >
> > The record is fetched around #683. The output() method writes the stuff to
> > the segment and does the optional parsing of the record. Parsing is done
> > around #960.
> >
> > In output() you could communicate with your DB although it's not the best
> > place but easy to test. FetcherOutputFormat is more suitable for writing
> > data. Also read about how to talk to DB's in Hadoop land.
> >
> > Cheers
> >
> >
> > -----Original message-----
> > > From:S.L <si...@gmail.com>
> > > Sent: Monday 8th July 2013 22:11
> > > To: user@nutch.apache.org
> > > Subject: Intercept the current URL that Nutch is about to crawl in Nutch
> > 1.7
> > >
> > > Hello All,
> > >
> > > I am a new Nutch user , I need to be able to get every URL that Nutch is
> > > crawling with in a session and insert the URL into a MySQL database along
> > > with some other metadata , I am using Nutch 1.7 and have set up the
> > project
> > > in Eclipse .Can anyone please give me guidance on which class/classes I
> > > would need to modify to get the URL in the current session  and insert it
> > > into the database?
> > >
> > > Any help would be greatly appreciated.
> > >
> > > Thank You in advance.
> > >
> >
>