You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Karol Rybak <ka...@gmail.com> on 2007/11/27 00:02:12 UTC

Generate times

Hello i have a crawldb consisting of about 57 million urls, i'm generating
segments 1 million each.

Generate take 7 hrs 39 minutes to complete on my cluster.

I have 4 machines in my cluster each is Pentium 4 HT 3.0 Ghz with 1 GB ram
and 150GB IDE drives

Merging urls into crawldb took 4hrs, 34mins last time.

What i wanted to ask is if that times are normal for that kind of
configuration ?

Is generate phase so processor intensive ? When i check i have two threads
on each of nodes each taking up 100% time of ht pseudo-core.

Also i have a problem with partitioning 3 times out of 4 it fails because it
cannot open temporary files created by generate job.

I'm using trunk version of nutch with hadoop 0.15.

I need to find a way to speed up crawldb processing.

I want to create an updateable index of about 30 million pages which could
be updated every month.

I do not need the scoring-opic plugin, but i couldn't disable it. I do not
need it as i'm using index to search for plagiarism in our university
students papers.

I was thinking about moving whole crawldb into some database
(mysql/postgres) and generating urls to crawl from there, then importing
them to nutch using clean crawldb and text files.

Please let me know if you have any suggestions on how to speed up the
crawling process.

-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Re: Generate times

Posted by Espen Amble Kolstad <es...@trank.no>.

Hi,

Do you run generate with filter? Depending on your filter-settings, this
will really make generate alot slower.
If you do not need normalize (e.g the URL's are already normalized) then it
will really help to add this to nutch-site.xml:
  <property>
    <name>urlnormalizer.scope.partition</name>
    <value>org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer</value>
  </property>

  <property>
    <name>urlnormalizer.scope.generate_host_count</name>
    <value>org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer</value>
  </property>

I've just run the first part of generate on a 43 million crawldb, in 13,5
min on my 3 node cluster (3 tasks per node). node hw: 4GB RAM ~700GB disk
2.8 GHz 4 core.

Hope this helps!

Espen


On 11/27/07, misc <mi...@robotgenius.net> wrote:
>
>
> Hi-
>
>     I don't have as large of a list of urls as you, but it is in the
> millions. I also see really long times for generate, about 3 hours.  This
> is
> definitely the largest part of my wait.
>
>     I have posted here before trying to figure this out.  The thing is, I
> can do a unix sort on a comperable list much more quickly, so I suspect
> something is being done inefficiently.  I don't know fully what completely
> is happening inside nutch, so I am not sure.
>
>     I suspect I could cut the time waiting for generate down by generating
> multiple segments at once, but I haven't spent much time to get this
> working.
>
>                         see you
>                             -Jim
>
>
> ----- Original Message -----
> From: "Karol Rybak" <ka...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Monday, November 26, 2007 3:02 PM
> Subject: Generate times
>
>
> > Hello i have a crawldb consisting of about 57 million urls, i'm
> generating
> > segments 1 million each.
> >
> > Generate take 7 hrs 39 minutes to complete on my cluster.
> >
> > I have 4 machines in my cluster each is Pentium 4 HT 3.0 Ghz with 1 GB
> ram
> > and 150GB IDE drives
> >
> > Merging urls into crawldb took 4hrs, 34mins last time.
> >
> > What i wanted to ask is if that times are normal for that kind of
> > configuration ?
> >
> > Is generate phase so processor intensive ? When i check i have two
> threads
> > on each of nodes each taking up 100% time of ht pseudo-core.
> >
> > Also i have a problem with partitioning 3 times out of 4 it fails
> because
> > it
> > cannot open temporary files created by generate job.
> >
> > I'm using trunk version of nutch with hadoop 0.15.
> >
> > I need to find a way to speed up crawldb processing.
> >
> > I want to create an updateable index of about 30 million pages which
> could
> > be updated every month.
> >
> > I do not need the scoring-opic plugin, but i couldn't disable it. I do
> not
> > need it as i'm using index to search for plagiarism in our university
> > students papers.
> >
> > I was thinking about moving whole crawldb into some database
> > (mysql/postgres) and generating urls to crawl from there, then importing
> > them to nutch using clean crawldb and text files.
> >
> > Please let me know if you have any suggestions on how to speed up the
> > crawling process.
> >
> > --
> > Karol Rybak
> > Programista / Programmer
> > Sekcja aplikacji / Applications section
> > Wyższa Szkoła Informatyki i Zarządzania / University of Internet
> > Technology
> > and Management
> > +48(17)8661277
> >
>
>

Re: How to read crawldb

Posted by Cool Coder <te...@yahoo.com>.

Hi Jian,
           I saw your site and it really looks nice. Thanks for your information and I am still researching on nutch on how to find the raw data for each link. I am currently using Jspider and I think it is also a good crawler. Incase, you decided to make your crawler open source, please let me know and I would love to use that.

  - BR

jian chen <ch...@gmail.com> wrote:
  It is a bit convoluted at best.

I found out that the links and their meta data are stored in the crawldb
directory, and the actual raw http contents of the links are stored in the
different segments.

The crawldb and the segments are MapFiles or SequenceFiles I think. So, you
could use a MapFileReader or SequenceFileReader to read them and dump them
out whatever format you like.

However, so far I haven't figured out how to associate the crawldb links
with their contents. For example, while looping through the crawldb links,
for each link, I want to find its raw http content. But, I don't know how to
do it yet.

That said, it is possible to dump out the two into a MySql database and they
are all keyed on the link/url. But that means, you need to write to the
MySql database twice for each url. Which is not good for performance
reasons.

That's why I am sticking to my own crawler for now and it works very good
for me.

Take a look at www.coolposting.com, where it searches for multiple forums.
The crawler behind that is the one I wrote based on Nutch architecture and
storing into MySql for each url content.

If I want to open source my crawler, I will need to add some licensing terms
to the code first before releasing it onto www.jiansnet.com. Anyway, I will
make the crawler available soon, one way or the other (open source, closed
source but free download, etc.)

Cheers,

Jian

On Nov 27, 2007 2:20 PM, Cool Coder wrote:

> Hello,
> I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.
>
> - BR
>
>
> ---------------------------------
> Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try
> it now.
>

---------------------------------
Get easy, one-click access to your favorites.  Make Yahoo! your homepage.

Re: How to read crawldb

Posted by jian chen <ch...@gmail.com>.

It is a bit convoluted at best.

I found out that the links and their meta data are stored in the crawldb
directory, and the actual raw http contents of the links are stored in the
different segments.

The crawldb and the segments are MapFiles or SequenceFiles I think. So, you
could use a MapFileReader or SequenceFileReader to read them and dump them
out whatever format you like.

However, so far I haven't figured out how to associate the crawldb links
with their contents. For example, while looping through the crawldb links,
for each link, I want to find its raw http content. But, I don't know how to
do it yet.

That said, it is possible to dump out the two into a MySql database and they
are all keyed on the link/url. But that means, you need to write to the
MySql database twice for each url. Which is not good for performance
reasons.

That's why I am sticking to my own crawler for now and it works very good
for me.

Take a look at www.coolposting.com, where it searches for multiple forums.
The crawler behind that is the one I wrote based on Nutch architecture and
storing into MySql for each url content.

If I want to open source my crawler, I will need to add some licensing terms
to the code first before releasing it onto www.jiansnet.com. Anyway, I will
make the crawler available soon, one way or the other (open source, closed
source but free download, etc.)

Cheers,

Jian

On Nov 27, 2007 2:20 PM, Cool Coder <te...@yahoo.com> wrote:

> Hello,
>           I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.
>
>  - BR
>
>
> ---------------------------------
> Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try
> it now.
>

Re: How to read crawldb

Posted by Cool Coder <te...@yahoo.com>.

Thanks for information. I tried with ./bin/nutch readlinkdb, however I could not able to get all the links. I think I am missing something on proper usage pattern of readlinkdb option
  I tried with

  $ ./bin/nutch readlinkdb ./nutch-index/linkdb/
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out

  Let me tell you that, nutch-index is the location of nutch index and it has following directories
  --crawldb
  --index
  --indexes
--linkdb
  --segments

  can you tell me what is the usage patternm, I should use to view all the links

  - RB

Andrzej Bialecki <ab...@getopt.org> wrote:
  Cool Coder wrote:
> Hello, I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.

In Nutch 0.8 and later the page information and link information is 
stored separately, in CrawlDb and LinkDb. You need to have the linkdb 
(see bin/nutch invertlinks command), and then you can use LinkDbReader 
class to retrieve this information. From the command line this is 
bin/nutch readlinkdb.

-- 
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com

---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: How to read crawldb

Posted by Andrzej Bialecki <ab...@getopt.org>.

Cool Coder wrote:
> Hello, I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.

In Nutch 0.8 and later the page information and link information is 
stored separately, in CrawlDb and LinkDb. You need to have the linkdb 
(see bin/nutch invertlinks command), and then you can use LinkDbReader 
class to retrieve this information. From the command line this is 
bin/nutch readlinkdb.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

How to read crawldb

Posted by Cool Coder <te...@yahoo.com>.

Hello,
           I am just wondering how can I read crawldb and get content of each stored URL. I am not sure whether this can be possible or not.
   
  - BR

       
---------------------------------
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.

Re: Generate times

Posted by misc <mi...@robotgenius.net>.

Hi-

    I don't have as large of a list of urls as you, but it is in the 
millions. I also see really long times for generate, about 3 hours.  This is 
definitely the largest part of my wait.

    I have posted here before trying to figure this out.  The thing is, I 
can do a unix sort on a comperable list much more quickly, so I suspect 
something is being done inefficiently.  I don't know fully what completely 
is happening inside nutch, so I am not sure.

    I suspect I could cut the time waiting for generate down by generating 
multiple segments at once, but I haven't spent much time to get this 
working.

                        see you
                            -Jim


----- Original Message ----- 
From: "Karol Rybak" <ka...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Monday, November 26, 2007 3:02 PM
Subject: Generate times


> Hello i have a crawldb consisting of about 57 million urls, i'm generating
> segments 1 million each.
>
> Generate take 7 hrs 39 minutes to complete on my cluster.
>
> I have 4 machines in my cluster each is Pentium 4 HT 3.0 Ghz with 1 GB ram
> and 150GB IDE drives
>
> Merging urls into crawldb took 4hrs, 34mins last time.
>
> What i wanted to ask is if that times are normal for that kind of
> configuration ?
>
> Is generate phase so processor intensive ? When i check i have two threads
> on each of nodes each taking up 100% time of ht pseudo-core.
>
> Also i have a problem with partitioning 3 times out of 4 it fails because 
> it
> cannot open temporary files created by generate job.
>
> I'm using trunk version of nutch with hadoop 0.15.
>
> I need to find a way to speed up crawldb processing.
>
> I want to create an updateable index of about 30 million pages which could
> be updated every month.
>
> I do not need the scoring-opic plugin, but i couldn't disable it. I do not
> need it as i'm using index to search for plagiarism in our university
> students papers.
>
> I was thinking about moving whole crawldb into some database
> (mysql/postgres) and generating urls to crawl from there, then importing
> them to nutch using clean crawldb and text files.
>
> Please let me know if you have any suggestions on how to speed up the
> crawling process.
>
> -- 
> Karol Rybak
> Programista / Programmer
> Sekcja aplikacji / Applications section
> Wyższa Szkoła Informatyki i Zarządzania / University of Internet 
> Technology
> and Management
> +48(17)8661277
>