You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Br...@emc.com on 2007/07/25 14:28:44 UTC

RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

I didn't find the answer to my question yet but I made some progress:
Using a profiler, I saw that a lot of time is spent looking for URL in the text document (using regular expression). This is something Lucene doesn't do.
However, having recompiled the text parser without this scanning, it is still as slow.
Now it seems that a lot of time is spent in hadoop framework (compared to say the indexing by Lucene and the loading of documents from the file system).

Would that mean that the overhead of the Hadoop framework is killing the performance on a single box ?


-------- Message d'origine--------
De: Brette, Marc
Date: lun. 23/07/2007 12:08
À: nutch-user@lucene.apache.org
Objet : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
 
Hi all,
I performed a little test where I index the same set of documents with Nutch (0.9) and Lucene.
This is a set of documents from TREC, 134 000+ short text documents.

With Lucene, it took 1H. With Nutch using the file:/ protocol, it took 4H10.

Could anyone explain why there is such a difference and is there some way to eliminate part of this overhead ?

Regards,
--
Marc

RE: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

Posted by Br...@emc.com.

Some more detailled figures that confirm that the overhead comes from the 'fetch' step.

I ran (this is a subset of 10000 documents of my original document set):
- inject
- generate, fetch, updatedb twice because I have 1 level of folder
- invertlinks
- index

1st inject:  6 sec (~20 file URL are used as seed)
1st generate:  9 sec
1st fetch:  14 sec
1st updatedb:  7 sec

2nd generate:  12 sec
2nd fetch:  3 min 12 sec
2nd updatedb: 8 sec

invertlinks: 6 sec
index: 1 min 40 sec

The equivalent test takes 1 min 30 sec with Lucene 2.1.0.

-----Original Message-----
From: Brette_Marc@emc.com [mailto:Brette_Marc@emc.com] 
Sent: Wednesday, July 25, 2007 7:08 PM
To: nutch-user@lucene.apache.org
Subject: RE: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

I have some figure on a smaller set of documents (6 min for Nutch vs 1 min 30 sec for Lucene):
Seeding/startup takes 40 sec
Fetching takes 3 min 30 sec
Updating linkdb (?) takes 50 sec
Indexing takes 1 min 10 sec

I'm not completely sure because these figures are based on the trace displayed on the screen when running the crawl command.

I'll try to get more precise traces using the individual commands (inject, generate, fetch, updatedb, invertlinks, index).

It seems that most of the time is indeed spent while fetching.
However, I'm pulling the files from the local file system in both cases (using the protocol-file plugin).
As far as I understood, Nutch would have its own local copy after fetching. Would that explain such a big difference ? (and would there be a way to avoid this copy ?) 

-----Original Message-----
From: Doğacan Güney [mailto:dogacan@gmail.com] 
Sent: Wednesday, July 25, 2007 5:07 PM
To: nutch-user@lucene.apache.org
Subject: Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

On 7/25/07, Brette_Marc@emc.com <Br...@emc.com> wrote:
> Fair enough.
> I'm using Lucene 2.1.0 out-of-the-box using demo class IndexFile.
> I think it uses StandardAnalyzer. I don't think there is any parser apart text.
>
> 4H10 is the total duration of the crawl. I'm using the simple 'bin/nutch crawl ...' command.
>
> crawldb is 1.7 MB, linkdb is 2.2 Mb, segments is 38 MB, index and indexes are both 13 MB
> As I said there are 134000 documents in less than 100 folders.
>
> Lucene index folder is 6.7 MB


Thanks for the details. Could you also measure how much time
individual jobs (fetch, parse, updatedb, invertlinks, index) are
taking? It is possible that you are spending most of the time during
fetching (depending on how actually you are fetching the documents:
are you pulling them from a local file system or from the web
server?).

>
> There is probably a lot of extra IO compared too Lucene. But it seems strange that managing the database of links add so much an overhead to doing the actual job of indexing and loading the file in memory.

I am fairly sure that I/O won't be a huge burden for your case
(because sum of the size of all your structures is less than 50 MB). I
was thinking that perhaps you have a huge crawldb (millions of urls)
and you are indexing a small segment (~100-200K url segment). As you
may know, keys are sorted between map and reduce phases (which
actually is not necessary for indexing). But sorting a couple hundred
thousand keys should'nt be a problem.



>
> -----Original Message-----
> From: Doğacan Güney [mailto:dogacan@gmail.com]
> Sent: Wednesday, July 25, 2007 2:44 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
>
> On 7/25/07, Brette_Marc@emc.com <Br...@emc.com> wrote:
> > I didn't find the answer to my question yet but I made some progress:
> > Using a profiler, I saw that a lot of time is spent looking for URL in the text document (using regular expression). This is something Lucene doesn't do.
> > However, having recompiled the text parser without this scanning, it is still as slow.
> > Now it seems that a lot of time is spent in hadoop framework (compared to say the indexing by Lucene and the loading of documents from the file system).
> >
> > Would that mean that the overhead of the Hadoop framework is killing the performance on a single box ?
>
> It is hard to say something without knowing what you are comparing.
>
> 1) How are you indexing pages with lucene (which analyzers, etc?)
>
> 2) Is 4H10M spent only in indexing job or is it the total duration of
> the entire crawl?
>
> 3) How big is your crawldb, linkdb, etc.? Indexer reads a lot of
> different structures to combine data, perhaps I/O takes a lot of
> time...
>
> >
> >
> > -------- Message d'origine--------
> > De: Brette, Marc
> > Date: lun. 23/07/2007 12:08
> > À: nutch-user@lucene.apache.org
> > Objet : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
> >
> > Hi all,
> > I performed a little test where I index the same set of documents with Nutch (0.9) and Lucene.
> > This is a set of documents from TREC, 134 000+ short text documents.
> >
> > With Lucene, it took 1H. With Nutch using the file:/ protocol, it took 4H10.
> >
> > Could anyone explain why there is such a difference and is there some way to eliminate part of this overhead ?
> >
> > Regards,
> > --
> > Marc
> >
> >
> >
> >
> >
>
>
> --
> Doğacan Güney
>


-- 
Doğacan Güney

RE: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

Posted by Br...@emc.com.

I have some figure on a smaller set of documents (6 min for Nutch vs 1 min 30 sec for Lucene):
Seeding/startup takes 40 sec
Fetching takes 3 min 30 sec
Updating linkdb (?) takes 50 sec
Indexing takes 1 min 10 sec

I'm not completely sure because these figures are based on the trace displayed on the screen when running the crawl command.

I'll try to get more precise traces using the individual commands (inject, generate, fetch, updatedb, invertlinks, index).

It seems that most of the time is indeed spent while fetching.
However, I'm pulling the files from the local file system in both cases (using the protocol-file plugin).
As far as I understood, Nutch would have its own local copy after fetching. Would that explain such a big difference ? (and would there be a way to avoid this copy ?) 

-----Original Message-----
From: Doğacan Güney [mailto:dogacan@gmail.com] 
Sent: Wednesday, July 25, 2007 5:07 PM
To: nutch-user@lucene.apache.org
Subject: Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

On 7/25/07, Brette_Marc@emc.com <Br...@emc.com> wrote:
> Fair enough.
> I'm using Lucene 2.1.0 out-of-the-box using demo class IndexFile.
> I think it uses StandardAnalyzer. I don't think there is any parser apart text.
>
> 4H10 is the total duration of the crawl. I'm using the simple 'bin/nutch crawl ...' command.
>
> crawldb is 1.7 MB, linkdb is 2.2 Mb, segments is 38 MB, index and indexes are both 13 MB
> As I said there are 134000 documents in less than 100 folders.
>
> Lucene index folder is 6.7 MB


Thanks for the details. Could you also measure how much time
individual jobs (fetch, parse, updatedb, invertlinks, index) are
taking? It is possible that you are spending most of the time during
fetching (depending on how actually you are fetching the documents:
are you pulling them from a local file system or from the web
server?).

>
> There is probably a lot of extra IO compared too Lucene. But it seems strange that managing the database of links add so much an overhead to doing the actual job of indexing and loading the file in memory.

I am fairly sure that I/O won't be a huge burden for your case
(because sum of the size of all your structures is less than 50 MB). I
was thinking that perhaps you have a huge crawldb (millions of urls)
and you are indexing a small segment (~100-200K url segment). As you
may know, keys are sorted between map and reduce phases (which
actually is not necessary for indexing). But sorting a couple hundred
thousand keys should'nt be a problem.



>
> -----Original Message-----
> From: Doğacan Güney [mailto:dogacan@gmail.com]
> Sent: Wednesday, July 25, 2007 2:44 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
>
> On 7/25/07, Brette_Marc@emc.com <Br...@emc.com> wrote:
> > I didn't find the answer to my question yet but I made some progress:
> > Using a profiler, I saw that a lot of time is spent looking for URL in the text document (using regular expression). This is something Lucene doesn't do.
> > However, having recompiled the text parser without this scanning, it is still as slow.
> > Now it seems that a lot of time is spent in hadoop framework (compared to say the indexing by Lucene and the loading of documents from the file system).
> >
> > Would that mean that the overhead of the Hadoop framework is killing the performance on a single box ?
>
> It is hard to say something without knowing what you are comparing.
>
> 1) How are you indexing pages with lucene (which analyzers, etc?)
>
> 2) Is 4H10M spent only in indexing job or is it the total duration of
> the entire crawl?
>
> 3) How big is your crawldb, linkdb, etc.? Indexer reads a lot of
> different structures to combine data, perhaps I/O takes a lot of
> time...
>
> >
> >
> > -------- Message d'origine--------
> > De: Brette, Marc
> > Date: lun. 23/07/2007 12:08
> > À: nutch-user@lucene.apache.org
> > Objet : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
> >
> > Hi all,
> > I performed a little test where I index the same set of documents with Nutch (0.9) and Lucene.
> > This is a set of documents from TREC, 134 000+ short text documents.
> >
> > With Lucene, it took 1H. With Nutch using the file:/ protocol, it took 4H10.
> >
> > Could anyone explain why there is such a difference and is there some way to eliminate part of this overhead ?
> >
> > Regards,
> > --
> > Marc
> >
> >
> >
> >
> >
>
>
> --
> Doğacan Güney
>


-- 
Doğacan Güney

Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

Posted by Doğacan Güney <do...@gmail.com>.

On 7/25/07, Brette_Marc@emc.com <Br...@emc.com> wrote:
> Fair enough.
> I'm using Lucene 2.1.0 out-of-the-box using demo class IndexFile.
> I think it uses StandardAnalyzer. I don't think there is any parser apart text.
>
> 4H10 is the total duration of the crawl. I'm using the simple 'bin/nutch crawl ...' command.
>
> crawldb is 1.7 MB, linkdb is 2.2 Mb, segments is 38 MB, index and indexes are both 13 MB
> As I said there are 134000 documents in less than 100 folders.
>
> Lucene index folder is 6.7 MB


Thanks for the details. Could you also measure how much time
individual jobs (fetch, parse, updatedb, invertlinks, index) are
taking? It is possible that you are spending most of the time during
fetching (depending on how actually you are fetching the documents:
are you pulling them from a local file system or from the web
server?).

>
> There is probably a lot of extra IO compared too Lucene. But it seems strange that managing the database of links add so much an overhead to doing the actual job of indexing and loading the file in memory.

I am fairly sure that I/O won't be a huge burden for your case
(because sum of the size of all your structures is less than 50 MB). I
was thinking that perhaps you have a huge crawldb (millions of urls)
and you are indexing a small segment (~100-200K url segment). As you
may know, keys are sorted between map and reduce phases (which
actually is not necessary for indexing). But sorting a couple hundred
thousand keys should'nt be a problem.



>
> -----Original Message-----
> From: Doğacan Güney [mailto:dogacan@gmail.com]
> Sent: Wednesday, July 25, 2007 2:44 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
>
> On 7/25/07, Brette_Marc@emc.com <Br...@emc.com> wrote:
> > I didn't find the answer to my question yet but I made some progress:
> > Using a profiler, I saw that a lot of time is spent looking for URL in the text document (using regular expression). This is something Lucene doesn't do.
> > However, having recompiled the text parser without this scanning, it is still as slow.
> > Now it seems that a lot of time is spent in hadoop framework (compared to say the indexing by Lucene and the loading of documents from the file system).
> >
> > Would that mean that the overhead of the Hadoop framework is killing the performance on a single box ?
>
> It is hard to say something without knowing what you are comparing.
>
> 1) How are you indexing pages with lucene (which analyzers, etc?)
>
> 2) Is 4H10M spent only in indexing job or is it the total duration of
> the entire crawl?
>
> 3) How big is your crawldb, linkdb, etc.? Indexer reads a lot of
> different structures to combine data, perhaps I/O takes a lot of
> time...
>
> >
> >
> > -------- Message d'origine--------
> > De: Brette, Marc
> > Date: lun. 23/07/2007 12:08
> > À: nutch-user@lucene.apache.org
> > Objet : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
> >
> > Hi all,
> > I performed a little test where I index the same set of documents with Nutch (0.9) and Lucene.
> > This is a set of documents from TREC, 134 000+ short text documents.
> >
> > With Lucene, it took 1H. With Nutch using the file:/ protocol, it took 4H10.
> >
> > Could anyone explain why there is such a difference and is there some way to eliminate part of this overhead ?
> >
> > Regards,
> > --
> > Marc
> >
> >
> >
> >
> >
>
>
> --
> Doğacan Güney
>


-- 
Doğacan Güney

RE: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

Posted by Br...@emc.com.

Fair enough.
I'm using Lucene 2.1.0 out-of-the-box using demo class IndexFile.
I think it uses StandardAnalyzer. I don't think there is any parser apart text.

4H10 is the total duration of the crawl. I'm using the simple 'bin/nutch crawl ...' command.

crawldb is 1.7 MB, linkdb is 2.2 Mb, segments is 38 MB, index and indexes are both 13 MB 
As I said there are 134000 documents in less than 100 folders.

Lucene index folder is 6.7 MB

There is probably a lot of extra IO compared too Lucene. But it seems strange that managing the database of links add so much an overhead to doing the actual job of indexing and loading the file in memory.

-----Original Message-----
From: Doğacan Güney [mailto:dogacan@gmail.com] 
Sent: Wednesday, July 25, 2007 2:44 PM
To: nutch-user@lucene.apache.org
Subject: Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

On 7/25/07, Brette_Marc@emc.com <Br...@emc.com> wrote:
> I didn't find the answer to my question yet but I made some progress:
> Using a profiler, I saw that a lot of time is spent looking for URL in the text document (using regular expression). This is something Lucene doesn't do.
> However, having recompiled the text parser without this scanning, it is still as slow.
> Now it seems that a lot of time is spent in hadoop framework (compared to say the indexing by Lucene and the loading of documents from the file system).
>
> Would that mean that the overhead of the Hadoop framework is killing the performance on a single box ?

It is hard to say something without knowing what you are comparing.

1) How are you indexing pages with lucene (which analyzers, etc?)

2) Is 4H10M spent only in indexing job or is it the total duration of
the entire crawl?

3) How big is your crawldb, linkdb, etc.? Indexer reads a lot of
different structures to combine data, perhaps I/O takes a lot of
time...

>
>
> -------- Message d'origine--------
> De: Brette, Marc
> Date: lun. 23/07/2007 12:08
> À: nutch-user@lucene.apache.org
> Objet : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
>
> Hi all,
> I performed a little test where I index the same set of documents with Nutch (0.9) and Lucene.
> This is a set of documents from TREC, 134 000+ short text documents.
>
> With Lucene, it took 1H. With Nutch using the file:/ protocol, it took 4H10.
>
> Could anyone explain why there is such a difference and is there some way to eliminate part of this overhead ?
>
> Regards,
> --
> Marc
>
>
>
>
>

-- 
Doğacan Güney

Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

Posted by Doğacan Güney <do...@gmail.com>.

On 7/25/07, Brette_Marc@emc.com <Br...@emc.com> wrote:
> I didn't find the answer to my question yet but I made some progress:
> Using a profiler, I saw that a lot of time is spent looking for URL in the text document (using regular expression). This is something Lucene doesn't do.
> However, having recompiled the text parser without this scanning, it is still as slow.
> Now it seems that a lot of time is spent in hadoop framework (compared to say the indexing by Lucene and the loading of documents from the file system).
>
> Would that mean that the overhead of the Hadoop framework is killing the performance on a single box ?

It is hard to say something without knowing what you are comparing.

1) How are you indexing pages with lucene (which analyzers, etc?)

2) Is 4H10M spent only in indexing job or is it the total duration of
the entire crawl?

3) How big is your crawldb, linkdb, etc.? Indexer reads a lot of
different structures to combine data, perhaps I/O takes a lot of
time...

>
>
> -------- Message d'origine--------
> De: Brette, Marc
> Date: lun. 23/07/2007 12:08
> À: nutch-user@lucene.apache.org
> Objet : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)
>
> Hi all,
> I performed a little test where I index the same set of documents with Nutch (0.9) and Lucene.
> This is a set of documents from TREC, 134 000+ short text documents.
>
> With Lucene, it took 1H. With Nutch using the file:/ protocol, it took 4H10.
>
> Could anyone explain why there is such a difference and is there some way to eliminate part of this overhead ?
>
> Regards,
> --
> Marc
>
>
>
>
>


-- 
Doğacan Güney