You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by JoostRuiter <jo...@adnexus-recruitment.nl> on 2007/04/23 16:31:57 UTC

Perfomance problems and segmenting

Hi All,

First off, I'm quite the noob when it comes to Nutch, so don't bash me if
the following is an enormously stupid question.

We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.


Performance is really poor, if we do get search results it will take
multiple minutes. When the query is longer we are getting the following:

"java.lang.OutOfMemoryError: Java heap memory"

What we have tried to improve on this:
- Slice the segments into smaller chuncks (max: 50000 url/per seg)
- Set io.map.index.skip to 8
- Set indexer.termIndexInterval to 1024
- Cluster with Hadoop (4 nodes to search)

Any ideas? Missing information? Please let me know, this is my graduation
internship and I would really like to get a good grade ;)
-- 
View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10141310
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Perfomance problems and segmenting

Posted by JoostRuiter <jo...@adnexus-recruitment.nl>.
Dear Briggs,

Currently we allocated 1gig for JVM and Resin/Tomcat.

Greetings, 

Joost


Briggs wrote:
> 
> How much memory are you currently allocating to the search servers?
> 
> 
> 
> On 4/23/07, JoostRuiter <jo...@adnexus-recruitment.nl> wrote:
>>
>> Hi All,
>>
>> First off, I'm quite the noob when it comes to Nutch, so don't bash me if
>> the following is an enormously stupid question.
>>
>> We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and
>> a
>> 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of
>> 15gig.
>>
>>
>> Performance is really poor, if we do get search results it will take
>> multiple minutes. When the query is longer we are getting the following:
>>
>> "java.lang.OutOfMemoryError: Java heap memory"
>>
>> What we have tried to improve on this:
>> - Slice the segments into smaller chuncks (max: 50000 url/per seg)
>> - Set io.map.index.skip to 8
>> - Set indexer.termIndexInterval to 1024
>> - Cluster with Hadoop (4 nodes to search)
>>
>> Any ideas? Missing information? Please let me know, this is my graduation
>> internship and I would really like to get a good grade ;)
>> --
>> View this message in context:
>> http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10141310
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> "Conscious decisions by concious minds are what make reality real"
> 
> 

-- 
View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10141752
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Perfomance problems and segmenting

Posted by Briggs <ac...@gmail.com>.
How much memory are you currently allocating to the search servers?



On 4/23/07, JoostRuiter <jo...@adnexus-recruitment.nl> wrote:
>
> Hi All,
>
> First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> the following is an enormously stupid question.
>
> We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
>
>
> Performance is really poor, if we do get search results it will take
> multiple minutes. When the query is longer we are getting the following:
>
> "java.lang.OutOfMemoryError: Java heap memory"
>
> What we have tried to improve on this:
> - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> - Set io.map.index.skip to 8
> - Set indexer.termIndexInterval to 1024
> - Cluster with Hadoop (4 nodes to search)
>
> Any ideas? Missing information? Please let me know, this is my graduation
> internship and I would really like to get a good grade ;)
> --
> View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10141310
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>


-- 
"Conscious decisions by concious minds are what make reality real"

Re: Perfomance problems and segmenting

Posted by JoostRuiter <jo...@adnexus-recruitment.nl>.
I got some additional info from our developer:

"I never
had much luck with the merge tools but you might post this snippit from
your log to the board:

2007-04-23 20:01:56,656 INFO  segment.SegmentMerger - Slice size: 50000
URLs.
2007-04-23 20:01:56,656 INFO  segment.SegmentMerger - Slice size: 50000
URLs.
2007-04-23 21:28:09,031 WARN  mapred.LocalJobRunner - job_gai7an
java.lang.OutOfMemoryError: Java heap space

Which might give them a little more info since it tells them when."



JoostRuiter wrote:
> 
> Hey guys,
> 
> one more addition, we're not using DFS. We got a single XP box with NFTS
> (so no distributed index).
> 
> Hope this helps, greetings..
> 
> And for some strange reason we got the following error after slicing the
> segments into 50K url pieces:
> 
> $ nutch mergesegs arscrminternal/outseg -dir arscrminternal/segments/
> -slice 50000
> Merging 1 segments to arscrminternal/outseg/20070423163605
> SegmentMerger:   adding arscrminternal/segments/20070421110321
> SegmentMerger: using segment data from: content crawl_generate crawl_fetch
> crawl_parse parse_data parse_text
> Slice size: 50000 URLs.
> Slice size: 50000 URLs.
> Slice size: 50000 URLs.
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>         at
> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:627)
>         at
> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:675)
> 
> 
> We thought making smaller chunks would help the perfomance, but we didnt
> even come around to test it beacause of the above error, any ideas?
> 
> 
> 
> JoostRuiter wrote:
>> 
>> Ok thanks for all your input guys! I`ll discuss this with my co-worker.
>> Dennis, what more information do you need?
>> 
>> Thanks everyone!
>> 
>> 
>> Briggs wrote:
>>> 
>>> One more thing...
>>> 
>>> Are you using a distributed index?  If this is so, you do not want to
>>> do this; indexes should be local to the machine that is being
>>> searched.
>>> 
>>> On 4/23/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
>>>> Without more information this sounds like your tomcat search
>>>> nutch-site.xml file is setup to use the DFS rather than the local file
>>>> system.  Remember that processing jobs occurs on the DFS but for
>>>> searching, indexes are best moved to the local file system.
>>>>
>>>> Dennis Kubes
>>>>
>>>> JoostRuiter wrote:
>>>> > Hi All,
>>>> >
>>>> > First off, I'm quite the noob when it comes to Nutch, so don't bash
>>>> me if
>>>> > the following is an enormously stupid question.
>>>> >
>>>> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM
>>>> and a
>>>> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of
>>>> 15gig.
>>>> >
>>>> >
>>>> > Performance is really poor, if we do get search results it will take
>>>> > multiple minutes. When the query is longer we are getting the
>>>> following:
>>>> >
>>>> > "java.lang.OutOfMemoryError: Java heap memory"
>>>> >
>>>> > What we have tried to improve on this:
>>>> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
>>>> > - Set io.map.index.skip to 8
>>>> > - Set indexer.termIndexInterval to 1024
>>>> > - Cluster with Hadoop (4 nodes to search)
>>>> >
>>>> > Any ideas? Missing information? Please let me know, this is my
>>>> graduation
>>>> > internship and I would really like to get a good grade ;)
>>>>
>>> 
>>> 
>>> -- 
>>> "Conscious decisions by conscious minds are what make reality real"
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10158788
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Perfomance problems and segmenting

Posted by JoostRuiter <jo...@adnexus-recruitment.nl>.
Hey guys,

one more addition, we're not using DFS. We got a single XP box with NFTS (so
no distributed index).

Hope this helps, greetings..

And for some strange reason we got the following error after slicing the
segments into 50K url pieces:

$ nutch mergesegs arscrminternal/outseg -dir arscrminternal/segments/ -slice
50000
Merging 1 segments to arscrminternal/outseg/20070423163605
SegmentMerger:   adding arscrminternal/segments/20070421110321
SegmentMerger: using segment data from: content crawl_generate crawl_fetch
crawl_parse parse_data parse_text
Slice size: 50000 URLs.
Slice size: 50000 URLs.
Slice size: 50000 URLs.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at
org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:627)
        at
org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:675)


We thought making smaller chunks would help the perfomance, but we didnt
even come around to test it beacause of the above error, any ideas?



JoostRuiter wrote:
> 
> Ok thanks for all your input guys! I`ll discuss this with my co-worker.
> Dennis, what more information do you need?
> 
> Thanks everyone!
> 
> 
> Briggs wrote:
>> 
>> One more thing...
>> 
>> Are you using a distributed index?  If this is so, you do not want to
>> do this; indexes should be local to the machine that is being
>> searched.
>> 
>> On 4/23/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
>>> Without more information this sounds like your tomcat search
>>> nutch-site.xml file is setup to use the DFS rather than the local file
>>> system.  Remember that processing jobs occurs on the DFS but for
>>> searching, indexes are best moved to the local file system.
>>>
>>> Dennis Kubes
>>>
>>> JoostRuiter wrote:
>>> > Hi All,
>>> >
>>> > First off, I'm quite the noob when it comes to Nutch, so don't bash me
>>> if
>>> > the following is an enormously stupid question.
>>> >
>>> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM
>>> and a
>>> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of
>>> 15gig.
>>> >
>>> >
>>> > Performance is really poor, if we do get search results it will take
>>> > multiple minutes. When the query is longer we are getting the
>>> following:
>>> >
>>> > "java.lang.OutOfMemoryError: Java heap memory"
>>> >
>>> > What we have tried to improve on this:
>>> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
>>> > - Set io.map.index.skip to 8
>>> > - Set indexer.termIndexInterval to 1024
>>> > - Cluster with Hadoop (4 nodes to search)
>>> >
>>> > Any ideas? Missing information? Please let me know, this is my
>>> graduation
>>> > internship and I would really like to get a good grade ;)
>>>
>> 
>> 
>> -- 
>> "Conscious decisions by conscious minds are what make reality real"
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10155864
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Perfomance problems and segmenting

Posted by JoostRuiter <jo...@adnexus-recruitment.nl>.
Hey guys,

one more addition, we're not using DFS. We got a single XP box with NFTS (so
no distributed index).

Hope this helps, greetings..


JoostRuiter wrote:
> 
> Ok thanks for all your input guys! I`ll discuss this with my co-worker.
> Dennis, what more information do you need?
> 
> Thanks everyone!
> 
> 
> Briggs wrote:
>> 
>> One more thing...
>> 
>> Are you using a distributed index?  If this is so, you do not want to
>> do this; indexes should be local to the machine that is being
>> searched.
>> 
>> On 4/23/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
>>> Without more information this sounds like your tomcat search
>>> nutch-site.xml file is setup to use the DFS rather than the local file
>>> system.  Remember that processing jobs occurs on the DFS but for
>>> searching, indexes are best moved to the local file system.
>>>
>>> Dennis Kubes
>>>
>>> JoostRuiter wrote:
>>> > Hi All,
>>> >
>>> > First off, I'm quite the noob when it comes to Nutch, so don't bash me
>>> if
>>> > the following is an enormously stupid question.
>>> >
>>> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM
>>> and a
>>> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of
>>> 15gig.
>>> >
>>> >
>>> > Performance is really poor, if we do get search results it will take
>>> > multiple minutes. When the query is longer we are getting the
>>> following:
>>> >
>>> > "java.lang.OutOfMemoryError: Java heap memory"
>>> >
>>> > What we have tried to improve on this:
>>> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
>>> > - Set io.map.index.skip to 8
>>> > - Set indexer.termIndexInterval to 1024
>>> > - Cluster with Hadoop (4 nodes to search)
>>> >
>>> > Any ideas? Missing information? Please let me know, this is my
>>> graduation
>>> > internship and I would really like to get a good grade ;)
>>>
>> 
>> 
>> -- 
>> "Conscious decisions by conscious minds are what make reality real"
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10155864
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Perfomance problems and segmenting

Posted by JoostRuiter <jo...@adnexus-recruitment.nl>.
Ok thanks for all your input guys! I`ll discuss this with my co-worker.
Dennis, what more information do you need?

Thanks everyone!


Briggs wrote:
> 
> One more thing...
> 
> Are you using a distributed index?  If this is so, you do not want to
> do this; indexes should be local to the machine that is being
> searched.
> 
> On 4/23/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
>> Without more information this sounds like your tomcat search
>> nutch-site.xml file is setup to use the DFS rather than the local file
>> system.  Remember that processing jobs occurs on the DFS but for
>> searching, indexes are best moved to the local file system.
>>
>> Dennis Kubes
>>
>> JoostRuiter wrote:
>> > Hi All,
>> >
>> > First off, I'm quite the noob when it comes to Nutch, so don't bash me
>> if
>> > the following is an enormously stupid question.
>> >
>> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM
>> and a
>> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of
>> 15gig.
>> >
>> >
>> > Performance is really poor, if we do get search results it will take
>> > multiple minutes. When the query is longer we are getting the
>> following:
>> >
>> > "java.lang.OutOfMemoryError: Java heap memory"
>> >
>> > What we have tried to improve on this:
>> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
>> > - Set io.map.index.skip to 8
>> > - Set indexer.termIndexInterval to 1024
>> > - Cluster with Hadoop (4 nodes to search)
>> >
>> > Any ideas? Missing information? Please let me know, this is my
>> graduation
>> > internship and I would really like to get a good grade ;)
>>
> 
> 
> -- 
> "Conscious decisions by conscious minds are what make reality real"
> 
> 

-- 
View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10155851
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Perfomance problems and segmenting

Posted by Briggs <ac...@gmail.com>.
One more thing...

Are you using a distributed index?  If this is so, you do not want to
do this; indexes should be local to the machine that is being
searched.

On 4/23/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> Without more information this sounds like your tomcat search
> nutch-site.xml file is setup to use the DFS rather than the local file
> system.  Remember that processing jobs occurs on the DFS but for
> searching, indexes are best moved to the local file system.
>
> Dennis Kubes
>
> JoostRuiter wrote:
> > Hi All,
> >
> > First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> > the following is an enormously stupid question.
> >
> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
> >
> >
> > Performance is really poor, if we do get search results it will take
> > multiple minutes. When the query is longer we are getting the following:
> >
> > "java.lang.OutOfMemoryError: Java heap memory"
> >
> > What we have tried to improve on this:
> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> > - Set io.map.index.skip to 8
> > - Set indexer.termIndexInterval to 1024
> > - Cluster with Hadoop (4 nodes to search)
> >
> > Any ideas? Missing information? Please let me know, this is my graduation
> > internship and I would really like to get a good grade ;)
>


-- 
"Conscious decisions by conscious minds are what make reality real"

Re: Perfomance problems and segmenting

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Without more information this sounds like your tomcat search 
nutch-site.xml file is setup to use the DFS rather than the local file 
system.  Remember that processing jobs occurs on the DFS but for 
searching, indexes are best moved to the local file system.

Dennis Kubes

JoostRuiter wrote:
> Hi All,
> 
> First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> the following is an enormously stupid question.
> 
> We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
> 
> 
> Performance is really poor, if we do get search results it will take
> multiple minutes. When the query is longer we are getting the following:
> 
> "java.lang.OutOfMemoryError: Java heap memory"
> 
> What we have tried to improve on this:
> - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> - Set io.map.index.skip to 8
> - Set indexer.termIndexInterval to 1024
> - Cluster with Hadoop (4 nodes to search)
> 
> Any ideas? Missing information? Please let me know, this is my graduation
> internship and I would really like to get a good grade ;)