You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Chen <ca...@gmail.com> on 2005/09/26 08:24:43 UTC

link analysis and update segments

In a whole-web or vertical crawling setting, is it right that link 
analysis and update segment from DB should be performed in right order 
before indexing the segments?

There's not much talk about update segment from DB. I think it should be 
an important step. Could someone point out when it should be  run and 
what the benefits are?

I remember it was mentioned sometime ago that the link analysis tool 
does not work yet and the number of in-links should be used instead. Any 
update? If it's still not working, how to set it to use link numbers?

Thanks,
AJ


Re: link analysis and update segments

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
1)
I can't answer to this question fully. I think this method of scoreing 
is good.

2)
In my case it was near 1 day.

Michael Ji wrotte:

>1)
>Does that mean a page's outlink number will count for
>its' score? ( I seems to see this logic in code, but
>can't remember which one )
>
>Then, my question is --- how accurate the score for
>this method will be?
>
>I mean, theoretically, a page's score depends on the
>number of in-links and the score of source page of
>these in-links. 
>
>2) 
>How much link analysis tool cost? for example, if I
>have 10 million pages in Webdb, how long it will take
>to run?
>
>thanks,
>
>Michael Ji,
>  
>


Re: Is it at all necessary to merge segments in MapRed?

Posted by Doug Cutting <cu...@nutch.org>.
Gal Nitzan wrote:
> If I understand correctly, having one segment or a hundred is not 
> important?

It depends.  If you have hundreds of segments and are trying to search 
them with a single JVM then you will probably run out of file handles.

> What happens when a page is fetched a second time? is there something to 
> deduplicate it?

The dedup command has not yet been implemented in the mapred branch. 
Coming soon.

Doug

Re: Is it at all necessary to merge segments in MapRed?

Posted by Gal Nitzan <gn...@usa.net>.
Doug Cutting wrote:
> Gal Nitzan wrote:
>> Is it at all necessary to merge segments located in the ndfs and if 
>> it is, How?
>
> Segment merging is not yet implemented in the mapred branch.  Whether 
> it will be required is to be seen.
>
> Doug
>
> .
>
Thanks Doug.

If I understand correctly, having one segment or a hundred is not important?

What happens when a page is fetched a second time? is there something to 
deduplicate it?

Regards,

Gal

Re: Is it at all necessary to merge segments in MapRed?

Posted by Doug Cutting <cu...@nutch.org>.
Gal Nitzan wrote:
> Is it at all necessary to merge segments located in the ndfs and if it 
> is, How?

Segment merging is not yet implemented in the mapred branch.  Whether it 
will be required is to be seen.

Doug

Is it at all necessary to merge segments in MapRed?

Posted by Gal Nitzan <gn...@usa.net>.
Well,

I was not able to find any info about that...

Is it at all necessary to merge segments located in the ndfs and if it 
is, How?

Thanks, Gal

Re: link analysis and update segments

Posted by Michael Ji <fj...@yahoo.com>.
1)
Does that mean a page's outlink number will count for
its' score? ( I seems to see this logic in code, but
can't remember which one )

Then, my question is --- how accurate the score for
this method will be?

I mean, theoretically, a page's score depends on the
number of in-links and the score of source page of
these in-links. 

2) 
How much link analysis tool cost? for example, if I
have 10 million pages in Webdb, how long it will take
to run?

thanks,

Michael Ji,

--- "yoursoft@freemail.hu" <yo...@freemail.hu>
wrote:

> The linkanalisis tool needs long time to process.
> Doug wrotte some comments from it:
> The fetchlist.score.by.link.count and the
> indexer.boost.by.link.count to 
> true. And forgot using of linkanalysis tool.
> I use these method from since 2005 June, without
> problem.
> With the linkanalysis tool the scoring is better,
> but with the explained 
> setup it is near scoring - without many resource
> usage.
> 
> Michael Ji wrotte:
> 
> >Hi,
> >
> >As my understanding, link anaylsis is neccessary to
> 
> >run whenever a new fetching is updated to webdb.
> >Because the link graphic is changed ( it is
> possible
> >that new links are added and old links are deleted
> ),
> >the score for each node is changed so a
> recaculation
> >is neccessary. 
> >
> >Link analysis will update the score for each node
> (by
> >page) in webdb, then updatesegmentfromdb needs to
> run
> >to copy recalculated score to segment.
> >
> >I can't see a point that we can skip link anaylsis.
> Am
> >I missing something important? Let me know.
> >
> >thanks,
> >
> >Michael Ji,
> >
> >
> >--- AJ Chen <ca...@gmail.com> wrote:
> >
> >  
> >
> >>I assume you mean UpdateSegmentFromDB, and there
> is
> >>no need to run link 
> >>analysis tool if I want to use the number of
> inlinks
> >>for nutch score. 
> >>Right? I tried to find your patch, but couldn't
> find
> >>it. How to find it?
> >>-AJ
> >>
> >>Piotr Kosiorowski wrote:
> >>
> >>    
> >>
> >>>UpdateDB copies link information and score from
> >>>      
> >>>
> >>the WebDB to segments 
> >>    
> >>
> >>>so it is important to have score calculated
> before
> >>>      
> >>>
> >>updatedb is run. 
> >>    
> >>
> >>>One can use current standard nutch score (based
> on
> >>>      
> >>>
> >>number of inlinks) 
> >>    
> >>
> >>>or try to use analyze - I have committed a patch
> >>>      
> >>>
> >>for it some time ago 
> >>    
> >>
> >>>that might help a bit with it disk space
> >>>      
> >>>
> >>requirements so the best 
> >>    
> >>
> >>>approach would be to test it (it worked ok for
> me)
> >>>      
> >>>
> >>and if it is ok for 
> >>    
> >>
> >>>you - report it so others can also try it out.
> >>>Regards
> >>>Piotr
> >>>AJ Chen wrote:
> >>>
> >>>      
> >>>
> >>>>In a whole-web or vertical crawling setting, is
> >>>>        
> >>>>
> >>it right that link 
> >>    
> >>
> >>>>analysis and update segment from DB should be
> >>>>        
> >>>>
> >>performed in right 
> >>    
> >>
> >>>>order before indexing the segments?
> >>>>
> >>>>There's not much talk about update segment from
> >>>>        
> >>>>
> >>DB. I think it should 
> >>    
> >>
> >>>>be an important step. Could someone point out
> >>>>        
> >>>>
> >>when it should be  run 
> >>    
> >>
> >>>>and what the benefits are?
> >>>>
> >>>>I remember it was mentioned sometime ago that
> the
> >>>>        
> >>>>
> >>link analysis tool 
> >>    
> >>
> >>>>does not work yet and the number of in-links
> >>>>        
> >>>>
> >>should be used instead. 
> >>    
> >>
> >>>>Any update? If it's still not working, how to
> set
> >>>>        
> >>>>
> >>it to use link 
> >>    
> >>
> >>>>numbers?
> >>>>
> >>>>Thanks,
> >>>>AJ
> >>>>
> >>>>
> >>>>        
> >>>>
> >>>      
> >>>
> >>    
> >>
> >
> >
> >
> >		
> >__________________________________ 
> >Yahoo! Mail - PC Magazine Editors' Choice 2005 
> >http://mail.yahoo.com
> >
> >
> >  
> >
> 
> 



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: link analysis and update segments

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
The linkanalisis tool needs long time to process.
Doug wrotte some comments from it:
The fetchlist.score.by.link.count and the indexer.boost.by.link.count to 
true. And forgot using of linkanalysis tool.
I use these method from since 2005 June, without problem.
With the linkanalysis tool the scoring is better, but with the explained 
setup it is near scoring - without many resource usage.

Michael Ji wrotte:

>Hi,
>
>As my understanding, link anaylsis is neccessary to 
>run whenever a new fetching is updated to webdb.
>Because the link graphic is changed ( it is possible
>that new links are added and old links are deleted ),
>the score for each node is changed so a recaculation
>is neccessary. 
>
>Link analysis will update the score for each node (by
>page) in webdb, then updatesegmentfromdb needs to run
>to copy recalculated score to segment.
>
>I can't see a point that we can skip link anaylsis. Am
>I missing something important? Let me know.
>
>thanks,
>
>Michael Ji,
>
>
>--- AJ Chen <ca...@gmail.com> wrote:
>
>  
>
>>I assume you mean UpdateSegmentFromDB, and there is
>>no need to run link 
>>analysis tool if I want to use the number of inlinks
>>for nutch score. 
>>Right? I tried to find your patch, but couldn't find
>>it. How to find it?
>>-AJ
>>
>>Piotr Kosiorowski wrote:
>>
>>    
>>
>>>UpdateDB copies link information and score from
>>>      
>>>
>>the WebDB to segments 
>>    
>>
>>>so it is important to have score calculated before
>>>      
>>>
>>updatedb is run. 
>>    
>>
>>>One can use current standard nutch score (based on
>>>      
>>>
>>number of inlinks) 
>>    
>>
>>>or try to use analyze - I have committed a patch
>>>      
>>>
>>for it some time ago 
>>    
>>
>>>that might help a bit with it disk space
>>>      
>>>
>>requirements so the best 
>>    
>>
>>>approach would be to test it (it worked ok for me)
>>>      
>>>
>>and if it is ok for 
>>    
>>
>>>you - report it so others can also try it out.
>>>Regards
>>>Piotr
>>>AJ Chen wrote:
>>>
>>>      
>>>
>>>>In a whole-web or vertical crawling setting, is
>>>>        
>>>>
>>it right that link 
>>    
>>
>>>>analysis and update segment from DB should be
>>>>        
>>>>
>>performed in right 
>>    
>>
>>>>order before indexing the segments?
>>>>
>>>>There's not much talk about update segment from
>>>>        
>>>>
>>DB. I think it should 
>>    
>>
>>>>be an important step. Could someone point out
>>>>        
>>>>
>>when it should be  run 
>>    
>>
>>>>and what the benefits are?
>>>>
>>>>I remember it was mentioned sometime ago that the
>>>>        
>>>>
>>link analysis tool 
>>    
>>
>>>>does not work yet and the number of in-links
>>>>        
>>>>
>>should be used instead. 
>>    
>>
>>>>Any update? If it's still not working, how to set
>>>>        
>>>>
>>it to use link 
>>    
>>
>>>>numbers?
>>>>
>>>>Thanks,
>>>>AJ
>>>>
>>>>
>>>>        
>>>>
>>>      
>>>
>>    
>>
>
>
>
>		
>__________________________________ 
>Yahoo! Mail - PC Magazine Editors' Choice 2005 
>http://mail.yahoo.com
>
>
>  
>


Re: link analysis and update segments

Posted by Michael Ji <fj...@yahoo.com>.
Hi,

As my understanding, link anaylsis is neccessary to 
run whenever a new fetching is updated to webdb.
Because the link graphic is changed ( it is possible
that new links are added and old links are deleted ),
the score for each node is changed so a recaculation
is neccessary. 

Link analysis will update the score for each node (by
page) in webdb, then updatesegmentfromdb needs to run
to copy recalculated score to segment.

I can't see a point that we can skip link anaylsis. Am
I missing something important? Let me know.

thanks,

Michael Ji,


--- AJ Chen <ca...@gmail.com> wrote:

> I assume you mean UpdateSegmentFromDB, and there is
> no need to run link 
> analysis tool if I want to use the number of inlinks
> for nutch score. 
> Right? I tried to find your patch, but couldn't find
> it. How to find it?
> -AJ
> 
> Piotr Kosiorowski wrote:
> 
> > UpdateDB copies link information and score from
> the WebDB to segments 
> > so it is important to have score calculated before
> updatedb is run. 
> > One can use current standard nutch score (based on
> number of inlinks) 
> > or try to use analyze - I have committed a patch
> for it some time ago 
> > that might help a bit with it disk space
> requirements so the best 
> > approach would be to test it (it worked ok for me)
> and if it is ok for 
> > you - report it so others can also try it out.
> > Regards
> > Piotr
> > AJ Chen wrote:
> >
> >> In a whole-web or vertical crawling setting, is
> it right that link 
> >> analysis and update segment from DB should be
> performed in right 
> >> order before indexing the segments?
> >>
> >> There's not much talk about update segment from
> DB. I think it should 
> >> be an important step. Could someone point out
> when it should be  run 
> >> and what the benefits are?
> >>
> >> I remember it was mentioned sometime ago that the
> link analysis tool 
> >> does not work yet and the number of in-links
> should be used instead. 
> >> Any update? If it's still not working, how to set
> it to use link 
> >> numbers?
> >>
> >> Thanks,
> >> AJ
> >>
> >>
> >
> >
> 
> 



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: link analysis and update segments

Posted by AJ Chen <ca...@gmail.com>.
I assume you mean UpdateSegmentFromDB, and there is no need to run link 
analysis tool if I want to use the number of inlinks for nutch score. 
Right? I tried to find your patch, but couldn't find it. How to find it?
-AJ

Piotr Kosiorowski wrote:

> UpdateDB copies link information and score from the WebDB to segments 
> so it is important to have score calculated before updatedb is run. 
> One can use current standard nutch score (based on number of inlinks) 
> or try to use analyze - I have committed a patch for it some time ago 
> that might help a bit with it disk space requirements so the best 
> approach would be to test it (it worked ok for me) and if it is ok for 
> you - report it so others can also try it out.
> Regards
> Piotr
> AJ Chen wrote:
>
>> In a whole-web or vertical crawling setting, is it right that link 
>> analysis and update segment from DB should be performed in right 
>> order before indexing the segments?
>>
>> There's not much talk about update segment from DB. I think it should 
>> be an important step. Could someone point out when it should be  run 
>> and what the benefits are?
>>
>> I remember it was mentioned sometime ago that the link analysis tool 
>> does not work yet and the number of in-links should be used instead. 
>> Any update? If it's still not working, how to set it to use link 
>> numbers?
>>
>> Thanks,
>> AJ
>>
>>
>
>


Re: link analysis and update segments

Posted by Piotr Kosiorowski <pk...@gmail.com>.
UpdateDB copies link information and score from the WebDB to segments so 
it is important to have score calculated before updatedb is run. One can 
use current standard nutch score (based on number of inlinks) or try to 
use analyze - I have committed a patch for it some time ago that might 
help a bit with it disk space requirements so the best approach would be 
to test it (it worked ok for me) and if it is ok for you - report it so 
others can also try it out.
Regards
Piotr
AJ Chen wrote:
> In a whole-web or vertical crawling setting, is it right that link 
> analysis and update segment from DB should be performed in right order 
> before indexing the segments?
> 
> There's not much talk about update segment from DB. I think it should be 
> an important step. Could someone point out when it should be  run and 
> what the benefits are?
> 
> I remember it was mentioned sometime ago that the link analysis tool 
> does not work yet and the number of in-links should be used instead. Any 
> update? If it's still not working, how to set it to use link numbers?
> 
> Thanks,
> AJ
> 
> 


New SE

Posted by David Webster <tr...@loxinfo.co.th>.
Hello, 

I've been subscribed to this list for several months and have read
numerous posts, although most sail merrily above my noggin. I hope this
is not an inappropriate post.

I want to start a SE and have done quite a lot of thinking about it ..
although I'm not a programmer by any stretch of imagination .. and my
budget is .. um .. "challenged"

The SE I want is a clustering SE for Travel .. for specific regions.

- I want to spider and index as many pages as possible (I think) on
*one server*
- I want to exclude *all* predominant affiliate sites and all directory
sites.
- I *think* I only want to spider to 3 levels .. as the site should be
about travel resources more so than detailed information. 
- I believe updating the information (respidering) every 30 days is
sufficient .. maybe even every 60 days?
- I think that the easiest part is to set up nutch and get it working
.. and the harder part is configuring the crawler? or the indexer? to
inclue only those URLs that fit the requirements I have defined above.
- The one specific section of the DMOZ index would probably be okay for
seeding the database

So I'm hoping to get:

- Some reasonable comments on my plan
- A price from a consultant (free works too! ;) to get Nutch +
clustering set up and running
- A price from a consultant to configure the spider/indexer.

Thanks and best regards,

Dave W.