You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <ku...@apache.org> on 2008/07/14 17:57:05 UTC

How to walk a webgraph?

Does anybody know how to efficiently (non-exponentially) walk a web 
graph to detect cycles.  This would be very useful in identifying spammy 
webpage and tight knit communities.

I have a program that I will be releasing soon that does the detection 
through converting a webgraph into a tree and walking the tree nodes, 
but it is exponential in terms of intermediate map reduce output and 
computation.

Dennis

Remote connection from search.jsp to nutchbean

Posted by Fritz Bein <fr...@gmx.de>.
Hi,

can anybody tell me if it is possible to run the search interface (i.e. search.jsp) on a different computer than that computer where the nutchbean runs on? 

I want to setup a server for the purpose of delivering search results to another server that generates the html-pages using the results.

Any help is very much appreciated.
Regards,
Fritz

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

Re: How to walk a webgraph?

Posted by Dennis Kubes <ku...@apache.org>.
True you would definitely want to set a depth limit.  Not just to limit 
run time but also because of the fundamental nature of the problem. 
Reciprocal links and link loops of 2,3,4,5, maybe even six tend to show 
tight knit communities.  Beyond that you start getting into the random 
nature of the web connections.

Right now the problem is not so much the run time as the intermediate 
output.  A two depth run, which identifies reciprocal and 2-node cycles 
such as A->B->C->A creates a few hundred gigs of output on an 800M url 
database.  A three depth run to identify A->B->C->D->A generates > 10T 
of intermediate output.

What I am really trying to do is find a more efficient way to either:

1) Store the webgraph to apply a computation
2) Process the graph as in walking a tree in a way that doesn't lead to 
storing (processing is fine) outlinks of outlinks of outlinks, etc.

Don't know if that helps to explain the problem or not.

Dennis

hank williams wrote:
> On Tue, Jul 15, 2008 at 10:20 AM, brainstorm <br...@gmail.com> wrote:
> 
>> On Tue, Jul 15, 2008 at 4:12 PM, hank williams <ha...@gmail.com> wrote:
>>> Note that these algorithms will *not* resolve the fundamentally
>> exponential
>>> nature of the problem,
>>
>> O(|V|+|E|), IIRC
>>
> 
> True. The problem is not actually exponential. It is the *dataset* that is
> when you talk about the web graph. The problem is essentially linear with
> the size of the dataset. If your dataset is constrained (by setting a limit)
> then so is your execution.
> 
>>
>>> efficiently as possible, but this still involves essentially visiting
>> every
>>> node in the tree at least once until you hit a cycle in the tree or
>> decide
>>> you have traveled too far from the start. Moreover, after quickly
>> perusing
>>> (in other words I could have missed it) I did not see specific mention of
>>> setting a depth of search limit. it is critical that you set a limit or
>> you
>>> will end up attempting to spider the entire internet in search of cycles.
>>> The examples shown in the references are (for good reason) constrained in
>>> terms of node count, but on the real web the pool is (obviously) much
>> larger
>>> and the problem becomes fundamentally different.
>>>
>>> Hank
>>>
>>> On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <ku...@apache.org> wrote:
>>>
>>>> Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.
>>>>
>>>> Dennis
>>>>
>>>>
>>>> Andrzej Bialecki wrote:
>>>>
>>>>> Dennis Kubes wrote:
>>>>>
>>>>>> Does anybody know how to efficiently (non-exponentially) walk a web
>> graph
>>>>>> to detect cycles.  This would be very useful in identifying spammy
>> webpage
>>>>>> and tight knit communities.
>>>>>>
>>>>>> I have a program that I will be releasing soon that does the detection
>>>>>> through converting a webgraph into a tree and walking the tree nodes,
>> but it
>>>>>> is exponential in terms of intermediate map reduce output and
>> computation.
>>>>> Perhaps this could be helpful:
>>>>>
>>>>> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>
>> <http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last
>> section.
>>>>> * http://citeseer.ist.psu.edu/395003.html
>>>>>
>>>>> * http://en.wikipedia.org/wiki/Cycle_detection
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> blog: whydoeseverythingsuck.com
>>>
> 
> 
> 

Re: How to walk a webgraph?

Posted by hank williams <ha...@gmail.com>.
On Tue, Jul 15, 2008 at 10:20 AM, brainstorm <br...@gmail.com> wrote:

> On Tue, Jul 15, 2008 at 4:12 PM, hank williams <ha...@gmail.com> wrote:
> > Note that these algorithms will *not* resolve the fundamentally
> exponential
> > nature of the problem,
>
>
> O(|V|+|E|), IIRC
>

True. The problem is not actually exponential. It is the *dataset* that is
when you talk about the web graph. The problem is essentially linear with
the size of the dataset. If your dataset is constrained (by setting a limit)
then so is your execution.

>
>
> > efficiently as possible, but this still involves essentially visiting
> every
> > node in the tree at least once until you hit a cycle in the tree or
> decide
> > you have traveled too far from the start. Moreover, after quickly
> perusing
> > (in other words I could have missed it) I did not see specific mention of
> > setting a depth of search limit. it is critical that you set a limit or
> you
> > will end up attempting to spider the entire internet in search of cycles.
> > The examples shown in the references are (for good reason) constrained in
> > terms of node count, but on the real web the pool is (obviously) much
> larger
> > and the problem becomes fundamentally different.
> >
> > Hank
> >
> > On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <ku...@apache.org> wrote:
> >
> >> Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.
> >>
> >> Dennis
> >>
> >>
> >> Andrzej Bialecki wrote:
> >>
> >>> Dennis Kubes wrote:
> >>>
> >>>> Does anybody know how to efficiently (non-exponentially) walk a web
> graph
> >>>> to detect cycles.  This would be very useful in identifying spammy
> webpage
> >>>> and tight knit communities.
> >>>>
> >>>> I have a program that I will be releasing soon that does the detection
> >>>> through converting a webgraph into a tree and walking the tree nodes,
> but it
> >>>> is exponential in terms of intermediate map reduce output and
> computation.
> >>>>
> >>>
> >>> Perhaps this could be helpful:
> >>>
> >>> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>
> <http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last
> section.
> >>>
> >>> * http://citeseer.ist.psu.edu/395003.html
> >>>
> >>> * http://en.wikipedia.org/wiki/Cycle_detection
> >>>
> >>>
> >>>
> >
> >
> > --
> > blog: whydoeseverythingsuck.com
> >
>



-- 
blog: whydoeseverythingsuck.com

Re: How to walk a webgraph?

Posted by brainstorm <br...@gmail.com>.
On Tue, Jul 15, 2008 at 4:12 PM, hank williams <ha...@gmail.com> wrote:
> Note that these algorithms will *not* resolve the fundamentally exponential
> nature of the problem,


O(|V|+|E|), IIRC


> efficiently as possible, but this still involves essentially visiting every
> node in the tree at least once until you hit a cycle in the tree or decide
> you have traveled too far from the start. Moreover, after quickly perusing
> (in other words I could have missed it) I did not see specific mention of
> setting a depth of search limit. it is critical that you set a limit or you
> will end up attempting to spider the entire internet in search of cycles.
> The examples shown in the references are (for good reason) constrained in
> terms of node count, but on the real web the pool is (obviously) much larger
> and the problem becomes fundamentally different.
>
> Hank
>
> On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <ku...@apache.org> wrote:
>
>> Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.
>>
>> Dennis
>>
>>
>> Andrzej Bialecki wrote:
>>
>>> Dennis Kubes wrote:
>>>
>>>> Does anybody know how to efficiently (non-exponentially) walk a web graph
>>>> to detect cycles.  This would be very useful in identifying spammy webpage
>>>> and tight knit communities.
>>>>
>>>> I have a program that I will be releasing soon that does the detection
>>>> through converting a webgraph into a tree and walking the tree nodes, but it
>>>> is exponential in terms of intermediate map reduce output and computation.
>>>>
>>>
>>> Perhaps this could be helpful:
>>>
>>> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last section.
>>>
>>> * http://citeseer.ist.psu.edu/395003.html
>>>
>>> * http://en.wikipedia.org/wiki/Cycle_detection
>>>
>>>
>>>
>
>
> --
> blog: whydoeseverythingsuck.com
>

Re: How to walk a webgraph?

Posted by hank williams <ha...@gmail.com>.
Note that these algorithms will *not* resolve the fundamentally exponential
nature of the problem, they simply provide algorithms for doing it as
efficiently as possible, but this still involves essentially visiting every
node in the tree at least once until you hit a cycle in the tree or decide
you have traveled too far from the start. Moreover, after quickly perusing
(in other words I could have missed it) I did not see specific mention of
setting a depth of search limit. it is critical that you set a limit or you
will end up attempting to spider the entire internet in search of cycles.
The examples shown in the references are (for good reason) constrained in
terms of node count, but on the real web the pool is (obviously) much larger
and the problem becomes fundamentally different.

Hank

On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <ku...@apache.org> wrote:

> Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.
>
> Dennis
>
>
> Andrzej Bialecki wrote:
>
>> Dennis Kubes wrote:
>>
>>> Does anybody know how to efficiently (non-exponentially) walk a web graph
>>> to detect cycles.  This would be very useful in identifying spammy webpage
>>> and tight knit communities.
>>>
>>> I have a program that I will be releasing soon that does the detection
>>> through converting a webgraph into a tree and walking the tree nodes, but it
>>> is exponential in terms of intermediate map reduce output and computation.
>>>
>>
>> Perhaps this could be helpful:
>>
>> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last section.
>>
>> * http://citeseer.ist.psu.edu/395003.html
>>
>> * http://en.wikipedia.org/wiki/Cycle_detection
>>
>>
>>


-- 
blog: whydoeseverythingsuck.com

Re: How to walk a webgraph?

Posted by Dennis Kubes <ku...@apache.org>.
Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> Does anybody know how to efficiently (non-exponentially) walk a web 
>> graph to detect cycles.  This would be very useful in identifying 
>> spammy webpage and tight knit communities.
>>
>> I have a program that I will be releasing soon that does the detection 
>> through converting a webgraph into a tree and walking the tree nodes, 
>> but it is exponential in terms of intermediate map reduce output and 
>> computation.
> 
> Perhaps this could be helpful:
> 
> * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf see the 
> last section.
> 
> * http://citeseer.ist.psu.edu/395003.html
> 
> * http://en.wikipedia.org/wiki/Cycle_detection
> 
> 

Re: How to walk a webgraph?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> Does anybody know how to efficiently (non-exponentially) walk a web 
> graph to detect cycles.  This would be very useful in identifying spammy 
> webpage and tight knit communities.
> 
> I have a program that I will be releasing soon that does the detection 
> through converting a webgraph into a tree and walking the tree nodes, 
> but it is exponential in terms of intermediate map reduce output and 
> computation.

Perhaps this could be helpful:

* http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf see the 
last section.

* http://citeseer.ist.psu.edu/395003.html

* http://en.wikipedia.org/wiki/Cycle_detection


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: How to walk a webgraph?

Posted by Dennis Kubes <ku...@apache.org>.
You're correct, it isn't a nutch or hadoop problem, just related :). I 
was hoping there was some mathematical trick out there which I wasn't 
aware of which might reduce the intermediate permutations.  Guess not. 
Thanks for the reply.

Dennis

hank williams wrote:
> AFAIK this is not a nutch or hadoop issue, but a basic math problem, which
> is inherent in walking trees. You have to figure out how many levels of the
> graph you are willing to walk, and live with the mathematical consequences.
> The further away you walk, the more untenable the math gets, but for a few
> levels it can be OK. But it is nothing but a brute force problem.
> 
> Hank
> 
> On Mon, Jul 14, 2008 at 11:57 AM, Dennis Kubes <ku...@apache.org> wrote:
> 
>> Does anybody know how to efficiently (non-exponentially) walk a web graph
>> to detect cycles.  This would be very useful in identifying spammy webpage
>> and tight knit communities.
>>
>> I have a program that I will be releasing soon that does the detection
>> through converting a webgraph into a tree and walking the tree nodes, but it
>> is exponential in terms of intermediate map reduce output and computation.
>>
>> Dennis
>>
> 
> 
> 

Re: How to walk a webgraph?

Posted by hank williams <ha...@gmail.com>.
AFAIK this is not a nutch or hadoop issue, but a basic math problem, which
is inherent in walking trees. You have to figure out how many levels of the
graph you are willing to walk, and live with the mathematical consequences.
The further away you walk, the more untenable the math gets, but for a few
levels it can be OK. But it is nothing but a brute force problem.

Hank

On Mon, Jul 14, 2008 at 11:57 AM, Dennis Kubes <ku...@apache.org> wrote:

> Does anybody know how to efficiently (non-exponentially) walk a web graph
> to detect cycles.  This would be very useful in identifying spammy webpage
> and tight knit communities.
>
> I have a program that I will be releasing soon that does the detection
> through converting a webgraph into a tree and walking the tree nodes, but it
> is exponential in terms of intermediate map reduce output and computation.
>
> Dennis
>



-- 
blog: whydoeseverythingsuck.com