You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Talat Uyarer <ta...@uyarer.com> on 2014/05/03 00:10:22 UTC

About RankingJob for Giraph

Hi all,

A long time ago, we talked with Julien and Lewis about major needs for 2.x
on the mail list.

I know that Giraph uses only map slots as workers. At the present our
architecture of scoring plugins don't permit. Giraph and Opic have
different work types. IMHO We should create a pluggable RankingJob like as
IndexingJob for Giraph and BSP based systems. The Pluggable architecture
can permit us for implementing custom pagerank (hostrank,usagerank etc.)
algorithms.  Wdyt?

We use different giraph algorithms similar this solution in our company. If
this makes sense for everybody, After 2.3 is released than i can implement
it.

I wait your comments :)

Talat

Re: About RankingJob for Giraph

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi Sebastian,
Thank you for review my email. "a pluggable RankingJob" means a Job
that has pluggable ranking backends for graph based algorithms. This
job is similar our present architecture of IndexingJob. If we create a
RankingJob in our crawler workflow, we can create a dummy Scoring
Filter that can reach page's calculated score for generating fetch
list at GeneratorJob scope. This job provides implementing custom
graph based ranking algorithms (Hostrank, Trustrank, Linkrank etc.)

Talat

> The scoring plugin interface fits into the crawler work and data flow:
> links are feed into CrawlDb/Webtable, fetch lists are generated, etc.
> OPIC can be used because it's "online".  Other link rank algorithms
> define a complex work flow with additional steps, iterations, etc.
>
>> a pluggable RankingJob
>
> "Pluggable" and "job" are somewhat in contradiction.
>
> Plugins in Nutch never define jobs, most define a simple
> interface which can be called "functional" (stateless,
> return value depends only on the function arguments).
> In addition, most plugins can be used in combination
> (e.g., OPIC + custom plugin for focused crawling).
>
> Yes, it may be worth to think about functions and data
> structures which could be shared between ranking algorithms.
> I'm skeptic whether there will be enough similarities and overlaps
> to make ranking pluggable.
>
> But, to avoid any misunderstanding: that's not against writing
> a "RankingJob for Giraph".
>
> Sebastian
>
>
> On 05/03/2014 12:10 AM, Talat Uyarer wrote:
>> Hi all,
>>
>> A long time ago, we talked with Julien and Lewis about major needs for 2.x on the mail list.
>>
>> I know that Giraph uses only map slots as workers. At the present our architecture of scoring
>> plugins don't permit. Giraph and Opic have different work types. IMHO We should create a pluggable
>> RankingJob like as IndexingJob for Giraph and BSP based systems. The Pluggable architecture can
>> permit us for implementing custom pagerank (hostrank,usagerank etc.) algorithms.  Wdyt?
>>
>> We use different giraph algorithms similar this solution in our company. If this makes sense for
>> everybody, After 2.3 is released than i can implement it.
>>
>> I wait your comments :)
>>
>> Talat
>>
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: About RankingJob for Giraph

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Talat,

> At the present our architecture of scoring plugins don't permit.

The scoring plugin interface fits into the crawler work and data flow:
links are feed into CrawlDb/Webtable, fetch lists are generated, etc.
OPIC can be used because it's "online".  Other link rank algorithms
define a complex work flow with additional steps, iterations, etc.

> a pluggable RankingJob

"Pluggable" and "job" are somewhat in contradiction.

Plugins in Nutch never define jobs, most define a simple
interface which can be called "functional" (stateless,
return value depends only on the function arguments).
In addition, most plugins can be used in combination
(e.g., OPIC + custom plugin for focused crawling).

Yes, it may be worth to think about functions and data
structures which could be shared between ranking algorithms.
I'm skeptic whether there will be enough similarities and overlaps
to make ranking pluggable.

But, to avoid any misunderstanding: that's not against writing
a "RankingJob for Giraph".

Sebastian


On 05/03/2014 12:10 AM, Talat Uyarer wrote:
> Hi all,
> 
> A long time ago, we talked with Julien and Lewis about major needs for 2.x on the mail list.
> 
> I know that Giraph uses only map slots as workers. At the present our architecture of scoring
> plugins don't permit. Giraph and Opic have different work types. IMHO We should create a pluggable
> RankingJob like as IndexingJob for Giraph and BSP based systems. The Pluggable architecture can
> permit us for implementing custom pagerank (hostrank,usagerank etc.) algorithms.  Wdyt?
> 
> We use different giraph algorithms similar this solution in our company. If this makes sense for
> everybody, After 2.3 is released than i can implement it.
> 
> I wait your comments :)
> 
> Talat
>