You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hama.apache.org by Mark Kerzner <ma...@gmail.com> on 2010/06/07 22:00:01 UTC

Pregel article

Hi,

anybody has read it?

Thank you,
Mark

Re: Pregel article

Posted by "Edward J. Yoon" <ed...@apache.org>.

Wow, Thanks for sharing nice comment and blog post!! I'll look at it
soon and try to join to this discussion. :)

On Sat, Jul 3, 2010 at 1:08 AM, Claudio Martella
<cl...@tis.bz.it> wrote:
> I'll try to make things clear.
>
> The end of a superstep is obtained by the un-activation of all the
> vertices. In pregel the superstep is over when all the vertices call
> VoteToHalt() (in hama it's done by the sync() method). This happens at
> the end of each computation by each vertex. Each vertex is activated by
> the arrival of a message directed to that vertex. This means that each
> superstep computation is atomic and it should be considered in the
> design of the algorithm. That's a change of paradigm and that's what
> pregel author's call the "vertex's perspective programming".
>
> So no, there's no assumption about all the vertices to be active at the
> beginning of each superstep.
>
> About Felix's argument: yes, fewer supersteps mean less communication
> and synchronization overhead. At the same time, having longer supersteps
> will mean that it's more probable that certain vertices end their
> computation earlier than others, making them idle for a long time
> (waiting for the others to finish), and loosing computational time. So
> ideally it should be a good balance between long computational
> supersteps (decreasing communication overhead) and short computational
> supersteps (decreasing idle time).
> This is an intrinsical problem of BSP models because of the barrier. On
> the contrary DataFlow models don't have barriers and each computation is
> more independent, therefore more similar to the model you have in mind.
>
> Hope this helps.
>
> I attach the text from my blog post (roughly obtained with html2text) as
> requested.
>
> Cheers,
>
> Claudio
>
>
> zercal wrote:
>> The paper I found about pregel are not very detailed,
>> is it "http://portal.acm.org/citation.cfm?id=1582716.1582723"?
>> I guess, in this paper, vertices are assumed to be all actived at every superstep. simply random
>> access will reduce some communication cost but take more superstep.
>> However, is there any way of vertice selection method can be performed
>> that at every super step, each vertex knows whether to active according to
>> information it kept and received from other vertex?
>> But I can't found more detail from that paper...
>> Becides, I can not access your blogs. Would you please send me your article?
>>
>> Thank you very much!
>> from Xiong Chenyan
>>
>> ÔÚ2010-07-02 20:04:40£¬"Felix Halim" <fe...@gmail.com> Ð´µÀ£º
>>
>>> Exactly how to activate a particular vertex is not clear from the
>>> paper (is it random access?) and this feature is probably not as good
>>> as it sounds for complex graph algorithms. It might be better off to
>>> assume all vertices are active (to reduce the overhead of the flag
>>> needed and the space to make it randomly accessible, by storing it in
>>> blocks or whatever).
>>>
>>> Here is my argument:
>>>
>>> The way Pregel (and existing MR) works is iterative, where each
>>> iteration is separated by a super-step barrier where all messages have
>>> to arrive. Algorithms that have fewer super-steps are preferable than
>>> those that have large number of super-steps. In fact, we should
>>> measure Algorithms in terms of the number of super-steps required. To
>>> minimize the number of super-steps, likely we need to activate as much
>>> vertices as possible to do all the work in current super-step, rather
>>> than spill-over to the next super-step. In this case, the feature to
>>> "turn off" vertices is useless, since most of the time all vertices
>>> will be active to effectively reduce the number of super-steps.
>>>
>>> Unfortunately, I don't have experiments to backup my argument... I
>>> don't have Pregel...
>>>
>>> Felix Halim
>>>
>>>
>>> On Fri, Jul 2, 2010 at 7:37 PM, Claudio Martella
>>> <cl...@tis.bz.it> wrote:
>>>
>>>> I did too. See:
>>>>
>>>> http://blog.acaro.org/entry/pregel-is-out-but-what-is-pregel
>>>>
>>>>
>>>> Felix Halim wrote:
>>>>
>>>>> I have. See my comment in this blog:
>>>>>
>>>>> http://blog.udanax.org/2010/06/summary-of-google-pregel.html
>>>>>
>>>>> Felix Halim
>>>>>
>>>>>
>>>>> On Tue, Jun 8, 2010 at 4:00 AM, Mark Kerzner <ma...@gmail.com> wrote:
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> anybody has read it?
>>>>>>
>>>>>> Thank you,
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> --
>>>> Claudio Martella
>>>> Digital Technologies
>>>> Unit Research & Development - Analyst
>>>>
>>>> TIS innovation park
>>>> Via Siemens 19 | Siemensstr. 19
>>>> 39100 Bolzano | 39100 Bozen
>>>> Tel. +39 0471 068 123
>>>> Fax  +39 0471 068 129
>>>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>>>
>>>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>>>
>>>>
>>>>
>>>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>
>
>
>
>
> * ** ** ** ** ** * _ c c_ l l_ a a_ u u_ d d_ i i_ o o_  _ m m_ a a_ r r_ t t_ e e_ l l_ l l_ a a * ** ** ** ** ** *
> * ** ** ** ** * _ [ [_ r r_ s s_ s s_  _ f f_ e e_ e e_ d d_ ] ] _ a a_ r r_ c c_ h h_ i i_ v v_ e e _ a a_ b b_ o o_ u u_ t t * ** ** ** ** *
> * ** ** ** ** ** * _ G G_ o o_ o o_ g g_ l l_ e e_  _ P P_ r r_ e e_ g g_ e e_ l l_  _ i i_ s s_  _ o o_ u u_ t t_ . ._  _ B B_ u u_ t t_  _ w w_ h h_ a a_ t t_  _ i i_ s s_  _ P P_ r r_ e e_ g g_ e e_ l l_ ? ? * ** ** ** ** ** *
> June 21, 2010
> Google Pregel's _ p_ a_ p_ e_ r is finally out. But what is Pregel? It has been mentioned
> in many posts talking about NoSQL, GrabhDBs, Big Data, even the Facebook
> OpenGraph, so it looks like, apart of the hype fuzz, there's a little confusion
> about what it is, what it does, what it's good for and what it certainly is
> not. Let's start with the latter.
> Google Pregel is not a database (neither RDBMS nor NoSQL), no key-value store
> or any new means of storing data (big or small it might be). Putting it in the
> same lists with GraphDBs like _ N_ e_ o_ 4_ j, _ H_ y_ p_ e_ r_ G_ r_ a_ p_ h_ D_ B or even Twitter's _ F_ l_ o_ c_ k_ D_ B is
> somehow like putting MapReduce in the NoSQL group.
> GraphDBs are storage systems that use graph representations for data where each
> node represents an entity with unique ids, type and properties. An arc
> represents a relationship between two nodes and itself can have a type and
> properties. Think of a GraphDB as a RDBMS where instead of tables you have a
> graph. It's Semantic Web's triple stores brought to general purpose.
> Why would you use a GraphDB? Well, as you can describe your data in terms of
> entities and relationship, you're able to avoid defining a schema as we know
> it. It's a smaller step towards schema-less representation of data that actual
> NoSQLs provide.
> Informally, you can describe your data with ER diagrams without translating
> them into tables, keeping the representation dynamic and avoiding the costs of
> schema redefinition in RDBMs. Plus, it's very efficient and easy to write
> queries that allow you to get, for example, all the followers of a user, all
> the users he follows, the items related to him (tweets he wrote?) and maybe the
> users connected to his items (users retweeting or reply his tweets?) in one go.
> That's basically what Twitter wants to achieve with FlockDB, but with a general
> GraphDB you can describe your data and the relationships between your data as
> you wish.
> You store data in a GraphDB and you recall it in an easy and efficient way. So
> what's Pregel good for? What if if you want to mine the data in the graph (i.e.
> Google's Pagerank, Facebook's social network analysis, Twitter's retweeting/
> authority analysis)? Google reports that 80% of their distributed computation
> is based on MapReduce (Google Maps, Search Indexing, clustering in Google News,
> reports of Google Trends, Google Translate etc.) so we can only guess that the
> rest 20% is based on Pregel and the authors report they can work with graphs of
> the size of billions of vertices. Plus, implementing Pagerank is just about 15
> lines of code...
> That's what Pregel is for. So what is it? Pregel is a system for large-scale
> graph processing. It provides a fault-tolerant framework for the execution of
> graph algorithms in parallel over many machines. Think of it as MapReduce re-
> thought for graph operations.
> But what's wrong with MapReduce and graph algorithms? Nothing particularly,
> though it can lead to suboptimal performance because the graph state has to be
> passed from one phase to the other generating a lot of I/O, but in general we
> can say it has some usability issues as it doesn't provide a way to do any per-
> vertex calculation. In general, it's not easy to express graph algorithms in M/
> R. Pregel fills a gap as there are no frameworks for graph processing that
> address both distributability and fault-tolerance.
> Pregel's architecture is inspired by the Bulk Synchronous Parallel model
> introduced by Valiant. BSP is a computational model for the execution of
> parallel algorithms on top of multiple sequential Von Neumann machines. It
> gives an abstraction, just like M/R, that allows the programmer to think about
> the parallel expression of his solution without the hassle of communication and
> memory allocation in a distributed system. Before we get into details I think
> two things have to be underlined.
> First, again like M/R, although the model is used by Google to distribute
> computation among multiple computers that's not necessary, in principle BSP
> fits parallel programming on SMP or NUMA machines and mainframes.
> Second, although the model is used by Google to distribute graph processing,
> BSP can be used to distribute other kind of algorithms like matrix
> manipulation, just like M/R.
> Ok, how does BSP work? I'll take the diagram and snippet from the _ B_ S_ P_  _ p_ a_ g_ e_  _ f_ r_ o_ m
> _ W_ i_ k_ i_ p_ e_ d_ i_ a:
> ‚ÄúA BSP computer consists of processors connected by a communication network.
> Each processor has a fast local memory, and may follow different threads of
> computation.
> A BSP computation proceeds in a series of global supersteps. A superstep
> consists of three ordered stages:
>   1. Concurrent computation: Several computations take place on every
>      participating processor. Each process only uses values stored in the
>      local memory of the processor. The computations are independent in the
>      sense that they occur asynchronously of all the others.
>   2. Communication: At this stage, the processes exchange data between
>      themselves.
>   3. Barrier synchronisation: When a process reaches this point (the barrier),
>      it waits until all other processes have finished their communication
>      actions.
> The figure below shows this in a diagrammatic form. The processes are not
> regarded as having a particular linear order (from left to right or otherwise),
> and may be mapped to processors in any way.‚Äù
> [bsp architecture]
> Ok, basically at every superstep every processor executes the same algorithm on
> its data: its state and the incoming messages. At superstep t every processor
> will work on its state, which is the result of its computation at superstep t-
> 1, and the messages sent to him at superstep t-1. As a result of the superstep
> t computation the processor will send messages to other processors and these
> messages will be the incoming messages at superstep t+1. And the cycle goes on.
> The barrier synchronisation is the moment where t gets to be t+1.
> It is easy to see that each computation should take approximately the same
> amount of time, otherwise a long lasting computation will force the others to
> wait idle.
> How does Pregel implement BSP? Quoting Pregel's original paper: ‚ÄúThe input to
> a Pregel computation is a directed graph in which each vertex is uniquely
> identiÔ¨Åed by a string vertex identiÔ¨Åer. Each vertex is associated with a
> modiÔ¨Åable, user deÔ¨Åned value. The directed edges are associated with their
> source vertices, and each edge consists of a modiÔ¨Åable, user deÔ¨Åned value
> and a target vertex identiÔ¨Åer. A typical Pregel computation consists of
> input, when the graph is initialized, followed by a sequence of supersteps
> separated by global synchronization points until the algorithm terminates, and
> Ô¨Ånishing with output.
> Within each superstep the vertices compute in parallel, each executing the same
> user-deÔ¨Åned function that expresses the logic of a given algorithm. A vertex
> can modify its state or that of its outgoing edges, receive messages sent to it
> in the previous superstep, send messages to other vertices (to be received in
> the next superstep), or even mutate the topology of the graph. Edges are not
> Ô¨Årst-class citizens in this model, having no associated computation.
> Algorithm termination is based on every vertex voting to halt. In superstep 0,
> every vertex is in the active state; all active vertices participate in the
> computation of any given superstep. A vertex deactivates itself by voting to
> halt. This means that the vertex has no further work to do unless triggered
> externally, and the Pregel framework will not execute that vertex in subsequent
> supersteps unless it receives a message. If reactivated by a message, a vertex
> must explicitly deactivate itself again. The algorithm as a whole terminates
> when all vertices are simultaneously inactive and there are no messages in
> transit.‚Äù
> The mapping between BSP and Pregel is very simple: each local computation of
> BSP maps to the user-defined function in Pregel and the communication often,
> but not necessarily, corresponds to edge connectivity between nodes. The
> barrier is defined by the halt voting of all the active nodes.
> From the perspective of the API Pregel requires the implementation of the
> virtual Compute() method of the class Vertex. The class Vertex itself provides
> VoteToHalt(), SendMessageTo(), GetValue(), GetOutEdgeIterator() and const
> methods superstep() and vertex_id().
> Like M/R, it provides the possibility to define Combiners in order to reduce
> message passing overhead by combining messages together where semantically
> possible. Like Sawzall Pregel provides Aggregators which allow global
> communication by receiving messages from multiple vertices, combining them and
> sending the result back to the vertices. They are useful for statistics (think
> of an histogram of vertex degrees) or for global controlling (for example an
> aggregator can collect all the vertices' PageRank deltas to calculate the
> convergence condition).
> From the perspective of architecture, Pregel follows a master/worker
> architecture, like most of the other Google frameworks. The master is
> responsible of partitioning the graph with a hash function based on the vertex
> ID (like hash(ID) mod #partitions although a topology-aware partitioner might
> be able to minimize communication between workers by keeping messages intra-
> machine) but doesn't compute any partition. At the beginning of computation the
> workers subscribe to the computation to the master.
> Once the graph is partitioned and the partitions are assigned to workers, the
> master issues the start of the superstep. Each worker loops through all his
> active vertices, calling the Compute() method and delivering the messages
> collected in the previous superstep. The new messages are delivered before the
> end of the superstep, right before telling the master the list of active
> vertices for the next superstep.
> After the computation halts the master might ask the workers to dump their
> graph partition to disk.
> At the moment there are no projects that handle such computational power over
> graphs. _ P_ a_ r_ a_ l_ l_ e_ l_  _ B_ G_ L and _ C_ G_ M_ L_ i_ b can handle parallel processing on graphs but
> don't scale to this size and is not fault-tolerant. _ H_ a_ m_ a, an Apache incubated
> project, aims at developing a similar model to Pregel, but it's not complete
> yet.
> So, where do I go now?
> _ G_ o_ o_ g_ l_ e_ '_ s_  _ R_ e_ s_ e_ a_ r_ c_ h_  _ B_ l_ o_ g_  _ a_ n_ n_ o_ u_ n_ c_ e_ m_ e_ n_ t
> _ B_ S_ P_  _ W_ o_ r_ l_ d_ -_ w_ i_ d_ e
> _ N_ o_ S_ Q_ L_  _ G_ r_ a_ p_ h_ D_ B
>  Please enable JavaScript to view the _ c_ o_ m_ m_ e_ n_ t_ s_  _ p_ o_ w_ e_ r_ e_ d_  _ b_ y_  _ D_ i_ s_ q_ u_ s_ . _ b_ l_ o_ g_  _ c_ o_ m_ m_ e_ n_ t_ s
> _ p_ o_ w_ e_ r_ e_ d_  _ b_ y_  _ D_ i_ s_ q_ u_ s
>
>



-- 
Best Regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org

Re: Pregel article

Posted by Claudio Martella <cl...@tis.bz.it>.

I'll try to make things clear.

The end of a superstep is obtained by the un-activation of all the
vertices. In pregel the superstep is over when all the vertices call
VoteToHalt() (in hama it's done by the sync() method). This happens at
the end of each computation by each vertex. Each vertex is activated by
the arrival of a message directed to that vertex. This means that each
superstep computation is atomic and it should be considered in the
design of the algorithm. That's a change of paradigm and that's what
pregel author's call the "vertex's perspective programming".

So no, there's no assumption about all the vertices to be active at the
beginning of each superstep.

About Felix's argument: yes, fewer supersteps mean less communication
and synchronization overhead. At the same time, having longer supersteps
will mean that it's more probable that certain vertices end their
computation earlier than others, making them idle for a long time
(waiting for the others to finish), and loosing computational time. So
ideally it should be a good balance between long computational
supersteps (decreasing communication overhead) and short computational
supersteps (decreasing idle time).
This is an intrinsical problem of BSP models because of the barrier. On
the contrary DataFlow models don't have barriers and each computation is
more independent, therefore more similar to the model you have in mind.

Hope this helps.

I attach the text from my blog post (roughly obtained with html2text) as
requested.

Cheers,

Claudio

zercal wrote:
> The paper I found about pregel are not very detailed, 
> is it "http://portal.acm.org/citation.cfm?id=1582716.1582723"?
> I guess, in this paper, vertices are assumed to be all actived at every superstep. simply random
> access will reduce some communication cost but take more superstep.
> However, is there any way of vertice selection method can be performed
> that at every super step, each vertex knows whether to active according to 
> information it kept and received from other vertex? 
> But I can't found more detail from that paper...
> Becides, I can not access your blogs. Would you please send me your article?
>
> Thank you very much!
> from Xiong Chenyan
>
> ÔÚ2010-07-02 20:04:40£¬"Felix Halim" <fe...@gmail.com> Ð´µÀ£º
>   
>> Exactly how to activate a particular vertex is not clear from the
>> paper (is it random access?) and this feature is probably not as good
>> as it sounds for complex graph algorithms. It might be better off to
>> assume all vertices are active (to reduce the overhead of the flag
>> needed and the space to make it randomly accessible, by storing it in
>> blocks or whatever).
>>
>> Here is my argument:
>>
>> The way Pregel (and existing MR) works is iterative, where each
>> iteration is separated by a super-step barrier where all messages have
>> to arrive. Algorithms that have fewer super-steps are preferable than
>> those that have large number of super-steps. In fact, we should
>> measure Algorithms in terms of the number of super-steps required. To
>> minimize the number of super-steps, likely we need to activate as much
>> vertices as possible to do all the work in current super-step, rather
>> than spill-over to the next super-step. In this case, the feature to
>> "turn off" vertices is useless, since most of the time all vertices
>> will be active to effectively reduce the number of super-steps.
>>
>> Unfortunately, I don't have experiments to backup my argument... I
>> don't have Pregel...
>>
>> Felix Halim
>>
>>
>> On Fri, Jul 2, 2010 at 7:37 PM, Claudio Martella
>> <cl...@tis.bz.it> wrote:
>>     
>>> I did too. See:
>>>
>>> http://blog.acaro.org/entry/pregel-is-out-but-what-is-pregel
>>>
>>>
>>> Felix Halim wrote:
>>>       
>>>> I have. See my comment in this blog:
>>>>
>>>> http://blog.udanax.org/2010/06/summary-of-google-pregel.html
>>>>
>>>> Felix Halim
>>>>
>>>>
>>>> On Tue, Jun 8, 2010 at 4:00 AM, Mark Kerzner <ma...@gmail.com> wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> anybody has read it?
>>>>>
>>>>> Thank you,
>>>>> Mark
>>>>>
>>>>>
>>>>>           
>>>>         
>>> --
>>> Claudio Martella
>>> Digital Technologies
>>> Unit Research & Development - Analyst
>>>
>>> TIS innovation park
>>> Via Siemens 19 | Siemensstr. 19
>>> 39100 Bolzano | 39100 Bozen
>>> Tel. +39 0471 068 123
>>> Fax  +39 0471 068 129
>>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>>
>>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>>
>>>
>>>
>>>       

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.

Re:Re: Pregel article

Posted by zercal <ze...@126.com>.

The paper I found about pregel are not very detailed, 
is it "http://portal.acm.org/citation.cfm?id=1582716.1582723"?
I guess, in this paper, vertices are assumed to be all actived at every superstep. simply random
access will reduce some communication cost but take more superstep.
However, is there any way of vertice selection method can be performed
that at every super step, each vertex knows whether to active according to 
information it kept and received from other vertex? 
But I can't found more detail from that paper...
Becides, I can not access your blogs. Would you please send me your article?

Thank you very much!
from Xiong Chenyan

在2010-07-02 20:04:40，"Felix Halim" <fe...@gmail.com> 写道：
>Exactly how to activate a particular vertex is not clear from the
>paper (is it random access?) and this feature is probably not as good
>as it sounds for complex graph algorithms. It might be better off to
>assume all vertices are active (to reduce the overhead of the flag
>needed and the space to make it randomly accessible, by storing it in
>blocks or whatever).
>
>Here is my argument:
>
>The way Pregel (and existing MR) works is iterative, where each
>iteration is separated by a super-step barrier where all messages have
>to arrive. Algorithms that have fewer super-steps are preferable than
>those that have large number of super-steps. In fact, we should
>measure Algorithms in terms of the number of super-steps required. To
>minimize the number of super-steps, likely we need to activate as much
>vertices as possible to do all the work in current super-step, rather
>than spill-over to the next super-step. In this case, the feature to
>"turn off" vertices is useless, since most of the time all vertices
>will be active to effectively reduce the number of super-steps.
>
>Unfortunately, I don't have experiments to backup my argument... I
>don't have Pregel...
>
>Felix Halim
>
>
>On Fri, Jul 2, 2010 at 7:37 PM, Claudio Martella
><cl...@tis.bz.it> wrote:
>> I did too. See:
>>
>> http://blog.acaro.org/entry/pregel-is-out-but-what-is-pregel
>>
>>
>> Felix Halim wrote:
>>> I have. See my comment in this blog:
>>>
>>> http://blog.udanax.org/2010/06/summary-of-google-pregel.html
>>>
>>> Felix Halim
>>>
>>>
>>> On Tue, Jun 8, 2010 at 4:00 AM, Mark Kerzner <ma...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> anybody has read it?
>>>>
>>>> Thank you,
>>>> Mark
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>
>>
>>

Re: Pregel article

Posted by Felix Halim <fe...@gmail.com>.

Exactly how to activate a particular vertex is not clear from the
paper (is it random access?) and this feature is probably not as good
as it sounds for complex graph algorithms. It might be better off to
assume all vertices are active (to reduce the overhead of the flag
needed and the space to make it randomly accessible, by storing it in
blocks or whatever).

Here is my argument:

The way Pregel (and existing MR) works is iterative, where each
iteration is separated by a super-step barrier where all messages have
to arrive. Algorithms that have fewer super-steps are preferable than
those that have large number of super-steps. In fact, we should
measure Algorithms in terms of the number of super-steps required. To
minimize the number of super-steps, likely we need to activate as much
vertices as possible to do all the work in current super-step, rather
than spill-over to the next super-step. In this case, the feature to
"turn off" vertices is useless, since most of the time all vertices
will be active to effectively reduce the number of super-steps.

Unfortunately, I don't have experiments to backup my argument... I
don't have Pregel...

Felix Halim

On Fri, Jul 2, 2010 at 7:37 PM, Claudio Martella
<cl...@tis.bz.it> wrote:
> I did too. See:
>
> http://blog.acaro.org/entry/pregel-is-out-but-what-is-pregel
>
>
> Felix Halim wrote:
>> I have. See my comment in this blog:
>>
>> http://blog.udanax.org/2010/06/summary-of-google-pregel.html
>>
>> Felix Halim
>>
>>
>> On Tue, Jun 8, 2010 at 4:00 AM, Mark Kerzner <ma...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> anybody has read it?
>>>
>>> Thank you,
>>> Mark
>>>
>>>
>>
>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>
>
>

Re: Pregel article

Posted by Claudio Martella <cl...@tis.bz.it>.

I did too. See:

http://blog.acaro.org/entry/pregel-is-out-but-what-is-pregel

Felix Halim wrote:
> I have. See my comment in this blog:
>
> http://blog.udanax.org/2010/06/summary-of-google-pregel.html
>
> Felix Halim
>
>
> On Tue, Jun 8, 2010 at 4:00 AM, Mark Kerzner <ma...@gmail.com> wrote:
>   
>> Hi,
>>
>> anybody has read it?
>>
>> Thank you,
>> Mark
>>
>>     
>
>   

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.

Re: Pregel article

Posted by Felix Halim <fe...@gmail.com>.

On Mon, Jul 5, 2010 at 7:06 PM, Edward J. Yoon <ed...@apache.org> wrote:
> I agree with you that we should measure about the number of
> iterations. And, as you said, there is still I/O overhead involved in
> reading and writing materialized data every time, even if avoiding the
> situation shuffle and sort of reduce phase.

I'm particularly interested in how BSP handle the I/O overhead?
Suppose only several vertices are active among millions of vertices.
How does BSP activate those vertices?
Does the vertices directly accessible?
Usually the vertices are stored within a block of 64 MB.
Does BSP read all 64 MB just to activate one vertex?
Or BSP has some kind of indexing?

> IMO, BSP will communication only some vertices which can't be solved
> locally, and I'm sure that the number of iterations will be less or
> equal to (M/R based) Schimmy approach.

As far as I know, Schimmy approach doesn't reduce the number of iterations.
It only used to avoid shuffling the "master" graph.
So, the number of iterations for MR based and the number of supersteps
for BSP should be the same.
Here number of MR iterations (or rounds) is identical to BSP's "supersteps".

> More I hope we can compare them using Hama BSP soon.

I'm sure BSP version will be more efficient since BSP is like MR +
Schimmy built in.

Felix Halim

Re: Pregel article

Posted by "Edward J. Yoon" <ed...@apache.org>.

HI,

I'm reading that paper and seeing few design pattens.

I agree with you that we should measure about the number of
iterations. And, as you said, there is still I/O overhead involved in
reading and writing materialized data every time, even if avoiding the
situation shuffle and sort of reduce phase.

IMO, BSP will communication only some vertices which can't be solved
locally, and I'm sure that the number of iterations will be less or
equal to (M/R based) Schimmy approach.

More I hope we can compare them using Hama BSP soon.

On Fri, Jul 2, 2010 at 11:12 AM, Felix Halim <fe...@gmail.com> wrote:
> I have. See my comment in this blog:
>
> http://blog.udanax.org/2010/06/summary-of-google-pregel.html
>
> Felix Halim
>
>
> On Tue, Jun 8, 2010 at 4:00 AM, Mark Kerzner <ma...@gmail.com> wrote:
>> Hi,
>>
>> anybody has read it?
>>
>> Thank you,
>> Mark
>>
>

-- 
Best Regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org

Re: Pregel article

Posted by Felix Halim <fe...@gmail.com>.

I have. See my comment in this blog:

http://blog.udanax.org/2010/06/summary-of-google-pregel.html

Felix Halim


On Tue, Jun 8, 2010 at 4:00 AM, Mark Kerzner <ma...@gmail.com> wrote:
> Hi,
>
> anybody has read it?
>
> Thank you,
> Mark
>

Re: Pregel article

Posted by Guo Leitao <le...@gmail.com>.

I read a short version of this paper before (1 paper). I have downloaded the
full paper, and read it later today. :)

2010/6/8 Edward J. Yoon <ed...@apache.org>

> Now I'm roughly reading. I think, there is no room for doubt that It
> is a perfectly BSP-style programming model. See below interfaces :)
>
> Pregel:
>
> Compute(MessageIterator* msgs) {
>  SendMessageTo(); // Send data to neighbor
>  VoteToHalt(); // Superstep synchronization
> }
>
> Hama BSP:
>
> bps(BSPPeer bspPeer) {  // or bsp(MessageIterator<BSPMessage> msgs)
>  bspPeer.send(hostname, msg); // Send data to neighbor
>  bspPeer.sync(); // Superstep synchronization
>  bspPeer.getCurrentMessage();
> }
>
> As i mentioned before
> (
> http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201005.mbox/%3CAANLkTikqzyJmu7jC_PTstua3bC4834PuahFS3ADG13H0@mail.gmail.com%3E
> ),
> IMO, Pregel programming framework clone project is not a big deal.
>
> After reading it more, I'll blog about In/Output system and FT system
> design.
>
> Thanks.
>
> On Tue, Jun 8, 2010 at 11:51 AM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > So Edward, is this going to help Hama a lot or a little?
> >
> > Mark
> >
> > On Mon, Jun 7, 2010 at 9:45 PM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
> >
> >> Oh, Cool! I'm going to blogging!
> >>
> >> On Tue, Jun 8, 2010 at 5:16 AM, Paolo Castagna
> >> <ca...@googlemail.com> wrote:
> >> > Mark Kerzner wrote:
> >> >>
> >> >> anybody has read it?
> >> >
> >> >  - Pregel: a system for large-scale graph processing
> >> >   http://doi.acm.org/10.1145/1807167.1807184
> >> >   http://portal.acm.org/citation.cfm?id=1807167.1807184
> >> >
> >> > :-)
> >> >
> >> > Paolo
> >> >
> >> >>
> >> >> Thank you,
> >> >> Mark
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> edwardyoon@apache.org
> >> http://blog.udanax.org
> >>
> >
>
>
>
> --
> Best Regards, Edward J. Yoon
> edwardyoon@apache.org
> http://blog.udanax.org
>

Re: Pregel article

Posted by "Edward J. Yoon" <ed...@apache.org>.

Now I'm roughly reading. I think, there is no room for doubt that It
is a perfectly BSP-style programming model. See below interfaces :)

Pregel:

Compute(MessageIterator* msgs) {
  SendMessageTo(); // Send data to neighbor
  VoteToHalt(); // Superstep synchronization
}

Hama BSP:

bps(BSPPeer bspPeer) {  // or bsp(MessageIterator<BSPMessage> msgs)
  bspPeer.send(hostname, msg); // Send data to neighbor
  bspPeer.sync(); // Superstep synchronization
  bspPeer.getCurrentMessage();
}

As i mentioned before
(http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201005.mbox/%3CAANLkTikqzyJmu7jC_PTstua3bC4834PuahFS3ADG13H0@mail.gmail.com%3E),
IMO, Pregel programming framework clone project is not a big deal.

After reading it more, I'll blog about In/Output system and FT system design.

Thanks.

On Tue, Jun 8, 2010 at 11:51 AM, Mark Kerzner <ma...@gmail.com> wrote:
> So Edward, is this going to help Hama a lot or a little?
>
> Mark
>
> On Mon, Jun 7, 2010 at 9:45 PM, Edward J. Yoon <ed...@apache.org>wrote:
>
>> Oh, Cool! I'm going to blogging!
>>
>> On Tue, Jun 8, 2010 at 5:16 AM, Paolo Castagna
>> <ca...@googlemail.com> wrote:
>> > Mark Kerzner wrote:
>> >>
>> >> anybody has read it?
>> >
>> >  - Pregel: a system for large-scale graph processing
>> >   http://doi.acm.org/10.1145/1807167.1807184
>> >   http://portal.acm.org/citation.cfm?id=1807167.1807184
>> >
>> > :-)
>> >
>> > Paolo
>> >
>> >>
>> >> Thank you,
>> >> Mark
>> >>
>> >
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> edwardyoon@apache.org
>> http://blog.udanax.org
>>
>



-- 
Best Regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org

Re: Pregel article

Posted by Mark Kerzner <ma...@gmail.com>.

So Edward, is this going to help Hama a lot or a little?

Mark

On Mon, Jun 7, 2010 at 9:45 PM, Edward J. Yoon <ed...@apache.org>wrote:

> Oh, Cool! I'm going to blogging!
>
> On Tue, Jun 8, 2010 at 5:16 AM, Paolo Castagna
> <ca...@googlemail.com> wrote:
> > Mark Kerzner wrote:
> >>
> >> anybody has read it?
> >
> >  - Pregel: a system for large-scale graph processing
> >   http://doi.acm.org/10.1145/1807167.1807184
> >   http://portal.acm.org/citation.cfm?id=1807167.1807184
> >
> > :-)
> >
> > Paolo
> >
> >>
> >> Thank you,
> >> Mark
> >>
> >
>
>
>
> --
> Best Regards, Edward J. Yoon
> edwardyoon@apache.org
> http://blog.udanax.org
>

Re: Pregel article

Posted by "Edward J. Yoon" <ed...@apache.org>.

Oh, Cool! I'm going to blogging!

On Tue, Jun 8, 2010 at 5:16 AM, Paolo Castagna
<ca...@googlemail.com> wrote:
> Mark Kerzner wrote:
>>
>> anybody has read it?
>
>  - Pregel: a system for large-scale graph processing
>   http://doi.acm.org/10.1145/1807167.1807184
>   http://portal.acm.org/citation.cfm?id=1807167.1807184
>
> :-)
>
> Paolo
>
>>
>> Thank you,
>> Mark
>>
>



-- 
Best Regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org

Re: Pregel article

Posted by Hyunsik Choi <hy...@gmail.com>.

I have had much interest in the pregel, and yesterday I downloaded
that paper. But, I might read the pregel paper next week because right
now I cannot have spare time.

- Hyunsik Choi

On Tue, Jun 8, 2010 at 5:16 AM, Paolo Castagna
<ca...@googlemail.com> wrote:
> Mark Kerzner wrote:
>>
>> anybody has read it?
>
>  - Pregel: a system for large-scale graph processing
>   http://doi.acm.org/10.1145/1807167.1807184
>   http://portal.acm.org/citation.cfm?id=1807167.1807184
>
> :-)
>
> Paolo
>
>>
>> Thank you,
>> Mark
>>
>

Re: Pregel article

Posted by Paolo Castagna <ca...@googlemail.com>.

Mark Kerzner wrote:
> anybody has read it?

  - Pregel: a system for large-scale graph processing
    http://doi.acm.org/10.1145/1807167.1807184
    http://portal.acm.org/citation.cfm?id=1807167.1807184

:-)

Paolo

> 
> Thank you,
> Mark
>

Pregel article

Posted by Mark Kerzner <ma...@gmail.com>.

Sorry for re-post, confused about hama-dev and hava-user groups

Hi,

anybody has read it?

Thank you,
Mark