You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2013/04/01 18:20:23 UTC

Re: GSoC 2013

I'm somewhat familiar with WTF code (my day job is managing the analytics
infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in fact
some of the Pig 11 features/improvements are directly due to this
project...), and mostly has to do with clever algorithms implemented in Pig
(an earlier version of WTF loaded the graph into main memory on large-mem
machines -- that system is open sourced, too, under
github.com/twitter/cassovary). Are you proposing to create an open-source
implementation of those algorithms? Do you suggest they should be Pig
scripts added to the Pig project, or do you want to create some new
operators? I'm not totally sure where you are going here.

GSoC proposals for Pig are usually made by students who want to work on
issues labeled as GSoC candidates on the apache jira. The students spend
some time to understand the problem stated in the jira, familiarize
themselves with the existing codebase, and put a basic technical
implementation plan and schedule into their proposal. Since in this case
you are proposing something we haven't scoped or defined well for
ourselves, we need you to be very clear and specific about what you are
trying to do, and how you plan to go about it. I think that Graph
processing in Pig (or other Hadoop-based systems) is a really interesting
topic and there is a lot of work to be done, but we really need you to be
far more detailed to be able to give you good guidance with regards to GSoC.

Best,
Dmitriy


On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com> wrote:

> Sure. We can implement a graph model using  "WTF: The Who to Follow Service
> at Twitter article we can" article.This article's said that in this way
> graph can be stored one machine's memory so that every node will read from
> HDFS and cache the graph to the memory. Every node is responsible from its
> bucket edge to process. I mean it can be splitted. Every node can be
> processed its bucket using random walk algorithm for instance. Finally it
> can be reduced to get to the final results. I hope it's clear :)
>
> Thanks
> Best Regards...
>
>
> On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > Hi Burakk,
> > The general idea of making graph processing easier is a good one. I'm not
> > sure what exactly you are proposing to do, though. Could you be more
> > detailed about what you are thinking?
> >
> >
> > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com> wrote:
> >
> > > Hi,
> > > I might be a little bit late. I come up with a new idea for the last
> > > minute. Currently I'm working on social graph processing. I think we
> can
> > > implement a solution for pig.  With this idea I'm thinking to apply the
> > > GSOC 2013 so that I can do some tasks about it. Is there any mentor to
> do
> > > it with me?  Is there any suggestion? :)
> > >
> > > Details:
> > > Of course I can improve some join operations. I'm not sure is there any
> > > implementation about fuzzy joins for instance. These are the papers
> that
> > I
> > > found
> > >
> > > Fuzzy Joins Using MapReduce
> > > http://ilpubs.stanford.edu:8090/1006/
> > >
> > > Dimension independent similarity computation
> > > http://arxiv.org/abs/1206.2082
> > >
> > > MapReduce is Good Enough? If All You Have is a Hammer, Throw Away
> > > Everything That’s Not a Nail!
> > > http://arxiv.org/pdf/1209.2191.pdf
> > >
> > > Large Graph Processing in the Cloud
> > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > >
> > > ..etc
> > >
> > > Thanks
> > > Best regards..
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > *
> > > *
> > >
> >
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

RE: GSoC 2013

Posted by Steve Bernstein <St...@deem.com>.
As a long follower, infrequent poster to this list, I agree with this wisdom.

Much as I'm attracted to graph analysis, continuing focus on a rock solid foundation is a good call.

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com] 
Sent: Monday, April 08, 2013 11:58 AM
To: dev@pig.apache.org
Cc: user@pig.apache.org
Subject: Re: GSoC 2013

Hi,
I think this is an interesting project but is not core to "Pig" itself -- it may be more interesting / viable as a standalone project on github that uses Pig to implement graph algorithms.
At this point in its development, I feel that Pig needs to concentrate on doing the things it already does, and do them better (operator efficiency, storage efficiency, better MR plan generation, etc) rather than expand to specific verticals; we should allow our users to create their own solution suites that use Pig for specific purposes. A successful example of such a standalone project is PacketPig (https://github.com/packetloop/packetpig) , a PCAP network capture analysis tool.

D


On Tue, Apr 2, 2013 at 9:48 AM, burakkk <bu...@gmail.com> wrote:

> I know that but giraph tries to use bsp. What I'm saying is nothing 
> shared model except reducers. Besides I don't want to divide 
> iteration. One phase is still responsible for whole iteration. Every 
> different origin vertex will be processed in parallel.
>
> Thanks
> Best regards...
>
>
> On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales < 
> gdfm@gdfm.me
> > wrote:
>
> > FYI, Giraph has a Random Walk implementation.
> >
> > Pig does not support iteration natively, so any iterative algorithm 
> > is
> not
> > a very good fit for it. Just my 2c.
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> >
> > On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com> wrote:
> >
> > > So what do you suggest? Is it clear?
> > >
> > >
> > > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com>
> wrote:
> > >
> > > > I'm using only WTF graph representation to fit the memory. By 
> > > > the
> way I
> > > > haven't seen any explanation from the pig 0.11 release page 
> > > > about WTF
> > or
> > > > graph models.
> > > > I don't wanna use Cassovary. I believe it can be done with pig. 
> > > > I implement a graph representation using WTF paper to pig and 
> > > > then I'll
> > use
> > > > it to implement random walk algorithm. To do that maybe I need 
> > > > to
> > improve
> > > > some features such as joins(fuzzy join) etc or implement a new
> > operator.
> > > I
> > > > can implement it using either existing operators or new operators.
> > That's
> > > > up to us and it doesn't really matter. If there is already a
> > > implementation
> > > > to random walker algorithm, please feel free to tell. Because I
> haven't
> > > > found it.
> > > > Are you proposing to create an open-source implementation of 
> > > > those algorithms?
> > > > Yes, I'm proposing to implement a random walk algorithm, new 
> > > > data
> model
> > > > which is representing graph. After that, people can use it 
> > > > coding the
> > > pig.
> > > >
> > > > Do you suggest they should be Pig scripts added to the Pig 
> > > > project,
> or
> > do
> > > > you want to create some new operators?
> > > > Maybe, it can be UDF or new operator.
> > > >
> > > > I made a quick example. It may not be completely accurate, I've 
> > > > just
> > > tried
> > > > to explain it.
> > > > Think about you have a graph file just like that user_id 
> > > > follower
> > > > 1 2
> > > > 1 3
> > > > 1 10
> > > > 2 3
> > > > 3 4
> > > > 3 5
> > > > ...
> > > >
> > > > Vertex List is an array including sorted vertex ids node List is 
> > > > a matrix including vertex id and its starting position
> > > >
> > > >
> > > > graph = load 'graph' using PigStorage() (vertex:int, 
> > > > follower:int) - --load the graph file vertex = COGROUP graph BY 
> > > > (vertex); list = FOREACH vertex GENERATE 
> > > > org.apache.pig.generateVertex(vertex)
> as
> > > > vertexList; --load the whole vertexes from HDFS into the memory 
> > > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) 
> > > > as nodeList; --load the whole vertexes from HDFS into the memory 
> > > > randomWalk = FOREACH vertex GENERATE 
> > > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > > generate a
> > > > score using the node list you can traverse the graph to the your
> > > finishing
> > > > position
> > > > store...
> > > >
> > > >
> > > > Thanks
> > > > Best Regards...
> > > >
> > > >
> > > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy 
> > > > <dv...@gmail.com>
> > > wrote:
> > > >
> > > >> I'm somewhat familiar with WTF code (my day job is managing the
> > > analytics
> > > >> infrastructure team at Twitter). WTF is implemented using Pig 
> > > >> 0.11
> (in
> > > >> fact
> > > >> some of the Pig 11 features/improvements are directly due to 
> > > >> this project...), and mostly has to do with clever algorithms 
> > > >> implemented
> > in
> > > >> Pig
> > > >> (an earlier version of WTF loaded the graph into main memory on
> > > large-mem
> > > >> machines -- that system is open sourced, too, under 
> > > >> github.com/twitter/cassovary). Are you proposing to create an
> > > open-source
> > > >> implementation of those algorithms? Do you suggest they should 
> > > >> be
> Pig
> > > >> scripts added to the Pig project, or do you want to create some 
> > > >> new operators? I'm not totally sure where you are going here.
> > > >>
> > > >> GSoC proposals for Pig are usually made by students who want to 
> > > >> work
> > on
> > > >> issues labeled as GSoC candidates on the apache jira. The 
> > > >> students
> > spend
> > > >> some time to understand the problem stated in the jira, 
> > > >> familiarize themselves with the existing codebase, and put a 
> > > >> basic technical implementation plan and schedule into their 
> > > >> proposal. Since in this
> > case
> > > >> you are proposing something we haven't scoped or defined well 
> > > >> for ourselves, we need you to be very clear and specific about 
> > > >> what you
> > are
> > > >> trying to do, and how you plan to go about it. I think that 
> > > >> Graph processing in Pig (or other Hadoop-based systems) is a 
> > > >> really
> > > interesting
> > > >> topic and there is a lot of work to be done, but we really need 
> > > >> you
> to
> > > be
> > > >> far more detailed to be able to give you good guidance with 
> > > >> regards
> to
> > > >> GSoC.
> > > >>
> > > >> Best,
> > > >> Dmitriy
> > > >>
> > > >>
> > > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk 
> > > >> <bu...@gmail.com>
> > > wrote:
> > > >>
> > > >> > Sure. We can implement a graph model using  "WTF: The Who to
> Follow
> > > >> Service
> > > >> > at Twitter article we can" article.This article's said that 
> > > >> > in
> this
> > > way
> > > >> > graph can be stored one machine's memory so that every node 
> > > >> > will
> > read
> > > >> from
> > > >> > HDFS and cache the graph to the memory. Every node is 
> > > >> > responsible
> > from
> > > >> its
> > > >> > bucket edge to process. I mean it can be splitted. Every node 
> > > >> > can
> be
> > > >> > processed its bucket using random walk algorithm for instance.
> > Finally
> > > >> it
> > > >> > can be reduced to get to the final results. I hope it's clear 
> > > >> > :)
> > > >> >
> > > >> > Thanks
> > > >> > Best Regards...
> > > >> >
> > > >> >
> > > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <
> dvryaboy@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Burakk,
> > > >> > > The general idea of making graph processing easier is a 
> > > >> > > good
> one.
> > > I'm
> > > >> not
> > > >> > > sure what exactly you are proposing to do, though. Could 
> > > >> > > you be
> > more
> > > >> > > detailed about what you are thinking?
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <
> burak.isikli@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi,
> > > >> > > > I might be a little bit late. I come up with a new idea 
> > > >> > > > for
> the
> > > last
> > > >> > > > minute. Currently I'm working on social graph processing. 
> > > >> > > > I
> > think
> > > we
> > > >> > can
> > > >> > > > implement a solution for pig.  With this idea I'm 
> > > >> > > > thinking to
> > > apply
> > > >> the
> > > >> > > > GSOC 2013 so that I can do some tasks about it. Is there 
> > > >> > > > any
> > > mentor
> > > >> to
> > > >> > do
> > > >> > > > it with me?  Is there any suggestion? :)
> > > >> > > >
> > > >> > > > Details:
> > > >> > > > Of course I can improve some join operations. I'm not 
> > > >> > > > sure is
> > > there
> > > >> any
> > > >> > > > implementation about fuzzy joins for instance. These are 
> > > >> > > > the
> > > papers
> > > >> > that
> > > >> > > I
> > > >> > > > found
> > > >> > > >
> > > >> > > > Fuzzy Joins Using MapReduce 
> > > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > > >> > > >
> > > >> > > > Dimension independent similarity computation
> > > >> > > > http://arxiv.org/abs/1206.2082
> > > >> > > >
> > > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, 
> > > >> > > > Throw
> > Away
> > > >> > > > Everything That's Not a Nail!
> > > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > >> > > >
> > > >> > > > Large Graph Processing in the Cloud 
> > > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > >> > > >
> > > >> > > > ..etc
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > > Best regards..
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > >
> > > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > >> > > > *
> > > >> > > > *
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >
> > > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > >> > *
> > > >> > *
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > *
> > > *
> > >
> >
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

Re: GSoC 2013

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
+1 to what Dmitriy says.

Cheers,

--
Gianmarco


On Mon, Apr 8, 2013 at 8:57 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hi,
> I think this is an interesting project but is not core to "Pig" itself --
> it may be more interesting / viable as a standalone project on github that
> uses Pig to implement graph algorithms.
> At this point in its development, I feel that Pig needs to concentrate on
> doing the things it already does, and do them better (operator efficiency,
> storage efficiency, better MR plan generation, etc) rather than expand to
> specific verticals; we should allow our users to create their own solution
> suites that use Pig for specific purposes. A successful example of such a
> standalone project is PacketPig (https://github.com/packetloop/packetpig)
> ,
> a PCAP network capture analysis tool.
>
> D
>
>
> On Tue, Apr 2, 2013 at 9:48 AM, burakkk <bu...@gmail.com> wrote:
>
> > I know that but giraph tries to use bsp. What I'm saying is nothing
> shared
> > model except reducers. Besides I don't want to divide iteration. One
> phase
> > is still responsible for whole iteration. Every different origin vertex
> > will be processed in parallel.
> >
> > Thanks
> > Best regards...
> >
> >
> > On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales <
> > gdfm@gdfm.me
> > > wrote:
> >
> > > FYI, Giraph has a Random Walk implementation.
> > >
> > > Pig does not support iteration natively, so any iterative algorithm is
> > not
> > > a very good fit for it. Just my 2c.
> > >
> > > Cheers,
> > >
> > > --
> > > Gianmarco
> > >
> > >
> > > On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com>
> wrote:
> > >
> > > > So what do you suggest? Is it clear?
> > > >
> > > >
> > > > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com>
> > wrote:
> > > >
> > > > > I'm using only WTF graph representation to fit the memory. By the
> > way I
> > > > > haven't seen any explanation from the pig 0.11 release page about
> WTF
> > > or
> > > > > graph models.
> > > > > I don't wanna use Cassovary. I believe it can be done with pig. I
> > > > > implement a graph representation using WTF paper to pig and then
> I'll
> > > use
> > > > > it to implement random walk algorithm. To do that maybe I need to
> > > improve
> > > > > some features such as joins(fuzzy join) etc or implement a new
> > > operator.
> > > > I
> > > > > can implement it using either existing operators or new operators.
> > > That's
> > > > > up to us and it doesn't really matter. If there is already a
> > > > implementation
> > > > > to random walker algorithm, please feel free to tell. Because I
> > haven't
> > > > > found it.
> > > > > Are you proposing to create an open-source implementation of those
> > > > > algorithms?
> > > > > Yes, I'm proposing to implement a random walk algorithm, new data
> > model
> > > > > which is representing graph. After that, people can use it coding
> the
> > > > pig.
> > > > >
> > > > > Do you suggest they should be Pig scripts added to the Pig project,
> > or
> > > do
> > > > > you want to create some new operators?
> > > > > Maybe, it can be UDF or new operator.
> > > > >
> > > > > I made a quick example. It may not be completely accurate, I've
> just
> > > > tried
> > > > > to explain it.
> > > > > Think about you have a graph file just like that
> > > > > user_id follower
> > > > > 1 2
> > > > > 1 3
> > > > > 1 10
> > > > > 2 3
> > > > > 3 4
> > > > > 3 5
> > > > > ...
> > > > >
> > > > > Vertex List is an array including sorted vertex ids
> > > > > node List is a matrix including vertex id and its starting position
> > > > >
> > > > >
> > > > > graph = load 'graph' using PigStorage() (vertex:int, follower:int)
> -
> > > > > --load the graph file
> > > > > vertex = COGROUP graph BY (vertex);
> > > > > list = FOREACH vertex GENERATE
> org.apache.pig.generateVertex(vertex)
> > as
> > > > > vertexList; --load the whole vertexes from HDFS into the memory
> > > > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > > > > nodeList; --load the whole vertexes from HDFS into the memory
> > > > > randomWalk = FOREACH vertex GENERATE
> > > > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > > > generate a
> > > > > score using the node list you can traverse the graph to the your
> > > > finishing
> > > > > position
> > > > > store...
> > > > >
> > > > >
> > > > > Thanks
> > > > > Best Regards...
> > > > >
> > > > >
> > > > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> I'm somewhat familiar with WTF code (my day job is managing the
> > > > analytics
> > > > >> infrastructure team at Twitter). WTF is implemented using Pig 0.11
> > (in
> > > > >> fact
> > > > >> some of the Pig 11 features/improvements are directly due to this
> > > > >> project...), and mostly has to do with clever algorithms
> implemented
> > > in
> > > > >> Pig
> > > > >> (an earlier version of WTF loaded the graph into main memory on
> > > > large-mem
> > > > >> machines -- that system is open sourced, too, under
> > > > >> github.com/twitter/cassovary). Are you proposing to create an
> > > > open-source
> > > > >> implementation of those algorithms? Do you suggest they should be
> > Pig
> > > > >> scripts added to the Pig project, or do you want to create some
> new
> > > > >> operators? I'm not totally sure where you are going here.
> > > > >>
> > > > >> GSoC proposals for Pig are usually made by students who want to
> work
> > > on
> > > > >> issues labeled as GSoC candidates on the apache jira. The students
> > > spend
> > > > >> some time to understand the problem stated in the jira,
> familiarize
> > > > >> themselves with the existing codebase, and put a basic technical
> > > > >> implementation plan and schedule into their proposal. Since in
> this
> > > case
> > > > >> you are proposing something we haven't scoped or defined well for
> > > > >> ourselves, we need you to be very clear and specific about what
> you
> > > are
> > > > >> trying to do, and how you plan to go about it. I think that Graph
> > > > >> processing in Pig (or other Hadoop-based systems) is a really
> > > > interesting
> > > > >> topic and there is a lot of work to be done, but we really need
> you
> > to
> > > > be
> > > > >> far more detailed to be able to give you good guidance with
> regards
> > to
> > > > >> GSoC.
> > > > >>
> > > > >> Best,
> > > > >> Dmitriy
> > > > >>
> > > > >>
> > > > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <burak.isikli@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >> > Sure. We can implement a graph model using  "WTF: The Who to
> > Follow
> > > > >> Service
> > > > >> > at Twitter article we can" article.This article's said that in
> > this
> > > > way
> > > > >> > graph can be stored one machine's memory so that every node will
> > > read
> > > > >> from
> > > > >> > HDFS and cache the graph to the memory. Every node is
> responsible
> > > from
> > > > >> its
> > > > >> > bucket edge to process. I mean it can be splitted. Every node
> can
> > be
> > > > >> > processed its bucket using random walk algorithm for instance.
> > > Finally
> > > > >> it
> > > > >> > can be reduced to get to the final results. I hope it's clear :)
> > > > >> >
> > > > >> > Thanks
> > > > >> > Best Regards...
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <
> > dvryaboy@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi Burakk,
> > > > >> > > The general idea of making graph processing easier is a good
> > one.
> > > > I'm
> > > > >> not
> > > > >> > > sure what exactly you are proposing to do, though. Could you
> be
> > > more
> > > > >> > > detailed about what you are thinking?
> > > > >> > >
> > > > >> > >
> > > > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <
> > burak.isikli@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi,
> > > > >> > > > I might be a little bit late. I come up with a new idea for
> > the
> > > > last
> > > > >> > > > minute. Currently I'm working on social graph processing. I
> > > think
> > > > we
> > > > >> > can
> > > > >> > > > implement a solution for pig.  With this idea I'm thinking
> to
> > > > apply
> > > > >> the
> > > > >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> > > > mentor
> > > > >> to
> > > > >> > do
> > > > >> > > > it with me?  Is there any suggestion? :)
> > > > >> > > >
> > > > >> > > > Details:
> > > > >> > > > Of course I can improve some join operations. I'm not sure
> is
> > > > there
> > > > >> any
> > > > >> > > > implementation about fuzzy joins for instance. These are the
> > > > papers
> > > > >> > that
> > > > >> > > I
> > > > >> > > > found
> > > > >> > > >
> > > > >> > > > Fuzzy Joins Using MapReduce
> > > > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > > > >> > > >
> > > > >> > > > Dimension independent similarity computation
> > > > >> > > > http://arxiv.org/abs/1206.2082
> > > > >> > > >
> > > > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw
> > > Away
> > > > >> > > > Everything That’s Not a Nail!
> > > > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > > >> > > >
> > > > >> > > > Large Graph Processing in the Cloud
> > > > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > > >> > > >
> > > > >> > > > ..etc
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Best regards..
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > >
> > > > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > >> > > > *
> > > > >> > > > *
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > >> > *
> > > > >> > *
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > > *
> > > > > *
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>

RE: GSoC 2013

Posted by Steve Bernstein <St...@deem.com>.
As a long follower, infrequent poster to this list, I agree with this wisdom.

Much as I'm attracted to graph analysis, continuing focus on a rock solid foundation is a good call.

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com] 
Sent: Monday, April 08, 2013 11:58 AM
To: dev@pig.apache.org
Cc: user@pig.apache.org
Subject: Re: GSoC 2013

Hi,
I think this is an interesting project but is not core to "Pig" itself -- it may be more interesting / viable as a standalone project on github that uses Pig to implement graph algorithms.
At this point in its development, I feel that Pig needs to concentrate on doing the things it already does, and do them better (operator efficiency, storage efficiency, better MR plan generation, etc) rather than expand to specific verticals; we should allow our users to create their own solution suites that use Pig for specific purposes. A successful example of such a standalone project is PacketPig (https://github.com/packetloop/packetpig) , a PCAP network capture analysis tool.

D


On Tue, Apr 2, 2013 at 9:48 AM, burakkk <bu...@gmail.com> wrote:

> I know that but giraph tries to use bsp. What I'm saying is nothing 
> shared model except reducers. Besides I don't want to divide 
> iteration. One phase is still responsible for whole iteration. Every 
> different origin vertex will be processed in parallel.
>
> Thanks
> Best regards...
>
>
> On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales < 
> gdfm@gdfm.me
> > wrote:
>
> > FYI, Giraph has a Random Walk implementation.
> >
> > Pig does not support iteration natively, so any iterative algorithm 
> > is
> not
> > a very good fit for it. Just my 2c.
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> >
> > On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com> wrote:
> >
> > > So what do you suggest? Is it clear?
> > >
> > >
> > > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com>
> wrote:
> > >
> > > > I'm using only WTF graph representation to fit the memory. By 
> > > > the
> way I
> > > > haven't seen any explanation from the pig 0.11 release page 
> > > > about WTF
> > or
> > > > graph models.
> > > > I don't wanna use Cassovary. I believe it can be done with pig. 
> > > > I implement a graph representation using WTF paper to pig and 
> > > > then I'll
> > use
> > > > it to implement random walk algorithm. To do that maybe I need 
> > > > to
> > improve
> > > > some features such as joins(fuzzy join) etc or implement a new
> > operator.
> > > I
> > > > can implement it using either existing operators or new operators.
> > That's
> > > > up to us and it doesn't really matter. If there is already a
> > > implementation
> > > > to random walker algorithm, please feel free to tell. Because I
> haven't
> > > > found it.
> > > > Are you proposing to create an open-source implementation of 
> > > > those algorithms?
> > > > Yes, I'm proposing to implement a random walk algorithm, new 
> > > > data
> model
> > > > which is representing graph. After that, people can use it 
> > > > coding the
> > > pig.
> > > >
> > > > Do you suggest they should be Pig scripts added to the Pig 
> > > > project,
> or
> > do
> > > > you want to create some new operators?
> > > > Maybe, it can be UDF or new operator.
> > > >
> > > > I made a quick example. It may not be completely accurate, I've 
> > > > just
> > > tried
> > > > to explain it.
> > > > Think about you have a graph file just like that user_id 
> > > > follower
> > > > 1 2
> > > > 1 3
> > > > 1 10
> > > > 2 3
> > > > 3 4
> > > > 3 5
> > > > ...
> > > >
> > > > Vertex List is an array including sorted vertex ids node List is 
> > > > a matrix including vertex id and its starting position
> > > >
> > > >
> > > > graph = load 'graph' using PigStorage() (vertex:int, 
> > > > follower:int) - --load the graph file vertex = COGROUP graph BY 
> > > > (vertex); list = FOREACH vertex GENERATE 
> > > > org.apache.pig.generateVertex(vertex)
> as
> > > > vertexList; --load the whole vertexes from HDFS into the memory 
> > > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) 
> > > > as nodeList; --load the whole vertexes from HDFS into the memory 
> > > > randomWalk = FOREACH vertex GENERATE 
> > > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > > generate a
> > > > score using the node list you can traverse the graph to the your
> > > finishing
> > > > position
> > > > store...
> > > >
> > > >
> > > > Thanks
> > > > Best Regards...
> > > >
> > > >
> > > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy 
> > > > <dv...@gmail.com>
> > > wrote:
> > > >
> > > >> I'm somewhat familiar with WTF code (my day job is managing the
> > > analytics
> > > >> infrastructure team at Twitter). WTF is implemented using Pig 
> > > >> 0.11
> (in
> > > >> fact
> > > >> some of the Pig 11 features/improvements are directly due to 
> > > >> this project...), and mostly has to do with clever algorithms 
> > > >> implemented
> > in
> > > >> Pig
> > > >> (an earlier version of WTF loaded the graph into main memory on
> > > large-mem
> > > >> machines -- that system is open sourced, too, under 
> > > >> github.com/twitter/cassovary). Are you proposing to create an
> > > open-source
> > > >> implementation of those algorithms? Do you suggest they should 
> > > >> be
> Pig
> > > >> scripts added to the Pig project, or do you want to create some 
> > > >> new operators? I'm not totally sure where you are going here.
> > > >>
> > > >> GSoC proposals for Pig are usually made by students who want to 
> > > >> work
> > on
> > > >> issues labeled as GSoC candidates on the apache jira. The 
> > > >> students
> > spend
> > > >> some time to understand the problem stated in the jira, 
> > > >> familiarize themselves with the existing codebase, and put a 
> > > >> basic technical implementation plan and schedule into their 
> > > >> proposal. Since in this
> > case
> > > >> you are proposing something we haven't scoped or defined well 
> > > >> for ourselves, we need you to be very clear and specific about 
> > > >> what you
> > are
> > > >> trying to do, and how you plan to go about it. I think that 
> > > >> Graph processing in Pig (or other Hadoop-based systems) is a 
> > > >> really
> > > interesting
> > > >> topic and there is a lot of work to be done, but we really need 
> > > >> you
> to
> > > be
> > > >> far more detailed to be able to give you good guidance with 
> > > >> regards
> to
> > > >> GSoC.
> > > >>
> > > >> Best,
> > > >> Dmitriy
> > > >>
> > > >>
> > > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk 
> > > >> <bu...@gmail.com>
> > > wrote:
> > > >>
> > > >> > Sure. We can implement a graph model using  "WTF: The Who to
> Follow
> > > >> Service
> > > >> > at Twitter article we can" article.This article's said that 
> > > >> > in
> this
> > > way
> > > >> > graph can be stored one machine's memory so that every node 
> > > >> > will
> > read
> > > >> from
> > > >> > HDFS and cache the graph to the memory. Every node is 
> > > >> > responsible
> > from
> > > >> its
> > > >> > bucket edge to process. I mean it can be splitted. Every node 
> > > >> > can
> be
> > > >> > processed its bucket using random walk algorithm for instance.
> > Finally
> > > >> it
> > > >> > can be reduced to get to the final results. I hope it's clear 
> > > >> > :)
> > > >> >
> > > >> > Thanks
> > > >> > Best Regards...
> > > >> >
> > > >> >
> > > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <
> dvryaboy@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Burakk,
> > > >> > > The general idea of making graph processing easier is a 
> > > >> > > good
> one.
> > > I'm
> > > >> not
> > > >> > > sure what exactly you are proposing to do, though. Could 
> > > >> > > you be
> > more
> > > >> > > detailed about what you are thinking?
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <
> burak.isikli@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi,
> > > >> > > > I might be a little bit late. I come up with a new idea 
> > > >> > > > for
> the
> > > last
> > > >> > > > minute. Currently I'm working on social graph processing. 
> > > >> > > > I
> > think
> > > we
> > > >> > can
> > > >> > > > implement a solution for pig.  With this idea I'm 
> > > >> > > > thinking to
> > > apply
> > > >> the
> > > >> > > > GSOC 2013 so that I can do some tasks about it. Is there 
> > > >> > > > any
> > > mentor
> > > >> to
> > > >> > do
> > > >> > > > it with me?  Is there any suggestion? :)
> > > >> > > >
> > > >> > > > Details:
> > > >> > > > Of course I can improve some join operations. I'm not 
> > > >> > > > sure is
> > > there
> > > >> any
> > > >> > > > implementation about fuzzy joins for instance. These are 
> > > >> > > > the
> > > papers
> > > >> > that
> > > >> > > I
> > > >> > > > found
> > > >> > > >
> > > >> > > > Fuzzy Joins Using MapReduce 
> > > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > > >> > > >
> > > >> > > > Dimension independent similarity computation
> > > >> > > > http://arxiv.org/abs/1206.2082
> > > >> > > >
> > > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, 
> > > >> > > > Throw
> > Away
> > > >> > > > Everything That's Not a Nail!
> > > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > >> > > >
> > > >> > > > Large Graph Processing in the Cloud 
> > > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > >> > > >
> > > >> > > > ..etc
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > > Best regards..
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > >
> > > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > >> > > > *
> > > >> > > > *
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >
> > > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > >> > *
> > > >> > *
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > *
> > > *
> > >
> >
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

Re: GSoC 2013

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
+1 to what Dmitriy says.

Cheers,

--
Gianmarco


On Mon, Apr 8, 2013 at 8:57 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hi,
> I think this is an interesting project but is not core to "Pig" itself --
> it may be more interesting / viable as a standalone project on github that
> uses Pig to implement graph algorithms.
> At this point in its development, I feel that Pig needs to concentrate on
> doing the things it already does, and do them better (operator efficiency,
> storage efficiency, better MR plan generation, etc) rather than expand to
> specific verticals; we should allow our users to create their own solution
> suites that use Pig for specific purposes. A successful example of such a
> standalone project is PacketPig (https://github.com/packetloop/packetpig)
> ,
> a PCAP network capture analysis tool.
>
> D
>
>
> On Tue, Apr 2, 2013 at 9:48 AM, burakkk <bu...@gmail.com> wrote:
>
> > I know that but giraph tries to use bsp. What I'm saying is nothing
> shared
> > model except reducers. Besides I don't want to divide iteration. One
> phase
> > is still responsible for whole iteration. Every different origin vertex
> > will be processed in parallel.
> >
> > Thanks
> > Best regards...
> >
> >
> > On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales <
> > gdfm@gdfm.me
> > > wrote:
> >
> > > FYI, Giraph has a Random Walk implementation.
> > >
> > > Pig does not support iteration natively, so any iterative algorithm is
> > not
> > > a very good fit for it. Just my 2c.
> > >
> > > Cheers,
> > >
> > > --
> > > Gianmarco
> > >
> > >
> > > On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com>
> wrote:
> > >
> > > > So what do you suggest? Is it clear?
> > > >
> > > >
> > > > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com>
> > wrote:
> > > >
> > > > > I'm using only WTF graph representation to fit the memory. By the
> > way I
> > > > > haven't seen any explanation from the pig 0.11 release page about
> WTF
> > > or
> > > > > graph models.
> > > > > I don't wanna use Cassovary. I believe it can be done with pig. I
> > > > > implement a graph representation using WTF paper to pig and then
> I'll
> > > use
> > > > > it to implement random walk algorithm. To do that maybe I need to
> > > improve
> > > > > some features such as joins(fuzzy join) etc or implement a new
> > > operator.
> > > > I
> > > > > can implement it using either existing operators or new operators.
> > > That's
> > > > > up to us and it doesn't really matter. If there is already a
> > > > implementation
> > > > > to random walker algorithm, please feel free to tell. Because I
> > haven't
> > > > > found it.
> > > > > Are you proposing to create an open-source implementation of those
> > > > > algorithms?
> > > > > Yes, I'm proposing to implement a random walk algorithm, new data
> > model
> > > > > which is representing graph. After that, people can use it coding
> the
> > > > pig.
> > > > >
> > > > > Do you suggest they should be Pig scripts added to the Pig project,
> > or
> > > do
> > > > > you want to create some new operators?
> > > > > Maybe, it can be UDF or new operator.
> > > > >
> > > > > I made a quick example. It may not be completely accurate, I've
> just
> > > > tried
> > > > > to explain it.
> > > > > Think about you have a graph file just like that
> > > > > user_id follower
> > > > > 1 2
> > > > > 1 3
> > > > > 1 10
> > > > > 2 3
> > > > > 3 4
> > > > > 3 5
> > > > > ...
> > > > >
> > > > > Vertex List is an array including sorted vertex ids
> > > > > node List is a matrix including vertex id and its starting position
> > > > >
> > > > >
> > > > > graph = load 'graph' using PigStorage() (vertex:int, follower:int)
> -
> > > > > --load the graph file
> > > > > vertex = COGROUP graph BY (vertex);
> > > > > list = FOREACH vertex GENERATE
> org.apache.pig.generateVertex(vertex)
> > as
> > > > > vertexList; --load the whole vertexes from HDFS into the memory
> > > > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > > > > nodeList; --load the whole vertexes from HDFS into the memory
> > > > > randomWalk = FOREACH vertex GENERATE
> > > > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > > > generate a
> > > > > score using the node list you can traverse the graph to the your
> > > > finishing
> > > > > position
> > > > > store...
> > > > >
> > > > >
> > > > > Thanks
> > > > > Best Regards...
> > > > >
> > > > >
> > > > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> I'm somewhat familiar with WTF code (my day job is managing the
> > > > analytics
> > > > >> infrastructure team at Twitter). WTF is implemented using Pig 0.11
> > (in
> > > > >> fact
> > > > >> some of the Pig 11 features/improvements are directly due to this
> > > > >> project...), and mostly has to do with clever algorithms
> implemented
> > > in
> > > > >> Pig
> > > > >> (an earlier version of WTF loaded the graph into main memory on
> > > > large-mem
> > > > >> machines -- that system is open sourced, too, under
> > > > >> github.com/twitter/cassovary). Are you proposing to create an
> > > > open-source
> > > > >> implementation of those algorithms? Do you suggest they should be
> > Pig
> > > > >> scripts added to the Pig project, or do you want to create some
> new
> > > > >> operators? I'm not totally sure where you are going here.
> > > > >>
> > > > >> GSoC proposals for Pig are usually made by students who want to
> work
> > > on
> > > > >> issues labeled as GSoC candidates on the apache jira. The students
> > > spend
> > > > >> some time to understand the problem stated in the jira,
> familiarize
> > > > >> themselves with the existing codebase, and put a basic technical
> > > > >> implementation plan and schedule into their proposal. Since in
> this
> > > case
> > > > >> you are proposing something we haven't scoped or defined well for
> > > > >> ourselves, we need you to be very clear and specific about what
> you
> > > are
> > > > >> trying to do, and how you plan to go about it. I think that Graph
> > > > >> processing in Pig (or other Hadoop-based systems) is a really
> > > > interesting
> > > > >> topic and there is a lot of work to be done, but we really need
> you
> > to
> > > > be
> > > > >> far more detailed to be able to give you good guidance with
> regards
> > to
> > > > >> GSoC.
> > > > >>
> > > > >> Best,
> > > > >> Dmitriy
> > > > >>
> > > > >>
> > > > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <burak.isikli@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >> > Sure. We can implement a graph model using  "WTF: The Who to
> > Follow
> > > > >> Service
> > > > >> > at Twitter article we can" article.This article's said that in
> > this
> > > > way
> > > > >> > graph can be stored one machine's memory so that every node will
> > > read
> > > > >> from
> > > > >> > HDFS and cache the graph to the memory. Every node is
> responsible
> > > from
> > > > >> its
> > > > >> > bucket edge to process. I mean it can be splitted. Every node
> can
> > be
> > > > >> > processed its bucket using random walk algorithm for instance.
> > > Finally
> > > > >> it
> > > > >> > can be reduced to get to the final results. I hope it's clear :)
> > > > >> >
> > > > >> > Thanks
> > > > >> > Best Regards...
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <
> > dvryaboy@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi Burakk,
> > > > >> > > The general idea of making graph processing easier is a good
> > one.
> > > > I'm
> > > > >> not
> > > > >> > > sure what exactly you are proposing to do, though. Could you
> be
> > > more
> > > > >> > > detailed about what you are thinking?
> > > > >> > >
> > > > >> > >
> > > > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <
> > burak.isikli@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi,
> > > > >> > > > I might be a little bit late. I come up with a new idea for
> > the
> > > > last
> > > > >> > > > minute. Currently I'm working on social graph processing. I
> > > think
> > > > we
> > > > >> > can
> > > > >> > > > implement a solution for pig.  With this idea I'm thinking
> to
> > > > apply
> > > > >> the
> > > > >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> > > > mentor
> > > > >> to
> > > > >> > do
> > > > >> > > > it with me?  Is there any suggestion? :)
> > > > >> > > >
> > > > >> > > > Details:
> > > > >> > > > Of course I can improve some join operations. I'm not sure
> is
> > > > there
> > > > >> any
> > > > >> > > > implementation about fuzzy joins for instance. These are the
> > > > papers
> > > > >> > that
> > > > >> > > I
> > > > >> > > > found
> > > > >> > > >
> > > > >> > > > Fuzzy Joins Using MapReduce
> > > > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > > > >> > > >
> > > > >> > > > Dimension independent similarity computation
> > > > >> > > > http://arxiv.org/abs/1206.2082
> > > > >> > > >
> > > > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw
> > > Away
> > > > >> > > > Everything That’s Not a Nail!
> > > > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > > >> > > >
> > > > >> > > > Large Graph Processing in the Cloud
> > > > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > > >> > > >
> > > > >> > > > ..etc
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Best regards..
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > >
> > > > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > >> > > > *
> > > > >> > > > *
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > >> > *
> > > > >> > *
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > > *
> > > > > *
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>

Re: GSoC 2013

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Hi,
I think this is an interesting project but is not core to "Pig" itself --
it may be more interesting / viable as a standalone project on github that
uses Pig to implement graph algorithms.
At this point in its development, I feel that Pig needs to concentrate on
doing the things it already does, and do them better (operator efficiency,
storage efficiency, better MR plan generation, etc) rather than expand to
specific verticals; we should allow our users to create their own solution
suites that use Pig for specific purposes. A successful example of such a
standalone project is PacketPig (https://github.com/packetloop/packetpig) ,
a PCAP network capture analysis tool.

D


On Tue, Apr 2, 2013 at 9:48 AM, burakkk <bu...@gmail.com> wrote:

> I know that but giraph tries to use bsp. What I'm saying is nothing shared
> model except reducers. Besides I don't want to divide iteration. One phase
> is still responsible for whole iteration. Every different origin vertex
> will be processed in parallel.
>
> Thanks
> Best regards...
>
>
> On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales <
> gdfm@gdfm.me
> > wrote:
>
> > FYI, Giraph has a Random Walk implementation.
> >
> > Pig does not support iteration natively, so any iterative algorithm is
> not
> > a very good fit for it. Just my 2c.
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> >
> > On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com> wrote:
> >
> > > So what do you suggest? Is it clear?
> > >
> > >
> > > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com>
> wrote:
> > >
> > > > I'm using only WTF graph representation to fit the memory. By the
> way I
> > > > haven't seen any explanation from the pig 0.11 release page about WTF
> > or
> > > > graph models.
> > > > I don't wanna use Cassovary. I believe it can be done with pig. I
> > > > implement a graph representation using WTF paper to pig and then I'll
> > use
> > > > it to implement random walk algorithm. To do that maybe I need to
> > improve
> > > > some features such as joins(fuzzy join) etc or implement a new
> > operator.
> > > I
> > > > can implement it using either existing operators or new operators.
> > That's
> > > > up to us and it doesn't really matter. If there is already a
> > > implementation
> > > > to random walker algorithm, please feel free to tell. Because I
> haven't
> > > > found it.
> > > > Are you proposing to create an open-source implementation of those
> > > > algorithms?
> > > > Yes, I'm proposing to implement a random walk algorithm, new data
> model
> > > > which is representing graph. After that, people can use it coding the
> > > pig.
> > > >
> > > > Do you suggest they should be Pig scripts added to the Pig project,
> or
> > do
> > > > you want to create some new operators?
> > > > Maybe, it can be UDF or new operator.
> > > >
> > > > I made a quick example. It may not be completely accurate, I've just
> > > tried
> > > > to explain it.
> > > > Think about you have a graph file just like that
> > > > user_id follower
> > > > 1 2
> > > > 1 3
> > > > 1 10
> > > > 2 3
> > > > 3 4
> > > > 3 5
> > > > ...
> > > >
> > > > Vertex List is an array including sorted vertex ids
> > > > node List is a matrix including vertex id and its starting position
> > > >
> > > >
> > > > graph = load 'graph' using PigStorage() (vertex:int, follower:int) -
> > > > --load the graph file
> > > > vertex = COGROUP graph BY (vertex);
> > > > list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex)
> as
> > > > vertexList; --load the whole vertexes from HDFS into the memory
> > > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > > > nodeList; --load the whole vertexes from HDFS into the memory
> > > > randomWalk = FOREACH vertex GENERATE
> > > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > > generate a
> > > > score using the node list you can traverse the graph to the your
> > > finishing
> > > > position
> > > > store...
> > > >
> > > >
> > > > Thanks
> > > > Best Regards...
> > > >
> > > >
> > > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > > wrote:
> > > >
> > > >> I'm somewhat familiar with WTF code (my day job is managing the
> > > analytics
> > > >> infrastructure team at Twitter). WTF is implemented using Pig 0.11
> (in
> > > >> fact
> > > >> some of the Pig 11 features/improvements are directly due to this
> > > >> project...), and mostly has to do with clever algorithms implemented
> > in
> > > >> Pig
> > > >> (an earlier version of WTF loaded the graph into main memory on
> > > large-mem
> > > >> machines -- that system is open sourced, too, under
> > > >> github.com/twitter/cassovary). Are you proposing to create an
> > > open-source
> > > >> implementation of those algorithms? Do you suggest they should be
> Pig
> > > >> scripts added to the Pig project, or do you want to create some new
> > > >> operators? I'm not totally sure where you are going here.
> > > >>
> > > >> GSoC proposals for Pig are usually made by students who want to work
> > on
> > > >> issues labeled as GSoC candidates on the apache jira. The students
> > spend
> > > >> some time to understand the problem stated in the jira, familiarize
> > > >> themselves with the existing codebase, and put a basic technical
> > > >> implementation plan and schedule into their proposal. Since in this
> > case
> > > >> you are proposing something we haven't scoped or defined well for
> > > >> ourselves, we need you to be very clear and specific about what you
> > are
> > > >> trying to do, and how you plan to go about it. I think that Graph
> > > >> processing in Pig (or other Hadoop-based systems) is a really
> > > interesting
> > > >> topic and there is a lot of work to be done, but we really need you
> to
> > > be
> > > >> far more detailed to be able to give you good guidance with regards
> to
> > > >> GSoC.
> > > >>
> > > >> Best,
> > > >> Dmitriy
> > > >>
> > > >>
> > > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com>
> > > wrote:
> > > >>
> > > >> > Sure. We can implement a graph model using  "WTF: The Who to
> Follow
> > > >> Service
> > > >> > at Twitter article we can" article.This article's said that in
> this
> > > way
> > > >> > graph can be stored one machine's memory so that every node will
> > read
> > > >> from
> > > >> > HDFS and cache the graph to the memory. Every node is responsible
> > from
> > > >> its
> > > >> > bucket edge to process. I mean it can be splitted. Every node can
> be
> > > >> > processed its bucket using random walk algorithm for instance.
> > Finally
> > > >> it
> > > >> > can be reduced to get to the final results. I hope it's clear :)
> > > >> >
> > > >> > Thanks
> > > >> > Best Regards...
> > > >> >
> > > >> >
> > > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <
> dvryaboy@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Burakk,
> > > >> > > The general idea of making graph processing easier is a good
> one.
> > > I'm
> > > >> not
> > > >> > > sure what exactly you are proposing to do, though. Could you be
> > more
> > > >> > > detailed about what you are thinking?
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <
> burak.isikli@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi,
> > > >> > > > I might be a little bit late. I come up with a new idea for
> the
> > > last
> > > >> > > > minute. Currently I'm working on social graph processing. I
> > think
> > > we
> > > >> > can
> > > >> > > > implement a solution for pig.  With this idea I'm thinking to
> > > apply
> > > >> the
> > > >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> > > mentor
> > > >> to
> > > >> > do
> > > >> > > > it with me?  Is there any suggestion? :)
> > > >> > > >
> > > >> > > > Details:
> > > >> > > > Of course I can improve some join operations. I'm not sure is
> > > there
> > > >> any
> > > >> > > > implementation about fuzzy joins for instance. These are the
> > > papers
> > > >> > that
> > > >> > > I
> > > >> > > > found
> > > >> > > >
> > > >> > > > Fuzzy Joins Using MapReduce
> > > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > > >> > > >
> > > >> > > > Dimension independent similarity computation
> > > >> > > > http://arxiv.org/abs/1206.2082
> > > >> > > >
> > > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw
> > Away
> > > >> > > > Everything That’s Not a Nail!
> > > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > >> > > >
> > > >> > > > Large Graph Processing in the Cloud
> > > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > >> > > >
> > > >> > > > ..etc
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > > Best regards..
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > >
> > > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > >> > > > *
> > > >> > > > *
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >
> > > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > >> > *
> > > >> > *
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > *
> > > *
> > >
> >
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

Re: GSoC 2013

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Hi,
I think this is an interesting project but is not core to "Pig" itself --
it may be more interesting / viable as a standalone project on github that
uses Pig to implement graph algorithms.
At this point in its development, I feel that Pig needs to concentrate on
doing the things it already does, and do them better (operator efficiency,
storage efficiency, better MR plan generation, etc) rather than expand to
specific verticals; we should allow our users to create their own solution
suites that use Pig for specific purposes. A successful example of such a
standalone project is PacketPig (https://github.com/packetloop/packetpig) ,
a PCAP network capture analysis tool.

D


On Tue, Apr 2, 2013 at 9:48 AM, burakkk <bu...@gmail.com> wrote:

> I know that but giraph tries to use bsp. What I'm saying is nothing shared
> model except reducers. Besides I don't want to divide iteration. One phase
> is still responsible for whole iteration. Every different origin vertex
> will be processed in parallel.
>
> Thanks
> Best regards...
>
>
> On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales <
> gdfm@gdfm.me
> > wrote:
>
> > FYI, Giraph has a Random Walk implementation.
> >
> > Pig does not support iteration natively, so any iterative algorithm is
> not
> > a very good fit for it. Just my 2c.
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> >
> > On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com> wrote:
> >
> > > So what do you suggest? Is it clear?
> > >
> > >
> > > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com>
> wrote:
> > >
> > > > I'm using only WTF graph representation to fit the memory. By the
> way I
> > > > haven't seen any explanation from the pig 0.11 release page about WTF
> > or
> > > > graph models.
> > > > I don't wanna use Cassovary. I believe it can be done with pig. I
> > > > implement a graph representation using WTF paper to pig and then I'll
> > use
> > > > it to implement random walk algorithm. To do that maybe I need to
> > improve
> > > > some features such as joins(fuzzy join) etc or implement a new
> > operator.
> > > I
> > > > can implement it using either existing operators or new operators.
> > That's
> > > > up to us and it doesn't really matter. If there is already a
> > > implementation
> > > > to random walker algorithm, please feel free to tell. Because I
> haven't
> > > > found it.
> > > > Are you proposing to create an open-source implementation of those
> > > > algorithms?
> > > > Yes, I'm proposing to implement a random walk algorithm, new data
> model
> > > > which is representing graph. After that, people can use it coding the
> > > pig.
> > > >
> > > > Do you suggest they should be Pig scripts added to the Pig project,
> or
> > do
> > > > you want to create some new operators?
> > > > Maybe, it can be UDF or new operator.
> > > >
> > > > I made a quick example. It may not be completely accurate, I've just
> > > tried
> > > > to explain it.
> > > > Think about you have a graph file just like that
> > > > user_id follower
> > > > 1 2
> > > > 1 3
> > > > 1 10
> > > > 2 3
> > > > 3 4
> > > > 3 5
> > > > ...
> > > >
> > > > Vertex List is an array including sorted vertex ids
> > > > node List is a matrix including vertex id and its starting position
> > > >
> > > >
> > > > graph = load 'graph' using PigStorage() (vertex:int, follower:int) -
> > > > --load the graph file
> > > > vertex = COGROUP graph BY (vertex);
> > > > list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex)
> as
> > > > vertexList; --load the whole vertexes from HDFS into the memory
> > > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > > > nodeList; --load the whole vertexes from HDFS into the memory
> > > > randomWalk = FOREACH vertex GENERATE
> > > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > > generate a
> > > > score using the node list you can traverse the graph to the your
> > > finishing
> > > > position
> > > > store...
> > > >
> > > >
> > > > Thanks
> > > > Best Regards...
> > > >
> > > >
> > > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > > wrote:
> > > >
> > > >> I'm somewhat familiar with WTF code (my day job is managing the
> > > analytics
> > > >> infrastructure team at Twitter). WTF is implemented using Pig 0.11
> (in
> > > >> fact
> > > >> some of the Pig 11 features/improvements are directly due to this
> > > >> project...), and mostly has to do with clever algorithms implemented
> > in
> > > >> Pig
> > > >> (an earlier version of WTF loaded the graph into main memory on
> > > large-mem
> > > >> machines -- that system is open sourced, too, under
> > > >> github.com/twitter/cassovary). Are you proposing to create an
> > > open-source
> > > >> implementation of those algorithms? Do you suggest they should be
> Pig
> > > >> scripts added to the Pig project, or do you want to create some new
> > > >> operators? I'm not totally sure where you are going here.
> > > >>
> > > >> GSoC proposals for Pig are usually made by students who want to work
> > on
> > > >> issues labeled as GSoC candidates on the apache jira. The students
> > spend
> > > >> some time to understand the problem stated in the jira, familiarize
> > > >> themselves with the existing codebase, and put a basic technical
> > > >> implementation plan and schedule into their proposal. Since in this
> > case
> > > >> you are proposing something we haven't scoped or defined well for
> > > >> ourselves, we need you to be very clear and specific about what you
> > are
> > > >> trying to do, and how you plan to go about it. I think that Graph
> > > >> processing in Pig (or other Hadoop-based systems) is a really
> > > interesting
> > > >> topic and there is a lot of work to be done, but we really need you
> to
> > > be
> > > >> far more detailed to be able to give you good guidance with regards
> to
> > > >> GSoC.
> > > >>
> > > >> Best,
> > > >> Dmitriy
> > > >>
> > > >>
> > > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com>
> > > wrote:
> > > >>
> > > >> > Sure. We can implement a graph model using  "WTF: The Who to
> Follow
> > > >> Service
> > > >> > at Twitter article we can" article.This article's said that in
> this
> > > way
> > > >> > graph can be stored one machine's memory so that every node will
> > read
> > > >> from
> > > >> > HDFS and cache the graph to the memory. Every node is responsible
> > from
> > > >> its
> > > >> > bucket edge to process. I mean it can be splitted. Every node can
> be
> > > >> > processed its bucket using random walk algorithm for instance.
> > Finally
> > > >> it
> > > >> > can be reduced to get to the final results. I hope it's clear :)
> > > >> >
> > > >> > Thanks
> > > >> > Best Regards...
> > > >> >
> > > >> >
> > > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <
> dvryaboy@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Burakk,
> > > >> > > The general idea of making graph processing easier is a good
> one.
> > > I'm
> > > >> not
> > > >> > > sure what exactly you are proposing to do, though. Could you be
> > more
> > > >> > > detailed about what you are thinking?
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <
> burak.isikli@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi,
> > > >> > > > I might be a little bit late. I come up with a new idea for
> the
> > > last
> > > >> > > > minute. Currently I'm working on social graph processing. I
> > think
> > > we
> > > >> > can
> > > >> > > > implement a solution for pig.  With this idea I'm thinking to
> > > apply
> > > >> the
> > > >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> > > mentor
> > > >> to
> > > >> > do
> > > >> > > > it with me?  Is there any suggestion? :)
> > > >> > > >
> > > >> > > > Details:
> > > >> > > > Of course I can improve some join operations. I'm not sure is
> > > there
> > > >> any
> > > >> > > > implementation about fuzzy joins for instance. These are the
> > > papers
> > > >> > that
> > > >> > > I
> > > >> > > > found
> > > >> > > >
> > > >> > > > Fuzzy Joins Using MapReduce
> > > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > > >> > > >
> > > >> > > > Dimension independent similarity computation
> > > >> > > > http://arxiv.org/abs/1206.2082
> > > >> > > >
> > > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw
> > Away
> > > >> > > > Everything That’s Not a Nail!
> > > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > >> > > >
> > > >> > > > Large Graph Processing in the Cloud
> > > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > >> > > >
> > > >> > > > ..etc
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > > Best regards..
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > >
> > > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > >> > > > *
> > > >> > > > *
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >
> > > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > >> > *
> > > >> > *
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > *
> > > *
> > >
> >
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

Re: GSoC 2013

Posted by burakkk <bu...@gmail.com>.
I know that but giraph tries to use bsp. What I'm saying is nothing shared
model except reducers. Besides I don't want to divide iteration. One phase
is still responsible for whole iteration. Every different origin vertex
will be processed in parallel.

Thanks
Best regards...


On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales <gdfm@gdfm.me
> wrote:

> FYI, Giraph has a Random Walk implementation.
>
> Pig does not support iteration natively, so any iterative algorithm is not
> a very good fit for it. Just my 2c.
>
> Cheers,
>
> --
> Gianmarco
>
>
> On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com> wrote:
>
> > So what do you suggest? Is it clear?
> >
> >
> > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com> wrote:
> >
> > > I'm using only WTF graph representation to fit the memory. By the way I
> > > haven't seen any explanation from the pig 0.11 release page about WTF
> or
> > > graph models.
> > > I don't wanna use Cassovary. I believe it can be done with pig. I
> > > implement a graph representation using WTF paper to pig and then I'll
> use
> > > it to implement random walk algorithm. To do that maybe I need to
> improve
> > > some features such as joins(fuzzy join) etc or implement a new
> operator.
> > I
> > > can implement it using either existing operators or new operators.
> That's
> > > up to us and it doesn't really matter. If there is already a
> > implementation
> > > to random walker algorithm, please feel free to tell. Because I haven't
> > > found it.
> > > Are you proposing to create an open-source implementation of those
> > > algorithms?
> > > Yes, I'm proposing to implement a random walk algorithm, new data model
> > > which is representing graph. After that, people can use it coding the
> > pig.
> > >
> > > Do you suggest they should be Pig scripts added to the Pig project, or
> do
> > > you want to create some new operators?
> > > Maybe, it can be UDF or new operator.
> > >
> > > I made a quick example. It may not be completely accurate, I've just
> > tried
> > > to explain it.
> > > Think about you have a graph file just like that
> > > user_id follower
> > > 1 2
> > > 1 3
> > > 1 10
> > > 2 3
> > > 3 4
> > > 3 5
> > > ...
> > >
> > > Vertex List is an array including sorted vertex ids
> > > node List is a matrix including vertex id and its starting position
> > >
> > >
> > > graph = load 'graph' using PigStorage() (vertex:int, follower:int) -
> > > --load the graph file
> > > vertex = COGROUP graph BY (vertex);
> > > list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
> > > vertexList; --load the whole vertexes from HDFS into the memory
> > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > > nodeList; --load the whole vertexes from HDFS into the memory
> > > randomWalk = FOREACH vertex GENERATE
> > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > generate a
> > > score using the node list you can traverse the graph to the your
> > finishing
> > > position
> > > store...
> > >
> > >
> > > Thanks
> > > Best Regards...
> > >
> > >
> > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> > >
> > >> I'm somewhat familiar with WTF code (my day job is managing the
> > analytics
> > >> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in
> > >> fact
> > >> some of the Pig 11 features/improvements are directly due to this
> > >> project...), and mostly has to do with clever algorithms implemented
> in
> > >> Pig
> > >> (an earlier version of WTF loaded the graph into main memory on
> > large-mem
> > >> machines -- that system is open sourced, too, under
> > >> github.com/twitter/cassovary). Are you proposing to create an
> > open-source
> > >> implementation of those algorithms? Do you suggest they should be Pig
> > >> scripts added to the Pig project, or do you want to create some new
> > >> operators? I'm not totally sure where you are going here.
> > >>
> > >> GSoC proposals for Pig are usually made by students who want to work
> on
> > >> issues labeled as GSoC candidates on the apache jira. The students
> spend
> > >> some time to understand the problem stated in the jira, familiarize
> > >> themselves with the existing codebase, and put a basic technical
> > >> implementation plan and schedule into their proposal. Since in this
> case
> > >> you are proposing something we haven't scoped or defined well for
> > >> ourselves, we need you to be very clear and specific about what you
> are
> > >> trying to do, and how you plan to go about it. I think that Graph
> > >> processing in Pig (or other Hadoop-based systems) is a really
> > interesting
> > >> topic and there is a lot of work to be done, but we really need you to
> > be
> > >> far more detailed to be able to give you good guidance with regards to
> > >> GSoC.
> > >>
> > >> Best,
> > >> Dmitriy
> > >>
> > >>
> > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com>
> > wrote:
> > >>
> > >> > Sure. We can implement a graph model using  "WTF: The Who to Follow
> > >> Service
> > >> > at Twitter article we can" article.This article's said that in this
> > way
> > >> > graph can be stored one machine's memory so that every node will
> read
> > >> from
> > >> > HDFS and cache the graph to the memory. Every node is responsible
> from
> > >> its
> > >> > bucket edge to process. I mean it can be splitted. Every node can be
> > >> > processed its bucket using random walk algorithm for instance.
> Finally
> > >> it
> > >> > can be reduced to get to the final results. I hope it's clear :)
> > >> >
> > >> > Thanks
> > >> > Best Regards...
> > >> >
> > >> >
> > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
> >
> > >> > wrote:
> > >> >
> > >> > > Hi Burakk,
> > >> > > The general idea of making graph processing easier is a good one.
> > I'm
> > >> not
> > >> > > sure what exactly you are proposing to do, though. Could you be
> more
> > >> > > detailed about what you are thinking?
> > >> > >
> > >> > >
> > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > > I might be a little bit late. I come up with a new idea for the
> > last
> > >> > > > minute. Currently I'm working on social graph processing. I
> think
> > we
> > >> > can
> > >> > > > implement a solution for pig.  With this idea I'm thinking to
> > apply
> > >> the
> > >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> > mentor
> > >> to
> > >> > do
> > >> > > > it with me?  Is there any suggestion? :)
> > >> > > >
> > >> > > > Details:
> > >> > > > Of course I can improve some join operations. I'm not sure is
> > there
> > >> any
> > >> > > > implementation about fuzzy joins for instance. These are the
> > papers
> > >> > that
> > >> > > I
> > >> > > > found
> > >> > > >
> > >> > > > Fuzzy Joins Using MapReduce
> > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > >> > > >
> > >> > > > Dimension independent similarity computation
> > >> > > > http://arxiv.org/abs/1206.2082
> > >> > > >
> > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw
> Away
> > >> > > > Everything That’s Not a Nail!
> > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > >> > > >
> > >> > > > Large Graph Processing in the Cloud
> > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > >> > > >
> > >> > > > ..etc
> > >> > > >
> > >> > > > Thanks
> > >> > > > Best regards..
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > >
> > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > >> > > > *
> > >> > > > *
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > >> > *
> > >> > *
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > *
> > > *
> > >
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>



-- 

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*

Re: GSoC 2013

Posted by burakkk <bu...@gmail.com>.
I know that but giraph tries to use bsp. What I'm saying is nothing shared
model except reducers. Besides I don't want to divide iteration. One phase
is still responsible for whole iteration. Every different origin vertex
will be processed in parallel.

Thanks
Best regards...


On Tue, Apr 2, 2013 at 7:20 PM, Gianmarco De Francisci Morales <gdfm@gdfm.me
> wrote:

> FYI, Giraph has a Random Walk implementation.
>
> Pig does not support iteration natively, so any iterative algorithm is not
> a very good fit for it. Just my 2c.
>
> Cheers,
>
> --
> Gianmarco
>
>
> On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com> wrote:
>
> > So what do you suggest? Is it clear?
> >
> >
> > On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com> wrote:
> >
> > > I'm using only WTF graph representation to fit the memory. By the way I
> > > haven't seen any explanation from the pig 0.11 release page about WTF
> or
> > > graph models.
> > > I don't wanna use Cassovary. I believe it can be done with pig. I
> > > implement a graph representation using WTF paper to pig and then I'll
> use
> > > it to implement random walk algorithm. To do that maybe I need to
> improve
> > > some features such as joins(fuzzy join) etc or implement a new
> operator.
> > I
> > > can implement it using either existing operators or new operators.
> That's
> > > up to us and it doesn't really matter. If there is already a
> > implementation
> > > to random walker algorithm, please feel free to tell. Because I haven't
> > > found it.
> > > Are you proposing to create an open-source implementation of those
> > > algorithms?
> > > Yes, I'm proposing to implement a random walk algorithm, new data model
> > > which is representing graph. After that, people can use it coding the
> > pig.
> > >
> > > Do you suggest they should be Pig scripts added to the Pig project, or
> do
> > > you want to create some new operators?
> > > Maybe, it can be UDF or new operator.
> > >
> > > I made a quick example. It may not be completely accurate, I've just
> > tried
> > > to explain it.
> > > Think about you have a graph file just like that
> > > user_id follower
> > > 1 2
> > > 1 3
> > > 1 10
> > > 2 3
> > > 3 4
> > > 3 5
> > > ...
> > >
> > > Vertex List is an array including sorted vertex ids
> > > node List is a matrix including vertex id and its starting position
> > >
> > >
> > > graph = load 'graph' using PigStorage() (vertex:int, follower:int) -
> > > --load the graph file
> > > vertex = COGROUP graph BY (vertex);
> > > list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
> > > vertexList; --load the whole vertexes from HDFS into the memory
> > > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > > nodeList; --load the whole vertexes from HDFS into the memory
> > > randomWalk = FOREACH vertex GENERATE
> > > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> > generate a
> > > score using the node list you can traverse the graph to the your
> > finishing
> > > position
> > > store...
> > >
> > >
> > > Thanks
> > > Best Regards...
> > >
> > >
> > > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> > >
> > >> I'm somewhat familiar with WTF code (my day job is managing the
> > analytics
> > >> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in
> > >> fact
> > >> some of the Pig 11 features/improvements are directly due to this
> > >> project...), and mostly has to do with clever algorithms implemented
> in
> > >> Pig
> > >> (an earlier version of WTF loaded the graph into main memory on
> > large-mem
> > >> machines -- that system is open sourced, too, under
> > >> github.com/twitter/cassovary). Are you proposing to create an
> > open-source
> > >> implementation of those algorithms? Do you suggest they should be Pig
> > >> scripts added to the Pig project, or do you want to create some new
> > >> operators? I'm not totally sure where you are going here.
> > >>
> > >> GSoC proposals for Pig are usually made by students who want to work
> on
> > >> issues labeled as GSoC candidates on the apache jira. The students
> spend
> > >> some time to understand the problem stated in the jira, familiarize
> > >> themselves with the existing codebase, and put a basic technical
> > >> implementation plan and schedule into their proposal. Since in this
> case
> > >> you are proposing something we haven't scoped or defined well for
> > >> ourselves, we need you to be very clear and specific about what you
> are
> > >> trying to do, and how you plan to go about it. I think that Graph
> > >> processing in Pig (or other Hadoop-based systems) is a really
> > interesting
> > >> topic and there is a lot of work to be done, but we really need you to
> > be
> > >> far more detailed to be able to give you good guidance with regards to
> > >> GSoC.
> > >>
> > >> Best,
> > >> Dmitriy
> > >>
> > >>
> > >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com>
> > wrote:
> > >>
> > >> > Sure. We can implement a graph model using  "WTF: The Who to Follow
> > >> Service
> > >> > at Twitter article we can" article.This article's said that in this
> > way
> > >> > graph can be stored one machine's memory so that every node will
> read
> > >> from
> > >> > HDFS and cache the graph to the memory. Every node is responsible
> from
> > >> its
> > >> > bucket edge to process. I mean it can be splitted. Every node can be
> > >> > processed its bucket using random walk algorithm for instance.
> Finally
> > >> it
> > >> > can be reduced to get to the final results. I hope it's clear :)
> > >> >
> > >> > Thanks
> > >> > Best Regards...
> > >> >
> > >> >
> > >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
> >
> > >> > wrote:
> > >> >
> > >> > > Hi Burakk,
> > >> > > The general idea of making graph processing easier is a good one.
> > I'm
> > >> not
> > >> > > sure what exactly you are proposing to do, though. Could you be
> more
> > >> > > detailed about what you are thinking?
> > >> > >
> > >> > >
> > >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > > I might be a little bit late. I come up with a new idea for the
> > last
> > >> > > > minute. Currently I'm working on social graph processing. I
> think
> > we
> > >> > can
> > >> > > > implement a solution for pig.  With this idea I'm thinking to
> > apply
> > >> the
> > >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> > mentor
> > >> to
> > >> > do
> > >> > > > it with me?  Is there any suggestion? :)
> > >> > > >
> > >> > > > Details:
> > >> > > > Of course I can improve some join operations. I'm not sure is
> > there
> > >> any
> > >> > > > implementation about fuzzy joins for instance. These are the
> > papers
> > >> > that
> > >> > > I
> > >> > > > found
> > >> > > >
> > >> > > > Fuzzy Joins Using MapReduce
> > >> > > > http://ilpubs.stanford.edu:8090/1006/
> > >> > > >
> > >> > > > Dimension independent similarity computation
> > >> > > > http://arxiv.org/abs/1206.2082
> > >> > > >
> > >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw
> Away
> > >> > > > Everything That’s Not a Nail!
> > >> > > > http://arxiv.org/pdf/1209.2191.pdf
> > >> > > >
> > >> > > > Large Graph Processing in the Cloud
> > >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > >> > > >
> > >> > > > ..etc
> > >> > > >
> > >> > > > Thanks
> > >> > > > Best regards..
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > >
> > >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > >> > > > *
> > >> > > > *
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > >> > *
> > >> > *
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > *
> > > *
> > >
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>



-- 

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*

Re: GSoC 2013

Posted by Gianmarco De Francisci Morales <gd...@gdfm.me>.
FYI, Giraph has a Random Walk implementation.

Pig does not support iteration natively, so any iterative algorithm is not
a very good fit for it. Just my 2c.

Cheers,

--
Gianmarco


On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com> wrote:

> So what do you suggest? Is it clear?
>
>
> On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com> wrote:
>
> > I'm using only WTF graph representation to fit the memory. By the way I
> > haven't seen any explanation from the pig 0.11 release page about WTF or
> > graph models.
> > I don't wanna use Cassovary. I believe it can be done with pig. I
> > implement a graph representation using WTF paper to pig and then I'll use
> > it to implement random walk algorithm. To do that maybe I need to improve
> > some features such as joins(fuzzy join) etc or implement a new operator.
> I
> > can implement it using either existing operators or new operators. That's
> > up to us and it doesn't really matter. If there is already a
> implementation
> > to random walker algorithm, please feel free to tell. Because I haven't
> > found it.
> > Are you proposing to create an open-source implementation of those
> > algorithms?
> > Yes, I'm proposing to implement a random walk algorithm, new data model
> > which is representing graph. After that, people can use it coding the
> pig.
> >
> > Do you suggest they should be Pig scripts added to the Pig project, or do
> > you want to create some new operators?
> > Maybe, it can be UDF or new operator.
> >
> > I made a quick example. It may not be completely accurate, I've just
> tried
> > to explain it.
> > Think about you have a graph file just like that
> > user_id follower
> > 1 2
> > 1 3
> > 1 10
> > 2 3
> > 3 4
> > 3 5
> > ...
> >
> > Vertex List is an array including sorted vertex ids
> > node List is a matrix including vertex id and its starting position
> >
> >
> > graph = load 'graph' using PigStorage() (vertex:int, follower:int) -
> > --load the graph file
> > vertex = COGROUP graph BY (vertex);
> > list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
> > vertexList; --load the whole vertexes from HDFS into the memory
> > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > nodeList; --load the whole vertexes from HDFS into the memory
> > randomWalk = FOREACH vertex GENERATE
> > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> generate a
> > score using the node list you can traverse the graph to the your
> finishing
> > position
> > store...
> >
> >
> > Thanks
> > Best Regards...
> >
> >
> > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> I'm somewhat familiar with WTF code (my day job is managing the
> analytics
> >> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in
> >> fact
> >> some of the Pig 11 features/improvements are directly due to this
> >> project...), and mostly has to do with clever algorithms implemented in
> >> Pig
> >> (an earlier version of WTF loaded the graph into main memory on
> large-mem
> >> machines -- that system is open sourced, too, under
> >> github.com/twitter/cassovary). Are you proposing to create an
> open-source
> >> implementation of those algorithms? Do you suggest they should be Pig
> >> scripts added to the Pig project, or do you want to create some new
> >> operators? I'm not totally sure where you are going here.
> >>
> >> GSoC proposals for Pig are usually made by students who want to work on
> >> issues labeled as GSoC candidates on the apache jira. The students spend
> >> some time to understand the problem stated in the jira, familiarize
> >> themselves with the existing codebase, and put a basic technical
> >> implementation plan and schedule into their proposal. Since in this case
> >> you are proposing something we haven't scoped or defined well for
> >> ourselves, we need you to be very clear and specific about what you are
> >> trying to do, and how you plan to go about it. I think that Graph
> >> processing in Pig (or other Hadoop-based systems) is a really
> interesting
> >> topic and there is a lot of work to be done, but we really need you to
> be
> >> far more detailed to be able to give you good guidance with regards to
> >> GSoC.
> >>
> >> Best,
> >> Dmitriy
> >>
> >>
> >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com>
> wrote:
> >>
> >> > Sure. We can implement a graph model using  "WTF: The Who to Follow
> >> Service
> >> > at Twitter article we can" article.This article's said that in this
> way
> >> > graph can be stored one machine's memory so that every node will read
> >> from
> >> > HDFS and cache the graph to the memory. Every node is responsible from
> >> its
> >> > bucket edge to process. I mean it can be splitted. Every node can be
> >> > processed its bucket using random walk algorithm for instance. Finally
> >> it
> >> > can be reduced to get to the final results. I hope it's clear :)
> >> >
> >> > Thanks
> >> > Best Regards...
> >> >
> >> >
> >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dv...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Burakk,
> >> > > The general idea of making graph processing easier is a good one.
> I'm
> >> not
> >> > > sure what exactly you are proposing to do, though. Could you be more
> >> > > detailed about what you are thinking?
> >> > >
> >> > >
> >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hi,
> >> > > > I might be a little bit late. I come up with a new idea for the
> last
> >> > > > minute. Currently I'm working on social graph processing. I think
> we
> >> > can
> >> > > > implement a solution for pig.  With this idea I'm thinking to
> apply
> >> the
> >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> mentor
> >> to
> >> > do
> >> > > > it with me?  Is there any suggestion? :)
> >> > > >
> >> > > > Details:
> >> > > > Of course I can improve some join operations. I'm not sure is
> there
> >> any
> >> > > > implementation about fuzzy joins for instance. These are the
> papers
> >> > that
> >> > > I
> >> > > > found
> >> > > >
> >> > > > Fuzzy Joins Using MapReduce
> >> > > > http://ilpubs.stanford.edu:8090/1006/
> >> > > >
> >> > > > Dimension independent similarity computation
> >> > > > http://arxiv.org/abs/1206.2082
> >> > > >
> >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw Away
> >> > > > Everything That’s Not a Nail!
> >> > > > http://arxiv.org/pdf/1209.2191.pdf
> >> > > >
> >> > > > Large Graph Processing in the Cloud
> >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> >> > > >
> >> > > > ..etc
> >> > > >
> >> > > > Thanks
> >> > > > Best regards..
> >> > > >
> >> > > >
> >> > > > --
> >> > > >
> >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> >> > > > *
> >> > > > *
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> >> > *
> >> > *
> >> >
> >>
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

Re: GSoC 2013

Posted by Gianmarco De Francisci Morales <gd...@gdfm.me>.
FYI, Giraph has a Random Walk implementation.

Pig does not support iteration natively, so any iterative algorithm is not
a very good fit for it. Just my 2c.

Cheers,

--
Gianmarco


On Tue, Apr 2, 2013 at 10:04 AM, burakkk <bu...@gmail.com> wrote:

> So what do you suggest? Is it clear?
>
>
> On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com> wrote:
>
> > I'm using only WTF graph representation to fit the memory. By the way I
> > haven't seen any explanation from the pig 0.11 release page about WTF or
> > graph models.
> > I don't wanna use Cassovary. I believe it can be done with pig. I
> > implement a graph representation using WTF paper to pig and then I'll use
> > it to implement random walk algorithm. To do that maybe I need to improve
> > some features such as joins(fuzzy join) etc or implement a new operator.
> I
> > can implement it using either existing operators or new operators. That's
> > up to us and it doesn't really matter. If there is already a
> implementation
> > to random walker algorithm, please feel free to tell. Because I haven't
> > found it.
> > Are you proposing to create an open-source implementation of those
> > algorithms?
> > Yes, I'm proposing to implement a random walk algorithm, new data model
> > which is representing graph. After that, people can use it coding the
> pig.
> >
> > Do you suggest they should be Pig scripts added to the Pig project, or do
> > you want to create some new operators?
> > Maybe, it can be UDF or new operator.
> >
> > I made a quick example. It may not be completely accurate, I've just
> tried
> > to explain it.
> > Think about you have a graph file just like that
> > user_id follower
> > 1 2
> > 1 3
> > 1 10
> > 2 3
> > 3 4
> > 3 5
> > ...
> >
> > Vertex List is an array including sorted vertex ids
> > node List is a matrix including vertex id and its starting position
> >
> >
> > graph = load 'graph' using PigStorage() (vertex:int, follower:int) -
> > --load the graph file
> > vertex = COGROUP graph BY (vertex);
> > list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
> > vertexList; --load the whole vertexes from HDFS into the memory
> > list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> > nodeList; --load the whole vertexes from HDFS into the memory
> > randomWalk = FOREACH vertex GENERATE
> > flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; --
> generate a
> > score using the node list you can traverse the graph to the your
> finishing
> > position
> > store...
> >
> >
> > Thanks
> > Best Regards...
> >
> >
> > On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> I'm somewhat familiar with WTF code (my day job is managing the
> analytics
> >> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in
> >> fact
> >> some of the Pig 11 features/improvements are directly due to this
> >> project...), and mostly has to do with clever algorithms implemented in
> >> Pig
> >> (an earlier version of WTF loaded the graph into main memory on
> large-mem
> >> machines -- that system is open sourced, too, under
> >> github.com/twitter/cassovary). Are you proposing to create an
> open-source
> >> implementation of those algorithms? Do you suggest they should be Pig
> >> scripts added to the Pig project, or do you want to create some new
> >> operators? I'm not totally sure where you are going here.
> >>
> >> GSoC proposals for Pig are usually made by students who want to work on
> >> issues labeled as GSoC candidates on the apache jira. The students spend
> >> some time to understand the problem stated in the jira, familiarize
> >> themselves with the existing codebase, and put a basic technical
> >> implementation plan and schedule into their proposal. Since in this case
> >> you are proposing something we haven't scoped or defined well for
> >> ourselves, we need you to be very clear and specific about what you are
> >> trying to do, and how you plan to go about it. I think that Graph
> >> processing in Pig (or other Hadoop-based systems) is a really
> interesting
> >> topic and there is a lot of work to be done, but we really need you to
> be
> >> far more detailed to be able to give you good guidance with regards to
> >> GSoC.
> >>
> >> Best,
> >> Dmitriy
> >>
> >>
> >> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com>
> wrote:
> >>
> >> > Sure. We can implement a graph model using  "WTF: The Who to Follow
> >> Service
> >> > at Twitter article we can" article.This article's said that in this
> way
> >> > graph can be stored one machine's memory so that every node will read
> >> from
> >> > HDFS and cache the graph to the memory. Every node is responsible from
> >> its
> >> > bucket edge to process. I mean it can be splitted. Every node can be
> >> > processed its bucket using random walk algorithm for instance. Finally
> >> it
> >> > can be reduced to get to the final results. I hope it's clear :)
> >> >
> >> > Thanks
> >> > Best Regards...
> >> >
> >> >
> >> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dv...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Burakk,
> >> > > The general idea of making graph processing easier is a good one.
> I'm
> >> not
> >> > > sure what exactly you are proposing to do, though. Could you be more
> >> > > detailed about what you are thinking?
> >> > >
> >> > >
> >> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hi,
> >> > > > I might be a little bit late. I come up with a new idea for the
> last
> >> > > > minute. Currently I'm working on social graph processing. I think
> we
> >> > can
> >> > > > implement a solution for pig.  With this idea I'm thinking to
> apply
> >> the
> >> > > > GSOC 2013 so that I can do some tasks about it. Is there any
> mentor
> >> to
> >> > do
> >> > > > it with me?  Is there any suggestion? :)
> >> > > >
> >> > > > Details:
> >> > > > Of course I can improve some join operations. I'm not sure is
> there
> >> any
> >> > > > implementation about fuzzy joins for instance. These are the
> papers
> >> > that
> >> > > I
> >> > > > found
> >> > > >
> >> > > > Fuzzy Joins Using MapReduce
> >> > > > http://ilpubs.stanford.edu:8090/1006/
> >> > > >
> >> > > > Dimension independent similarity computation
> >> > > > http://arxiv.org/abs/1206.2082
> >> > > >
> >> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw Away
> >> > > > Everything That’s Not a Nail!
> >> > > > http://arxiv.org/pdf/1209.2191.pdf
> >> > > >
> >> > > > Large Graph Processing in the Cloud
> >> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> >> > > >
> >> > > > ..etc
> >> > > >
> >> > > > Thanks
> >> > > > Best regards..
> >> > > >
> >> > > >
> >> > > > --
> >> > > >
> >> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> >> > > > *
> >> > > > *
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> >> > *
> >> > *
> >> >
> >>
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

Re: GSoC 2013

Posted by burakkk <bu...@gmail.com>.
So what do you suggest? Is it clear?


On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com> wrote:

> I'm using only WTF graph representation to fit the memory. By the way I
> haven't seen any explanation from the pig 0.11 release page about WTF or
> graph models.
> I don't wanna use Cassovary. I believe it can be done with pig. I
> implement a graph representation using WTF paper to pig and then I'll use
> it to implement random walk algorithm. To do that maybe I need to improve
> some features such as joins(fuzzy join) etc or implement a new operator. I
> can implement it using either existing operators or new operators. That's
> up to us and it doesn't really matter. If there is already a implementation
> to random walker algorithm, please feel free to tell. Because I haven't
> found it.
> Are you proposing to create an open-source implementation of those
> algorithms?
> Yes, I'm proposing to implement a random walk algorithm, new data model
> which is representing graph. After that, people can use it coding the pig.
>
> Do you suggest they should be Pig scripts added to the Pig project, or do
> you want to create some new operators?
> Maybe, it can be UDF or new operator.
>
> I made a quick example. It may not be completely accurate, I've just tried
> to explain it.
> Think about you have a graph file just like that
> user_id follower
> 1 2
> 1 3
> 1 10
> 2 3
> 3 4
> 3 5
> ...
>
> Vertex List is an array including sorted vertex ids
> node List is a matrix including vertex id and its starting position
>
>
> graph = load 'graph' using PigStorage() (vertex:int, follower:int) -
> --load the graph file
> vertex = COGROUP graph BY (vertex);
> list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
> vertexList; --load the whole vertexes from HDFS into the memory
> list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> nodeList; --load the whole vertexes from HDFS into the memory
> randomWalk = FOREACH vertex GENERATE
> flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; -- generate a
> score using the node list you can traverse the graph to the your finishing
> position
> store...
>
>
> Thanks
> Best Regards...
>
>
> On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> I'm somewhat familiar with WTF code (my day job is managing the analytics
>> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in
>> fact
>> some of the Pig 11 features/improvements are directly due to this
>> project...), and mostly has to do with clever algorithms implemented in
>> Pig
>> (an earlier version of WTF loaded the graph into main memory on large-mem
>> machines -- that system is open sourced, too, under
>> github.com/twitter/cassovary). Are you proposing to create an open-source
>> implementation of those algorithms? Do you suggest they should be Pig
>> scripts added to the Pig project, or do you want to create some new
>> operators? I'm not totally sure where you are going here.
>>
>> GSoC proposals for Pig are usually made by students who want to work on
>> issues labeled as GSoC candidates on the apache jira. The students spend
>> some time to understand the problem stated in the jira, familiarize
>> themselves with the existing codebase, and put a basic technical
>> implementation plan and schedule into their proposal. Since in this case
>> you are proposing something we haven't scoped or defined well for
>> ourselves, we need you to be very clear and specific about what you are
>> trying to do, and how you plan to go about it. I think that Graph
>> processing in Pig (or other Hadoop-based systems) is a really interesting
>> topic and there is a lot of work to be done, but we really need you to be
>> far more detailed to be able to give you good guidance with regards to
>> GSoC.
>>
>> Best,
>> Dmitriy
>>
>>
>> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com> wrote:
>>
>> > Sure. We can implement a graph model using  "WTF: The Who to Follow
>> Service
>> > at Twitter article we can" article.This article's said that in this way
>> > graph can be stored one machine's memory so that every node will read
>> from
>> > HDFS and cache the graph to the memory. Every node is responsible from
>> its
>> > bucket edge to process. I mean it can be splitted. Every node can be
>> > processed its bucket using random walk algorithm for instance. Finally
>> it
>> > can be reduced to get to the final results. I hope it's clear :)
>> >
>> > Thanks
>> > Best Regards...
>> >
>> >
>> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> > wrote:
>> >
>> > > Hi Burakk,
>> > > The general idea of making graph processing easier is a good one. I'm
>> not
>> > > sure what exactly you are proposing to do, though. Could you be more
>> > > detailed about what you are thinking?
>> > >
>> > >
>> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com>
>> wrote:
>> > >
>> > > > Hi,
>> > > > I might be a little bit late. I come up with a new idea for the last
>> > > > minute. Currently I'm working on social graph processing. I think we
>> > can
>> > > > implement a solution for pig.  With this idea I'm thinking to apply
>> the
>> > > > GSOC 2013 so that I can do some tasks about it. Is there any mentor
>> to
>> > do
>> > > > it with me?  Is there any suggestion? :)
>> > > >
>> > > > Details:
>> > > > Of course I can improve some join operations. I'm not sure is there
>> any
>> > > > implementation about fuzzy joins for instance. These are the papers
>> > that
>> > > I
>> > > > found
>> > > >
>> > > > Fuzzy Joins Using MapReduce
>> > > > http://ilpubs.stanford.edu:8090/1006/
>> > > >
>> > > > Dimension independent similarity computation
>> > > > http://arxiv.org/abs/1206.2082
>> > > >
>> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw Away
>> > > > Everything That’s Not a Nail!
>> > > > http://arxiv.org/pdf/1209.2191.pdf
>> > > >
>> > > > Large Graph Processing in the Cloud
>> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
>> > > >
>> > > > ..etc
>> > > >
>> > > > Thanks
>> > > > Best regards..
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
>> > > > *
>> > > > *
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
>> > *
>> > *
>> >
>>
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>



-- 

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*

Re: GSoC 2013

Posted by burakkk <bu...@gmail.com>.
So what do you suggest? Is it clear?


On Mon, Apr 1, 2013 at 9:35 PM, burakkk <bu...@gmail.com> wrote:

> I'm using only WTF graph representation to fit the memory. By the way I
> haven't seen any explanation from the pig 0.11 release page about WTF or
> graph models.
> I don't wanna use Cassovary. I believe it can be done with pig. I
> implement a graph representation using WTF paper to pig and then I'll use
> it to implement random walk algorithm. To do that maybe I need to improve
> some features such as joins(fuzzy join) etc or implement a new operator. I
> can implement it using either existing operators or new operators. That's
> up to us and it doesn't really matter. If there is already a implementation
> to random walker algorithm, please feel free to tell. Because I haven't
> found it.
> Are you proposing to create an open-source implementation of those
> algorithms?
> Yes, I'm proposing to implement a random walk algorithm, new data model
> which is representing graph. After that, people can use it coding the pig.
>
> Do you suggest they should be Pig scripts added to the Pig project, or do
> you want to create some new operators?
> Maybe, it can be UDF or new operator.
>
> I made a quick example. It may not be completely accurate, I've just tried
> to explain it.
> Think about you have a graph file just like that
> user_id follower
> 1 2
> 1 3
> 1 10
> 2 3
> 3 4
> 3 5
> ...
>
> Vertex List is an array including sorted vertex ids
> node List is a matrix including vertex id and its starting position
>
>
> graph = load 'graph' using PigStorage() (vertex:int, follower:int) -
> --load the graph file
> vertex = COGROUP graph BY (vertex);
> list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
> vertexList; --load the whole vertexes from HDFS into the memory
> list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
> nodeList; --load the whole vertexes from HDFS into the memory
> randomWalk = FOREACH vertex GENERATE
> flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; -- generate a
> score using the node list you can traverse the graph to the your finishing
> position
> store...
>
>
> Thanks
> Best Regards...
>
>
> On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> I'm somewhat familiar with WTF code (my day job is managing the analytics
>> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in
>> fact
>> some of the Pig 11 features/improvements are directly due to this
>> project...), and mostly has to do with clever algorithms implemented in
>> Pig
>> (an earlier version of WTF loaded the graph into main memory on large-mem
>> machines -- that system is open sourced, too, under
>> github.com/twitter/cassovary). Are you proposing to create an open-source
>> implementation of those algorithms? Do you suggest they should be Pig
>> scripts added to the Pig project, or do you want to create some new
>> operators? I'm not totally sure where you are going here.
>>
>> GSoC proposals for Pig are usually made by students who want to work on
>> issues labeled as GSoC candidates on the apache jira. The students spend
>> some time to understand the problem stated in the jira, familiarize
>> themselves with the existing codebase, and put a basic technical
>> implementation plan and schedule into their proposal. Since in this case
>> you are proposing something we haven't scoped or defined well for
>> ourselves, we need you to be very clear and specific about what you are
>> trying to do, and how you plan to go about it. I think that Graph
>> processing in Pig (or other Hadoop-based systems) is a really interesting
>> topic and there is a lot of work to be done, but we really need you to be
>> far more detailed to be able to give you good guidance with regards to
>> GSoC.
>>
>> Best,
>> Dmitriy
>>
>>
>> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com> wrote:
>>
>> > Sure. We can implement a graph model using  "WTF: The Who to Follow
>> Service
>> > at Twitter article we can" article.This article's said that in this way
>> > graph can be stored one machine's memory so that every node will read
>> from
>> > HDFS and cache the graph to the memory. Every node is responsible from
>> its
>> > bucket edge to process. I mean it can be splitted. Every node can be
>> > processed its bucket using random walk algorithm for instance. Finally
>> it
>> > can be reduced to get to the final results. I hope it's clear :)
>> >
>> > Thanks
>> > Best Regards...
>> >
>> >
>> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> > wrote:
>> >
>> > > Hi Burakk,
>> > > The general idea of making graph processing easier is a good one. I'm
>> not
>> > > sure what exactly you are proposing to do, though. Could you be more
>> > > detailed about what you are thinking?
>> > >
>> > >
>> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com>
>> wrote:
>> > >
>> > > > Hi,
>> > > > I might be a little bit late. I come up with a new idea for the last
>> > > > minute. Currently I'm working on social graph processing. I think we
>> > can
>> > > > implement a solution for pig.  With this idea I'm thinking to apply
>> the
>> > > > GSOC 2013 so that I can do some tasks about it. Is there any mentor
>> to
>> > do
>> > > > it with me?  Is there any suggestion? :)
>> > > >
>> > > > Details:
>> > > > Of course I can improve some join operations. I'm not sure is there
>> any
>> > > > implementation about fuzzy joins for instance. These are the papers
>> > that
>> > > I
>> > > > found
>> > > >
>> > > > Fuzzy Joins Using MapReduce
>> > > > http://ilpubs.stanford.edu:8090/1006/
>> > > >
>> > > > Dimension independent similarity computation
>> > > > http://arxiv.org/abs/1206.2082
>> > > >
>> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw Away
>> > > > Everything That’s Not a Nail!
>> > > > http://arxiv.org/pdf/1209.2191.pdf
>> > > >
>> > > > Large Graph Processing in the Cloud
>> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
>> > > >
>> > > > ..etc
>> > > >
>> > > > Thanks
>> > > > Best regards..
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
>> > > > *
>> > > > *
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
>> > *
>> > *
>> >
>>
>
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>



-- 

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*

Re: GSoC 2013

Posted by burakkk <bu...@gmail.com>.
I'm using only WTF graph representation to fit the memory. By the way I
haven't seen any explanation from the pig 0.11 release page about WTF or
graph models.
I don't wanna use Cassovary. I believe it can be done with pig. I implement
a graph representation using WTF paper to pig and then I'll use it to
implement random walk algorithm. To do that maybe I need to improve some
features such as joins(fuzzy join) etc or implement a new operator. I can
implement it using either existing operators or new operators. That's up to
us and it doesn't really matter. If there is already a implementation to
random walker algorithm, please feel free to tell. Because I haven't found
it.
Are you proposing to create an open-source implementation of those
algorithms?
Yes, I'm proposing to implement a random walk algorithm, new data model
which is representing graph. After that, people can use it coding the pig.

Do you suggest they should be Pig scripts added to the Pig project, or do
you want to create some new operators?
Maybe, it can be UDF or new operator.

I made a quick example. It may not be completely accurate, I've just tried
to explain it.
Think about you have a graph file just like that
user_id follower
1 2
1 3
1 10
2 3
3 4
3 5
...

Vertex List is an array including sorted vertex ids
node List is a matrix including vertex id and its starting position


graph = load 'graph' using PigStorage() (vertex:int, follower:int) - --load
the graph file
vertex = COGROUP graph BY (vertex);
list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
vertexList; --load the whole vertexes from HDFS into the memory
list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
nodeList; --load the whole vertexes from HDFS into the memory
randomWalk = FOREACH vertex GENERATE
flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; -- generate a
score using the node list you can traverse the graph to the your finishing
position
store...


Thanks
Best Regards...


On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I'm somewhat familiar with WTF code (my day job is managing the analytics
> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in fact
> some of the Pig 11 features/improvements are directly due to this
> project...), and mostly has to do with clever algorithms implemented in Pig
> (an earlier version of WTF loaded the graph into main memory on large-mem
> machines -- that system is open sourced, too, under
> github.com/twitter/cassovary). Are you proposing to create an open-source
> implementation of those algorithms? Do you suggest they should be Pig
> scripts added to the Pig project, or do you want to create some new
> operators? I'm not totally sure where you are going here.
>
> GSoC proposals for Pig are usually made by students who want to work on
> issues labeled as GSoC candidates on the apache jira. The students spend
> some time to understand the problem stated in the jira, familiarize
> themselves with the existing codebase, and put a basic technical
> implementation plan and schedule into their proposal. Since in this case
> you are proposing something we haven't scoped or defined well for
> ourselves, we need you to be very clear and specific about what you are
> trying to do, and how you plan to go about it. I think that Graph
> processing in Pig (or other Hadoop-based systems) is a really interesting
> topic and there is a lot of work to be done, but we really need you to be
> far more detailed to be able to give you good guidance with regards to
> GSoC.
>
> Best,
> Dmitriy
>
>
> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com> wrote:
>
> > Sure. We can implement a graph model using  "WTF: The Who to Follow
> Service
> > at Twitter article we can" article.This article's said that in this way
> > graph can be stored one machine's memory so that every node will read
> from
> > HDFS and cache the graph to the memory. Every node is responsible from
> its
> > bucket edge to process. I mean it can be splitted. Every node can be
> > processed its bucket using random walk algorithm for instance. Finally it
> > can be reduced to get to the final results. I hope it's clear :)
> >
> > Thanks
> > Best Regards...
> >
> >
> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > Hi Burakk,
> > > The general idea of making graph processing easier is a good one. I'm
> not
> > > sure what exactly you are proposing to do, though. Could you be more
> > > detailed about what you are thinking?
> > >
> > >
> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > > I might be a little bit late. I come up with a new idea for the last
> > > > minute. Currently I'm working on social graph processing. I think we
> > can
> > > > implement a solution for pig.  With this idea I'm thinking to apply
> the
> > > > GSOC 2013 so that I can do some tasks about it. Is there any mentor
> to
> > do
> > > > it with me?  Is there any suggestion? :)
> > > >
> > > > Details:
> > > > Of course I can improve some join operations. I'm not sure is there
> any
> > > > implementation about fuzzy joins for instance. These are the papers
> > that
> > > I
> > > > found
> > > >
> > > > Fuzzy Joins Using MapReduce
> > > > http://ilpubs.stanford.edu:8090/1006/
> > > >
> > > > Dimension independent similarity computation
> > > > http://arxiv.org/abs/1206.2082
> > > >
> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw Away
> > > > Everything That’s Not a Nail!
> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > >
> > > > Large Graph Processing in the Cloud
> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > >
> > > > ..etc
> > > >
> > > > Thanks
> > > > Best regards..
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>



-- 

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*

Re: GSoC 2013

Posted by burakkk <bu...@gmail.com>.
I'm using only WTF graph representation to fit the memory. By the way I
haven't seen any explanation from the pig 0.11 release page about WTF or
graph models.
I don't wanna use Cassovary. I believe it can be done with pig. I implement
a graph representation using WTF paper to pig and then I'll use it to
implement random walk algorithm. To do that maybe I need to improve some
features such as joins(fuzzy join) etc or implement a new operator. I can
implement it using either existing operators or new operators. That's up to
us and it doesn't really matter. If there is already a implementation to
random walker algorithm, please feel free to tell. Because I haven't found
it.
Are you proposing to create an open-source implementation of those
algorithms?
Yes, I'm proposing to implement a random walk algorithm, new data model
which is representing graph. After that, people can use it coding the pig.

Do you suggest they should be Pig scripts added to the Pig project, or do
you want to create some new operators?
Maybe, it can be UDF or new operator.

I made a quick example. It may not be completely accurate, I've just tried
to explain it.
Think about you have a graph file just like that
user_id follower
1 2
1 3
1 10
2 3
3 4
3 5
...

Vertex List is an array including sorted vertex ids
node List is a matrix including vertex id and its starting position


graph = load 'graph' using PigStorage() (vertex:int, follower:int) - --load
the graph file
vertex = COGROUP graph BY (vertex);
list = FOREACH vertex GENERATE org.apache.pig.generateVertex(vertex) as
vertexList; --load the whole vertexes from HDFS into the memory
list = FOREACH graph GENERATE org.apache.pig.generateNode(list) as
nodeList; --load the whole vertexes from HDFS into the memory
randomWalk = FOREACH vertex GENERATE
flatten(org.apache.pig.RandomWalk(list, endVertex)) as score; -- generate a
score using the node list you can traverse the graph to the your finishing
position
store...


Thanks
Best Regards...


On Mon, Apr 1, 2013 at 7:20 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I'm somewhat familiar with WTF code (my day job is managing the analytics
> infrastructure team at Twitter). WTF is implemented using Pig 0.11 (in fact
> some of the Pig 11 features/improvements are directly due to this
> project...), and mostly has to do with clever algorithms implemented in Pig
> (an earlier version of WTF loaded the graph into main memory on large-mem
> machines -- that system is open sourced, too, under
> github.com/twitter/cassovary). Are you proposing to create an open-source
> implementation of those algorithms? Do you suggest they should be Pig
> scripts added to the Pig project, or do you want to create some new
> operators? I'm not totally sure where you are going here.
>
> GSoC proposals for Pig are usually made by students who want to work on
> issues labeled as GSoC candidates on the apache jira. The students spend
> some time to understand the problem stated in the jira, familiarize
> themselves with the existing codebase, and put a basic technical
> implementation plan and schedule into their proposal. Since in this case
> you are proposing something we haven't scoped or defined well for
> ourselves, we need you to be very clear and specific about what you are
> trying to do, and how you plan to go about it. I think that Graph
> processing in Pig (or other Hadoop-based systems) is a really interesting
> topic and there is a lot of work to be done, but we really need you to be
> far more detailed to be able to give you good guidance with regards to
> GSoC.
>
> Best,
> Dmitriy
>
>
> On Sat, Mar 30, 2013 at 10:12 AM, burakkk <bu...@gmail.com> wrote:
>
> > Sure. We can implement a graph model using  "WTF: The Who to Follow
> Service
> > at Twitter article we can" article.This article's said that in this way
> > graph can be stored one machine's memory so that every node will read
> from
> > HDFS and cache the graph to the memory. Every node is responsible from
> its
> > bucket edge to process. I mean it can be splitted. Every node can be
> > processed its bucket using random walk algorithm for instance. Finally it
> > can be reduced to get to the final results. I hope it's clear :)
> >
> > Thanks
> > Best Regards...
> >
> >
> > On Fri, Mar 29, 2013 at 6:10 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > Hi Burakk,
> > > The general idea of making graph processing easier is a good one. I'm
> not
> > > sure what exactly you are proposing to do, though. Could you be more
> > > detailed about what you are thinking?
> > >
> > >
> > > On Thu, Mar 28, 2013 at 1:28 PM, burakkk <bu...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > > I might be a little bit late. I come up with a new idea for the last
> > > > minute. Currently I'm working on social graph processing. I think we
> > can
> > > > implement a solution for pig.  With this idea I'm thinking to apply
> the
> > > > GSOC 2013 so that I can do some tasks about it. Is there any mentor
> to
> > do
> > > > it with me?  Is there any suggestion? :)
> > > >
> > > > Details:
> > > > Of course I can improve some join operations. I'm not sure is there
> any
> > > > implementation about fuzzy joins for instance. These are the papers
> > that
> > > I
> > > > found
> > > >
> > > > Fuzzy Joins Using MapReduce
> > > > http://ilpubs.stanford.edu:8090/1006/
> > > >
> > > > Dimension independent similarity computation
> > > > http://arxiv.org/abs/1206.2082
> > > >
> > > > MapReduce is Good Enough? If All You Have is a Hammer, Throw Away
> > > > Everything That’s Not a Nail!
> > > > http://arxiv.org/pdf/1209.2191.pdf
> > > >
> > > > Large Graph Processing in the Cloud
> > > > http://www.ntu.edu.sg/home/bshe/sigmod10_demo.pdf
> > > >
> > > > ..etc
> > > >
> > > > Thanks
> > > > Best regards..
> > > >
> > > >
> > > > --
> > > >
> > > > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > > > *
> > > > *
> > > >
> > >
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>



-- 

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*