You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Raimon Bosch <ra...@gmail.com> on 2012/05/08 19:17:52 UTC

Questions about how to define a graph in Apache Giraph

Hi all,

I'm designing a model to graph my web visits using the data in Access Log.
My idea is to create edges between my pages throught queries comming from
Google i.e. if a user searches for "used cars in NY" and hits one of my
pages (say A), and one month later another user searches for "used cars in
NY" and hits another of my pages (say B) I can create an edge between A and
B where the value of the edge will be the number of pages viewed for those
2 users.

So my question is more directly related with the format used in Apache
Giraph:

- Can I give values to the edges? i.e. A to B (cost is 6), and B to C (cost
is 4).

- In the shortestPath example we have an input like this:

[10,4500,[[11,1000]]]
[11,5500,[[12,1100]]]
[12,6600,[[13,1200]]]
[13,7800,[[14,1300]]]
[14,9100,[[0,1400]]]

Can you give us an overview how does this graph would look like? That would
be a nice document for the wiki page.

- How it will be the input for my use case? (A -> B (cost 6), B -> C (cost
4))


Thanks in advance,
Raimon Bosch.

pd: Some feedback about my model would be appreciated too. I haven't found
any papers about this topic yet.

Re: Questions about how to define a graph in Apache Giraph

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Raimon,

you are actually having two graphs in your use case description :) The
first one is a bipartite graph consisting of a set of search queries on
the one hand and a set of web pages on the other hand.

>From this graph you want to create another graph where all pages that
share a common query are connected. This is an algorithmic problem, not
just a way of formatting the input.

There are several ways to create this second graph. One way would be to
use Mahout's ItemSimilarityJob with (query,page) tuples as input.
ItemSimilarityJob will give you all or the top-k similar pages per page
then (which will be (page,page) tuples). From this output you could
create the second graph very easily.

Alternatively you could think about implementing a pairwise similarity
algorithm yourself in Giraph. You basically would need to find all pairs
of vertices that share a common neighbor.

--sebastian



On 08.05.2012 19:17, Raimon Bosch wrote:
> I'm designing a model to graph my web visits using the data in Access Log.
> My idea is to create edges between my pages throught queries comming from
> Google i.e. if a user searches for "used cars in NY" and hits one of my
> pages (say A), and one month later another user searches for "used cars in
> NY" and hits another of my pages (say B) I can create an edge between A and
> B where the value of the edge will be the number of pages viewed for those
> 2 users.
> 
> So my question is more directly related with the format used in Apache
> Giraph:
> 
> - Can I give values to the edges? i.e. A to B (cost is 6), and B to C (cost
> is 4).
> 
> - In the shortestPath example we have an input like this:
> 
> [10,4500,[[11,1000]]]
> [11,5500,[[12,1100]]]
> [12,6600,[[13,1200]]]
> [13,7800,[[14,1300]]]
> [14,9100,[[0,1400]]]
> 
> Can you give us an overview how does this graph would look like? That would
> be a nice document for the wiki page.
> 
> - How it will be the input for my use case? (A -> B (cost 6), B -> C (cost
> 4))
> 
> 
> Thanks in advance,
> Raimon Bosch.
> 
> pd: Some feedback about my model would be appreciated too. I haven't found
> any papers about this topic yet.
>