You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ravikant Dindokar <ra...@gmail.com> on 2015/06/24 08:35:32 UTC

Joins in Hadoop

Hi Hadoop user,

I want to use hadoop for performing operation on graph data
I have two file :

1. Edge list file
        This file contains one line for each edge in the graph.
sample:
1    2 (here 1 is source and 2 is sink node for the edge)
1    5
2    3
4    2
4    3
5    6
5    4
5    7
7    8
8    9
8    10

2. Partition file :
         This file contains one line for each vertex. Each line has two
values first number is <vertex id> and second number is <partition id >
 sample : <vertex id>  <partition id >
2    1
3    1
4    1
5    2
6    2
7    2
8    1
9    1
10    1


The Edge list file is having size of 32Gb, while partition file is of 10Gb.
(size is so large that map/reduce can read only partition file . I have 20
node cluster with 24Gb memory per node.)

My aim is to get all vertices (along with their adjacency list )those
having same partition id in one reducer so that I can perform further
analytics on a given partition in reducer.

Is there any way in hadoop to get join of these two file in mapper and so
that I can map based on the partition id ?

Thanks
Ravikant

Re: Joins in Hadoop

Posted by Russell Jurney <ru...@gmail.com>.

You are insane to do this with mapreduce. Use Pig or Hive, or Spark and
perform a join. This will take you less than ten minutes, including the
time to download and install pig or hive and run them on your data. For
example, see http://pig.apache.org/docs/r0.15.0/basic.html#join-inner

For curiosity's sake, check out this join implementation in Python:
https://github.com/bd4c/big_data_for_chimps-code/blob/master/examples/ch_07/join.py

And this book, which explains mapreduce joins:
https://books.google.com/books?id=GxFYuVZHG60C&lpg=PP1&dq=Mapreduce%20algorithms&pg=PA59#v=snippet&q=3.5%20Relational%20joins&f=false

Using Java and mapreduce Apis to solve this problem is an exercise in pure
futility. Unless you're doing this to learn, in which case these links
should help.

My book (with Flip Kromer), Big Data for Chimps, covers joins in Pig and
Python:
https://github.com/infochimps-labs/big_data_for_chimps/blob/master/Ch07-joining_patterns.asciidoc

It is due out in a few weeks.

On Wednesday, June 24, 2015, Harshit Mathur <ma...@gmail.com> wrote:

> So basically you want <vertex_id,partitionId> as your key..?
> If this is the case, then you can have your custom key object by
> implementing writablecomparable.
>
> But i am not sure if the logic permits to do this in this single map
> reduce job. As per my understanding of your problem, what you want to
> achieve will be done in two jobs.
>
> On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <
> ravikant.iisc@gmail.com
> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>
>> but in the reducer for Job1, you have :
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> so partition Id's for vertices in the adjacency list are not available.
>> So essentially what I am trying to get output as
>>
>> <vertex_id,partitionId>,<list >
>> where each element of list is of type <vertex_id,partitionId>
>>
>> can this be achieved in single map-reduce job?
>>
>> Thanks
>> Ravikant
>>
>>
>>
>>
>> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <mathursharp@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>
>>> yeah you can store it as well in your custom object like you are storing
>>> adjacency list.
>>>
>>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>
>>>> Hi Harshit,
>>>>
>>>> Is there any way to retain the partition id for each vertex in the
>>>> adjacency list?
>>>>
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>
>>>>> Thanks Harshit
>>>>>
>>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>>
>>>>>> Job 1:
>>>>>>
>>>>>> You need to  have two Mappers one reading from Edge File and the
>>>>>> other reading from Partition file.
>>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>>> Now you can have a custom writable (say GraphCustomObject) holding
>>>>>> the following,
>>>>>> 1)type : a representation of the object coming from which mapper
>>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>>> 3)partiton Id: to hold the partition id
>>>>>>
>>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>>> key=> vertexId
>>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>>>> present in this file)
>>>>>>
>>>>>> The output of PartitionFileMapper will be,
>>>>>> key=>vertexId
>>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>>
>>>>>>
>>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>>> GraphCustomObject populated.
>>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>>
>>>>>> The output of this reducer will be,
>>>>>> key=> partitionId
>>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>>> This will be the stored as output of job1.
>>>>>>
>>>>>> Job 2
>>>>>> This job will read the output generated in the previous job and use
>>>>>> identity Mapper, so in the reducer we will have
>>>>>> key=> partitionId
>>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>>
>>>>>>
>>>>>>
>>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>>
>>>>>> BR,
>>>>>> Harshit
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>>> ravikant.iisc@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>>>
>>>>>>> Hi Hadoop user,
>>>>>>>
>>>>>>> I want to use hadoop for performing operation on graph data
>>>>>>> I have two file :
>>>>>>>
>>>>>>> 1. Edge list file
>>>>>>>         This file contains one line for each edge in the graph.
>>>>>>> sample:
>>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>>> 1    5
>>>>>>> 2    3
>>>>>>> 4    2
>>>>>>> 4    3
>>>>>>> 5    6
>>>>>>> 5    4
>>>>>>> 5    7
>>>>>>> 7    8
>>>>>>> 8    9
>>>>>>> 8    10
>>>>>>>
>>>>>>> 2. Partition file :
>>>>>>>          This file contains one line for each vertex. Each line has
>>>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>>>  sample : <vertex id>  <partition id >
>>>>>>> 2    1
>>>>>>> 3    1
>>>>>>> 4    1
>>>>>>> 5    2
>>>>>>> 6    2
>>>>>>> 7    2
>>>>>>> 8    1
>>>>>>> 9    1
>>>>>>> 10    1
>>>>>>>
>>>>>>>
>>>>>>> The Edge list file is having size of 32Gb, while partition file is
>>>>>>> of 10Gb.
>>>>>>> (size is so large that map/reduce can read only partition file . I
>>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>>
>>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>>> further analytics on a given partition in reducer.
>>>>>>>
>>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>>> and so that I can map based on the partition id ?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ravikant
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harshit Mathur
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>
>
> --
> Harshit Mathur
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Joins in Hadoop

Posted by Russell Jurney <ru...@gmail.com>.

You are insane to do this with mapreduce. Use Pig or Hive, or Spark and
perform a join. This will take you less than ten minutes, including the
time to download and install pig or hive and run them on your data. For
example, see http://pig.apache.org/docs/r0.15.0/basic.html#join-inner

For curiosity's sake, check out this join implementation in Python:
https://github.com/bd4c/big_data_for_chimps-code/blob/master/examples/ch_07/join.py

And this book, which explains mapreduce joins:
https://books.google.com/books?id=GxFYuVZHG60C&lpg=PP1&dq=Mapreduce%20algorithms&pg=PA59#v=snippet&q=3.5%20Relational%20joins&f=false

Using Java and mapreduce Apis to solve this problem is an exercise in pure
futility. Unless you're doing this to learn, in which case these links
should help.

My book (with Flip Kromer), Big Data for Chimps, covers joins in Pig and
Python:
https://github.com/infochimps-labs/big_data_for_chimps/blob/master/Ch07-joining_patterns.asciidoc

It is due out in a few weeks.

On Wednesday, June 24, 2015, Harshit Mathur <ma...@gmail.com> wrote:

> So basically you want <vertex_id,partitionId> as your key..?
> If this is the case, then you can have your custom key object by
> implementing writablecomparable.
>
> But i am not sure if the logic permits to do this in this single map
> reduce job. As per my understanding of your problem, what you want to
> achieve will be done in two jobs.
>
> On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <
> ravikant.iisc@gmail.com
> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>
>> but in the reducer for Job1, you have :
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> so partition Id's for vertices in the adjacency list are not available.
>> So essentially what I am trying to get output as
>>
>> <vertex_id,partitionId>,<list >
>> where each element of list is of type <vertex_id,partitionId>
>>
>> can this be achieved in single map-reduce job?
>>
>> Thanks
>> Ravikant
>>
>>
>>
>>
>> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <mathursharp@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>
>>> yeah you can store it as well in your custom object like you are storing
>>> adjacency list.
>>>
>>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>
>>>> Hi Harshit,
>>>>
>>>> Is there any way to retain the partition id for each vertex in the
>>>> adjacency list?
>>>>
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>
>>>>> Thanks Harshit
>>>>>
>>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>>
>>>>>> Job 1:
>>>>>>
>>>>>> You need to  have two Mappers one reading from Edge File and the
>>>>>> other reading from Partition file.
>>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>>> Now you can have a custom writable (say GraphCustomObject) holding
>>>>>> the following,
>>>>>> 1)type : a representation of the object coming from which mapper
>>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>>> 3)partiton Id: to hold the partition id
>>>>>>
>>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>>> key=> vertexId
>>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>>>> present in this file)
>>>>>>
>>>>>> The output of PartitionFileMapper will be,
>>>>>> key=>vertexId
>>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>>
>>>>>>
>>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>>> GraphCustomObject populated.
>>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>>
>>>>>> The output of this reducer will be,
>>>>>> key=> partitionId
>>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>>> This will be the stored as output of job1.
>>>>>>
>>>>>> Job 2
>>>>>> This job will read the output generated in the previous job and use
>>>>>> identity Mapper, so in the reducer we will have
>>>>>> key=> partitionId
>>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>>
>>>>>>
>>>>>>
>>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>>
>>>>>> BR,
>>>>>> Harshit
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>>> ravikant.iisc@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>>>
>>>>>>> Hi Hadoop user,
>>>>>>>
>>>>>>> I want to use hadoop for performing operation on graph data
>>>>>>> I have two file :
>>>>>>>
>>>>>>> 1. Edge list file
>>>>>>>         This file contains one line for each edge in the graph.
>>>>>>> sample:
>>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>>> 1    5
>>>>>>> 2    3
>>>>>>> 4    2
>>>>>>> 4    3
>>>>>>> 5    6
>>>>>>> 5    4
>>>>>>> 5    7
>>>>>>> 7    8
>>>>>>> 8    9
>>>>>>> 8    10
>>>>>>>
>>>>>>> 2. Partition file :
>>>>>>>          This file contains one line for each vertex. Each line has
>>>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>>>  sample : <vertex id>  <partition id >
>>>>>>> 2    1
>>>>>>> 3    1
>>>>>>> 4    1
>>>>>>> 5    2
>>>>>>> 6    2
>>>>>>> 7    2
>>>>>>> 8    1
>>>>>>> 9    1
>>>>>>> 10    1
>>>>>>>
>>>>>>>
>>>>>>> The Edge list file is having size of 32Gb, while partition file is
>>>>>>> of 10Gb.
>>>>>>> (size is so large that map/reduce can read only partition file . I
>>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>>
>>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>>> further analytics on a given partition in reducer.
>>>>>>>
>>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>>> and so that I can map based on the partition id ?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ravikant
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harshit Mathur
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>
>
> --
> Harshit Mathur
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Joins in Hadoop

Posted by Russell Jurney <ru...@gmail.com>.

You are insane to do this with mapreduce. Use Pig or Hive, or Spark and
perform a join. This will take you less than ten minutes, including the
time to download and install pig or hive and run them on your data. For
example, see http://pig.apache.org/docs/r0.15.0/basic.html#join-inner

For curiosity's sake, check out this join implementation in Python:
https://github.com/bd4c/big_data_for_chimps-code/blob/master/examples/ch_07/join.py

And this book, which explains mapreduce joins:
https://books.google.com/books?id=GxFYuVZHG60C&lpg=PP1&dq=Mapreduce%20algorithms&pg=PA59#v=snippet&q=3.5%20Relational%20joins&f=false

Using Java and mapreduce Apis to solve this problem is an exercise in pure
futility. Unless you're doing this to learn, in which case these links
should help.

My book (with Flip Kromer), Big Data for Chimps, covers joins in Pig and
Python:
https://github.com/infochimps-labs/big_data_for_chimps/blob/master/Ch07-joining_patterns.asciidoc

It is due out in a few weeks.

On Wednesday, June 24, 2015, Harshit Mathur <ma...@gmail.com> wrote:

> So basically you want <vertex_id,partitionId> as your key..?
> If this is the case, then you can have your custom key object by
> implementing writablecomparable.
>
> But i am not sure if the logic permits to do this in this single map
> reduce job. As per my understanding of your problem, what you want to
> achieve will be done in two jobs.
>
> On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <
> ravikant.iisc@gmail.com
> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>
>> but in the reducer for Job1, you have :
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> so partition Id's for vertices in the adjacency list are not available.
>> So essentially what I am trying to get output as
>>
>> <vertex_id,partitionId>,<list >
>> where each element of list is of type <vertex_id,partitionId>
>>
>> can this be achieved in single map-reduce job?
>>
>> Thanks
>> Ravikant
>>
>>
>>
>>
>> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <mathursharp@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>
>>> yeah you can store it as well in your custom object like you are storing
>>> adjacency list.
>>>
>>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>
>>>> Hi Harshit,
>>>>
>>>> Is there any way to retain the partition id for each vertex in the
>>>> adjacency list?
>>>>
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>
>>>>> Thanks Harshit
>>>>>
>>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>>
>>>>>> Job 1:
>>>>>>
>>>>>> You need to  have two Mappers one reading from Edge File and the
>>>>>> other reading from Partition file.
>>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>>> Now you can have a custom writable (say GraphCustomObject) holding
>>>>>> the following,
>>>>>> 1)type : a representation of the object coming from which mapper
>>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>>> 3)partiton Id: to hold the partition id
>>>>>>
>>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>>> key=> vertexId
>>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>>>> present in this file)
>>>>>>
>>>>>> The output of PartitionFileMapper will be,
>>>>>> key=>vertexId
>>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>>
>>>>>>
>>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>>> GraphCustomObject populated.
>>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>>
>>>>>> The output of this reducer will be,
>>>>>> key=> partitionId
>>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>>> This will be the stored as output of job1.
>>>>>>
>>>>>> Job 2
>>>>>> This job will read the output generated in the previous job and use
>>>>>> identity Mapper, so in the reducer we will have
>>>>>> key=> partitionId
>>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>>
>>>>>>
>>>>>>
>>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>>
>>>>>> BR,
>>>>>> Harshit
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>>> ravikant.iisc@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>>>
>>>>>>> Hi Hadoop user,
>>>>>>>
>>>>>>> I want to use hadoop for performing operation on graph data
>>>>>>> I have two file :
>>>>>>>
>>>>>>> 1. Edge list file
>>>>>>>         This file contains one line for each edge in the graph.
>>>>>>> sample:
>>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>>> 1    5
>>>>>>> 2    3
>>>>>>> 4    2
>>>>>>> 4    3
>>>>>>> 5    6
>>>>>>> 5    4
>>>>>>> 5    7
>>>>>>> 7    8
>>>>>>> 8    9
>>>>>>> 8    10
>>>>>>>
>>>>>>> 2. Partition file :
>>>>>>>          This file contains one line for each vertex. Each line has
>>>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>>>  sample : <vertex id>  <partition id >
>>>>>>> 2    1
>>>>>>> 3    1
>>>>>>> 4    1
>>>>>>> 5    2
>>>>>>> 6    2
>>>>>>> 7    2
>>>>>>> 8    1
>>>>>>> 9    1
>>>>>>> 10    1
>>>>>>>
>>>>>>>
>>>>>>> The Edge list file is having size of 32Gb, while partition file is
>>>>>>> of 10Gb.
>>>>>>> (size is so large that map/reduce can read only partition file . I
>>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>>
>>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>>> further analytics on a given partition in reducer.
>>>>>>>
>>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>>> and so that I can map based on the partition id ?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ravikant
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harshit Mathur
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>
>
> --
> Harshit Mathur
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Joins in Hadoop

Posted by Russell Jurney <ru...@gmail.com>.

You are insane to do this with mapreduce. Use Pig or Hive, or Spark and
perform a join. This will take you less than ten minutes, including the
time to download and install pig or hive and run them on your data. For
example, see http://pig.apache.org/docs/r0.15.0/basic.html#join-inner

For curiosity's sake, check out this join implementation in Python:
https://github.com/bd4c/big_data_for_chimps-code/blob/master/examples/ch_07/join.py

And this book, which explains mapreduce joins:
https://books.google.com/books?id=GxFYuVZHG60C&lpg=PP1&dq=Mapreduce%20algorithms&pg=PA59#v=snippet&q=3.5%20Relational%20joins&f=false

Using Java and mapreduce Apis to solve this problem is an exercise in pure
futility. Unless you're doing this to learn, in which case these links
should help.

My book (with Flip Kromer), Big Data for Chimps, covers joins in Pig and
Python:
https://github.com/infochimps-labs/big_data_for_chimps/blob/master/Ch07-joining_patterns.asciidoc

It is due out in a few weeks.

On Wednesday, June 24, 2015, Harshit Mathur <ma...@gmail.com> wrote:

> So basically you want <vertex_id,partitionId> as your key..?
> If this is the case, then you can have your custom key object by
> implementing writablecomparable.
>
> But i am not sure if the logic permits to do this in this single map
> reduce job. As per my understanding of your problem, what you want to
> achieve will be done in two jobs.
>
> On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <
> ravikant.iisc@gmail.com
> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>
>> but in the reducer for Job1, you have :
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> so partition Id's for vertices in the adjacency list are not available.
>> So essentially what I am trying to get output as
>>
>> <vertex_id,partitionId>,<list >
>> where each element of list is of type <vertex_id,partitionId>
>>
>> can this be achieved in single map-reduce job?
>>
>> Thanks
>> Ravikant
>>
>>
>>
>>
>> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <mathursharp@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>
>>> yeah you can store it as well in your custom object like you are storing
>>> adjacency list.
>>>
>>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>
>>>> Hi Harshit,
>>>>
>>>> Is there any way to retain the partition id for each vertex in the
>>>> adjacency list?
>>>>
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>
>>>>> Thanks Harshit
>>>>>
>>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>>
>>>>>> Job 1:
>>>>>>
>>>>>> You need to  have two Mappers one reading from Edge File and the
>>>>>> other reading from Partition file.
>>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>>> Now you can have a custom writable (say GraphCustomObject) holding
>>>>>> the following,
>>>>>> 1)type : a representation of the object coming from which mapper
>>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>>> 3)partiton Id: to hold the partition id
>>>>>>
>>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>>> key=> vertexId
>>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>>>> present in this file)
>>>>>>
>>>>>> The output of PartitionFileMapper will be,
>>>>>> key=>vertexId
>>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>>
>>>>>>
>>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>>> GraphCustomObject populated.
>>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>>
>>>>>> The output of this reducer will be,
>>>>>> key=> partitionId
>>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>>> This will be the stored as output of job1.
>>>>>>
>>>>>> Job 2
>>>>>> This job will read the output generated in the previous job and use
>>>>>> identity Mapper, so in the reducer we will have
>>>>>> key=> partitionId
>>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>>
>>>>>>
>>>>>>
>>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>>
>>>>>> BR,
>>>>>> Harshit
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>>> ravikant.iisc@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>>>
>>>>>>> Hi Hadoop user,
>>>>>>>
>>>>>>> I want to use hadoop for performing operation on graph data
>>>>>>> I have two file :
>>>>>>>
>>>>>>> 1. Edge list file
>>>>>>>         This file contains one line for each edge in the graph.
>>>>>>> sample:
>>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>>> 1    5
>>>>>>> 2    3
>>>>>>> 4    2
>>>>>>> 4    3
>>>>>>> 5    6
>>>>>>> 5    4
>>>>>>> 5    7
>>>>>>> 7    8
>>>>>>> 8    9
>>>>>>> 8    10
>>>>>>>
>>>>>>> 2. Partition file :
>>>>>>>          This file contains one line for each vertex. Each line has
>>>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>>>  sample : <vertex id>  <partition id >
>>>>>>> 2    1
>>>>>>> 3    1
>>>>>>> 4    1
>>>>>>> 5    2
>>>>>>> 6    2
>>>>>>> 7    2
>>>>>>> 8    1
>>>>>>> 9    1
>>>>>>> 10    1
>>>>>>>
>>>>>>>
>>>>>>> The Edge list file is having size of 32Gb, while partition file is
>>>>>>> of 10Gb.
>>>>>>> (size is so large that map/reduce can read only partition file . I
>>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>>
>>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>>> further analytics on a given partition in reducer.
>>>>>>>
>>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>>> and so that I can map based on the partition id ?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ravikant
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harshit Mathur
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>
>
> --
> Harshit Mathur
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

So basically you want <vertex_id,partitionId> as your key..?
If this is the case, then you can have your custom key object by
implementing writablecomparable.

But i am not sure if the logic permits to do this in this single map reduce
job. As per my understanding of your problem, what you want to achieve will
be done in two jobs.

On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> but in the reducer for Job1, you have :
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> so partition Id's for vertices in the adjacency list are not available. So
> essentially what I am trying to get output as
>
> <vertex_id,partitionId>,<list >
> where each element of list is of type <vertex_id,partitionId>
>
> can this be achieved in single map-reduce job?
>
> Thanks
> Ravikant
>
>
>
>
> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <ma...@gmail.com>
> wrote:
>
>> yeah you can store it as well in your custom object like you are storing
>> adjacency list.
>>
>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Harshit,
>>>
>>> Is there any way to retain the partition id for each vertex in the
>>> adjacency list?
>>>
>>>
>>> Thanks
>>> Ravikant
>>>
>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Thanks Harshit
>>>>
>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>
>>>>> Job 1:
>>>>>
>>>>> You need to  have two Mappers one reading from Edge File and the other
>>>>> reading from Partition file.
>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>>> following,
>>>>> 1)type : a representation of the object coming from which mapper
>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>> 3)partiton Id: to hold the partition id
>>>>>
>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>> key=> vertexId
>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>>> present in this file)
>>>>>
>>>>> The output of PartitionFileMapper will be,
>>>>> key=>vertexId
>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>
>>>>>
>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>> GraphCustomObject populated.
>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>
>>>>> The output of this reducer will be,
>>>>> key=> partitionId
>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>> This will be the stored as output of job1.
>>>>>
>>>>> Job 2
>>>>> This job will read the output generated in the previous job and use
>>>>> identity Mapper, so in the reducer we will have
>>>>> key=> partitionId
>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>
>>>>>
>>>>>
>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>
>>>>> BR,
>>>>> Harshit
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>> ravikant.iisc@gmail.com> wrote:
>>>>>
>>>>>> Hi Hadoop user,
>>>>>>
>>>>>> I want to use hadoop for performing operation on graph data
>>>>>> I have two file :
>>>>>>
>>>>>> 1. Edge list file
>>>>>>         This file contains one line for each edge in the graph.
>>>>>> sample:
>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>> 1    5
>>>>>> 2    3
>>>>>> 4    2
>>>>>> 4    3
>>>>>> 5    6
>>>>>> 5    4
>>>>>> 5    7
>>>>>> 7    8
>>>>>> 8    9
>>>>>> 8    10
>>>>>>
>>>>>> 2. Partition file :
>>>>>>          This file contains one line for each vertex. Each line has
>>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>>  sample : <vertex id>  <partition id >
>>>>>> 2    1
>>>>>> 3    1
>>>>>> 4    1
>>>>>> 5    2
>>>>>> 6    2
>>>>>> 7    2
>>>>>> 8    1
>>>>>> 9    1
>>>>>> 10    1
>>>>>>
>>>>>>
>>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>>> 10Gb.
>>>>>> (size is so large that map/reduce can read only partition file . I
>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>
>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>> further analytics on a given partition in reducer.
>>>>>>
>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>> and so that I can map based on the partition id ?
>>>>>>
>>>>>> Thanks
>>>>>> Ravikant
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harshit Mathur
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>


-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

So basically you want <vertex_id,partitionId> as your key..?
If this is the case, then you can have your custom key object by
implementing writablecomparable.

But i am not sure if the logic permits to do this in this single map reduce
job. As per my understanding of your problem, what you want to achieve will
be done in two jobs.

On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> but in the reducer for Job1, you have :
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> so partition Id's for vertices in the adjacency list are not available. So
> essentially what I am trying to get output as
>
> <vertex_id,partitionId>,<list >
> where each element of list is of type <vertex_id,partitionId>
>
> can this be achieved in single map-reduce job?
>
> Thanks
> Ravikant
>
>
>
>
> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <ma...@gmail.com>
> wrote:
>
>> yeah you can store it as well in your custom object like you are storing
>> adjacency list.
>>
>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Harshit,
>>>
>>> Is there any way to retain the partition id for each vertex in the
>>> adjacency list?
>>>
>>>
>>> Thanks
>>> Ravikant
>>>
>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Thanks Harshit
>>>>
>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>
>>>>> Job 1:
>>>>>
>>>>> You need to  have two Mappers one reading from Edge File and the other
>>>>> reading from Partition file.
>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>>> following,
>>>>> 1)type : a representation of the object coming from which mapper
>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>> 3)partiton Id: to hold the partition id
>>>>>
>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>> key=> vertexId
>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>>> present in this file)
>>>>>
>>>>> The output of PartitionFileMapper will be,
>>>>> key=>vertexId
>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>
>>>>>
>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>> GraphCustomObject populated.
>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>
>>>>> The output of this reducer will be,
>>>>> key=> partitionId
>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>> This will be the stored as output of job1.
>>>>>
>>>>> Job 2
>>>>> This job will read the output generated in the previous job and use
>>>>> identity Mapper, so in the reducer we will have
>>>>> key=> partitionId
>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>
>>>>>
>>>>>
>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>
>>>>> BR,
>>>>> Harshit
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>> ravikant.iisc@gmail.com> wrote:
>>>>>
>>>>>> Hi Hadoop user,
>>>>>>
>>>>>> I want to use hadoop for performing operation on graph data
>>>>>> I have two file :
>>>>>>
>>>>>> 1. Edge list file
>>>>>>         This file contains one line for each edge in the graph.
>>>>>> sample:
>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>> 1    5
>>>>>> 2    3
>>>>>> 4    2
>>>>>> 4    3
>>>>>> 5    6
>>>>>> 5    4
>>>>>> 5    7
>>>>>> 7    8
>>>>>> 8    9
>>>>>> 8    10
>>>>>>
>>>>>> 2. Partition file :
>>>>>>          This file contains one line for each vertex. Each line has
>>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>>  sample : <vertex id>  <partition id >
>>>>>> 2    1
>>>>>> 3    1
>>>>>> 4    1
>>>>>> 5    2
>>>>>> 6    2
>>>>>> 7    2
>>>>>> 8    1
>>>>>> 9    1
>>>>>> 10    1
>>>>>>
>>>>>>
>>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>>> 10Gb.
>>>>>> (size is so large that map/reduce can read only partition file . I
>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>
>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>> further analytics on a given partition in reducer.
>>>>>>
>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>> and so that I can map based on the partition id ?
>>>>>>
>>>>>> Thanks
>>>>>> Ravikant
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harshit Mathur
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>


-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

So basically you want <vertex_id,partitionId> as your key..?
If this is the case, then you can have your custom key object by
implementing writablecomparable.

But i am not sure if the logic permits to do this in this single map reduce
job. As per my understanding of your problem, what you want to achieve will
be done in two jobs.

On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> but in the reducer for Job1, you have :
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> so partition Id's for vertices in the adjacency list are not available. So
> essentially what I am trying to get output as
>
> <vertex_id,partitionId>,<list >
> where each element of list is of type <vertex_id,partitionId>
>
> can this be achieved in single map-reduce job?
>
> Thanks
> Ravikant
>
>
>
>
> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <ma...@gmail.com>
> wrote:
>
>> yeah you can store it as well in your custom object like you are storing
>> adjacency list.
>>
>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Harshit,
>>>
>>> Is there any way to retain the partition id for each vertex in the
>>> adjacency list?
>>>
>>>
>>> Thanks
>>> Ravikant
>>>
>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Thanks Harshit
>>>>
>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>
>>>>> Job 1:
>>>>>
>>>>> You need to  have two Mappers one reading from Edge File and the other
>>>>> reading from Partition file.
>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>>> following,
>>>>> 1)type : a representation of the object coming from which mapper
>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>> 3)partiton Id: to hold the partition id
>>>>>
>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>> key=> vertexId
>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>>> present in this file)
>>>>>
>>>>> The output of PartitionFileMapper will be,
>>>>> key=>vertexId
>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>
>>>>>
>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>> GraphCustomObject populated.
>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>
>>>>> The output of this reducer will be,
>>>>> key=> partitionId
>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>> This will be the stored as output of job1.
>>>>>
>>>>> Job 2
>>>>> This job will read the output generated in the previous job and use
>>>>> identity Mapper, so in the reducer we will have
>>>>> key=> partitionId
>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>
>>>>>
>>>>>
>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>
>>>>> BR,
>>>>> Harshit
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>> ravikant.iisc@gmail.com> wrote:
>>>>>
>>>>>> Hi Hadoop user,
>>>>>>
>>>>>> I want to use hadoop for performing operation on graph data
>>>>>> I have two file :
>>>>>>
>>>>>> 1. Edge list file
>>>>>>         This file contains one line for each edge in the graph.
>>>>>> sample:
>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>> 1    5
>>>>>> 2    3
>>>>>> 4    2
>>>>>> 4    3
>>>>>> 5    6
>>>>>> 5    4
>>>>>> 5    7
>>>>>> 7    8
>>>>>> 8    9
>>>>>> 8    10
>>>>>>
>>>>>> 2. Partition file :
>>>>>>          This file contains one line for each vertex. Each line has
>>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>>  sample : <vertex id>  <partition id >
>>>>>> 2    1
>>>>>> 3    1
>>>>>> 4    1
>>>>>> 5    2
>>>>>> 6    2
>>>>>> 7    2
>>>>>> 8    1
>>>>>> 9    1
>>>>>> 10    1
>>>>>>
>>>>>>
>>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>>> 10Gb.
>>>>>> (size is so large that map/reduce can read only partition file . I
>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>
>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>> further analytics on a given partition in reducer.
>>>>>>
>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>> and so that I can map based on the partition id ?
>>>>>>
>>>>>> Thanks
>>>>>> Ravikant
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harshit Mathur
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>


-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

So basically you want <vertex_id,partitionId> as your key..?
If this is the case, then you can have your custom key object by
implementing writablecomparable.

But i am not sure if the logic permits to do this in this single map reduce
job. As per my understanding of your problem, what you want to achieve will
be done in two jobs.

On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> but in the reducer for Job1, you have :
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> so partition Id's for vertices in the adjacency list are not available. So
> essentially what I am trying to get output as
>
> <vertex_id,partitionId>,<list >
> where each element of list is of type <vertex_id,partitionId>
>
> can this be achieved in single map-reduce job?
>
> Thanks
> Ravikant
>
>
>
>
> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <ma...@gmail.com>
> wrote:
>
>> yeah you can store it as well in your custom object like you are storing
>> adjacency list.
>>
>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Harshit,
>>>
>>> Is there any way to retain the partition id for each vertex in the
>>> adjacency list?
>>>
>>>
>>> Thanks
>>> Ravikant
>>>
>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Thanks Harshit
>>>>
>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>
>>>>> Job 1:
>>>>>
>>>>> You need to  have two Mappers one reading from Edge File and the other
>>>>> reading from Partition file.
>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>>> following,
>>>>> 1)type : a representation of the object coming from which mapper
>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>> 3)partiton Id: to hold the partition id
>>>>>
>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>> key=> vertexId
>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>>> present in this file)
>>>>>
>>>>> The output of PartitionFileMapper will be,
>>>>> key=>vertexId
>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>
>>>>>
>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>> GraphCustomObject populated.
>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>
>>>>> The output of this reducer will be,
>>>>> key=> partitionId
>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>> This will be the stored as output of job1.
>>>>>
>>>>> Job 2
>>>>> This job will read the output generated in the previous job and use
>>>>> identity Mapper, so in the reducer we will have
>>>>> key=> partitionId
>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>
>>>>>
>>>>>
>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>
>>>>> BR,
>>>>> Harshit
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>> ravikant.iisc@gmail.com> wrote:
>>>>>
>>>>>> Hi Hadoop user,
>>>>>>
>>>>>> I want to use hadoop for performing operation on graph data
>>>>>> I have two file :
>>>>>>
>>>>>> 1. Edge list file
>>>>>>         This file contains one line for each edge in the graph.
>>>>>> sample:
>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>> 1    5
>>>>>> 2    3
>>>>>> 4    2
>>>>>> 4    3
>>>>>> 5    6
>>>>>> 5    4
>>>>>> 5    7
>>>>>> 7    8
>>>>>> 8    9
>>>>>> 8    10
>>>>>>
>>>>>> 2. Partition file :
>>>>>>          This file contains one line for each vertex. Each line has
>>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>>  sample : <vertex id>  <partition id >
>>>>>> 2    1
>>>>>> 3    1
>>>>>> 4    1
>>>>>> 5    2
>>>>>> 6    2
>>>>>> 7    2
>>>>>> 8    1
>>>>>> 9    1
>>>>>> 10    1
>>>>>>
>>>>>>
>>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>>> 10Gb.
>>>>>> (size is so large that map/reduce can read only partition file . I
>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>
>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>> further analytics on a given partition in reducer.
>>>>>>
>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>> and so that I can map based on the partition id ?
>>>>>>
>>>>>> Thanks
>>>>>> Ravikant
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harshit Mathur
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>


-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

but in the reducer for Job1, you have :
vertexId => {adjcencyVertex complete list, partitonid=0}

so partition Id's for vertices in the adjacency list are not available. So
essentially what I am trying to get output as

<vertex_id,partitionId>,<list >
where each element of list is of type <vertex_id,partitionId>

can this be achieved in single map-reduce job?

Thanks
Ravikant




On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <ma...@gmail.com>
wrote:

> yeah you can store it as well in your custom object like you are storing
> adjacency list.
>
> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Harshit,
>>
>> Is there any way to retain the partition id for each vertex in the
>> adjacency list?
>>
>>
>> Thanks
>> Ravikant
>>
>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Thanks Harshit
>>>
>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> This may be the solution (i hope i understood the problem correctly)
>>>>
>>>> Job 1:
>>>>
>>>> You need to  have two Mappers one reading from Edge File and the other
>>>> reading from Partition file.
>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>> following,
>>>> 1)type : a representation of the object coming from which mapper
>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>> 3)partiton Id: to hold the partition id
>>>>
>>>> Now the output key and value of the EdgeFileMapper will be,
>>>> key=> vertexId
>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>> present in this file)
>>>>
>>>> The output of PartitionFileMapper will be,
>>>> key=>vertexId
>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>
>>>>
>>>> So in the Reducer for each VertexId we will can have the complete
>>>> GraphCustomObject populated.
>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>
>>>> The output of this reducer will be,
>>>> key=> partitionId
>>>> Value=> {adjcencyVertexList, vertexId}
>>>> This will be the stored as output of job1.
>>>>
>>>> Job 2
>>>> This job will read the output generated in the previous job and use
>>>> identity Mapper, so in the reducer we will have
>>>> key=> partitionId
>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>
>>>>
>>>>
>>>> I know my explanation seems a bit messy, sorry for that.
>>>>
>>>> BR,
>>>> Harshit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com> wrote:
>>>>
>>>>> Hi Hadoop user,
>>>>>
>>>>> I want to use hadoop for performing operation on graph data
>>>>> I have two file :
>>>>>
>>>>> 1. Edge list file
>>>>>         This file contains one line for each edge in the graph.
>>>>> sample:
>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>> 1    5
>>>>> 2    3
>>>>> 4    2
>>>>> 4    3
>>>>> 5    6
>>>>> 5    4
>>>>> 5    7
>>>>> 7    8
>>>>> 8    9
>>>>> 8    10
>>>>>
>>>>> 2. Partition file :
>>>>>          This file contains one line for each vertex. Each line has
>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>  sample : <vertex id>  <partition id >
>>>>> 2    1
>>>>> 3    1
>>>>> 4    1
>>>>> 5    2
>>>>> 6    2
>>>>> 7    2
>>>>> 8    1
>>>>> 9    1
>>>>> 10    1
>>>>>
>>>>>
>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>> 10Gb.
>>>>> (size is so large that map/reduce can read only partition file . I
>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>
>>>>> My aim is to get all vertices (along with their adjacency list )those
>>>>> having same partition id in one reducer so that I can perform further
>>>>> analytics on a given partition in reducer.
>>>>>
>>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>>> so that I can map based on the partition id ?
>>>>>
>>>>> Thanks
>>>>> Ravikant
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Harshit Mathur
>>>>
>>>
>>>
>>
>
>
> --
> Harshit Mathur
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

but in the reducer for Job1, you have :
vertexId => {adjcencyVertex complete list, partitonid=0}

so partition Id's for vertices in the adjacency list are not available. So
essentially what I am trying to get output as

<vertex_id,partitionId>,<list >
where each element of list is of type <vertex_id,partitionId>

can this be achieved in single map-reduce job?

Thanks
Ravikant




On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <ma...@gmail.com>
wrote:

> yeah you can store it as well in your custom object like you are storing
> adjacency list.
>
> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Harshit,
>>
>> Is there any way to retain the partition id for each vertex in the
>> adjacency list?
>>
>>
>> Thanks
>> Ravikant
>>
>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Thanks Harshit
>>>
>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> This may be the solution (i hope i understood the problem correctly)
>>>>
>>>> Job 1:
>>>>
>>>> You need to  have two Mappers one reading from Edge File and the other
>>>> reading from Partition file.
>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>> following,
>>>> 1)type : a representation of the object coming from which mapper
>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>> 3)partiton Id: to hold the partition id
>>>>
>>>> Now the output key and value of the EdgeFileMapper will be,
>>>> key=> vertexId
>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>> present in this file)
>>>>
>>>> The output of PartitionFileMapper will be,
>>>> key=>vertexId
>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>
>>>>
>>>> So in the Reducer for each VertexId we will can have the complete
>>>> GraphCustomObject populated.
>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>
>>>> The output of this reducer will be,
>>>> key=> partitionId
>>>> Value=> {adjcencyVertexList, vertexId}
>>>> This will be the stored as output of job1.
>>>>
>>>> Job 2
>>>> This job will read the output generated in the previous job and use
>>>> identity Mapper, so in the reducer we will have
>>>> key=> partitionId
>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>
>>>>
>>>>
>>>> I know my explanation seems a bit messy, sorry for that.
>>>>
>>>> BR,
>>>> Harshit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com> wrote:
>>>>
>>>>> Hi Hadoop user,
>>>>>
>>>>> I want to use hadoop for performing operation on graph data
>>>>> I have two file :
>>>>>
>>>>> 1. Edge list file
>>>>>         This file contains one line for each edge in the graph.
>>>>> sample:
>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>> 1    5
>>>>> 2    3
>>>>> 4    2
>>>>> 4    3
>>>>> 5    6
>>>>> 5    4
>>>>> 5    7
>>>>> 7    8
>>>>> 8    9
>>>>> 8    10
>>>>>
>>>>> 2. Partition file :
>>>>>          This file contains one line for each vertex. Each line has
>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>  sample : <vertex id>  <partition id >
>>>>> 2    1
>>>>> 3    1
>>>>> 4    1
>>>>> 5    2
>>>>> 6    2
>>>>> 7    2
>>>>> 8    1
>>>>> 9    1
>>>>> 10    1
>>>>>
>>>>>
>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>> 10Gb.
>>>>> (size is so large that map/reduce can read only partition file . I
>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>
>>>>> My aim is to get all vertices (along with their adjacency list )those
>>>>> having same partition id in one reducer so that I can perform further
>>>>> analytics on a given partition in reducer.
>>>>>
>>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>>> so that I can map based on the partition id ?
>>>>>
>>>>> Thanks
>>>>> Ravikant
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Harshit Mathur
>>>>
>>>
>>>
>>
>
>
> --
> Harshit Mathur
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

but in the reducer for Job1, you have :
vertexId => {adjcencyVertex complete list, partitonid=0}

so partition Id's for vertices in the adjacency list are not available. So
essentially what I am trying to get output as

<vertex_id,partitionId>,<list >
where each element of list is of type <vertex_id,partitionId>

can this be achieved in single map-reduce job?

Thanks
Ravikant




On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <ma...@gmail.com>
wrote:

> yeah you can store it as well in your custom object like you are storing
> adjacency list.
>
> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Harshit,
>>
>> Is there any way to retain the partition id for each vertex in the
>> adjacency list?
>>
>>
>> Thanks
>> Ravikant
>>
>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Thanks Harshit
>>>
>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> This may be the solution (i hope i understood the problem correctly)
>>>>
>>>> Job 1:
>>>>
>>>> You need to  have two Mappers one reading from Edge File and the other
>>>> reading from Partition file.
>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>> following,
>>>> 1)type : a representation of the object coming from which mapper
>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>> 3)partiton Id: to hold the partition id
>>>>
>>>> Now the output key and value of the EdgeFileMapper will be,
>>>> key=> vertexId
>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>> present in this file)
>>>>
>>>> The output of PartitionFileMapper will be,
>>>> key=>vertexId
>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>
>>>>
>>>> So in the Reducer for each VertexId we will can have the complete
>>>> GraphCustomObject populated.
>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>
>>>> The output of this reducer will be,
>>>> key=> partitionId
>>>> Value=> {adjcencyVertexList, vertexId}
>>>> This will be the stored as output of job1.
>>>>
>>>> Job 2
>>>> This job will read the output generated in the previous job and use
>>>> identity Mapper, so in the reducer we will have
>>>> key=> partitionId
>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>
>>>>
>>>>
>>>> I know my explanation seems a bit messy, sorry for that.
>>>>
>>>> BR,
>>>> Harshit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com> wrote:
>>>>
>>>>> Hi Hadoop user,
>>>>>
>>>>> I want to use hadoop for performing operation on graph data
>>>>> I have two file :
>>>>>
>>>>> 1. Edge list file
>>>>>         This file contains one line for each edge in the graph.
>>>>> sample:
>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>> 1    5
>>>>> 2    3
>>>>> 4    2
>>>>> 4    3
>>>>> 5    6
>>>>> 5    4
>>>>> 5    7
>>>>> 7    8
>>>>> 8    9
>>>>> 8    10
>>>>>
>>>>> 2. Partition file :
>>>>>          This file contains one line for each vertex. Each line has
>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>  sample : <vertex id>  <partition id >
>>>>> 2    1
>>>>> 3    1
>>>>> 4    1
>>>>> 5    2
>>>>> 6    2
>>>>> 7    2
>>>>> 8    1
>>>>> 9    1
>>>>> 10    1
>>>>>
>>>>>
>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>> 10Gb.
>>>>> (size is so large that map/reduce can read only partition file . I
>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>
>>>>> My aim is to get all vertices (along with their adjacency list )those
>>>>> having same partition id in one reducer so that I can perform further
>>>>> analytics on a given partition in reducer.
>>>>>
>>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>>> so that I can map based on the partition id ?
>>>>>
>>>>> Thanks
>>>>> Ravikant
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Harshit Mathur
>>>>
>>>
>>>
>>
>
>
> --
> Harshit Mathur
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

but in the reducer for Job1, you have :
vertexId => {adjcencyVertex complete list, partitonid=0}

so partition Id's for vertices in the adjacency list are not available. So
essentially what I am trying to get output as

<vertex_id,partitionId>,<list >
where each element of list is of type <vertex_id,partitionId>

can this be achieved in single map-reduce job?

Thanks
Ravikant




On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <ma...@gmail.com>
wrote:

> yeah you can store it as well in your custom object like you are storing
> adjacency list.
>
> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Harshit,
>>
>> Is there any way to retain the partition id for each vertex in the
>> adjacency list?
>>
>>
>> Thanks
>> Ravikant
>>
>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Thanks Harshit
>>>
>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> This may be the solution (i hope i understood the problem correctly)
>>>>
>>>> Job 1:
>>>>
>>>> You need to  have two Mappers one reading from Edge File and the other
>>>> reading from Partition file.
>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>> following,
>>>> 1)type : a representation of the object coming from which mapper
>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>> 3)partiton Id: to hold the partition id
>>>>
>>>> Now the output key and value of the EdgeFileMapper will be,
>>>> key=> vertexId
>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>> present in this file)
>>>>
>>>> The output of PartitionFileMapper will be,
>>>> key=>vertexId
>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>
>>>>
>>>> So in the Reducer for each VertexId we will can have the complete
>>>> GraphCustomObject populated.
>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>
>>>> The output of this reducer will be,
>>>> key=> partitionId
>>>> Value=> {adjcencyVertexList, vertexId}
>>>> This will be the stored as output of job1.
>>>>
>>>> Job 2
>>>> This job will read the output generated in the previous job and use
>>>> identity Mapper, so in the reducer we will have
>>>> key=> partitionId
>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>
>>>>
>>>>
>>>> I know my explanation seems a bit messy, sorry for that.
>>>>
>>>> BR,
>>>> Harshit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com> wrote:
>>>>
>>>>> Hi Hadoop user,
>>>>>
>>>>> I want to use hadoop for performing operation on graph data
>>>>> I have two file :
>>>>>
>>>>> 1. Edge list file
>>>>>         This file contains one line for each edge in the graph.
>>>>> sample:
>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>> 1    5
>>>>> 2    3
>>>>> 4    2
>>>>> 4    3
>>>>> 5    6
>>>>> 5    4
>>>>> 5    7
>>>>> 7    8
>>>>> 8    9
>>>>> 8    10
>>>>>
>>>>> 2. Partition file :
>>>>>          This file contains one line for each vertex. Each line has
>>>>> two values first number is <vertex id> and second number is <partition id >
>>>>>  sample : <vertex id>  <partition id >
>>>>> 2    1
>>>>> 3    1
>>>>> 4    1
>>>>> 5    2
>>>>> 6    2
>>>>> 7    2
>>>>> 8    1
>>>>> 9    1
>>>>> 10    1
>>>>>
>>>>>
>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>> 10Gb.
>>>>> (size is so large that map/reduce can read only partition file . I
>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>
>>>>> My aim is to get all vertices (along with their adjacency list )those
>>>>> having same partition id in one reducer so that I can perform further
>>>>> analytics on a given partition in reducer.
>>>>>
>>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>>> so that I can map based on the partition id ?
>>>>>
>>>>> Thanks
>>>>> Ravikant
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Harshit Mathur
>>>>
>>>
>>>
>>
>
>
> --
> Harshit Mathur
>

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

yeah you can store it as well in your custom object like you are storing
adjacency list.

On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Harshit,
>
> Is there any way to retain the partition id for each vertex in the
> adjacency list?
>
>
> Thanks
> Ravikant
>
> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Thanks Harshit
>>
>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> This may be the solution (i hope i understood the problem correctly)
>>>
>>> Job 1:
>>>
>>> You need to  have two Mappers one reading from Edge File and the other
>>> reading from Partition file.
>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>> following,
>>> 1)type : a representation of the object coming from which mapper
>>> 2)Adjacency vertex list: list of adjacency vertex
>>> 3)partiton Id: to hold the partition id
>>>
>>> Now the output key and value of the EdgeFileMapper will be,
>>> key=> vertexId
>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>> present in this file)
>>>
>>> The output of PartitionFileMapper will be,
>>> key=>vertexId
>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>
>>>
>>> So in the Reducer for each VertexId we will can have the complete
>>> GraphCustomObject populated.
>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>
>>> The output of this reducer will be,
>>> key=> partitionId
>>> Value=> {adjcencyVertexList, vertexId}
>>> This will be the stored as output of job1.
>>>
>>> Job 2
>>> This job will read the output generated in the previous job and use
>>> identity Mapper, so in the reducer we will have
>>> key=> partitionId
>>> value=> list of all the adjacency vertexlist along with vertexid
>>>
>>>
>>>
>>> I know my explanation seems a bit messy, sorry for that.
>>>
>>> BR,
>>> Harshit
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Hi Hadoop user,
>>>>
>>>> I want to use hadoop for performing operation on graph data
>>>> I have two file :
>>>>
>>>> 1. Edge list file
>>>>         This file contains one line for each edge in the graph.
>>>> sample:
>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>> 1    5
>>>> 2    3
>>>> 4    2
>>>> 4    3
>>>> 5    6
>>>> 5    4
>>>> 5    7
>>>> 7    8
>>>> 8    9
>>>> 8    10
>>>>
>>>> 2. Partition file :
>>>>          This file contains one line for each vertex. Each line has two
>>>> values first number is <vertex id> and second number is <partition id >
>>>>  sample : <vertex id>  <partition id >
>>>> 2    1
>>>> 3    1
>>>> 4    1
>>>> 5    2
>>>> 6    2
>>>> 7    2
>>>> 8    1
>>>> 9    1
>>>> 10    1
>>>>
>>>>
>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>> 10Gb.
>>>> (size is so large that map/reduce can read only partition file . I have
>>>> 20 node cluster with 24Gb memory per node.)
>>>>
>>>> My aim is to get all vertices (along with their adjacency list )those
>>>> having same partition id in one reducer so that I can perform further
>>>> analytics on a given partition in reducer.
>>>>
>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>> so that I can map based on the partition id ?
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>


-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

yeah you can store it as well in your custom object like you are storing
adjacency list.

On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Harshit,
>
> Is there any way to retain the partition id for each vertex in the
> adjacency list?
>
>
> Thanks
> Ravikant
>
> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Thanks Harshit
>>
>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> This may be the solution (i hope i understood the problem correctly)
>>>
>>> Job 1:
>>>
>>> You need to  have two Mappers one reading from Edge File and the other
>>> reading from Partition file.
>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>> following,
>>> 1)type : a representation of the object coming from which mapper
>>> 2)Adjacency vertex list: list of adjacency vertex
>>> 3)partiton Id: to hold the partition id
>>>
>>> Now the output key and value of the EdgeFileMapper will be,
>>> key=> vertexId
>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>> present in this file)
>>>
>>> The output of PartitionFileMapper will be,
>>> key=>vertexId
>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>
>>>
>>> So in the Reducer for each VertexId we will can have the complete
>>> GraphCustomObject populated.
>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>
>>> The output of this reducer will be,
>>> key=> partitionId
>>> Value=> {adjcencyVertexList, vertexId}
>>> This will be the stored as output of job1.
>>>
>>> Job 2
>>> This job will read the output generated in the previous job and use
>>> identity Mapper, so in the reducer we will have
>>> key=> partitionId
>>> value=> list of all the adjacency vertexlist along with vertexid
>>>
>>>
>>>
>>> I know my explanation seems a bit messy, sorry for that.
>>>
>>> BR,
>>> Harshit
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Hi Hadoop user,
>>>>
>>>> I want to use hadoop for performing operation on graph data
>>>> I have two file :
>>>>
>>>> 1. Edge list file
>>>>         This file contains one line for each edge in the graph.
>>>> sample:
>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>> 1    5
>>>> 2    3
>>>> 4    2
>>>> 4    3
>>>> 5    6
>>>> 5    4
>>>> 5    7
>>>> 7    8
>>>> 8    9
>>>> 8    10
>>>>
>>>> 2. Partition file :
>>>>          This file contains one line for each vertex. Each line has two
>>>> values first number is <vertex id> and second number is <partition id >
>>>>  sample : <vertex id>  <partition id >
>>>> 2    1
>>>> 3    1
>>>> 4    1
>>>> 5    2
>>>> 6    2
>>>> 7    2
>>>> 8    1
>>>> 9    1
>>>> 10    1
>>>>
>>>>
>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>> 10Gb.
>>>> (size is so large that map/reduce can read only partition file . I have
>>>> 20 node cluster with 24Gb memory per node.)
>>>>
>>>> My aim is to get all vertices (along with their adjacency list )those
>>>> having same partition id in one reducer so that I can perform further
>>>> analytics on a given partition in reducer.
>>>>
>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>> so that I can map based on the partition id ?
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>


-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

yeah you can store it as well in your custom object like you are storing
adjacency list.

On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Harshit,
>
> Is there any way to retain the partition id for each vertex in the
> adjacency list?
>
>
> Thanks
> Ravikant
>
> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Thanks Harshit
>>
>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> This may be the solution (i hope i understood the problem correctly)
>>>
>>> Job 1:
>>>
>>> You need to  have two Mappers one reading from Edge File and the other
>>> reading from Partition file.
>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>> following,
>>> 1)type : a representation of the object coming from which mapper
>>> 2)Adjacency vertex list: list of adjacency vertex
>>> 3)partiton Id: to hold the partition id
>>>
>>> Now the output key and value of the EdgeFileMapper will be,
>>> key=> vertexId
>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>> present in this file)
>>>
>>> The output of PartitionFileMapper will be,
>>> key=>vertexId
>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>
>>>
>>> So in the Reducer for each VertexId we will can have the complete
>>> GraphCustomObject populated.
>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>
>>> The output of this reducer will be,
>>> key=> partitionId
>>> Value=> {adjcencyVertexList, vertexId}
>>> This will be the stored as output of job1.
>>>
>>> Job 2
>>> This job will read the output generated in the previous job and use
>>> identity Mapper, so in the reducer we will have
>>> key=> partitionId
>>> value=> list of all the adjacency vertexlist along with vertexid
>>>
>>>
>>>
>>> I know my explanation seems a bit messy, sorry for that.
>>>
>>> BR,
>>> Harshit
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Hi Hadoop user,
>>>>
>>>> I want to use hadoop for performing operation on graph data
>>>> I have two file :
>>>>
>>>> 1. Edge list file
>>>>         This file contains one line for each edge in the graph.
>>>> sample:
>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>> 1    5
>>>> 2    3
>>>> 4    2
>>>> 4    3
>>>> 5    6
>>>> 5    4
>>>> 5    7
>>>> 7    8
>>>> 8    9
>>>> 8    10
>>>>
>>>> 2. Partition file :
>>>>          This file contains one line for each vertex. Each line has two
>>>> values first number is <vertex id> and second number is <partition id >
>>>>  sample : <vertex id>  <partition id >
>>>> 2    1
>>>> 3    1
>>>> 4    1
>>>> 5    2
>>>> 6    2
>>>> 7    2
>>>> 8    1
>>>> 9    1
>>>> 10    1
>>>>
>>>>
>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>> 10Gb.
>>>> (size is so large that map/reduce can read only partition file . I have
>>>> 20 node cluster with 24Gb memory per node.)
>>>>
>>>> My aim is to get all vertices (along with their adjacency list )those
>>>> having same partition id in one reducer so that I can perform further
>>>> analytics on a given partition in reducer.
>>>>
>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>> so that I can map based on the partition id ?
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>


-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

yeah you can store it as well in your custom object like you are storing
adjacency list.

On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Harshit,
>
> Is there any way to retain the partition id for each vertex in the
> adjacency list?
>
>
> Thanks
> Ravikant
>
> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Thanks Harshit
>>
>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> This may be the solution (i hope i understood the problem correctly)
>>>
>>> Job 1:
>>>
>>> You need to  have two Mappers one reading from Edge File and the other
>>> reading from Partition file.
>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>> following,
>>> 1)type : a representation of the object coming from which mapper
>>> 2)Adjacency vertex list: list of adjacency vertex
>>> 3)partiton Id: to hold the partition id
>>>
>>> Now the output key and value of the EdgeFileMapper will be,
>>> key=> vertexId
>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>> present in this file)
>>>
>>> The output of PartitionFileMapper will be,
>>> key=>vertexId
>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>
>>>
>>> So in the Reducer for each VertexId we will can have the complete
>>> GraphCustomObject populated.
>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>
>>> The output of this reducer will be,
>>> key=> partitionId
>>> Value=> {adjcencyVertexList, vertexId}
>>> This will be the stored as output of job1.
>>>
>>> Job 2
>>> This job will read the output generated in the previous job and use
>>> identity Mapper, so in the reducer we will have
>>> key=> partitionId
>>> value=> list of all the adjacency vertexlist along with vertexid
>>>
>>>
>>>
>>> I know my explanation seems a bit messy, sorry for that.
>>>
>>> BR,
>>> Harshit
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Hi Hadoop user,
>>>>
>>>> I want to use hadoop for performing operation on graph data
>>>> I have two file :
>>>>
>>>> 1. Edge list file
>>>>         This file contains one line for each edge in the graph.
>>>> sample:
>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>> 1    5
>>>> 2    3
>>>> 4    2
>>>> 4    3
>>>> 5    6
>>>> 5    4
>>>> 5    7
>>>> 7    8
>>>> 8    9
>>>> 8    10
>>>>
>>>> 2. Partition file :
>>>>          This file contains one line for each vertex. Each line has two
>>>> values first number is <vertex id> and second number is <partition id >
>>>>  sample : <vertex id>  <partition id >
>>>> 2    1
>>>> 3    1
>>>> 4    1
>>>> 5    2
>>>> 6    2
>>>> 7    2
>>>> 8    1
>>>> 9    1
>>>> 10    1
>>>>
>>>>
>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>> 10Gb.
>>>> (size is so large that map/reduce can read only partition file . I have
>>>> 20 node cluster with 24Gb memory per node.)
>>>>
>>>> My aim is to get all vertices (along with their adjacency list )those
>>>> having same partition id in one reducer so that I can perform further
>>>> analytics on a given partition in reducer.
>>>>
>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>> so that I can map based on the partition id ?
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>


-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

Hi Harshit,

Is there any way to retain the partition id for each vertex in the
adjacency list?


Thanks
Ravikant

On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <ra...@gmail.com>
wrote:

> Thanks Harshit
>
> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> This may be the solution (i hope i understood the problem correctly)
>>
>> Job 1:
>>
>> You need to  have two Mappers one reading from Edge File and the other
>> reading from Partition file.
>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>> Now you can have a custom writable (say GraphCustomObject) holding the
>> following,
>> 1)type : a representation of the object coming from which mapper
>> 2)Adjacency vertex list: list of adjacency vertex
>> 3)partiton Id: to hold the partition id
>>
>> Now the output key and value of the EdgeFileMapper will be,
>> key=> vertexId
>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>> present in this file)
>>
>> The output of PartitionFileMapper will be,
>> key=>vertexId
>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>
>>
>> So in the Reducer for each VertexId we will can have the complete
>> GraphCustomObject populated.
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> The output of this reducer will be,
>> key=> partitionId
>> Value=> {adjcencyVertexList, vertexId}
>> This will be the stored as output of job1.
>>
>> Job 2
>> This job will read the output generated in the previous job and use
>> identity Mapper, so in the reducer we will have
>> key=> partitionId
>> value=> list of all the adjacency vertexlist along with vertexid
>>
>>
>>
>> I know my explanation seems a bit messy, sorry for that.
>>
>> BR,
>> Harshit
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Hadoop user,
>>>
>>> I want to use hadoop for performing operation on graph data
>>> I have two file :
>>>
>>> 1. Edge list file
>>>         This file contains one line for each edge in the graph.
>>> sample:
>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>> 1    5
>>> 2    3
>>> 4    2
>>> 4    3
>>> 5    6
>>> 5    4
>>> 5    7
>>> 7    8
>>> 8    9
>>> 8    10
>>>
>>> 2. Partition file :
>>>          This file contains one line for each vertex. Each line has two
>>> values first number is <vertex id> and second number is <partition id >
>>>  sample : <vertex id>  <partition id >
>>> 2    1
>>> 3    1
>>> 4    1
>>> 5    2
>>> 6    2
>>> 7    2
>>> 8    1
>>> 9    1
>>> 10    1
>>>
>>>
>>> The Edge list file is having size of 32Gb, while partition file is of
>>> 10Gb.
>>> (size is so large that map/reduce can read only partition file . I have
>>> 20 node cluster with 24Gb memory per node.)
>>>
>>> My aim is to get all vertices (along with their adjacency list )those
>>> having same partition id in one reducer so that I can perform further
>>> analytics on a given partition in reducer.
>>>
>>> Is there any way in hadoop to get join of these two file in mapper and
>>> so that I can map based on the partition id ?
>>>
>>> Thanks
>>> Ravikant
>>>
>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

Hi Harshit,

Is there any way to retain the partition id for each vertex in the
adjacency list?


Thanks
Ravikant

On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <ra...@gmail.com>
wrote:

> Thanks Harshit
>
> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> This may be the solution (i hope i understood the problem correctly)
>>
>> Job 1:
>>
>> You need to  have two Mappers one reading from Edge File and the other
>> reading from Partition file.
>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>> Now you can have a custom writable (say GraphCustomObject) holding the
>> following,
>> 1)type : a representation of the object coming from which mapper
>> 2)Adjacency vertex list: list of adjacency vertex
>> 3)partiton Id: to hold the partition id
>>
>> Now the output key and value of the EdgeFileMapper will be,
>> key=> vertexId
>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>> present in this file)
>>
>> The output of PartitionFileMapper will be,
>> key=>vertexId
>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>
>>
>> So in the Reducer for each VertexId we will can have the complete
>> GraphCustomObject populated.
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> The output of this reducer will be,
>> key=> partitionId
>> Value=> {adjcencyVertexList, vertexId}
>> This will be the stored as output of job1.
>>
>> Job 2
>> This job will read the output generated in the previous job and use
>> identity Mapper, so in the reducer we will have
>> key=> partitionId
>> value=> list of all the adjacency vertexlist along with vertexid
>>
>>
>>
>> I know my explanation seems a bit messy, sorry for that.
>>
>> BR,
>> Harshit
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Hadoop user,
>>>
>>> I want to use hadoop for performing operation on graph data
>>> I have two file :
>>>
>>> 1. Edge list file
>>>         This file contains one line for each edge in the graph.
>>> sample:
>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>> 1    5
>>> 2    3
>>> 4    2
>>> 4    3
>>> 5    6
>>> 5    4
>>> 5    7
>>> 7    8
>>> 8    9
>>> 8    10
>>>
>>> 2. Partition file :
>>>          This file contains one line for each vertex. Each line has two
>>> values first number is <vertex id> and second number is <partition id >
>>>  sample : <vertex id>  <partition id >
>>> 2    1
>>> 3    1
>>> 4    1
>>> 5    2
>>> 6    2
>>> 7    2
>>> 8    1
>>> 9    1
>>> 10    1
>>>
>>>
>>> The Edge list file is having size of 32Gb, while partition file is of
>>> 10Gb.
>>> (size is so large that map/reduce can read only partition file . I have
>>> 20 node cluster with 24Gb memory per node.)
>>>
>>> My aim is to get all vertices (along with their adjacency list )those
>>> having same partition id in one reducer so that I can perform further
>>> analytics on a given partition in reducer.
>>>
>>> Is there any way in hadoop to get join of these two file in mapper and
>>> so that I can map based on the partition id ?
>>>
>>> Thanks
>>> Ravikant
>>>
>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

Hi Harshit,

Is there any way to retain the partition id for each vertex in the
adjacency list?


Thanks
Ravikant

On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <ra...@gmail.com>
wrote:

> Thanks Harshit
>
> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> This may be the solution (i hope i understood the problem correctly)
>>
>> Job 1:
>>
>> You need to  have two Mappers one reading from Edge File and the other
>> reading from Partition file.
>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>> Now you can have a custom writable (say GraphCustomObject) holding the
>> following,
>> 1)type : a representation of the object coming from which mapper
>> 2)Adjacency vertex list: list of adjacency vertex
>> 3)partiton Id: to hold the partition id
>>
>> Now the output key and value of the EdgeFileMapper will be,
>> key=> vertexId
>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>> present in this file)
>>
>> The output of PartitionFileMapper will be,
>> key=>vertexId
>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>
>>
>> So in the Reducer for each VertexId we will can have the complete
>> GraphCustomObject populated.
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> The output of this reducer will be,
>> key=> partitionId
>> Value=> {adjcencyVertexList, vertexId}
>> This will be the stored as output of job1.
>>
>> Job 2
>> This job will read the output generated in the previous job and use
>> identity Mapper, so in the reducer we will have
>> key=> partitionId
>> value=> list of all the adjacency vertexlist along with vertexid
>>
>>
>>
>> I know my explanation seems a bit messy, sorry for that.
>>
>> BR,
>> Harshit
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Hadoop user,
>>>
>>> I want to use hadoop for performing operation on graph data
>>> I have two file :
>>>
>>> 1. Edge list file
>>>         This file contains one line for each edge in the graph.
>>> sample:
>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>> 1    5
>>> 2    3
>>> 4    2
>>> 4    3
>>> 5    6
>>> 5    4
>>> 5    7
>>> 7    8
>>> 8    9
>>> 8    10
>>>
>>> 2. Partition file :
>>>          This file contains one line for each vertex. Each line has two
>>> values first number is <vertex id> and second number is <partition id >
>>>  sample : <vertex id>  <partition id >
>>> 2    1
>>> 3    1
>>> 4    1
>>> 5    2
>>> 6    2
>>> 7    2
>>> 8    1
>>> 9    1
>>> 10    1
>>>
>>>
>>> The Edge list file is having size of 32Gb, while partition file is of
>>> 10Gb.
>>> (size is so large that map/reduce can read only partition file . I have
>>> 20 node cluster with 24Gb memory per node.)
>>>
>>> My aim is to get all vertices (along with their adjacency list )those
>>> having same partition id in one reducer so that I can perform further
>>> analytics on a given partition in reducer.
>>>
>>> Is there any way in hadoop to get join of these two file in mapper and
>>> so that I can map based on the partition id ?
>>>
>>> Thanks
>>> Ravikant
>>>
>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

Hi Harshit,

Is there any way to retain the partition id for each vertex in the
adjacency list?


Thanks
Ravikant

On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <ra...@gmail.com>
wrote:

> Thanks Harshit
>
> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> This may be the solution (i hope i understood the problem correctly)
>>
>> Job 1:
>>
>> You need to  have two Mappers one reading from Edge File and the other
>> reading from Partition file.
>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>> Now you can have a custom writable (say GraphCustomObject) holding the
>> following,
>> 1)type : a representation of the object coming from which mapper
>> 2)Adjacency vertex list: list of adjacency vertex
>> 3)partiton Id: to hold the partition id
>>
>> Now the output key and value of the EdgeFileMapper will be,
>> key=> vertexId
>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>> present in this file)
>>
>> The output of PartitionFileMapper will be,
>> key=>vertexId
>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>
>>
>> So in the Reducer for each VertexId we will can have the complete
>> GraphCustomObject populated.
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> The output of this reducer will be,
>> key=> partitionId
>> Value=> {adjcencyVertexList, vertexId}
>> This will be the stored as output of job1.
>>
>> Job 2
>> This job will read the output generated in the previous job and use
>> identity Mapper, so in the reducer we will have
>> key=> partitionId
>> value=> list of all the adjacency vertexlist along with vertexid
>>
>>
>>
>> I know my explanation seems a bit messy, sorry for that.
>>
>> BR,
>> Harshit
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Hadoop user,
>>>
>>> I want to use hadoop for performing operation on graph data
>>> I have two file :
>>>
>>> 1. Edge list file
>>>         This file contains one line for each edge in the graph.
>>> sample:
>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>> 1    5
>>> 2    3
>>> 4    2
>>> 4    3
>>> 5    6
>>> 5    4
>>> 5    7
>>> 7    8
>>> 8    9
>>> 8    10
>>>
>>> 2. Partition file :
>>>          This file contains one line for each vertex. Each line has two
>>> values first number is <vertex id> and second number is <partition id >
>>>  sample : <vertex id>  <partition id >
>>> 2    1
>>> 3    1
>>> 4    1
>>> 5    2
>>> 6    2
>>> 7    2
>>> 8    1
>>> 9    1
>>> 10    1
>>>
>>>
>>> The Edge list file is having size of 32Gb, while partition file is of
>>> 10Gb.
>>> (size is so large that map/reduce can read only partition file . I have
>>> 20 node cluster with 24Gb memory per node.)
>>>
>>> My aim is to get all vertices (along with their adjacency list )those
>>> having same partition id in one reducer so that I can perform further
>>> analytics on a given partition in reducer.
>>>
>>> Is there any way in hadoop to get join of these two file in mapper and
>>> so that I can map based on the partition id ?
>>>
>>> Thanks
>>> Ravikant
>>>
>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

Thanks Harshit

On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
wrote:

> Hi,
>
>
> This may be the solution (i hope i understood the problem correctly)
>
> Job 1:
>
> You need to  have two Mappers one reading from Edge File and the other
> reading from Partition file.
> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
> Now you can have a custom writable (say GraphCustomObject) holding the
> following,
> 1)type : a representation of the object coming from which mapper
> 2)Adjacency vertex list: list of adjacency vertex
> 3)partiton Id: to hold the partition id
>
> Now the output key and value of the EdgeFileMapper will be,
> key=> vertexId
> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
> present in this file)
>
> The output of PartitionFileMapper will be,
> key=>vertexId
> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>
>
> So in the Reducer for each VertexId we will can have the complete
> GraphCustomObject populated.
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> The output of this reducer will be,
> key=> partitionId
> Value=> {adjcencyVertexList, vertexId}
> This will be the stored as output of job1.
>
> Job 2
> This job will read the output generated in the previous job and use
> identity Mapper, so in the reducer we will have
> key=> partitionId
> value=> list of all the adjacency vertexlist along with vertexid
>
>
>
> I know my explanation seems a bit messy, sorry for that.
>
> BR,
> Harshit
>
>
>
>
>
>
>
>
> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Hadoop user,
>>
>> I want to use hadoop for performing operation on graph data
>> I have two file :
>>
>> 1. Edge list file
>>         This file contains one line for each edge in the graph.
>> sample:
>> 1    2 (here 1 is source and 2 is sink node for the edge)
>> 1    5
>> 2    3
>> 4    2
>> 4    3
>> 5    6
>> 5    4
>> 5    7
>> 7    8
>> 8    9
>> 8    10
>>
>> 2. Partition file :
>>          This file contains one line for each vertex. Each line has two
>> values first number is <vertex id> and second number is <partition id >
>>  sample : <vertex id>  <partition id >
>> 2    1
>> 3    1
>> 4    1
>> 5    2
>> 6    2
>> 7    2
>> 8    1
>> 9    1
>> 10    1
>>
>>
>> The Edge list file is having size of 32Gb, while partition file is of
>> 10Gb.
>> (size is so large that map/reduce can read only partition file . I have
>> 20 node cluster with 24Gb memory per node.)
>>
>> My aim is to get all vertices (along with their adjacency list )those
>> having same partition id in one reducer so that I can perform further
>> analytics on a given partition in reducer.
>>
>> Is there any way in hadoop to get join of these two file in mapper and so
>> that I can map based on the partition id ?
>>
>> Thanks
>> Ravikant
>>
>
>
>
> --
> Harshit Mathur
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

Thanks Harshit

On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
wrote:

> Hi,
>
>
> This may be the solution (i hope i understood the problem correctly)
>
> Job 1:
>
> You need to  have two Mappers one reading from Edge File and the other
> reading from Partition file.
> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
> Now you can have a custom writable (say GraphCustomObject) holding the
> following,
> 1)type : a representation of the object coming from which mapper
> 2)Adjacency vertex list: list of adjacency vertex
> 3)partiton Id: to hold the partition id
>
> Now the output key and value of the EdgeFileMapper will be,
> key=> vertexId
> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
> present in this file)
>
> The output of PartitionFileMapper will be,
> key=>vertexId
> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>
>
> So in the Reducer for each VertexId we will can have the complete
> GraphCustomObject populated.
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> The output of this reducer will be,
> key=> partitionId
> Value=> {adjcencyVertexList, vertexId}
> This will be the stored as output of job1.
>
> Job 2
> This job will read the output generated in the previous job and use
> identity Mapper, so in the reducer we will have
> key=> partitionId
> value=> list of all the adjacency vertexlist along with vertexid
>
>
>
> I know my explanation seems a bit messy, sorry for that.
>
> BR,
> Harshit
>
>
>
>
>
>
>
>
> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Hadoop user,
>>
>> I want to use hadoop for performing operation on graph data
>> I have two file :
>>
>> 1. Edge list file
>>         This file contains one line for each edge in the graph.
>> sample:
>> 1    2 (here 1 is source and 2 is sink node for the edge)
>> 1    5
>> 2    3
>> 4    2
>> 4    3
>> 5    6
>> 5    4
>> 5    7
>> 7    8
>> 8    9
>> 8    10
>>
>> 2. Partition file :
>>          This file contains one line for each vertex. Each line has two
>> values first number is <vertex id> and second number is <partition id >
>>  sample : <vertex id>  <partition id >
>> 2    1
>> 3    1
>> 4    1
>> 5    2
>> 6    2
>> 7    2
>> 8    1
>> 9    1
>> 10    1
>>
>>
>> The Edge list file is having size of 32Gb, while partition file is of
>> 10Gb.
>> (size is so large that map/reduce can read only partition file . I have
>> 20 node cluster with 24Gb memory per node.)
>>
>> My aim is to get all vertices (along with their adjacency list )those
>> having same partition id in one reducer so that I can perform further
>> analytics on a given partition in reducer.
>>
>> Is there any way in hadoop to get join of these two file in mapper and so
>> that I can map based on the partition id ?
>>
>> Thanks
>> Ravikant
>>
>
>
>
> --
> Harshit Mathur
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

Thanks Harshit

On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
wrote:

> Hi,
>
>
> This may be the solution (i hope i understood the problem correctly)
>
> Job 1:
>
> You need to  have two Mappers one reading from Edge File and the other
> reading from Partition file.
> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
> Now you can have a custom writable (say GraphCustomObject) holding the
> following,
> 1)type : a representation of the object coming from which mapper
> 2)Adjacency vertex list: list of adjacency vertex
> 3)partiton Id: to hold the partition id
>
> Now the output key and value of the EdgeFileMapper will be,
> key=> vertexId
> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
> present in this file)
>
> The output of PartitionFileMapper will be,
> key=>vertexId
> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>
>
> So in the Reducer for each VertexId we will can have the complete
> GraphCustomObject populated.
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> The output of this reducer will be,
> key=> partitionId
> Value=> {adjcencyVertexList, vertexId}
> This will be the stored as output of job1.
>
> Job 2
> This job will read the output generated in the previous job and use
> identity Mapper, so in the reducer we will have
> key=> partitionId
> value=> list of all the adjacency vertexlist along with vertexid
>
>
>
> I know my explanation seems a bit messy, sorry for that.
>
> BR,
> Harshit
>
>
>
>
>
>
>
>
> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Hadoop user,
>>
>> I want to use hadoop for performing operation on graph data
>> I have two file :
>>
>> 1. Edge list file
>>         This file contains one line for each edge in the graph.
>> sample:
>> 1    2 (here 1 is source and 2 is sink node for the edge)
>> 1    5
>> 2    3
>> 4    2
>> 4    3
>> 5    6
>> 5    4
>> 5    7
>> 7    8
>> 8    9
>> 8    10
>>
>> 2. Partition file :
>>          This file contains one line for each vertex. Each line has two
>> values first number is <vertex id> and second number is <partition id >
>>  sample : <vertex id>  <partition id >
>> 2    1
>> 3    1
>> 4    1
>> 5    2
>> 6    2
>> 7    2
>> 8    1
>> 9    1
>> 10    1
>>
>>
>> The Edge list file is having size of 32Gb, while partition file is of
>> 10Gb.
>> (size is so large that map/reduce can read only partition file . I have
>> 20 node cluster with 24Gb memory per node.)
>>
>> My aim is to get all vertices (along with their adjacency list )those
>> having same partition id in one reducer so that I can perform further
>> analytics on a given partition in reducer.
>>
>> Is there any way in hadoop to get join of these two file in mapper and so
>> that I can map based on the partition id ?
>>
>> Thanks
>> Ravikant
>>
>
>
>
> --
> Harshit Mathur
>

Re: Joins in Hadoop

Posted by Ravikant Dindokar <ra...@gmail.com>.

Thanks Harshit

On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <ma...@gmail.com>
wrote:

> Hi,
>
>
> This may be the solution (i hope i understood the problem correctly)
>
> Job 1:
>
> You need to  have two Mappers one reading from Edge File and the other
> reading from Partition file.
> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
> Now you can have a custom writable (say GraphCustomObject) holding the
> following,
> 1)type : a representation of the object coming from which mapper
> 2)Adjacency vertex list: list of adjacency vertex
> 3)partiton Id: to hold the partition id
>
> Now the output key and value of the EdgeFileMapper will be,
> key=> vertexId
> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
> present in this file)
>
> The output of PartitionFileMapper will be,
> key=>vertexId
> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>
>
> So in the Reducer for each VertexId we will can have the complete
> GraphCustomObject populated.
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> The output of this reducer will be,
> key=> partitionId
> Value=> {adjcencyVertexList, vertexId}
> This will be the stored as output of job1.
>
> Job 2
> This job will read the output generated in the previous job and use
> identity Mapper, so in the reducer we will have
> key=> partitionId
> value=> list of all the adjacency vertexlist along with vertexid
>
>
>
> I know my explanation seems a bit messy, sorry for that.
>
> BR,
> Harshit
>
>
>
>
>
>
>
>
> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Hadoop user,
>>
>> I want to use hadoop for performing operation on graph data
>> I have two file :
>>
>> 1. Edge list file
>>         This file contains one line for each edge in the graph.
>> sample:
>> 1    2 (here 1 is source and 2 is sink node for the edge)
>> 1    5
>> 2    3
>> 4    2
>> 4    3
>> 5    6
>> 5    4
>> 5    7
>> 7    8
>> 8    9
>> 8    10
>>
>> 2. Partition file :
>>          This file contains one line for each vertex. Each line has two
>> values first number is <vertex id> and second number is <partition id >
>>  sample : <vertex id>  <partition id >
>> 2    1
>> 3    1
>> 4    1
>> 5    2
>> 6    2
>> 7    2
>> 8    1
>> 9    1
>> 10    1
>>
>>
>> The Edge list file is having size of 32Gb, while partition file is of
>> 10Gb.
>> (size is so large that map/reduce can read only partition file . I have
>> 20 node cluster with 24Gb memory per node.)
>>
>> My aim is to get all vertices (along with their adjacency list )those
>> having same partition id in one reducer so that I can perform further
>> analytics on a given partition in reducer.
>>
>> Is there any way in hadoop to get join of these two file in mapper and so
>> that I can map based on the partition id ?
>>
>> Thanks
>> Ravikant
>>
>
>
>
> --
> Harshit Mathur
>

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

Hi,


This may be the solution (i hope i understood the problem correctly)

Job 1:

You need to  have two Mappers one reading from Edge File and the other
reading from Partition file.
Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
Now you can have a custom writable (say GraphCustomObject) holding the
following,
1)type : a representation of the object coming from which mapper
2)Adjacency vertex list: list of adjacency vertex
3)partiton Id: to hold the partition id

Now the output key and value of the EdgeFileMapper will be,
key=> vertexId
value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
present in this file)

The output of PartitionFileMapper will be,
key=>vertexId
value=>{type=partitionfile; adjcencyVertex=0, partitonid)


So in the Reducer for each VertexId we will can have the complete
GraphCustomObject populated.
vertexId => {adjcencyVertex complete list, partitonid=0}

The output of this reducer will be,
key=> partitionId
Value=> {adjcencyVertexList, vertexId}
This will be the stored as output of job1.

Job 2
This job will read the output generated in the previous job and use
identity Mapper, so in the reducer we will have
key=> partitionId
value=> list of all the adjacency vertexlist along with vertexid



I know my explanation seems a bit messy, sorry for that.

BR,
Harshit








On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Hadoop user,
>
> I want to use hadoop for performing operation on graph data
> I have two file :
>
> 1. Edge list file
>         This file contains one line for each edge in the graph.
> sample:
> 1    2 (here 1 is source and 2 is sink node for the edge)
> 1    5
> 2    3
> 4    2
> 4    3
> 5    6
> 5    4
> 5    7
> 7    8
> 8    9
> 8    10
>
> 2. Partition file :
>          This file contains one line for each vertex. Each line has two
> values first number is <vertex id> and second number is <partition id >
>  sample : <vertex id>  <partition id >
> 2    1
> 3    1
> 4    1
> 5    2
> 6    2
> 7    2
> 8    1
> 9    1
> 10    1
>
>
> The Edge list file is having size of 32Gb, while partition file is of 10Gb.
> (size is so large that map/reduce can read only partition file . I have 20
> node cluster with 24Gb memory per node.)
>
> My aim is to get all vertices (along with their adjacency list )those
> having same partition id in one reducer so that I can perform further
> analytics on a given partition in reducer.
>
> Is there any way in hadoop to get join of these two file in mapper and so
> that I can map based on the partition id ?
>
> Thanks
> Ravikant
>



-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

Hi,


This may be the solution (i hope i understood the problem correctly)

Job 1:

You need to  have two Mappers one reading from Edge File and the other
reading from Partition file.
Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
Now you can have a custom writable (say GraphCustomObject) holding the
following,
1)type : a representation of the object coming from which mapper
2)Adjacency vertex list: list of adjacency vertex
3)partiton Id: to hold the partition id

Now the output key and value of the EdgeFileMapper will be,
key=> vertexId
value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
present in this file)

The output of PartitionFileMapper will be,
key=>vertexId
value=>{type=partitionfile; adjcencyVertex=0, partitonid)


So in the Reducer for each VertexId we will can have the complete
GraphCustomObject populated.
vertexId => {adjcencyVertex complete list, partitonid=0}

The output of this reducer will be,
key=> partitionId
Value=> {adjcencyVertexList, vertexId}
This will be the stored as output of job1.

Job 2
This job will read the output generated in the previous job and use
identity Mapper, so in the reducer we will have
key=> partitionId
value=> list of all the adjacency vertexlist along with vertexid



I know my explanation seems a bit messy, sorry for that.

BR,
Harshit








On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Hadoop user,
>
> I want to use hadoop for performing operation on graph data
> I have two file :
>
> 1. Edge list file
>         This file contains one line for each edge in the graph.
> sample:
> 1    2 (here 1 is source and 2 is sink node for the edge)
> 1    5
> 2    3
> 4    2
> 4    3
> 5    6
> 5    4
> 5    7
> 7    8
> 8    9
> 8    10
>
> 2. Partition file :
>          This file contains one line for each vertex. Each line has two
> values first number is <vertex id> and second number is <partition id >
>  sample : <vertex id>  <partition id >
> 2    1
> 3    1
> 4    1
> 5    2
> 6    2
> 7    2
> 8    1
> 9    1
> 10    1
>
>
> The Edge list file is having size of 32Gb, while partition file is of 10Gb.
> (size is so large that map/reduce can read only partition file . I have 20
> node cluster with 24Gb memory per node.)
>
> My aim is to get all vertices (along with their adjacency list )those
> having same partition id in one reducer so that I can perform further
> analytics on a given partition in reducer.
>
> Is there any way in hadoop to get join of these two file in mapper and so
> that I can map based on the partition id ?
>
> Thanks
> Ravikant
>



-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

Hi,


This may be the solution (i hope i understood the problem correctly)

Job 1:

You need to  have two Mappers one reading from Edge File and the other
reading from Partition file.
Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
Now you can have a custom writable (say GraphCustomObject) holding the
following,
1)type : a representation of the object coming from which mapper
2)Adjacency vertex list: list of adjacency vertex
3)partiton Id: to hold the partition id

Now the output key and value of the EdgeFileMapper will be,
key=> vertexId
value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
present in this file)

The output of PartitionFileMapper will be,
key=>vertexId
value=>{type=partitionfile; adjcencyVertex=0, partitonid)


So in the Reducer for each VertexId we will can have the complete
GraphCustomObject populated.
vertexId => {adjcencyVertex complete list, partitonid=0}

The output of this reducer will be,
key=> partitionId
Value=> {adjcencyVertexList, vertexId}
This will be the stored as output of job1.

Job 2
This job will read the output generated in the previous job and use
identity Mapper, so in the reducer we will have
key=> partitionId
value=> list of all the adjacency vertexlist along with vertexid



I know my explanation seems a bit messy, sorry for that.

BR,
Harshit








On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Hadoop user,
>
> I want to use hadoop for performing operation on graph data
> I have two file :
>
> 1. Edge list file
>         This file contains one line for each edge in the graph.
> sample:
> 1    2 (here 1 is source and 2 is sink node for the edge)
> 1    5
> 2    3
> 4    2
> 4    3
> 5    6
> 5    4
> 5    7
> 7    8
> 8    9
> 8    10
>
> 2. Partition file :
>          This file contains one line for each vertex. Each line has two
> values first number is <vertex id> and second number is <partition id >
>  sample : <vertex id>  <partition id >
> 2    1
> 3    1
> 4    1
> 5    2
> 6    2
> 7    2
> 8    1
> 9    1
> 10    1
>
>
> The Edge list file is having size of 32Gb, while partition file is of 10Gb.
> (size is so large that map/reduce can read only partition file . I have 20
> node cluster with 24Gb memory per node.)
>
> My aim is to get all vertices (along with their adjacency list )those
> having same partition id in one reducer so that I can perform further
> analytics on a given partition in reducer.
>
> Is there any way in hadoop to get join of these two file in mapper and so
> that I can map based on the partition id ?
>
> Thanks
> Ravikant
>



-- 
Harshit Mathur

Re: Joins in Hadoop

Posted by Harshit Mathur <ma...@gmail.com>.

Hi,


This may be the solution (i hope i understood the problem correctly)

Job 1:

You need to  have two Mappers one reading from Edge File and the other
reading from Partition file.
Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
Now you can have a custom writable (say GraphCustomObject) holding the
following,
1)type : a representation of the object coming from which mapper
2)Adjacency vertex list: list of adjacency vertex
3)partiton Id: to hold the partition id

Now the output key and value of the EdgeFileMapper will be,
key=> vertexId
value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
present in this file)

The output of PartitionFileMapper will be,
key=>vertexId
value=>{type=partitionfile; adjcencyVertex=0, partitonid)


So in the Reducer for each VertexId we will can have the complete
GraphCustomObject populated.
vertexId => {adjcencyVertex complete list, partitonid=0}

The output of this reducer will be,
key=> partitionId
Value=> {adjcencyVertexList, vertexId}
This will be the stored as output of job1.

Job 2
This job will read the output generated in the previous job and use
identity Mapper, so in the reducer we will have
key=> partitionId
value=> list of all the adjacency vertexlist along with vertexid



I know my explanation seems a bit messy, sorry for that.

BR,
Harshit








On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Hadoop user,
>
> I want to use hadoop for performing operation on graph data
> I have two file :
>
> 1. Edge list file
>         This file contains one line for each edge in the graph.
> sample:
> 1    2 (here 1 is source and 2 is sink node for the edge)
> 1    5
> 2    3
> 4    2
> 4    3
> 5    6
> 5    4
> 5    7
> 7    8
> 8    9
> 8    10
>
> 2. Partition file :
>          This file contains one line for each vertex. Each line has two
> values first number is <vertex id> and second number is <partition id >
>  sample : <vertex id>  <partition id >
> 2    1
> 3    1
> 4    1
> 5    2
> 6    2
> 7    2
> 8    1
> 9    1
> 10    1
>
>
> The Edge list file is having size of 32Gb, while partition file is of 10Gb.
> (size is so large that map/reduce can read only partition file . I have 20
> node cluster with 24Gb memory per node.)
>
> My aim is to get all vertices (along with their adjacency list )those
> having same partition id in one reducer so that I can perform further
> analytics on a given partition in reducer.
>
> Is there any way in hadoop to get join of these two file in mapper and so
> that I can map based on the partition id ?
>
> Thanks
> Ravikant
>



-- 
Harshit Mathur