You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hama.apache.org by Apurv Verma <da...@gmail.com> on 2013/01/06 15:54:07 UTC

Partitioner in Hama

Hey all,
 I found that PartitioningRunner has been removed from the codepath, I
guess this is the right way to make jobs faster.
But in the current scenario is it possible to have something all
follows. I want that all values < some integer are designated to peer
index 0, all values in range 0-a to peer index 1, and so on and so
forth.
With the partitioning removed would i need to use an additional
superstep to do this classification of input records.


--
Regards,
Apurv Verma

Re: Partitioner in Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Sorry, I was confused in term. ;)

On Wed, Jan 9, 2013 at 3:00 PM, Edward J. Yoon <ed...@apache.org> wrote:
> Let's don't use the term "runtime partitioning" at this time.
>
> Originally,
>
>  * Partitioning was handled by single client-side 'BSPJobClient'.
>  * And, there were separate partition processing logic in
> GraphJobRunner, called run-time partitioning.
>
> And now, by using BSP job for partitioning input-data, we can process
> read and write operations in parallel. Also, data locality is
> preserved at least for read operations. Above all things, we can
> specify the number of BSP tasks now.
>
> If we want to implement network-based run-time partitioning, it should
> be processed before BSP's setup() method internally. I think we can
> hold the run-time partitioning for later on.
>
> On Wed, Jan 9, 2013 at 8:56 AM, Suraj Menon <su...@apache.org> wrote:
>>> Keeping run-time (network-based) partitioning within GraphJobRunner is
>>> not good idea.
>>
>>
>> It is not. I think I got testSubmitGraph to runtime partition (in
>> preprocessing step) the single file into 2 files in the unit tests in my
>> current state of patch..
>>
>>
>>> >> - the number of splits found are not equal to the number of BSP tasks
>>> >> configured for the job. OR
>>>
>>> I have a question. If the input is unsorted map and I want to
>>> re-partition by hashing but the numbers of blocks and desired tasks
>>> are same, then what happens? Do you mean run-time partitioning?
>>
>> You will have runtime partitioner class defined and partitioning flag on by
>> default. For case of HAMA-561 a user can switch off partitioning using the
>> same flag.
>>
>>
>>
>>> On Wed, Jan 9, 2013 at 7:07 AM, Suraj Menon <su...@apache.org>
>>> wrote:
>>> > Hi Apurv, yes, those are pending test cases to be fixed. GraphJobRunner
>>> is
>>> > expecting the input in the format of Vertex, but we have input files as
>>> > well as record key, values defined as Text. I have fixed only one unit
>>> test
>>> > case yet.
>>> >
>>> > On Tue, Jan 8, 2013 at 4:45 PM, Apurv Verma <da...@gmail.com> wrote:
>>> >
>>> >> Hey all,
>>> >>  I got the problem, the partitioner was not being set for the
>>> >> PartitionerRunner bsp task. :P I have fixed the partitioner with
>>> portions
>>> >> from your patch Suraj. Now after this commit partitioner will obey what
>>> you
>>> >> specified earlier, just to recapitulate.
>>> >>
>>> >> Repartitioning is done if :
>>> >> - the number of splits found are not equal to the number of BSP tasks
>>> >> configured for the job. OR
>>> >> - the flag is set to true by the user
>>> ("bsp.input.runtime.partitioning") OR
>>> >> - user has specified a Runtime Partitioner class and enabled runtime
>>> >> partitioning
>>> >>
>>> >> There was one special thing that I discovered about partitioner , just
>>> >> sharing with you guys. Suppose I implement a partitioner which returns 0
>>> >> for a record, then it isn't necessary that this record will go to peer
>>> with
>>> >> index 0. It might go to peer 1. The only certitude which partitioner's
>>> >> provide is that all records returning 0 will go to the same peer. I
>>> needed
>>> >> partitioner to work for PrefixSum I was implementing.
>>> >>
>>> >> Things to do next.
>>> >> 1) RecordConverter , which Suraj is implementing in HAMA-700. (Please
>>> >> update Suraj)
>>> >>
>>> >> B.T.W there are problems in mvn test.
>>> >> *java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
>>> to
>>> >> org.apache.hadoop.io.ArrayWritable*
>>> >> * at
>>> >>
>>> org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:287)*
>>> >> *
>>> >> *
>>> >> I don't think my commit is breaking this.
>>> >>
>>> >> Thanks
>>> >>
>>> >>
>>> >> --
>>> >> Regards,
>>> >> Apurv Verma
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Jan 8, 2013 at 11:07 PM, Suraj Menon <su...@apache.org>
>>> >> wrote:
>>> >>
>>> >> > Please explain the nature of problems you are facing with Partitioner?
>>> >> >
>>> >> > >Any reasons for deciding to move the
>>> >> > > PartitioningJob inside BSPJobClient from BSPJob?
>>> >> >
>>> >> > Twofold, BSPJob was just a configuration holder object, didn't want to
>>> >> add
>>> >> > the partitioning responsibility to the class.
>>> >> > And also I wanted to know the number of splits, before taking the
>>> >> decision
>>> >> > whether to repartition or not.
>>> >> > Repartitioning is done if :
>>> >> > - the number of splits found are not equal to the number of BSP tasks
>>> >> > configured for the job. OR
>>> >> > - the flag is set to true by the user
>>> ("bsp.input.runtime.partitioning")
>>> >> OR
>>> >> > - user has specified a Runtime Partitioner class and enabled runtime
>>> >> > partitioning
>>> >> >
>>> >> > Thanks,
>>> >> > Suraj
>>> >> >
>>> >> > On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <da...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > > Thanks, let me have a careful look at it. On a cursory look, I seem
>>> to
>>> >> > > understand the basic idea. Any reasons for deciding to move the
>>> >> > > PartitioningJob inside BSPJobClient from BSPJob?
>>> >> > > BTW the current partitioner doesn't work as intended, only the
>>> default
>>> >> > > partitioner HashPartitioner works fine, if I try to put some custom
>>> >> > > partitioner there are problems.
>>> >> > >
>>> >> > > Let's resolve the partitioning completely before the spilling
>>> message
>>> >> > > queue.
>>> >> > >
>>> >> > >
>>> >> > > --
>>> >> > > Regards,
>>> >> > > Apurv Verma
>>> >> > >
>>> >> > >
>>> >> > >
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Partitioner in Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Let's don't use the term "runtime partitioning" at this time.

Originally,

 * Partitioning was handled by single client-side 'BSPJobClient'.
 * And, there were separate partition processing logic in
GraphJobRunner, called run-time partitioning.

And now, by using BSP job for partitioning input-data, we can process
read and write operations in parallel. Also, data locality is
preserved at least for read operations. Above all things, we can
specify the number of BSP tasks now.

If we want to implement network-based run-time partitioning, it should
be processed before BSP's setup() method internally. I think we can
hold the run-time partitioning for later on.

On Wed, Jan 9, 2013 at 8:56 AM, Suraj Menon <su...@apache.org> wrote:
>> Keeping run-time (network-based) partitioning within GraphJobRunner is
>> not good idea.
>
>
> It is not. I think I got testSubmitGraph to runtime partition (in
> preprocessing step) the single file into 2 files in the unit tests in my
> current state of patch..
>
>
>> >> - the number of splits found are not equal to the number of BSP tasks
>> >> configured for the job. OR
>>
>> I have a question. If the input is unsorted map and I want to
>> re-partition by hashing but the numbers of blocks and desired tasks
>> are same, then what happens? Do you mean run-time partitioning?
>
> You will have runtime partitioner class defined and partitioning flag on by
> default. For case of HAMA-561 a user can switch off partitioning using the
> same flag.
>
>
>
>> On Wed, Jan 9, 2013 at 7:07 AM, Suraj Menon <su...@apache.org>
>> wrote:
>> > Hi Apurv, yes, those are pending test cases to be fixed. GraphJobRunner
>> is
>> > expecting the input in the format of Vertex, but we have input files as
>> > well as record key, values defined as Text. I have fixed only one unit
>> test
>> > case yet.
>> >
>> > On Tue, Jan 8, 2013 at 4:45 PM, Apurv Verma <da...@gmail.com> wrote:
>> >
>> >> Hey all,
>> >>  I got the problem, the partitioner was not being set for the
>> >> PartitionerRunner bsp task. :P I have fixed the partitioner with
>> portions
>> >> from your patch Suraj. Now after this commit partitioner will obey what
>> you
>> >> specified earlier, just to recapitulate.
>> >>
>> >> Repartitioning is done if :
>> >> - the number of splits found are not equal to the number of BSP tasks
>> >> configured for the job. OR
>> >> - the flag is set to true by the user
>> ("bsp.input.runtime.partitioning") OR
>> >> - user has specified a Runtime Partitioner class and enabled runtime
>> >> partitioning
>> >>
>> >> There was one special thing that I discovered about partitioner , just
>> >> sharing with you guys. Suppose I implement a partitioner which returns 0
>> >> for a record, then it isn't necessary that this record will go to peer
>> with
>> >> index 0. It might go to peer 1. The only certitude which partitioner's
>> >> provide is that all records returning 0 will go to the same peer. I
>> needed
>> >> partitioner to work for PrefixSum I was implementing.
>> >>
>> >> Things to do next.
>> >> 1) RecordConverter , which Suraj is implementing in HAMA-700. (Please
>> >> update Suraj)
>> >>
>> >> B.T.W there are problems in mvn test.
>> >> *java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
>> to
>> >> org.apache.hadoop.io.ArrayWritable*
>> >> * at
>> >>
>> org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:287)*
>> >> *
>> >> *
>> >> I don't think my commit is breaking this.
>> >>
>> >> Thanks
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Apurv Verma
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Jan 8, 2013 at 11:07 PM, Suraj Menon <su...@apache.org>
>> >> wrote:
>> >>
>> >> > Please explain the nature of problems you are facing with Partitioner?
>> >> >
>> >> > >Any reasons for deciding to move the
>> >> > > PartitioningJob inside BSPJobClient from BSPJob?
>> >> >
>> >> > Twofold, BSPJob was just a configuration holder object, didn't want to
>> >> add
>> >> > the partitioning responsibility to the class.
>> >> > And also I wanted to know the number of splits, before taking the
>> >> decision
>> >> > whether to repartition or not.
>> >> > Repartitioning is done if :
>> >> > - the number of splits found are not equal to the number of BSP tasks
>> >> > configured for the job. OR
>> >> > - the flag is set to true by the user
>> ("bsp.input.runtime.partitioning")
>> >> OR
>> >> > - user has specified a Runtime Partitioner class and enabled runtime
>> >> > partitioning
>> >> >
>> >> > Thanks,
>> >> > Suraj
>> >> >
>> >> > On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <da...@gmail.com>
>> wrote:
>> >> >
>> >> > > Thanks, let me have a careful look at it. On a cursory look, I seem
>> to
>> >> > > understand the basic idea. Any reasons for deciding to move the
>> >> > > PartitioningJob inside BSPJobClient from BSPJob?
>> >> > > BTW the current partitioner doesn't work as intended, only the
>> default
>> >> > > partitioner HashPartitioner works fine, if I try to put some custom
>> >> > > partitioner there are problems.
>> >> > >
>> >> > > Let's resolve the partitioning completely before the spilling
>> message
>> >> > > queue.
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Regards,
>> >> > > Apurv Verma
>> >> > >
>> >> > >
>> >> > >
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Partitioner in Hama

Posted by Suraj Menon <su...@apache.org>.

> Keeping run-time (network-based) partitioning within GraphJobRunner is
> not good idea.


It is not. I think I got testSubmitGraph to runtime partition (in
preprocessing step) the single file into 2 files in the unit tests in my
current state of patch..


> >> - the number of splits found are not equal to the number of BSP tasks
> >> configured for the job. OR
>
> I have a question. If the input is unsorted map and I want to
> re-partition by hashing but the numbers of blocks and desired tasks
> are same, then what happens? Do you mean run-time partitioning?

You will have runtime partitioner class defined and partitioning flag on by
default. For case of HAMA-561 a user can switch off partitioning using the
same flag.



> On Wed, Jan 9, 2013 at 7:07 AM, Suraj Menon <su...@apache.org>
> wrote:
> > Hi Apurv, yes, those are pending test cases to be fixed. GraphJobRunner
> is
> > expecting the input in the format of Vertex, but we have input files as
> > well as record key, values defined as Text. I have fixed only one unit
> test
> > case yet.
> >
> > On Tue, Jan 8, 2013 at 4:45 PM, Apurv Verma <da...@gmail.com> wrote:
> >
> >> Hey all,
> >>  I got the problem, the partitioner was not being set for the
> >> PartitionerRunner bsp task. :P I have fixed the partitioner with
> portions
> >> from your patch Suraj. Now after this commit partitioner will obey what
> you
> >> specified earlier, just to recapitulate.
> >>
> >> Repartitioning is done if :
> >> - the number of splits found are not equal to the number of BSP tasks
> >> configured for the job. OR
> >> - the flag is set to true by the user
> ("bsp.input.runtime.partitioning") OR
> >> - user has specified a Runtime Partitioner class and enabled runtime
> >> partitioning
> >>
> >> There was one special thing that I discovered about partitioner , just
> >> sharing with you guys. Suppose I implement a partitioner which returns 0
> >> for a record, then it isn't necessary that this record will go to peer
> with
> >> index 0. It might go to peer 1. The only certitude which partitioner's
> >> provide is that all records returning 0 will go to the same peer. I
> needed
> >> partitioner to work for PrefixSum I was implementing.
> >>
> >> Things to do next.
> >> 1) RecordConverter , which Suraj is implementing in HAMA-700. (Please
> >> update Suraj)
> >>
> >> B.T.W there are problems in mvn test.
> >> *java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
> to
> >> org.apache.hadoop.io.ArrayWritable*
> >> * at
> >>
> org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:287)*
> >> *
> >> *
> >> I don't think my commit is breaking this.
> >>
> >> Thanks
> >>
> >>
> >> --
> >> Regards,
> >> Apurv Verma
> >>
> >>
> >>
> >>
> >> On Tue, Jan 8, 2013 at 11:07 PM, Suraj Menon <su...@apache.org>
> >> wrote:
> >>
> >> > Please explain the nature of problems you are facing with Partitioner?
> >> >
> >> > >Any reasons for deciding to move the
> >> > > PartitioningJob inside BSPJobClient from BSPJob?
> >> >
> >> > Twofold, BSPJob was just a configuration holder object, didn't want to
> >> add
> >> > the partitioning responsibility to the class.
> >> > And also I wanted to know the number of splits, before taking the
> >> decision
> >> > whether to repartition or not.
> >> > Repartitioning is done if :
> >> > - the number of splits found are not equal to the number of BSP tasks
> >> > configured for the job. OR
> >> > - the flag is set to true by the user
> ("bsp.input.runtime.partitioning")
> >> OR
> >> > - user has specified a Runtime Partitioner class and enabled runtime
> >> > partitioning
> >> >
> >> > Thanks,
> >> > Suraj
> >> >
> >> > On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <da...@gmail.com>
> wrote:
> >> >
> >> > > Thanks, let me have a careful look at it. On a cursory look, I seem
> to
> >> > > understand the basic idea. Any reasons for deciding to move the
> >> > > PartitioningJob inside BSPJobClient from BSPJob?
> >> > > BTW the current partitioner doesn't work as intended, only the
> default
> >> > > partitioner HashPartitioner works fine, if I try to put some custom
> >> > > partitioner there are problems.
> >> > >
> >> > > Let's resolve the partitioning completely before the spilling
> message
> >> > > queue.
> >> > >
> >> > >
> >> > > --
> >> > > Regards,
> >> > > Apurv Verma
> >> > >
> >> > >
> >> > >
>

Re: Partitioner in Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Keeping run-time (network-based) partitioning within GraphJobRunner is
not good idea.

>> - the number of splits found are not equal to the number of BSP tasks
>> configured for the job. OR

I have a question. If the input is unsorted map and I want to
re-partition by hashing but the numbers of blocks and desired tasks
are same, then what happens? Do you mean run-time partitioning?

On Wed, Jan 9, 2013 at 7:07 AM, Suraj Menon <su...@apache.org> wrote:
> Hi Apurv, yes, those are pending test cases to be fixed. GraphJobRunner is
> expecting the input in the format of Vertex, but we have input files as
> well as record key, values defined as Text. I have fixed only one unit test
> case yet.
>
> On Tue, Jan 8, 2013 at 4:45 PM, Apurv Verma <da...@gmail.com> wrote:
>
>> Hey all,
>>  I got the problem, the partitioner was not being set for the
>> PartitionerRunner bsp task. :P I have fixed the partitioner with portions
>> from your patch Suraj. Now after this commit partitioner will obey what you
>> specified earlier, just to recapitulate.
>>
>> Repartitioning is done if :
>> - the number of splits found are not equal to the number of BSP tasks
>> configured for the job. OR
>> - the flag is set to true by the user ("bsp.input.runtime.partitioning") OR
>> - user has specified a Runtime Partitioner class and enabled runtime
>> partitioning
>>
>> There was one special thing that I discovered about partitioner , just
>> sharing with you guys. Suppose I implement a partitioner which returns 0
>> for a record, then it isn't necessary that this record will go to peer with
>> index 0. It might go to peer 1. The only certitude which partitioner's
>> provide is that all records returning 0 will go to the same peer. I needed
>> partitioner to work for PrefixSum I was implementing.
>>
>> Things to do next.
>> 1) RecordConverter , which Suraj is implementing in HAMA-700. (Please
>> update Suraj)
>>
>> B.T.W there are problems in mvn test.
>> *java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.hadoop.io.ArrayWritable*
>> * at
>> org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:287)*
>> *
>> *
>> I don't think my commit is breaking this.
>>
>> Thanks
>>
>>
>> --
>> Regards,
>> Apurv Verma
>>
>>
>>
>>
>> On Tue, Jan 8, 2013 at 11:07 PM, Suraj Menon <su...@apache.org>
>> wrote:
>>
>> > Please explain the nature of problems you are facing with Partitioner?
>> >
>> > >Any reasons for deciding to move the
>> > > PartitioningJob inside BSPJobClient from BSPJob?
>> >
>> > Twofold, BSPJob was just a configuration holder object, didn't want to
>> add
>> > the partitioning responsibility to the class.
>> > And also I wanted to know the number of splits, before taking the
>> decision
>> > whether to repartition or not.
>> > Repartitioning is done if :
>> > - the number of splits found are not equal to the number of BSP tasks
>> > configured for the job. OR
>> > - the flag is set to true by the user ("bsp.input.runtime.partitioning")
>> OR
>> > - user has specified a Runtime Partitioner class and enabled runtime
>> > partitioning
>> >
>> > Thanks,
>> > Suraj
>> >
>> > On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <da...@gmail.com> wrote:
>> >
>> > > Thanks, let me have a careful look at it. On a cursory look, I seem to
>> > > understand the basic idea. Any reasons for deciding to move the
>> > > PartitioningJob inside BSPJobClient from BSPJob?
>> > > BTW the current partitioner doesn't work as intended, only the default
>> > > partitioner HashPartitioner works fine, if I try to put some custom
>> > > partitioner there are problems.
>> > >
>> > > Let's resolve the partitioning completely before the spilling message
>> > > queue.
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > Apurv Verma
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Jan 8, 2013 at 8:39 PM, Suraj Menon <su...@apache.org>
>> > > wrote:
>> > >
>> > > > Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide
>> > > > suggestions or even work on it.
>> > > >
>> > > > Thanks,
>> > > > Suraj
>> > > >
>> > > > On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <da...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hey Edward,
>> > > > >  There was a compile bug which i fixed temporarily. isPartitioned
>> was
>> > > not
>> > > > > being initialized. Could you please check the last commit. I have
>> > > > currently
>> > > > > initialized it to false but I guess this should be configurable.
>> > > > > There was some jira where we wanted partitioning to be skipped if
>> > user
>> > > > > thinks his data is already partitioned.
>> > > > >
>> > > > > Thanks again.
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Regards,
>> > > > > Apurv Verma
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <
>> > edwardyoon@apache.org
>> > > > > >wrote:
>> > > > >
>> > > > > > Thanks, then I'll finish tomorrow. Please feel free to comment
>> > there.
>> > > > > >
>> > > > > > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
>> > > > > > <to...@gmail.com> wrote:
>> > > > > > > thanks Edward, it looks good.
>> > > > > > > Tommaso
>> > > > > > >
>> > > > > > >
>> > > > > > > 2013/1/8 Edward J. Yoon <ed...@apache.org>
>> > > > > > >
>> > > > > > >> Please review this:
>> > > > > > >>
>> > > > > > >> http://wiki.apache.org/hama/Partitioning
>> > > > > > >>
>> > > > > > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <
>> > > > edwardyoon@apache.org
>> > > > > >
>> > > > > > >> wrote:
>> > > > > > >> > I mean, the pre-partitioning or resizing partitions is
>> really
>> > > > > > important.
>> > > > > > >> >
>> > > > > > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <
>> > > > > edwardyoon@apache.org
>> > > > > > >
>> > > > > > >> wrote:
>> > > > > > >> >> This is another talk ...
>> > > > > > >> >>
>> > > > > > >> >> Unlike MapReduce, I think, Hama BSP will handle tasks that
>> > > input
>> > > > is
>> > > > > > >> >> small in size but large in computational complexity, such
>> as
>> > > > graph,
>> > > > > > >> >> sparse matrix, machine learning algorithms.
>> > > > > > >> >>
>> > > > > > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <
>> > > > > > edwardyoon@apache.org>
>> > > > > > >> wrote:
>> > > > > > >> >>> Even though the numbers of splits and tasks are the same,
>> > > > > > user-defined
>> > > > > > >> >>> partitioning job should be run (because it is not only for
>> > > > > resizing
>> > > > > > >> >>> partitions. For example, range partitioning of unsorted
>> data
>> > > set
>> > > > > or
>> > > > > > >> >>> hash key partitioning, ..., etc).
>> > > > > > >> >>>
>> > > > > > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <
>> > > > > surajsmenon@apache.org
>> > > > > > >
>> > > > > > >> wrote:
>> > > > > > >> >>>>>    1. I am referring to
>> > > > org.apache.hama.bsp.PartitioningRunner,
>> > > > > > it's
>> > > > > > >> named
>> > > > > > >> >>>>>    as so in the HEAD (1429573) of trunk. It isn't
>> removed
>> > > but
>> > > > it
>> > > > > > >> isn't
>> > > > > > >> >>>>>    referred to anywhere else. I can't find any
>> references
>> > to
>> > > > it
>> > > > > in
>> > > > > > >> the
>> > > > > > >> >>>>>    workspace.
>> > > > > > >> >>>>>
>> > > > > > >> >>>>
>> > > > > > >> >>>> It is referred in BSPJob#waitForCompletion function as a
>> > > > separate
>> > > > > > BSP
>> > > > > > >> job
>> > > > > > >> >>>> to create the specified splits.
>> > > > > > >> >>>>
>> > > > > > >> >>>>
>> > > > > > >> >>>>>    2. job.setPartitioner is the same as setting
>> > > > > > >> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to
>> me
>> > > > > > >> partitions are
>> > > > > > >> >>>>> not
>> > > > > > >> >>>>>    being created because of which the following happens.
>> > > > > > >> >>>>>    If I am running the task on local fs and not hdfs,
>> > > there's
>> > > > > just
>> > > > > > >> one
>> > > > > > >> >>>>>    input split and even if I set a partitioner to create
>> > two
>> > > > > > >> partitions and
>> > > > > > >> >>>>>    set bsp.setNumTasks(2) , this is overriden and only
>> one
>> > > > task
>> > > > > is
>> > > > > > >> >>>>> executed.
>> > > > > > >> >>>>>    See BSPJobClient#submitJobInternal()
>> > > > > > >> >>>>>    where it does the following
>> > > > > > >> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile,
>> > > > > maxTasks));
>> > > > > > >> Line
>> > > > > > >> >>>>>    326.
>> > > > > > >> >>>>>
>> > > > > > >> >>>>> This job is set to run if the number of splits != number
>> > of
>> > > > > Tasks
>> > > > > > or
>> > > > > > >> if
>> > > > > > >> >>>> forced by the configuration. I can share my HAMA-700
>> > current
>> > > > > state
>> > > > > > of
>> > > > > > >> patch
>> > > > > > >> >>>> with you.
>> > > > > > >> >>>>
>> > > > > > >> >>>>
>> > > > > > >> >>>>>    3. So here is what I think is happening, Partitioner
>> is
>> > > not
>> > > > > in
>> > > > > > the
>> > > > > > >> >>>>>    codepath (try putting a breakpoint inside the
>> > partitioner
>> > > > and
>> > > > > > >> executing
>> > > > > > >> >>>>> and
>> > > > > > >> >>>>>    non graph bsp task), so partitions are not being
>> > created
>> > > > and
>> > > > > > >> >>>>> writeSplits()
>> > > > > > >> >>>>>    is returning 1.
>> > > > > > >> >>>>>    [ writeSplits() returns the number of splits in the
>> > > input.
>> > > > ]
>> > > > > > >> >>>>>
>> > > > > > >> >>>>
>> > > > > > >> >>>> Probably because it is running as a separate process?
>> > > > > > >> >>>
>> > > > > > >> >>>
>> > > > > > >> >>>
>> > > > > > >> >>> --
>> > > > > > >> >>> Best Regards, Edward J. Yoon
>> > > > > > >> >>> @eddieyoon
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >> >> --
>> > > > > > >> >> Best Regards, Edward J. Yoon
>> > > > > > >> >> @eddieyoon
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > --
>> > > > > > >> > Best Regards, Edward J. Yoon
>> > > > > > >> > @eddieyoon
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> --
>> > > > > > >> Best Regards, Edward J. Yoon
>> > > > > > >> @eddieyoon
>> > > > > > >>
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Best Regards, Edward J. Yoon
>> > > > > > @eddieyoon
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Partitioner in Hama

Posted by Suraj Menon <su...@apache.org>.

Hi Apurv, yes, those are pending test cases to be fixed. GraphJobRunner is
expecting the input in the format of Vertex, but we have input files as
well as record key, values defined as Text. I have fixed only one unit test
case yet.

On Tue, Jan 8, 2013 at 4:45 PM, Apurv Verma <da...@gmail.com> wrote:

> Hey all,
>  I got the problem, the partitioner was not being set for the
> PartitionerRunner bsp task. :P I have fixed the partitioner with portions
> from your patch Suraj. Now after this commit partitioner will obey what you
> specified earlier, just to recapitulate.
>
> Repartitioning is done if :
> - the number of splits found are not equal to the number of BSP tasks
> configured for the job. OR
> - the flag is set to true by the user ("bsp.input.runtime.partitioning") OR
> - user has specified a Runtime Partitioner class and enabled runtime
> partitioning
>
> There was one special thing that I discovered about partitioner , just
> sharing with you guys. Suppose I implement a partitioner which returns 0
> for a record, then it isn't necessary that this record will go to peer with
> index 0. It might go to peer 1. The only certitude which partitioner's
> provide is that all records returning 0 will go to the same peer. I needed
> partitioner to work for PrefixSum I was implementing.
>
> Things to do next.
> 1) RecordConverter , which Suraj is implementing in HAMA-700. (Please
> update Suraj)
>
> B.T.W there are problems in mvn test.
> *java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.ArrayWritable*
> * at
> org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:287)*
> *
> *
> I don't think my commit is breaking this.
>
> Thanks
>
>
> --
> Regards,
> Apurv Verma
>
>
>
>
> On Tue, Jan 8, 2013 at 11:07 PM, Suraj Menon <su...@apache.org>
> wrote:
>
> > Please explain the nature of problems you are facing with Partitioner?
> >
> > >Any reasons for deciding to move the
> > > PartitioningJob inside BSPJobClient from BSPJob?
> >
> > Twofold, BSPJob was just a configuration holder object, didn't want to
> add
> > the partitioning responsibility to the class.
> > And also I wanted to know the number of splits, before taking the
> decision
> > whether to repartition or not.
> > Repartitioning is done if :
> > - the number of splits found are not equal to the number of BSP tasks
> > configured for the job. OR
> > - the flag is set to true by the user ("bsp.input.runtime.partitioning")
> OR
> > - user has specified a Runtime Partitioner class and enabled runtime
> > partitioning
> >
> > Thanks,
> > Suraj
> >
> > On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <da...@gmail.com> wrote:
> >
> > > Thanks, let me have a careful look at it. On a cursory look, I seem to
> > > understand the basic idea. Any reasons for deciding to move the
> > > PartitioningJob inside BSPJobClient from BSPJob?
> > > BTW the current partitioner doesn't work as intended, only the default
> > > partitioner HashPartitioner works fine, if I try to put some custom
> > > partitioner there are problems.
> > >
> > > Let's resolve the partitioning completely before the spilling message
> > > queue.
> > >
> > >
> > > --
> > > Regards,
> > > Apurv Verma
> > >
> > >
> > >
> > >
> > > On Tue, Jan 8, 2013 at 8:39 PM, Suraj Menon <su...@apache.org>
> > > wrote:
> > >
> > > > Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide
> > > > suggestions or even work on it.
> > > >
> > > > Thanks,
> > > > Suraj
> > > >
> > > > On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <da...@gmail.com>
> wrote:
> > > >
> > > > > Hey Edward,
> > > > >  There was a compile bug which i fixed temporarily. isPartitioned
> was
> > > not
> > > > > being initialized. Could you please check the last commit. I have
> > > > currently
> > > > > initialized it to false but I guess this should be configurable.
> > > > > There was some jira where we wanted partitioning to be skipped if
> > user
> > > > > thinks his data is already partitioned.
> > > > >
> > > > > Thanks again.
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Apurv Verma
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <
> > edwardyoon@apache.org
> > > > > >wrote:
> > > > >
> > > > > > Thanks, then I'll finish tomorrow. Please feel free to comment
> > there.
> > > > > >
> > > > > > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
> > > > > > <to...@gmail.com> wrote:
> > > > > > > thanks Edward, it looks good.
> > > > > > > Tommaso
> > > > > > >
> > > > > > >
> > > > > > > 2013/1/8 Edward J. Yoon <ed...@apache.org>
> > > > > > >
> > > > > > >> Please review this:
> > > > > > >>
> > > > > > >> http://wiki.apache.org/hama/Partitioning
> > > > > > >>
> > > > > > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <
> > > > edwardyoon@apache.org
> > > > > >
> > > > > > >> wrote:
> > > > > > >> > I mean, the pre-partitioning or resizing partitions is
> really
> > > > > > important.
> > > > > > >> >
> > > > > > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <
> > > > > edwardyoon@apache.org
> > > > > > >
> > > > > > >> wrote:
> > > > > > >> >> This is another talk ...
> > > > > > >> >>
> > > > > > >> >> Unlike MapReduce, I think, Hama BSP will handle tasks that
> > > input
> > > > is
> > > > > > >> >> small in size but large in computational complexity, such
> as
> > > > graph,
> > > > > > >> >> sparse matrix, machine learning algorithms.
> > > > > > >> >>
> > > > > > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <
> > > > > > edwardyoon@apache.org>
> > > > > > >> wrote:
> > > > > > >> >>> Even though the numbers of splits and tasks are the same,
> > > > > > user-defined
> > > > > > >> >>> partitioning job should be run (because it is not only for
> > > > > resizing
> > > > > > >> >>> partitions. For example, range partitioning of unsorted
> data
> > > set
> > > > > or
> > > > > > >> >>> hash key partitioning, ..., etc).
> > > > > > >> >>>
> > > > > > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <
> > > > > surajsmenon@apache.org
> > > > > > >
> > > > > > >> wrote:
> > > > > > >> >>>>>    1. I am referring to
> > > > org.apache.hama.bsp.PartitioningRunner,
> > > > > > it's
> > > > > > >> named
> > > > > > >> >>>>>    as so in the HEAD (1429573) of trunk. It isn't
> removed
> > > but
> > > > it
> > > > > > >> isn't
> > > > > > >> >>>>>    referred to anywhere else. I can't find any
> references
> > to
> > > > it
> > > > > in
> > > > > > >> the
> > > > > > >> >>>>>    workspace.
> > > > > > >> >>>>>
> > > > > > >> >>>>
> > > > > > >> >>>> It is referred in BSPJob#waitForCompletion function as a
> > > > separate
> > > > > > BSP
> > > > > > >> job
> > > > > > >> >>>> to create the specified splits.
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>>>    2. job.setPartitioner is the same as setting
> > > > > > >> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to
> me
> > > > > > >> partitions are
> > > > > > >> >>>>> not
> > > > > > >> >>>>>    being created because of which the following happens.
> > > > > > >> >>>>>    If I am running the task on local fs and not hdfs,
> > > there's
> > > > > just
> > > > > > >> one
> > > > > > >> >>>>>    input split and even if I set a partitioner to create
> > two
> > > > > > >> partitions and
> > > > > > >> >>>>>    set bsp.setNumTasks(2) , this is overriden and only
> one
> > > > task
> > > > > is
> > > > > > >> >>>>> executed.
> > > > > > >> >>>>>    See BSPJobClient#submitJobInternal()
> > > > > > >> >>>>>    where it does the following
> > > > > > >> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile,
> > > > > maxTasks));
> > > > > > >> Line
> > > > > > >> >>>>>    326.
> > > > > > >> >>>>>
> > > > > > >> >>>>> This job is set to run if the number of splits != number
> > of
> > > > > Tasks
> > > > > > or
> > > > > > >> if
> > > > > > >> >>>> forced by the configuration. I can share my HAMA-700
> > current
> > > > > state
> > > > > > of
> > > > > > >> patch
> > > > > > >> >>>> with you.
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>>>    3. So here is what I think is happening, Partitioner
> is
> > > not
> > > > > in
> > > > > > the
> > > > > > >> >>>>>    codepath (try putting a breakpoint inside the
> > partitioner
> > > > and
> > > > > > >> executing
> > > > > > >> >>>>> and
> > > > > > >> >>>>>    non graph bsp task), so partitions are not being
> > created
> > > > and
> > > > > > >> >>>>> writeSplits()
> > > > > > >> >>>>>    is returning 1.
> > > > > > >> >>>>>    [ writeSplits() returns the number of splits in the
> > > input.
> > > > ]
> > > > > > >> >>>>>
> > > > > > >> >>>>
> > > > > > >> >>>> Probably because it is running as a separate process?
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>> --
> > > > > > >> >>> Best Regards, Edward J. Yoon
> > > > > > >> >>> @eddieyoon
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> --
> > > > > > >> >> Best Regards, Edward J. Yoon
> > > > > > >> >> @eddieyoon
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Best Regards, Edward J. Yoon
> > > > > > >> > @eddieyoon
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Best Regards, Edward J. Yoon
> > > > > > >> @eddieyoon
> > > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards, Edward J. Yoon
> > > > > > @eddieyoon
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Partitioner in Hama

Posted by Apurv Verma <da...@gmail.com>.

Hey all,
 I got the problem, the partitioner was not being set for the
PartitionerRunner bsp task. :P I have fixed the partitioner with portions
from your patch Suraj. Now after this commit partitioner will obey what you
specified earlier, just to recapitulate.

Repartitioning is done if :
- the number of splits found are not equal to the number of BSP tasks
configured for the job. OR
- the flag is set to true by the user ("bsp.input.runtime.partitioning") OR
- user has specified a Runtime Partitioner class and enabled runtime
partitioning

There was one special thing that I discovered about partitioner , just
sharing with you guys. Suppose I implement a partitioner which returns 0
for a record, then it isn't necessary that this record will go to peer with
index 0. It might go to peer 1. The only certitude which partitioner's
provide is that all records returning 0 will go to the same peer. I needed
partitioner to work for PrefixSum I was implementing.

Things to do next.
1) RecordConverter , which Suraj is implementing in HAMA-700. (Please
update Suraj)

B.T.W there are problems in mvn test.
*java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.ArrayWritable*
* at
org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:287)*
*
*
I don't think my commit is breaking this.

Thanks


--
Regards,
Apurv Verma




On Tue, Jan 8, 2013 at 11:07 PM, Suraj Menon <su...@apache.org> wrote:

> Please explain the nature of problems you are facing with Partitioner?
>
> >Any reasons for deciding to move the
> > PartitioningJob inside BSPJobClient from BSPJob?
>
> Twofold, BSPJob was just a configuration holder object, didn't want to add
> the partitioning responsibility to the class.
> And also I wanted to know the number of splits, before taking the decision
> whether to repartition or not.
> Repartitioning is done if :
> - the number of splits found are not equal to the number of BSP tasks
> configured for the job. OR
> - the flag is set to true by the user ("bsp.input.runtime.partitioning") OR
> - user has specified a Runtime Partitioner class and enabled runtime
> partitioning
>
> Thanks,
> Suraj
>
> On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <da...@gmail.com> wrote:
>
> > Thanks, let me have a careful look at it. On a cursory look, I seem to
> > understand the basic idea. Any reasons for deciding to move the
> > PartitioningJob inside BSPJobClient from BSPJob?
> > BTW the current partitioner doesn't work as intended, only the default
> > partitioner HashPartitioner works fine, if I try to put some custom
> > partitioner there are problems.
> >
> > Let's resolve the partitioning completely before the spilling message
> > queue.
> >
> >
> > --
> > Regards,
> > Apurv Verma
> >
> >
> >
> >
> > On Tue, Jan 8, 2013 at 8:39 PM, Suraj Menon <su...@apache.org>
> > wrote:
> >
> > > Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide
> > > suggestions or even work on it.
> > >
> > > Thanks,
> > > Suraj
> > >
> > > On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <da...@gmail.com> wrote:
> > >
> > > > Hey Edward,
> > > >  There was a compile bug which i fixed temporarily. isPartitioned was
> > not
> > > > being initialized. Could you please check the last commit. I have
> > > currently
> > > > initialized it to false but I guess this should be configurable.
> > > > There was some jira where we wanted partitioning to be skipped if
> user
> > > > thinks his data is already partitioned.
> > > >
> > > > Thanks again.
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Apurv Verma
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <
> edwardyoon@apache.org
> > > > >wrote:
> > > >
> > > > > Thanks, then I'll finish tomorrow. Please feel free to comment
> there.
> > > > >
> > > > > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
> > > > > <to...@gmail.com> wrote:
> > > > > > thanks Edward, it looks good.
> > > > > > Tommaso
> > > > > >
> > > > > >
> > > > > > 2013/1/8 Edward J. Yoon <ed...@apache.org>
> > > > > >
> > > > > >> Please review this:
> > > > > >>
> > > > > >> http://wiki.apache.org/hama/Partitioning
> > > > > >>
> > > > > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <
> > > edwardyoon@apache.org
> > > > >
> > > > > >> wrote:
> > > > > >> > I mean, the pre-partitioning or resizing partitions is really
> > > > > important.
> > > > > >> >
> > > > > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <
> > > > edwardyoon@apache.org
> > > > > >
> > > > > >> wrote:
> > > > > >> >> This is another talk ...
> > > > > >> >>
> > > > > >> >> Unlike MapReduce, I think, Hama BSP will handle tasks that
> > input
> > > is
> > > > > >> >> small in size but large in computational complexity, such as
> > > graph,
> > > > > >> >> sparse matrix, machine learning algorithms.
> > > > > >> >>
> > > > > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <
> > > > > edwardyoon@apache.org>
> > > > > >> wrote:
> > > > > >> >>> Even though the numbers of splits and tasks are the same,
> > > > > user-defined
> > > > > >> >>> partitioning job should be run (because it is not only for
> > > > resizing
> > > > > >> >>> partitions. For example, range partitioning of unsorted data
> > set
> > > > or
> > > > > >> >>> hash key partitioning, ..., etc).
> > > > > >> >>>
> > > > > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <
> > > > surajsmenon@apache.org
> > > > > >
> > > > > >> wrote:
> > > > > >> >>>>>    1. I am referring to
> > > org.apache.hama.bsp.PartitioningRunner,
> > > > > it's
> > > > > >> named
> > > > > >> >>>>>    as so in the HEAD (1429573) of trunk. It isn't removed
> > but
> > > it
> > > > > >> isn't
> > > > > >> >>>>>    referred to anywhere else. I can't find any references
> to
> > > it
> > > > in
> > > > > >> the
> > > > > >> >>>>>    workspace.
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>>> It is referred in BSPJob#waitForCompletion function as a
> > > separate
> > > > > BSP
> > > > > >> job
> > > > > >> >>>> to create the specified splits.
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>>>    2. job.setPartitioner is the same as setting
> > > > > >> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me
> > > > > >> partitions are
> > > > > >> >>>>> not
> > > > > >> >>>>>    being created because of which the following happens.
> > > > > >> >>>>>    If I am running the task on local fs and not hdfs,
> > there's
> > > > just
> > > > > >> one
> > > > > >> >>>>>    input split and even if I set a partitioner to create
> two
> > > > > >> partitions and
> > > > > >> >>>>>    set bsp.setNumTasks(2) , this is overriden and only one
> > > task
> > > > is
> > > > > >> >>>>> executed.
> > > > > >> >>>>>    See BSPJobClient#submitJobInternal()
> > > > > >> >>>>>    where it does the following
> > > > > >> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile,
> > > > maxTasks));
> > > > > >> Line
> > > > > >> >>>>>    326.
> > > > > >> >>>>>
> > > > > >> >>>>> This job is set to run if the number of splits != number
> of
> > > > Tasks
> > > > > or
> > > > > >> if
> > > > > >> >>>> forced by the configuration. I can share my HAMA-700
> current
> > > > state
> > > > > of
> > > > > >> patch
> > > > > >> >>>> with you.
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>>>    3. So here is what I think is happening, Partitioner is
> > not
> > > > in
> > > > > the
> > > > > >> >>>>>    codepath (try putting a breakpoint inside the
> partitioner
> > > and
> > > > > >> executing
> > > > > >> >>>>> and
> > > > > >> >>>>>    non graph bsp task), so partitions are not being
> created
> > > and
> > > > > >> >>>>> writeSplits()
> > > > > >> >>>>>    is returning 1.
> > > > > >> >>>>>    [ writeSplits() returns the number of splits in the
> > input.
> > > ]
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>>> Probably because it is running as a separate process?
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> --
> > > > > >> >>> Best Regards, Edward J. Yoon
> > > > > >> >>> @eddieyoon
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> --
> > > > > >> >> Best Regards, Edward J. Yoon
> > > > > >> >> @eddieyoon
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Best Regards, Edward J. Yoon
> > > > > >> > @eddieyoon
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Best Regards, Edward J. Yoon
> > > > > >> @eddieyoon
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards, Edward J. Yoon
> > > > > @eddieyoon
> > > > >
> > > >
> > >
> >
>

Re: Partitioner in Hama

Posted by Suraj Menon <su...@apache.org>.

Please explain the nature of problems you are facing with Partitioner?

>Any reasons for deciding to move the
> PartitioningJob inside BSPJobClient from BSPJob?

Twofold, BSPJob was just a configuration holder object, didn't want to add
the partitioning responsibility to the class.
And also I wanted to know the number of splits, before taking the decision
whether to repartition or not.
Repartitioning is done if :
- the number of splits found are not equal to the number of BSP tasks
configured for the job. OR
- the flag is set to true by the user ("bsp.input.runtime.partitioning") OR
- user has specified a Runtime Partitioner class and enabled runtime
partitioning

Thanks,
Suraj

On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <da...@gmail.com> wrote:

> Thanks, let me have a careful look at it. On a cursory look, I seem to
> understand the basic idea. Any reasons for deciding to move the
> PartitioningJob inside BSPJobClient from BSPJob?
> BTW the current partitioner doesn't work as intended, only the default
> partitioner HashPartitioner works fine, if I try to put some custom
> partitioner there are problems.
>
> Let's resolve the partitioning completely before the spilling message
> queue.
>
>
> --
> Regards,
> Apurv Verma
>
>
>
>
> On Tue, Jan 8, 2013 at 8:39 PM, Suraj Menon <su...@apache.org>
> wrote:
>
> > Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide
> > suggestions or even work on it.
> >
> > Thanks,
> > Suraj
> >
> > On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <da...@gmail.com> wrote:
> >
> > > Hey Edward,
> > >  There was a compile bug which i fixed temporarily. isPartitioned was
> not
> > > being initialized. Could you please check the last commit. I have
> > currently
> > > initialized it to false but I guess this should be configurable.
> > > There was some jira where we wanted partitioning to be skipped if user
> > > thinks his data is already partitioned.
> > >
> > > Thanks again.
> > >
> > >
> > > --
> > > Regards,
> > > Apurv Verma
> > >
> > >
> > >
> > >
> > > On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <edwardyoon@apache.org
> > > >wrote:
> > >
> > > > Thanks, then I'll finish tomorrow. Please feel free to comment there.
> > > >
> > > > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
> > > > <to...@gmail.com> wrote:
> > > > > thanks Edward, it looks good.
> > > > > Tommaso
> > > > >
> > > > >
> > > > > 2013/1/8 Edward J. Yoon <ed...@apache.org>
> > > > >
> > > > >> Please review this:
> > > > >>
> > > > >> http://wiki.apache.org/hama/Partitioning
> > > > >>
> > > > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <
> > edwardyoon@apache.org
> > > >
> > > > >> wrote:
> > > > >> > I mean, the pre-partitioning or resizing partitions is really
> > > > important.
> > > > >> >
> > > > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <
> > > edwardyoon@apache.org
> > > > >
> > > > >> wrote:
> > > > >> >> This is another talk ...
> > > > >> >>
> > > > >> >> Unlike MapReduce, I think, Hama BSP will handle tasks that
> input
> > is
> > > > >> >> small in size but large in computational complexity, such as
> > graph,
> > > > >> >> sparse matrix, machine learning algorithms.
> > > > >> >>
> > > > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <
> > > > edwardyoon@apache.org>
> > > > >> wrote:
> > > > >> >>> Even though the numbers of splits and tasks are the same,
> > > > user-defined
> > > > >> >>> partitioning job should be run (because it is not only for
> > > resizing
> > > > >> >>> partitions. For example, range partitioning of unsorted data
> set
> > > or
> > > > >> >>> hash key partitioning, ..., etc).
> > > > >> >>>
> > > > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <
> > > surajsmenon@apache.org
> > > > >
> > > > >> wrote:
> > > > >> >>>>>    1. I am referring to
> > org.apache.hama.bsp.PartitioningRunner,
> > > > it's
> > > > >> named
> > > > >> >>>>>    as so in the HEAD (1429573) of trunk. It isn't removed
> but
> > it
> > > > >> isn't
> > > > >> >>>>>    referred to anywhere else. I can't find any references to
> > it
> > > in
> > > > >> the
> > > > >> >>>>>    workspace.
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>> It is referred in BSPJob#waitForCompletion function as a
> > separate
> > > > BSP
> > > > >> job
> > > > >> >>>> to create the specified splits.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>>>    2. job.setPartitioner is the same as setting
> > > > >> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me
> > > > >> partitions are
> > > > >> >>>>> not
> > > > >> >>>>>    being created because of which the following happens.
> > > > >> >>>>>    If I am running the task on local fs and not hdfs,
> there's
> > > just
> > > > >> one
> > > > >> >>>>>    input split and even if I set a partitioner to create two
> > > > >> partitions and
> > > > >> >>>>>    set bsp.setNumTasks(2) , this is overriden and only one
> > task
> > > is
> > > > >> >>>>> executed.
> > > > >> >>>>>    See BSPJobClient#submitJobInternal()
> > > > >> >>>>>    where it does the following
> > > > >> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile,
> > > maxTasks));
> > > > >> Line
> > > > >> >>>>>    326.
> > > > >> >>>>>
> > > > >> >>>>> This job is set to run if the number of splits != number of
> > > Tasks
> > > > or
> > > > >> if
> > > > >> >>>> forced by the configuration. I can share my HAMA-700 current
> > > state
> > > > of
> > > > >> patch
> > > > >> >>>> with you.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>>>    3. So here is what I think is happening, Partitioner is
> not
> > > in
> > > > the
> > > > >> >>>>>    codepath (try putting a breakpoint inside the partitioner
> > and
> > > > >> executing
> > > > >> >>>>> and
> > > > >> >>>>>    non graph bsp task), so partitions are not being created
> > and
> > > > >> >>>>> writeSplits()
> > > > >> >>>>>    is returning 1.
> > > > >> >>>>>    [ writeSplits() returns the number of splits in the
> input.
> > ]
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>> Probably because it is running as a separate process?
> > > > >> >>>
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> --
> > > > >> >>> Best Regards, Edward J. Yoon
> > > > >> >>> @eddieyoon
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Best Regards, Edward J. Yoon
> > > > >> >> @eddieyoon
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Best Regards, Edward J. Yoon
> > > > >> > @eddieyoon
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Best Regards, Edward J. Yoon
> > > > >> @eddieyoon
> > > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards, Edward J. Yoon
> > > > @eddieyoon
> > > >
> > >
> >
>

Re: Partitioner in Hama

Posted by Apurv Verma <da...@gmail.com>.

Thanks, let me have a careful look at it. On a cursory look, I seem to
understand the basic idea. Any reasons for deciding to move the
PartitioningJob inside BSPJobClient from BSPJob?
BTW the current partitioner doesn't work as intended, only the default
partitioner HashPartitioner works fine, if I try to put some custom
partitioner there are problems.

Let's resolve the partitioning completely before the spilling message queue.


--
Regards,
Apurv Verma




On Tue, Jan 8, 2013 at 8:39 PM, Suraj Menon <su...@apache.org> wrote:

> Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide
> suggestions or even work on it.
>
> Thanks,
> Suraj
>
> On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <da...@gmail.com> wrote:
>
> > Hey Edward,
> >  There was a compile bug which i fixed temporarily. isPartitioned was not
> > being initialized. Could you please check the last commit. I have
> currently
> > initialized it to false but I guess this should be configurable.
> > There was some jira where we wanted partitioning to be skipped if user
> > thinks his data is already partitioned.
> >
> > Thanks again.
> >
> >
> > --
> > Regards,
> > Apurv Verma
> >
> >
> >
> >
> > On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <edwardyoon@apache.org
> > >wrote:
> >
> > > Thanks, then I'll finish tomorrow. Please feel free to comment there.
> > >
> > > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
> > > <to...@gmail.com> wrote:
> > > > thanks Edward, it looks good.
> > > > Tommaso
> > > >
> > > >
> > > > 2013/1/8 Edward J. Yoon <ed...@apache.org>
> > > >
> > > >> Please review this:
> > > >>
> > > >> http://wiki.apache.org/hama/Partitioning
> > > >>
> > > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <
> edwardyoon@apache.org
> > >
> > > >> wrote:
> > > >> > I mean, the pre-partitioning or resizing partitions is really
> > > important.
> > > >> >
> > > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <
> > edwardyoon@apache.org
> > > >
> > > >> wrote:
> > > >> >> This is another talk ...
> > > >> >>
> > > >> >> Unlike MapReduce, I think, Hama BSP will handle tasks that input
> is
> > > >> >> small in size but large in computational complexity, such as
> graph,
> > > >> >> sparse matrix, machine learning algorithms.
> > > >> >>
> > > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <
> > > edwardyoon@apache.org>
> > > >> wrote:
> > > >> >>> Even though the numbers of splits and tasks are the same,
> > > user-defined
> > > >> >>> partitioning job should be run (because it is not only for
> > resizing
> > > >> >>> partitions. For example, range partitioning of unsorted data set
> > or
> > > >> >>> hash key partitioning, ..., etc).
> > > >> >>>
> > > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <
> > surajsmenon@apache.org
> > > >
> > > >> wrote:
> > > >> >>>>>    1. I am referring to
> org.apache.hama.bsp.PartitioningRunner,
> > > it's
> > > >> named
> > > >> >>>>>    as so in the HEAD (1429573) of trunk. It isn't removed but
> it
> > > >> isn't
> > > >> >>>>>    referred to anywhere else. I can't find any references to
> it
> > in
> > > >> the
> > > >> >>>>>    workspace.
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>> It is referred in BSPJob#waitForCompletion function as a
> separate
> > > BSP
> > > >> job
> > > >> >>>> to create the specified splits.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>>    2. job.setPartitioner is the same as setting
> > > >> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me
> > > >> partitions are
> > > >> >>>>> not
> > > >> >>>>>    being created because of which the following happens.
> > > >> >>>>>    If I am running the task on local fs and not hdfs, there's
> > just
> > > >> one
> > > >> >>>>>    input split and even if I set a partitioner to create two
> > > >> partitions and
> > > >> >>>>>    set bsp.setNumTasks(2) , this is overriden and only one
> task
> > is
> > > >> >>>>> executed.
> > > >> >>>>>    See BSPJobClient#submitJobInternal()
> > > >> >>>>>    where it does the following
> > > >> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile,
> > maxTasks));
> > > >> Line
> > > >> >>>>>    326.
> > > >> >>>>>
> > > >> >>>>> This job is set to run if the number of splits != number of
> > Tasks
> > > or
> > > >> if
> > > >> >>>> forced by the configuration. I can share my HAMA-700 current
> > state
> > > of
> > > >> patch
> > > >> >>>> with you.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>>    3. So here is what I think is happening, Partitioner is not
> > in
> > > the
> > > >> >>>>>    codepath (try putting a breakpoint inside the partitioner
> and
> > > >> executing
> > > >> >>>>> and
> > > >> >>>>>    non graph bsp task), so partitions are not being created
> and
> > > >> >>>>> writeSplits()
> > > >> >>>>>    is returning 1.
> > > >> >>>>>    [ writeSplits() returns the number of splits in the input.
> ]
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>> Probably because it is running as a separate process?
> > > >> >>>
> > > >> >>>
> > > >> >>>
> > > >> >>> --
> > > >> >>> Best Regards, Edward J. Yoon
> > > >> >>> @eddieyoon
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Best Regards, Edward J. Yoon
> > > >> >> @eddieyoon
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Best Regards, Edward J. Yoon
> > > >> > @eddieyoon
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Best Regards, Edward J. Yoon
> > > >> @eddieyoon
> > > >>
> > >
> > >
> > >
> > > --
> > > Best Regards, Edward J. Yoon
> > > @eddieyoon
> > >
> >
>

Re: Partitioner in Hama

Posted by Suraj Menon <su...@apache.org>.

Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide
suggestions or even work on it.

Thanks,
Suraj

On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <da...@gmail.com> wrote:

> Hey Edward,
>  There was a compile bug which i fixed temporarily. isPartitioned was not
> being initialized. Could you please check the last commit. I have currently
> initialized it to false but I guess this should be configurable.
> There was some jira where we wanted partitioning to be skipped if user
> thinks his data is already partitioned.
>
> Thanks again.
>
>
> --
> Regards,
> Apurv Verma
>
>
>
>
> On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
>
> > Thanks, then I'll finish tomorrow. Please feel free to comment there.
> >
> > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
> > <to...@gmail.com> wrote:
> > > thanks Edward, it looks good.
> > > Tommaso
> > >
> > >
> > > 2013/1/8 Edward J. Yoon <ed...@apache.org>
> > >
> > >> Please review this:
> > >>
> > >> http://wiki.apache.org/hama/Partitioning
> > >>
> > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <edwardyoon@apache.org
> >
> > >> wrote:
> > >> > I mean, the pre-partitioning or resizing partitions is really
> > important.
> > >> >
> > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <
> edwardyoon@apache.org
> > >
> > >> wrote:
> > >> >> This is another talk ...
> > >> >>
> > >> >> Unlike MapReduce, I think, Hama BSP will handle tasks that input is
> > >> >> small in size but large in computational complexity, such as graph,
> > >> >> sparse matrix, machine learning algorithms.
> > >> >>
> > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <
> > edwardyoon@apache.org>
> > >> wrote:
> > >> >>> Even though the numbers of splits and tasks are the same,
> > user-defined
> > >> >>> partitioning job should be run (because it is not only for
> resizing
> > >> >>> partitions. For example, range partitioning of unsorted data set
> or
> > >> >>> hash key partitioning, ..., etc).
> > >> >>>
> > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <
> surajsmenon@apache.org
> > >
> > >> wrote:
> > >> >>>>>    1. I am referring to org.apache.hama.bsp.PartitioningRunner,
> > it's
> > >> named
> > >> >>>>>    as so in the HEAD (1429573) of trunk. It isn't removed but it
> > >> isn't
> > >> >>>>>    referred to anywhere else. I can't find any references to it
> in
> > >> the
> > >> >>>>>    workspace.
> > >> >>>>>
> > >> >>>>
> > >> >>>> It is referred in BSPJob#waitForCompletion function as a separate
> > BSP
> > >> job
> > >> >>>> to create the specified splits.
> > >> >>>>
> > >> >>>>
> > >> >>>>>    2. job.setPartitioner is the same as setting
> > >> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me
> > >> partitions are
> > >> >>>>> not
> > >> >>>>>    being created because of which the following happens.
> > >> >>>>>    If I am running the task on local fs and not hdfs, there's
> just
> > >> one
> > >> >>>>>    input split and even if I set a partitioner to create two
> > >> partitions and
> > >> >>>>>    set bsp.setNumTasks(2) , this is overriden and only one task
> is
> > >> >>>>> executed.
> > >> >>>>>    See BSPJobClient#submitJobInternal()
> > >> >>>>>    where it does the following
> > >> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile,
> maxTasks));
> > >> Line
> > >> >>>>>    326.
> > >> >>>>>
> > >> >>>>> This job is set to run if the number of splits != number of
> Tasks
> > or
> > >> if
> > >> >>>> forced by the configuration. I can share my HAMA-700 current
> state
> > of
> > >> patch
> > >> >>>> with you.
> > >> >>>>
> > >> >>>>
> > >> >>>>>    3. So here is what I think is happening, Partitioner is not
> in
> > the
> > >> >>>>>    codepath (try putting a breakpoint inside the partitioner and
> > >> executing
> > >> >>>>> and
> > >> >>>>>    non graph bsp task), so partitions are not being created and
> > >> >>>>> writeSplits()
> > >> >>>>>    is returning 1.
> > >> >>>>>    [ writeSplits() returns the number of splits in the input. ]
> > >> >>>>>
> > >> >>>>
> > >> >>>> Probably because it is running as a separate process?
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> --
> > >> >>> Best Regards, Edward J. Yoon
> > >> >>> @eddieyoon
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Best Regards, Edward J. Yoon
> > >> >> @eddieyoon
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Best Regards, Edward J. Yoon
> > >> > @eddieyoon
> > >>
> > >>
> > >>
> > >> --
> > >> Best Regards, Edward J. Yoon
> > >> @eddieyoon
> > >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
> >
>

Re: Partitioner in Hama

Posted by Apurv Verma <da...@gmail.com>.

Hey Edward,
 There was a compile bug which i fixed temporarily. isPartitioned was not
being initialized. Could you please check the last commit. I have currently
initialized it to false but I guess this should be configurable.
There was some jira where we wanted partitioning to be skipped if user
thinks his data is already partitioned.

Thanks again.


--
Regards,
Apurv Verma




On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon <ed...@apache.org>wrote:

> Thanks, then I'll finish tomorrow. Please feel free to comment there.
>
> On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
> <to...@gmail.com> wrote:
> > thanks Edward, it looks good.
> > Tommaso
> >
> >
> > 2013/1/8 Edward J. Yoon <ed...@apache.org>
> >
> >> Please review this:
> >>
> >> http://wiki.apache.org/hama/Partitioning
> >>
> >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <ed...@apache.org>
> >> wrote:
> >> > I mean, the pre-partitioning or resizing partitions is really
> important.
> >> >
> >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <edwardyoon@apache.org
> >
> >> wrote:
> >> >> This is another talk ...
> >> >>
> >> >> Unlike MapReduce, I think, Hama BSP will handle tasks that input is
> >> >> small in size but large in computational complexity, such as graph,
> >> >> sparse matrix, machine learning algorithms.
> >> >>
> >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <
> edwardyoon@apache.org>
> >> wrote:
> >> >>> Even though the numbers of splits and tasks are the same,
> user-defined
> >> >>> partitioning job should be run (because it is not only for resizing
> >> >>> partitions. For example, range partitioning of unsorted data set or
> >> >>> hash key partitioning, ..., etc).
> >> >>>
> >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <surajsmenon@apache.org
> >
> >> wrote:
> >> >>>>>    1. I am referring to org.apache.hama.bsp.PartitioningRunner,
> it's
> >> named
> >> >>>>>    as so in the HEAD (1429573) of trunk. It isn't removed but it
> >> isn't
> >> >>>>>    referred to anywhere else. I can't find any references to it in
> >> the
> >> >>>>>    workspace.
> >> >>>>>
> >> >>>>
> >> >>>> It is referred in BSPJob#waitForCompletion function as a separate
> BSP
> >> job
> >> >>>> to create the specified splits.
> >> >>>>
> >> >>>>
> >> >>>>>    2. job.setPartitioner is the same as setting
> >> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me
> >> partitions are
> >> >>>>> not
> >> >>>>>    being created because of which the following happens.
> >> >>>>>    If I am running the task on local fs and not hdfs, there's just
> >> one
> >> >>>>>    input split and even if I set a partitioner to create two
> >> partitions and
> >> >>>>>    set bsp.setNumTasks(2) , this is overriden and only one task is
> >> >>>>> executed.
> >> >>>>>    See BSPJobClient#submitJobInternal()
> >> >>>>>    where it does the following
> >> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks));
> >> Line
> >> >>>>>    326.
> >> >>>>>
> >> >>>>> This job is set to run if the number of splits != number of Tasks
> or
> >> if
> >> >>>> forced by the configuration. I can share my HAMA-700 current state
> of
> >> patch
> >> >>>> with you.
> >> >>>>
> >> >>>>
> >> >>>>>    3. So here is what I think is happening, Partitioner is not in
> the
> >> >>>>>    codepath (try putting a breakpoint inside the partitioner and
> >> executing
> >> >>>>> and
> >> >>>>>    non graph bsp task), so partitions are not being created and
> >> >>>>> writeSplits()
> >> >>>>>    is returning 1.
> >> >>>>>    [ writeSplits() returns the number of splits in the input. ]
> >> >>>>>
> >> >>>>
> >> >>>> Probably because it is running as a separate process?
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Best Regards, Edward J. Yoon
> >> >>> @eddieyoon
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >> @eddieyoon
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards, Edward J. Yoon
> >> > @eddieyoon
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: Partitioner in Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Thanks, then I'll finish tomorrow. Please feel free to comment there.

On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili
<to...@gmail.com> wrote:
> thanks Edward, it looks good.
> Tommaso
>
>
> 2013/1/8 Edward J. Yoon <ed...@apache.org>
>
>> Please review this:
>>
>> http://wiki.apache.org/hama/Partitioning
>>
>> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <ed...@apache.org>
>> wrote:
>> > I mean, the pre-partitioning or resizing partitions is really important.
>> >
>> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <ed...@apache.org>
>> wrote:
>> >> This is another talk ...
>> >>
>> >> Unlike MapReduce, I think, Hama BSP will handle tasks that input is
>> >> small in size but large in computational complexity, such as graph,
>> >> sparse matrix, machine learning algorithms.
>> >>
>> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <ed...@apache.org>
>> wrote:
>> >>> Even though the numbers of splits and tasks are the same, user-defined
>> >>> partitioning job should be run (because it is not only for resizing
>> >>> partitions. For example, range partitioning of unsorted data set or
>> >>> hash key partitioning, ..., etc).
>> >>>
>> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <su...@apache.org>
>> wrote:
>> >>>>>    1. I am referring to org.apache.hama.bsp.PartitioningRunner, it's
>> named
>> >>>>>    as so in the HEAD (1429573) of trunk. It isn't removed but it
>> isn't
>> >>>>>    referred to anywhere else. I can't find any references to it in
>> the
>> >>>>>    workspace.
>> >>>>>
>> >>>>
>> >>>> It is referred in BSPJob#waitForCompletion function as a separate BSP
>> job
>> >>>> to create the specified splits.
>> >>>>
>> >>>>
>> >>>>>    2. job.setPartitioner is the same as setting
>> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me
>> partitions are
>> >>>>> not
>> >>>>>    being created because of which the following happens.
>> >>>>>    If I am running the task on local fs and not hdfs, there's just
>> one
>> >>>>>    input split and even if I set a partitioner to create two
>> partitions and
>> >>>>>    set bsp.setNumTasks(2) , this is overriden and only one task is
>> >>>>> executed.
>> >>>>>    See BSPJobClient#submitJobInternal()
>> >>>>>    where it does the following
>> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks));
>> Line
>> >>>>>    326.
>> >>>>>
>> >>>>> This job is set to run if the number of splits != number of Tasks or
>> if
>> >>>> forced by the configuration. I can share my HAMA-700 current state of
>> patch
>> >>>> with you.
>> >>>>
>> >>>>
>> >>>>>    3. So here is what I think is happening, Partitioner is not in the
>> >>>>>    codepath (try putting a breakpoint inside the partitioner and
>> executing
>> >>>>> and
>> >>>>>    non graph bsp task), so partitions are not being created and
>> >>>>> writeSplits()
>> >>>>>    is returning 1.
>> >>>>>    [ writeSplits() returns the number of splits in the input. ]
>> >>>>>
>> >>>>
>> >>>> Probably because it is running as a separate process?
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Best Regards, Edward J. Yoon
>> >>> @eddieyoon
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >
>> >
>> >
>> > --
>> > Best Regards, Edward J. Yoon
>> > @eddieyoon
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Partitioner in Hama

Posted by Tommaso Teofili <to...@gmail.com>.

thanks Edward, it looks good.
Tommaso


2013/1/8 Edward J. Yoon <ed...@apache.org>

> Please review this:
>
> http://wiki.apache.org/hama/Partitioning
>
> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
> > I mean, the pre-partitioning or resizing partitions is really important.
> >
> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
> >> This is another talk ...
> >>
> >> Unlike MapReduce, I think, Hama BSP will handle tasks that input is
> >> small in size but large in computational complexity, such as graph,
> >> sparse matrix, machine learning algorithms.
> >>
> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
> >>> Even though the numbers of splits and tasks are the same, user-defined
> >>> partitioning job should be run (because it is not only for resizing
> >>> partitions. For example, range partitioning of unsorted data set or
> >>> hash key partitioning, ..., etc).
> >>>
> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <su...@apache.org>
> wrote:
> >>>>>    1. I am referring to org.apache.hama.bsp.PartitioningRunner, it's
> named
> >>>>>    as so in the HEAD (1429573) of trunk. It isn't removed but it
> isn't
> >>>>>    referred to anywhere else. I can't find any references to it in
> the
> >>>>>    workspace.
> >>>>>
> >>>>
> >>>> It is referred in BSPJob#waitForCompletion function as a separate BSP
> job
> >>>> to create the specified splits.
> >>>>
> >>>>
> >>>>>    2. job.setPartitioner is the same as setting
> >>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me
> partitions are
> >>>>> not
> >>>>>    being created because of which the following happens.
> >>>>>    If I am running the task on local fs and not hdfs, there's just
> one
> >>>>>    input split and even if I set a partitioner to create two
> partitions and
> >>>>>    set bsp.setNumTasks(2) , this is overriden and only one task is
> >>>>> executed.
> >>>>>    See BSPJobClient#submitJobInternal()
> >>>>>    where it does the following
> >>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks));
> Line
> >>>>>    326.
> >>>>>
> >>>>> This job is set to run if the number of splits != number of Tasks or
> if
> >>>> forced by the configuration. I can share my HAMA-700 current state of
> patch
> >>>> with you.
> >>>>
> >>>>
> >>>>>    3. So here is what I think is happening, Partitioner is not in the
> >>>>>    codepath (try putting a breakpoint inside the partitioner and
> executing
> >>>>> and
> >>>>>    non graph bsp task), so partitions are not being created and
> >>>>> writeSplits()
> >>>>>    is returning 1.
> >>>>>    [ writeSplits() returns the number of splits in the input. ]
> >>>>>
> >>>>
> >>>> Probably because it is running as a separate process?
> >>>
> >>>
> >>>
> >>> --
> >>> Best Regards, Edward J. Yoon
> >>> @eddieyoon
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: Partitioner in Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Please review this:

http://wiki.apache.org/hama/Partitioning

On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon <ed...@apache.org> wrote:
> I mean, the pre-partitioning or resizing partitions is really important.
>
> On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <ed...@apache.org> wrote:
>> This is another talk ...
>>
>> Unlike MapReduce, I think, Hama BSP will handle tasks that input is
>> small in size but large in computational complexity, such as graph,
>> sparse matrix, machine learning algorithms.
>>
>> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <ed...@apache.org> wrote:
>>> Even though the numbers of splits and tasks are the same, user-defined
>>> partitioning job should be run (because it is not only for resizing
>>> partitions. For example, range partitioning of unsorted data set or
>>> hash key partitioning, ..., etc).
>>>
>>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <su...@apache.org> wrote:
>>>>>    1. I am referring to org.apache.hama.bsp.PartitioningRunner, it's named
>>>>>    as so in the HEAD (1429573) of trunk. It isn't removed but it isn't
>>>>>    referred to anywhere else. I can't find any references to it in the
>>>>>    workspace.
>>>>>
>>>>
>>>> It is referred in BSPJob#waitForCompletion function as a separate BSP job
>>>> to create the specified splits.
>>>>
>>>>
>>>>>    2. job.setPartitioner is the same as setting
>>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me partitions are
>>>>> not
>>>>>    being created because of which the following happens.
>>>>>    If I am running the task on local fs and not hdfs, there's just one
>>>>>    input split and even if I set a partitioner to create two partitions and
>>>>>    set bsp.setNumTasks(2) , this is overriden and only one task is
>>>>> executed.
>>>>>    See BSPJobClient#submitJobInternal()
>>>>>    where it does the following
>>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks)); Line
>>>>>    326.
>>>>>
>>>>> This job is set to run if the number of splits != number of Tasks or if
>>>> forced by the configuration. I can share my HAMA-700 current state of patch
>>>> with you.
>>>>
>>>>
>>>>>    3. So here is what I think is happening, Partitioner is not in the
>>>>>    codepath (try putting a breakpoint inside the partitioner and executing
>>>>> and
>>>>>    non graph bsp task), so partitions are not being created and
>>>>> writeSplits()
>>>>>    is returning 1.
>>>>>    [ writeSplits() returns the number of splits in the input. ]
>>>>>
>>>>
>>>> Probably because it is running as a separate process?
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Partitioner in Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

I mean, the pre-partitioning or resizing partitions is really important.

On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon <ed...@apache.org> wrote:
> This is another talk ...
>
> Unlike MapReduce, I think, Hama BSP will handle tasks that input is
> small in size but large in computational complexity, such as graph,
> sparse matrix, machine learning algorithms.
>
> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <ed...@apache.org> wrote:
>> Even though the numbers of splits and tasks are the same, user-defined
>> partitioning job should be run (because it is not only for resizing
>> partitions. For example, range partitioning of unsorted data set or
>> hash key partitioning, ..., etc).
>>
>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <su...@apache.org> wrote:
>>>>    1. I am referring to org.apache.hama.bsp.PartitioningRunner, it's named
>>>>    as so in the HEAD (1429573) of trunk. It isn't removed but it isn't
>>>>    referred to anywhere else. I can't find any references to it in the
>>>>    workspace.
>>>>
>>>
>>> It is referred in BSPJob#waitForCompletion function as a separate BSP job
>>> to create the specified splits.
>>>
>>>
>>>>    2. job.setPartitioner is the same as setting
>>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me partitions are
>>>> not
>>>>    being created because of which the following happens.
>>>>    If I am running the task on local fs and not hdfs, there's just one
>>>>    input split and even if I set a partitioner to create two partitions and
>>>>    set bsp.setNumTasks(2) , this is overriden and only one task is
>>>> executed.
>>>>    See BSPJobClient#submitJobInternal()
>>>>    where it does the following
>>>>    job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks)); Line
>>>>    326.
>>>>
>>>> This job is set to run if the number of splits != number of Tasks or if
>>> forced by the configuration. I can share my HAMA-700 current state of patch
>>> with you.
>>>
>>>
>>>>    3. So here is what I think is happening, Partitioner is not in the
>>>>    codepath (try putting a breakpoint inside the partitioner and executing
>>>> and
>>>>    non graph bsp task), so partitions are not being created and
>>>> writeSplits()
>>>>    is returning 1.
>>>>    [ writeSplits() returns the number of splits in the input. ]
>>>>
>>>
>>> Probably because it is running as a separate process?
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Partitioner in Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

This is another talk ...

Unlike MapReduce, I think, Hama BSP will handle tasks that input is
small in size but large in computational complexity, such as graph,
sparse matrix, machine learning algorithms.

On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon <ed...@apache.org> wrote:
> Even though the numbers of splits and tasks are the same, user-defined
> partitioning job should be run (because it is not only for resizing
> partitions. For example, range partitioning of unsorted data set or
> hash key partitioning, ..., etc).
>
> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <su...@apache.org> wrote:
>>>    1. I am referring to org.apache.hama.bsp.PartitioningRunner, it's named
>>>    as so in the HEAD (1429573) of trunk. It isn't removed but it isn't
>>>    referred to anywhere else. I can't find any references to it in the
>>>    workspace.
>>>
>>
>> It is referred in BSPJob#waitForCompletion function as a separate BSP job
>> to create the specified splits.
>>
>>
>>>    2. job.setPartitioner is the same as setting
>>>    "bsp.input.partitioner.class" . Anyways , So acc. to me partitions are
>>> not
>>>    being created because of which the following happens.
>>>    If I am running the task on local fs and not hdfs, there's just one
>>>    input split and even if I set a partitioner to create two partitions and
>>>    set bsp.setNumTasks(2) , this is overriden and only one task is
>>> executed.
>>>    See BSPJobClient#submitJobInternal()
>>>    where it does the following
>>>    job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks)); Line
>>>    326.
>>>
>>> This job is set to run if the number of splits != number of Tasks or if
>> forced by the configuration. I can share my HAMA-700 current state of patch
>> with you.
>>
>>
>>>    3. So here is what I think is happening, Partitioner is not in the
>>>    codepath (try putting a breakpoint inside the partitioner and executing
>>> and
>>>    non graph bsp task), so partitions are not being created and
>>> writeSplits()
>>>    is returning 1.
>>>    [ writeSplits() returns the number of splits in the input. ]
>>>
>>
>> Probably because it is running as a separate process?
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Partitioner in Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Even though the numbers of splits and tasks are the same, user-defined
partitioning job should be run (because it is not only for resizing
partitions. For example, range partitioning of unsorted data set or
hash key partitioning, ..., etc).

On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon <su...@apache.org> wrote:
>>    1. I am referring to org.apache.hama.bsp.PartitioningRunner, it's named
>>    as so in the HEAD (1429573) of trunk. It isn't removed but it isn't
>>    referred to anywhere else. I can't find any references to it in the
>>    workspace.
>>
>
> It is referred in BSPJob#waitForCompletion function as a separate BSP job
> to create the specified splits.
>
>
>>    2. job.setPartitioner is the same as setting
>>    "bsp.input.partitioner.class" . Anyways , So acc. to me partitions are
>> not
>>    being created because of which the following happens.
>>    If I am running the task on local fs and not hdfs, there's just one
>>    input split and even if I set a partitioner to create two partitions and
>>    set bsp.setNumTasks(2) , this is overriden and only one task is
>> executed.
>>    See BSPJobClient#submitJobInternal()
>>    where it does the following
>>    job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks)); Line
>>    326.
>>
>> This job is set to run if the number of splits != number of Tasks or if
> forced by the configuration. I can share my HAMA-700 current state of patch
> with you.
>
>
>>    3. So here is what I think is happening, Partitioner is not in the
>>    codepath (try putting a breakpoint inside the partitioner and executing
>> and
>>    non graph bsp task), so partitions are not being created and
>> writeSplits()
>>    is returning 1.
>>    [ writeSplits() returns the number of splits in the input. ]
>>
>
> Probably because it is running as a separate process?



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Partitioner in Hama

Posted by Suraj Menon <su...@apache.org>.

>    1. I am referring to org.apache.hama.bsp.PartitioningRunner, it's named
>    as so in the HEAD (1429573) of trunk. It isn't removed but it isn't
>    referred to anywhere else. I can't find any references to it in the
>    workspace.
>

It is referred in BSPJob#waitForCompletion function as a separate BSP job
to create the specified splits.


>    2. job.setPartitioner is the same as setting
>    "bsp.input.partitioner.class" . Anyways , So acc. to me partitions are
> not
>    being created because of which the following happens.
>    If I am running the task on local fs and not hdfs, there's just one
>    input split and even if I set a partitioner to create two partitions and
>    set bsp.setNumTasks(2) , this is overriden and only one task is
> executed.
>    See BSPJobClient#submitJobInternal()
>    where it does the following
>    job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks)); Line
>    326.
>
> This job is set to run if the number of splits != number of Tasks or if
forced by the configuration. I can share my HAMA-700 current state of patch
with you.


>    3. So here is what I think is happening, Partitioner is not in the
>    codepath (try putting a breakpoint inside the partitioner and executing
> and
>    non graph bsp task), so partitions are not being created and
> writeSplits()
>    is returning 1.
>    [ writeSplits() returns the number of splits in the input. ]
>

Probably because it is running as a separate process?

Re: Partitioner in Hama

Posted by Apurv Verma <da...@gmail.com>.

   1. I am referring to org.apache.hama.bsp.PartitioningRunner, it's named
   as so in the HEAD (1429573) of trunk. It isn't removed but it isn't
   referred to anywhere else. I can't find any references to it in the
   workspace.
   2. job.setPartitioner is the same as setting
   "bsp.input.partitioner.class" . Anyways , So acc. to me partitions are not
   being created because of which the following happens.
   If I am running the task on local fs and not hdfs, there's just one
   input split and even if I set a partitioner to create two partitions and
   set bsp.setNumTasks(2) , this is overriden and only one task is executed.
   See BSPJobClient#submitJobInternal()
   where it does the following
   job.setNumBspTask(writeSplits(job, submitSplitFile, maxTasks)); Line
   326.

   3. So here is what I think is happening, Partitioner is not in the
   codepath (try putting a breakpoint inside the partitioner and executing and
   non graph bsp task), so partitions are not being created and writeSplits()
   is returning 1.
   [ writeSplits() returns the number of splits in the input. ]

--
Regards,
Apurv Verma

On Sun, Jan 6, 2013 at 9:05 PM, Suraj Menon <su...@apache.org> wrote:

> Are you referring to org.apache.hama.bsp.PartitionRunner ? I don't see a
> commit removing the class.
> PartitionRunner is designed to be a Hama job in itself to create the
> expected splits before starting the submitted job.
> You can use your own Partitioner in the config using
> "bsp.input.partitioner.class" . Hopefully I answered your question.
>
> I am trying to make things backward compatible[ HAMA-700 ], but facing some
> problems. The goal is to have runtime partitioning of graphs done by
> PartitionRunner itself.
>
> -Suraj
>
> On Sun, Jan 6, 2013 at 9:54 AM, Apurv Verma <da...@gmail.com> wrote:
>
> > Hey all,
> >  I found that PartitioningRunner has been removed from the codepath, I
> > guess this is the right way to make jobs faster.
> > But in the current scenario is it possible to have something all
> > follows. I want that all values < some integer are designated to peer
> > index 0, all values in range 0-a to peer index 1, and so on and so
> > forth.
> > With the partitioning removed would i need to use an additional
> > superstep to do this classification of input records.
> >
> >
> > --
> > Regards,
> > Apurv Verma
> >
>

Re: Partitioner in Hama

Posted by Suraj Menon <su...@apache.org>.

Are you referring to org.apache.hama.bsp.PartitionRunner ? I don't see a
commit removing the class.
PartitionRunner is designed to be a Hama job in itself to create the
expected splits before starting the submitted job.
You can use your own Partitioner in the config using
"bsp.input.partitioner.class" . Hopefully I answered your question.

I am trying to make things backward compatible[ HAMA-700 ], but facing some
problems. The goal is to have runtime partitioning of graphs done by
PartitionRunner itself.

-Suraj

On Sun, Jan 6, 2013 at 9:54 AM, Apurv Verma <da...@gmail.com> wrote:

> Hey all,
>  I found that PartitioningRunner has been removed from the codepath, I
> guess this is the right way to make jobs faster.
> But in the current scenario is it possible to have something all
> follows. I want that all values < some integer are designated to peer
> index 0, all values in range 0-a to peer index 1, and so on and so
> forth.
> With the partitioning removed would i need to use an additional
> superstep to do this classification of input records.
>
>
> --
> Regards,
> Apurv Verma
>