You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hama.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2012/12/07 01:30:52 UTC

runtimePartitioning in GraphJobRunner

In fact, there's no choice but to use runtimePartitioning (because of
VertexInputReader). Right? If so, I would like to delete all "if
(runtimePartitioning) {" conditions.

-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

Any other opinions?

On Mon, Dec 10, 2012 at 9:34 PM, Edward J. Yoon <ed...@apache.org> wrote:
>> Just wanted to remind you why we introduced runtime partitioning.
>
> Sorry that I could not review your patch of HAMA-531 and many things
> of Hama 0.5 release. I was busy.
>
> On Mon, Dec 10, 2012 at 8:47 PM, Thomas Jungblut
> <th...@gmail.com> wrote:
>> Just wanted to remind you why we introduced runtime partitioning.
>>
>> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>
>>> HDFS is common. It's not tunable for only Hama BSP computing.
>>>
>>> > Yes, so spilling on disk is the easiest solution to save memory. Not
>>> > changing the partitioning.
>>> > If you want to split again through the block boundaries to distribute the
>>> > data through the cluster, then do it, but this is plainly wrong.
>>>
>>> Vertex load balancing is basically uses Hash partitioner. You can't
>>> avoid data transfers.
>>>
>>> Again...,
>>>
>>> VertexInputReader and runtime partitioning make code complex as I
>>> mentioned above.
>>>
>>> > This reader is needed, so people can create vertices from their own
>>> fileformat.
>>>
>>> I don't think so. Instead of VertexInputReader, we can provide <K
>>> extends WritableComparable, V extends ArrayWritable>.
>>>
>>> Let's assume that there's a web table in Google's BigTable (HBase).
>>> User can create their own WebTableInputFormatter to read records as a
>>> <Text url, TextArrayWritable anchors>. Am I wrong?
>>>
>>> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
>>> <th...@gmail.com> wrote:
>>> > Yes, because changing the blocksize to 32m will just use 300mb of memory,
>>> > so you can add more machines to fit the number of resulting tasks.
>>> >
>>> > If each node have small memory, there's no way to process in memory
>>> >
>>> >
>>> > Yes, so spilling on disk is the easiest solution to save memory. Not
>>> > changing the partitioning.
>>> > If you want to split again through the block boundaries to distribute the
>>> > data through the cluster, then do it, but this is plainly wrong.
>>> >
>>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >
>>> >> > A Hama cluster is scalable. It means that the computing capacity
>>> >> >> should be increased by adding slaves. Right?
>>> >> >
>>> >> >
>>> >> > I'm sorry, but I don't see how this relates to the vertex input
>>> reader.
>>> >>
>>> >> Not related with input reader. It related with partitioning and load
>>> >> balancing. As I reported to you before, to process vertices within
>>> >> 256MB block, each TaskRunner requied 25~30GB memory.
>>> >>
>>> >> If each node have small memory, there's no way to process in memory
>>> >> without changing block size of HDFS.
>>> >>
>>> >> Do you think this is scalable?
>>> >>
>>> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
>>> >> <th...@gmail.com> wrote:
>>> >> > Oh okay, so if you want to remove that, have a lot of fun. This
>>> reader is
>>> >> > needed, so people can create vertices from their own fileformat.
>>> >> > Going back to a sequencefile input will not only break backward
>>> >> > compatibility but also make the same issues we had before.
>>> >> >
>>> >> > A Hama cluster is scalable. It means that the computing capacity
>>> >> >> should be increased by adding slaves. Right?
>>> >> >
>>> >> >
>>> >> > I'm sorry, but I don't see how this relates to the vertex input
>>> reader.
>>> >> >
>>> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >> >
>>> >> >> A Hama cluster is scalable. It means that the computing capacity
>>> >> >> should be increased by adding slaves. Right?
>>> >> >>
>>> >> >> As I mentioned before, disk-queue and storing vertices on local disk
>>> >> >> are not urgent.
>>> >> >>
>>> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
>>> >> >> partition in Graph package.
>>> >> >>
>>> >> >> See also,
>>> >> >>
>>> >>
>>> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>>> >> >>
>>> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
>>> >> >> <th...@gmail.com> wrote:
>>> >> >> > uhm, I have no idea what you want to archieve, do you want to get
>>> >> back to
>>> >> >> > client-side partitioning?
>>> >> >> >
>>> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >> >> >
>>> >> >> >> If there's no opinion, I'll remove VertexInputReader in
>>> >> >> >> GraphJobRunner, because it make code complex. Let's consider again
>>> >> >> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>>> >> >> >> issues.
>>> >> >> >>
>>> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
>>> >> edwardyoon@apache.org>
>>> >> >> >> wrote:
>>> >> >> >> > Or, I'd like to get rid of VertexInputReader.
>>> >> >> >> >
>>> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
>>> >> edwardyoon@apache.org
>>> >> >> >
>>> >> >> >> wrote:
>>> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
>>> >> (because of
>>> >> >> >> >> VertexInputReader). Right? If so, I would like to delete all
>>> "if
>>> >> >> >> >> (runtimePartitioning) {" conditions.
>>> >> >> >> >>
>>> >> >> >> >> --
>>> >> >> >> >> Best Regards, Edward J. Yoon
>>> >> >> >> >> @eddieyoon
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > --
>>> >> >> >> > Best Regards, Edward J. Yoon
>>> >> >> >> > @eddieyoon
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> --
>>> >> >> >> Best Regards, Edward J. Yoon
>>> >> >> >> @eddieyoon
>>> >> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Best Regards, Edward J. Yoon
>>> >> >> @eddieyoon
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best Regards, Edward J. Yoon
>>> >> @eddieyoon
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

Please let me clarify the issue.

The main problem is related with data partitioning, scaling-out tasks,
and code complexity. Irrespective of HAMA-531 issue, graph package has
own network-based partitioner and it is abnormally evolved with
VertexInputReader.

Sorry for again and again.

"If there's no opinion, I'll remove VertexInputReader in
GraphJobRunner, because it make code complex. Let's consider again
about the VertexInputReader, after fixing HAMA-531 and HAMA-632
issues."

On Tue, Dec 11, 2012 at 7:39 AM, Edward J. Yoon <ed...@apache.org> wrote:
> If you can fix BSPJobClient.partition() method to partition text
> input, please do.
>
> Again ... :/
>
>>>> * If we have VertexInputReader again, we don't need to apply it to all
>>>> examples. And, random generators and examples should be managed
>>>> together now.
>
> As we discussed, I'll clean up them tomorrow.
>
> On Tue, Dec 11, 2012 at 7:21 AM, Edward J. Yoon <ed...@apache.org> wrote:
>>> Please do me a favor a code how you want the partitioning BSP job to work
>>> before removing everything. I will tell you how to use the readers without
>>> any graph duplicate code so you don't need to touch the examples at all.
>>
>> You don't need to wait. Because it will be almost same with
>> BSPJobClient.partition() method.
>>
>> On Tue, Dec 11, 2012 at 6:59 AM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>>> Please do me a favor a code how you want the partitioning BSP job to work
>>> before removing everything. I will tell you how to use the readers without
>>> any graph duplicate code so you don't need to touch the examples at all.
>>>
>>> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>>
>>>> Please review
>>>> https://issues.apache.org/jira/secure/attachment/12560155/patch_v02.txt
>>>> first.
>>>>
>>>> * If we have VertexInputReader again, we don't need to apply it to all
>>>> examples. And, random generators and examples should be managed
>>>> together now.
>>>>
>>>> On Tue, Dec 11, 2012 at 6:52 AM, Thomas Jungblut
>>>> <th...@gmail.com> wrote:
>>>> > Yes, but in patches and in Issue Hama-531, so we can review.
>>>> >
>>>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>>> >
>>>> >> We talked on gtalk, the conclusion is as below:
>>>> >>
>>>> >> "If there's no opinion, I'll remove VertexInputReader in
>>>> >> GraphJobRunner, because it make code complex. Let's consider again
>>>> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>>>> >> issues."
>>>> >>
>>>> >> I'll clean up them tomorrow.
>>>> >>
>>>> >> On Tue, Dec 11, 2012 at 4:58 AM, Suraj Menon <su...@apache.org>
>>>> >> wrote:
>>>> >> > Hi Edward, I am assuming that you want to do this because you want to
>>>> run
>>>> >> > the job using more BSP tasks in parallel to reduce the memory usage
>>>> per
>>>> >> > task and perhaps run it faster.
>>>> >> > Am I right? I am +1 if this makes things faster. However this would be
>>>> >> > expensive for people with smaller clusters, and we should have spill,
>>>> >> cache
>>>> >> > and lookup implemented for Vertices in such cases.
>>>> >> >
>>>> >> > Regarding backward compatibility, can we use the user's
>>>> VertexInputReader
>>>> >> > to read the data and then write them in sequential file format we
>>>> wan't.
>>>> >> I
>>>> >> > was discussing this with Thomas and we felt this could be done by
>>>> >> > configuring a default input reader and overriding the same by
>>>> >> > configuration. We would have to make the Vertex class Writable. I
>>>> would
>>>> >> > like to keep it backward compatible. Is this a possibility?
>>>> >> >
>>>> >> > Regarding run-time partitioning, not all partitioning would be based
>>>> on
>>>> >> > hash partitioning. I can have a partitioner based on color of the
>>>> vertex
>>>> >> or
>>>> >> > some other property of the vertex. It is a step we can skip if not
>>>> >> > configured by user.
>>>> >> >
>>>> >> > Just my 2 cents. We can deprecate things but let's not remove
>>>> >> immediately.
>>>> >> >
>>>> >> > -Suraj
>>>> >> >
>>>> >> > HAMA-632 can wait until everything is resolved. I am trying to reduce
>>>> the
>>>> >> > API complexity.
>>>> >> >
>>>> >> > On Mon, Dec 10, 2012 at 2:56 PM, Thomas Jungblut
>>>> >> > <th...@gmail.com>wrote:
>>>> >> >
>>>> >> >> You didn't get the use of the reader.
>>>> >> >> The reader doesn't care about the input format.
>>>> >> >> It just takes the input as Writable, so for Text this is
>>>> >> LongWritable/Text
>>>> >> >> pairs. For NoSQL this might be LongWritable/BytesWritable.
>>>> >> >>
>>>> >> >> It's up to you coding this for your input sequence, not for each
>>>> format.
>>>> >> >> This is not hardcoded to text, only in the examples.
>>>> >> >>
>>>> >> >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>>> >> >>
>>>> >> >> > Again ... User can create their own InputFormatter to read records
>>>> as
>>>> >> >> > a <Writable, ArrayWritable> from text file or sequence file, or
>>>> >> >> > NoSQLs.
>>>> >> >> >
>>>> >> >> > You can use K, V pairs and sequence file. Why do you want to use
>>>> text
>>>> >> >> > file? Should I always write text file and parse them using
>>>> >> >> > VertexInputReader?
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
>>>> >> >> > <th...@gmail.com> wrote:
>>>> >> >> > >>
>>>> >> >> > >> It's a gap in experience, Thomas.
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > > Most probably you should read some good books on data extraction
>>>> and
>>>> >> >> then
>>>> >> >> > > choose your tools accordingly.
>>>> >> >> > > I never think that BSP is and will be a good extraction technique
>>>> >> for
>>>> >> >> > > unstructured data.
>>>> >> >> > >
>>>> >> >> > > But these are just my two cents here- there seems to be somewhat
>>>> >> more
>>>> >> >> > > political problems in this game than using tools appropriately.
>>>> >> >> > >
>>>> >> >> > > 2012/12/10 Thomas Jungblut <th...@gmail.com>
>>>> >> >> > >
>>>> >> >> > >> Yes, if you preprocess your data correctly.
>>>> >> >> > >> I have done the same unstructured extraction with the movie
>>>> >> database
>>>> >> >> > from
>>>> >> >> > >> IMDB and it worked fine.
>>>> >> >> > >> That's just not a job for BSP, but for MapReduce.
>>>> >> >> > >>
>>>> >> >> > >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>>> >> >> > >>
>>>> >> >> > >>> It's a gap in experience, Thomas. Do you think you can extract
>>>> >> >> Twitter
>>>> >> >> > >>>
>>>> >> >> > >>> mention graph using parseVertex?
>>>> >> >> > >>>
>>>> >> >> > >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
>>>> >> >> > >>> <th...@gmail.com> wrote:
>>>> >> >> > >>> > I have trouble understanding you here.
>>>> >> >> > >>> >
>>>> >> >> > >>> > How can I generate large sample without coding?
>>>> >> >> > >>> >
>>>> >> >> > >>> >
>>>> >> >> > >>> > Do you mean random data generation or real-life data?
>>>> >> >> > >>> > Personally I think it is really convenient to transform
>>>> >> >> unstructured
>>>> >> >> > >>> data
>>>> >> >> > >>> > in a text file to vertices.
>>>> >> >> > >>> >
>>>> >> >> > >>> >
>>>> >> >> > >>> > 2012/12/10 Edward <ed...@udanax.org>
>>>> >> >> > >>> >
>>>> >> >> > >>> >> I mean, With or without input reader. How can I generate
>>>> large
>>>> >> >> > sample
>>>> >> >> > >>> >> without coding?
>>>> >> >> > >>> >>
>>>> >> >> > >>> >> It's unnecessary feature. As I mentioned before, only good
>>>> for
>>>> >> >> > simple
>>>> >> >> > >>> and
>>>> >> >> > >>> >> small test.
>>>> >> >> > >>> >>
>>>> >> >> > >>> >> Sent from my iPhone
>>>> >> >> > >>> >>
>>>> >> >> > >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
>>>> >> >> > >>> thomas.jungblut@gmail.com>
>>>> >> >> > >>> >> wrote:
>>>> >> >> > >>> >>
>>>> >> >> > >>> >> >>
>>>> >> >> > >>> >> >> In my case, generating test data is very annoying.
>>>> >> >> > >>> >> >
>>>> >> >> > >>> >> >
>>>> >> >> > >>> >> > Really? What is so difficult to generate tab separated
>>>> text
>>>> >> >> > data?;)
>>>> >> >> > >>> >> > I think we shouldn't do this, but there seems to be very
>>>> >> little
>>>> >> >> > >>> interest
>>>> >> >> > >>> >> in
>>>> >> >> > >>> >> > the community so I will not block your work on it.
>>>> >> >> > >>> >> >
>>>> >> >> > >>> >> > Good luck ;)
>>>> >> >> > >>> >>
>>>> >> >> > >>>
>>>> >> >> > >>>
>>>> >> >> > >>>
>>>> >> >> > >>> --
>>>> >> >> > >>> Best Regards, Edward J. Yoon
>>>> >> >> > >>> @eddieyoon
>>>> >> >> > >>>
>>>> >> >> > >>
>>>> >> >> > >>
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Best Regards, Edward J. Yoon
>>>> >> >> > @eddieyoon
>>>> >> >> >
>>>> >> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Best Regards, Edward J. Yoon
>>>> >> @eddieyoon
>>>> >>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

If you can fix BSPJobClient.partition() method to partition text
input, please do.

Again ... :/

>>> * If we have VertexInputReader again, we don't need to apply it to all
>>> examples. And, random generators and examples should be managed
>>> together now.

As we discussed, I'll clean up them tomorrow.

On Tue, Dec 11, 2012 at 7:21 AM, Edward J. Yoon <ed...@apache.org> wrote:
>> Please do me a favor a code how you want the partitioning BSP job to work
>> before removing everything. I will tell you how to use the readers without
>> any graph duplicate code so you don't need to touch the examples at all.
>
> You don't need to wait. Because it will be almost same with
> BSPJobClient.partition() method.
>
> On Tue, Dec 11, 2012 at 6:59 AM, Thomas Jungblut
> <th...@gmail.com> wrote:
>> Please do me a favor a code how you want the partitioning BSP job to work
>> before removing everything. I will tell you how to use the readers without
>> any graph duplicate code so you don't need to touch the examples at all.
>>
>> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>
>>> Please review
>>> https://issues.apache.org/jira/secure/attachment/12560155/patch_v02.txt
>>> first.
>>>
>>> * If we have VertexInputReader again, we don't need to apply it to all
>>> examples. And, random generators and examples should be managed
>>> together now.
>>>
>>> On Tue, Dec 11, 2012 at 6:52 AM, Thomas Jungblut
>>> <th...@gmail.com> wrote:
>>> > Yes, but in patches and in Issue Hama-531, so we can review.
>>> >
>>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >
>>> >> We talked on gtalk, the conclusion is as below:
>>> >>
>>> >> "If there's no opinion, I'll remove VertexInputReader in
>>> >> GraphJobRunner, because it make code complex. Let's consider again
>>> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>>> >> issues."
>>> >>
>>> >> I'll clean up them tomorrow.
>>> >>
>>> >> On Tue, Dec 11, 2012 at 4:58 AM, Suraj Menon <su...@apache.org>
>>> >> wrote:
>>> >> > Hi Edward, I am assuming that you want to do this because you want to
>>> run
>>> >> > the job using more BSP tasks in parallel to reduce the memory usage
>>> per
>>> >> > task and perhaps run it faster.
>>> >> > Am I right? I am +1 if this makes things faster. However this would be
>>> >> > expensive for people with smaller clusters, and we should have spill,
>>> >> cache
>>> >> > and lookup implemented for Vertices in such cases.
>>> >> >
>>> >> > Regarding backward compatibility, can we use the user's
>>> VertexInputReader
>>> >> > to read the data and then write them in sequential file format we
>>> wan't.
>>> >> I
>>> >> > was discussing this with Thomas and we felt this could be done by
>>> >> > configuring a default input reader and overriding the same by
>>> >> > configuration. We would have to make the Vertex class Writable. I
>>> would
>>> >> > like to keep it backward compatible. Is this a possibility?
>>> >> >
>>> >> > Regarding run-time partitioning, not all partitioning would be based
>>> on
>>> >> > hash partitioning. I can have a partitioner based on color of the
>>> vertex
>>> >> or
>>> >> > some other property of the vertex. It is a step we can skip if not
>>> >> > configured by user.
>>> >> >
>>> >> > Just my 2 cents. We can deprecate things but let's not remove
>>> >> immediately.
>>> >> >
>>> >> > -Suraj
>>> >> >
>>> >> > HAMA-632 can wait until everything is resolved. I am trying to reduce
>>> the
>>> >> > API complexity.
>>> >> >
>>> >> > On Mon, Dec 10, 2012 at 2:56 PM, Thomas Jungblut
>>> >> > <th...@gmail.com>wrote:
>>> >> >
>>> >> >> You didn't get the use of the reader.
>>> >> >> The reader doesn't care about the input format.
>>> >> >> It just takes the input as Writable, so for Text this is
>>> >> LongWritable/Text
>>> >> >> pairs. For NoSQL this might be LongWritable/BytesWritable.
>>> >> >>
>>> >> >> It's up to you coding this for your input sequence, not for each
>>> format.
>>> >> >> This is not hardcoded to text, only in the examples.
>>> >> >>
>>> >> >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >> >>
>>> >> >> > Again ... User can create their own InputFormatter to read records
>>> as
>>> >> >> > a <Writable, ArrayWritable> from text file or sequence file, or
>>> >> >> > NoSQLs.
>>> >> >> >
>>> >> >> > You can use K, V pairs and sequence file. Why do you want to use
>>> text
>>> >> >> > file? Should I always write text file and parse them using
>>> >> >> > VertexInputReader?
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
>>> >> >> > <th...@gmail.com> wrote:
>>> >> >> > >>
>>> >> >> > >> It's a gap in experience, Thomas.
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > Most probably you should read some good books on data extraction
>>> and
>>> >> >> then
>>> >> >> > > choose your tools accordingly.
>>> >> >> > > I never think that BSP is and will be a good extraction technique
>>> >> for
>>> >> >> > > unstructured data.
>>> >> >> > >
>>> >> >> > > But these are just my two cents here- there seems to be somewhat
>>> >> more
>>> >> >> > > political problems in this game than using tools appropriately.
>>> >> >> > >
>>> >> >> > > 2012/12/10 Thomas Jungblut <th...@gmail.com>
>>> >> >> > >
>>> >> >> > >> Yes, if you preprocess your data correctly.
>>> >> >> > >> I have done the same unstructured extraction with the movie
>>> >> database
>>> >> >> > from
>>> >> >> > >> IMDB and it worked fine.
>>> >> >> > >> That's just not a job for BSP, but for MapReduce.
>>> >> >> > >>
>>> >> >> > >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >> >> > >>
>>> >> >> > >>> It's a gap in experience, Thomas. Do you think you can extract
>>> >> >> Twitter
>>> >> >> > >>>
>>> >> >> > >>> mention graph using parseVertex?
>>> >> >> > >>>
>>> >> >> > >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
>>> >> >> > >>> <th...@gmail.com> wrote:
>>> >> >> > >>> > I have trouble understanding you here.
>>> >> >> > >>> >
>>> >> >> > >>> > How can I generate large sample without coding?
>>> >> >> > >>> >
>>> >> >> > >>> >
>>> >> >> > >>> > Do you mean random data generation or real-life data?
>>> >> >> > >>> > Personally I think it is really convenient to transform
>>> >> >> unstructured
>>> >> >> > >>> data
>>> >> >> > >>> > in a text file to vertices.
>>> >> >> > >>> >
>>> >> >> > >>> >
>>> >> >> > >>> > 2012/12/10 Edward <ed...@udanax.org>
>>> >> >> > >>> >
>>> >> >> > >>> >> I mean, With or without input reader. How can I generate
>>> large
>>> >> >> > sample
>>> >> >> > >>> >> without coding?
>>> >> >> > >>> >>
>>> >> >> > >>> >> It's unnecessary feature. As I mentioned before, only good
>>> for
>>> >> >> > simple
>>> >> >> > >>> and
>>> >> >> > >>> >> small test.
>>> >> >> > >>> >>
>>> >> >> > >>> >> Sent from my iPhone
>>> >> >> > >>> >>
>>> >> >> > >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
>>> >> >> > >>> thomas.jungblut@gmail.com>
>>> >> >> > >>> >> wrote:
>>> >> >> > >>> >>
>>> >> >> > >>> >> >>
>>> >> >> > >>> >> >> In my case, generating test data is very annoying.
>>> >> >> > >>> >> >
>>> >> >> > >>> >> >
>>> >> >> > >>> >> > Really? What is so difficult to generate tab separated
>>> text
>>> >> >> > data?;)
>>> >> >> > >>> >> > I think we shouldn't do this, but there seems to be very
>>> >> little
>>> >> >> > >>> interest
>>> >> >> > >>> >> in
>>> >> >> > >>> >> > the community so I will not block your work on it.
>>> >> >> > >>> >> >
>>> >> >> > >>> >> > Good luck ;)
>>> >> >> > >>> >>
>>> >> >> > >>>
>>> >> >> > >>>
>>> >> >> > >>>
>>> >> >> > >>> --
>>> >> >> > >>> Best Regards, Edward J. Yoon
>>> >> >> > >>> @eddieyoon
>>> >> >> > >>>
>>> >> >> > >>
>>> >> >> > >>
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Best Regards, Edward J. Yoon
>>> >> >> > @eddieyoon
>>> >> >> >
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best Regards, Edward J. Yoon
>>> >> @eddieyoon
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

> Please do me a favor a code how you want the partitioning BSP job to work
> before removing everything. I will tell you how to use the readers without
> any graph duplicate code so you don't need to touch the examples at all.

You don't need to wait. Because it will be almost same with
BSPJobClient.partition() method.

On Tue, Dec 11, 2012 at 6:59 AM, Thomas Jungblut
<th...@gmail.com> wrote:
> Please do me a favor a code how you want the partitioning BSP job to work
> before removing everything. I will tell you how to use the readers without
> any graph duplicate code so you don't need to touch the examples at all.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> Please review
>> https://issues.apache.org/jira/secure/attachment/12560155/patch_v02.txt
>> first.
>>
>> * If we have VertexInputReader again, we don't need to apply it to all
>> examples. And, random generators and examples should be managed
>> together now.
>>
>> On Tue, Dec 11, 2012 at 6:52 AM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> > Yes, but in patches and in Issue Hama-531, so we can review.
>> >
>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >
>> >> We talked on gtalk, the conclusion is as below:
>> >>
>> >> "If there's no opinion, I'll remove VertexInputReader in
>> >> GraphJobRunner, because it make code complex. Let's consider again
>> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>> >> issues."
>> >>
>> >> I'll clean up them tomorrow.
>> >>
>> >> On Tue, Dec 11, 2012 at 4:58 AM, Suraj Menon <su...@apache.org>
>> >> wrote:
>> >> > Hi Edward, I am assuming that you want to do this because you want to
>> run
>> >> > the job using more BSP tasks in parallel to reduce the memory usage
>> per
>> >> > task and perhaps run it faster.
>> >> > Am I right? I am +1 if this makes things faster. However this would be
>> >> > expensive for people with smaller clusters, and we should have spill,
>> >> cache
>> >> > and lookup implemented for Vertices in such cases.
>> >> >
>> >> > Regarding backward compatibility, can we use the user's
>> VertexInputReader
>> >> > to read the data and then write them in sequential file format we
>> wan't.
>> >> I
>> >> > was discussing this with Thomas and we felt this could be done by
>> >> > configuring a default input reader and overriding the same by
>> >> > configuration. We would have to make the Vertex class Writable. I
>> would
>> >> > like to keep it backward compatible. Is this a possibility?
>> >> >
>> >> > Regarding run-time partitioning, not all partitioning would be based
>> on
>> >> > hash partitioning. I can have a partitioner based on color of the
>> vertex
>> >> or
>> >> > some other property of the vertex. It is a step we can skip if not
>> >> > configured by user.
>> >> >
>> >> > Just my 2 cents. We can deprecate things but let's not remove
>> >> immediately.
>> >> >
>> >> > -Suraj
>> >> >
>> >> > HAMA-632 can wait until everything is resolved. I am trying to reduce
>> the
>> >> > API complexity.
>> >> >
>> >> > On Mon, Dec 10, 2012 at 2:56 PM, Thomas Jungblut
>> >> > <th...@gmail.com>wrote:
>> >> >
>> >> >> You didn't get the use of the reader.
>> >> >> The reader doesn't care about the input format.
>> >> >> It just takes the input as Writable, so for Text this is
>> >> LongWritable/Text
>> >> >> pairs. For NoSQL this might be LongWritable/BytesWritable.
>> >> >>
>> >> >> It's up to you coding this for your input sequence, not for each
>> format.
>> >> >> This is not hardcoded to text, only in the examples.
>> >> >>
>> >> >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >>
>> >> >> > Again ... User can create their own InputFormatter to read records
>> as
>> >> >> > a <Writable, ArrayWritable> from text file or sequence file, or
>> >> >> > NoSQLs.
>> >> >> >
>> >> >> > You can use K, V pairs and sequence file. Why do you want to use
>> text
>> >> >> > file? Should I always write text file and parse them using
>> >> >> > VertexInputReader?
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
>> >> >> > <th...@gmail.com> wrote:
>> >> >> > >>
>> >> >> > >> It's a gap in experience, Thomas.
>> >> >> > >
>> >> >> > >
>> >> >> > > Most probably you should read some good books on data extraction
>> and
>> >> >> then
>> >> >> > > choose your tools accordingly.
>> >> >> > > I never think that BSP is and will be a good extraction technique
>> >> for
>> >> >> > > unstructured data.
>> >> >> > >
>> >> >> > > But these are just my two cents here- there seems to be somewhat
>> >> more
>> >> >> > > political problems in this game than using tools appropriately.
>> >> >> > >
>> >> >> > > 2012/12/10 Thomas Jungblut <th...@gmail.com>
>> >> >> > >
>> >> >> > >> Yes, if you preprocess your data correctly.
>> >> >> > >> I have done the same unstructured extraction with the movie
>> >> database
>> >> >> > from
>> >> >> > >> IMDB and it worked fine.
>> >> >> > >> That's just not a job for BSP, but for MapReduce.
>> >> >> > >>
>> >> >> > >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >> > >>
>> >> >> > >>> It's a gap in experience, Thomas. Do you think you can extract
>> >> >> Twitter
>> >> >> > >>>
>> >> >> > >>> mention graph using parseVertex?
>> >> >> > >>>
>> >> >> > >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
>> >> >> > >>> <th...@gmail.com> wrote:
>> >> >> > >>> > I have trouble understanding you here.
>> >> >> > >>> >
>> >> >> > >>> > How can I generate large sample without coding?
>> >> >> > >>> >
>> >> >> > >>> >
>> >> >> > >>> > Do you mean random data generation or real-life data?
>> >> >> > >>> > Personally I think it is really convenient to transform
>> >> >> unstructured
>> >> >> > >>> data
>> >> >> > >>> > in a text file to vertices.
>> >> >> > >>> >
>> >> >> > >>> >
>> >> >> > >>> > 2012/12/10 Edward <ed...@udanax.org>
>> >> >> > >>> >
>> >> >> > >>> >> I mean, With or without input reader. How can I generate
>> large
>> >> >> > sample
>> >> >> > >>> >> without coding?
>> >> >> > >>> >>
>> >> >> > >>> >> It's unnecessary feature. As I mentioned before, only good
>> for
>> >> >> > simple
>> >> >> > >>> and
>> >> >> > >>> >> small test.
>> >> >> > >>> >>
>> >> >> > >>> >> Sent from my iPhone
>> >> >> > >>> >>
>> >> >> > >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
>> >> >> > >>> thomas.jungblut@gmail.com>
>> >> >> > >>> >> wrote:
>> >> >> > >>> >>
>> >> >> > >>> >> >>
>> >> >> > >>> >> >> In my case, generating test data is very annoying.
>> >> >> > >>> >> >
>> >> >> > >>> >> >
>> >> >> > >>> >> > Really? What is so difficult to generate tab separated
>> text
>> >> >> > data?;)
>> >> >> > >>> >> > I think we shouldn't do this, but there seems to be very
>> >> little
>> >> >> > >>> interest
>> >> >> > >>> >> in
>> >> >> > >>> >> > the community so I will not block your work on it.
>> >> >> > >>> >> >
>> >> >> > >>> >> > Good luck ;)
>> >> >> > >>> >>
>> >> >> > >>>
>> >> >> > >>>
>> >> >> > >>>
>> >> >> > >>> --
>> >> >> > >>> Best Regards, Edward J. Yoon
>> >> >> > >>> @eddieyoon
>> >> >> > >>>
>> >> >> > >>
>> >> >> > >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Best Regards, Edward J. Yoon
>> >> >> > @eddieyoon
>> >> >> >
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

Please do me a favor a code how you want the partitioning BSP job to work
before removing everything. I will tell you how to use the readers without
any graph duplicate code so you don't need to touch the examples at all.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> Please review
> https://issues.apache.org/jira/secure/attachment/12560155/patch_v02.txt
> first.
>
> * If we have VertexInputReader again, we don't need to apply it to all
> examples. And, random generators and examples should be managed
> together now.
>
> On Tue, Dec 11, 2012 at 6:52 AM, Thomas Jungblut
> <th...@gmail.com> wrote:
> > Yes, but in patches and in Issue Hama-531, so we can review.
> >
> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >
> >> We talked on gtalk, the conclusion is as below:
> >>
> >> "If there's no opinion, I'll remove VertexInputReader in
> >> GraphJobRunner, because it make code complex. Let's consider again
> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
> >> issues."
> >>
> >> I'll clean up them tomorrow.
> >>
> >> On Tue, Dec 11, 2012 at 4:58 AM, Suraj Menon <su...@apache.org>
> >> wrote:
> >> > Hi Edward, I am assuming that you want to do this because you want to
> run
> >> > the job using more BSP tasks in parallel to reduce the memory usage
> per
> >> > task and perhaps run it faster.
> >> > Am I right? I am +1 if this makes things faster. However this would be
> >> > expensive for people with smaller clusters, and we should have spill,
> >> cache
> >> > and lookup implemented for Vertices in such cases.
> >> >
> >> > Regarding backward compatibility, can we use the user's
> VertexInputReader
> >> > to read the data and then write them in sequential file format we
> wan't.
> >> I
> >> > was discussing this with Thomas and we felt this could be done by
> >> > configuring a default input reader and overriding the same by
> >> > configuration. We would have to make the Vertex class Writable. I
> would
> >> > like to keep it backward compatible. Is this a possibility?
> >> >
> >> > Regarding run-time partitioning, not all partitioning would be based
> on
> >> > hash partitioning. I can have a partitioner based on color of the
> vertex
> >> or
> >> > some other property of the vertex. It is a step we can skip if not
> >> > configured by user.
> >> >
> >> > Just my 2 cents. We can deprecate things but let's not remove
> >> immediately.
> >> >
> >> > -Suraj
> >> >
> >> > HAMA-632 can wait until everything is resolved. I am trying to reduce
> the
> >> > API complexity.
> >> >
> >> > On Mon, Dec 10, 2012 at 2:56 PM, Thomas Jungblut
> >> > <th...@gmail.com>wrote:
> >> >
> >> >> You didn't get the use of the reader.
> >> >> The reader doesn't care about the input format.
> >> >> It just takes the input as Writable, so for Text this is
> >> LongWritable/Text
> >> >> pairs. For NoSQL this might be LongWritable/BytesWritable.
> >> >>
> >> >> It's up to you coding this for your input sequence, not for each
> format.
> >> >> This is not hardcoded to text, only in the examples.
> >> >>
> >> >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >>
> >> >> > Again ... User can create their own InputFormatter to read records
> as
> >> >> > a <Writable, ArrayWritable> from text file or sequence file, or
> >> >> > NoSQLs.
> >> >> >
> >> >> > You can use K, V pairs and sequence file. Why do you want to use
> text
> >> >> > file? Should I always write text file and parse them using
> >> >> > VertexInputReader?
> >> >> >
> >> >> >
> >> >> > On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
> >> >> > <th...@gmail.com> wrote:
> >> >> > >>
> >> >> > >> It's a gap in experience, Thomas.
> >> >> > >
> >> >> > >
> >> >> > > Most probably you should read some good books on data extraction
> and
> >> >> then
> >> >> > > choose your tools accordingly.
> >> >> > > I never think that BSP is and will be a good extraction technique
> >> for
> >> >> > > unstructured data.
> >> >> > >
> >> >> > > But these are just my two cents here- there seems to be somewhat
> >> more
> >> >> > > political problems in this game than using tools appropriately.
> >> >> > >
> >> >> > > 2012/12/10 Thomas Jungblut <th...@gmail.com>
> >> >> > >
> >> >> > >> Yes, if you preprocess your data correctly.
> >> >> > >> I have done the same unstructured extraction with the movie
> >> database
> >> >> > from
> >> >> > >> IMDB and it worked fine.
> >> >> > >> That's just not a job for BSP, but for MapReduce.
> >> >> > >>
> >> >> > >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >> > >>
> >> >> > >>> It's a gap in experience, Thomas. Do you think you can extract
> >> >> Twitter
> >> >> > >>>
> >> >> > >>> mention graph using parseVertex?
> >> >> > >>>
> >> >> > >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
> >> >> > >>> <th...@gmail.com> wrote:
> >> >> > >>> > I have trouble understanding you here.
> >> >> > >>> >
> >> >> > >>> > How can I generate large sample without coding?
> >> >> > >>> >
> >> >> > >>> >
> >> >> > >>> > Do you mean random data generation or real-life data?
> >> >> > >>> > Personally I think it is really convenient to transform
> >> >> unstructured
> >> >> > >>> data
> >> >> > >>> > in a text file to vertices.
> >> >> > >>> >
> >> >> > >>> >
> >> >> > >>> > 2012/12/10 Edward <ed...@udanax.org>
> >> >> > >>> >
> >> >> > >>> >> I mean, With or without input reader. How can I generate
> large
> >> >> > sample
> >> >> > >>> >> without coding?
> >> >> > >>> >>
> >> >> > >>> >> It's unnecessary feature. As I mentioned before, only good
> for
> >> >> > simple
> >> >> > >>> and
> >> >> > >>> >> small test.
> >> >> > >>> >>
> >> >> > >>> >> Sent from my iPhone
> >> >> > >>> >>
> >> >> > >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
> >> >> > >>> thomas.jungblut@gmail.com>
> >> >> > >>> >> wrote:
> >> >> > >>> >>
> >> >> > >>> >> >>
> >> >> > >>> >> >> In my case, generating test data is very annoying.
> >> >> > >>> >> >
> >> >> > >>> >> >
> >> >> > >>> >> > Really? What is so difficult to generate tab separated
> text
> >> >> > data?;)
> >> >> > >>> >> > I think we shouldn't do this, but there seems to be very
> >> little
> >> >> > >>> interest
> >> >> > >>> >> in
> >> >> > >>> >> > the community so I will not block your work on it.
> >> >> > >>> >> >
> >> >> > >>> >> > Good luck ;)
> >> >> > >>> >>
> >> >> > >>>
> >> >> > >>>
> >> >> > >>>
> >> >> > >>> --
> >> >> > >>> Best Regards, Edward J. Yoon
> >> >> > >>> @eddieyoon
> >> >> > >>>
> >> >> > >>
> >> >> > >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Best Regards, Edward J. Yoon
> >> >> > @eddieyoon
> >> >> >
> >> >>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

Please review https://issues.apache.org/jira/secure/attachment/12560155/patch_v02.txt
first.

* If we have VertexInputReader again, we don't need to apply it to all
examples. And, random generators and examples should be managed
together now.

On Tue, Dec 11, 2012 at 6:52 AM, Thomas Jungblut
<th...@gmail.com> wrote:
> Yes, but in patches and in Issue Hama-531, so we can review.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> We talked on gtalk, the conclusion is as below:
>>
>> "If there's no opinion, I'll remove VertexInputReader in
>> GraphJobRunner, because it make code complex. Let's consider again
>> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>> issues."
>>
>> I'll clean up them tomorrow.
>>
>> On Tue, Dec 11, 2012 at 4:58 AM, Suraj Menon <su...@apache.org>
>> wrote:
>> > Hi Edward, I am assuming that you want to do this because you want to run
>> > the job using more BSP tasks in parallel to reduce the memory usage per
>> > task and perhaps run it faster.
>> > Am I right? I am +1 if this makes things faster. However this would be
>> > expensive for people with smaller clusters, and we should have spill,
>> cache
>> > and lookup implemented for Vertices in such cases.
>> >
>> > Regarding backward compatibility, can we use the user's VertexInputReader
>> > to read the data and then write them in sequential file format we wan't.
>> I
>> > was discussing this with Thomas and we felt this could be done by
>> > configuring a default input reader and overriding the same by
>> > configuration. We would have to make the Vertex class Writable. I would
>> > like to keep it backward compatible. Is this a possibility?
>> >
>> > Regarding run-time partitioning, not all partitioning would be based on
>> > hash partitioning. I can have a partitioner based on color of the vertex
>> or
>> > some other property of the vertex. It is a step we can skip if not
>> > configured by user.
>> >
>> > Just my 2 cents. We can deprecate things but let's not remove
>> immediately.
>> >
>> > -Suraj
>> >
>> > HAMA-632 can wait until everything is resolved. I am trying to reduce the
>> > API complexity.
>> >
>> > On Mon, Dec 10, 2012 at 2:56 PM, Thomas Jungblut
>> > <th...@gmail.com>wrote:
>> >
>> >> You didn't get the use of the reader.
>> >> The reader doesn't care about the input format.
>> >> It just takes the input as Writable, so for Text this is
>> LongWritable/Text
>> >> pairs. For NoSQL this might be LongWritable/BytesWritable.
>> >>
>> >> It's up to you coding this for your input sequence, not for each format.
>> >> This is not hardcoded to text, only in the examples.
>> >>
>> >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >>
>> >> > Again ... User can create their own InputFormatter to read records as
>> >> > a <Writable, ArrayWritable> from text file or sequence file, or
>> >> > NoSQLs.
>> >> >
>> >> > You can use K, V pairs and sequence file. Why do you want to use text
>> >> > file? Should I always write text file and parse them using
>> >> > VertexInputReader?
>> >> >
>> >> >
>> >> > On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
>> >> > <th...@gmail.com> wrote:
>> >> > >>
>> >> > >> It's a gap in experience, Thomas.
>> >> > >
>> >> > >
>> >> > > Most probably you should read some good books on data extraction and
>> >> then
>> >> > > choose your tools accordingly.
>> >> > > I never think that BSP is and will be a good extraction technique
>> for
>> >> > > unstructured data.
>> >> > >
>> >> > > But these are just my two cents here- there seems to be somewhat
>> more
>> >> > > political problems in this game than using tools appropriately.
>> >> > >
>> >> > > 2012/12/10 Thomas Jungblut <th...@gmail.com>
>> >> > >
>> >> > >> Yes, if you preprocess your data correctly.
>> >> > >> I have done the same unstructured extraction with the movie
>> database
>> >> > from
>> >> > >> IMDB and it worked fine.
>> >> > >> That's just not a job for BSP, but for MapReduce.
>> >> > >>
>> >> > >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> > >>
>> >> > >>> It's a gap in experience, Thomas. Do you think you can extract
>> >> Twitter
>> >> > >>>
>> >> > >>> mention graph using parseVertex?
>> >> > >>>
>> >> > >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
>> >> > >>> <th...@gmail.com> wrote:
>> >> > >>> > I have trouble understanding you here.
>> >> > >>> >
>> >> > >>> > How can I generate large sample without coding?
>> >> > >>> >
>> >> > >>> >
>> >> > >>> > Do you mean random data generation or real-life data?
>> >> > >>> > Personally I think it is really convenient to transform
>> >> unstructured
>> >> > >>> data
>> >> > >>> > in a text file to vertices.
>> >> > >>> >
>> >> > >>> >
>> >> > >>> > 2012/12/10 Edward <ed...@udanax.org>
>> >> > >>> >
>> >> > >>> >> I mean, With or without input reader. How can I generate large
>> >> > sample
>> >> > >>> >> without coding?
>> >> > >>> >>
>> >> > >>> >> It's unnecessary feature. As I mentioned before, only good for
>> >> > simple
>> >> > >>> and
>> >> > >>> >> small test.
>> >> > >>> >>
>> >> > >>> >> Sent from my iPhone
>> >> > >>> >>
>> >> > >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
>> >> > >>> thomas.jungblut@gmail.com>
>> >> > >>> >> wrote:
>> >> > >>> >>
>> >> > >>> >> >>
>> >> > >>> >> >> In my case, generating test data is very annoying.
>> >> > >>> >> >
>> >> > >>> >> >
>> >> > >>> >> > Really? What is so difficult to generate tab separated text
>> >> > data?;)
>> >> > >>> >> > I think we shouldn't do this, but there seems to be very
>> little
>> >> > >>> interest
>> >> > >>> >> in
>> >> > >>> >> > the community so I will not block your work on it.
>> >> > >>> >> >
>> >> > >>> >> > Good luck ;)
>> >> > >>> >>
>> >> > >>>
>> >> > >>>
>> >> > >>>
>> >> > >>> --
>> >> > >>> Best Regards, Edward J. Yoon
>> >> > >>> @eddieyoon
>> >> > >>>
>> >> > >>
>> >> > >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Regards, Edward J. Yoon
>> >> > @eddieyoon
>> >> >
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

Yes, but in patches and in Issue Hama-531, so we can review.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> We talked on gtalk, the conclusion is as below:
>
> "If there's no opinion, I'll remove VertexInputReader in
> GraphJobRunner, because it make code complex. Let's consider again
> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
> issues."
>
> I'll clean up them tomorrow.
>
> On Tue, Dec 11, 2012 at 4:58 AM, Suraj Menon <su...@apache.org>
> wrote:
> > Hi Edward, I am assuming that you want to do this because you want to run
> > the job using more BSP tasks in parallel to reduce the memory usage per
> > task and perhaps run it faster.
> > Am I right? I am +1 if this makes things faster. However this would be
> > expensive for people with smaller clusters, and we should have spill,
> cache
> > and lookup implemented for Vertices in such cases.
> >
> > Regarding backward compatibility, can we use the user's VertexInputReader
> > to read the data and then write them in sequential file format we wan't.
> I
> > was discussing this with Thomas and we felt this could be done by
> > configuring a default input reader and overriding the same by
> > configuration. We would have to make the Vertex class Writable. I would
> > like to keep it backward compatible. Is this a possibility?
> >
> > Regarding run-time partitioning, not all partitioning would be based on
> > hash partitioning. I can have a partitioner based on color of the vertex
> or
> > some other property of the vertex. It is a step we can skip if not
> > configured by user.
> >
> > Just my 2 cents. We can deprecate things but let's not remove
> immediately.
> >
> > -Suraj
> >
> > HAMA-632 can wait until everything is resolved. I am trying to reduce the
> > API complexity.
> >
> > On Mon, Dec 10, 2012 at 2:56 PM, Thomas Jungblut
> > <th...@gmail.com>wrote:
> >
> >> You didn't get the use of the reader.
> >> The reader doesn't care about the input format.
> >> It just takes the input as Writable, so for Text this is
> LongWritable/Text
> >> pairs. For NoSQL this might be LongWritable/BytesWritable.
> >>
> >> It's up to you coding this for your input sequence, not for each format.
> >> This is not hardcoded to text, only in the examples.
> >>
> >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >>
> >> > Again ... User can create their own InputFormatter to read records as
> >> > a <Writable, ArrayWritable> from text file or sequence file, or
> >> > NoSQLs.
> >> >
> >> > You can use K, V pairs and sequence file. Why do you want to use text
> >> > file? Should I always write text file and parse them using
> >> > VertexInputReader?
> >> >
> >> >
> >> > On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
> >> > <th...@gmail.com> wrote:
> >> > >>
> >> > >> It's a gap in experience, Thomas.
> >> > >
> >> > >
> >> > > Most probably you should read some good books on data extraction and
> >> then
> >> > > choose your tools accordingly.
> >> > > I never think that BSP is and will be a good extraction technique
> for
> >> > > unstructured data.
> >> > >
> >> > > But these are just my two cents here- there seems to be somewhat
> more
> >> > > political problems in this game than using tools appropriately.
> >> > >
> >> > > 2012/12/10 Thomas Jungblut <th...@gmail.com>
> >> > >
> >> > >> Yes, if you preprocess your data correctly.
> >> > >> I have done the same unstructured extraction with the movie
> database
> >> > from
> >> > >> IMDB and it worked fine.
> >> > >> That's just not a job for BSP, but for MapReduce.
> >> > >>
> >> > >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> > >>
> >> > >>> It's a gap in experience, Thomas. Do you think you can extract
> >> Twitter
> >> > >>>
> >> > >>> mention graph using parseVertex?
> >> > >>>
> >> > >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
> >> > >>> <th...@gmail.com> wrote:
> >> > >>> > I have trouble understanding you here.
> >> > >>> >
> >> > >>> > How can I generate large sample without coding?
> >> > >>> >
> >> > >>> >
> >> > >>> > Do you mean random data generation or real-life data?
> >> > >>> > Personally I think it is really convenient to transform
> >> unstructured
> >> > >>> data
> >> > >>> > in a text file to vertices.
> >> > >>> >
> >> > >>> >
> >> > >>> > 2012/12/10 Edward <ed...@udanax.org>
> >> > >>> >
> >> > >>> >> I mean, With or without input reader. How can I generate large
> >> > sample
> >> > >>> >> without coding?
> >> > >>> >>
> >> > >>> >> It's unnecessary feature. As I mentioned before, only good for
> >> > simple
> >> > >>> and
> >> > >>> >> small test.
> >> > >>> >>
> >> > >>> >> Sent from my iPhone
> >> > >>> >>
> >> > >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
> >> > >>> thomas.jungblut@gmail.com>
> >> > >>> >> wrote:
> >> > >>> >>
> >> > >>> >> >>
> >> > >>> >> >> In my case, generating test data is very annoying.
> >> > >>> >> >
> >> > >>> >> >
> >> > >>> >> > Really? What is so difficult to generate tab separated text
> >> > data?;)
> >> > >>> >> > I think we shouldn't do this, but there seems to be very
> little
> >> > >>> interest
> >> > >>> >> in
> >> > >>> >> > the community so I will not block your work on it.
> >> > >>> >> >
> >> > >>> >> > Good luck ;)
> >> > >>> >>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> --
> >> > >>> Best Regards, Edward J. Yoon
> >> > >>> @eddieyoon
> >> > >>>
> >> > >>
> >> > >>
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards, Edward J. Yoon
> >> > @eddieyoon
> >> >
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

We talked on gtalk, the conclusion is as below:

"If there's no opinion, I'll remove VertexInputReader in
GraphJobRunner, because it make code complex. Let's consider again
about the VertexInputReader, after fixing HAMA-531 and HAMA-632
issues."

I'll clean up them tomorrow.

On Tue, Dec 11, 2012 at 4:58 AM, Suraj Menon <su...@apache.org> wrote:
> Hi Edward, I am assuming that you want to do this because you want to run
> the job using more BSP tasks in parallel to reduce the memory usage per
> task and perhaps run it faster.
> Am I right? I am +1 if this makes things faster. However this would be
> expensive for people with smaller clusters, and we should have spill, cache
> and lookup implemented for Vertices in such cases.
>
> Regarding backward compatibility, can we use the user's VertexInputReader
> to read the data and then write them in sequential file format we wan't. I
> was discussing this with Thomas and we felt this could be done by
> configuring a default input reader and overriding the same by
> configuration. We would have to make the Vertex class Writable. I would
> like to keep it backward compatible. Is this a possibility?
>
> Regarding run-time partitioning, not all partitioning would be based on
> hash partitioning. I can have a partitioner based on color of the vertex or
> some other property of the vertex. It is a step we can skip if not
> configured by user.
>
> Just my 2 cents. We can deprecate things but let's not remove immediately.
>
> -Suraj
>
> HAMA-632 can wait until everything is resolved. I am trying to reduce the
> API complexity.
>
> On Mon, Dec 10, 2012 at 2:56 PM, Thomas Jungblut
> <th...@gmail.com>wrote:
>
>> You didn't get the use of the reader.
>> The reader doesn't care about the input format.
>> It just takes the input as Writable, so for Text this is LongWritable/Text
>> pairs. For NoSQL this might be LongWritable/BytesWritable.
>>
>> It's up to you coding this for your input sequence, not for each format.
>> This is not hardcoded to text, only in the examples.
>>
>> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>
>> > Again ... User can create their own InputFormatter to read records as
>> > a <Writable, ArrayWritable> from text file or sequence file, or
>> > NoSQLs.
>> >
>> > You can use K, V pairs and sequence file. Why do you want to use text
>> > file? Should I always write text file and parse them using
>> > VertexInputReader?
>> >
>> >
>> > On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
>> > <th...@gmail.com> wrote:
>> > >>
>> > >> It's a gap in experience, Thomas.
>> > >
>> > >
>> > > Most probably you should read some good books on data extraction and
>> then
>> > > choose your tools accordingly.
>> > > I never think that BSP is and will be a good extraction technique for
>> > > unstructured data.
>> > >
>> > > But these are just my two cents here- there seems to be somewhat more
>> > > political problems in this game than using tools appropriately.
>> > >
>> > > 2012/12/10 Thomas Jungblut <th...@gmail.com>
>> > >
>> > >> Yes, if you preprocess your data correctly.
>> > >> I have done the same unstructured extraction with the movie database
>> > from
>> > >> IMDB and it worked fine.
>> > >> That's just not a job for BSP, but for MapReduce.
>> > >>
>> > >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> > >>
>> > >>> It's a gap in experience, Thomas. Do you think you can extract
>> Twitter
>> > >>>
>> > >>> mention graph using parseVertex?
>> > >>>
>> > >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
>> > >>> <th...@gmail.com> wrote:
>> > >>> > I have trouble understanding you here.
>> > >>> >
>> > >>> > How can I generate large sample without coding?
>> > >>> >
>> > >>> >
>> > >>> > Do you mean random data generation or real-life data?
>> > >>> > Personally I think it is really convenient to transform
>> unstructured
>> > >>> data
>> > >>> > in a text file to vertices.
>> > >>> >
>> > >>> >
>> > >>> > 2012/12/10 Edward <ed...@udanax.org>
>> > >>> >
>> > >>> >> I mean, With or without input reader. How can I generate large
>> > sample
>> > >>> >> without coding?
>> > >>> >>
>> > >>> >> It's unnecessary feature. As I mentioned before, only good for
>> > simple
>> > >>> and
>> > >>> >> small test.
>> > >>> >>
>> > >>> >> Sent from my iPhone
>> > >>> >>
>> > >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
>> > >>> thomas.jungblut@gmail.com>
>> > >>> >> wrote:
>> > >>> >>
>> > >>> >> >>
>> > >>> >> >> In my case, generating test data is very annoying.
>> > >>> >> >
>> > >>> >> >
>> > >>> >> > Really? What is so difficult to generate tab separated text
>> > data?;)
>> > >>> >> > I think we shouldn't do this, but there seems to be very little
>> > >>> interest
>> > >>> >> in
>> > >>> >> > the community so I will not block your work on it.
>> > >>> >> >
>> > >>> >> > Good luck ;)
>> > >>> >>
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Best Regards, Edward J. Yoon
>> > >>> @eddieyoon
>> > >>>
>> > >>
>> > >>
>> >
>> >
>> >
>> > --
>> > Best Regards, Edward J. Yoon
>> > @eddieyoon
>> >
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Suraj Menon <su...@apache.org>.

Hi Edward, I am assuming that you want to do this because you want to run
the job using more BSP tasks in parallel to reduce the memory usage per
task and perhaps run it faster.
Am I right? I am +1 if this makes things faster. However this would be
expensive for people with smaller clusters, and we should have spill, cache
and lookup implemented for Vertices in such cases.

Regarding backward compatibility, can we use the user's VertexInputReader
to read the data and then write them in sequential file format we wan't. I
was discussing this with Thomas and we felt this could be done by
configuring a default input reader and overriding the same by
configuration. We would have to make the Vertex class Writable. I would
like to keep it backward compatible. Is this a possibility?

Regarding run-time partitioning, not all partitioning would be based on
hash partitioning. I can have a partitioner based on color of the vertex or
some other property of the vertex. It is a step we can skip if not
configured by user.

Just my 2 cents. We can deprecate things but let's not remove immediately.

-Suraj

HAMA-632 can wait until everything is resolved. I am trying to reduce the
API complexity.

On Mon, Dec 10, 2012 at 2:56 PM, Thomas Jungblut
<th...@gmail.com>wrote:

> You didn't get the use of the reader.
> The reader doesn't care about the input format.
> It just takes the input as Writable, so for Text this is LongWritable/Text
> pairs. For NoSQL this might be LongWritable/BytesWritable.
>
> It's up to you coding this for your input sequence, not for each format.
> This is not hardcoded to text, only in the examples.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
> > Again ... User can create their own InputFormatter to read records as
> > a <Writable, ArrayWritable> from text file or sequence file, or
> > NoSQLs.
> >
> > You can use K, V pairs and sequence file. Why do you want to use text
> > file? Should I always write text file and parse them using
> > VertexInputReader?
> >
> >
> > On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
> > <th...@gmail.com> wrote:
> > >>
> > >> It's a gap in experience, Thomas.
> > >
> > >
> > > Most probably you should read some good books on data extraction and
> then
> > > choose your tools accordingly.
> > > I never think that BSP is and will be a good extraction technique for
> > > unstructured data.
> > >
> > > But these are just my two cents here- there seems to be somewhat more
> > > political problems in this game than using tools appropriately.
> > >
> > > 2012/12/10 Thomas Jungblut <th...@gmail.com>
> > >
> > >> Yes, if you preprocess your data correctly.
> > >> I have done the same unstructured extraction with the movie database
> > from
> > >> IMDB and it worked fine.
> > >> That's just not a job for BSP, but for MapReduce.
> > >>
> > >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
> > >>
> > >>> It's a gap in experience, Thomas. Do you think you can extract
> Twitter
> > >>>
> > >>> mention graph using parseVertex?
> > >>>
> > >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
> > >>> <th...@gmail.com> wrote:
> > >>> > I have trouble understanding you here.
> > >>> >
> > >>> > How can I generate large sample without coding?
> > >>> >
> > >>> >
> > >>> > Do you mean random data generation or real-life data?
> > >>> > Personally I think it is really convenient to transform
> unstructured
> > >>> data
> > >>> > in a text file to vertices.
> > >>> >
> > >>> >
> > >>> > 2012/12/10 Edward <ed...@udanax.org>
> > >>> >
> > >>> >> I mean, With or without input reader. How can I generate large
> > sample
> > >>> >> without coding?
> > >>> >>
> > >>> >> It's unnecessary feature. As I mentioned before, only good for
> > simple
> > >>> and
> > >>> >> small test.
> > >>> >>
> > >>> >> Sent from my iPhone
> > >>> >>
> > >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
> > >>> thomas.jungblut@gmail.com>
> > >>> >> wrote:
> > >>> >>
> > >>> >> >>
> > >>> >> >> In my case, generating test data is very annoying.
> > >>> >> >
> > >>> >> >
> > >>> >> > Really? What is so difficult to generate tab separated text
> > data?;)
> > >>> >> > I think we shouldn't do this, but there seems to be very little
> > >>> interest
> > >>> >> in
> > >>> >> > the community so I will not block your work on it.
> > >>> >> >
> > >>> >> > Good luck ;)
> > >>> >>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Best Regards, Edward J. Yoon
> > >>> @eddieyoon
> > >>>
> > >>
> > >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
> >
>

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

You didn't get the use of the reader.
The reader doesn't care about the input format.
It just takes the input as Writable, so for Text this is LongWritable/Text
pairs. For NoSQL this might be LongWritable/BytesWritable.

It's up to you coding this for your input sequence, not for each format.
This is not hardcoded to text, only in the examples.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> Again ... User can create their own InputFormatter to read records as
> a <Writable, ArrayWritable> from text file or sequence file, or
> NoSQLs.
>
> You can use K, V pairs and sequence file. Why do you want to use text
> file? Should I always write text file and parse them using
> VertexInputReader?
>
>
> On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
> <th...@gmail.com> wrote:
> >>
> >> It's a gap in experience, Thomas.
> >
> >
> > Most probably you should read some good books on data extraction and then
> > choose your tools accordingly.
> > I never think that BSP is and will be a good extraction technique for
> > unstructured data.
> >
> > But these are just my two cents here- there seems to be somewhat more
> > political problems in this game than using tools appropriately.
> >
> > 2012/12/10 Thomas Jungblut <th...@gmail.com>
> >
> >> Yes, if you preprocess your data correctly.
> >> I have done the same unstructured extraction with the movie database
> from
> >> IMDB and it worked fine.
> >> That's just not a job for BSP, but for MapReduce.
> >>
> >> 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >>
> >>> It's a gap in experience, Thomas. Do you think you can extract Twitter
> >>>
> >>> mention graph using parseVertex?
> >>>
> >>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
> >>> <th...@gmail.com> wrote:
> >>> > I have trouble understanding you here.
> >>> >
> >>> > How can I generate large sample without coding?
> >>> >
> >>> >
> >>> > Do you mean random data generation or real-life data?
> >>> > Personally I think it is really convenient to transform unstructured
> >>> data
> >>> > in a text file to vertices.
> >>> >
> >>> >
> >>> > 2012/12/10 Edward <ed...@udanax.org>
> >>> >
> >>> >> I mean, With or without input reader. How can I generate large
> sample
> >>> >> without coding?
> >>> >>
> >>> >> It's unnecessary feature. As I mentioned before, only good for
> simple
> >>> and
> >>> >> small test.
> >>> >>
> >>> >> Sent from my iPhone
> >>> >>
> >>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
> >>> thomas.jungblut@gmail.com>
> >>> >> wrote:
> >>> >>
> >>> >> >>
> >>> >> >> In my case, generating test data is very annoying.
> >>> >> >
> >>> >> >
> >>> >> > Really? What is so difficult to generate tab separated text
> data?;)
> >>> >> > I think we shouldn't do this, but there seems to be very little
> >>> interest
> >>> >> in
> >>> >> > the community so I will not block your work on it.
> >>> >> >
> >>> >> > Good luck ;)
> >>> >>
> >>>
> >>>
> >>>
> >>> --
> >>> Best Regards, Edward J. Yoon
> >>> @eddieyoon
> >>>
> >>
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

Sorry, text is exception.

On Tue, Dec 11, 2012 at 4:52 AM, Edward J. Yoon <ed...@apache.org> wrote:
> Again ... User can create their own InputFormatter to read records as
> a <Writable, ArrayWritable> from text file or sequence file, or
> NoSQLs.
>
> You can use K, V pairs and sequence file. Why do you want to use text
> file? Should I always write text file and parse them using
> VertexInputReader?
>
>
> On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
> <th...@gmail.com> wrote:
>>>
>>> It's a gap in experience, Thomas.
>>
>>
>> Most probably you should read some good books on data extraction and then
>> choose your tools accordingly.
>> I never think that BSP is and will be a good extraction technique for
>> unstructured data.
>>
>> But these are just my two cents here- there seems to be somewhat more
>> political problems in this game than using tools appropriately.
>>
>> 2012/12/10 Thomas Jungblut <th...@gmail.com>
>>
>>> Yes, if you preprocess your data correctly.
>>> I have done the same unstructured extraction with the movie database from
>>> IMDB and it worked fine.
>>> That's just not a job for BSP, but for MapReduce.
>>>
>>> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>>
>>>> It's a gap in experience, Thomas. Do you think you can extract Twitter
>>>>
>>>> mention graph using parseVertex?
>>>>
>>>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
>>>> <th...@gmail.com> wrote:
>>>> > I have trouble understanding you here.
>>>> >
>>>> > How can I generate large sample without coding?
>>>> >
>>>> >
>>>> > Do you mean random data generation or real-life data?
>>>> > Personally I think it is really convenient to transform unstructured
>>>> data
>>>> > in a text file to vertices.
>>>> >
>>>> >
>>>> > 2012/12/10 Edward <ed...@udanax.org>
>>>> >
>>>> >> I mean, With or without input reader. How can I generate large sample
>>>> >> without coding?
>>>> >>
>>>> >> It's unnecessary feature. As I mentioned before, only good for simple
>>>> and
>>>> >> small test.
>>>> >>
>>>> >> Sent from my iPhone
>>>> >>
>>>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
>>>> thomas.jungblut@gmail.com>
>>>> >> wrote:
>>>> >>
>>>> >> >>
>>>> >> >> In my case, generating test data is very annoying.
>>>> >> >
>>>> >> >
>>>> >> > Really? What is so difficult to generate tab separated text data?;)
>>>> >> > I think we shouldn't do this, but there seems to be very little
>>>> interest
>>>> >> in
>>>> >> > the community so I will not block your work on it.
>>>> >> >
>>>> >> > Good luck ;)
>>>> >>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>>
>>>
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

Again ... User can create their own InputFormatter to read records as
a <Writable, ArrayWritable> from text file or sequence file, or
NoSQLs.

You can use K, V pairs and sequence file. Why do you want to use text
file? Should I always write text file and parse them using
VertexInputReader?


On Tue, Dec 11, 2012 at 4:48 AM, Thomas Jungblut
<th...@gmail.com> wrote:
>>
>> It's a gap in experience, Thomas.
>
>
> Most probably you should read some good books on data extraction and then
> choose your tools accordingly.
> I never think that BSP is and will be a good extraction technique for
> unstructured data.
>
> But these are just my two cents here- there seems to be somewhat more
> political problems in this game than using tools appropriately.
>
> 2012/12/10 Thomas Jungblut <th...@gmail.com>
>
>> Yes, if you preprocess your data correctly.
>> I have done the same unstructured extraction with the movie database from
>> IMDB and it worked fine.
>> That's just not a job for BSP, but for MapReduce.
>>
>> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>
>>> It's a gap in experience, Thomas. Do you think you can extract Twitter
>>>
>>> mention graph using parseVertex?
>>>
>>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
>>> <th...@gmail.com> wrote:
>>> > I have trouble understanding you here.
>>> >
>>> > How can I generate large sample without coding?
>>> >
>>> >
>>> > Do you mean random data generation or real-life data?
>>> > Personally I think it is really convenient to transform unstructured
>>> data
>>> > in a text file to vertices.
>>> >
>>> >
>>> > 2012/12/10 Edward <ed...@udanax.org>
>>> >
>>> >> I mean, With or without input reader. How can I generate large sample
>>> >> without coding?
>>> >>
>>> >> It's unnecessary feature. As I mentioned before, only good for simple
>>> and
>>> >> small test.
>>> >>
>>> >> Sent from my iPhone
>>> >>
>>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
>>> thomas.jungblut@gmail.com>
>>> >> wrote:
>>> >>
>>> >> >>
>>> >> >> In my case, generating test data is very annoying.
>>> >> >
>>> >> >
>>> >> > Really? What is so difficult to generate tab separated text data?;)
>>> >> > I think we shouldn't do this, but there seems to be very little
>>> interest
>>> >> in
>>> >> > the community so I will not block your work on it.
>>> >> >
>>> >> > Good luck ;)
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>>
>>
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

>
> It's a gap in experience, Thomas.


Most probably you should read some good books on data extraction and then
choose your tools accordingly.
I never think that BSP is and will be a good extraction technique for
unstructured data.

But these are just my two cents here- there seems to be somewhat more
political problems in this game than using tools appropriately.

2012/12/10 Thomas Jungblut <th...@gmail.com>

> Yes, if you preprocess your data correctly.
> I have done the same unstructured extraction with the movie database from
> IMDB and it worked fine.
> That's just not a job for BSP, but for MapReduce.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> It's a gap in experience, Thomas. Do you think you can extract Twitter
>>
>> mention graph using parseVertex?
>>
>> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> > I have trouble understanding you here.
>> >
>> > How can I generate large sample without coding?
>> >
>> >
>> > Do you mean random data generation or real-life data?
>> > Personally I think it is really convenient to transform unstructured
>> data
>> > in a text file to vertices.
>> >
>> >
>> > 2012/12/10 Edward <ed...@udanax.org>
>> >
>> >> I mean, With or without input reader. How can I generate large sample
>> >> without coding?
>> >>
>> >> It's unnecessary feature. As I mentioned before, only good for simple
>> and
>> >> small test.
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <
>> thomas.jungblut@gmail.com>
>> >> wrote:
>> >>
>> >> >>
>> >> >> In my case, generating test data is very annoying.
>> >> >
>> >> >
>> >> > Really? What is so difficult to generate tab separated text data?;)
>> >> > I think we shouldn't do this, but there seems to be very little
>> interest
>> >> in
>> >> > the community so I will not block your work on it.
>> >> >
>> >> > Good luck ;)
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>
>
>

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

Yes, if you preprocess your data correctly.
I have done the same unstructured extraction with the movie database from
IMDB and it worked fine.
That's just not a job for BSP, but for MapReduce.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> It's a gap in experience, Thomas. Do you think you can extract Twitter
> mention graph using parseVertex?
>
> On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
> <th...@gmail.com> wrote:
> > I have trouble understanding you here.
> >
> > How can I generate large sample without coding?
> >
> >
> > Do you mean random data generation or real-life data?
> > Personally I think it is really convenient to transform unstructured data
> > in a text file to vertices.
> >
> >
> > 2012/12/10 Edward <ed...@udanax.org>
> >
> >> I mean, With or without input reader. How can I generate large sample
> >> without coding?
> >>
> >> It's unnecessary feature. As I mentioned before, only good for simple
> and
> >> small test.
> >>
> >> Sent from my iPhone
> >>
> >> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <thomas.jungblut@gmail.com
> >
> >> wrote:
> >>
> >> >>
> >> >> In my case, generating test data is very annoying.
> >> >
> >> >
> >> > Really? What is so difficult to generate tab separated text data?;)
> >> > I think we shouldn't do this, but there seems to be very little
> interest
> >> in
> >> > the community so I will not block your work on it.
> >> >
> >> > Good luck ;)
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

It's a gap in experience, Thomas. Do you think you can extract Twitter
mention graph using parseVertex?

On Tue, Dec 11, 2012 at 4:34 AM, Thomas Jungblut
<th...@gmail.com> wrote:
> I have trouble understanding you here.
>
> How can I generate large sample without coding?
>
>
> Do you mean random data generation or real-life data?
> Personally I think it is really convenient to transform unstructured data
> in a text file to vertices.
>
>
> 2012/12/10 Edward <ed...@udanax.org>
>
>> I mean, With or without input reader. How can I generate large sample
>> without coding?
>>
>> It's unnecessary feature. As I mentioned before, only good for simple and
>> small test.
>>
>> Sent from my iPhone
>>
>> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <th...@gmail.com>
>> wrote:
>>
>> >>
>> >> In my case, generating test data is very annoying.
>> >
>> >
>> > Really? What is so difficult to generate tab separated text data?;)
>> > I think we shouldn't do this, but there seems to be very little interest
>> in
>> > the community so I will not block your work on it.
>> >
>> > Good luck ;)
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

I have trouble understanding you here.

How can I generate large sample without coding?


Do you mean random data generation or real-life data?
Personally I think it is really convenient to transform unstructured data
in a text file to vertices.


2012/12/10 Edward <ed...@udanax.org>

> I mean, With or without input reader. How can I generate large sample
> without coding?
>
> It's unnecessary feature. As I mentioned before, only good for simple and
> small test.
>
> Sent from my iPhone
>
> On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <th...@gmail.com>
> wrote:
>
> >>
> >> In my case, generating test data is very annoying.
> >
> >
> > Really? What is so difficult to generate tab separated text data?;)
> > I think we shouldn't do this, but there seems to be very little interest
> in
> > the community so I will not block your work on it.
> >
> > Good luck ;)
>

Re: runtimePartitioning in GraphJobRunner

Posted by Edward <ed...@udanax.org>.

I mean, With or without input reader. How can I generate large sample without coding?

It's unnecessary feature. As I mentioned before, only good for simple and small test.

Sent from my iPhone

On Dec 11, 2012, at 3:38 AM, Thomas Jungblut <th...@gmail.com> wrote:

>> 
>> In my case, generating test data is very annoying.
> 
> 
> Really? What is so difficult to generate tab separated text data?;)
> I think we shouldn't do this, but there seems to be very little interest in
> the community so I will not block your work on it.
> 
> Good luck ;)

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

>
> In my case, generating test data is very annoying.


Really? What is so difficult to generate tab separated text data?;)
I think we shouldn't do this, but there seems to be very little interest in
the community so I will not block your work on it.

Good luck ;)

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

Anyway, then I'm removing them tomorrow.

On Mon, Dec 10, 2012 at 10:09 PM, Edward J. Yoon <ed...@apache.org> wrote:
> You know what? If graph is not stored well in somewhere, graph should
> be extracted from unstructured data. parseVertex API is only good for
> simple test/debug programs, because it's human readable text.
>
> In my case, generating test data is very annoying.
>
> On Mon, Dec 10, 2012 at 9:51 PM, Thomas Jungblut
> <th...@gmail.com> wrote:
>> That's nothing personal, just about how we solve the problems we face.
>> We need just some trade-off between API compatibility and scalability
>> improvement.
>>
>> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>
>>> I don't dislike your Intuitive input reader. Once cleaning is done, we
>>> can think about it again.
>>>
>>> On Mon, Dec 10, 2012 at 9:37 PM, Thomas Jungblut
>>> <th...@gmail.com> wrote:
>>> > no problem, forgot what I've done there anyways.
>>> >
>>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >
>>> >> > Just wanted to remind you why we introduced runtime partitioning.
>>> >>
>>> >> Sorry that I could not review your patch of HAMA-531 and many things
>>> >> of Hama 0.5 release. I was busy.
>>> >>
>>> >> On Mon, Dec 10, 2012 at 8:47 PM, Thomas Jungblut
>>> >> <th...@gmail.com> wrote:
>>> >> > Just wanted to remind you why we introduced runtime partitioning.
>>> >> >
>>> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >> >
>>> >> >> HDFS is common. It's not tunable for only Hama BSP computing.
>>> >> >>
>>> >> >> > Yes, so spilling on disk is the easiest solution to save memory.
>>> Not
>>> >> >> > changing the partitioning.
>>> >> >> > If you want to split again through the block boundaries to
>>> distribute
>>> >> the
>>> >> >> > data through the cluster, then do it, but this is plainly wrong.
>>> >> >>
>>> >> >> Vertex load balancing is basically uses Hash partitioner. You can't
>>> >> >> avoid data transfers.
>>> >> >>
>>> >> >> Again...,
>>> >> >>
>>> >> >> VertexInputReader and runtime partitioning make code complex as I
>>> >> >> mentioned above.
>>> >> >>
>>> >> >> > This reader is needed, so people can create vertices from their own
>>> >> >> fileformat.
>>> >> >>
>>> >> >> I don't think so. Instead of VertexInputReader, we can provide <K
>>> >> >> extends WritableComparable, V extends ArrayWritable>.
>>> >> >>
>>> >> >> Let's assume that there's a web table in Google's BigTable (HBase).
>>> >> >> User can create their own WebTableInputFormatter to read records as a
>>> >> >> <Text url, TextArrayWritable anchors>. Am I wrong?
>>> >> >>
>>> >> >> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
>>> >> >> <th...@gmail.com> wrote:
>>> >> >> > Yes, because changing the blocksize to 32m will just use 300mb of
>>> >> memory,
>>> >> >> > so you can add more machines to fit the number of resulting tasks.
>>> >> >> >
>>> >> >> > If each node have small memory, there's no way to process in memory
>>> >> >> >
>>> >> >> >
>>> >> >> > Yes, so spilling on disk is the easiest solution to save memory.
>>> Not
>>> >> >> > changing the partitioning.
>>> >> >> > If you want to split again through the block boundaries to
>>> distribute
>>> >> the
>>> >> >> > data through the cluster, then do it, but this is plainly wrong.
>>> >> >> >
>>> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >> >> >
>>> >> >> >> > A Hama cluster is scalable. It means that the computing capacity
>>> >> >> >> >> should be increased by adding slaves. Right?
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > I'm sorry, but I don't see how this relates to the vertex input
>>> >> >> reader.
>>> >> >> >>
>>> >> >> >> Not related with input reader. It related with partitioning and
>>> load
>>> >> >> >> balancing. As I reported to you before, to process vertices within
>>> >> >> >> 256MB block, each TaskRunner requied 25~30GB memory.
>>> >> >> >>
>>> >> >> >> If each node have small memory, there's no way to process in
>>> memory
>>> >> >> >> without changing block size of HDFS.
>>> >> >> >>
>>> >> >> >> Do you think this is scalable?
>>> >> >> >>
>>> >> >> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
>>> >> >> >> <th...@gmail.com> wrote:
>>> >> >> >> > Oh okay, so if you want to remove that, have a lot of fun. This
>>> >> >> reader is
>>> >> >> >> > needed, so people can create vertices from their own fileformat.
>>> >> >> >> > Going back to a sequencefile input will not only break backward
>>> >> >> >> > compatibility but also make the same issues we had before.
>>> >> >> >> >
>>> >> >> >> > A Hama cluster is scalable. It means that the computing capacity
>>> >> >> >> >> should be increased by adding slaves. Right?
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > I'm sorry, but I don't see how this relates to the vertex input
>>> >> >> reader.
>>> >> >> >> >
>>> >> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >> >> >> >
>>> >> >> >> >> A Hama cluster is scalable. It means that the computing
>>> capacity
>>> >> >> >> >> should be increased by adding slaves. Right?
>>> >> >> >> >>
>>> >> >> >> >> As I mentioned before, disk-queue and storing vertices on local
>>> >> disk
>>> >> >> >> >> are not urgent.
>>> >> >> >> >>
>>> >> >> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
>>> >> >> >> >> partition in Graph package.
>>> >> >> >> >>
>>> >> >> >> >> See also,
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>>> >> >> >> >>
>>> >> >> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
>>> >> >> >> >> <th...@gmail.com> wrote:
>>> >> >> >> >> > uhm, I have no idea what you want to archieve, do you want to
>>> >> get
>>> >> >> >> back to
>>> >> >> >> >> > client-side partitioning?
>>> >> >> >> >> >
>>> >> >> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>>> >> >> >> >> >
>>> >> >> >> >> >> If there's no opinion, I'll remove VertexInputReader in
>>> >> >> >> >> >> GraphJobRunner, because it make code complex. Let's consider
>>> >> again
>>> >> >> >> >> >> about the VertexInputReader, after fixing HAMA-531 and
>>> HAMA-632
>>> >> >> >> >> >> issues.
>>> >> >> >> >> >>
>>> >> >> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
>>> >> >> >> edwardyoon@apache.org>
>>> >> >> >> >> >> wrote:
>>> >> >> >> >> >> > Or, I'd like to get rid of VertexInputReader.
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
>>> >> >> >> edwardyoon@apache.org
>>> >> >> >> >> >
>>> >> >> >> >> >> wrote:
>>> >> >> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
>>> >> >> >> (because of
>>> >> >> >> >> >> >> VertexInputReader). Right? If so, I would like to delete
>>> all
>>> >> >> "if
>>> >> >> >> >> >> >> (runtimePartitioning) {" conditions.
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> --
>>> >> >> >> >> >> >> Best Regards, Edward J. Yoon
>>> >> >> >> >> >> >> @eddieyoon
>>> >> >> >> >> >> >
>>> >> >> >> >> >> >
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > --
>>> >> >> >> >> >> > Best Regards, Edward J. Yoon
>>> >> >> >> >> >> > @eddieyoon
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> --
>>> >> >> >> >> >> Best Regards, Edward J. Yoon
>>> >> >> >> >> >> @eddieyoon
>>> >> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> --
>>> >> >> >> >> Best Regards, Edward J. Yoon
>>> >> >> >> >> @eddieyoon
>>> >> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> --
>>> >> >> >> Best Regards, Edward J. Yoon
>>> >> >> >> @eddieyoon
>>> >> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Best Regards, Edward J. Yoon
>>> >> >> @eddieyoon
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best Regards, Edward J. Yoon
>>> >> @eddieyoon
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

You know what? If graph is not stored well in somewhere, graph should
be extracted from unstructured data. parseVertex API is only good for
simple test/debug programs, because it's human readable text.

In my case, generating test data is very annoying.

On Mon, Dec 10, 2012 at 9:51 PM, Thomas Jungblut
<th...@gmail.com> wrote:
> That's nothing personal, just about how we solve the problems we face.
> We need just some trade-off between API compatibility and scalability
> improvement.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> I don't dislike your Intuitive input reader. Once cleaning is done, we
>> can think about it again.
>>
>> On Mon, Dec 10, 2012 at 9:37 PM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> > no problem, forgot what I've done there anyways.
>> >
>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >
>> >> > Just wanted to remind you why we introduced runtime partitioning.
>> >>
>> >> Sorry that I could not review your patch of HAMA-531 and many things
>> >> of Hama 0.5 release. I was busy.
>> >>
>> >> On Mon, Dec 10, 2012 at 8:47 PM, Thomas Jungblut
>> >> <th...@gmail.com> wrote:
>> >> > Just wanted to remind you why we introduced runtime partitioning.
>> >> >
>> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >
>> >> >> HDFS is common. It's not tunable for only Hama BSP computing.
>> >> >>
>> >> >> > Yes, so spilling on disk is the easiest solution to save memory.
>> Not
>> >> >> > changing the partitioning.
>> >> >> > If you want to split again through the block boundaries to
>> distribute
>> >> the
>> >> >> > data through the cluster, then do it, but this is plainly wrong.
>> >> >>
>> >> >> Vertex load balancing is basically uses Hash partitioner. You can't
>> >> >> avoid data transfers.
>> >> >>
>> >> >> Again...,
>> >> >>
>> >> >> VertexInputReader and runtime partitioning make code complex as I
>> >> >> mentioned above.
>> >> >>
>> >> >> > This reader is needed, so people can create vertices from their own
>> >> >> fileformat.
>> >> >>
>> >> >> I don't think so. Instead of VertexInputReader, we can provide <K
>> >> >> extends WritableComparable, V extends ArrayWritable>.
>> >> >>
>> >> >> Let's assume that there's a web table in Google's BigTable (HBase).
>> >> >> User can create their own WebTableInputFormatter to read records as a
>> >> >> <Text url, TextArrayWritable anchors>. Am I wrong?
>> >> >>
>> >> >> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
>> >> >> <th...@gmail.com> wrote:
>> >> >> > Yes, because changing the blocksize to 32m will just use 300mb of
>> >> memory,
>> >> >> > so you can add more machines to fit the number of resulting tasks.
>> >> >> >
>> >> >> > If each node have small memory, there's no way to process in memory
>> >> >> >
>> >> >> >
>> >> >> > Yes, so spilling on disk is the easiest solution to save memory.
>> Not
>> >> >> > changing the partitioning.
>> >> >> > If you want to split again through the block boundaries to
>> distribute
>> >> the
>> >> >> > data through the cluster, then do it, but this is plainly wrong.
>> >> >> >
>> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >> >
>> >> >> >> > A Hama cluster is scalable. It means that the computing capacity
>> >> >> >> >> should be increased by adding slaves. Right?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > I'm sorry, but I don't see how this relates to the vertex input
>> >> >> reader.
>> >> >> >>
>> >> >> >> Not related with input reader. It related with partitioning and
>> load
>> >> >> >> balancing. As I reported to you before, to process vertices within
>> >> >> >> 256MB block, each TaskRunner requied 25~30GB memory.
>> >> >> >>
>> >> >> >> If each node have small memory, there's no way to process in
>> memory
>> >> >> >> without changing block size of HDFS.
>> >> >> >>
>> >> >> >> Do you think this is scalable?
>> >> >> >>
>> >> >> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
>> >> >> >> <th...@gmail.com> wrote:
>> >> >> >> > Oh okay, so if you want to remove that, have a lot of fun. This
>> >> >> reader is
>> >> >> >> > needed, so people can create vertices from their own fileformat.
>> >> >> >> > Going back to a sequencefile input will not only break backward
>> >> >> >> > compatibility but also make the same issues we had before.
>> >> >> >> >
>> >> >> >> > A Hama cluster is scalable. It means that the computing capacity
>> >> >> >> >> should be increased by adding slaves. Right?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > I'm sorry, but I don't see how this relates to the vertex input
>> >> >> reader.
>> >> >> >> >
>> >> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >> >> >
>> >> >> >> >> A Hama cluster is scalable. It means that the computing
>> capacity
>> >> >> >> >> should be increased by adding slaves. Right?
>> >> >> >> >>
>> >> >> >> >> As I mentioned before, disk-queue and storing vertices on local
>> >> disk
>> >> >> >> >> are not urgent.
>> >> >> >> >>
>> >> >> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
>> >> >> >> >> partition in Graph package.
>> >> >> >> >>
>> >> >> >> >> See also,
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>> >> >> >> >>
>> >> >> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
>> >> >> >> >> <th...@gmail.com> wrote:
>> >> >> >> >> > uhm, I have no idea what you want to archieve, do you want to
>> >> get
>> >> >> >> back to
>> >> >> >> >> > client-side partitioning?
>> >> >> >> >> >
>> >> >> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >> >> >> >
>> >> >> >> >> >> If there's no opinion, I'll remove VertexInputReader in
>> >> >> >> >> >> GraphJobRunner, because it make code complex. Let's consider
>> >> again
>> >> >> >> >> >> about the VertexInputReader, after fixing HAMA-531 and
>> HAMA-632
>> >> >> >> >> >> issues.
>> >> >> >> >> >>
>> >> >> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
>> >> >> >> edwardyoon@apache.org>
>> >> >> >> >> >> wrote:
>> >> >> >> >> >> > Or, I'd like to get rid of VertexInputReader.
>> >> >> >> >> >> >
>> >> >> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
>> >> >> >> edwardyoon@apache.org
>> >> >> >> >> >
>> >> >> >> >> >> wrote:
>> >> >> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
>> >> >> >> (because of
>> >> >> >> >> >> >> VertexInputReader). Right? If so, I would like to delete
>> all
>> >> >> "if
>> >> >> >> >> >> >> (runtimePartitioning) {" conditions.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> --
>> >> >> >> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> >> >> >> @eddieyoon
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> > --
>> >> >> >> >> >> > Best Regards, Edward J. Yoon
>> >> >> >> >> >> > @eddieyoon
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> --
>> >> >> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> >> >> @eddieyoon
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> >> @eddieyoon
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> @eddieyoon
>> >> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best Regards, Edward J. Yoon
>> >> >> @eddieyoon
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

That's nothing personal, just about how we solve the problems we face.
We need just some trade-off between API compatibility and scalability
improvement.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> I don't dislike your Intuitive input reader. Once cleaning is done, we
> can think about it again.
>
> On Mon, Dec 10, 2012 at 9:37 PM, Thomas Jungblut
> <th...@gmail.com> wrote:
> > no problem, forgot what I've done there anyways.
> >
> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >
> >> > Just wanted to remind you why we introduced runtime partitioning.
> >>
> >> Sorry that I could not review your patch of HAMA-531 and many things
> >> of Hama 0.5 release. I was busy.
> >>
> >> On Mon, Dec 10, 2012 at 8:47 PM, Thomas Jungblut
> >> <th...@gmail.com> wrote:
> >> > Just wanted to remind you why we introduced runtime partitioning.
> >> >
> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >
> >> >> HDFS is common. It's not tunable for only Hama BSP computing.
> >> >>
> >> >> > Yes, so spilling on disk is the easiest solution to save memory.
> Not
> >> >> > changing the partitioning.
> >> >> > If you want to split again through the block boundaries to
> distribute
> >> the
> >> >> > data through the cluster, then do it, but this is plainly wrong.
> >> >>
> >> >> Vertex load balancing is basically uses Hash partitioner. You can't
> >> >> avoid data transfers.
> >> >>
> >> >> Again...,
> >> >>
> >> >> VertexInputReader and runtime partitioning make code complex as I
> >> >> mentioned above.
> >> >>
> >> >> > This reader is needed, so people can create vertices from their own
> >> >> fileformat.
> >> >>
> >> >> I don't think so. Instead of VertexInputReader, we can provide <K
> >> >> extends WritableComparable, V extends ArrayWritable>.
> >> >>
> >> >> Let's assume that there's a web table in Google's BigTable (HBase).
> >> >> User can create their own WebTableInputFormatter to read records as a
> >> >> <Text url, TextArrayWritable anchors>. Am I wrong?
> >> >>
> >> >> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
> >> >> <th...@gmail.com> wrote:
> >> >> > Yes, because changing the blocksize to 32m will just use 300mb of
> >> memory,
> >> >> > so you can add more machines to fit the number of resulting tasks.
> >> >> >
> >> >> > If each node have small memory, there's no way to process in memory
> >> >> >
> >> >> >
> >> >> > Yes, so spilling on disk is the easiest solution to save memory.
> Not
> >> >> > changing the partitioning.
> >> >> > If you want to split again through the block boundaries to
> distribute
> >> the
> >> >> > data through the cluster, then do it, but this is plainly wrong.
> >> >> >
> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >> >
> >> >> >> > A Hama cluster is scalable. It means that the computing capacity
> >> >> >> >> should be increased by adding slaves. Right?
> >> >> >> >
> >> >> >> >
> >> >> >> > I'm sorry, but I don't see how this relates to the vertex input
> >> >> reader.
> >> >> >>
> >> >> >> Not related with input reader. It related with partitioning and
> load
> >> >> >> balancing. As I reported to you before, to process vertices within
> >> >> >> 256MB block, each TaskRunner requied 25~30GB memory.
> >> >> >>
> >> >> >> If each node have small memory, there's no way to process in
> memory
> >> >> >> without changing block size of HDFS.
> >> >> >>
> >> >> >> Do you think this is scalable?
> >> >> >>
> >> >> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
> >> >> >> <th...@gmail.com> wrote:
> >> >> >> > Oh okay, so if you want to remove that, have a lot of fun. This
> >> >> reader is
> >> >> >> > needed, so people can create vertices from their own fileformat.
> >> >> >> > Going back to a sequencefile input will not only break backward
> >> >> >> > compatibility but also make the same issues we had before.
> >> >> >> >
> >> >> >> > A Hama cluster is scalable. It means that the computing capacity
> >> >> >> >> should be increased by adding slaves. Right?
> >> >> >> >
> >> >> >> >
> >> >> >> > I'm sorry, but I don't see how this relates to the vertex input
> >> >> reader.
> >> >> >> >
> >> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >> >> >
> >> >> >> >> A Hama cluster is scalable. It means that the computing
> capacity
> >> >> >> >> should be increased by adding slaves. Right?
> >> >> >> >>
> >> >> >> >> As I mentioned before, disk-queue and storing vertices on local
> >> disk
> >> >> >> >> are not urgent.
> >> >> >> >>
> >> >> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
> >> >> >> >> partition in Graph package.
> >> >> >> >>
> >> >> >> >> See also,
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
> >> >> >> >>
> >> >> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
> >> >> >> >> <th...@gmail.com> wrote:
> >> >> >> >> > uhm, I have no idea what you want to archieve, do you want to
> >> get
> >> >> >> back to
> >> >> >> >> > client-side partitioning?
> >> >> >> >> >
> >> >> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >> >> >> >
> >> >> >> >> >> If there's no opinion, I'll remove VertexInputReader in
> >> >> >> >> >> GraphJobRunner, because it make code complex. Let's consider
> >> again
> >> >> >> >> >> about the VertexInputReader, after fixing HAMA-531 and
> HAMA-632
> >> >> >> >> >> issues.
> >> >> >> >> >>
> >> >> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
> >> >> >> edwardyoon@apache.org>
> >> >> >> >> >> wrote:
> >> >> >> >> >> > Or, I'd like to get rid of VertexInputReader.
> >> >> >> >> >> >
> >> >> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
> >> >> >> edwardyoon@apache.org
> >> >> >> >> >
> >> >> >> >> >> wrote:
> >> >> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
> >> >> >> (because of
> >> >> >> >> >> >> VertexInputReader). Right? If so, I would like to delete
> all
> >> >> "if
> >> >> >> >> >> >> (runtimePartitioning) {" conditions.
> >> >> >> >> >> >>
> >> >> >> >> >> >> --
> >> >> >> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> >> >> >> @eddieyoon
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > --
> >> >> >> >> >> > Best Regards, Edward J. Yoon
> >> >> >> >> >> > @eddieyoon
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> --
> >> >> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> >> >> @eddieyoon
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> >> @eddieyoon
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> @eddieyoon
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >> @eddieyoon
> >> >>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

I don't dislike your Intuitive input reader. Once cleaning is done, we
can think about it again.

On Mon, Dec 10, 2012 at 9:37 PM, Thomas Jungblut
<th...@gmail.com> wrote:
> no problem, forgot what I've done there anyways.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> > Just wanted to remind you why we introduced runtime partitioning.
>>
>> Sorry that I could not review your patch of HAMA-531 and many things
>> of Hama 0.5 release. I was busy.
>>
>> On Mon, Dec 10, 2012 at 8:47 PM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> > Just wanted to remind you why we introduced runtime partitioning.
>> >
>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >
>> >> HDFS is common. It's not tunable for only Hama BSP computing.
>> >>
>> >> > Yes, so spilling on disk is the easiest solution to save memory. Not
>> >> > changing the partitioning.
>> >> > If you want to split again through the block boundaries to distribute
>> the
>> >> > data through the cluster, then do it, but this is plainly wrong.
>> >>
>> >> Vertex load balancing is basically uses Hash partitioner. You can't
>> >> avoid data transfers.
>> >>
>> >> Again...,
>> >>
>> >> VertexInputReader and runtime partitioning make code complex as I
>> >> mentioned above.
>> >>
>> >> > This reader is needed, so people can create vertices from their own
>> >> fileformat.
>> >>
>> >> I don't think so. Instead of VertexInputReader, we can provide <K
>> >> extends WritableComparable, V extends ArrayWritable>.
>> >>
>> >> Let's assume that there's a web table in Google's BigTable (HBase).
>> >> User can create their own WebTableInputFormatter to read records as a
>> >> <Text url, TextArrayWritable anchors>. Am I wrong?
>> >>
>> >> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
>> >> <th...@gmail.com> wrote:
>> >> > Yes, because changing the blocksize to 32m will just use 300mb of
>> memory,
>> >> > so you can add more machines to fit the number of resulting tasks.
>> >> >
>> >> > If each node have small memory, there's no way to process in memory
>> >> >
>> >> >
>> >> > Yes, so spilling on disk is the easiest solution to save memory. Not
>> >> > changing the partitioning.
>> >> > If you want to split again through the block boundaries to distribute
>> the
>> >> > data through the cluster, then do it, but this is plainly wrong.
>> >> >
>> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >
>> >> >> > A Hama cluster is scalable. It means that the computing capacity
>> >> >> >> should be increased by adding slaves. Right?
>> >> >> >
>> >> >> >
>> >> >> > I'm sorry, but I don't see how this relates to the vertex input
>> >> reader.
>> >> >>
>> >> >> Not related with input reader. It related with partitioning and load
>> >> >> balancing. As I reported to you before, to process vertices within
>> >> >> 256MB block, each TaskRunner requied 25~30GB memory.
>> >> >>
>> >> >> If each node have small memory, there's no way to process in memory
>> >> >> without changing block size of HDFS.
>> >> >>
>> >> >> Do you think this is scalable?
>> >> >>
>> >> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
>> >> >> <th...@gmail.com> wrote:
>> >> >> > Oh okay, so if you want to remove that, have a lot of fun. This
>> >> reader is
>> >> >> > needed, so people can create vertices from their own fileformat.
>> >> >> > Going back to a sequencefile input will not only break backward
>> >> >> > compatibility but also make the same issues we had before.
>> >> >> >
>> >> >> > A Hama cluster is scalable. It means that the computing capacity
>> >> >> >> should be increased by adding slaves. Right?
>> >> >> >
>> >> >> >
>> >> >> > I'm sorry, but I don't see how this relates to the vertex input
>> >> reader.
>> >> >> >
>> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >> >
>> >> >> >> A Hama cluster is scalable. It means that the computing capacity
>> >> >> >> should be increased by adding slaves. Right?
>> >> >> >>
>> >> >> >> As I mentioned before, disk-queue and storing vertices on local
>> disk
>> >> >> >> are not urgent.
>> >> >> >>
>> >> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
>> >> >> >> partition in Graph package.
>> >> >> >>
>> >> >> >> See also,
>> >> >> >>
>> >> >>
>> >>
>> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>> >> >> >>
>> >> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
>> >> >> >> <th...@gmail.com> wrote:
>> >> >> >> > uhm, I have no idea what you want to archieve, do you want to
>> get
>> >> >> back to
>> >> >> >> > client-side partitioning?
>> >> >> >> >
>> >> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >> >> >
>> >> >> >> >> If there's no opinion, I'll remove VertexInputReader in
>> >> >> >> >> GraphJobRunner, because it make code complex. Let's consider
>> again
>> >> >> >> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>> >> >> >> >> issues.
>> >> >> >> >>
>> >> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
>> >> >> edwardyoon@apache.org>
>> >> >> >> >> wrote:
>> >> >> >> >> > Or, I'd like to get rid of VertexInputReader.
>> >> >> >> >> >
>> >> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
>> >> >> edwardyoon@apache.org
>> >> >> >> >
>> >> >> >> >> wrote:
>> >> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
>> >> >> (because of
>> >> >> >> >> >> VertexInputReader). Right? If so, I would like to delete all
>> >> "if
>> >> >> >> >> >> (runtimePartitioning) {" conditions.
>> >> >> >> >> >>
>> >> >> >> >> >> --
>> >> >> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> >> >> @eddieyoon
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > --
>> >> >> >> >> > Best Regards, Edward J. Yoon
>> >> >> >> >> > @eddieyoon
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> >> @eddieyoon
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> @eddieyoon
>> >> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best Regards, Edward J. Yoon
>> >> >> @eddieyoon
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

no problem, forgot what I've done there anyways.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> > Just wanted to remind you why we introduced runtime partitioning.
>
> Sorry that I could not review your patch of HAMA-531 and many things
> of Hama 0.5 release. I was busy.
>
> On Mon, Dec 10, 2012 at 8:47 PM, Thomas Jungblut
> <th...@gmail.com> wrote:
> > Just wanted to remind you why we introduced runtime partitioning.
> >
> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >
> >> HDFS is common. It's not tunable for only Hama BSP computing.
> >>
> >> > Yes, so spilling on disk is the easiest solution to save memory. Not
> >> > changing the partitioning.
> >> > If you want to split again through the block boundaries to distribute
> the
> >> > data through the cluster, then do it, but this is plainly wrong.
> >>
> >> Vertex load balancing is basically uses Hash partitioner. You can't
> >> avoid data transfers.
> >>
> >> Again...,
> >>
> >> VertexInputReader and runtime partitioning make code complex as I
> >> mentioned above.
> >>
> >> > This reader is needed, so people can create vertices from their own
> >> fileformat.
> >>
> >> I don't think so. Instead of VertexInputReader, we can provide <K
> >> extends WritableComparable, V extends ArrayWritable>.
> >>
> >> Let's assume that there's a web table in Google's BigTable (HBase).
> >> User can create their own WebTableInputFormatter to read records as a
> >> <Text url, TextArrayWritable anchors>. Am I wrong?
> >>
> >> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
> >> <th...@gmail.com> wrote:
> >> > Yes, because changing the blocksize to 32m will just use 300mb of
> memory,
> >> > so you can add more machines to fit the number of resulting tasks.
> >> >
> >> > If each node have small memory, there's no way to process in memory
> >> >
> >> >
> >> > Yes, so spilling on disk is the easiest solution to save memory. Not
> >> > changing the partitioning.
> >> > If you want to split again through the block boundaries to distribute
> the
> >> > data through the cluster, then do it, but this is plainly wrong.
> >> >
> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >
> >> >> > A Hama cluster is scalable. It means that the computing capacity
> >> >> >> should be increased by adding slaves. Right?
> >> >> >
> >> >> >
> >> >> > I'm sorry, but I don't see how this relates to the vertex input
> >> reader.
> >> >>
> >> >> Not related with input reader. It related with partitioning and load
> >> >> balancing. As I reported to you before, to process vertices within
> >> >> 256MB block, each TaskRunner requied 25~30GB memory.
> >> >>
> >> >> If each node have small memory, there's no way to process in memory
> >> >> without changing block size of HDFS.
> >> >>
> >> >> Do you think this is scalable?
> >> >>
> >> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
> >> >> <th...@gmail.com> wrote:
> >> >> > Oh okay, so if you want to remove that, have a lot of fun. This
> >> reader is
> >> >> > needed, so people can create vertices from their own fileformat.
> >> >> > Going back to a sequencefile input will not only break backward
> >> >> > compatibility but also make the same issues we had before.
> >> >> >
> >> >> > A Hama cluster is scalable. It means that the computing capacity
> >> >> >> should be increased by adding slaves. Right?
> >> >> >
> >> >> >
> >> >> > I'm sorry, but I don't see how this relates to the vertex input
> >> reader.
> >> >> >
> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >> >
> >> >> >> A Hama cluster is scalable. It means that the computing capacity
> >> >> >> should be increased by adding slaves. Right?
> >> >> >>
> >> >> >> As I mentioned before, disk-queue and storing vertices on local
> disk
> >> >> >> are not urgent.
> >> >> >>
> >> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
> >> >> >> partition in Graph package.
> >> >> >>
> >> >> >> See also,
> >> >> >>
> >> >>
> >>
> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
> >> >> >>
> >> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
> >> >> >> <th...@gmail.com> wrote:
> >> >> >> > uhm, I have no idea what you want to archieve, do you want to
> get
> >> >> back to
> >> >> >> > client-side partitioning?
> >> >> >> >
> >> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >> >> >
> >> >> >> >> If there's no opinion, I'll remove VertexInputReader in
> >> >> >> >> GraphJobRunner, because it make code complex. Let's consider
> again
> >> >> >> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
> >> >> >> >> issues.
> >> >> >> >>
> >> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
> >> >> edwardyoon@apache.org>
> >> >> >> >> wrote:
> >> >> >> >> > Or, I'd like to get rid of VertexInputReader.
> >> >> >> >> >
> >> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
> >> >> edwardyoon@apache.org
> >> >> >> >
> >> >> >> >> wrote:
> >> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
> >> >> (because of
> >> >> >> >> >> VertexInputReader). Right? If so, I would like to delete all
> >> "if
> >> >> >> >> >> (runtimePartitioning) {" conditions.
> >> >> >> >> >>
> >> >> >> >> >> --
> >> >> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> >> >> @eddieyoon
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > --
> >> >> >> >> > Best Regards, Edward J. Yoon
> >> >> >> >> > @eddieyoon
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> >> @eddieyoon
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> @eddieyoon
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >> @eddieyoon
> >> >>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

> Just wanted to remind you why we introduced runtime partitioning.

Sorry that I could not review your patch of HAMA-531 and many things
of Hama 0.5 release. I was busy.

On Mon, Dec 10, 2012 at 8:47 PM, Thomas Jungblut
<th...@gmail.com> wrote:
> Just wanted to remind you why we introduced runtime partitioning.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> HDFS is common. It's not tunable for only Hama BSP computing.
>>
>> > Yes, so spilling on disk is the easiest solution to save memory. Not
>> > changing the partitioning.
>> > If you want to split again through the block boundaries to distribute the
>> > data through the cluster, then do it, but this is plainly wrong.
>>
>> Vertex load balancing is basically uses Hash partitioner. You can't
>> avoid data transfers.
>>
>> Again...,
>>
>> VertexInputReader and runtime partitioning make code complex as I
>> mentioned above.
>>
>> > This reader is needed, so people can create vertices from their own
>> fileformat.
>>
>> I don't think so. Instead of VertexInputReader, we can provide <K
>> extends WritableComparable, V extends ArrayWritable>.
>>
>> Let's assume that there's a web table in Google's BigTable (HBase).
>> User can create their own WebTableInputFormatter to read records as a
>> <Text url, TextArrayWritable anchors>. Am I wrong?
>>
>> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> > Yes, because changing the blocksize to 32m will just use 300mb of memory,
>> > so you can add more machines to fit the number of resulting tasks.
>> >
>> > If each node have small memory, there's no way to process in memory
>> >
>> >
>> > Yes, so spilling on disk is the easiest solution to save memory. Not
>> > changing the partitioning.
>> > If you want to split again through the block boundaries to distribute the
>> > data through the cluster, then do it, but this is plainly wrong.
>> >
>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >
>> >> > A Hama cluster is scalable. It means that the computing capacity
>> >> >> should be increased by adding slaves. Right?
>> >> >
>> >> >
>> >> > I'm sorry, but I don't see how this relates to the vertex input
>> reader.
>> >>
>> >> Not related with input reader. It related with partitioning and load
>> >> balancing. As I reported to you before, to process vertices within
>> >> 256MB block, each TaskRunner requied 25~30GB memory.
>> >>
>> >> If each node have small memory, there's no way to process in memory
>> >> without changing block size of HDFS.
>> >>
>> >> Do you think this is scalable?
>> >>
>> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
>> >> <th...@gmail.com> wrote:
>> >> > Oh okay, so if you want to remove that, have a lot of fun. This
>> reader is
>> >> > needed, so people can create vertices from their own fileformat.
>> >> > Going back to a sequencefile input will not only break backward
>> >> > compatibility but also make the same issues we had before.
>> >> >
>> >> > A Hama cluster is scalable. It means that the computing capacity
>> >> >> should be increased by adding slaves. Right?
>> >> >
>> >> >
>> >> > I'm sorry, but I don't see how this relates to the vertex input
>> reader.
>> >> >
>> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >
>> >> >> A Hama cluster is scalable. It means that the computing capacity
>> >> >> should be increased by adding slaves. Right?
>> >> >>
>> >> >> As I mentioned before, disk-queue and storing vertices on local disk
>> >> >> are not urgent.
>> >> >>
>> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
>> >> >> partition in Graph package.
>> >> >>
>> >> >> See also,
>> >> >>
>> >>
>> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>> >> >>
>> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
>> >> >> <th...@gmail.com> wrote:
>> >> >> > uhm, I have no idea what you want to archieve, do you want to get
>> >> back to
>> >> >> > client-side partitioning?
>> >> >> >
>> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >> >
>> >> >> >> If there's no opinion, I'll remove VertexInputReader in
>> >> >> >> GraphJobRunner, because it make code complex. Let's consider again
>> >> >> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>> >> >> >> issues.
>> >> >> >>
>> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
>> >> edwardyoon@apache.org>
>> >> >> >> wrote:
>> >> >> >> > Or, I'd like to get rid of VertexInputReader.
>> >> >> >> >
>> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
>> >> edwardyoon@apache.org
>> >> >> >
>> >> >> >> wrote:
>> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
>> >> (because of
>> >> >> >> >> VertexInputReader). Right? If so, I would like to delete all
>> "if
>> >> >> >> >> (runtimePartitioning) {" conditions.
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> >> @eddieyoon
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Best Regards, Edward J. Yoon
>> >> >> >> > @eddieyoon
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> @eddieyoon
>> >> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best Regards, Edward J. Yoon
>> >> >> @eddieyoon
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

If we provide an random input data generators for our examples, newbie
will be able to easily test/evalute Hama cluster. I'm sure this will
give a good first impression to users.

On Mon, Dec 10, 2012 at 8:47 PM, Thomas Jungblut
<th...@gmail.com> wrote:
> Just wanted to remind you why we introduced runtime partitioning.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> HDFS is common. It's not tunable for only Hama BSP computing.
>>
>> > Yes, so spilling on disk is the easiest solution to save memory. Not
>> > changing the partitioning.
>> > If you want to split again through the block boundaries to distribute the
>> > data through the cluster, then do it, but this is plainly wrong.
>>
>> Vertex load balancing is basically uses Hash partitioner. You can't
>> avoid data transfers.
>>
>> Again...,
>>
>> VertexInputReader and runtime partitioning make code complex as I
>> mentioned above.
>>
>> > This reader is needed, so people can create vertices from their own
>> fileformat.
>>
>> I don't think so. Instead of VertexInputReader, we can provide <K
>> extends WritableComparable, V extends ArrayWritable>.
>>
>> Let's assume that there's a web table in Google's BigTable (HBase).
>> User can create their own WebTableInputFormatter to read records as a
>> <Text url, TextArrayWritable anchors>. Am I wrong?
>>
>> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> > Yes, because changing the blocksize to 32m will just use 300mb of memory,
>> > so you can add more machines to fit the number of resulting tasks.
>> >
>> > If each node have small memory, there's no way to process in memory
>> >
>> >
>> > Yes, so spilling on disk is the easiest solution to save memory. Not
>> > changing the partitioning.
>> > If you want to split again through the block boundaries to distribute the
>> > data through the cluster, then do it, but this is plainly wrong.
>> >
>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >
>> >> > A Hama cluster is scalable. It means that the computing capacity
>> >> >> should be increased by adding slaves. Right?
>> >> >
>> >> >
>> >> > I'm sorry, but I don't see how this relates to the vertex input
>> reader.
>> >>
>> >> Not related with input reader. It related with partitioning and load
>> >> balancing. As I reported to you before, to process vertices within
>> >> 256MB block, each TaskRunner requied 25~30GB memory.
>> >>
>> >> If each node have small memory, there's no way to process in memory
>> >> without changing block size of HDFS.
>> >>
>> >> Do you think this is scalable?
>> >>
>> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
>> >> <th...@gmail.com> wrote:
>> >> > Oh okay, so if you want to remove that, have a lot of fun. This
>> reader is
>> >> > needed, so people can create vertices from their own fileformat.
>> >> > Going back to a sequencefile input will not only break backward
>> >> > compatibility but also make the same issues we had before.
>> >> >
>> >> > A Hama cluster is scalable. It means that the computing capacity
>> >> >> should be increased by adding slaves. Right?
>> >> >
>> >> >
>> >> > I'm sorry, but I don't see how this relates to the vertex input
>> reader.
>> >> >
>> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >
>> >> >> A Hama cluster is scalable. It means that the computing capacity
>> >> >> should be increased by adding slaves. Right?
>> >> >>
>> >> >> As I mentioned before, disk-queue and storing vertices on local disk
>> >> >> are not urgent.
>> >> >>
>> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
>> >> >> partition in Graph package.
>> >> >>
>> >> >> See also,
>> >> >>
>> >>
>> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>> >> >>
>> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
>> >> >> <th...@gmail.com> wrote:
>> >> >> > uhm, I have no idea what you want to archieve, do you want to get
>> >> back to
>> >> >> > client-side partitioning?
>> >> >> >
>> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >> >
>> >> >> >> If there's no opinion, I'll remove VertexInputReader in
>> >> >> >> GraphJobRunner, because it make code complex. Let's consider again
>> >> >> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>> >> >> >> issues.
>> >> >> >>
>> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
>> >> edwardyoon@apache.org>
>> >> >> >> wrote:
>> >> >> >> > Or, I'd like to get rid of VertexInputReader.
>> >> >> >> >
>> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
>> >> edwardyoon@apache.org
>> >> >> >
>> >> >> >> wrote:
>> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
>> >> (because of
>> >> >> >> >> VertexInputReader). Right? If so, I would like to delete all
>> "if
>> >> >> >> >> (runtimePartitioning) {" conditions.
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> >> @eddieyoon
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Best Regards, Edward J. Yoon
>> >> >> >> > @eddieyoon
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> @eddieyoon
>> >> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best Regards, Edward J. Yoon
>> >> >> @eddieyoon
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

Just wanted to remind you why we introduced runtime partitioning.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> HDFS is common. It's not tunable for only Hama BSP computing.
>
> > Yes, so spilling on disk is the easiest solution to save memory. Not
> > changing the partitioning.
> > If you want to split again through the block boundaries to distribute the
> > data through the cluster, then do it, but this is plainly wrong.
>
> Vertex load balancing is basically uses Hash partitioner. You can't
> avoid data transfers.
>
> Again...,
>
> VertexInputReader and runtime partitioning make code complex as I
> mentioned above.
>
> > This reader is needed, so people can create vertices from their own
> fileformat.
>
> I don't think so. Instead of VertexInputReader, we can provide <K
> extends WritableComparable, V extends ArrayWritable>.
>
> Let's assume that there's a web table in Google's BigTable (HBase).
> User can create their own WebTableInputFormatter to read records as a
> <Text url, TextArrayWritable anchors>. Am I wrong?
>
> On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
> <th...@gmail.com> wrote:
> > Yes, because changing the blocksize to 32m will just use 300mb of memory,
> > so you can add more machines to fit the number of resulting tasks.
> >
> > If each node have small memory, there's no way to process in memory
> >
> >
> > Yes, so spilling on disk is the easiest solution to save memory. Not
> > changing the partitioning.
> > If you want to split again through the block boundaries to distribute the
> > data through the cluster, then do it, but this is plainly wrong.
> >
> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >
> >> > A Hama cluster is scalable. It means that the computing capacity
> >> >> should be increased by adding slaves. Right?
> >> >
> >> >
> >> > I'm sorry, but I don't see how this relates to the vertex input
> reader.
> >>
> >> Not related with input reader. It related with partitioning and load
> >> balancing. As I reported to you before, to process vertices within
> >> 256MB block, each TaskRunner requied 25~30GB memory.
> >>
> >> If each node have small memory, there's no way to process in memory
> >> without changing block size of HDFS.
> >>
> >> Do you think this is scalable?
> >>
> >> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
> >> <th...@gmail.com> wrote:
> >> > Oh okay, so if you want to remove that, have a lot of fun. This
> reader is
> >> > needed, so people can create vertices from their own fileformat.
> >> > Going back to a sequencefile input will not only break backward
> >> > compatibility but also make the same issues we had before.
> >> >
> >> > A Hama cluster is scalable. It means that the computing capacity
> >> >> should be increased by adding slaves. Right?
> >> >
> >> >
> >> > I'm sorry, but I don't see how this relates to the vertex input
> reader.
> >> >
> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >
> >> >> A Hama cluster is scalable. It means that the computing capacity
> >> >> should be increased by adding slaves. Right?
> >> >>
> >> >> As I mentioned before, disk-queue and storing vertices on local disk
> >> >> are not urgent.
> >> >>
> >> >> In short, yeah, I wan to remove VertexInputReader and runtime
> >> >> partition in Graph package.
> >> >>
> >> >> See also,
> >> >>
> >>
> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
> >> >>
> >> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
> >> >> <th...@gmail.com> wrote:
> >> >> > uhm, I have no idea what you want to archieve, do you want to get
> >> back to
> >> >> > client-side partitioning?
> >> >> >
> >> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >> >
> >> >> >> If there's no opinion, I'll remove VertexInputReader in
> >> >> >> GraphJobRunner, because it make code complex. Let's consider again
> >> >> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
> >> >> >> issues.
> >> >> >>
> >> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
> >> edwardyoon@apache.org>
> >> >> >> wrote:
> >> >> >> > Or, I'd like to get rid of VertexInputReader.
> >> >> >> >
> >> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
> >> edwardyoon@apache.org
> >> >> >
> >> >> >> wrote:
> >> >> >> >> In fact, there's no choice but to use runtimePartitioning
> >> (because of
> >> >> >> >> VertexInputReader). Right? If so, I would like to delete all
> "if
> >> >> >> >> (runtimePartitioning) {" conditions.
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> >> @eddieyoon
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Best Regards, Edward J. Yoon
> >> >> >> > @eddieyoon
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> @eddieyoon
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >> @eddieyoon
> >> >>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

HDFS is common. It's not tunable for only Hama BSP computing.

> Yes, so spilling on disk is the easiest solution to save memory. Not
> changing the partitioning.
> If you want to split again through the block boundaries to distribute the
> data through the cluster, then do it, but this is plainly wrong.

Vertex load balancing is basically uses Hash partitioner. You can't
avoid data transfers.

Again...,

VertexInputReader and runtime partitioning make code complex as I
mentioned above.

> This reader is needed, so people can create vertices from their own fileformat.

I don't think so. Instead of VertexInputReader, we can provide <K
extends WritableComparable, V extends ArrayWritable>.

Let's assume that there's a web table in Google's BigTable (HBase).
User can create their own WebTableInputFormatter to read records as a
<Text url, TextArrayWritable anchors>. Am I wrong?

On Mon, Dec 10, 2012 at 8:21 PM, Thomas Jungblut
<th...@gmail.com> wrote:
> Yes, because changing the blocksize to 32m will just use 300mb of memory,
> so you can add more machines to fit the number of resulting tasks.
>
> If each node have small memory, there's no way to process in memory
>
>
> Yes, so spilling on disk is the easiest solution to save memory. Not
> changing the partitioning.
> If you want to split again through the block boundaries to distribute the
> data through the cluster, then do it, but this is plainly wrong.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> > A Hama cluster is scalable. It means that the computing capacity
>> >> should be increased by adding slaves. Right?
>> >
>> >
>> > I'm sorry, but I don't see how this relates to the vertex input reader.
>>
>> Not related with input reader. It related with partitioning and load
>> balancing. As I reported to you before, to process vertices within
>> 256MB block, each TaskRunner requied 25~30GB memory.
>>
>> If each node have small memory, there's no way to process in memory
>> without changing block size of HDFS.
>>
>> Do you think this is scalable?
>>
>> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> > Oh okay, so if you want to remove that, have a lot of fun. This reader is
>> > needed, so people can create vertices from their own fileformat.
>> > Going back to a sequencefile input will not only break backward
>> > compatibility but also make the same issues we had before.
>> >
>> > A Hama cluster is scalable. It means that the computing capacity
>> >> should be increased by adding slaves. Right?
>> >
>> >
>> > I'm sorry, but I don't see how this relates to the vertex input reader.
>> >
>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >
>> >> A Hama cluster is scalable. It means that the computing capacity
>> >> should be increased by adding slaves. Right?
>> >>
>> >> As I mentioned before, disk-queue and storing vertices on local disk
>> >> are not urgent.
>> >>
>> >> In short, yeah, I wan to remove VertexInputReader and runtime
>> >> partition in Graph package.
>> >>
>> >> See also,
>> >>
>> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>> >>
>> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
>> >> <th...@gmail.com> wrote:
>> >> > uhm, I have no idea what you want to archieve, do you want to get
>> back to
>> >> > client-side partitioning?
>> >> >
>> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >> >
>> >> >> If there's no opinion, I'll remove VertexInputReader in
>> >> >> GraphJobRunner, because it make code complex. Let's consider again
>> >> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>> >> >> issues.
>> >> >>
>> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
>> edwardyoon@apache.org>
>> >> >> wrote:
>> >> >> > Or, I'd like to get rid of VertexInputReader.
>> >> >> >
>> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
>> edwardyoon@apache.org
>> >> >
>> >> >> wrote:
>> >> >> >> In fact, there's no choice but to use runtimePartitioning
>> (because of
>> >> >> >> VertexInputReader). Right? If so, I would like to delete all "if
>> >> >> >> (runtimePartitioning) {" conditions.
>> >> >> >>
>> >> >> >> --
>> >> >> >> Best Regards, Edward J. Yoon
>> >> >> >> @eddieyoon
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Best Regards, Edward J. Yoon
>> >> >> > @eddieyoon
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best Regards, Edward J. Yoon
>> >> >> @eddieyoon
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

Yes, because changing the blocksize to 32m will just use 300mb of memory,
so you can add more machines to fit the number of resulting tasks.

If each node have small memory, there's no way to process in memory


Yes, so spilling on disk is the easiest solution to save memory. Not
changing the partitioning.
If you want to split again through the block boundaries to distribute the
data through the cluster, then do it, but this is plainly wrong.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> > A Hama cluster is scalable. It means that the computing capacity
> >> should be increased by adding slaves. Right?
> >
> >
> > I'm sorry, but I don't see how this relates to the vertex input reader.
>
> Not related with input reader. It related with partitioning and load
> balancing. As I reported to you before, to process vertices within
> 256MB block, each TaskRunner requied 25~30GB memory.
>
> If each node have small memory, there's no way to process in memory
> without changing block size of HDFS.
>
> Do you think this is scalable?
>
> On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
> <th...@gmail.com> wrote:
> > Oh okay, so if you want to remove that, have a lot of fun. This reader is
> > needed, so people can create vertices from their own fileformat.
> > Going back to a sequencefile input will not only break backward
> > compatibility but also make the same issues we had before.
> >
> > A Hama cluster is scalable. It means that the computing capacity
> >> should be increased by adding slaves. Right?
> >
> >
> > I'm sorry, but I don't see how this relates to the vertex input reader.
> >
> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >
> >> A Hama cluster is scalable. It means that the computing capacity
> >> should be increased by adding slaves. Right?
> >>
> >> As I mentioned before, disk-queue and storing vertices on local disk
> >> are not urgent.
> >>
> >> In short, yeah, I wan to remove VertexInputReader and runtime
> >> partition in Graph package.
> >>
> >> See also,
> >>
> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
> >>
> >> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
> >> <th...@gmail.com> wrote:
> >> > uhm, I have no idea what you want to archieve, do you want to get
> back to
> >> > client-side partitioning?
> >> >
> >> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >> >
> >> >> If there's no opinion, I'll remove VertexInputReader in
> >> >> GraphJobRunner, because it make code complex. Let's consider again
> >> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
> >> >> issues.
> >> >>
> >> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <
> edwardyoon@apache.org>
> >> >> wrote:
> >> >> > Or, I'd like to get rid of VertexInputReader.
> >> >> >
> >> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <
> edwardyoon@apache.org
> >> >
> >> >> wrote:
> >> >> >> In fact, there's no choice but to use runtimePartitioning
> (because of
> >> >> >> VertexInputReader). Right? If so, I would like to delete all "if
> >> >> >> (runtimePartitioning) {" conditions.
> >> >> >>
> >> >> >> --
> >> >> >> Best Regards, Edward J. Yoon
> >> >> >> @eddieyoon
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Best Regards, Edward J. Yoon
> >> >> > @eddieyoon
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >> @eddieyoon
> >> >>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

> A Hama cluster is scalable. It means that the computing capacity
>> should be increased by adding slaves. Right?
>
>
> I'm sorry, but I don't see how this relates to the vertex input reader.

Not related with input reader. It related with partitioning and load
balancing. As I reported to you before, to process vertices within
256MB block, each TaskRunner requied 25~30GB memory.

If each node have small memory, there's no way to process in memory
without changing block size of HDFS.

Do you think this is scalable?

On Mon, Dec 10, 2012 at 7:59 PM, Thomas Jungblut
<th...@gmail.com> wrote:
> Oh okay, so if you want to remove that, have a lot of fun. This reader is
> needed, so people can create vertices from their own fileformat.
> Going back to a sequencefile input will not only break backward
> compatibility but also make the same issues we had before.
>
> A Hama cluster is scalable. It means that the computing capacity
>> should be increased by adding slaves. Right?
>
>
> I'm sorry, but I don't see how this relates to the vertex input reader.
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> A Hama cluster is scalable. It means that the computing capacity
>> should be increased by adding slaves. Right?
>>
>> As I mentioned before, disk-queue and storing vertices on local disk
>> are not urgent.
>>
>> In short, yeah, I wan to remove VertexInputReader and runtime
>> partition in Graph package.
>>
>> See also,
>> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>>
>> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> > uhm, I have no idea what you want to archieve, do you want to get back to
>> > client-side partitioning?
>> >
>> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
>> >
>> >> If there's no opinion, I'll remove VertexInputReader in
>> >> GraphJobRunner, because it make code complex. Let's consider again
>> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>> >> issues.
>> >>
>> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <ed...@apache.org>
>> >> wrote:
>> >> > Or, I'd like to get rid of VertexInputReader.
>> >> >
>> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <edwardyoon@apache.org
>> >
>> >> wrote:
>> >> >> In fact, there's no choice but to use runtimePartitioning (because of
>> >> >> VertexInputReader). Right? If so, I would like to delete all "if
>> >> >> (runtimePartitioning) {" conditions.
>> >> >>
>> >> >> --
>> >> >> Best Regards, Edward J. Yoon
>> >> >> @eddieyoon
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Regards, Edward J. Yoon
>> >> > @eddieyoon
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

Oh okay, so if you want to remove that, have a lot of fun. This reader is
needed, so people can create vertices from their own fileformat.
Going back to a sequencefile input will not only break backward
compatibility but also make the same issues we had before.

A Hama cluster is scalable. It means that the computing capacity
> should be increased by adding slaves. Right?


I'm sorry, but I don't see how this relates to the vertex input reader.

2012/12/10 Edward J. Yoon <ed...@apache.org>

> A Hama cluster is scalable. It means that the computing capacity
> should be increased by adding slaves. Right?
>
> As I mentioned before, disk-queue and storing vertices on local disk
> are not urgent.
>
> In short, yeah, I wan to remove VertexInputReader and runtime
> partition in Graph package.
>
> See also,
> https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756
>
> On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
> <th...@gmail.com> wrote:
> > uhm, I have no idea what you want to archieve, do you want to get back to
> > client-side partitioning?
> >
> > 2012/12/10 Edward J. Yoon <ed...@apache.org>
> >
> >> If there's no opinion, I'll remove VertexInputReader in
> >> GraphJobRunner, because it make code complex. Let's consider again
> >> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
> >> issues.
> >>
> >> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <ed...@apache.org>
> >> wrote:
> >> > Or, I'd like to get rid of VertexInputReader.
> >> >
> >> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <edwardyoon@apache.org
> >
> >> wrote:
> >> >> In fact, there's no choice but to use runtimePartitioning (because of
> >> >> VertexInputReader). Right? If so, I would like to delete all "if
> >> >> (runtimePartitioning) {" conditions.
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >> @eddieyoon
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards, Edward J. Yoon
> >> > @eddieyoon
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

A Hama cluster is scalable. It means that the computing capacity
should be increased by adding slaves. Right?

As I mentioned before, disk-queue and storing vertices on local disk
are not urgent.

In short, yeah, I wan to remove VertexInputReader and runtime
partition in Graph package.

See also, https://issues.apache.org/jira/browse/HAMA-531?focusedCommentId=13527756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13527756

On Mon, Dec 10, 2012 at 7:31 PM, Thomas Jungblut
<th...@gmail.com> wrote:
> uhm, I have no idea what you want to archieve, do you want to get back to
> client-side partitioning?
>
> 2012/12/10 Edward J. Yoon <ed...@apache.org>
>
>> If there's no opinion, I'll remove VertexInputReader in
>> GraphJobRunner, because it make code complex. Let's consider again
>> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
>> issues.
>>
>> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <ed...@apache.org>
>> wrote:
>> > Or, I'd like to get rid of VertexInputReader.
>> >
>> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <ed...@apache.org>
>> wrote:
>> >> In fact, there's no choice but to use runtimePartitioning (because of
>> >> VertexInputReader). Right? If so, I would like to delete all "if
>> >> (runtimePartitioning) {" conditions.
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >
>> >
>> >
>> > --
>> > Best Regards, Edward J. Yoon
>> > @eddieyoon
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by Thomas Jungblut <th...@gmail.com>.

uhm, I have no idea what you want to archieve, do you want to get back to
client-side partitioning?

2012/12/10 Edward J. Yoon <ed...@apache.org>

> If there's no opinion, I'll remove VertexInputReader in
> GraphJobRunner, because it make code complex. Let's consider again
> about the VertexInputReader, after fixing HAMA-531 and HAMA-632
> issues.
>
> On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
> > Or, I'd like to get rid of VertexInputReader.
> >
> > On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
> >> In fact, there's no choice but to use runtimePartitioning (because of
> >> VertexInputReader). Right? If so, I would like to delete all "if
> >> (runtimePartitioning) {" conditions.
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

If there's no opinion, I'll remove VertexInputReader in
GraphJobRunner, because it make code complex. Let's consider again
about the VertexInputReader, after fixing HAMA-531 and HAMA-632
issues.

On Fri, Dec 7, 2012 at 9:35 AM, Edward J. Yoon <ed...@apache.org> wrote:
> Or, I'd like to get rid of VertexInputReader.
>
> On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <ed...@apache.org> wrote:
>> In fact, there's no choice but to use runtimePartitioning (because of
>> VertexInputReader). Right? If so, I would like to delete all "if
>> (runtimePartitioning) {" conditions.
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: runtimePartitioning in GraphJobRunner

Posted by "Edward J. Yoon" <ed...@apache.org>.

Or, I'd like to get rid of VertexInputReader.

On Fri, Dec 7, 2012 at 9:30 AM, Edward J. Yoon <ed...@apache.org> wrote:
> In fact, there's no choice but to use runtimePartitioning (because of
> VertexInputReader). Right? If so, I would like to delete all "if
> (runtimePartitioning) {" conditions.
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon