You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hama.apache.org by Leonidas Fegaras <fe...@cse.uta.edu> on 2014/10/20 15:51:35 UTC

Re: Question about FileInputFormat splits

Dear Hama developers,
I still have a problem setting the split size of an HDFS input file 
using Hama 0.6.4.  For example, when I use:

BSPJob job = new BSPJob(conf,BSPop.class);
job.setNumBspTask(10);
job.setLong("bsp.min.split.size",10000L);   // 10000 bytes

For a small file with 2 blocks, this will use only 2 BSP tasks (one for 
each block), instead of 10.
This used to work in Hama 0.5.0.
Any suggestions?
Thanks.
Leonidas Fegaras

On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
> Hello,
>
>> than a block. But if you have more nodes in your cluster than data blocks,
>> you may get faster execution if you allow splits smaller than a block. Is
> You're right. So, we're working on partitioning issues now.
>
>> you may get faster execution if you allow splits smaller than a block. Is
>> there any way to use splits smaller than a block in Hama 0.6.0?
> Yes. But, Hama 0.6.1 version will support it.
>
> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
>> Dear Hama developers,
>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>> cannot be smaller than a block. In Hama 0.5.0, I could set any split size
>> using  job.set("bsp.min.split.size",...) and set the task numbers using
>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split smaller
>> than a block. But if you have more nodes in your cluster than data blocks,
>> you may get faster execution if you allow splits smaller than a block. Is
>> there any way to use splits smaller than a block in Hama 0.6.0?
>> Thanks for your help,
>> Leonidas
>>
>
>

Re: Question about FileInputFormat splits

Posted by "Edward J. Yoon" <ed...@apache.org>.

Hello,

Here's the similar unit test program:
http://svn.apache.org/repos/asf/hama/trunk/core/src/test/java/org/apache/hama/bsp/TestPartitioning.java

For the one or two more text input files, you can set the number of
BSP tasks as you desired if you use the input partitioner.

I'm not sure why your test doesn't work. I'll check it on distributed
environment.


On Thu, Oct 23, 2014 at 2:05 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> Hi Edward,
> I am testing my programs with:
> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
> The splitter works fine for hadoop sequence files but it gets errors for
> text files.
> From the messages below, it seems that the splitter didn't produce a
> split-00001 file.
> Then the BSPJobClient.readSplitFile methods gets 4 splits but the split IDs
> are 0, 2, 3, and 4.
> Is this a Hama bug or is my InputFormat wrong? (it works fine without
> setPartitioner)
> Thanks.
> Leonidas
>
> 14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
> 14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
> 14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
> 14/10/22 09:18:00 INFO bsp.BSPJobClient: Running job: job_201410220850_0006
> 14/10/22 09:18:03 INFO bsp.BSPJobClient: Current supersteps number: 0
> 14/10/22 09:18:09 INFO bsp.BSPJobClient: Current supersteps number: 2
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: The total number of supersteps: 2
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: Counters: 6
> 14/10/22 09:18:12 INFO bsp.BSPJobClient:
> org.apache.hama.bsp.JobInProgress$JobCounter
> 14/10/22 09:18:12 INFO bsp.BSPJobClient:     SUPERSTEPS=2
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: LAUNCHED_TASKS=1
> 14/10/22 09:18:12 INFO bsp.BSPJobClient:
> org.apache.hama.bsp.BSPPeerImpl$PeerCounter
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: SUPERSTEP_SUM=2
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=117
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: IO_BYTES_READ=511839
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: TASK_INPUT_RECORDS=12373
> 14/10/22 09:18:12 INFO bsp.FileInputFormat: Total input paths to process : 4
> java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 4
>     at org.apache.hama.bsp.BSPJobClient.readSplitFile(BSPJobClient.java:611)
>     at org.apache.hama.bsp.JobInProgress.initTasks(JobInProgress.java:261)
>     at org.apache.hama.bsp.QueueManager.initJob(QueueManager.java:44)
>     at
> org.apache.hama.bsp.SimpleTaskScheduler$JobListener.jobAdded(SimpleTaskScheduler.java:117)
>     at org.apache.hama.bsp.BSPMaster.addJob(BSPMaster.java:753)
>     at org.apache.hama.bsp.BSPMaster.submitJob(BSPMaster.java:614)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:601)
>     at org.apache.hama.ipc.RPC$Server.call(RPC.java:613)
>     at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1211)
>     at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1207)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>     at org.apache.hama.ipc.Server$Handler.run(Server.java:1206)
>
>> hadoop fs -ls /tmp/hama-parts/job_201410220850_0005
> Found 4 items
> -rw-r--r--   3 hadoop supergroup     240516 2014-10-22 09:18
> /tmp/hama-parts/job_201410220850_0005/part-00000
> -rw-r--r--   3 hadoop supergroup     242699 2014-10-22 09:18
> /tmp/hama-parts/job_201410220850_0005/part-00002
> -rw-r--r--   3 hadoop supergroup       5710 2014-10-22 09:18
> /tmp/hama-parts/job_201410220850_0005/part-00003
> -rw-r--r--   3 hadoop supergroup     247892 2014-10-22 09:18
> /tmp/hama-parts/job_201410220850_0005/part-00004
>
>
>
>
>
> On 10/20/2014 04:59 PM, Edward J. Yoon wrote:
>>
>> Hi it works as you expected? I thought bsp.input.runtime.partitioning
>> should be true. :0
>>
>> --
>> Best Regards, Edward J. Yoon
>> Chief Executive Officer
>> DataSayer Co., Ltd.
>>
>>> 2014. 10. 21., 오전 6:31, Leonidas Fegaras <fegaras@cse.uta.edu
>>> <ma...@cse.uta.edu>> 작성:
>>>
>>>
>>> Hi Edward,
>>> OK. It works now. I used the following in hama-site.xml:
>>>
>>>  <property>
>>>    <name>bsp.input.runtime.partitioning</name>
>>>    <value>false</value>
>>>  </property>
>>>
>>> and re-started bspd. The correct code for the Job is:
>>>
>>> job.setNumBspTask(10);
>>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>>
>>> Maybe you should explain this in the Hama Wiki.
>>> Thanks.
>>> Leonidas
>>>
>>> On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
>>>>
>>>> Hi Edward,
>>>> Thank you for the reply.
>>>> But I want the opposite: I want to create more tasks than blocks, not
>>>> fewer tasks than blocks.
>>>> That is, I want to be able to send less than one block to each task (for
>>>> example, only 10000 bytes). Sending less data to a task will speed-up
>>>> execution and will require less memory at each node. Hadoop map-reduce,
>>>> Spark, and Flink allow you to use a split size smaller than a block.
>>>> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
>>>> 0.6.4. Did you remove this capability because it is a bad idea or
>>>> because it is very hard to implement?
>>>>
>>>> Based on your instructions, I tried the following:
>>>>
>>>>      job.setNumBspTask(10);
>>>>      job.setBoolean("bsp.input.runtime.partitioning",false);
>>>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>>>
>>>> I get the following error:
>>>>
>>>> java.lang.ArrayIndexOutOfBoundsException: 1
>>>>      at
>>>> org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>>>>      at
>>>>
>>>> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>>>>      at
>>>> org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>>>>      at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>>>>      at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>>>>
>>>> Thanks.
>>>> Leonidas
>>>>
>>>>
>>>> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>>>>>
>>>>> Hi Leonidas,
>>>>>
>>>>> The bsp.min.split.size property is used to prevent to create too many
>>>>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>>>>> size then 1 block is sent to each task).
>>>>>
>>>>> I guess this will work fine. BTW, if you set the input partitioner
>>>>> then input partitioner creates the new partitions as you specified in
>>>>> the setNumBspTask() method (graph job pre-processes the (hash) input
>>>>> partition by default).
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> Best Regards, Edward J. Yoon
>>>>> Chief Executive Officer
>>>>> DataSayer Co., Ltd.
>>>>>
>>>>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>>>>>> <ma...@cse.uta.edu>
>>>>>> <ma...@cse.uta.edu>> 작성:
>>>>>>
>>>>>> Dear Hama developers,
>>>>>> I still have a problem setting the split size of an HDFS input file
>>>>>> using Hama 0.6.4.  For example, when I use:
>>>>>>
>>>>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>>>>> job.setNumBspTask(10);
>>>>>> job.setLong("bsp.min.split.size",10000L);   // 10000 bytes
>>>>>>
>>>>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>>>>> for each block), instead of 10.
>>>>>> This used to work in Hama 0.5.0.
>>>>>> Any suggestions?
>>>>>> Thanks.
>>>>>> Leonidas Fegaras
>>>>>>
>>>>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>>>> blocks,
>>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>>> block. Is
>>>>>>>
>>>>>>> You're right. So, we're working on partitioning issues now.
>>>>>>>
>>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>>> block. Is
>>>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>>>
>>>>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>>>>
>>>>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>>>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>
>>>>>>> <ma...@cse.uta.edu>> wrote:
>>>>>>>>
>>>>>>>> Dear Hama developers,
>>>>>>>> It seems that the splits generated by the FileInputFormat in Hama
>>>>>>>> 0.6.0
>>>>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>>>>> split size
>>>>>>>> using  job.set("bsp.min.split.size",...) and set the task numbers
>>>>>>>> using
>>>>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>>>>> smaller
>>>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>>>> blocks,
>>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>>> block. Is
>>>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>>>> Thanks for your help,
>>>>>>>> Leonidas
>>>>>>>>
>>>>>>>
>>>
>>
>



-- 
Best Regards, Edward J. Yoon
CEO at DataSayer Co., Ltd.

Re: Question about FileInputFormat splits

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Hi Edward,
I am testing my programs with:
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
The splitter works fine for hadoop sequence files but it gets errors for 
text files.
 From the messages below, it seems that the splitter didn't produce a 
split-00001 file.
Then the BSPJobClient.readSplitFile methods gets 4 splits but the split 
IDs are 0, 2, 3, and 4.
Is this a Hama bug or is my InputFormat wrong? (it works fine without 
setPartitioner)
Thanks.
Leonidas

14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:18:00 INFO bsp.BSPJobClient: Running job: job_201410220850_0006
14/10/22 09:18:03 INFO bsp.BSPJobClient: Current supersteps number: 0
14/10/22 09:18:09 INFO bsp.BSPJobClient: Current supersteps number: 2
14/10/22 09:18:12 INFO bsp.BSPJobClient: The total number of supersteps: 2
14/10/22 09:18:12 INFO bsp.BSPJobClient: Counters: 6
14/10/22 09:18:12 INFO bsp.BSPJobClient: 
org.apache.hama.bsp.JobInProgress$JobCounter
14/10/22 09:18:12 INFO bsp.BSPJobClient:     SUPERSTEPS=2
14/10/22 09:18:12 INFO bsp.BSPJobClient: LAUNCHED_TASKS=1
14/10/22 09:18:12 INFO bsp.BSPJobClient: 
org.apache.hama.bsp.BSPPeerImpl$PeerCounter
14/10/22 09:18:12 INFO bsp.BSPJobClient: SUPERSTEP_SUM=2
14/10/22 09:18:12 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=117
14/10/22 09:18:12 INFO bsp.BSPJobClient: IO_BYTES_READ=511839
14/10/22 09:18:12 INFO bsp.BSPJobClient: TASK_INPUT_RECORDS=12373
14/10/22 09:18:12 INFO bsp.FileInputFormat: Total input paths to process : 4
java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 4
     at 
org.apache.hama.bsp.BSPJobClient.readSplitFile(BSPJobClient.java:611)
     at org.apache.hama.bsp.JobInProgress.initTasks(JobInProgress.java:261)
     at org.apache.hama.bsp.QueueManager.initJob(QueueManager.java:44)
     at 
org.apache.hama.bsp.SimpleTaskScheduler$JobListener.jobAdded(SimpleTaskScheduler.java:117)
     at org.apache.hama.bsp.BSPMaster.addJob(BSPMaster.java:753)
     at org.apache.hama.bsp.BSPMaster.submitJob(BSPMaster.java:614)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:601)
     at org.apache.hama.ipc.RPC$Server.call(RPC.java:613)
     at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1211)
     at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1207)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:415)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
     at org.apache.hama.ipc.Server$Handler.run(Server.java:1206)

 > hadoop fs -ls /tmp/hama-parts/job_201410220850_0005
Found 4 items
-rw-r--r--   3 hadoop supergroup     240516 2014-10-22 09:18 
/tmp/hama-parts/job_201410220850_0005/part-00000
-rw-r--r--   3 hadoop supergroup     242699 2014-10-22 09:18 
/tmp/hama-parts/job_201410220850_0005/part-00002
-rw-r--r--   3 hadoop supergroup       5710 2014-10-22 09:18 
/tmp/hama-parts/job_201410220850_0005/part-00003
-rw-r--r--   3 hadoop supergroup     247892 2014-10-22 09:18 
/tmp/hama-parts/job_201410220850_0005/part-00004





On 10/20/2014 04:59 PM, Edward J. Yoon wrote:
> Hi it works as you expected? I thought bsp.input.runtime.partitioning 
> should be true. :0
>
> --
> Best Regards, Edward J. Yoon
> Chief Executive Officer
> DataSayer Co., Ltd.
>
>> 2014. 10. 21., 오전 6:31, Leonidas Fegaras <fegaras@cse.uta.edu 
>> <ma...@cse.uta.edu>> 작성:
>>
>> Hi Edward,
>> OK. It works now. I used the following in hama-site.xml:
>>
>>  <property>
>>    <name>bsp.input.runtime.partitioning</name>
>>    <value>false</value>
>>  </property>
>>
>> and re-started bspd. The correct code for the Job is:
>>
>> job.setNumBspTask(10);
>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>
>> Maybe you should explain this in the Hama Wiki.
>> Thanks.
>> Leonidas
>>
>> On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
>>> Hi Edward,
>>> Thank you for the reply.
>>> But I want the opposite: I want to create more tasks than blocks, not
>>> fewer tasks than blocks.
>>> That is, I want to be able to send less than one block to each task (for
>>> example, only 10000 bytes). Sending less data to a task will speed-up
>>> execution and will require less memory at each node. Hadoop map-reduce,
>>> Spark, and Flink allow you to use a split size smaller than a block.
>>> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
>>> 0.6.4. Did you remove this capability because it is a bad idea or
>>> because it is very hard to implement?
>>>
>>> Based on your instructions, I tried the following:
>>>
>>>      job.setNumBspTask(10);
>>>      job.setBoolean("bsp.input.runtime.partitioning",false);
>>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>>
>>> I get the following error:
>>>
>>> java.lang.ArrayIndexOutOfBoundsException: 1
>>>      at 
>>> org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>>>      at
>>> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>>>      at 
>>> org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>>>      at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>>>      at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>>>
>>> Thanks.
>>> Leonidas
>>>
>>>
>>> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>>>> Hi Leonidas,
>>>>
>>>> The bsp.min.split.size property is used to prevent to create too many
>>>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>>>> size then 1 block is sent to each task).
>>>>
>>>> I guess this will work fine. BTW, if you set the input partitioner
>>>> then input partitioner creates the new partitions as you specified in
>>>> the setNumBspTask() method (graph job pre-processes the (hash) input
>>>> partition by default).
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> Chief Executive Officer
>>>> DataSayer Co., Ltd.
>>>>
>>>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu 
>>>>> <ma...@cse.uta.edu>
>>>>> <ma...@cse.uta.edu>> 작성:
>>>>>
>>>>> Dear Hama developers,
>>>>> I still have a problem setting the split size of an HDFS input file
>>>>> using Hama 0.6.4.  For example, when I use:
>>>>>
>>>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>>>> job.setNumBspTask(10);
>>>>> job.setLong("bsp.min.split.size",10000L);   // 10000 bytes
>>>>>
>>>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>>>> for each block), instead of 10.
>>>>> This used to work in Hama 0.5.0.
>>>>> Any suggestions?
>>>>> Thanks.
>>>>> Leonidas Fegaras
>>>>>
>>>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>>>> Hello,
>>>>>>
>>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>>> blocks,
>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>> block. Is
>>>>>> You're right. So, we're working on partitioning issues now.
>>>>>>
>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>> block. Is
>>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>>>
>>>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu> 
>>>>>> <ma...@cse.uta.edu>> wrote:
>>>>>>> Dear Hama developers,
>>>>>>> It seems that the splits generated by the FileInputFormat in 
>>>>>>> Hama 0.6.0
>>>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>>>> split size
>>>>>>> using  job.set("bsp.min.split.size",...) and set the task 
>>>>>>> numbers using
>>>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>>>> smaller
>>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>>> blocks,
>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>> block. Is
>>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>>> Thanks for your help,
>>>>>>> Leonidas
>>>>>>>
>>>>>>
>>
>

Re: Question about FileInputFormat splits

Posted by "Edward J. Yoon" <ed...@datasayer.com>.

Hi it works as you expected? I thought bsp.input.runtime.partitioning should be true. :0

--
Best Regards, Edward J. Yoon
Chief Executive Officer
DataSayer Co., Ltd.

> 2014. 10. 21., 오전 6:31, Leonidas Fegaras <fe...@cse.uta.edu> 작성:
> 
> Hi Edward,
> OK. It works now. I used the following in hama-site.xml:
> 
>  <property>
>    <name>bsp.input.runtime.partitioning</name>
>    <value>false</value>
>  </property>
> 
> and re-started bspd. The correct code for the Job is:
> 
> job.setNumBspTask(10);
> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
> 
> Maybe you should explain this in the Hama Wiki.
> Thanks.
> Leonidas
> 
> On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
>> Hi Edward,
>> Thank you for the reply.
>> But I want the opposite: I want to create more tasks than blocks, not
>> fewer tasks than blocks.
>> That is, I want to be able to send less than one block to each task (for
>> example, only 10000 bytes). Sending less data to a task will speed-up
>> execution and will require less memory at each node. Hadoop map-reduce,
>> Spark, and Flink allow you to use a split size smaller than a block.
>> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
>> 0.6.4. Did you remove this capability because it is a bad idea or
>> because it is very hard to implement?
>> 
>> Based on your instructions, I tried the following:
>> 
>>      job.setNumBspTask(10);
>>      job.setBoolean("bsp.input.runtime.partitioning",false);
>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>> 
>> I get the following error:
>> 
>> java.lang.ArrayIndexOutOfBoundsException: 1
>>      at org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>>      at
>> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>>      at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>>      at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>>      at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>> 
>> Thanks.
>> Leonidas
>> 
>> 
>> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>>> Hi Leonidas,
>>> 
>>> The bsp.min.split.size property is used to prevent to create too many
>>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>>> size then 1 block is sent to each task).
>>> 
>>> I guess this will work fine. BTW, if you set the input partitioner
>>> then input partitioner creates the new partitions as you specified in
>>> the setNumBspTask() method (graph job pre-processes the (hash) input
>>> partition by default).
>>> 
>>> Thanks.
>>> 
>>> --
>>> Best Regards, Edward J. Yoon
>>> Chief Executive Officer
>>> DataSayer Co., Ltd.
>>> 
>>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>>>> <ma...@cse.uta.edu>> 작성:
>>>> 
>>>> Dear Hama developers,
>>>> I still have a problem setting the split size of an HDFS input file
>>>> using Hama 0.6.4.  For example, when I use:
>>>> 
>>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>>> job.setNumBspTask(10);
>>>> job.setLong("bsp.min.split.size",10000L);   // 10000 bytes
>>>> 
>>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>>> for each block), instead of 10.
>>>> This used to work in Hama 0.5.0.
>>>> Any suggestions?
>>>> Thanks.
>>>> Leonidas Fegaras
>>>> 
>>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>>> Hello,
>>>>> 
>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>> blocks,
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>> You're right. So, we're working on partitioning issues now.
>>>>> 
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>> 
>>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>> wrote:
>>>>>> Dear Hama developers,
>>>>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>>> split size
>>>>>> using  job.set("bsp.min.split.size",...) and set the task numbers using
>>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>>> smaller
>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>> blocks,
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>> Thanks for your help,
>>>>>> Leonidas
>>>>>> 
>>>>> 
>

Re: Question about FileInputFormat splits

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Hi Edward,
OK. It works now. I used the following in hama-site.xml:

   <property>
     <name>bsp.input.runtime.partitioning</name>
     <value>false</value>
   </property>

and re-started bspd. The correct code for the Job is:

job.setNumBspTask(10);
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);

Maybe you should explain this in the Hama Wiki.
Thanks.
Leonidas

On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
> Hi Edward,
> Thank you for the reply.
> But I want the opposite: I want to create more tasks than blocks, not
> fewer tasks than blocks.
> That is, I want to be able to send less than one block to each task (for
> example, only 10000 bytes). Sending less data to a task will speed-up
> execution and will require less memory at each node. Hadoop map-reduce,
> Spark, and Flink allow you to use a split size smaller than a block.
> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
> 0.6.4. Did you remove this capability because it is a bad idea or
> because it is very hard to implement?
>
> Based on your instructions, I tried the following:
>
>       job.setNumBspTask(10);
>       job.setBoolean("bsp.input.runtime.partitioning",false);
> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>
> I get the following error:
>
> java.lang.ArrayIndexOutOfBoundsException: 1
>       at org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>       at
> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>       at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>       at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>       at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>
> Thanks.
> Leonidas
>
>
> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>> Hi Leonidas,
>>
>> The bsp.min.split.size property is used to prevent to create too many
>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>> size then 1 block is sent to each task).
>>
>> I guess this will work fine. BTW, if you set the input partitioner
>> then input partitioner creates the new partitions as you specified in
>> the setNumBspTask() method (graph job pre-processes the (hash) input
>> partition by default).
>>
>> Thanks.
>>
>> --
>> Best Regards, Edward J. Yoon
>> Chief Executive Officer
>> DataSayer Co., Ltd.
>>
>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>>> <ma...@cse.uta.edu>> 작성:
>>>
>>> Dear Hama developers,
>>> I still have a problem setting the split size of an HDFS input file
>>> using Hama 0.6.4.  For example, when I use:
>>>
>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>> job.setNumBspTask(10);
>>> job.setLong("bsp.min.split.size",10000L);   // 10000 bytes
>>>
>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>> for each block), instead of 10.
>>> This used to work in Hama 0.5.0.
>>> Any suggestions?
>>> Thanks.
>>> Leonidas Fegaras
>>>
>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>> Hello,
>>>>
>>>>> than a block. But if you have more nodes in your cluster than data
>>>>> blocks,
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>> You're right. So, we're working on partitioning issues now.
>>>>
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>
>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>> wrote:
>>>>> Dear Hama developers,
>>>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>> split size
>>>>> using  job.set("bsp.min.split.size",...) and set the task numbers using
>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>> smaller
>>>>> than a block. But if you have more nodes in your cluster than data
>>>>> blocks,
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>> Thanks for your help,
>>>>> Leonidas
>>>>>
>>>>

Re: Question about FileInputFormat splits

Posted by Leonidas Fegaras <fe...@cse.uta.edu>.

Hi Edward,
Thank you for the reply.
But I want the opposite: I want to create more tasks than blocks, not 
fewer tasks than blocks.
That is, I want to be able to send less than one block to each task (for 
example, only 10000 bytes). Sending less data to a task will speed-up 
execution and will require less memory at each node. Hadoop map-reduce, 
Spark, and Flink allow you to use a split size smaller than a block. 
Also, I used to be able to do this with Hama 0.5.0 but not with Hama 
0.6.4. Did you remove this capability because it is a bad idea or 
because it is very hard to implement?

Based on your instructions, I tried the following:

     job.setNumBspTask(10);
     job.setBoolean("bsp.input.runtime.partitioning",false);
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);

I get the following error:

java.lang.ArrayIndexOutOfBoundsException: 1
     at org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
     at 
org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
     at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
     at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
     at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)

Thanks.
Leonidas


On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
> Hi Leonidas,
>
> The bsp.min.split.size property is used to prevent to create too many 
> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block 
> size then 1 block is sent to each task).
>
> I guess this will work fine. BTW, if you set the input partitioner 
> then input partitioner creates the new partitions as you specified in 
> the setNumBspTask() method (graph job pre-processes the (hash) input 
> partition by default).
>
> Thanks.
>
> --
> Best Regards, Edward J. Yoon
> Chief Executive Officer
> DataSayer Co., Ltd.
>
>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu 
>> <ma...@cse.uta.edu>> 작성:
>>
>> Dear Hama developers,
>> I still have a problem setting the split size of an HDFS input file 
>> using Hama 0.6.4.  For example, when I use:
>>
>> BSPJob job = new BSPJob(conf,BSPop.class);
>> job.setNumBspTask(10);
>> job.setLong("bsp.min.split.size",10000L);   // 10000 bytes
>>
>> For a small file with 2 blocks, this will use only 2 BSP tasks (one 
>> for each block), instead of 10.
>> This used to work in Hama 0.5.0.
>> Any suggestions?
>> Thanks.
>> Leonidas Fegaras
>>
>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>> Hello,
>>>
>>>> than a block. But if you have more nodes in your cluster than data 
>>>> blocks,
>>>> you may get faster execution if you allow splits smaller than a 
>>>> block. Is
>>> You're right. So, we're working on partitioning issues now.
>>>
>>>> you may get faster execution if you allow splits smaller than a 
>>>> block. Is
>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>> Yes. But, Hama 0.6.1 version will support it.
>>>
>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras 
>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>> wrote:
>>>> Dear Hama developers,
>>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any 
>>>> split size
>>>> using  job.set("bsp.min.split.size",...) and set the task numbers using
>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split 
>>>> smaller
>>>> than a block. But if you have more nodes in your cluster than data 
>>>> blocks,
>>>> you may get faster execution if you allow splits smaller than a 
>>>> block. Is
>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>> Thanks for your help,
>>>> Leonidas
>>>>
>>>
>>>
>>
>

Re: Question about FileInputFormat splits

Posted by "Edward J. Yoon" <ed...@datasayer.com>.

Hi Leonidas,

The bsp.min.split.size property is used to prevent to create too many tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block size then 1 block is sent to each task).

I guess this will work fine. BTW, if you set the input partitioner then input partitioner creates the new partitions as you specified in the setNumBspTask() method (graph job pre-processes the (hash) input partition by default).

Thanks.

--
Best Regards, Edward J. Yoon
Chief Executive Officer
DataSayer Co., Ltd.

> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fe...@cse.uta.edu> 작성:
> 
> Dear Hama developers,
> I still have a problem setting the split size of an HDFS input file using Hama 0.6.4.  For example, when I use:
> 
> BSPJob job = new BSPJob(conf,BSPop.class);
> job.setNumBspTask(10);
> job.setLong("bsp.min.split.size",10000L);   // 10000 bytes
> 
> For a small file with 2 blocks, this will use only 2 BSP tasks (one for each block), instead of 10.
> This used to work in Hama 0.5.0.
> Any suggestions?
> Thanks.
> Leonidas Fegaras
> 
> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>> Hello,
>> 
>>> than a block. But if you have more nodes in your cluster than data blocks,
>>> you may get faster execution if you allow splits smaller than a block. Is
>> You're right. So, we're working on partitioning issues now.
>> 
>>> you may get faster execution if you allow splits smaller than a block. Is
>>> there any way to use splits smaller than a block in Hama 0.6.0?
>> Yes. But, Hama 0.6.1 version will support it.
>> 
>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
>>> Dear Hama developers,
>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>> cannot be smaller than a block. In Hama 0.5.0, I could set any split size
>>> using  job.set("bsp.min.split.size",...) and set the task numbers using
>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split smaller
>>> than a block. But if you have more nodes in your cluster than data blocks,
>>> you may get faster execution if you allow splits smaller than a block. Is
>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>> Thanks for your help,
>>> Leonidas
>>> 
>> 
>> 
>