You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hama.apache.org by Leonidas Fegaras <fe...@cse.uta.edu> on 2014/10/20 15:51:35 UTC
Re: Question about FileInputFormat splits
Dear Hama developers,
I still have a problem setting the split size of an HDFS input file
using Hama 0.6.4. For example, when I use:
BSPJob job = new BSPJob(conf,BSPop.class);
job.setNumBspTask(10);
job.setLong("bsp.min.split.size",10000L); // 10000 bytes
For a small file with 2 blocks, this will use only 2 BSP tasks (one for
each block), instead of 10.
This used to work in Hama 0.5.0.
Any suggestions?
Thanks.
Leonidas Fegaras
On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
> Hello,
>
>> than a block. But if you have more nodes in your cluster than data blocks,
>> you may get faster execution if you allow splits smaller than a block. Is
> You're right. So, we're working on partitioning issues now.
>
>> you may get faster execution if you allow splits smaller than a block. Is
>> there any way to use splits smaller than a block in Hama 0.6.0?
> Yes. But, Hama 0.6.1 version will support it.
>
> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
>> Dear Hama developers,
>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>> cannot be smaller than a block. In Hama 0.5.0, I could set any split size
>> using job.set("bsp.min.split.size",...) and set the task numbers using
>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split smaller
>> than a block. But if you have more nodes in your cluster than data blocks,
>> you may get faster execution if you allow splits smaller than a block. Is
>> there any way to use splits smaller than a block in Hama 0.6.0?
>> Thanks for your help,
>> Leonidas
>>
>
>
Re: Question about FileInputFormat splits
Posted by "Edward J. Yoon" <ed...@apache.org>.
Hello,
Here's the similar unit test program:
http://svn.apache.org/repos/asf/hama/trunk/core/src/test/java/org/apache/hama/bsp/TestPartitioning.java
For the one or two more text input files, you can set the number of
BSP tasks as you desired if you use the input partitioner.
I'm not sure why your test doesn't work. I'll check it on distributed
environment.
On Thu, Oct 23, 2014 at 2:05 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
> Hi Edward,
> I am testing my programs with:
> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
> The splitter works fine for hadoop sequence files but it gets errors for
> text files.
> From the messages below, it seems that the splitter didn't produce a
> split-00001 file.
> Then the BSPJobClient.readSplitFile methods gets 4 splits but the split IDs
> are 0, 2, 3, and 4.
> Is this a Hama bug or is my InputFormat wrong? (it works fine without
> setPartitioner)
> Thanks.
> Leonidas
>
> 14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
> 14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
> 14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
> 14/10/22 09:18:00 INFO bsp.BSPJobClient: Running job: job_201410220850_0006
> 14/10/22 09:18:03 INFO bsp.BSPJobClient: Current supersteps number: 0
> 14/10/22 09:18:09 INFO bsp.BSPJobClient: Current supersteps number: 2
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: The total number of supersteps: 2
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: Counters: 6
> 14/10/22 09:18:12 INFO bsp.BSPJobClient:
> org.apache.hama.bsp.JobInProgress$JobCounter
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: SUPERSTEPS=2
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: LAUNCHED_TASKS=1
> 14/10/22 09:18:12 INFO bsp.BSPJobClient:
> org.apache.hama.bsp.BSPPeerImpl$PeerCounter
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: SUPERSTEP_SUM=2
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=117
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: IO_BYTES_READ=511839
> 14/10/22 09:18:12 INFO bsp.BSPJobClient: TASK_INPUT_RECORDS=12373
> 14/10/22 09:18:12 INFO bsp.FileInputFormat: Total input paths to process : 4
> java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 4
> at org.apache.hama.bsp.BSPJobClient.readSplitFile(BSPJobClient.java:611)
> at org.apache.hama.bsp.JobInProgress.initTasks(JobInProgress.java:261)
> at org.apache.hama.bsp.QueueManager.initJob(QueueManager.java:44)
> at
> org.apache.hama.bsp.SimpleTaskScheduler$JobListener.jobAdded(SimpleTaskScheduler.java:117)
> at org.apache.hama.bsp.BSPMaster.addJob(BSPMaster.java:753)
> at org.apache.hama.bsp.BSPMaster.submitJob(BSPMaster.java:614)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at org.apache.hama.ipc.RPC$Server.call(RPC.java:613)
> at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1211)
> at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1207)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
> at org.apache.hama.ipc.Server$Handler.run(Server.java:1206)
>
>> hadoop fs -ls /tmp/hama-parts/job_201410220850_0005
> Found 4 items
> -rw-r--r-- 3 hadoop supergroup 240516 2014-10-22 09:18
> /tmp/hama-parts/job_201410220850_0005/part-00000
> -rw-r--r-- 3 hadoop supergroup 242699 2014-10-22 09:18
> /tmp/hama-parts/job_201410220850_0005/part-00002
> -rw-r--r-- 3 hadoop supergroup 5710 2014-10-22 09:18
> /tmp/hama-parts/job_201410220850_0005/part-00003
> -rw-r--r-- 3 hadoop supergroup 247892 2014-10-22 09:18
> /tmp/hama-parts/job_201410220850_0005/part-00004
>
>
>
>
>
> On 10/20/2014 04:59 PM, Edward J. Yoon wrote:
>>
>> Hi it works as you expected? I thought bsp.input.runtime.partitioning
>> should be true. :0
>>
>> --
>> Best Regards, Edward J. Yoon
>> Chief Executive Officer
>> DataSayer Co., Ltd.
>>
>>> 2014. 10. 21., 오전 6:31, Leonidas Fegaras <fegaras@cse.uta.edu
>>> <ma...@cse.uta.edu>> 작성:
>>>
>>>
>>> Hi Edward,
>>> OK. It works now. I used the following in hama-site.xml:
>>>
>>> <property>
>>> <name>bsp.input.runtime.partitioning</name>
>>> <value>false</value>
>>> </property>
>>>
>>> and re-started bspd. The correct code for the Job is:
>>>
>>> job.setNumBspTask(10);
>>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>>
>>> Maybe you should explain this in the Hama Wiki.
>>> Thanks.
>>> Leonidas
>>>
>>> On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
>>>>
>>>> Hi Edward,
>>>> Thank you for the reply.
>>>> But I want the opposite: I want to create more tasks than blocks, not
>>>> fewer tasks than blocks.
>>>> That is, I want to be able to send less than one block to each task (for
>>>> example, only 10000 bytes). Sending less data to a task will speed-up
>>>> execution and will require less memory at each node. Hadoop map-reduce,
>>>> Spark, and Flink allow you to use a split size smaller than a block.
>>>> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
>>>> 0.6.4. Did you remove this capability because it is a bad idea or
>>>> because it is very hard to implement?
>>>>
>>>> Based on your instructions, I tried the following:
>>>>
>>>> job.setNumBspTask(10);
>>>> job.setBoolean("bsp.input.runtime.partitioning",false);
>>>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>>>
>>>> I get the following error:
>>>>
>>>> java.lang.ArrayIndexOutOfBoundsException: 1
>>>> at
>>>> org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>>>> at
>>>>
>>>> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>>>> at
>>>> org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>>>> at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>>>> at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>>>>
>>>> Thanks.
>>>> Leonidas
>>>>
>>>>
>>>> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>>>>>
>>>>> Hi Leonidas,
>>>>>
>>>>> The bsp.min.split.size property is used to prevent to create too many
>>>>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>>>>> size then 1 block is sent to each task).
>>>>>
>>>>> I guess this will work fine. BTW, if you set the input partitioner
>>>>> then input partitioner creates the new partitions as you specified in
>>>>> the setNumBspTask() method (graph job pre-processes the (hash) input
>>>>> partition by default).
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> Best Regards, Edward J. Yoon
>>>>> Chief Executive Officer
>>>>> DataSayer Co., Ltd.
>>>>>
>>>>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>>>>>> <ma...@cse.uta.edu>
>>>>>> <ma...@cse.uta.edu>> 작성:
>>>>>>
>>>>>> Dear Hama developers,
>>>>>> I still have a problem setting the split size of an HDFS input file
>>>>>> using Hama 0.6.4. For example, when I use:
>>>>>>
>>>>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>>>>> job.setNumBspTask(10);
>>>>>> job.setLong("bsp.min.split.size",10000L); // 10000 bytes
>>>>>>
>>>>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>>>>> for each block), instead of 10.
>>>>>> This used to work in Hama 0.5.0.
>>>>>> Any suggestions?
>>>>>> Thanks.
>>>>>> Leonidas Fegaras
>>>>>>
>>>>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>>>> blocks,
>>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>>> block. Is
>>>>>>>
>>>>>>> You're right. So, we're working on partitioning issues now.
>>>>>>>
>>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>>> block. Is
>>>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>>>
>>>>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>>>>
>>>>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>>>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>
>>>>>>> <ma...@cse.uta.edu>> wrote:
>>>>>>>>
>>>>>>>> Dear Hama developers,
>>>>>>>> It seems that the splits generated by the FileInputFormat in Hama
>>>>>>>> 0.6.0
>>>>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>>>>> split size
>>>>>>>> using job.set("bsp.min.split.size",...) and set the task numbers
>>>>>>>> using
>>>>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>>>>> smaller
>>>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>>>> blocks,
>>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>>> block. Is
>>>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>>>> Thanks for your help,
>>>>>>>> Leonidas
>>>>>>>>
>>>>>>>
>>>
>>
>
--
Best Regards, Edward J. Yoon
CEO at DataSayer Co., Ltd.
Re: Question about FileInputFormat splits
Posted by Leonidas Fegaras <fe...@cse.uta.edu>.
Hi Edward,
I am testing my programs with:
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
The splitter works fine for hadoop sequence files but it gets errors for
text files.
From the messages below, it seems that the splitter didn't produce a
split-00001 file.
Then the BSPJobClient.readSplitFile methods gets 4 splits but the split
IDs are 0, 2, 3, and 4.
Is this a Hama bug or is my InputFormat wrong? (it works fine without
setPartitioner)
Thanks.
Leonidas
14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:18:00 INFO bsp.BSPJobClient: Running job: job_201410220850_0006
14/10/22 09:18:03 INFO bsp.BSPJobClient: Current supersteps number: 0
14/10/22 09:18:09 INFO bsp.BSPJobClient: Current supersteps number: 2
14/10/22 09:18:12 INFO bsp.BSPJobClient: The total number of supersteps: 2
14/10/22 09:18:12 INFO bsp.BSPJobClient: Counters: 6
14/10/22 09:18:12 INFO bsp.BSPJobClient:
org.apache.hama.bsp.JobInProgress$JobCounter
14/10/22 09:18:12 INFO bsp.BSPJobClient: SUPERSTEPS=2
14/10/22 09:18:12 INFO bsp.BSPJobClient: LAUNCHED_TASKS=1
14/10/22 09:18:12 INFO bsp.BSPJobClient:
org.apache.hama.bsp.BSPPeerImpl$PeerCounter
14/10/22 09:18:12 INFO bsp.BSPJobClient: SUPERSTEP_SUM=2
14/10/22 09:18:12 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=117
14/10/22 09:18:12 INFO bsp.BSPJobClient: IO_BYTES_READ=511839
14/10/22 09:18:12 INFO bsp.BSPJobClient: TASK_INPUT_RECORDS=12373
14/10/22 09:18:12 INFO bsp.FileInputFormat: Total input paths to process : 4
java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 4
at
org.apache.hama.bsp.BSPJobClient.readSplitFile(BSPJobClient.java:611)
at org.apache.hama.bsp.JobInProgress.initTasks(JobInProgress.java:261)
at org.apache.hama.bsp.QueueManager.initJob(QueueManager.java:44)
at
org.apache.hama.bsp.SimpleTaskScheduler$JobListener.jobAdded(SimpleTaskScheduler.java:117)
at org.apache.hama.bsp.BSPMaster.addJob(BSPMaster.java:753)
at org.apache.hama.bsp.BSPMaster.submitJob(BSPMaster.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hama.ipc.RPC$Server.call(RPC.java:613)
at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1211)
at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1207)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hama.ipc.Server$Handler.run(Server.java:1206)
> hadoop fs -ls /tmp/hama-parts/job_201410220850_0005
Found 4 items
-rw-r--r-- 3 hadoop supergroup 240516 2014-10-22 09:18
/tmp/hama-parts/job_201410220850_0005/part-00000
-rw-r--r-- 3 hadoop supergroup 242699 2014-10-22 09:18
/tmp/hama-parts/job_201410220850_0005/part-00002
-rw-r--r-- 3 hadoop supergroup 5710 2014-10-22 09:18
/tmp/hama-parts/job_201410220850_0005/part-00003
-rw-r--r-- 3 hadoop supergroup 247892 2014-10-22 09:18
/tmp/hama-parts/job_201410220850_0005/part-00004
On 10/20/2014 04:59 PM, Edward J. Yoon wrote:
> Hi it works as you expected? I thought bsp.input.runtime.partitioning
> should be true. :0
>
> --
> Best Regards, Edward J. Yoon
> Chief Executive Officer
> DataSayer Co., Ltd.
>
>> 2014. 10. 21., 오전 6:31, Leonidas Fegaras <fegaras@cse.uta.edu
>> <ma...@cse.uta.edu>> 작성:
>>
>> Hi Edward,
>> OK. It works now. I used the following in hama-site.xml:
>>
>> <property>
>> <name>bsp.input.runtime.partitioning</name>
>> <value>false</value>
>> </property>
>>
>> and re-started bspd. The correct code for the Job is:
>>
>> job.setNumBspTask(10);
>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>
>> Maybe you should explain this in the Hama Wiki.
>> Thanks.
>> Leonidas
>>
>> On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
>>> Hi Edward,
>>> Thank you for the reply.
>>> But I want the opposite: I want to create more tasks than blocks, not
>>> fewer tasks than blocks.
>>> That is, I want to be able to send less than one block to each task (for
>>> example, only 10000 bytes). Sending less data to a task will speed-up
>>> execution and will require less memory at each node. Hadoop map-reduce,
>>> Spark, and Flink allow you to use a split size smaller than a block.
>>> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
>>> 0.6.4. Did you remove this capability because it is a bad idea or
>>> because it is very hard to implement?
>>>
>>> Based on your instructions, I tried the following:
>>>
>>> job.setNumBspTask(10);
>>> job.setBoolean("bsp.input.runtime.partitioning",false);
>>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>>
>>> I get the following error:
>>>
>>> java.lang.ArrayIndexOutOfBoundsException: 1
>>> at
>>> org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>>> at
>>> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>>> at
>>> org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>>> at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>>> at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>>>
>>> Thanks.
>>> Leonidas
>>>
>>>
>>> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>>>> Hi Leonidas,
>>>>
>>>> The bsp.min.split.size property is used to prevent to create too many
>>>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>>>> size then 1 block is sent to each task).
>>>>
>>>> I guess this will work fine. BTW, if you set the input partitioner
>>>> then input partitioner creates the new partitions as you specified in
>>>> the setNumBspTask() method (graph job pre-processes the (hash) input
>>>> partition by default).
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> Chief Executive Officer
>>>> DataSayer Co., Ltd.
>>>>
>>>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>>>>> <ma...@cse.uta.edu>
>>>>> <ma...@cse.uta.edu>> 작성:
>>>>>
>>>>> Dear Hama developers,
>>>>> I still have a problem setting the split size of an HDFS input file
>>>>> using Hama 0.6.4. For example, when I use:
>>>>>
>>>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>>>> job.setNumBspTask(10);
>>>>> job.setLong("bsp.min.split.size",10000L); // 10000 bytes
>>>>>
>>>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>>>> for each block), instead of 10.
>>>>> This used to work in Hama 0.5.0.
>>>>> Any suggestions?
>>>>> Thanks.
>>>>> Leonidas Fegaras
>>>>>
>>>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>>>> Hello,
>>>>>>
>>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>>> blocks,
>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>> block. Is
>>>>>> You're right. So, we're working on partitioning issues now.
>>>>>>
>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>> block. Is
>>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>>>
>>>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>
>>>>>> <ma...@cse.uta.edu>> wrote:
>>>>>>> Dear Hama developers,
>>>>>>> It seems that the splits generated by the FileInputFormat in
>>>>>>> Hama 0.6.0
>>>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>>>> split size
>>>>>>> using job.set("bsp.min.split.size",...) and set the task
>>>>>>> numbers using
>>>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>>>> smaller
>>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>>> blocks,
>>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>>> block. Is
>>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>>> Thanks for your help,
>>>>>>> Leonidas
>>>>>>>
>>>>>>
>>
>
Re: Question about FileInputFormat splits
Posted by "Edward J. Yoon" <ed...@datasayer.com>.
Hi it works as you expected? I thought bsp.input.runtime.partitioning should be true. :0
--
Best Regards, Edward J. Yoon
Chief Executive Officer
DataSayer Co., Ltd.
> 2014. 10. 21., 오전 6:31, Leonidas Fegaras <fe...@cse.uta.edu> 작성:
>
> Hi Edward,
> OK. It works now. I used the following in hama-site.xml:
>
> <property>
> <name>bsp.input.runtime.partitioning</name>
> <value>false</value>
> </property>
>
> and re-started bspd. The correct code for the Job is:
>
> job.setNumBspTask(10);
> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>
> Maybe you should explain this in the Hama Wiki.
> Thanks.
> Leonidas
>
> On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
>> Hi Edward,
>> Thank you for the reply.
>> But I want the opposite: I want to create more tasks than blocks, not
>> fewer tasks than blocks.
>> That is, I want to be able to send less than one block to each task (for
>> example, only 10000 bytes). Sending less data to a task will speed-up
>> execution and will require less memory at each node. Hadoop map-reduce,
>> Spark, and Flink allow you to use a split size smaller than a block.
>> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
>> 0.6.4. Did you remove this capability because it is a bad idea or
>> because it is very hard to implement?
>>
>> Based on your instructions, I tried the following:
>>
>> job.setNumBspTask(10);
>> job.setBoolean("bsp.input.runtime.partitioning",false);
>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>>
>> I get the following error:
>>
>> java.lang.ArrayIndexOutOfBoundsException: 1
>> at org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>> at
>> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>> at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>> at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>> at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>>
>> Thanks.
>> Leonidas
>>
>>
>> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>>> Hi Leonidas,
>>>
>>> The bsp.min.split.size property is used to prevent to create too many
>>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>>> size then 1 block is sent to each task).
>>>
>>> I guess this will work fine. BTW, if you set the input partitioner
>>> then input partitioner creates the new partitions as you specified in
>>> the setNumBspTask() method (graph job pre-processes the (hash) input
>>> partition by default).
>>>
>>> Thanks.
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> Chief Executive Officer
>>> DataSayer Co., Ltd.
>>>
>>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>>>> <ma...@cse.uta.edu>> 작성:
>>>>
>>>> Dear Hama developers,
>>>> I still have a problem setting the split size of an HDFS input file
>>>> using Hama 0.6.4. For example, when I use:
>>>>
>>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>>> job.setNumBspTask(10);
>>>> job.setLong("bsp.min.split.size",10000L); // 10000 bytes
>>>>
>>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>>> for each block), instead of 10.
>>>> This used to work in Hama 0.5.0.
>>>> Any suggestions?
>>>> Thanks.
>>>> Leonidas Fegaras
>>>>
>>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>>> Hello,
>>>>>
>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>> blocks,
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>> You're right. So, we're working on partitioning issues now.
>>>>>
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>>
>>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>> wrote:
>>>>>> Dear Hama developers,
>>>>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>>> split size
>>>>>> using job.set("bsp.min.split.size",...) and set the task numbers using
>>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>>> smaller
>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>> blocks,
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>> Thanks for your help,
>>>>>> Leonidas
>>>>>>
>>>>>
>
Re: Question about FileInputFormat splits
Posted by Leonidas Fegaras <fe...@cse.uta.edu>.
Hi Edward,
OK. It works now. I used the following in hama-site.xml:
<property>
<name>bsp.input.runtime.partitioning</name>
<value>false</value>
</property>
and re-started bspd. The correct code for the Job is:
job.setNumBspTask(10);
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
Maybe you should explain this in the Hama Wiki.
Thanks.
Leonidas
On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
> Hi Edward,
> Thank you for the reply.
> But I want the opposite: I want to create more tasks than blocks, not
> fewer tasks than blocks.
> That is, I want to be able to send less than one block to each task (for
> example, only 10000 bytes). Sending less data to a task will speed-up
> execution and will require less memory at each node. Hadoop map-reduce,
> Spark, and Flink allow you to use a split size smaller than a block.
> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
> 0.6.4. Did you remove this capability because it is a bad idea or
> because it is very hard to implement?
>
> Based on your instructions, I tried the following:
>
> job.setNumBspTask(10);
> job.setBoolean("bsp.input.runtime.partitioning",false);
> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>
> I get the following error:
>
> java.lang.ArrayIndexOutOfBoundsException: 1
> at org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
> at
> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
> at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
> at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
> at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>
> Thanks.
> Leonidas
>
>
> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>> Hi Leonidas,
>>
>> The bsp.min.split.size property is used to prevent to create too many
>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>> size then 1 block is sent to each task).
>>
>> I guess this will work fine. BTW, if you set the input partitioner
>> then input partitioner creates the new partitions as you specified in
>> the setNumBspTask() method (graph job pre-processes the (hash) input
>> partition by default).
>>
>> Thanks.
>>
>> --
>> Best Regards, Edward J. Yoon
>> Chief Executive Officer
>> DataSayer Co., Ltd.
>>
>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>>> <ma...@cse.uta.edu>> 작성:
>>>
>>> Dear Hama developers,
>>> I still have a problem setting the split size of an HDFS input file
>>> using Hama 0.6.4. For example, when I use:
>>>
>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>> job.setNumBspTask(10);
>>> job.setLong("bsp.min.split.size",10000L); // 10000 bytes
>>>
>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>> for each block), instead of 10.
>>> This used to work in Hama 0.5.0.
>>> Any suggestions?
>>> Thanks.
>>> Leonidas Fegaras
>>>
>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>> Hello,
>>>>
>>>>> than a block. But if you have more nodes in your cluster than data
>>>>> blocks,
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>> You're right. So, we're working on partitioning issues now.
>>>>
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>
>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>> wrote:
>>>>> Dear Hama developers,
>>>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>> split size
>>>>> using job.set("bsp.min.split.size",...) and set the task numbers using
>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>> smaller
>>>>> than a block. But if you have more nodes in your cluster than data
>>>>> blocks,
>>>>> you may get faster execution if you allow splits smaller than a
>>>>> block. Is
>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>> Thanks for your help,
>>>>> Leonidas
>>>>>
>>>>
Re: Question about FileInputFormat splits
Posted by Leonidas Fegaras <fe...@cse.uta.edu>.
Hi Edward,
Thank you for the reply.
But I want the opposite: I want to create more tasks than blocks, not
fewer tasks than blocks.
That is, I want to be able to send less than one block to each task (for
example, only 10000 bytes). Sending less data to a task will speed-up
execution and will require less memory at each node. Hadoop map-reduce,
Spark, and Flink allow you to use a split size smaller than a block.
Also, I used to be able to do this with Hama 0.5.0 but not with Hama
0.6.4. Did you remove this capability because it is a bad idea or
because it is very hard to implement?
Based on your instructions, I tried the following:
job.setNumBspTask(10);
job.setBoolean("bsp.input.runtime.partitioning",false);
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
I get the following error:
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
at
org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
Thanks.
Leonidas
On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
> Hi Leonidas,
>
> The bsp.min.split.size property is used to prevent to create too many
> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
> size then 1 block is sent to each task).
>
> I guess this will work fine. BTW, if you set the input partitioner
> then input partitioner creates the new partitions as you specified in
> the setNumBspTask() method (graph job pre-processes the (hash) input
> partition by default).
>
> Thanks.
>
> --
> Best Regards, Edward J. Yoon
> Chief Executive Officer
> DataSayer Co., Ltd.
>
>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fegaras@cse.uta.edu
>> <ma...@cse.uta.edu>> 작성:
>>
>> Dear Hama developers,
>> I still have a problem setting the split size of an HDFS input file
>> using Hama 0.6.4. For example, when I use:
>>
>> BSPJob job = new BSPJob(conf,BSPop.class);
>> job.setNumBspTask(10);
>> job.setLong("bsp.min.split.size",10000L); // 10000 bytes
>>
>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>> for each block), instead of 10.
>> This used to work in Hama 0.5.0.
>> Any suggestions?
>> Thanks.
>> Leonidas Fegaras
>>
>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>> Hello,
>>>
>>>> than a block. But if you have more nodes in your cluster than data
>>>> blocks,
>>>> you may get faster execution if you allow splits smaller than a
>>>> block. Is
>>> You're right. So, we're working on partitioning issues now.
>>>
>>>> you may get faster execution if you allow splits smaller than a
>>>> block. Is
>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>> Yes. But, Hama 0.6.1 version will support it.
>>>
>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>> <fegaras@cse.uta.edu <ma...@cse.uta.edu>> wrote:
>>>> Dear Hama developers,
>>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>> split size
>>>> using job.set("bsp.min.split.size",...) and set the task numbers using
>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>> smaller
>>>> than a block. But if you have more nodes in your cluster than data
>>>> blocks,
>>>> you may get faster execution if you allow splits smaller than a
>>>> block. Is
>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>> Thanks for your help,
>>>> Leonidas
>>>>
>>>
>>>
>>
>
Re: Question about FileInputFormat splits
Posted by "Edward J. Yoon" <ed...@datasayer.com>.
Hi Leonidas,
The bsp.min.split.size property is used to prevent to create too many tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block size then 1 block is sent to each task).
I guess this will work fine. BTW, if you set the input partitioner then input partitioner creates the new partitions as you specified in the setNumBspTask() method (graph job pre-processes the (hash) input partition by default).
Thanks.
--
Best Regards, Edward J. Yoon
Chief Executive Officer
DataSayer Co., Ltd.
> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <fe...@cse.uta.edu> 작성:
>
> Dear Hama developers,
> I still have a problem setting the split size of an HDFS input file using Hama 0.6.4. For example, when I use:
>
> BSPJob job = new BSPJob(conf,BSPop.class);
> job.setNumBspTask(10);
> job.setLong("bsp.min.split.size",10000L); // 10000 bytes
>
> For a small file with 2 blocks, this will use only 2 BSP tasks (one for each block), instead of 10.
> This used to work in Hama 0.5.0.
> Any suggestions?
> Thanks.
> Leonidas Fegaras
>
> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>> Hello,
>>
>>> than a block. But if you have more nodes in your cluster than data blocks,
>>> you may get faster execution if you allow splits smaller than a block. Is
>> You're right. So, we're working on partitioning issues now.
>>
>>> you may get faster execution if you allow splits smaller than a block. Is
>>> there any way to use splits smaller than a block in Hama 0.6.0?
>> Yes. But, Hama 0.6.1 version will support it.
>>
>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras <fe...@cse.uta.edu> wrote:
>>> Dear Hama developers,
>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>> cannot be smaller than a block. In Hama 0.5.0, I could set any split size
>>> using job.set("bsp.min.split.size",...) and set the task numbers using
>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split smaller
>>> than a block. But if you have more nodes in your cluster than data blocks,
>>> you may get faster execution if you allow splits smaller than a block. Is
>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>> Thanks for your help,
>>> Leonidas
>>>
>>
>>
>