You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hama.apache.org by Behroz Sikander <be...@gmail.com> on 2015/05/25 18:14:09 UTC

Hama parition 1000 files on 3 tasks/machine

Hi,
I have a problem regarding data partitioning but was not able to find any
solution online.

Problem: I have around 1000 files that I want to process using Hama. Each
file has the same schema/structure but different data. How can I divide
these files in my cluster ? I mean if I have 3 tasks/machines then each
task should process around 333 files.

So,
1- How can I take thousand files as input in Hama ? With my current
understanding, Hama will open 1000 tasks (1 task for each file)
2- How to divide the files on different machines (Custom Partitioner maybe
)?
3- If this approach is not supported, then what can be an alternative
approach of solving this ?

Regards,
Behroz Sikander

Re: Hama parition 1000 files on 3 tasks/machine

Posted by Behroz Sikander <be...@gmail.com>.
Thank you for the input.

As you mentioned, I have accessed the files directly and logically divided
the files into multiple tasks in my code. I am still working on it but I am
positive that it will work.

Thanks.

On Tue, May 26, 2015 at 7:57 AM, Edward J. Yoon <ed...@samsung.com>
wrote:

> Yeah, that's also good alternative. User can directly access external
> resources (such as HDFS, NoSQL, and RDBMS) and partition data using
> messaging
> APIs.
>
> However, I think we need to provide the solution at framework level.
>
> --
> Best Regards, Edward J. Yoon
>
>
> -----Original Message-----
> From: Chia-Hung Lin [mailto:clin4j@googlemail.com]
> Sent: Tuesday, May 26, 2015 2:39 PM
> To: user@hama.apache.org
> Subject: Re: Hama parition 1000 files on 3 tasks/machine
>
> An alternative thought:
>
> In addition to the (key/ value) interface provided by Hama, each
> process (within bsp function) should be able to read data from
> external source with Reader related class; but processes may need to
> use something like ZooKeeper for coordination.
>
> FYI
>
>
>
> On 26 May 2015 at 06:43, Edward J. Yoon <ed...@samsung.com> wrote:
> > Hi,
> >
> > Currently the task capacity of cluster should be larger than the number
> of
> > blocks or files of input dataset. The alternative is to merge them into
> one
> > file using hadoop fs -getmerge command.
> >
> > --
> > Best Regards, Edward J. Yoon
> >
> > -----Original Message-----
> > From: Behroz Sikander [mailto:behroz89@gmail.com]
> > Sent: Tuesday, May 26, 2015 1:14 AM
> > To: user@hama.apache.org
> > Subject: Hama parition 1000 files on 3 tasks/machine
> >
> > Hi,
> > I have a problem regarding data partitioning but was not able to find any
> > solution online.
> >
> > Problem: I have around 1000 files that I want to process using Hama. Each
> > file has the same schema/structure but different data. How can I divide
> > these files in my cluster ? I mean if I have 3 tasks/machines then each
> > task should process around 333 files.
> >
> > So,
> > 1- How can I take thousand files as input in Hama ? With my current
> > understanding, Hama will open 1000 tasks (1 task for each file)
> > 2- How to divide the files on different machines (Custom Partitioner
> maybe
> > )?
> > 3- If this approach is not supported, then what can be an alternative
> > approach of solving this ?Regards,
> > Behroz Sikander
> >
> >
>
>
>

RE: Hama parition 1000 files on 3 tasks/machine

Posted by "Edward J. Yoon" <ed...@samsung.com>.
Yeah, that's also good alternative. User can directly access external 
resources (such as HDFS, NoSQL, and RDBMS) and partition data using messaging 
APIs.

However, I think we need to provide the solution at framework level.

--
Best Regards, Edward J. Yoon


-----Original Message-----
From: Chia-Hung Lin [mailto:clin4j@googlemail.com]
Sent: Tuesday, May 26, 2015 2:39 PM
To: user@hama.apache.org
Subject: Re: Hama parition 1000 files on 3 tasks/machine

An alternative thought:

In addition to the (key/ value) interface provided by Hama, each
process (within bsp function) should be able to read data from
external source with Reader related class; but processes may need to
use something like ZooKeeper for coordination.

FYI



On 26 May 2015 at 06:43, Edward J. Yoon <ed...@samsung.com> wrote:
> Hi,
>
> Currently the task capacity of cluster should be larger than the number of
> blocks or files of input dataset. The alternative is to merge them into one
> file using hadoop fs -getmerge command.
>
> --
> Best Regards, Edward J. Yoon
>
> -----Original Message-----
> From: Behroz Sikander [mailto:behroz89@gmail.com]
> Sent: Tuesday, May 26, 2015 1:14 AM
> To: user@hama.apache.org
> Subject: Hama parition 1000 files on 3 tasks/machine
>
> Hi,
> I have a problem regarding data partitioning but was not able to find any
> solution online.
>
> Problem: I have around 1000 files that I want to process using Hama. Each
> file has the same schema/structure but different data. How can I divide
> these files in my cluster ? I mean if I have 3 tasks/machines then each
> task should process around 333 files.
>
> So,
> 1- How can I take thousand files as input in Hama ? With my current
> understanding, Hama will open 1000 tasks (1 task for each file)
> 2- How to divide the files on different machines (Custom Partitioner maybe
> )?
> 3- If this approach is not supported, then what can be an alternative
> approach of solving this ?Regards,
> Behroz Sikander
>
>



Re: Hama parition 1000 files on 3 tasks/machine

Posted by Chia-Hung Lin <cl...@googlemail.com>.
An alternative thought:

In addition to the (key/ value) interface provided by Hama, each
process (within bsp function) should be able to read data from
external source with Reader related class; but processes may need to
use something like ZooKeeper for coordination.

FYI



On 26 May 2015 at 06:43, Edward J. Yoon <ed...@samsung.com> wrote:
> Hi,
>
> Currently the task capacity of cluster should be larger than the number of
> blocks or files of input dataset. The alternative is to merge them into one
> file using hadoop fs -getmerge command.
>
> --
> Best Regards, Edward J. Yoon
>
> -----Original Message-----
> From: Behroz Sikander [mailto:behroz89@gmail.com]
> Sent: Tuesday, May 26, 2015 1:14 AM
> To: user@hama.apache.org
> Subject: Hama parition 1000 files on 3 tasks/machine
>
> Hi,
> I have a problem regarding data partitioning but was not able to find any
> solution online.
>
> Problem: I have around 1000 files that I want to process using Hama. Each
> file has the same schema/structure but different data. How can I divide
> these files in my cluster ? I mean if I have 3 tasks/machines then each
> task should process around 333 files.
>
> So,
> 1- How can I take thousand files as input in Hama ? With my current
> understanding, Hama will open 1000 tasks (1 task for each file)
> 2- How to divide the files on different machines (Custom Partitioner maybe
> )?
> 3- If this approach is not supported, then what can be an alternative
> approach of solving this ?Regards,
> Behroz Sikander
>
>

RE: Hama parition 1000 files on 3 tasks/machine

Posted by "Edward J. Yoon" <ed...@samsung.com>.
Hi,

Currently the task capacity of cluster should be larger than the number of 
blocks or files of input dataset. The alternative is to merge them into one 
file using hadoop fs -getmerge command.

--
Best Regards, Edward J. Yoon

-----Original Message-----
From: Behroz Sikander [mailto:behroz89@gmail.com]
Sent: Tuesday, May 26, 2015 1:14 AM
To: user@hama.apache.org
Subject: Hama parition 1000 files on 3 tasks/machine

Hi,
I have a problem regarding data partitioning but was not able to find any
solution online.

Problem: I have around 1000 files that I want to process using Hama. Each
file has the same schema/structure but different data. How can I divide
these files in my cluster ? I mean if I have 3 tasks/machines then each
task should process around 333 files.

So,
1- How can I take thousand files as input in Hama ? With my current
understanding, Hama will open 1000 tasks (1 task for each file)
2- How to divide the files on different machines (Custom Partitioner maybe
)?
3- If this approach is not supported, then what can be an alternative
approach of solving this ?Regards,
Behroz Sikander