You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Erik Forsberg <fo...@opera.com> on 2010/01/22 16:22:28 UTC

Scheduling, prioritizing some jobs

Hi!

I'm currently trying to wrap my head around the different schedulers
available. Running Cloudera 0.18.3, there's both the Fair Scheduler and
the Capacity Scheduler for me to play with.

I have a number of jobs that are run nightly. Some of them have tighter
deadlines than other, so if there is a job that has a tight deadline, I
would like that to get *all* mapper/reducer slots so that it finishes
as quickly as possible. Once there is a free mapper slot that's not
needed for the job, it should be made available for other jobs.

Within the jobs, there are sub-priorities. As an example, assuming we
analyse some kind of web access logs, it's more important the daily
report gets done before the monthly report. Etc. Also, the
tight-deadline's monthly report should be prioritized before the daily
report for non-tight-jobs.

I think what I'm after is a way to prioritize jobs, but the available
number of priorities (VERY HIGH, HIGH, et. al) are not enough.

Ideas on how to do this? I hope I've managed to explain what I want.

Cheers,
\EF
-- 
Erik Forsberg <fo...@opera.com>
Developer, Opera Software - http://www.opera.com/

Re: Scheduling, prioritizing some jobs

Posted by Erik Forsberg <fo...@opera.com>.

On Thu, 28 Jan 2010 11:12:07 -0800
Matei Zaharia <ma...@eecs.berkeley.edu> wrote:

> Hi Erik,
> 
> With four priority levels like this, you should just be able to use
> Hadoop's priorities, because it has five of them (very high, high,
> normal, low and very low). You can just use the default scheduler for
> this (i.e. don't enable either the fair or the capacity scheduler).
> Or am I missing something about your question?

You are right, given my silly example :-). In reality, I require more
than 5 different priorities. 

I think I might be able to use the CapacityScheduler, setting up one
pool (highprio) that is guaranteed 100% of the cluster, and another one
(lowprio) that is guaranteed 0% of the cluster. That will, if I
understand correctly, give me the equivalent of 10 different
priorities. Have not yet tried that, though.

Regards,
\EF

> > *) I have four scheduling pools: highprio-daily, highprio-monthly,
> >   lowprio-daily, lowprio-monthly.
> > 
> > *) Tasks for jobs that are put in highprio-daily always get priority
> >   before tasks in highprio-monthly. Highprio-monthly always get
> >   priority before lowprio-daily, and lowprio-daily always get
> > priority before lowprio-monthly?
> > 
> >   If there are several jobs in the same pool, run them in order of
> >   submission. A job should finish as quickly as possible, so if the
> >   currently most highly prioritized job needs all task slots, it
> >   should get them.
> > 
> > Thanks!
> > \EF
> > -- 
> > Erik Forsberg <fo...@opera.com>
> > Developer, Opera Software - http://www.opera.com/
> 


-- 
Erik Forsberg <fo...@opera.com>
Developer, Opera Software - http://www.opera.com/

Re: Scheduling, prioritizing some jobs

Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.

Hi Erik,

With four priority levels like this, you should just be able to use Hadoop's priorities, because it has five of them (very high, high, normal, low and very low). You can just use the default scheduler for this (i.e. don't enable either the fair or the capacity scheduler). Or am I missing something about your question?

Matei

On Jan 25, 2010, at 11:25 PM, Erik Forsberg wrote:

> Hi!
> 
> Simplifying my question: Can I configure Hadoop so that:
> 
> *) I have four scheduling pools: highprio-daily, highprio-monthly,
>   lowprio-daily, lowprio-monthly.
> 
> *) Tasks for jobs that are put in highprio-daily always get priority
>   before tasks in highprio-monthly. Highprio-monthly always get
>   priority before lowprio-daily, and lowprio-daily always get priority
>   before lowprio-monthly?
> 
>   If there are several jobs in the same pool, run them in order of
>   submission. A job should finish as quickly as possible, so if the
>   currently most highly prioritized job needs all task slots, it
>   should get them.
> 
> Thanks!
> \EF
> -- 
> Erik Forsberg <fo...@opera.com>
> Developer, Opera Software - http://www.opera.com/

Re: Scheduling, prioritizing some jobs

Posted by Erik Forsberg <fo...@opera.com>.

Hi!

Simplifying my question: Can I configure Hadoop so that:

*) I have four scheduling pools: highprio-daily, highprio-monthly,
   lowprio-daily, lowprio-monthly.

*) Tasks for jobs that are put in highprio-daily always get priority
   before tasks in highprio-monthly. Highprio-monthly always get
   priority before lowprio-daily, and lowprio-daily always get priority
   before lowprio-monthly?

   If there are several jobs in the same pool, run them in order of
   submission. A job should finish as quickly as possible, so if the
   currently most highly prioritized job needs all task slots, it
   should get them.

Thanks!
\EF
-- 
Erik Forsberg <fo...@opera.com>
Developer, Opera Software - http://www.opera.com/

Re: map side only behavior

Posted by Gang Luo <lg...@yahoo.com.cn>.

Thanks Jeff.

But what if the mapper's output size surpass the capacity of the output buffer? There will be multiple spills for this mapper in this case. If there is no merging, how to ensure there is *only one* output file?

 
-Gang



----- 原始邮件 ----
发件人： Jeff Zhang <zj...@gmail.com>
收件人： common-user@hadoop.apache.org
发送日期： 2010/1/29 (周五) 11:05:33 上午
主   题： Re: map side only behavior

No, the merge and sort will not happen in mapper task. And each mapper task
will generate one output file.



2010/1/29 Gang Luo <lg...@yahoo.com.cn>

> Hi all,
> If I only use map side to process my data (set # of reducers to 0 ), what
> is the behavior of hadoop? Will it merge and sort each of the spills
> generated by one mapper?
>
> -Gang
>
>
> ----- 原始邮件 ----
> 发件人： Gang Luo <lg...@yahoo.com.cn>
> 收件人： common-user@hadoop.apache.org
> 发送日期： 2010/1/29 (周五) 8:54:33 上午
> 主   题： Re: fine granularity operation on HDFS
>
> Yeah, I see how it works. Thanks Amogh.
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Amogh Vasekar <am...@yahoo-inc.com>
> 收件人： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> 发送日期： 2010/1/28 (周四) 10:00:22 上午
> 主   题： Re: fine granularity operation on HDFS
>
> Hi Gang,
> Yes PathFilters work only on file paths. I meant you can include such type
> of logic at split level.
> The input format's getSplits() method is responsible for computing and
> adding splits to a list container, for which JT initializes mapper tasks.
> You can override the getSplits() method to add only a few , say, based on
> the location or offset, to the list. Here's the reference :
> while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
>          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
>          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
>                                   blkLocations[blkIndex].getHosts()));
>          bytesRemaining -= splitSize;
>        }
>
>        if (bytesRemaining != 0) {
>          splits.add(new FileSplit(path, length-bytesRemaining,
> bytesRemaining,
>                     blkLocations[blkLocations.length-1].getHosts()));
>
> Before splits.add you can use your logic for discarding. However, you need
> to ensure your record reader takes care of incomplete records at boundaries.
>
> To get the block locations to load separately, the FileSystem class APIs
> expose few methods like getBlockLocations etc ..
> Hope this helps.
>
> Amogh
>
> On 1/28/10 7:26 PM, "Gang Luo" <lg...@yahoo.com.cn> wrote:
>
> Thanks Amogh.
>
> For the second part of my question, I actually mean loading block
> separately from HDFS. I don't know whether it is realistic. Anyway, for my
> goal is to process different division of a file separately, to do that at
> split level is OK. But even I can get the splits from inputformat, how to
> "add only a few splits you need to mapper and discard the others"?
> (pathfilters only works for file, but not block, I think).
>
> Thanks.
> -Gang
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>



-- 
Best Regards

Jeff Zhang



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

task-specific parameter

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi all,
Usually we set the parameter before the job starts. I am wondering whether we can do that after the job starts, e.g. in a configure() method of a map task. Or can we even do that in map function to dynamically tune the parameters? How about reduce side?

Besides, If I can do that in a specific task, are these modifications only visible to that task, or do they have a global impact? (currently, I am focusing on io.sort.record.percent).

Thanks.

-Gang



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Re: map side only behavior

Posted by Aaron Kimball <aa...@cloudera.com>.

In a map-only job, map tasks will be connected directly to the OutputFormat.
So calling output.collect() / context.write() in the mapper will emit data
straight to files in HDFS without sorting the data. There is no sort buffer
involved. If you want exactly one output file, follow Nick's advice.

- Aaron

On Fri, Jan 29, 2010 at 8:32 AM, Jones, Nick <ni...@amd.com> wrote:

> A single unity reducer should enforce a merge and sort to generate one
> file.
>
> Nick Jones
>
> -----Original Message-----
> From: Jeff Zhang [mailto:zjffdu@gmail.com]
> Sent: Friday, January 29, 2010 10:06 AM
> To: common-user@hadoop.apache.org
> Subject: Re: map side only behavior
>
> No, the merge and sort will not happen in mapper task. And each mapper task
> will generate one output file.
>
>
>
> 2010/1/29 Gang Luo <lg...@yahoo.com.cn>
>
> > Hi all,
> > If I only use map side to process my data (set # of reducers to 0 ), what
> > is the behavior of hadoop? Will it merge and sort each of the spills
> > generated by one mapper?
> >
> > -Gang
> >
> >
> > ----- 原始邮件 ----
> > 发件人： Gang Luo <lg...@yahoo.com.cn>
> > 收件人： common-user@hadoop.apache.org
> > 发送日期： 2010/1/29 (周五) 8:54:33 上午
> > 主   题： Re: fine granularity operation on HDFS
> >
> > Yeah, I see how it works. Thanks Amogh.
> >
> >
> > -Gang
> >
> >
> >
> > ----- 原始邮件 ----
> > 发件人： Amogh Vasekar <am...@yahoo-inc.com>
> > 收件人： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> > 发送日期： 2010/1/28 (周四) 10:00:22 上午
> > 主   题： Re: fine granularity operation on HDFS
> >
> > Hi Gang,
> > Yes PathFilters work only on file paths. I meant you can include such
> type
> > of logic at split level.
> > The input format's getSplits() method is responsible for computing and
> > adding splits to a list container, for which JT initializes mapper tasks.
> > You can override the getSplits() method to add only a few , say, based on
> > the location or offset, to the list. Here's the reference :
> > while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
> >          int blkIndex = getBlockIndex(blkLocations,
> length-bytesRemaining);
> >          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
> >                                   blkLocations[blkIndex].getHosts()));
> >          bytesRemaining -= splitSize;
> >        }
> >
> >        if (bytesRemaining != 0) {
> >          splits.add(new FileSplit(path, length-bytesRemaining,
> > bytesRemaining,
> >                     blkLocations[blkLocations.length-1].getHosts()));
> >
> > Before splits.add you can use your logic for discarding. However, you
> need
> > to ensure your record reader takes care of incomplete records at
> boundaries.
> >
> > To get the block locations to load separately, the FileSystem class APIs
> > expose few methods like getBlockLocations etc ..
> > Hope this helps.
> >
> > Amogh
> >
> > On 1/28/10 7:26 PM, "Gang Luo" <lg...@yahoo.com.cn> wrote:
> >
> > Thanks Amogh.
> >
> > For the second part of my question, I actually mean loading block
> > separately from HDFS. I don't know whether it is realistic. Anyway, for
> my
> > goal is to process different division of a file separately, to do that at
> > split level is OK. But even I can get the splits from inputformat, how to
> > "add only a few splits you need to mapper and discard the others"?
> > (pathfilters only works for file, but not block, I think).
> >
> > Thanks.
> > -Gang
> >
> >
> >
> >      ___________________________________________________________
> >  好玩贺卡等你发，邮箱贺卡全新上线！
> > http://card.mail.cn.yahoo.com/
> >
> >
> >
> >      ___________________________________________________________
> >  好玩贺卡等你发，邮箱贺卡全新上线！
> > http://card.mail.cn.yahoo.com/
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

RE: map side only behavior

Posted by "Jones, Nick" <ni...@amd.com>.

A single unity reducer should enforce a merge and sort to generate one file.

Nick Jones

-----Original Message-----
From: Jeff Zhang [mailto:zjffdu@gmail.com] 
Sent: Friday, January 29, 2010 10:06 AM
To: common-user@hadoop.apache.org
Subject: Re: map side only behavior

No, the merge and sort will not happen in mapper task. And each mapper task
will generate one output file.



2010/1/29 Gang Luo <lg...@yahoo.com.cn>

> Hi all,
> If I only use map side to process my data (set # of reducers to 0 ), what
> is the behavior of hadoop? Will it merge and sort each of the spills
> generated by one mapper?
>
> -Gang
>
>
> ----- 原始邮件 ----
> 发件人： Gang Luo <lg...@yahoo.com.cn>
> 收件人： common-user@hadoop.apache.org
> 发送日期： 2010/1/29 (周五) 8:54:33 上午
> 主   题： Re: fine granularity operation on HDFS
>
> Yeah, I see how it works. Thanks Amogh.
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Amogh Vasekar <am...@yahoo-inc.com>
> 收件人： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> 发送日期： 2010/1/28 (周四) 10:00:22 上午
> 主   题： Re: fine granularity operation on HDFS
>
> Hi Gang,
> Yes PathFilters work only on file paths. I meant you can include such type
> of logic at split level.
> The input format's getSplits() method is responsible for computing and
> adding splits to a list container, for which JT initializes mapper tasks.
> You can override the getSplits() method to add only a few , say, based on
> the location or offset, to the list. Here's the reference :
> while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
>          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
>          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
>                                   blkLocations[blkIndex].getHosts()));
>          bytesRemaining -= splitSize;
>        }
>
>        if (bytesRemaining != 0) {
>          splits.add(new FileSplit(path, length-bytesRemaining,
> bytesRemaining,
>                     blkLocations[blkLocations.length-1].getHosts()));
>
> Before splits.add you can use your logic for discarding. However, you need
> to ensure your record reader takes care of incomplete records at boundaries.
>
> To get the block locations to load separately, the FileSystem class APIs
> expose few methods like getBlockLocations etc ..
> Hope this helps.
>
> Amogh
>
> On 1/28/10 7:26 PM, "Gang Luo" <lg...@yahoo.com.cn> wrote:
>
> Thanks Amogh.
>
> For the second part of my question, I actually mean loading block
> separately from HDFS. I don't know whether it is realistic. Anyway, for my
> goal is to process different division of a file separately, to do that at
> split level is OK. But even I can get the splits from inputformat, how to
> "add only a few splits you need to mapper and discard the others"?
> (pathfilters only works for file, but not block, I think).
>
> Thanks.
> -Gang
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>



-- 
Best Regards

Jeff Zhang

Re: map side only behavior

Posted by Jeff Zhang <zj...@gmail.com>.

No, the merge and sort will not happen in mapper task. And each mapper task
will generate one output file.



2010/1/29 Gang Luo <lg...@yahoo.com.cn>

> Hi all,
> If I only use map side to process my data (set # of reducers to 0 ), what
> is the behavior of hadoop? Will it merge and sort each of the spills
> generated by one mapper?
>
> -Gang
>
>
> ----- 原始邮件 ----
> 发件人： Gang Luo <lg...@yahoo.com.cn>
> 收件人： common-user@hadoop.apache.org
> 发送日期： 2010/1/29 (周五) 8:54:33 上午
> 主   题： Re: fine granularity operation on HDFS
>
> Yeah, I see how it works. Thanks Amogh.
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Amogh Vasekar <am...@yahoo-inc.com>
> 收件人： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> 发送日期： 2010/1/28 (周四) 10:00:22 上午
> 主   题： Re: fine granularity operation on HDFS
>
> Hi Gang,
> Yes PathFilters work only on file paths. I meant you can include such type
> of logic at split level.
> The input format's getSplits() method is responsible for computing and
> adding splits to a list container, for which JT initializes mapper tasks.
> You can override the getSplits() method to add only a few , say, based on
> the location or offset, to the list. Here's the reference :
> while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
>          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
>          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
>                                   blkLocations[blkIndex].getHosts()));
>          bytesRemaining -= splitSize;
>        }
>
>        if (bytesRemaining != 0) {
>          splits.add(new FileSplit(path, length-bytesRemaining,
> bytesRemaining,
>                     blkLocations[blkLocations.length-1].getHosts()));
>
> Before splits.add you can use your logic for discarding. However, you need
> to ensure your record reader takes care of incomplete records at boundaries.
>
> To get the block locations to load separately, the FileSystem class APIs
> expose few methods like getBlockLocations etc ..
> Hope this helps.
>
> Amogh
>
> On 1/28/10 7:26 PM, "Gang Luo" <lg...@yahoo.com.cn> wrote:
>
> Thanks Amogh.
>
> For the second part of my question, I actually mean loading block
> separately from HDFS. I don't know whether it is realistic. Anyway, for my
> goal is to process different division of a file separately, to do that at
> split level is OK. But even I can get the splits from inputformat, how to
> "add only a few splits you need to mapper and discard the others"?
> (pathfilters only works for file, but not block, I think).
>
> Thanks.
> -Gang
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>



-- 
Best Regards

Jeff Zhang

map side only behavior

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi all,
If I only use map side to process my data (set # of reducers to 0 ), what is the behavior of hadoop? Will it merge and sort each of the spills generated by one mapper?

-Gang


----- 原始邮件 ----
发件人： Gang Luo <lg...@yahoo.com.cn>
收件人： common-user@hadoop.apache.org
发送日期： 2010/1/29 (周五) 8:54:33 上午
主   题： Re: fine granularity operation on HDFS

Yeah, I see how it works. Thanks Amogh.


-Gang



----- 原始邮件 ----
发件人： Amogh Vasekar <am...@yahoo-inc.com>
收件人： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
发送日期： 2010/1/28 (周四) 10:00:22 上午
主   题： Re: fine granularity operation on HDFS

Hi Gang,
Yes PathFilters work only on file paths. I meant you can include such type of logic at split level.
The input format's getSplits() method is responsible for computing and adding splits to a list container, for which JT initializes mapper tasks. You can override the getSplits() method to add only a few , say, based on the location or offset, to the list. Here's the reference :
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }

        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
                     blkLocations[blkLocations.length-1].getHosts()));

Before splits.add you can use your logic for discarding. However, you need to ensure your record reader takes care of incomplete records at boundaries.

To get the block locations to load separately, the FileSystem class APIs expose few methods like getBlockLocations etc ..
Hope this helps.

Amogh

On 1/28/10 7:26 PM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Thanks Amogh.

For the second part of my question, I actually mean loading block separately from HDFS. I don't know whether it is realistic. Anyway, for my goal is to process different division of a file separately, to do that at split level is OK. But even I can get the splits from inputformat, how to "add only a few splits you need to mapper and discard the others"? (pathfilters only works for file, but not block, I think).

Thanks.
-Gang



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Re: fine granularity operation on HDFS

Posted by Gang Luo <lg...@yahoo.com.cn>.

Yeah, I see how it works. Thanks Amogh.

 
-Gang



----- 原始邮件 ----
发件人： Amogh Vasekar <am...@yahoo-inc.com>
收件人： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
发送日期： 2010/1/28 (周四) 10:00:22 上午
主   题： Re: fine granularity operation on HDFS

Hi Gang,
Yes PathFilters work only on file paths. I meant you can include such type of logic at split level.
The input format's getSplits() method is responsible for computing and adding splits to a list container, for which JT initializes mapper tasks. You can override the getSplits() method to add only a few , say, based on the location or offset, to the list. Here's the reference :
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }

        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
                     blkLocations[blkLocations.length-1].getHosts()));

Before splits.add you can use your logic for discarding. However, you need to ensure your record reader takes care of incomplete records at boundaries.

To get the block locations to load separately, the FileSystem class APIs expose few methods like getBlockLocations etc ..
Hope this helps.

Amogh

On 1/28/10 7:26 PM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Thanks Amogh.

For the second part of my question, I actually mean loading block separately from HDFS. I don't know whether it is realistic. Anyway, for my goal is to process different division of a file separately, to do that at split level is OK. But even I can get the splits from inputformat, how to "add only a few splits you need to mapper and discard the others"? (pathfilters only works for file, but not block, I think).

Thanks.
-Gang



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Re: fine granularity operation on HDFS

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi Gang,
Yes PathFilters work only on file paths. I meant you can include such type of logic at split level.
The input format's getSplits() method is responsible for computing and adding splits to a list container, for which JT initializes mapper tasks. You can override the getSplits() method to add only a few , say, based on the location or offset, to the list. Here's the reference :
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }

        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
                     blkLocations[blkLocations.length-1].getHosts()));

Before splits.add you can use your logic for discarding. However, you need to ensure your record reader takes care of incomplete records at boundaries.

To get the block locations to load separately, the FileSystem class APIs expose few methods like getBlockLocations etc ..
Hope this helps.

Amogh

On 1/28/10 7:26 PM, "Gang Luo" <lg...@yahoo.com.cn> wrote:

Thanks Amogh.

For the second part of my question, I actually mean loading block separately from HDFS. I don't know whether it is realistic. Anyway, for my goal is to process different division of a file separately, to do that at split level is OK. But even I can get the splits from inputformat, how to "add only a few splits you need to mapper and discard the others"? (pathfilters only works for file, but not block, I think).

Thanks.
-Gang

Re: fine granularity operation on HDFS

Posted by Gang Luo <lg...@yahoo.com.cn>.

Thanks Amogh.

For the second part of my question, I actually mean loading block separately from HDFS. I don't know whether it is realistic. Anyway, for my goal is to process different division of a file separately, to do that at split level is OK. But even I can get the splits from inputformat, how to "add only a few splits you need to mapper and discard the others"? (pathfilters only works for file, but not block, I think).

Thanks.
-Gang

----- 原始邮件 ----
发件人： Amogh Vasekar <am...@yahoo-inc.com>
收件人： "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
发送日期： 2010/1/27 (周三) 1:40:26 下午
主 题： Re: fine granularity operation on HDFS

Hi,
>>now that I can get the splits of a file in hadoop, is it possible to name some splits (not all) as the input to mapper?
I'm assuming when you say "splits of a file in hadoop" you mean splits generated from the inputformat and not the blocks stored in HDFS.
The [File]InputFormat you use gives you access to splits, locations etc. You can use this to add only a few splits you need to mapper and discard the others ( something you can do on files as a whole using PathFilters ).

>>Or can I manually read some of these splits (not the whole file) using HDFS api?
You mean you list these splits somewhere in a file beforehand so individual mappers can read one line (split) ?

Amogh

___________________________________________________________
好玩贺卡等你发，邮箱贺卡全新上线！
http://card.mail.cn.yahoo.com/

Re: fine granularity operation on HDFS

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
>>now that I can get the splits of a file in hadoop, is it possible to name some splits (not all) as the input to mapper?
I'm assuming when you say "splits of a file in hadoop" you mean splits generated from the inputformat and not the blocks stored in HDFS.
The [File]InputFormat you use gives you access to splits, locations etc. You can use this to add only a few splits you need to mapper and discard the others ( something you can do on files as a whole using PathFilters ).

>>Or can I manually read some of these splits (not the whole file) using HDFS api?
You mean you list these splits somewhere in a file beforehand so individual mappers can read one line (split) ?

Amogh

fine granularity operation on HDFS

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi all,
now that I can get the splits of a file in hadoop, is it possible to name some splits (not all) as the input to mapper? Or can I manually read some of these splits (not the whole file) using HDFS api?

-Gang



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/