You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Jason Yang <li...@gmail.com> on 2012/09/12 05:15:30 UTC

How to split a sequence file

Hi,

I have a sequence file written by SequenceFileOutputFormat with key/value
type of <Text, BytesWritable>, like below:

Text                             BytesWritable
-------------------------------------------------------------
id_A_01  7F2B3C687F2B3C687F2B3C68
id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
id_A_03  5F2B3C68D77F2B3C687F2B3A
...
id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
id_B_02  5AB23C68D73C68D76AB68D76A1
id_B_03  F2B23C68D7B23C68D7B23C68D7

If I want all the records with the same key prefix to be processed by a
same mapper, say records with key id_A_XX are processed by a mapper and
records with key id_B_XX are processed by another mapper, what should I do?


Should I implement our own InputFormat inherited from
SequenceFileInputFormat ?

Any help would be appreciated.
-- 
YANG, Lin

Re: How to split a sequence file

Posted by Jason Yang <li...@gmail.com>.

hey guys,

Thanks for all your suggestions.

To wrap up, there're two ways to achieve this:
1. use multiple sequence files, then write a WholeFileInputFormat which use
each file as a split by overriding the isSeparatable();
2. Distribute records using partitioner and do the processing in reducers,
however, the shuffle would raise some network and IO cost.

BTW, As the computation could be parallelized in both Mapper and Reducer,
What's the difference btw them?

2012/9/12 Ajay Srivastava <Aj...@guavus.com>

> Hi Jason,
> I am wondering about use case of distributing records on the basis of key
> to mapper. If possible, could you please share your scenario ?
> Is it map only job ? Why not distribute records using partitioner and do
> the processing in reducers ?
>
>
> Regards,
> Ajay Srivastava
>
>
> On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:
>
> > Hi,
> >
> > I have a sequence file written by SequenceFileOutputFormat with
> key/value type of <Text, BytesWritable>, like below:
> >
> > Text                             BytesWritable
> > -------------------------------------------------------------
> > id_A_01  7F2B3C687F2B3C687F2B3C68
> > id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> > id_A_03  5F2B3C68D77F2B3C687F2B3A
> > ...
> > id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> > id_B_02  5AB23C68D73C68D76AB68D76A1
> > id_B_03  F2B23C68D7B23C68D7B23C68D7
> >
> > If I want all the records with the same key prefix to be processed by a
> same mapper, say records with key id_A_XX are processed by a mapper and
> records with key id_B_XX are processed by another mapper, what should I do?
> >
> > Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
> >
> > Any help would be appreciated.
> > --
> > YANG, Lin
> >
>
>


-- 
YANG, Lin

Re: How to split a sequence file

Posted by Jason Yang <li...@gmail.com>.

hey guys,

Thanks for all your suggestions.

To wrap up, there're two ways to achieve this:
1. use multiple sequence files, then write a WholeFileInputFormat which use
each file as a split by overriding the isSeparatable();
2. Distribute records using partitioner and do the processing in reducers,
however, the shuffle would raise some network and IO cost.

BTW, As the computation could be parallelized in both Mapper and Reducer,
What's the difference btw them?

2012/9/12 Ajay Srivastava <Aj...@guavus.com>

> Hi Jason,
> I am wondering about use case of distributing records on the basis of key
> to mapper. If possible, could you please share your scenario ?
> Is it map only job ? Why not distribute records using partitioner and do
> the processing in reducers ?
>
>
> Regards,
> Ajay Srivastava
>
>
> On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:
>
> > Hi,
> >
> > I have a sequence file written by SequenceFileOutputFormat with
> key/value type of <Text, BytesWritable>, like below:
> >
> > Text                             BytesWritable
> > -------------------------------------------------------------
> > id_A_01  7F2B3C687F2B3C687F2B3C68
> > id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> > id_A_03  5F2B3C68D77F2B3C687F2B3A
> > ...
> > id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> > id_B_02  5AB23C68D73C68D76AB68D76A1
> > id_B_03  F2B23C68D7B23C68D7B23C68D7
> >
> > If I want all the records with the same key prefix to be processed by a
> same mapper, say records with key id_A_XX are processed by a mapper and
> records with key id_B_XX are processed by another mapper, what should I do?
> >
> > Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
> >
> > Any help would be appreciated.
> > --
> > YANG, Lin
> >
>
>


-- 
YANG, Lin

Re: How to split a sequence file

Posted by Jason Yang <li...@gmail.com>.

hey guys,

Thanks for all your suggestions.

To wrap up, there're two ways to achieve this:
1. use multiple sequence files, then write a WholeFileInputFormat which use
each file as a split by overriding the isSeparatable();
2. Distribute records using partitioner and do the processing in reducers,
however, the shuffle would raise some network and IO cost.

BTW, As the computation could be parallelized in both Mapper and Reducer,
What's the difference btw them?

2012/9/12 Ajay Srivastava <Aj...@guavus.com>

> Hi Jason,
> I am wondering about use case of distributing records on the basis of key
> to mapper. If possible, could you please share your scenario ?
> Is it map only job ? Why not distribute records using partitioner and do
> the processing in reducers ?
>
>
> Regards,
> Ajay Srivastava
>
>
> On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:
>
> > Hi,
> >
> > I have a sequence file written by SequenceFileOutputFormat with
> key/value type of <Text, BytesWritable>, like below:
> >
> > Text                             BytesWritable
> > -------------------------------------------------------------
> > id_A_01  7F2B3C687F2B3C687F2B3C68
> > id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> > id_A_03  5F2B3C68D77F2B3C687F2B3A
> > ...
> > id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> > id_B_02  5AB23C68D73C68D76AB68D76A1
> > id_B_03  F2B23C68D7B23C68D7B23C68D7
> >
> > If I want all the records with the same key prefix to be processed by a
> same mapper, say records with key id_A_XX are processed by a mapper and
> records with key id_B_XX are processed by another mapper, what should I do?
> >
> > Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
> >
> > Any help would be appreciated.
> > --
> > YANG, Lin
> >
>
>


-- 
YANG, Lin

Re: How to split a sequence file

Posted by Jason Yang <li...@gmail.com>.

hey guys,

Thanks for all your suggestions.

To wrap up, there're two ways to achieve this:
1. use multiple sequence files, then write a WholeFileInputFormat which use
each file as a split by overriding the isSeparatable();
2. Distribute records using partitioner and do the processing in reducers,
however, the shuffle would raise some network and IO cost.

BTW, As the computation could be parallelized in both Mapper and Reducer,
What's the difference btw them?

2012/9/12 Ajay Srivastava <Aj...@guavus.com>

> Hi Jason,
> I am wondering about use case of distributing records on the basis of key
> to mapper. If possible, could you please share your scenario ?
> Is it map only job ? Why not distribute records using partitioner and do
> the processing in reducers ?
>
>
> Regards,
> Ajay Srivastava
>
>
> On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:
>
> > Hi,
> >
> > I have a sequence file written by SequenceFileOutputFormat with
> key/value type of <Text, BytesWritable>, like below:
> >
> > Text                             BytesWritable
> > -------------------------------------------------------------
> > id_A_01  7F2B3C687F2B3C687F2B3C68
> > id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> > id_A_03  5F2B3C68D77F2B3C687F2B3A
> > ...
> > id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> > id_B_02  5AB23C68D73C68D76AB68D76A1
> > id_B_03  F2B23C68D7B23C68D7B23C68D7
> >
> > If I want all the records with the same key prefix to be processed by a
> same mapper, say records with key id_A_XX are processed by a mapper and
> records with key id_B_XX are processed by another mapper, what should I do?
> >
> > Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
> >
> > Any help would be appreciated.
> > --
> > YANG, Lin
> >
>
>


-- 
YANG, Lin

Re: How to split a sequence file

Posted by Ajay Srivastava <Aj...@guavus.com>.

Hi Jason,
I am wondering about use case of distributing records on the basis of key to mapper. If possible, could you please share your scenario ?
Is it map only job ? Why not distribute records using partitioner and do the processing in reducers ?


Regards,
Ajay Srivastava 


On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:

> Hi, 
> 
> I have a sequence file written by SequenceFileOutputFormat with key/value type of <Text, BytesWritable>, like below:
> 
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
> 
> If I want all the records with the same key prefix to be processed by a same mapper, say records with key id_A_XX are processed by a mapper and records with key id_B_XX are processed by another mapper, what should I do?  
> 
> Should I implement our own InputFormat inherited from SequenceFileInputFormat ?
> 
> Any help would be appreciated.
> -- 
> YANG, Lin
>

Re: How to split a sequence file

Posted by Robert Dyer <ps...@gmail.com>.

If the file is pre-sorted, why not just make multiple sequence files -
1 for each split?

Then you don't have to compute InputSplits because the physical files
are already split.

On Tue, Sep 11, 2012 at 11:00 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Jason,
>
> Is the file pre-sorted? You could override the OutputFormat's
> #getSplits method to return InputSplits at identified key boundaries,
> as one solution - this would require reading the file up-front (at
> submit-time) and building the input splits out of it.
>
> On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <li...@gmail.com> wrote:
>> Hi,
>>
>> I have a sequence file written by SequenceFileOutputFormat with key/value
>> type of <Text, BytesWritable>, like below:
>>
>> Text                             BytesWritable
>> -------------------------------------------------------------
>> id_A_01  7F2B3C687F2B3C687F2B3C68
>> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
>> id_A_03  5F2B3C68D77F2B3C687F2B3A
>> ...
>> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
>> id_B_02  5AB23C68D73C68D76AB68D76A1
>> id_B_03  F2B23C68D7B23C68D7B23C68D7
>>
>> If I want all the records with the same key prefix to be processed by a same
>> mapper, say records with key id_A_XX are processed by a mapper and records
>> with key id_B_XX are processed by another mapper, what should I do?
>>
>> Should I implement our own InputFormat inherited from
>> SequenceFileInputFormat ?
>>
>> Any help would be appreciated.
>> --
>> YANG, Lin
>>
>
>
>
> --
> Harsh J

Re: How to split a sequence file

Posted by Robert Dyer <ps...@gmail.com>.

If the file is pre-sorted, why not just make multiple sequence files -
1 for each split?

Then you don't have to compute InputSplits because the physical files
are already split.

On Tue, Sep 11, 2012 at 11:00 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Jason,
>
> Is the file pre-sorted? You could override the OutputFormat's
> #getSplits method to return InputSplits at identified key boundaries,
> as one solution - this would require reading the file up-front (at
> submit-time) and building the input splits out of it.
>
> On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <li...@gmail.com> wrote:
>> Hi,
>>
>> I have a sequence file written by SequenceFileOutputFormat with key/value
>> type of <Text, BytesWritable>, like below:
>>
>> Text                             BytesWritable
>> -------------------------------------------------------------
>> id_A_01  7F2B3C687F2B3C687F2B3C68
>> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
>> id_A_03  5F2B3C68D77F2B3C687F2B3A
>> ...
>> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
>> id_B_02  5AB23C68D73C68D76AB68D76A1
>> id_B_03  F2B23C68D7B23C68D7B23C68D7
>>
>> If I want all the records with the same key prefix to be processed by a same
>> mapper, say records with key id_A_XX are processed by a mapper and records
>> with key id_B_XX are processed by another mapper, what should I do?
>>
>> Should I implement our own InputFormat inherited from
>> SequenceFileInputFormat ?
>>
>> Any help would be appreciated.
>> --
>> YANG, Lin
>>
>
>
>
> --
> Harsh J

Re: How to split a sequence file

Posted by Robert Dyer <ps...@gmail.com>.

If the file is pre-sorted, why not just make multiple sequence files -
1 for each split?

Then you don't have to compute InputSplits because the physical files
are already split.

On Tue, Sep 11, 2012 at 11:00 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Jason,
>
> Is the file pre-sorted? You could override the OutputFormat's
> #getSplits method to return InputSplits at identified key boundaries,
> as one solution - this would require reading the file up-front (at
> submit-time) and building the input splits out of it.
>
> On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <li...@gmail.com> wrote:
>> Hi,
>>
>> I have a sequence file written by SequenceFileOutputFormat with key/value
>> type of <Text, BytesWritable>, like below:
>>
>> Text                             BytesWritable
>> -------------------------------------------------------------
>> id_A_01  7F2B3C687F2B3C687F2B3C68
>> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
>> id_A_03  5F2B3C68D77F2B3C687F2B3A
>> ...
>> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
>> id_B_02  5AB23C68D73C68D76AB68D76A1
>> id_B_03  F2B23C68D7B23C68D7B23C68D7
>>
>> If I want all the records with the same key prefix to be processed by a same
>> mapper, say records with key id_A_XX are processed by a mapper and records
>> with key id_B_XX are processed by another mapper, what should I do?
>>
>> Should I implement our own InputFormat inherited from
>> SequenceFileInputFormat ?
>>
>> Any help would be appreciated.
>> --
>> YANG, Lin
>>
>
>
>
> --
> Harsh J

Re: How to split a sequence file

Posted by Robert Dyer <ps...@gmail.com>.

If the file is pre-sorted, why not just make multiple sequence files -
1 for each split?

Then you don't have to compute InputSplits because the physical files
are already split.

On Tue, Sep 11, 2012 at 11:00 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Jason,
>
> Is the file pre-sorted? You could override the OutputFormat's
> #getSplits method to return InputSplits at identified key boundaries,
> as one solution - this would require reading the file up-front (at
> submit-time) and building the input splits out of it.
>
> On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <li...@gmail.com> wrote:
>> Hi,
>>
>> I have a sequence file written by SequenceFileOutputFormat with key/value
>> type of <Text, BytesWritable>, like below:
>>
>> Text                             BytesWritable
>> -------------------------------------------------------------
>> id_A_01  7F2B3C687F2B3C687F2B3C68
>> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
>> id_A_03  5F2B3C68D77F2B3C687F2B3A
>> ...
>> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
>> id_B_02  5AB23C68D73C68D76AB68D76A1
>> id_B_03  F2B23C68D7B23C68D7B23C68D7
>>
>> If I want all the records with the same key prefix to be processed by a same
>> mapper, say records with key id_A_XX are processed by a mapper and records
>> with key id_B_XX are processed by another mapper, what should I do?
>>
>> Should I implement our own InputFormat inherited from
>> SequenceFileInputFormat ?
>>
>> Any help would be appreciated.
>> --
>> YANG, Lin
>>
>
>
>
> --
> Harsh J

Re: How to split a sequence file

Posted by Harsh J <ha...@cloudera.com>.

Hey Jason,

Is the file pre-sorted? You could override the OutputFormat's
#getSplits method to return InputSplits at identified key boundaries,
as one solution - this would require reading the file up-front (at
submit-time) and building the input splits out of it.

On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <li...@gmail.com> wrote:
> Hi,
>
> I have a sequence file written by SequenceFileOutputFormat with key/value
> type of <Text, BytesWritable>, like below:
>
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
>
> If I want all the records with the same key prefix to be processed by a same
> mapper, say records with key id_A_XX are processed by a mapper and records
> with key id_B_XX are processed by another mapper, what should I do?
>
> Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
>
> Any help would be appreciated.
> --
> YANG, Lin
>



-- 
Harsh J

Re: How to split a sequence file

Posted by Ajay Srivastava <Aj...@guavus.com>.

Hi Jason,
I am wondering about use case of distributing records on the basis of key to mapper. If possible, could you please share your scenario ?
Is it map only job ? Why not distribute records using partitioner and do the processing in reducers ?


Regards,
Ajay Srivastava 


On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:

> Hi, 
> 
> I have a sequence file written by SequenceFileOutputFormat with key/value type of <Text, BytesWritable>, like below:
> 
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
> 
> If I want all the records with the same key prefix to be processed by a same mapper, say records with key id_A_XX are processed by a mapper and records with key id_B_XX are processed by another mapper, what should I do?  
> 
> Should I implement our own InputFormat inherited from SequenceFileInputFormat ?
> 
> Any help would be appreciated.
> -- 
> YANG, Lin
>

Re: How to split a sequence file

Posted by Harsh J <ha...@cloudera.com>.

Hey Jason,

Is the file pre-sorted? You could override the OutputFormat's
#getSplits method to return InputSplits at identified key boundaries,
as one solution - this would require reading the file up-front (at
submit-time) and building the input splits out of it.

On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <li...@gmail.com> wrote:
> Hi,
>
> I have a sequence file written by SequenceFileOutputFormat with key/value
> type of <Text, BytesWritable>, like below:
>
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
>
> If I want all the records with the same key prefix to be processed by a same
> mapper, say records with key id_A_XX are processed by a mapper and records
> with key id_B_XX are processed by another mapper, what should I do?
>
> Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
>
> Any help would be appreciated.
> --
> YANG, Lin
>



-- 
Harsh J

Re: How to split a sequence file

Posted by Harsh J <ha...@cloudera.com>.

Hey Jason,

Is the file pre-sorted? You could override the OutputFormat's
#getSplits method to return InputSplits at identified key boundaries,
as one solution - this would require reading the file up-front (at
submit-time) and building the input splits out of it.

On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <li...@gmail.com> wrote:
> Hi,
>
> I have a sequence file written by SequenceFileOutputFormat with key/value
> type of <Text, BytesWritable>, like below:
>
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
>
> If I want all the records with the same key prefix to be processed by a same
> mapper, say records with key id_A_XX are processed by a mapper and records
> with key id_B_XX are processed by another mapper, what should I do?
>
> Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
>
> Any help would be appreciated.
> --
> YANG, Lin
>



-- 
Harsh J

Re: How to split a sequence file

Posted by Harsh J <ha...@cloudera.com>.

Hey Jason,

Is the file pre-sorted? You could override the OutputFormat's
#getSplits method to return InputSplits at identified key boundaries,
as one solution - this would require reading the file up-front (at
submit-time) and building the input splits out of it.

On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <li...@gmail.com> wrote:
> Hi,
>
> I have a sequence file written by SequenceFileOutputFormat with key/value
> type of <Text, BytesWritable>, like below:
>
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
>
> If I want all the records with the same key prefix to be processed by a same
> mapper, say records with key id_A_XX are processed by a mapper and records
> with key id_B_XX are processed by another mapper, what should I do?
>
> Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
>
> Any help would be appreciated.
> --
> YANG, Lin
>



-- 
Harsh J

Re: How to split a sequence file

Posted by Ajay Srivastava <Aj...@guavus.com>.

Hi Jason,
I am wondering about use case of distributing records on the basis of key to mapper. If possible, could you please share your scenario ?
Is it map only job ? Why not distribute records using partitioner and do the processing in reducers ?


Regards,
Ajay Srivastava 


On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:

> Hi, 
> 
> I have a sequence file written by SequenceFileOutputFormat with key/value type of <Text, BytesWritable>, like below:
> 
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
> 
> If I want all the records with the same key prefix to be processed by a same mapper, say records with key id_A_XX are processed by a mapper and records with key id_B_XX are processed by another mapper, what should I do?  
> 
> Should I implement our own InputFormat inherited from SequenceFileInputFormat ?
> 
> Any help would be appreciated.
> -- 
> YANG, Lin
>

Re: How to split a sequence file

Posted by Ajay Srivastava <Aj...@guavus.com>.

Hi Jason,
I am wondering about use case of distributing records on the basis of key to mapper. If possible, could you please share your scenario ?
Is it map only job ? Why not distribute records using partitioner and do the processing in reducers ?


Regards,
Ajay Srivastava 


On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:

> Hi, 
> 
> I have a sequence file written by SequenceFileOutputFormat with key/value type of <Text, BytesWritable>, like below:
> 
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
> 
> If I want all the records with the same key prefix to be processed by a same mapper, say records with key id_A_XX are processed by a mapper and records with key id_B_XX are processed by another mapper, what should I do?  
> 
> Should I implement our own InputFormat inherited from SequenceFileInputFormat ?
> 
> Any help would be appreciated.
> -- 
> YANG, Lin
>