You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Song Liu <la...@gmail.com> on 2010/04/05 15:57:09 UTC

Get Line Number from InputFormat

Dear all,
   TextInputFormat send the <Offset, Line> into the Mapper, however, the 
offset is sometime meaningless, and confusing. Is it possible to have a 
InputFormat which outputs <Line NO., line> into mapper?

Thanks a lot.

Song

RE: Get Line Number from InputFormat

Posted by Michael Segel <mi...@hotmail.com>.

Ok, so getting your position in to the file based on offset and a known fixed length format, er what you meant by structured, will give you a line number.

But lets look at the question from a more practical and wider application.  In most applications where you have a single record per line, you will not have a fixed length record format, so you really don't have a good way to calculate your line number based on position in to the file.

Lets also look at the issue of the importance of a line number in terms of practical use.
Sort of like row_id in a partitioned table, line number loses meaning.

If line number had specific meaning and the application ended their records with a '\n' (or cr nl),
the an alternative would be to add a field that contained the line number.

HTH

-Mike

PS. Wouldn't you call a record in XML structured? Yet of an unknown length? ;-)

(Sorry, I haven't had my first cup of coffee yet. :-)   )
> From: amogh@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Tue, 6 Apr 2010 12:14:56 +0530
> Subject: Re: Get Line Number from InputFormat
> 
> Hi,
> If your records are structured / of equal size, then getting the line number is straightforward.
> If not, you'll need to construct your own sequence of numbers, someone's been kind enough to publish on his blog:
> 
> http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html
> 
> Amogh
> 
> 
> On 4/5/10 7:59 PM, "Michael Segel" <mi...@hotmail.com> wrote:
> 
> 
> 
> 
> 
> > Date: Mon, 5 Apr 2010 14:57:09 +0100
> > From: lamfeeling2@gmail.com
> > To: common-user@hadoop.apache.org
> > Subject: Get Line Number from InputFormat
> >
> > Dear all,
> >    TextInputFormat send the <Offset, Line> into the Mapper, however, the
> > offset is sometime meaningless, and confusing. Is it possible to have a
> > InputFormat which outputs <Line NO., line> into mapper?
> >
> > Thanks a lot.
> >
> > Song
> 
> Song,
> 
> I'm not sure what you want is realistic or even worthwhile.
> 
> You have a file and its split in to chunks of 64MB (default) or something larger based on your cloud settings.
> You have map job that starts from a specific point in to the file, but that does not mean that its starting at a specific line, or that Hadoop will know which line in the file. (Your records are not always going to be based on the end of a line, or one like per record.
> 
> Does that make sense?
> Offset has more meaning that an arbitrary Line NO.
> 
> -Mike
> 
> _________________________________________________________________
> The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail.
> http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
> 
 		 	   		  
_________________________________________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

Re: Get Line Number from InputFormat

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
If your records are structured / of equal size, then getting the line number is straightforward.
If not, you'll need to construct your own sequence of numbers, someone's been kind enough to publish on his blog:

http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html

Amogh


On 4/5/10 7:59 PM, "Michael Segel" <mi...@hotmail.com> wrote:





> Date: Mon, 5 Apr 2010 14:57:09 +0100
> From: lamfeeling2@gmail.com
> To: common-user@hadoop.apache.org
> Subject: Get Line Number from InputFormat
>
> Dear all,
>    TextInputFormat send the <Offset, Line> into the Mapper, however, the
> offset is sometime meaningless, and confusing. Is it possible to have a
> InputFormat which outputs <Line NO., line> into mapper?
>
> Thanks a lot.
>
> Song

Song,

I'm not sure what you want is realistic or even worthwhile.

You have a file and its split in to chunks of 64MB (default) or something larger based on your cloud settings.
You have map job that starts from a specific point in to the file, but that does not mean that its starting at a specific line, or that Hadoop will know which line in the file. (Your records are not always going to be based on the end of a line, or one like per record.

Does that make sense?
Offset has more meaning that an arbitrary Line NO.

-Mike

_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5


RE: Get Line Number from InputFormat

Posted by Michael Segel <mi...@hotmail.com>.


> Date: Mon, 5 Apr 2010 14:57:09 +0100
> From: lamfeeling2@gmail.com
> To: common-user@hadoop.apache.org
> Subject: Get Line Number from InputFormat
> 
> Dear all,
>    TextInputFormat send the <Offset, Line> into the Mapper, however, the 
> offset is sometime meaningless, and confusing. Is it possible to have a 
> InputFormat which outputs <Line NO., line> into mapper?
> 
> Thanks a lot.
> 
> Song

Song,

I'm not sure what you want is realistic or even worthwhile.

You have a file and its split in to chunks of 64MB (default) or something larger based on your cloud settings.
You have map job that starts from a specific point in to the file, but that does not mean that its starting at a specific line, or that Hadoop will know which line in the file. (Your records are not always going to be based on the end of a line, or one like per record.

Does that make sense?
Offset has more meaning that an arbitrary Line NO.

-Mike
 		 	   		  
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5