You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Fengyun RAO <ra...@gmail.com> on 2014/02/27 10:59:38 UTC

What if file format is dependent upon first few lines?

Below is a fake sample of Microsoft IIS log:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent
200 0 0 390
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent
200 0 0 390
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent
200 0 0 390
...

The first four lines describe the file format, which is a must to parse
each log line. It means log file could NOT be simply splitted, otherwise
the second split would lost the "file format" information.

How could each mapper get the first few lines in the file?

RE: What if file format is dependent upon first few lines?

Posted by java8964 <ja...@hotmail.com>.
If the file is big enough and you want to split them for parallel processing, then maybe one option could be that in your mapper, you can always get the full file path from the InputSplit, then open it (The file path, which means you  can read from the the beginning), read the first 4 lines, and based on the content, processing the current split.
I believe the file in the HDFS can support concurrent read without any problem.
Yong

Date: Thu, 27 Feb 2014 17:59:38 +0800
Subject: What if file format is dependent upon first few lines?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Below is a fake sample of Microsoft IIS log:#Software: Microsoft Internet Information Services 7.5#Version: 1.0#Date: 2013-07-04 20:00:00#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200 0 0 390
...
The first four lines describe the file format, which is a must to parse each log line. It means log file could NOT be simply splitted, otherwise the second split would lost the "file format" information.

How could each mapper get the first few lines in the file? 		 	   		  

Re: What if file format is dependent upon first few lines?

Posted by Harsh J <ha...@cloudera.com>.
A mapper's record reader implementation need not be restricted to
strictly only the input split boundary. It is a loose relationship -
you can always seek(0), read the lines you need to prepare, then
seek(offset) and continue reading.

Apache Avro (http://avro.apache.org) has a similar format - header
contains the schema a reader needs to work.

On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <ra...@gmail.com> wrote:
> Below is a fake sample of Microsoft IIS log:
> #Software: Microsoft Internet Information Services 7.5
> #Version: 1.0
> #Date: 2013-07-04 20:00:00
> #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
> cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
> time-taken
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200
> 0 0 390
> ...
>
> The first four lines describe the file format, which is a must to parse each
> log line. It means log file could NOT be simply splitted, otherwise the
> second split would lost the "file format" information.
>
> How could each mapper get the first few lines in the file?



-- 
Harsh J

Re: What if file format is dependent upon first few lines?

Posted by Harsh J <ha...@cloudera.com>.
A mapper's record reader implementation need not be restricted to
strictly only the input split boundary. It is a loose relationship -
you can always seek(0), read the lines you need to prepare, then
seek(offset) and continue reading.

Apache Avro (http://avro.apache.org) has a similar format - header
contains the schema a reader needs to work.

On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <ra...@gmail.com> wrote:
> Below is a fake sample of Microsoft IIS log:
> #Software: Microsoft Internet Information Services 7.5
> #Version: 1.0
> #Date: 2013-07-04 20:00:00
> #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
> cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
> time-taken
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200
> 0 0 390
> ...
>
> The first four lines describe the file format, which is a must to parse each
> log line. It means log file could NOT be simply splitted, otherwise the
> second split would lost the "file format" information.
>
> How could each mapper get the first few lines in the file?



-- 
Harsh J

RE: What if file format is dependent upon first few lines?

Posted by java8964 <ja...@hotmail.com>.
If the file is big enough and you want to split them for parallel processing, then maybe one option could be that in your mapper, you can always get the full file path from the InputSplit, then open it (The file path, which means you  can read from the the beginning), read the first 4 lines, and based on the content, processing the current split.
I believe the file in the HDFS can support concurrent read without any problem.
Yong

Date: Thu, 27 Feb 2014 17:59:38 +0800
Subject: What if file format is dependent upon first few lines?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Below is a fake sample of Microsoft IIS log:#Software: Microsoft Internet Information Services 7.5#Version: 1.0#Date: 2013-07-04 20:00:00#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200 0 0 390
...
The first four lines describe the file format, which is a must to parse each log line. It means log file could NOT be simply splitted, otherwise the second split would lost the "file format" information.

How could each mapper get the first few lines in the file? 		 	   		  

Re: What if file format is dependent upon first few lines?

Posted by Harsh J <ha...@cloudera.com>.
A mapper's record reader implementation need not be restricted to
strictly only the input split boundary. It is a loose relationship -
you can always seek(0), read the lines you need to prepare, then
seek(offset) and continue reading.

Apache Avro (http://avro.apache.org) has a similar format - header
contains the schema a reader needs to work.

On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <ra...@gmail.com> wrote:
> Below is a fake sample of Microsoft IIS log:
> #Software: Microsoft Internet Information Services 7.5
> #Version: 1.0
> #Date: 2013-07-04 20:00:00
> #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
> cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
> time-taken
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200
> 0 0 390
> ...
>
> The first four lines describe the file format, which is a must to parse each
> log line. It means log file could NOT be simply splitted, otherwise the
> second split would lost the "file format" information.
>
> How could each mapper get the first few lines in the file?



-- 
Harsh J

RE: What if file format is dependent upon first few lines?

Posted by java8964 <ja...@hotmail.com>.
If the file is big enough and you want to split them for parallel processing, then maybe one option could be that in your mapper, you can always get the full file path from the InputSplit, then open it (The file path, which means you  can read from the the beginning), read the first 4 lines, and based on the content, processing the current split.
I believe the file in the HDFS can support concurrent read without any problem.
Yong

Date: Thu, 27 Feb 2014 17:59:38 +0800
Subject: What if file format is dependent upon first few lines?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Below is a fake sample of Microsoft IIS log:#Software: Microsoft Internet Information Services 7.5#Version: 1.0#Date: 2013-07-04 20:00:00#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200 0 0 390
...
The first four lines describe the file format, which is a must to parse each log line. It means log file could NOT be simply splitted, otherwise the second split would lost the "file format" information.

How could each mapper get the first few lines in the file? 		 	   		  

RE: What if file format is dependent upon first few lines?

Posted by java8964 <ja...@hotmail.com>.
If the file is big enough and you want to split them for parallel processing, then maybe one option could be that in your mapper, you can always get the full file path from the InputSplit, then open it (The file path, which means you  can read from the the beginning), read the first 4 lines, and based on the content, processing the current split.
I believe the file in the HDFS can support concurrent read without any problem.
Yong

Date: Thu, 27 Feb 2014 17:59:38 +0800
Subject: What if file format is dependent upon first few lines?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Below is a fake sample of Microsoft IIS log:#Software: Microsoft Internet Information Services 7.5#Version: 1.0#Date: 2013-07-04 20:00:00#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200 0 0 390
...
The first four lines describe the file format, which is a must to parse each log line. It means log file could NOT be simply splitted, otherwise the second split would lost the "file format" information.

How could each mapper get the first few lines in the file? 		 	   		  

Re: What if file format is dependent upon first few lines?

Posted by Harsh J <ha...@cloudera.com>.
A mapper's record reader implementation need not be restricted to
strictly only the input split boundary. It is a loose relationship -
you can always seek(0), read the lines you need to prepare, then
seek(offset) and continue reading.

Apache Avro (http://avro.apache.org) has a similar format - header
contains the schema a reader needs to work.

On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <ra...@gmail.com> wrote:
> Below is a fake sample of Microsoft IIS log:
> #Software: Microsoft Internet Information Services 7.5
> #Version: 1.0
> #Date: 2013-07-04 20:00:00
> #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
> cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
> time-taken
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200
> 0 0 390
> ...
>
> The first four lines describe the file format, which is a must to parse each
> log line. It means log file could NOT be simply splitted, otherwise the
> second split would lost the "file format" information.
>
> How could each mapper get the first few lines in the file?



-- 
Harsh J