You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sugandha Naolekar <su...@gmail.com> on 2014/02/25 07:10:13 UTC

Reading a file in a customized way

Hello,

Irrespective of the file blocks placed in HDFS, I want my map() to be
called/invoked in a customized manner. For. eg. I want to process a huge
JSON File(single file). Now this file is definitely less than the default
block size(128 MB). Thus, ideally, only one mapper will be called. Means,
the map task will be called only once right? But, I want my map function to
process every feature of this json file. Thus, every feature task will be
the map task. Thus, to read this json, will I have to get the inputsplits
and use custom record reader? Please find the sample of the json file below:

{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
"REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
"START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
"REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
"geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
1458703.170565 ] ] } }
,
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
"OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
"REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
"F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }

]

}

Also, generally the text files are split and placed in blocks in what
manner by hadoop? Line by Line? Can this be customized? IF not, can we read
the file from 2 blocks ? e,g; Each feature as seen in the json is a
combination of multiple lines. Now, can there be a possibility where, the
one line of the feature tag is placed in pne block of one m/c and rest of
the lines in other machine's block?

--
Thanks & Regards,
Sugandha Naolekar

Re: Reading a file in a customized way

Posted by sudhakara st <su...@gmail.com>.
Use  WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and  the
contents of the file as its value to mapper.
Before getting into  this,  better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html


On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>


-- 

Regards,
...sudhakara

Re: Reading a file in a customized way

Posted by sudhakara st <su...@gmail.com>.
Use  WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and  the
contents of the file as its value to mapper.
Before getting into  this,  better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html


On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>


-- 

Regards,
...sudhakara

Re: Reading a file in a customized way

Posted by sudhakara st <su...@gmail.com>.
Use  WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and  the
contents of the file as its value to mapper.
Before getting into  this,  better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html


On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>


-- 

Regards,
...sudhakara

RE: Reading a file in a customized way

Posted by Shumin Guo <gs...@gmail.com>.
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

Shumin
On Feb 25, 2014 10:56 AM, "java8964" <ja...@hotmail.com> wrote:

> See my reply for another email today for similar question.
>
> "
>
>    - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

RE: Reading a file in a customized way

Posted by Shumin Guo <gs...@gmail.com>.
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

Shumin
On Feb 25, 2014 10:56 AM, "java8964" <ja...@hotmail.com> wrote:

> See my reply for another email today for similar question.
>
> "
>
>    - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

RE: Reading a file in a customized way

Posted by Shumin Guo <gs...@gmail.com>.
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

Shumin
On Feb 25, 2014 10:56 AM, "java8964" <ja...@hotmail.com> wrote:

> See my reply for another email today for similar question.
>
> "
>
>    - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

RE: Reading a file in a customized way

Posted by Shumin Guo <gs...@gmail.com>.
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

Shumin
On Feb 25, 2014 10:56 AM, "java8964" <ja...@hotmail.com> wrote:

> See my reply for another email today for similar question.
>
> "
>
>    - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

RE: Reading a file in a customized way

Posted by java8964 <ja...@hotmail.com>.
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?‏"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org

Hello,

Irrespective of the file blocks placed in HDFS, I
 want my map() to be called/invoked in a customized manner. For. eg. I 
want to process a huge JSON File(single file). Now this file is definitely less than the default block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will be called only once right? But, I want my map 
function to process every feature of this json file. Thus, every feature
 task will be the map task. Thus, to read this json, will I have to get 
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",


"features": [
{
 "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000, 
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000, 
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, 
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
 8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature", 
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000, 
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "", 
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE": 
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, 
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID": 
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH": 
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
 }
]

}


Also, generally the text files are split and placed in blocks in what manner by hadoop? Line by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature as seen in the json is a combination of multiple lines. Now, can there be a possibility where, the one line of the feature tag is placed in pne block of one m/c and rest of the lines in other machine's block?



--Thanks & Regards,


Sugandha Naolekar




 		 	   		  

RE: Reading a file in a customized way

Posted by java8964 <ja...@hotmail.com>.
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?‏"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org

Hello,

Irrespective of the file blocks placed in HDFS, I
 want my map() to be called/invoked in a customized manner. For. eg. I 
want to process a huge JSON File(single file). Now this file is definitely less than the default block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will be called only once right? But, I want my map 
function to process every feature of this json file. Thus, every feature
 task will be the map task. Thus, to read this json, will I have to get 
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",


"features": [
{
 "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000, 
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000, 
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, 
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
 8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature", 
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000, 
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "", 
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE": 
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, 
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID": 
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH": 
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
 }
]

}


Also, generally the text files are split and placed in blocks in what manner by hadoop? Line by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature as seen in the json is a combination of multiple lines. Now, can there be a possibility where, the one line of the feature tag is placed in pne block of one m/c and rest of the lines in other machine's block?



--Thanks & Regards,


Sugandha Naolekar




 		 	   		  

Re: Reading a file in a customized way

Posted by sudhakara st <su...@gmail.com>.
Use  WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and  the
contents of the file as its value to mapper.
Before getting into  this,  better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html


On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>


-- 

Regards,
...sudhakara

RE: Reading a file in a customized way

Posted by java8964 <ja...@hotmail.com>.
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?‏"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org

Hello,

Irrespective of the file blocks placed in HDFS, I
 want my map() to be called/invoked in a customized manner. For. eg. I 
want to process a huge JSON File(single file). Now this file is definitely less than the default block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will be called only once right? But, I want my map 
function to process every feature of this json file. Thus, every feature
 task will be the map task. Thus, to read this json, will I have to get 
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",


"features": [
{
 "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000, 
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000, 
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, 
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
 8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature", 
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000, 
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "", 
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE": 
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, 
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID": 
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH": 
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
 }
]

}


Also, generally the text files are split and placed in blocks in what manner by hadoop? Line by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature as seen in the json is a combination of multiple lines. Now, can there be a possibility where, the one line of the feature tag is placed in pne block of one m/c and rest of the lines in other machine's block?



--Thanks & Regards,


Sugandha Naolekar




 		 	   		  

RE: Reading a file in a customized way

Posted by java8964 <ja...@hotmail.com>.
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?‏"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org

Hello,

Irrespective of the file blocks placed in HDFS, I
 want my map() to be called/invoked in a customized manner. For. eg. I 
want to process a huge JSON File(single file). Now this file is definitely less than the default block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will be called only once right? But, I want my map 
function to process every feature of this json file. Thus, every feature
 task will be the map task. Thus, to read this json, will I have to get 
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",


"features": [
{
 "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000, 
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000, 
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, 
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
 8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature", 
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000, 
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "", 
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE": 
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, 
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID": 
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH": 
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, 
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
 }
]

}


Also, generally the text files are split and placed in blocks in what manner by hadoop? Line by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature as seen in the json is a combination of multiple lines. Now, can there be a possibility where, the one line of the feature tag is placed in pne block of one m/c and rest of the lines in other machine's block?



--Thanks & Regards,


Sugandha Naolekar