You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sugandha Naolekar <su...@gmail.com> on 2014/02/25 07:10:13 UTC
Reading a file in a customized way
Hello,
Irrespective of the file blocks placed in HDFS, I want my map() to be
called/invoked in a customized manner. For. eg. I want to process a huge
JSON File(single file). Now this file is definitely less than the default
block size(128 MB). Thus, ideally, only one mapper will be called. Means,
the map task will be called only once right? But, I want my map function to
process every feature of this json file. Thus, every feature task will be
the map task. Thus, to read this json, will I have to get the inputsplits
and use custom record reader? Please find the sample of the json file below:
{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
"REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
"START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
"REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
"geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
1458703.170565 ] ] } }
,
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
"OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
"REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
"F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
]
}
Also, generally the text files are split and placed in blocks in what
manner by hadoop? Line by Line? Can this be customized? IF not, can we read
the file from 2 blocks ? e,g; Each feature as seen in the json is a
combination of multiple lines. Now, can there be a possibility where, the
one line of the feature tag is placed in pne block of one m/c and rest of
the lines in other machine's block?
--
Thanks & Regards,
Sugandha Naolekar
Re: Reading a file in a customized way
Posted by sudhakara st <su...@gmail.com>.
Use WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and the
contents of the file as its value to mapper.
Before getting into this, better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<su...@gmail.com>wrote:
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
--
Regards,
...sudhakara
Re: Reading a file in a customized way
Posted by sudhakara st <su...@gmail.com>.
Use WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and the
contents of the file as its value to mapper.
Before getting into this, better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<su...@gmail.com>wrote:
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
--
Regards,
...sudhakara
Re: Reading a file in a customized way
Posted by sudhakara st <su...@gmail.com>.
Use WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and the
contents of the file as its value to mapper.
Before getting into this, better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<su...@gmail.com>wrote:
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
--
Regards,
...sudhakara
RE: Reading a file in a customized way
Posted by Shumin Guo <gs...@gmail.com>.
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
Shumin
On Feb 25, 2014 10:56 AM, "java8964" <ja...@hotmail.com> wrote:
> See my reply for another email today for similar question.
>
> "
>
> - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
RE: Reading a file in a customized way
Posted by Shumin Guo <gs...@gmail.com>.
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
Shumin
On Feb 25, 2014 10:56 AM, "java8964" <ja...@hotmail.com> wrote:
> See my reply for another email today for similar question.
>
> "
>
> - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
RE: Reading a file in a customized way
Posted by Shumin Guo <gs...@gmail.com>.
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
Shumin
On Feb 25, 2014 10:56 AM, "java8964" <ja...@hotmail.com> wrote:
> See my reply for another email today for similar question.
>
> "
>
> - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
RE: Reading a file in a customized way
Posted by Shumin Guo <gs...@gmail.com>.
You can extend the fileinputformat and set splittable to be false. More
info is in the java doc:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
Shumin
On Feb 25, 2014 10:56 AM, "java8964" <ja...@hotmail.com> wrote:
> See my reply for another email today for similar question.
>
> "
>
> - RE: Can the file storage in HDFS be customized?"
>
> Thanks
>
> Yong
>
> ------------------------------
> From: sugandha.n87@gmail.com
> Date: Tue, 25 Feb 2014 11:40:13 +0530
> Subject: Reading a file in a customized way
> To: user@hadoop.apache.org
>
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
> ]
>
> }
>
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
RE: Reading a file in a customized way
Posted by java8964 <ja...@hotmail.com>.
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org
Hello,
Irrespective of the file blocks placed in HDFS, I
want my map() to be called/invoked in a customized manner. For. eg. I
want to process a huge JSON File(single file). Now this file is definitely less than the default block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will be called only once right? But, I want my map
function to process every feature of this json file. Thus, every feature
task will be the map task. Thus, to read this json, will I have to get
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",
"features": [
{
"type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000,
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1":
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID":
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH":
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000,
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature",
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000,
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "",
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000,
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1":
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000,
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
}
]
}
Also, generally the text files are split and placed in blocks in what manner by hadoop? Line by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature as seen in the json is a combination of multiple lines. Now, can there be a possibility where, the one line of the feature tag is placed in pne block of one m/c and rest of the lines in other machine's block?
--Thanks & Regards,
Sugandha Naolekar
RE: Reading a file in a customized way
Posted by java8964 <ja...@hotmail.com>.
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org
Hello,
Irrespective of the file blocks placed in HDFS, I
want my map() to be called/invoked in a customized manner. For. eg. I
want to process a huge JSON File(single file). Now this file is definitely less than the default block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will be called only once right? But, I want my map
function to process every feature of this json file. Thus, every feature
task will be the map task. Thus, to read this json, will I have to get
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",
"features": [
{
"type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000,
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1":
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID":
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH":
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000,
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature",
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000,
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "",
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000,
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1":
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000,
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
}
]
}
Also, generally the text files are split and placed in blocks in what manner by hadoop? Line by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature as seen in the json is a combination of multiple lines. Now, can there be a possibility where, the one line of the feature tag is placed in pne block of one m/c and rest of the lines in other machine's block?
--Thanks & Regards,
Sugandha Naolekar
Re: Reading a file in a customized way
Posted by sudhakara st <su...@gmail.com>.
Use WholeInputFileFormat/WholeFileRecordReader ( The Hadoop Definitive
Guide- Tom White's page 240) to read the file name as the key and the
contents of the file as its value to mapper.
Before getting into this, better read HDFS Architecture and Map reduce
flow
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
On Tue, Feb 25, 2014 at 11:40 AM, Sugandha Naolekar
<su...@gmail.com>wrote:
> Hello,
>
> Irrespective of the file blocks placed in HDFS, I want my map() to be
> called/invoked in a customized manner. For. eg. I want to process a huge
> JSON File(single file). Now this file is definitely less than the default
> block size(128 MB). Thus, ideally, only one mapper will be called. Means,
> the map task will be called only once right? But, I want my map function to
> process every feature of this json file. Thus, every feature task will be
> the map task. Thus, to read this json, will I have to get the inputsplits
> and use custom record reader? Please find the sample of the json file below:
>
> {
> "type": "FeatureCollection",
> "features": [
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
> "OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
> 1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000, "X2":
> 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1": 12.989879,
> "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID": 38033028.000000,
> "START_ID": 34570.000000, "KM": 0.000000, "LENGTH": 128.579933,
> "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000, "ROW_FLAG": "F" },
> "geometry": { "type": "LineString", "coordinates": [ [ 8632009.414824,
> 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [ 8632028.595172,
> 1458703.170565 ] ] } }
> ,
> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>
> ]
>
> }
>
> Also, generally the text files are split and placed in blocks in what
> manner by hadoop? Line by Line? Can this be customized? IF not, can we read
> the file from 2 blocks ? e,g; Each feature as seen in the json is a
> combination of multiple lines. Now, can there be a possibility where, the
> one line of the feature tag is placed in pne block of one m/c and rest of
> the lines in other machine's block?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
--
Regards,
...sudhakara
RE: Reading a file in a customized way
Posted by java8964 <ja...@hotmail.com>.
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org
Hello,
Irrespective of the file blocks placed in HDFS, I
want my map() to be called/invoked in a customized manner. For. eg. I
want to process a huge JSON File(single file). Now this file is definitely less than the default block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will be called only once right? But, I want my map
function to process every feature of this json file. Thus, every feature
task will be the map task. Thus, to read this json, will I have to get
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",
"features": [
{
"type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000,
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1":
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID":
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH":
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000,
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature",
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000,
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "",
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000,
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1":
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000,
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
}
]
}
Also, generally the text files are split and placed in blocks in what manner by hadoop? Line by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature as seen in the json is a combination of multiple lines. Now, can there be a possibility where, the one line of the feature tag is placed in pne block of one m/c and rest of the lines in other machine's block?
--Thanks & Regards,
Sugandha Naolekar
RE: Reading a file in a customized way
Posted by java8964 <ja...@hotmail.com>.
See my reply for another email today for similar question.
"RE: Can the file storage in HDFS be customized?"Thanks
Yong
From: sugandha.n87@gmail.com
Date: Tue, 25 Feb 2014 11:40:13 +0530
Subject: Reading a file in a customized way
To: user@hadoop.apache.org
Hello,
Irrespective of the file blocks placed in HDFS, I
want my map() to be called/invoked in a customized manner. For. eg. I
want to process a huge JSON File(single file). Now this file is definitely less than the default block size(128 MB). Thus, ideally, only one mapper will be called. Means, the map task will be called only once right? But, I want my map
function to process every feature of this json file. Thus, every feature
task will be the map task. Thus, to read this json, will I have to get
the inputsplits and use custom record reader? Please find the sample of the json file below:{
"type": "FeatureCollection",
"features": [
{
"type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 31.000000, "ROAD_TYPE": 3.000000, "END_ID": 22278.000000,
"OSM_META": "", "REVERSE_LE": 128.579933, "X1": 77.542660, "OSM_SOURCE":
1524649946.000000, "COST": 0.003129, "OSM_TARGET": 529780893.000000,
"X2": 77.542832, "Y2": 12.990992, "CONGESTED_": 138.579933, "Y1":
12.989879, "REVERSE_CO": 0.003129, "CONGESTION": 10.000000, "OSM_ID":
38033028.000000, "START_ID": 34570.000000, "KM": 0.000000, "LENGTH":
128.579933, "REVERSE__1": 138.579933, "SPEED_IN_K": 40.000000,
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
8632009.414824, 1458576.029252 ], [ 8632012.876860, 1458598.957830 ], [
8632028.595172, 1458703.170565 ] ] } }
,
{ "type": "Feature",
"properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000,
"ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "",
"REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000,
"X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1":
12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000,
"ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
}
]
}
Also, generally the text files are split and placed in blocks in what manner by hadoop? Line by Line? Can this be customized? IF not, can we read the file from 2 blocks ? e,g; Each feature as seen in the json is a combination of multiple lines. Now, can there be a possibility where, the one line of the feature tag is placed in pne block of one m/c and rest of the lines in other machine's block?
--Thanks & Regards,
Sugandha Naolekar