You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Jay Vyas <ja...@gmail.com> on 2012/12/06 22:37:43 UTC

DFS and the RecordReader

Hi guys:

Where and how does a Hadoop's record reader decide wether or not it needs
to copy a file to local disk ?

Clearly, since the InputSplit (which has meta data about file inputs) is
the input to the RecordReader, the RecordReader would have to implement
some kind of smart decision making ... Im looking for something like

//Psuedocode
if(! file.existsLocally())
   copyFileToDisk(filegetPath());

return new InputStream(file);

I've looked here:

http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.create%28java.lang.String%2Corg.apache.hadoop.fs.permission.FsPermission%2Cboolean%2Cshort%2Clong%2Corg.apache.hadoop.util.Progressable%2Cint%29

but don't see anything.

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: DFS and the RecordReader

Posted by Jay Vyas <ja...@gmail.com>.
This is an excellent explanation  thanks harsh...... Ill diagram this with some details out and share

> LineRecordReader has the calls you look for and wraps over a
> LineReader implementation, to take care of reading lines over block
> boundaries: [2] - Line 88
> LineReader has all the functional code to make it work for anyone
> reading lines off of text files: [3]
> 
> [1] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup
> [2] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup
> [3] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java?view=markup
> 
> On Fri, Dec 7, 2012 at 4:17 AM, Jay Vyas <ja...@gmail.com> wrote:
>> Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
>> an example of how the input stream is created... ?
> 
> 
> 
> -- 
> Harsh J

Re: DFS and the RecordReader

Posted by Jay Vyas <ja...@gmail.com>.
This is an excellent explanation  thanks harsh...... Ill diagram this with some details out and share

> LineRecordReader has the calls you look for and wraps over a
> LineReader implementation, to take care of reading lines over block
> boundaries: [2] - Line 88
> LineReader has all the functional code to make it work for anyone
> reading lines off of text files: [3]
> 
> [1] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup
> [2] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup
> [3] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java?view=markup
> 
> On Fri, Dec 7, 2012 at 4:17 AM, Jay Vyas <ja...@gmail.com> wrote:
>> Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
>> an example of how the input stream is created... ?
> 
> 
> 
> -- 
> Harsh J

Re: DFS and the RecordReader

Posted by Jay Vyas <ja...@gmail.com>.
This is an excellent explanation  thanks harsh...... Ill diagram this with some details out and share

> LineRecordReader has the calls you look for and wraps over a
> LineReader implementation, to take care of reading lines over block
> boundaries: [2] - Line 88
> LineReader has all the functional code to make it work for anyone
> reading lines off of text files: [3]
> 
> [1] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup
> [2] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup
> [3] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java?view=markup
> 
> On Fri, Dec 7, 2012 at 4:17 AM, Jay Vyas <ja...@gmail.com> wrote:
>> Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
>> an example of how the input stream is created... ?
> 
> 
> 
> -- 
> Harsh J

Re: DFS and the RecordReader

Posted by Jay Vyas <ja...@gmail.com>.
This is an excellent explanation  thanks harsh...... Ill diagram this with some details out and share

> LineRecordReader has the calls you look for and wraps over a
> LineReader implementation, to take care of reading lines over block
> boundaries: [2] - Line 88
> LineReader has all the functional code to make it work for anyone
> reading lines off of text files: [3]
> 
> [1] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup
> [2] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup
> [3] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java?view=markup
> 
> On Fri, Dec 7, 2012 at 4:17 AM, Jay Vyas <ja...@gmail.com> wrote:
>> Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
>> an example of how the input stream is created... ?
> 
> 
> 
> -- 
> Harsh J

Re: DFS and the RecordReader

Posted by Harsh J <ha...@cloudera.com>.
Ah ok, understood what you seem to be looking for.

Lets follow the simple LineReader implementation in that case.

TextInputFormat uses LineRecordReader: [1] - Line 52
LineRecordReader has the calls you look for and wraps over a
LineReader implementation, to take care of reading lines over block
boundaries: [2] - Line 88
LineReader has all the functional code to make it work for anyone
reading lines off of text files: [3]

[1] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup
[2] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup
[3] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java?view=markup

On Fri, Dec 7, 2012 at 4:17 AM, Jay Vyas <ja...@gmail.com> wrote:
> Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
> an example of how the input stream is created... ?



-- 
Harsh J

Re: DFS and the RecordReader

Posted by Harsh J <ha...@cloudera.com>.
Ah ok, understood what you seem to be looking for.

Lets follow the simple LineReader implementation in that case.

TextInputFormat uses LineRecordReader: [1] - Line 52
LineRecordReader has the calls you look for and wraps over a
LineReader implementation, to take care of reading lines over block
boundaries: [2] - Line 88
LineReader has all the functional code to make it work for anyone
reading lines off of text files: [3]

[1] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup
[2] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup
[3] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java?view=markup

On Fri, Dec 7, 2012 at 4:17 AM, Jay Vyas <ja...@gmail.com> wrote:
> Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
> an example of how the input stream is created... ?



-- 
Harsh J

Re: DFS and the RecordReader

Posted by Harsh J <ha...@cloudera.com>.
Ah ok, understood what you seem to be looking for.

Lets follow the simple LineReader implementation in that case.

TextInputFormat uses LineRecordReader: [1] - Line 52
LineRecordReader has the calls you look for and wraps over a
LineReader implementation, to take care of reading lines over block
boundaries: [2] - Line 88
LineReader has all the functional code to make it work for anyone
reading lines off of text files: [3]

[1] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup
[2] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup
[3] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java?view=markup

On Fri, Dec 7, 2012 at 4:17 AM, Jay Vyas <ja...@gmail.com> wrote:
> Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
> an example of how the input stream is created... ?



-- 
Harsh J

Re: DFS and the RecordReader

Posted by Harsh J <ha...@cloudera.com>.
Ah ok, understood what you seem to be looking for.

Lets follow the simple LineReader implementation in that case.

TextInputFormat uses LineRecordReader: [1] - Line 52
LineRecordReader has the calls you look for and wraps over a
LineReader implementation, to take care of reading lines over block
boundaries: [2] - Line 88
LineReader has all the functional code to make it work for anyone
reading lines off of text files: [3]

[1] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java?view=markup
[2] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?view=markup
[3] - http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.2-alpha/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java?view=markup

On Fri, Dec 7, 2012 at 4:17 AM, Jay Vyas <ja...@gmail.com> wrote:
> Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
> an example of how the input stream is created... ?



-- 
Harsh J

Re: DFS and the RecordReader

Posted by Jay Vyas <ja...@gmail.com>.
Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
an example of how the input stream is created... ?

Re: DFS and the RecordReader

Posted by Jay Vyas <ja...@gmail.com>.
Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
an example of how the input stream is created... ?

Re: DFS and the RecordReader

Posted by Jay Vyas <ja...@gmail.com>.
Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
an example of how the input stream is created... ?

Re: DFS and the RecordReader

Posted by Jay Vyas <ja...@gmail.com>.
Hmm... so when a record reader calls fs.open(...) , I guess Im looking for
an example of how the input stream is created... ?

Re: DFS and the RecordReader

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Not sure what you're talking about. RecordReaders, or for that matter,
any DFS InputStream, does not pull data locally before reading it.
Non-data-local reads are streamed over the network like how regular
data local reads are streamed over a local disk.

There is no such logic as the one you seek.

On Fri, Dec 7, 2012 at 3:07 AM, Jay Vyas <ja...@gmail.com> wrote:
> Hi guys:
>
> Where and how does a Hadoop's record reader decide wether or not it needs to
> copy a file to local disk ?
>
> Clearly, since the InputSplit (which has meta data about file inputs) is the
> input to the RecordReader, the RecordReader would have to implement some
> kind of smart decision making ... Im looking for something like
>
> //Psuedocode
> if(! file.existsLocally())
>    copyFileToDisk(filegetPath());
>
> return new InputStream(file);
>
> I've looked here:
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.create%28java.lang.String%2Corg.apache.hadoop.fs.permission.FsPermission%2Cboolean%2Cshort%2Clong%2Corg.apache.hadoop.util.Progressable%2Cint%29
>
> but don't see anything.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: DFS and the RecordReader

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Not sure what you're talking about. RecordReaders, or for that matter,
any DFS InputStream, does not pull data locally before reading it.
Non-data-local reads are streamed over the network like how regular
data local reads are streamed over a local disk.

There is no such logic as the one you seek.

On Fri, Dec 7, 2012 at 3:07 AM, Jay Vyas <ja...@gmail.com> wrote:
> Hi guys:
>
> Where and how does a Hadoop's record reader decide wether or not it needs to
> copy a file to local disk ?
>
> Clearly, since the InputSplit (which has meta data about file inputs) is the
> input to the RecordReader, the RecordReader would have to implement some
> kind of smart decision making ... Im looking for something like
>
> //Psuedocode
> if(! file.existsLocally())
>    copyFileToDisk(filegetPath());
>
> return new InputStream(file);
>
> I've looked here:
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.create%28java.lang.String%2Corg.apache.hadoop.fs.permission.FsPermission%2Cboolean%2Cshort%2Clong%2Corg.apache.hadoop.util.Progressable%2Cint%29
>
> but don't see anything.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: DFS and the RecordReader

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Not sure what you're talking about. RecordReaders, or for that matter,
any DFS InputStream, does not pull data locally before reading it.
Non-data-local reads are streamed over the network like how regular
data local reads are streamed over a local disk.

There is no such logic as the one you seek.

On Fri, Dec 7, 2012 at 3:07 AM, Jay Vyas <ja...@gmail.com> wrote:
> Hi guys:
>
> Where and how does a Hadoop's record reader decide wether or not it needs to
> copy a file to local disk ?
>
> Clearly, since the InputSplit (which has meta data about file inputs) is the
> input to the RecordReader, the RecordReader would have to implement some
> kind of smart decision making ... Im looking for something like
>
> //Psuedocode
> if(! file.existsLocally())
>    copyFileToDisk(filegetPath());
>
> return new InputStream(file);
>
> I've looked here:
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.create%28java.lang.String%2Corg.apache.hadoop.fs.permission.FsPermission%2Cboolean%2Cshort%2Clong%2Corg.apache.hadoop.util.Progressable%2Cint%29
>
> but don't see anything.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J

Re: DFS and the RecordReader

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Not sure what you're talking about. RecordReaders, or for that matter,
any DFS InputStream, does not pull data locally before reading it.
Non-data-local reads are streamed over the network like how regular
data local reads are streamed over a local disk.

There is no such logic as the one you seek.

On Fri, Dec 7, 2012 at 3:07 AM, Jay Vyas <ja...@gmail.com> wrote:
> Hi guys:
>
> Where and how does a Hadoop's record reader decide wether or not it needs to
> copy a file to local disk ?
>
> Clearly, since the InputSplit (which has meta data about file inputs) is the
> input to the RecordReader, the RecordReader would have to implement some
> kind of smart decision making ... Im looking for something like
>
> //Psuedocode
> if(! file.existsLocally())
>    copyFileToDisk(filegetPath());
>
> return new InputStream(file);
>
> I've looked here:
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.create%28java.lang.String%2Corg.apache.hadoop.fs.permission.FsPermission%2Cboolean%2Cshort%2Clong%2Corg.apache.hadoop.util.Progressable%2Cint%29
>
> but don't see anything.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com



-- 
Harsh J