You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by "Wheeler, Bill NPO" <bi...@intel.com> on 2012/12/03 17:22:53 UTC

Using Hadoop infrastructure with input streams instead of key/value input

I am trying to use Hadoop's partitioning/scheduling/storage infrastructure to process many HDFS files of data in parallel (1 HDFS file per map task), but in a way that does not naturally fit into the key/value pair input framework. Specifically my application's "map" function equivalent does not want to receive formatted data as key/value pairs-instead, I'd like to receive a Hadoop input stream object for my map processing so that I can read bytes out in many different ways with much greater flexibility and efficiency than what I'd get with the key/value pair input constraint. The input stream would handle the complexity of fetching local and remote HDFS data blocks as needed on my behalf. The result of the map processing would then conform to key/value pair map outputs and be subsequently processed by traditional reduce code.

I'm guessing that I am not the only person who would like to read HDFS file input directly as this capability could open up a new type of Hadoop use models. Is there any support for acquiring input streams directly into java map code? And is there any support for doing the same into C++ map code ala Pipes?

For added context, my application is in the video analytic space, requiring me to read video files . I have implemented a solution, but it is a hack with less than ideal characteristics: I have RecordReader code which simply passes the HDFS filename thru in the key field of my key/value input. I'm using Pipes to implement the map function in C++ code. The C++ map code then performs a system call, "hadoop fs -copyToLocal hdfs_filename local_filename" to put the entire HDFS file on the datanode's local file system where it is readable by C++ IO calls. I then simply open up this file and process it. It would be much better to avoid having to do all the extra IO associated with "copyToLocal" and instead somehow receive an input stream object from which to directly read from HDFS.

Any way of doing this in a more elegant fashion?

Thanks,
Bill

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Liuva Rosabal Valdes <lr...@estudiantes.uci.cu>.

As I get more I reach your mail because I am no longer interested in the subject



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Liuva Rosabal Valdes <lr...@estudiantes.uci.cu>.

As I get more I reach your mail because I am no longer interested in the subject



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Liuva Rosabal Valdes <lr...@estudiantes.uci.cu>.

As I get more I reach your mail because I am no longer interested in the subject



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Liuva Rosabal Valdes <lr...@estudiantes.uci.cu>.

As I get more I reach your mail because I am no longer interested in the subject



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Steve Lewis <lo...@gmail.com>.

I presume a single file is handled by one and only one mapper. in that case
you can pass the path as a string and do something like this

       public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
       String hdfspath = value.toString();
       final FileSystem fs =  FileSystem.get(context.getConfiguration());
        Path src = new Path(hdfsPath);
        InputStream is = null;
    try {
         is =  fs.open(src);
        ... handle Stream
   }
   finally {
           if(is != null)
                is.close();
 }

      You might try streaming to a C program




On Mon, Dec 3, 2012 at 8:22 AM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
>
> ** **
>
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
>
> ** **
>
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
>
> ** **
>
> Any way of doing this in a more elegant fashion?****
>
> ** **
>
> Thanks,****
>
> Bill****
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Steve Lewis <lo...@gmail.com>.

I presume a single file is handled by one and only one mapper. in that case
you can pass the path as a string and do something like this

       public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
       String hdfspath = value.toString();
       final FileSystem fs =  FileSystem.get(context.getConfiguration());
        Path src = new Path(hdfsPath);
        InputStream is = null;
    try {
         is =  fs.open(src);
        ... handle Stream
   }
   finally {
           if(is != null)
                is.close();
 }

      You might try streaming to a C program




On Mon, Dec 3, 2012 at 8:22 AM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
>
> ** **
>
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
>
> ** **
>
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
>
> ** **
>
> Any way of doing this in a more elegant fashion?****
>
> ** **
>
> Thanks,****
>
> Bill****
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi,

I have not tried this myself before, but would libhdfs help ?

http://hadoop.apache.org/docs/stable/libhdfs.html

Thanks
Hemanth


On Mon, Dec 3, 2012 at 9:52 PM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
>
> ** **
>
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
>
> ** **
>
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
>
> ** **
>
> Any way of doing this in a more elegant fashion?****
>
> ** **
>
> Thanks,****
>
> Bill****
>

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Steve Lewis <lo...@gmail.com>.

I presume a single file is handled by one and only one mapper. in that case
you can pass the path as a string and do something like this

       public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
       String hdfspath = value.toString();
       final FileSystem fs =  FileSystem.get(context.getConfiguration());
        Path src = new Path(hdfsPath);
        InputStream is = null;
    try {
         is =  fs.open(src);
        ... handle Stream
   }
   finally {
           if(is != null)
                is.close();
 }

      You might try streaming to a C program




On Mon, Dec 3, 2012 at 8:22 AM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
>
> ** **
>
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
>
> ** **
>
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
>
> ** **
>
> Any way of doing this in a more elegant fashion?****
>
> ** **
>
> Thanks,****
>
> Bill****
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi,

I have not tried this myself before, but would libhdfs help ?

http://hadoop.apache.org/docs/stable/libhdfs.html

Thanks
Hemanth


On Mon, Dec 3, 2012 at 9:52 PM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
>
> ** **
>
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
>
> ** **
>
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
>
> ** **
>
> Any way of doing this in a more elegant fashion?****
>
> ** **
>
> Thanks,****
>
> Bill****
>

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi,

I have not tried this myself before, but would libhdfs help ?

http://hadoop.apache.org/docs/stable/libhdfs.html

Thanks
Hemanth


On Mon, Dec 3, 2012 at 9:52 PM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
>
> ** **
>
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
>
> ** **
>
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
>
> ** **
>
> Any way of doing this in a more elegant fashion?****
>
> ** **
>
> Thanks,****
>
> Bill****
>

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Steve Lewis <lo...@gmail.com>.

I presume a single file is handled by one and only one mapper. in that case
you can pass the path as a string and do something like this

       public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
       String hdfspath = value.toString();
       final FileSystem fs =  FileSystem.get(context.getConfiguration());
        Path src = new Path(hdfsPath);
        InputStream is = null;
    try {
         is =  fs.open(src);
        ... handle Stream
   }
   finally {
           if(is != null)
                is.close();
 }

      You might try streaming to a C program




On Mon, Dec 3, 2012 at 8:22 AM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
>
> ** **
>
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
>
> ** **
>
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
>
> ** **
>
> Any way of doing this in a more elegant fashion?****
>
> ** **
>
> Thanks,****
>
> Bill****
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Using Hadoop infrastructure with input streams instead of key/value input

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi,

I have not tried this myself before, but would libhdfs help ?

http://hadoop.apache.org/docs/stable/libhdfs.html

Thanks
Hemanth


On Mon, Dec 3, 2012 at 9:52 PM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
>
> ** **
>
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
>
> ** **
>
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
>
> ** **
>
> Any way of doing this in a more elegant fashion?****
>
> ** **
>
> Thanks,****
>
> Bill****
>