You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Jaliya Ekanayake <jn...@gmail.com> on 2009/08/20 19:29:55 UTC
Re: Using Hadoop with executables and binary data
Hi Stefan,
I am sorry, for the late reply. Somehow the response email has slipped my
eyes.
Could you explain a bit on how to use Hadoop streaming with binary data
formats.
I can see, explanations on using it with text data formats, but not for
binary files.
Thank you,
Jaliya
Stefan Podkowinski
Mon, 10 Aug 2009 01:40:05 -0700
Jaliya,
did you consider Hadoop Streaming for your case?
http://wiki.apache.org/hadoop/HadoopStreaming
On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
Ekanayake<je...@cs.indiana.edu> wrote:
> Dear Hadoop devs,
>
>
>
> Please help me to figure out a way to program the following problem using
> Hadoop.
>
> I have a program which I need to invoke in parallel using Hadoop. The
> program takes an input file(binary) and produce an output file (binary)
>
>
>
> Input.bin ->prog.exe-> output.bin
>
>
>
> The input data set is about 1TB in size. Each input data file is about
33MB
> in size. (So I have about 31000 files)
>
> The output binary file is about 9KBs in size.
>
>
>
> I have implemented this program using Hadoop in the following way.
>
>
>
> I keep the input data in a shared parallel file system (Lustre File
System).
>
> Then, I collect the input file names and write them to a collection of
files
> in HDFS (let's say hdfs_input_0.txt ..).
>
> Each hdfs_input file contains roughly the equal number of files URIs to
the
> original input file.
>
> The map task, simply take a string Value which is a URI to an original
input
> data file and execute the program as an external program.
>
> The output of the program is also written to the shared file system
(Lustre
> File System).
>
>
>
> The problem in this approach is I am not utilizing the true benefit of
> MapReduce. The use of local disks.
>
> Could you please suggest me a way to use local disks for the above
> problem.?
>
>
>
> I thought, of the following way, but would like to verify from you if
there
> is a better way.
>
>
>
> 1. Upload the original data files in HDFS
>
> 2. In the map task, read the data file as an binary object.
>
> 3. Save it in the local file system.
>
> 4. Call the executable
>
> 5. Push the output from the local file system to HDFS.
>
>
>
> Any suggestion is greatly appreciated.
>
>
> Thank you,
>
> Jaliya
>
>
>
>
>
>
>
>
>
>
RE: Using Hadoop with executables and binary data
Posted by Jaliya Ekanayake <jn...@gmail.com>.
Thanks for the quick reply.
I looked at it, but still could not figure out how to use HDFS to store
input data (binary) and call an executable.
Please note that I cannot modify the executable.
May be I am asking some dumb question, but could you please explain a bit of
how to handle the scenario I have described.
Thanks,
Jaliya
-----Original Message-----
From: Aaron Kimball [mailto:aaron@cloudera.com]
Sent: Thursday, August 20, 2009 3:00 PM
To: common-dev@hadoop.apache.org
Cc: core-dev@hadoop.apache.org; core-user@hadoop.apache.org;
spodxx@gmail.com
Subject: Re: Using Hadoop with executables and binary data
Look into "typed bytes":
http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/
On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake
<jn...@gmail.com>wrote:
> Hi Stefan,
>
>
>
> I am sorry, for the late reply. Somehow the response email has slipped my
> eyes.
>
> Could you explain a bit on how to use Hadoop streaming with binary data
> formats.
>
> I can see, explanations on using it with text data formats, but not for
> binary files.
>
>
> Thank you,
>
> Jaliya
>
> Stefan Podkowinski
> Mon, 10 Aug 2009 01:40:05 -0700
>
> Jaliya,
>
> did you consider Hadoop Streaming for your case?
> http://wiki.apache.org/hadoop/HadoopStreaming
>
>
> On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
> Ekanayake<je...@cs.indiana.edu> wrote:
> > Dear Hadoop devs,
> >
> >
> >
> > Please help me to figure out a way to program the following problem
using
> > Hadoop.
> >
> > I have a program which I need to invoke in parallel using Hadoop. The
> > program takes an input file(binary) and produce an output file (binary)
> >
> >
> >
> > Input.bin ->prog.exe-> output.bin
> >
> >
> >
> > The input data set is about 1TB in size. Each input data file is about
> 33MB
> > in size. (So I have about 31000 files)
> >
> > The output binary file is about 9KBs in size.
> >
> >
> >
> > I have implemented this program using Hadoop in the following way.
> >
> >
> >
> > I keep the input data in a shared parallel file system (Lustre File
> System).
> >
> > Then, I collect the input file names and write them to a collection of
> files
> > in HDFS (let's say hdfs_input_0.txt ..).
> >
> > Each hdfs_input file contains roughly the equal number of files URIs to
> the
> > original input file.
> >
> > The map task, simply take a string Value which is a URI to an original
> input
> > data file and execute the program as an external program.
> >
> > The output of the program is also written to the shared file system
> (Lustre
> > File System).
> >
> >
> >
> > The problem in this approach is I am not utilizing the true benefit of
> > MapReduce. The use of local disks.
> >
> > Could you please suggest me a way to use local disks for the above
> > problem.?
> >
> >
> >
> > I thought, of the following way, but would like to verify from you if
> there
> > is a better way.
> >
> >
> >
> > 1. Upload the original data files in HDFS
> >
> > 2. In the map task, read the data file as an binary object.
> >
> > 3. Save it in the local file system.
> >
> > 4. Call the executable
> >
> > 5. Push the output from the local file system to HDFS.
> >
> >
> >
> > Any suggestion is greatly appreciated.
> >
> >
> > Thank you,
> >
> > Jaliya
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
RE: Using Hadoop with executables and binary data
Posted by Jaliya Ekanayake <jn...@gmail.com>.
Thanks for the quick reply.
I looked at it, but still could not figure out how to use HDFS to store
input data (binary) and call an executable.
Please note that I cannot modify the executable.
May be I am asking some dumb question, but could you please explain a bit of
how to handle the scenario I have described.
Thanks,
Jaliya
-----Original Message-----
From: Aaron Kimball [mailto:aaron@cloudera.com]
Sent: Thursday, August 20, 2009 3:00 PM
To: common-dev@hadoop.apache.org
Cc: core-dev@hadoop.apache.org; core-user@hadoop.apache.org;
spodxx@gmail.com
Subject: Re: Using Hadoop with executables and binary data
Look into "typed bytes":
http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/
On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake
<jn...@gmail.com>wrote:
> Hi Stefan,
>
>
>
> I am sorry, for the late reply. Somehow the response email has slipped my
> eyes.
>
> Could you explain a bit on how to use Hadoop streaming with binary data
> formats.
>
> I can see, explanations on using it with text data formats, but not for
> binary files.
>
>
> Thank you,
>
> Jaliya
>
> Stefan Podkowinski
> Mon, 10 Aug 2009 01:40:05 -0700
>
> Jaliya,
>
> did you consider Hadoop Streaming for your case?
> http://wiki.apache.org/hadoop/HadoopStreaming
>
>
> On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
> Ekanayake<je...@cs.indiana.edu> wrote:
> > Dear Hadoop devs,
> >
> >
> >
> > Please help me to figure out a way to program the following problem
using
> > Hadoop.
> >
> > I have a program which I need to invoke in parallel using Hadoop. The
> > program takes an input file(binary) and produce an output file (binary)
> >
> >
> >
> > Input.bin ->prog.exe-> output.bin
> >
> >
> >
> > The input data set is about 1TB in size. Each input data file is about
> 33MB
> > in size. (So I have about 31000 files)
> >
> > The output binary file is about 9KBs in size.
> >
> >
> >
> > I have implemented this program using Hadoop in the following way.
> >
> >
> >
> > I keep the input data in a shared parallel file system (Lustre File
> System).
> >
> > Then, I collect the input file names and write them to a collection of
> files
> > in HDFS (let's say hdfs_input_0.txt ..).
> >
> > Each hdfs_input file contains roughly the equal number of files URIs to
> the
> > original input file.
> >
> > The map task, simply take a string Value which is a URI to an original
> input
> > data file and execute the program as an external program.
> >
> > The output of the program is also written to the shared file system
> (Lustre
> > File System).
> >
> >
> >
> > The problem in this approach is I am not utilizing the true benefit of
> > MapReduce. The use of local disks.
> >
> > Could you please suggest me a way to use local disks for the above
> > problem.?
> >
> >
> >
> > I thought, of the following way, but would like to verify from you if
> there
> > is a better way.
> >
> >
> >
> > 1. Upload the original data files in HDFS
> >
> > 2. In the map task, read the data file as an binary object.
> >
> > 3. Save it in the local file system.
> >
> > 4. Call the executable
> >
> > 5. Push the output from the local file system to HDFS.
> >
> >
> >
> > Any suggestion is greatly appreciated.
> >
> >
> > Thank you,
> >
> > Jaliya
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
Re: Using Hadoop with executables and binary data
Posted by Aaron Kimball <aa...@cloudera.com>.
Look into "typed bytes":
http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/
On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake <jn...@gmail.com>wrote:
> Hi Stefan,
>
>
>
> I am sorry, for the late reply. Somehow the response email has slipped my
> eyes.
>
> Could you explain a bit on how to use Hadoop streaming with binary data
> formats.
>
> I can see, explanations on using it with text data formats, but not for
> binary files.
>
>
> Thank you,
>
> Jaliya
>
> Stefan Podkowinski
> Mon, 10 Aug 2009 01:40:05 -0700
>
> Jaliya,
>
> did you consider Hadoop Streaming for your case?
> http://wiki.apache.org/hadoop/HadoopStreaming
>
>
> On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
> Ekanayake<je...@cs.indiana.edu> wrote:
> > Dear Hadoop devs,
> >
> >
> >
> > Please help me to figure out a way to program the following problem using
> > Hadoop.
> >
> > I have a program which I need to invoke in parallel using Hadoop. The
> > program takes an input file(binary) and produce an output file (binary)
> >
> >
> >
> > Input.bin ->prog.exe-> output.bin
> >
> >
> >
> > The input data set is about 1TB in size. Each input data file is about
> 33MB
> > in size. (So I have about 31000 files)
> >
> > The output binary file is about 9KBs in size.
> >
> >
> >
> > I have implemented this program using Hadoop in the following way.
> >
> >
> >
> > I keep the input data in a shared parallel file system (Lustre File
> System).
> >
> > Then, I collect the input file names and write them to a collection of
> files
> > in HDFS (let's say hdfs_input_0.txt ..).
> >
> > Each hdfs_input file contains roughly the equal number of files URIs to
> the
> > original input file.
> >
> > The map task, simply take a string Value which is a URI to an original
> input
> > data file and execute the program as an external program.
> >
> > The output of the program is also written to the shared file system
> (Lustre
> > File System).
> >
> >
> >
> > The problem in this approach is I am not utilizing the true benefit of
> > MapReduce. The use of local disks.
> >
> > Could you please suggest me a way to use local disks for the above
> > problem.?
> >
> >
> >
> > I thought, of the following way, but would like to verify from you if
> there
> > is a better way.
> >
> >
> >
> > 1. Upload the original data files in HDFS
> >
> > 2. In the map task, read the data file as an binary object.
> >
> > 3. Save it in the local file system.
> >
> > 4. Call the executable
> >
> > 5. Push the output from the local file system to HDFS.
> >
> >
> >
> > Any suggestion is greatly appreciated.
> >
> >
> > Thank you,
> >
> > Jaliya
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
Re: Using Hadoop with executables and binary data
Posted by Aaron Kimball <aa...@cloudera.com>.
Look into "typed bytes":
http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/
On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake <jn...@gmail.com>wrote:
> Hi Stefan,
>
>
>
> I am sorry, for the late reply. Somehow the response email has slipped my
> eyes.
>
> Could you explain a bit on how to use Hadoop streaming with binary data
> formats.
>
> I can see, explanations on using it with text data formats, but not for
> binary files.
>
>
> Thank you,
>
> Jaliya
>
> Stefan Podkowinski
> Mon, 10 Aug 2009 01:40:05 -0700
>
> Jaliya,
>
> did you consider Hadoop Streaming for your case?
> http://wiki.apache.org/hadoop/HadoopStreaming
>
>
> On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
> Ekanayake<je...@cs.indiana.edu> wrote:
> > Dear Hadoop devs,
> >
> >
> >
> > Please help me to figure out a way to program the following problem using
> > Hadoop.
> >
> > I have a program which I need to invoke in parallel using Hadoop. The
> > program takes an input file(binary) and produce an output file (binary)
> >
> >
> >
> > Input.bin ->prog.exe-> output.bin
> >
> >
> >
> > The input data set is about 1TB in size. Each input data file is about
> 33MB
> > in size. (So I have about 31000 files)
> >
> > The output binary file is about 9KBs in size.
> >
> >
> >
> > I have implemented this program using Hadoop in the following way.
> >
> >
> >
> > I keep the input data in a shared parallel file system (Lustre File
> System).
> >
> > Then, I collect the input file names and write them to a collection of
> files
> > in HDFS (let's say hdfs_input_0.txt ..).
> >
> > Each hdfs_input file contains roughly the equal number of files URIs to
> the
> > original input file.
> >
> > The map task, simply take a string Value which is a URI to an original
> input
> > data file and execute the program as an external program.
> >
> > The output of the program is also written to the shared file system
> (Lustre
> > File System).
> >
> >
> >
> > The problem in this approach is I am not utilizing the true benefit of
> > MapReduce. The use of local disks.
> >
> > Could you please suggest me a way to use local disks for the above
> > problem.?
> >
> >
> >
> > I thought, of the following way, but would like to verify from you if
> there
> > is a better way.
> >
> >
> >
> > 1. Upload the original data files in HDFS
> >
> > 2. In the map task, read the data file as an binary object.
> >
> > 3. Save it in the local file system.
> >
> > 4. Call the executable
> >
> > 5. Push the output from the local file system to HDFS.
> >
> >
> >
> > Any suggestion is greatly appreciated.
> >
> >
> > Thank you,
> >
> > Jaliya
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>