You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jason Wang <ja...@gmail.com> on 2012/10/18 06:02:43 UTC

hadoop streaming with custom RecordReader class

Hi all,
I'm experimenting with hadoop streaming on build 1.0.3.

To give background info, i'm streaming a text file into mapper written in
C.  Using the default settings, streaming uses TextInputFormat which
creates one record from each line.  The problem I am having is that I need
record boundaries to be every 4 lines.  When the splitter breaks up the
input into the mappers, I have partial records on the boundaries due to
this.  To address this, my approach was to write a new RecordReader class
almost in java that is almost identical to LineRecordReader, but with a
modified next() method that reads 4 lines instead of one.

I then compiled the new class and created a jar.  I wanted to import this
at run time using the -libjars argument, like such:

hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
NLineRecordReader.jar -files test_stream.sh -inputreader
mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
/Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE

Unfortunately, I keep getting the following error:
-inputreader: class not found: mypackage.NLineRecordReader

My question is 2 fold.  Am I using the right approach to handle the 4 line
records with the custom RecordReader implementation?  And why isn't
-libjars working to include my class to hadoop streaming at runtime?

Thanks,
Jason

Re: hadoop streaming with custom RecordReader class

Posted by Jason Wang <ja...@gmail.com>.

Thanks a bunch Harsh, that was my problem.  Was strange because even with
no package specified, it was not able to find the class.  So it's working
now, though it seems that hadoop streaming ignores the specified
-inputreader class completely, but that's a different issue.

On Thu, Oct 18, 2012 at 12:58 AM, Harsh J <ha...@cloudera.com> wrote:

> Also, consider using Maven for these kinda development, helps build
> sane jars automatically :)
>
> On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <ha...@cloudera.com> wrote:
> > (3)'s your problem for sure.
> >
> > Try this:
> >
> > mkdir mypackage
> > mv <class file> mypackage/
> > jar cvf NLineRecordReader.jar mypackage
> > [Use this jar]
> >
> > On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com>
> wrote:
> >> 1. I did try using NLineInputFormat, but this causes the
> >> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> >> documentation:
> >>
> >> "The configuration parameter is valid only if
> stream.map.input.writer.class
> >> is org.apache.hadoop.streaming.io.TextInputWriter.class."
> >>
> >> My mapper prefers the streaming stdin to not have the key as part of the
> >> input.  I could obviously parse that out in the mapper, but the mapper
> >> belongs to a 3rd party. This is why I tried to do the RecordReader
> route.
> >>
> >> 2. Yes - I did export the classpath before running.
> >>
> >> 3. This may be the problem:
> >>
> >> bash-3.2$ jar -tf NLineRecordReader.jar
> >> META-INF/
> >> META-INF/MANIFEST.MF
> >> NLineRecordReader.class
> >>
> >> I have specified "package mypackage;" at the top of the java file
> though.
> >> Then compiled using "javac" and then "jar cf".
> >>
> >> 4. The class is public.
> >>
> >>
> >>
> >> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
> >>>
> >>> Hi Jason,
> >>>
> >>> A few questions (in order):
> >>>
> >>> 1. Does Hadoop's own NLineInputFormat not suffice?
> >>>
> >>>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> >>>
> >>> 2. Do you make sure to pass your jar into the front-end too?
> >>>
> >>> $ export HADOOP_CLASSPATH=/path/to/your/jar
> >>> $ command…
> >>>
> >>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
> >>>
> >>> 4. Is your class marked public?
> >>>
> >>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
> >>> wrote:
> >>> > Hi all,
> >>> > I'm experimenting with hadoop streaming on build 1.0.3.
> >>> >
> >>> > To give background info, i'm streaming a text file into mapper
> written
> >>> > in C.
> >>> > Using the default settings, streaming uses TextInputFormat which
> creates
> >>> > one
> >>> > record from each line.  The problem I am having is that I need record
> >>> > boundaries to be every 4 lines.  When the splitter breaks up the
> input
> >>> > into
> >>> > the mappers, I have partial records on the boundaries due to this.
>  To
> >>> > address this, my approach was to write a new RecordReader class
> almost
> >>> > in
> >>> > java that is almost identical to LineRecordReader, but with a
> modified
> >>> > next() method that reads 4 lines instead of one.
> >>> >
> >>> > I then compiled the new class and created a jar.  I wanted to import
> >>> > this at
> >>> > run time using the -libjars argument, like such:
> >>> >
> >>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> >>> > NLineRecordReader.jar -files test_stream.sh -inputreader
> >>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt
> -output
> >>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >>> >
> >>> > Unfortunately, I keep getting the following error:
> >>> > -inputreader: class not found: mypackage.NLineRecordReader
> >>> >
> >>> > My question is 2 fold.  Am I using the right approach to handle the 4
> >>> > line
> >>> > records with the custom RecordReader implementation?  And why isn't
> >>> > -libjars
> >>> > working to include my class to hadoop streaming at runtime?
> >>> >
> >>> > Thanks,
> >>> > Jason
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Posted by Jason Wang <ja...@gmail.com>.

Thanks a bunch Harsh, that was my problem.  Was strange because even with
no package specified, it was not able to find the class.  So it's working
now, though it seems that hadoop streaming ignores the specified
-inputreader class completely, but that's a different issue.

On Thu, Oct 18, 2012 at 12:58 AM, Harsh J <ha...@cloudera.com> wrote:

> Also, consider using Maven for these kinda development, helps build
> sane jars automatically :)
>
> On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <ha...@cloudera.com> wrote:
> > (3)'s your problem for sure.
> >
> > Try this:
> >
> > mkdir mypackage
> > mv <class file> mypackage/
> > jar cvf NLineRecordReader.jar mypackage
> > [Use this jar]
> >
> > On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com>
> wrote:
> >> 1. I did try using NLineInputFormat, but this causes the
> >> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> >> documentation:
> >>
> >> "The configuration parameter is valid only if
> stream.map.input.writer.class
> >> is org.apache.hadoop.streaming.io.TextInputWriter.class."
> >>
> >> My mapper prefers the streaming stdin to not have the key as part of the
> >> input.  I could obviously parse that out in the mapper, but the mapper
> >> belongs to a 3rd party. This is why I tried to do the RecordReader
> route.
> >>
> >> 2. Yes - I did export the classpath before running.
> >>
> >> 3. This may be the problem:
> >>
> >> bash-3.2$ jar -tf NLineRecordReader.jar
> >> META-INF/
> >> META-INF/MANIFEST.MF
> >> NLineRecordReader.class
> >>
> >> I have specified "package mypackage;" at the top of the java file
> though.
> >> Then compiled using "javac" and then "jar cf".
> >>
> >> 4. The class is public.
> >>
> >>
> >>
> >> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
> >>>
> >>> Hi Jason,
> >>>
> >>> A few questions (in order):
> >>>
> >>> 1. Does Hadoop's own NLineInputFormat not suffice?
> >>>
> >>>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> >>>
> >>> 2. Do you make sure to pass your jar into the front-end too?
> >>>
> >>> $ export HADOOP_CLASSPATH=/path/to/your/jar
> >>> $ command…
> >>>
> >>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
> >>>
> >>> 4. Is your class marked public?
> >>>
> >>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
> >>> wrote:
> >>> > Hi all,
> >>> > I'm experimenting with hadoop streaming on build 1.0.3.
> >>> >
> >>> > To give background info, i'm streaming a text file into mapper
> written
> >>> > in C.
> >>> > Using the default settings, streaming uses TextInputFormat which
> creates
> >>> > one
> >>> > record from each line.  The problem I am having is that I need record
> >>> > boundaries to be every 4 lines.  When the splitter breaks up the
> input
> >>> > into
> >>> > the mappers, I have partial records on the boundaries due to this.
>  To
> >>> > address this, my approach was to write a new RecordReader class
> almost
> >>> > in
> >>> > java that is almost identical to LineRecordReader, but with a
> modified
> >>> > next() method that reads 4 lines instead of one.
> >>> >
> >>> > I then compiled the new class and created a jar.  I wanted to import
> >>> > this at
> >>> > run time using the -libjars argument, like such:
> >>> >
> >>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> >>> > NLineRecordReader.jar -files test_stream.sh -inputreader
> >>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt
> -output
> >>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >>> >
> >>> > Unfortunately, I keep getting the following error:
> >>> > -inputreader: class not found: mypackage.NLineRecordReader
> >>> >
> >>> > My question is 2 fold.  Am I using the right approach to handle the 4
> >>> > line
> >>> > records with the custom RecordReader implementation?  And why isn't
> >>> > -libjars
> >>> > working to include my class to hadoop streaming at runtime?
> >>> >
> >>> > Thanks,
> >>> > Jason
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Posted by Jason Wang <ja...@gmail.com>.

Thanks a bunch Harsh, that was my problem.  Was strange because even with
no package specified, it was not able to find the class.  So it's working
now, though it seems that hadoop streaming ignores the specified
-inputreader class completely, but that's a different issue.

On Thu, Oct 18, 2012 at 12:58 AM, Harsh J <ha...@cloudera.com> wrote:

> Also, consider using Maven for these kinda development, helps build
> sane jars automatically :)
>
> On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <ha...@cloudera.com> wrote:
> > (3)'s your problem for sure.
> >
> > Try this:
> >
> > mkdir mypackage
> > mv <class file> mypackage/
> > jar cvf NLineRecordReader.jar mypackage
> > [Use this jar]
> >
> > On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com>
> wrote:
> >> 1. I did try using NLineInputFormat, but this causes the
> >> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> >> documentation:
> >>
> >> "The configuration parameter is valid only if
> stream.map.input.writer.class
> >> is org.apache.hadoop.streaming.io.TextInputWriter.class."
> >>
> >> My mapper prefers the streaming stdin to not have the key as part of the
> >> input.  I could obviously parse that out in the mapper, but the mapper
> >> belongs to a 3rd party. This is why I tried to do the RecordReader
> route.
> >>
> >> 2. Yes - I did export the classpath before running.
> >>
> >> 3. This may be the problem:
> >>
> >> bash-3.2$ jar -tf NLineRecordReader.jar
> >> META-INF/
> >> META-INF/MANIFEST.MF
> >> NLineRecordReader.class
> >>
> >> I have specified "package mypackage;" at the top of the java file
> though.
> >> Then compiled using "javac" and then "jar cf".
> >>
> >> 4. The class is public.
> >>
> >>
> >>
> >> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
> >>>
> >>> Hi Jason,
> >>>
> >>> A few questions (in order):
> >>>
> >>> 1. Does Hadoop's own NLineInputFormat not suffice?
> >>>
> >>>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> >>>
> >>> 2. Do you make sure to pass your jar into the front-end too?
> >>>
> >>> $ export HADOOP_CLASSPATH=/path/to/your/jar
> >>> $ command…
> >>>
> >>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
> >>>
> >>> 4. Is your class marked public?
> >>>
> >>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
> >>> wrote:
> >>> > Hi all,
> >>> > I'm experimenting with hadoop streaming on build 1.0.3.
> >>> >
> >>> > To give background info, i'm streaming a text file into mapper
> written
> >>> > in C.
> >>> > Using the default settings, streaming uses TextInputFormat which
> creates
> >>> > one
> >>> > record from each line.  The problem I am having is that I need record
> >>> > boundaries to be every 4 lines.  When the splitter breaks up the
> input
> >>> > into
> >>> > the mappers, I have partial records on the boundaries due to this.
>  To
> >>> > address this, my approach was to write a new RecordReader class
> almost
> >>> > in
> >>> > java that is almost identical to LineRecordReader, but with a
> modified
> >>> > next() method that reads 4 lines instead of one.
> >>> >
> >>> > I then compiled the new class and created a jar.  I wanted to import
> >>> > this at
> >>> > run time using the -libjars argument, like such:
> >>> >
> >>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> >>> > NLineRecordReader.jar -files test_stream.sh -inputreader
> >>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt
> -output
> >>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >>> >
> >>> > Unfortunately, I keep getting the following error:
> >>> > -inputreader: class not found: mypackage.NLineRecordReader
> >>> >
> >>> > My question is 2 fold.  Am I using the right approach to handle the 4
> >>> > line
> >>> > records with the custom RecordReader implementation?  And why isn't
> >>> > -libjars
> >>> > working to include my class to hadoop streaming at runtime?
> >>> >
> >>> > Thanks,
> >>> > Jason
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Posted by Jason Wang <ja...@gmail.com>.

Thanks a bunch Harsh, that was my problem.  Was strange because even with
no package specified, it was not able to find the class.  So it's working
now, though it seems that hadoop streaming ignores the specified
-inputreader class completely, but that's a different issue.

On Thu, Oct 18, 2012 at 12:58 AM, Harsh J <ha...@cloudera.com> wrote:

> Also, consider using Maven for these kinda development, helps build
> sane jars automatically :)
>
> On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <ha...@cloudera.com> wrote:
> > (3)'s your problem for sure.
> >
> > Try this:
> >
> > mkdir mypackage
> > mv <class file> mypackage/
> > jar cvf NLineRecordReader.jar mypackage
> > [Use this jar]
> >
> > On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com>
> wrote:
> >> 1. I did try using NLineInputFormat, but this causes the
> >> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> >> documentation:
> >>
> >> "The configuration parameter is valid only if
> stream.map.input.writer.class
> >> is org.apache.hadoop.streaming.io.TextInputWriter.class."
> >>
> >> My mapper prefers the streaming stdin to not have the key as part of the
> >> input.  I could obviously parse that out in the mapper, but the mapper
> >> belongs to a 3rd party. This is why I tried to do the RecordReader
> route.
> >>
> >> 2. Yes - I did export the classpath before running.
> >>
> >> 3. This may be the problem:
> >>
> >> bash-3.2$ jar -tf NLineRecordReader.jar
> >> META-INF/
> >> META-INF/MANIFEST.MF
> >> NLineRecordReader.class
> >>
> >> I have specified "package mypackage;" at the top of the java file
> though.
> >> Then compiled using "javac" and then "jar cf".
> >>
> >> 4. The class is public.
> >>
> >>
> >>
> >> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
> >>>
> >>> Hi Jason,
> >>>
> >>> A few questions (in order):
> >>>
> >>> 1. Does Hadoop's own NLineInputFormat not suffice?
> >>>
> >>>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> >>>
> >>> 2. Do you make sure to pass your jar into the front-end too?
> >>>
> >>> $ export HADOOP_CLASSPATH=/path/to/your/jar
> >>> $ command…
> >>>
> >>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
> >>>
> >>> 4. Is your class marked public?
> >>>
> >>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
> >>> wrote:
> >>> > Hi all,
> >>> > I'm experimenting with hadoop streaming on build 1.0.3.
> >>> >
> >>> > To give background info, i'm streaming a text file into mapper
> written
> >>> > in C.
> >>> > Using the default settings, streaming uses TextInputFormat which
> creates
> >>> > one
> >>> > record from each line.  The problem I am having is that I need record
> >>> > boundaries to be every 4 lines.  When the splitter breaks up the
> input
> >>> > into
> >>> > the mappers, I have partial records on the boundaries due to this.
>  To
> >>> > address this, my approach was to write a new RecordReader class
> almost
> >>> > in
> >>> > java that is almost identical to LineRecordReader, but with a
> modified
> >>> > next() method that reads 4 lines instead of one.
> >>> >
> >>> > I then compiled the new class and created a jar.  I wanted to import
> >>> > this at
> >>> > run time using the -libjars argument, like such:
> >>> >
> >>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> >>> > NLineRecordReader.jar -files test_stream.sh -inputreader
> >>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt
> -output
> >>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >>> >
> >>> > Unfortunately, I keep getting the following error:
> >>> > -inputreader: class not found: mypackage.NLineRecordReader
> >>> >
> >>> > My question is 2 fold.  Am I using the right approach to handle the 4
> >>> > line
> >>> > records with the custom RecordReader implementation?  And why isn't
> >>> > -libjars
> >>> > working to include my class to hadoop streaming at runtime?
> >>> >
> >>> > Thanks,
> >>> > Jason
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

Also, consider using Maven for these kinda development, helps build
sane jars automatically :)

On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <ha...@cloudera.com> wrote:
> (3)'s your problem for sure.
>
> Try this:
>
> mkdir mypackage
> mv <class file> mypackage/
> jar cvf NLineRecordReader.jar mypackage
> [Use this jar]
>
> On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com> wrote:
>> 1. I did try using NLineInputFormat, but this causes the
>> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
>> documentation:
>>
>> "The configuration parameter is valid only if stream.map.input.writer.class
>> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>>
>> My mapper prefers the streaming stdin to not have the key as part of the
>> input.  I could obviously parse that out in the mapper, but the mapper
>> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>>
>> 2. Yes - I did export the classpath before running.
>>
>> 3. This may be the problem:
>>
>> bash-3.2$ jar -tf NLineRecordReader.jar
>> META-INF/
>> META-INF/MANIFEST.MF
>> NLineRecordReader.class
>>
>> I have specified "package mypackage;" at the top of the java file though.
>> Then compiled using "javac" and then "jar cf".
>>
>> 4. The class is public.
>>
>>
>>
>> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>> Hi Jason,
>>>
>>> A few questions (in order):
>>>
>>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>>
>>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>
>>> 2. Do you make sure to pass your jar into the front-end too?
>>>
>>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>>> $ command…
>>>
>>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>>
>>> 4. Is your class marked public?
>>>
>>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
>>> wrote:
>>> > Hi all,
>>> > I'm experimenting with hadoop streaming on build 1.0.3.
>>> >
>>> > To give background info, i'm streaming a text file into mapper written
>>> > in C.
>>> > Using the default settings, streaming uses TextInputFormat which creates
>>> > one
>>> > record from each line.  The problem I am having is that I need record
>>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>>> > into
>>> > the mappers, I have partial records on the boundaries due to this.  To
>>> > address this, my approach was to write a new RecordReader class almost
>>> > in
>>> > java that is almost identical to LineRecordReader, but with a modified
>>> > next() method that reads 4 lines instead of one.
>>> >
>>> > I then compiled the new class and created a jar.  I wanted to import
>>> > this at
>>> > run time using the -libjars argument, like such:
>>> >
>>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>>> >
>>> > Unfortunately, I keep getting the following error:
>>> > -inputreader: class not found: mypackage.NLineRecordReader
>>> >
>>> > My question is 2 fold.  Am I using the right approach to handle the 4
>>> > line
>>> > records with the custom RecordReader implementation?  And why isn't
>>> > -libjars
>>> > working to include my class to hadoop streaming at runtime?
>>> >
>>> > Thanks,
>>> > Jason
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

Also, consider using Maven for these kinda development, helps build
sane jars automatically :)

On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <ha...@cloudera.com> wrote:
> (3)'s your problem for sure.
>
> Try this:
>
> mkdir mypackage
> mv <class file> mypackage/
> jar cvf NLineRecordReader.jar mypackage
> [Use this jar]
>
> On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com> wrote:
>> 1. I did try using NLineInputFormat, but this causes the
>> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
>> documentation:
>>
>> "The configuration parameter is valid only if stream.map.input.writer.class
>> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>>
>> My mapper prefers the streaming stdin to not have the key as part of the
>> input.  I could obviously parse that out in the mapper, but the mapper
>> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>>
>> 2. Yes - I did export the classpath before running.
>>
>> 3. This may be the problem:
>>
>> bash-3.2$ jar -tf NLineRecordReader.jar
>> META-INF/
>> META-INF/MANIFEST.MF
>> NLineRecordReader.class
>>
>> I have specified "package mypackage;" at the top of the java file though.
>> Then compiled using "javac" and then "jar cf".
>>
>> 4. The class is public.
>>
>>
>>
>> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>> Hi Jason,
>>>
>>> A few questions (in order):
>>>
>>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>>
>>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>
>>> 2. Do you make sure to pass your jar into the front-end too?
>>>
>>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>>> $ command…
>>>
>>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>>
>>> 4. Is your class marked public?
>>>
>>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
>>> wrote:
>>> > Hi all,
>>> > I'm experimenting with hadoop streaming on build 1.0.3.
>>> >
>>> > To give background info, i'm streaming a text file into mapper written
>>> > in C.
>>> > Using the default settings, streaming uses TextInputFormat which creates
>>> > one
>>> > record from each line.  The problem I am having is that I need record
>>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>>> > into
>>> > the mappers, I have partial records on the boundaries due to this.  To
>>> > address this, my approach was to write a new RecordReader class almost
>>> > in
>>> > java that is almost identical to LineRecordReader, but with a modified
>>> > next() method that reads 4 lines instead of one.
>>> >
>>> > I then compiled the new class and created a jar.  I wanted to import
>>> > this at
>>> > run time using the -libjars argument, like such:
>>> >
>>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>>> >
>>> > Unfortunately, I keep getting the following error:
>>> > -inputreader: class not found: mypackage.NLineRecordReader
>>> >
>>> > My question is 2 fold.  Am I using the right approach to handle the 4
>>> > line
>>> > records with the custom RecordReader implementation?  And why isn't
>>> > -libjars
>>> > working to include my class to hadoop streaming at runtime?
>>> >
>>> > Thanks,
>>> > Jason
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

Also, consider using Maven for these kinda development, helps build
sane jars automatically :)

On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <ha...@cloudera.com> wrote:
> (3)'s your problem for sure.
>
> Try this:
>
> mkdir mypackage
> mv <class file> mypackage/
> jar cvf NLineRecordReader.jar mypackage
> [Use this jar]
>
> On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com> wrote:
>> 1. I did try using NLineInputFormat, but this causes the
>> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
>> documentation:
>>
>> "The configuration parameter is valid only if stream.map.input.writer.class
>> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>>
>> My mapper prefers the streaming stdin to not have the key as part of the
>> input.  I could obviously parse that out in the mapper, but the mapper
>> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>>
>> 2. Yes - I did export the classpath before running.
>>
>> 3. This may be the problem:
>>
>> bash-3.2$ jar -tf NLineRecordReader.jar
>> META-INF/
>> META-INF/MANIFEST.MF
>> NLineRecordReader.class
>>
>> I have specified "package mypackage;" at the top of the java file though.
>> Then compiled using "javac" and then "jar cf".
>>
>> 4. The class is public.
>>
>>
>>
>> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>> Hi Jason,
>>>
>>> A few questions (in order):
>>>
>>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>>
>>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>
>>> 2. Do you make sure to pass your jar into the front-end too?
>>>
>>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>>> $ command…
>>>
>>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>>
>>> 4. Is your class marked public?
>>>
>>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
>>> wrote:
>>> > Hi all,
>>> > I'm experimenting with hadoop streaming on build 1.0.3.
>>> >
>>> > To give background info, i'm streaming a text file into mapper written
>>> > in C.
>>> > Using the default settings, streaming uses TextInputFormat which creates
>>> > one
>>> > record from each line.  The problem I am having is that I need record
>>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>>> > into
>>> > the mappers, I have partial records on the boundaries due to this.  To
>>> > address this, my approach was to write a new RecordReader class almost
>>> > in
>>> > java that is almost identical to LineRecordReader, but with a modified
>>> > next() method that reads 4 lines instead of one.
>>> >
>>> > I then compiled the new class and created a jar.  I wanted to import
>>> > this at
>>> > run time using the -libjars argument, like such:
>>> >
>>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>>> >
>>> > Unfortunately, I keep getting the following error:
>>> > -inputreader: class not found: mypackage.NLineRecordReader
>>> >
>>> > My question is 2 fold.  Am I using the right approach to handle the 4
>>> > line
>>> > records with the custom RecordReader implementation?  And why isn't
>>> > -libjars
>>> > working to include my class to hadoop streaming at runtime?
>>> >
>>> > Thanks,
>>> > Jason
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

Also, consider using Maven for these kinda development, helps build
sane jars automatically :)

On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <ha...@cloudera.com> wrote:
> (3)'s your problem for sure.
>
> Try this:
>
> mkdir mypackage
> mv <class file> mypackage/
> jar cvf NLineRecordReader.jar mypackage
> [Use this jar]
>
> On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com> wrote:
>> 1. I did try using NLineInputFormat, but this causes the
>> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
>> documentation:
>>
>> "The configuration parameter is valid only if stream.map.input.writer.class
>> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>>
>> My mapper prefers the streaming stdin to not have the key as part of the
>> input.  I could obviously parse that out in the mapper, but the mapper
>> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>>
>> 2. Yes - I did export the classpath before running.
>>
>> 3. This may be the problem:
>>
>> bash-3.2$ jar -tf NLineRecordReader.jar
>> META-INF/
>> META-INF/MANIFEST.MF
>> NLineRecordReader.class
>>
>> I have specified "package mypackage;" at the top of the java file though.
>> Then compiled using "javac" and then "jar cf".
>>
>> 4. The class is public.
>>
>>
>>
>> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>> Hi Jason,
>>>
>>> A few questions (in order):
>>>
>>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>>
>>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>
>>> 2. Do you make sure to pass your jar into the front-end too?
>>>
>>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>>> $ command…
>>>
>>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>>
>>> 4. Is your class marked public?
>>>
>>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
>>> wrote:
>>> > Hi all,
>>> > I'm experimenting with hadoop streaming on build 1.0.3.
>>> >
>>> > To give background info, i'm streaming a text file into mapper written
>>> > in C.
>>> > Using the default settings, streaming uses TextInputFormat which creates
>>> > one
>>> > record from each line.  The problem I am having is that I need record
>>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>>> > into
>>> > the mappers, I have partial records on the boundaries due to this.  To
>>> > address this, my approach was to write a new RecordReader class almost
>>> > in
>>> > java that is almost identical to LineRecordReader, but with a modified
>>> > next() method that reads 4 lines instead of one.
>>> >
>>> > I then compiled the new class and created a jar.  I wanted to import
>>> > this at
>>> > run time using the -libjars argument, like such:
>>> >
>>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>>> >
>>> > Unfortunately, I keep getting the following error:
>>> > -inputreader: class not found: mypackage.NLineRecordReader
>>> >
>>> > My question is 2 fold.  Am I using the right approach to handle the 4
>>> > line
>>> > records with the custom RecordReader implementation?  And why isn't
>>> > -libjars
>>> > working to include my class to hadoop streaming at runtime?
>>> >
>>> > Thanks,
>>> > Jason
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

(3)'s your problem for sure.

Try this:

mkdir mypackage
mv <class file> mypackage/
jar cvf NLineRecordReader.jar mypackage
[Use this jar]

On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com> wrote:
> 1. I did try using NLineInputFormat, but this causes the
> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> documentation:
>
> "The configuration parameter is valid only if stream.map.input.writer.class
> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>
> My mapper prefers the streaming stdin to not have the key as part of the
> input.  I could obviously parse that out in the mapper, but the mapper
> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>
> 2. Yes - I did export the classpath before running.
>
> 3. This may be the problem:
>
> bash-3.2$ jar -tf NLineRecordReader.jar
> META-INF/
> META-INF/MANIFEST.MF
> NLineRecordReader.class
>
> I have specified "package mypackage;" at the top of the java file though.
> Then compiled using "javac" and then "jar cf".
>
> 4. The class is public.
>
>
>
> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Jason,
>>
>> A few questions (in order):
>>
>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>
>> 2. Do you make sure to pass your jar into the front-end too?
>>
>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>> $ command…
>>
>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>
>> 4. Is your class marked public?
>>
>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
>> wrote:
>> > Hi all,
>> > I'm experimenting with hadoop streaming on build 1.0.3.
>> >
>> > To give background info, i'm streaming a text file into mapper written
>> > in C.
>> > Using the default settings, streaming uses TextInputFormat which creates
>> > one
>> > record from each line.  The problem I am having is that I need record
>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>> > into
>> > the mappers, I have partial records on the boundaries due to this.  To
>> > address this, my approach was to write a new RecordReader class almost
>> > in
>> > java that is almost identical to LineRecordReader, but with a modified
>> > next() method that reads 4 lines instead of one.
>> >
>> > I then compiled the new class and created a jar.  I wanted to import
>> > this at
>> > run time using the -libjars argument, like such:
>> >
>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>> >
>> > Unfortunately, I keep getting the following error:
>> > -inputreader: class not found: mypackage.NLineRecordReader
>> >
>> > My question is 2 fold.  Am I using the right approach to handle the 4
>> > line
>> > records with the custom RecordReader implementation?  And why isn't
>> > -libjars
>> > working to include my class to hadoop streaming at runtime?
>> >
>> > Thanks,
>> > Jason
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

(3)'s your problem for sure.

Try this:

mkdir mypackage
mv <class file> mypackage/
jar cvf NLineRecordReader.jar mypackage
[Use this jar]

On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com> wrote:
> 1. I did try using NLineInputFormat, but this causes the
> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> documentation:
>
> "The configuration parameter is valid only if stream.map.input.writer.class
> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>
> My mapper prefers the streaming stdin to not have the key as part of the
> input.  I could obviously parse that out in the mapper, but the mapper
> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>
> 2. Yes - I did export the classpath before running.
>
> 3. This may be the problem:
>
> bash-3.2$ jar -tf NLineRecordReader.jar
> META-INF/
> META-INF/MANIFEST.MF
> NLineRecordReader.class
>
> I have specified "package mypackage;" at the top of the java file though.
> Then compiled using "javac" and then "jar cf".
>
> 4. The class is public.
>
>
>
> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Jason,
>>
>> A few questions (in order):
>>
>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>
>> 2. Do you make sure to pass your jar into the front-end too?
>>
>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>> $ command…
>>
>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>
>> 4. Is your class marked public?
>>
>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
>> wrote:
>> > Hi all,
>> > I'm experimenting with hadoop streaming on build 1.0.3.
>> >
>> > To give background info, i'm streaming a text file into mapper written
>> > in C.
>> > Using the default settings, streaming uses TextInputFormat which creates
>> > one
>> > record from each line.  The problem I am having is that I need record
>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>> > into
>> > the mappers, I have partial records on the boundaries due to this.  To
>> > address this, my approach was to write a new RecordReader class almost
>> > in
>> > java that is almost identical to LineRecordReader, but with a modified
>> > next() method that reads 4 lines instead of one.
>> >
>> > I then compiled the new class and created a jar.  I wanted to import
>> > this at
>> > run time using the -libjars argument, like such:
>> >
>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>> >
>> > Unfortunately, I keep getting the following error:
>> > -inputreader: class not found: mypackage.NLineRecordReader
>> >
>> > My question is 2 fold.  Am I using the right approach to handle the 4
>> > line
>> > records with the custom RecordReader implementation?  And why isn't
>> > -libjars
>> > working to include my class to hadoop streaming at runtime?
>> >
>> > Thanks,
>> > Jason
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

(3)'s your problem for sure.

Try this:

mkdir mypackage
mv <class file> mypackage/
jar cvf NLineRecordReader.jar mypackage
[Use this jar]

On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com> wrote:
> 1. I did try using NLineInputFormat, but this causes the
> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> documentation:
>
> "The configuration parameter is valid only if stream.map.input.writer.class
> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>
> My mapper prefers the streaming stdin to not have the key as part of the
> input.  I could obviously parse that out in the mapper, but the mapper
> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>
> 2. Yes - I did export the classpath before running.
>
> 3. This may be the problem:
>
> bash-3.2$ jar -tf NLineRecordReader.jar
> META-INF/
> META-INF/MANIFEST.MF
> NLineRecordReader.class
>
> I have specified "package mypackage;" at the top of the java file though.
> Then compiled using "javac" and then "jar cf".
>
> 4. The class is public.
>
>
>
> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Jason,
>>
>> A few questions (in order):
>>
>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>
>> 2. Do you make sure to pass your jar into the front-end too?
>>
>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>> $ command…
>>
>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>
>> 4. Is your class marked public?
>>
>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
>> wrote:
>> > Hi all,
>> > I'm experimenting with hadoop streaming on build 1.0.3.
>> >
>> > To give background info, i'm streaming a text file into mapper written
>> > in C.
>> > Using the default settings, streaming uses TextInputFormat which creates
>> > one
>> > record from each line.  The problem I am having is that I need record
>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>> > into
>> > the mappers, I have partial records on the boundaries due to this.  To
>> > address this, my approach was to write a new RecordReader class almost
>> > in
>> > java that is almost identical to LineRecordReader, but with a modified
>> > next() method that reads 4 lines instead of one.
>> >
>> > I then compiled the new class and created a jar.  I wanted to import
>> > this at
>> > run time using the -libjars argument, like such:
>> >
>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>> >
>> > Unfortunately, I keep getting the following error:
>> > -inputreader: class not found: mypackage.NLineRecordReader
>> >
>> > My question is 2 fold.  Am I using the right approach to handle the 4
>> > line
>> > records with the custom RecordReader implementation?  And why isn't
>> > -libjars
>> > working to include my class to hadoop streaming at runtime?
>> >
>> > Thanks,
>> > Jason
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

(3)'s your problem for sure.

Try this:

mkdir mypackage
mv <class file> mypackage/
jar cvf NLineRecordReader.jar mypackage
[Use this jar]

On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <ja...@gmail.com> wrote:
> 1. I did try using NLineInputFormat, but this causes the
> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> documentation:
>
> "The configuration parameter is valid only if stream.map.input.writer.class
> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>
> My mapper prefers the streaming stdin to not have the key as part of the
> input.  I could obviously parse that out in the mapper, but the mapper
> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>
> 2. Yes - I did export the classpath before running.
>
> 3. This may be the problem:
>
> bash-3.2$ jar -tf NLineRecordReader.jar
> META-INF/
> META-INF/MANIFEST.MF
> NLineRecordReader.class
>
> I have specified "package mypackage;" at the top of the java file though.
> Then compiled using "javac" and then "jar cf".
>
> 4. The class is public.
>
>
>
> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Jason,
>>
>> A few questions (in order):
>>
>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>
>> 2. Do you make sure to pass your jar into the front-end too?
>>
>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>> $ command…
>>
>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>
>> 4. Is your class marked public?
>>
>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
>> wrote:
>> > Hi all,
>> > I'm experimenting with hadoop streaming on build 1.0.3.
>> >
>> > To give background info, i'm streaming a text file into mapper written
>> > in C.
>> > Using the default settings, streaming uses TextInputFormat which creates
>> > one
>> > record from each line.  The problem I am having is that I need record
>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>> > into
>> > the mappers, I have partial records on the boundaries due to this.  To
>> > address this, my approach was to write a new RecordReader class almost
>> > in
>> > java that is almost identical to LineRecordReader, but with a modified
>> > next() method that reads 4 lines instead of one.
>> >
>> > I then compiled the new class and created a jar.  I wanted to import
>> > this at
>> > run time using the -libjars argument, like such:
>> >
>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>> >
>> > Unfortunately, I keep getting the following error:
>> > -inputreader: class not found: mypackage.NLineRecordReader
>> >
>> > My question is 2 fold.  Am I using the right approach to handle the 4
>> > line
>> > records with the custom RecordReader implementation?  And why isn't
>> > -libjars
>> > working to include my class to hadoop streaming at runtime?
>> >
>> > Thanks,
>> > Jason
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Jason Wang <ja...@gmail.com>.

1. I did try using NLineInputFormat, but this causes the
"stream.map.input.ignoreKey" to no longer work.  As per the streaming
documentation:

"The configuration parameter is valid only if stream.map.input.writer.class
is org.apache.hadoop.streaming.io.TextInputWriter.class."

My mapper prefers the streaming stdin to not have the key as part of the
input.  I could obviously parse that out in the mapper, but the mapper
belongs to a 3rd party. This is why I tried to do the RecordReader route.

2. Yes - I did export the classpath before running.

3. This may be the problem:

bash-3.2$ jar -tf NLineRecordReader.jar
META-INF/
META-INF/MANIFEST.MF
NLineRecordReader.class

I have specified "package mypackage;" at the top of the java file though.
Then compiled using "javac" and then "jar cf".

4. The class is public.



On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Jason,
>
> A few questions (in order):
>
> 1. Does Hadoop's own NLineInputFormat not suffice?
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> 2. Do you make sure to pass your jar into the front-end too?
>
> $ export HADOOP_CLASSPATH=/path/to/your/jar
> $ command…
>
> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>
> 4. Is your class marked public?
>
> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
> wrote:
> > Hi all,
> > I'm experimenting with hadoop streaming on build 1.0.3.
> >
> > To give background info, i'm streaming a text file into mapper written
> in C.
> > Using the default settings, streaming uses TextInputFormat which creates
> one
> > record from each line.  The problem I am having is that I need record
> > boundaries to be every 4 lines.  When the splitter breaks up the input
> into
> > the mappers, I have partial records on the boundaries due to this.  To
> > address this, my approach was to write a new RecordReader class almost in
> > java that is almost identical to LineRecordReader, but with a modified
> > next() method that reads 4 lines instead of one.
> >
> > I then compiled the new class and created a jar.  I wanted to import
> this at
> > run time using the -libjars argument, like such:
> >
> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> > NLineRecordReader.jar -files test_stream.sh -inputreader
> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >
> > Unfortunately, I keep getting the following error:
> > -inputreader: class not found: mypackage.NLineRecordReader
> >
> > My question is 2 fold.  Am I using the right approach to handle the 4
> line
> > records with the custom RecordReader implementation?  And why isn't
> -libjars
> > working to include my class to hadoop streaming at runtime?
> >
> > Thanks,
> > Jason
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Posted by Jason Wang <ja...@gmail.com>.

1. I did try using NLineInputFormat, but this causes the
"stream.map.input.ignoreKey" to no longer work.  As per the streaming
documentation:

"The configuration parameter is valid only if stream.map.input.writer.class
is org.apache.hadoop.streaming.io.TextInputWriter.class."

My mapper prefers the streaming stdin to not have the key as part of the
input.  I could obviously parse that out in the mapper, but the mapper
belongs to a 3rd party. This is why I tried to do the RecordReader route.

2. Yes - I did export the classpath before running.

3. This may be the problem:

bash-3.2$ jar -tf NLineRecordReader.jar
META-INF/
META-INF/MANIFEST.MF
NLineRecordReader.class

I have specified "package mypackage;" at the top of the java file though.
Then compiled using "javac" and then "jar cf".

4. The class is public.



On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Jason,
>
> A few questions (in order):
>
> 1. Does Hadoop's own NLineInputFormat not suffice?
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> 2. Do you make sure to pass your jar into the front-end too?
>
> $ export HADOOP_CLASSPATH=/path/to/your/jar
> $ command…
>
> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>
> 4. Is your class marked public?
>
> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
> wrote:
> > Hi all,
> > I'm experimenting with hadoop streaming on build 1.0.3.
> >
> > To give background info, i'm streaming a text file into mapper written
> in C.
> > Using the default settings, streaming uses TextInputFormat which creates
> one
> > record from each line.  The problem I am having is that I need record
> > boundaries to be every 4 lines.  When the splitter breaks up the input
> into
> > the mappers, I have partial records on the boundaries due to this.  To
> > address this, my approach was to write a new RecordReader class almost in
> > java that is almost identical to LineRecordReader, but with a modified
> > next() method that reads 4 lines instead of one.
> >
> > I then compiled the new class and created a jar.  I wanted to import
> this at
> > run time using the -libjars argument, like such:
> >
> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> > NLineRecordReader.jar -files test_stream.sh -inputreader
> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >
> > Unfortunately, I keep getting the following error:
> > -inputreader: class not found: mypackage.NLineRecordReader
> >
> > My question is 2 fold.  Am I using the right approach to handle the 4
> line
> > records with the custom RecordReader implementation?  And why isn't
> -libjars
> > working to include my class to hadoop streaming at runtime?
> >
> > Thanks,
> > Jason
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Posted by Jason Wang <ja...@gmail.com>.

1. I did try using NLineInputFormat, but this causes the
"stream.map.input.ignoreKey" to no longer work.  As per the streaming
documentation:

"The configuration parameter is valid only if stream.map.input.writer.class
is org.apache.hadoop.streaming.io.TextInputWriter.class."

My mapper prefers the streaming stdin to not have the key as part of the
input.  I could obviously parse that out in the mapper, but the mapper
belongs to a 3rd party. This is why I tried to do the RecordReader route.

2. Yes - I did export the classpath before running.

3. This may be the problem:

bash-3.2$ jar -tf NLineRecordReader.jar
META-INF/
META-INF/MANIFEST.MF
NLineRecordReader.class

I have specified "package mypackage;" at the top of the java file though.
Then compiled using "javac" and then "jar cf".

4. The class is public.



On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Jason,
>
> A few questions (in order):
>
> 1. Does Hadoop's own NLineInputFormat not suffice?
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> 2. Do you make sure to pass your jar into the front-end too?
>
> $ export HADOOP_CLASSPATH=/path/to/your/jar
> $ command…
>
> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>
> 4. Is your class marked public?
>
> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
> wrote:
> > Hi all,
> > I'm experimenting with hadoop streaming on build 1.0.3.
> >
> > To give background info, i'm streaming a text file into mapper written
> in C.
> > Using the default settings, streaming uses TextInputFormat which creates
> one
> > record from each line.  The problem I am having is that I need record
> > boundaries to be every 4 lines.  When the splitter breaks up the input
> into
> > the mappers, I have partial records on the boundaries due to this.  To
> > address this, my approach was to write a new RecordReader class almost in
> > java that is almost identical to LineRecordReader, but with a modified
> > next() method that reads 4 lines instead of one.
> >
> > I then compiled the new class and created a jar.  I wanted to import
> this at
> > run time using the -libjars argument, like such:
> >
> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> > NLineRecordReader.jar -files test_stream.sh -inputreader
> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >
> > Unfortunately, I keep getting the following error:
> > -inputreader: class not found: mypackage.NLineRecordReader
> >
> > My question is 2 fold.  Am I using the right approach to handle the 4
> line
> > records with the custom RecordReader implementation?  And why isn't
> -libjars
> > working to include my class to hadoop streaming at runtime?
> >
> > Thanks,
> > Jason
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Posted by Jason Wang <ja...@gmail.com>.

1. I did try using NLineInputFormat, but this causes the
"stream.map.input.ignoreKey" to no longer work.  As per the streaming
documentation:

"The configuration parameter is valid only if stream.map.input.writer.class
is org.apache.hadoop.streaming.io.TextInputWriter.class."

My mapper prefers the streaming stdin to not have the key as part of the
input.  I could obviously parse that out in the mapper, but the mapper
belongs to a 3rd party. This is why I tried to do the RecordReader route.

2. Yes - I did export the classpath before running.

3. This may be the problem:

bash-3.2$ jar -tf NLineRecordReader.jar
META-INF/
META-INF/MANIFEST.MF
NLineRecordReader.class

I have specified "package mypackage;" at the top of the java file though.
Then compiled using "javac" and then "jar cf".

4. The class is public.



On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Jason,
>
> A few questions (in order):
>
> 1. Does Hadoop's own NLineInputFormat not suffice?
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> 2. Do you make sure to pass your jar into the front-end too?
>
> $ export HADOOP_CLASSPATH=/path/to/your/jar
> $ command…
>
> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>
> 4. Is your class marked public?
>
> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com>
> wrote:
> > Hi all,
> > I'm experimenting with hadoop streaming on build 1.0.3.
> >
> > To give background info, i'm streaming a text file into mapper written
> in C.
> > Using the default settings, streaming uses TextInputFormat which creates
> one
> > record from each line.  The problem I am having is that I need record
> > boundaries to be every 4 lines.  When the splitter breaks up the input
> into
> > the mappers, I have partial records on the boundaries due to this.  To
> > address this, my approach was to write a new RecordReader class almost in
> > java that is almost identical to LineRecordReader, but with a modified
> > next() method that reads 4 lines instead of one.
> >
> > I then compiled the new class and created a jar.  I wanted to import
> this at
> > run time using the -libjars argument, like such:
> >
> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> > NLineRecordReader.jar -files test_stream.sh -inputreader
> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >
> > Unfortunately, I keep getting the following error:
> > -inputreader: class not found: mypackage.NLineRecordReader
> >
> > My question is 2 fold.  Am I using the right approach to handle the 4
> line
> > records with the custom RecordReader implementation?  And why isn't
> -libjars
> > working to include my class to hadoop streaming at runtime?
> >
> > Thanks,
> > Jason
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

Hi Jason,

A few questions (in order):

1. Does Hadoop's own NLineInputFormat not suffice?
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

2. Do you make sure to pass your jar into the front-end too?

$ export HADOOP_CLASSPATH=/path/to/your/jar
$ command…

3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?

4. Is your class marked public?

On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com> wrote:
> Hi all,
> I'm experimenting with hadoop streaming on build 1.0.3.
>
> To give background info, i'm streaming a text file into mapper written in C.
> Using the default settings, streaming uses TextInputFormat which creates one
> record from each line.  The problem I am having is that I need record
> boundaries to be every 4 lines.  When the splitter breaks up the input into
> the mappers, I have partial records on the boundaries due to this.  To
> address this, my approach was to write a new RecordReader class almost in
> java that is almost identical to LineRecordReader, but with a modified
> next() method that reads 4 lines instead of one.
>
> I then compiled the new class and created a jar.  I wanted to import this at
> run time using the -libjars argument, like such:
>
> hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> NLineRecordReader.jar -files test_stream.sh -inputreader
> mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>
> Unfortunately, I keep getting the following error:
> -inputreader: class not found: mypackage.NLineRecordReader
>
> My question is 2 fold.  Am I using the right approach to handle the 4 line
> records with the custom RecordReader implementation?  And why isn't -libjars
> working to include my class to hadoop streaming at runtime?
>
> Thanks,
> Jason



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

Hi Jason,

A few questions (in order):

1. Does Hadoop's own NLineInputFormat not suffice?
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

2. Do you make sure to pass your jar into the front-end too?

$ export HADOOP_CLASSPATH=/path/to/your/jar
$ command…

3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?

4. Is your class marked public?

On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com> wrote:
> Hi all,
> I'm experimenting with hadoop streaming on build 1.0.3.
>
> To give background info, i'm streaming a text file into mapper written in C.
> Using the default settings, streaming uses TextInputFormat which creates one
> record from each line.  The problem I am having is that I need record
> boundaries to be every 4 lines.  When the splitter breaks up the input into
> the mappers, I have partial records on the boundaries due to this.  To
> address this, my approach was to write a new RecordReader class almost in
> java that is almost identical to LineRecordReader, but with a modified
> next() method that reads 4 lines instead of one.
>
> I then compiled the new class and created a jar.  I wanted to import this at
> run time using the -libjars argument, like such:
>
> hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> NLineRecordReader.jar -files test_stream.sh -inputreader
> mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>
> Unfortunately, I keep getting the following error:
> -inputreader: class not found: mypackage.NLineRecordReader
>
> My question is 2 fold.  Am I using the right approach to handle the 4 line
> records with the custom RecordReader implementation?  And why isn't -libjars
> working to include my class to hadoop streaming at runtime?
>
> Thanks,
> Jason



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

Hi Jason,

A few questions (in order):

1. Does Hadoop's own NLineInputFormat not suffice?
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

2. Do you make sure to pass your jar into the front-end too?

$ export HADOOP_CLASSPATH=/path/to/your/jar
$ command…

3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?

4. Is your class marked public?

On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com> wrote:
> Hi all,
> I'm experimenting with hadoop streaming on build 1.0.3.
>
> To give background info, i'm streaming a text file into mapper written in C.
> Using the default settings, streaming uses TextInputFormat which creates one
> record from each line.  The problem I am having is that I need record
> boundaries to be every 4 lines.  When the splitter breaks up the input into
> the mappers, I have partial records on the boundaries due to this.  To
> address this, my approach was to write a new RecordReader class almost in
> java that is almost identical to LineRecordReader, but with a modified
> next() method that reads 4 lines instead of one.
>
> I then compiled the new class and created a jar.  I wanted to import this at
> run time using the -libjars argument, like such:
>
> hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> NLineRecordReader.jar -files test_stream.sh -inputreader
> mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>
> Unfortunately, I keep getting the following error:
> -inputreader: class not found: mypackage.NLineRecordReader
>
> My question is 2 fold.  Am I using the right approach to handle the 4 line
> records with the custom RecordReader implementation?  And why isn't -libjars
> working to include my class to hadoop streaming at runtime?
>
> Thanks,
> Jason



-- 
Harsh J

Re: hadoop streaming with custom RecordReader class

Posted by Harsh J <ha...@cloudera.com>.

Hi Jason,

A few questions (in order):

1. Does Hadoop's own NLineInputFormat not suffice?
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

2. Do you make sure to pass your jar into the front-end too?

$ export HADOOP_CLASSPATH=/path/to/your/jar
$ command…

3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?

4. Is your class marked public?

On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <ja...@gmail.com> wrote:
> Hi all,
> I'm experimenting with hadoop streaming on build 1.0.3.
>
> To give background info, i'm streaming a text file into mapper written in C.
> Using the default settings, streaming uses TextInputFormat which creates one
> record from each line.  The problem I am having is that I need record
> boundaries to be every 4 lines.  When the splitter breaks up the input into
> the mappers, I have partial records on the boundaries due to this.  To
> address this, my approach was to write a new RecordReader class almost in
> java that is almost identical to LineRecordReader, but with a modified
> next() method that reads 4 lines instead of one.
>
> I then compiled the new class and created a jar.  I wanted to import this at
> run time using the -libjars argument, like such:
>
> hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> NLineRecordReader.jar -files test_stream.sh -inputreader
> mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>
> Unfortunately, I keep getting the following error:
> -inputreader: class not found: mypackage.NLineRecordReader
>
> My question is 2 fold.  Am I using the right approach to handle the 4 line
> records with the custom RecordReader implementation?  And why isn't -libjars
> working to include my class to hadoop streaming at runtime?
>
> Thanks,
> Jason



-- 
Harsh J