You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Francesco Tamberi <ta...@cli.di.unipi.it> on 2008/04/04 12:00:35 UTC

Streaming + custom input format

Hi All,
I have a streaming tool chain written in c++/python that performs some operations on really big text files (gigabytes order); the chain reads files and writes its result to standard output.
The chain needs to read well structured files and so I need to control how hadoop splits files: it should splits a file only at suitable places.
What's the best way to do that?
I'm trying defining a custom input format in that way but I'm not sure it's ok:

public class MyInputFormat extends FileInputFormat<LongWritable, Text> {
	...

	public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
		...
	}
}

That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class -inputformat MyClass -input test.txt -output test-output

But it raises an exception "-inputformat : class not found : MyClass"
I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH, putting in system' CLASSPATH but always the same result..

Thank you for your patience!
-- Francesco




Re: Streaming + custom input format

Posted by Yuri Pradkin <yu...@isi.edu>.
It does work for me.  I have to BOTH ship the extra jar using -file AND 
include in classpath on local system (via setting HADOOP_CLASSPATH).
I'm not sure what "nothing happened" means.  BTW, I'm using the 0.16.2 
release.

On Friday 04 April 2008 10:19:54 am Francesco Tamberi wrote:
> I already tried that... nothing happened...
> Thank you,
> -- Francesco
>
> Ted Dunning ha scritto:
> > I saw that, but I don't know if it will put a jar into the classpath at
> > the other end.

Re: Streaming + custom input format

Posted by Francesco Tamberi <ta...@cli.di.unipi.it>.
I already tried that... nothing happened...
Thank you,
-- Francesco

Ted Dunning ha scritto:
> I saw that, but I don't know if it will put a jar into the classpath at the
> other end.
>
>
> On 4/4/08 9:56 AM, "Yuri Pradkin" <yu...@isi.edu> wrote:
>
>   
>> There is a -file option to streaming that
>> -file     <file>     File/dir to be shipped in the Job jar file
>>
>> On Friday 04 April 2008 09:24:59 am Ted Dunning wrote:
>>     
>>> At one point, it
>>> was necessary to unpack the streaming.jar file and put your own classes and
>>> jars into that.  Last time I looked at the code, however, there was support
>>> for that happening magically, but in the 30 seconds I have allotted to help
>>> you (sorry bout that), I can't see that there is a command line option to
>>> trigger that, unless it is the one for including a file in the jar file.
>>>       
>>     
>
>   

Re: Streaming + custom input format

Posted by Ted Dunning <td...@veoh.com>.
I saw that, but I don't know if it will put a jar into the classpath at the
other end.


On 4/4/08 9:56 AM, "Yuri Pradkin" <yu...@isi.edu> wrote:

> There is a -file option to streaming that
> -file     <file>     File/dir to be shipped in the Job jar file
> 
> On Friday 04 April 2008 09:24:59 am Ted Dunning wrote:
>> At one point, it
>> was necessary to unpack the streaming.jar file and put your own classes and
>> jars into that.  Last time I looked at the code, however, there was support
>> for that happening magically, but in the 30 seconds I have allotted to help
>> you (sorry bout that), I can't see that there is a command line option to
>> trigger that, unless it is the one for including a file in the jar file.
> 
> 


Re: Streaming + custom input format

Posted by Yuri Pradkin <yu...@isi.edu>.
There is a -file option to streaming that
	-file     <file>     File/dir to be shipped in the Job jar file

On Friday 04 April 2008 09:24:59 am Ted Dunning wrote:
> At one point, it
> was necessary to unpack the streaming.jar file and put your own classes and
> jars into that.  Last time I looked at the code, however, there was support
> for that happening magically, but in the 30 seconds I have allotted to help
> you (sorry bout that), I can't see that there is a command line option to
> trigger that, unless it is the one for including a file in the jar file.



Re: Streaming + custom input format

Posted by Ted Dunning <td...@veoh.com>.


On 4/4/08 10:18 AM, "Francesco Tamberi" <ta...@cli.di.unipi.it> wrote:

> Thank for your fast reply!
> 
> Ted Dunning ha scritto:
>> Take a looks at the way that the text input format moves to the next line
>> after a split point.
>> 
>>   
> I'm not sure to understand.. is my way correct or are you suggesting
> another one?

I am not sure if I was suggesting something different, but I think it was.

It sounded like you were going to find good split points in the getSplits
method.

The TextInputFormat doesn't do try to be so clever since that would involve
(serialized) reading of parts of the file.  Instead, it picks the break
points *WITHOUT* reference to the contents of the files.  Then, when the
mapper lights up, the input format jumps to the assigned point, reads the
remainder of the line at that point and then starts sending full lines to
the mapper, continuing until it hits the end of the file OR passes the
beginning of the next split.  This means that it may read additional data
after the assigned end point, but that extra data is guaranteed to be
ignored by the input format in charge of reading that split.

This is a very clever and simple solution to the problem that depends only
on being able to find a boundary between records.  If you can do that, then
you are golden.


Re: Streaming + custom input format

Posted by Francesco Tamberi <ta...@cli.di.unipi.it>.
Thank for your fast reply!

Ted Dunning ha scritto:
> Take a looks at the way that the text input format moves to the next line
> after a split point.
>
>   
I'm not sure to understand.. is my way correct or are you suggesting 
another one?
> There are a couple of possible problems with your input format not found
> problem.
>
> First, is your input in a package?  If so, you need to provide a complete
> name for the class.
>
>   
I forget to explain but I provided the complete class name (i.e. 
org.myName.ClassName)
> Secondly, you have to give streaming information about how to package up
> your input format class for transfer to the cluster.  Having access to the
> class on your initial invoking machine is not sufficient.  At one point, it
> was necessary to unpack the streaming.jar file and put your own classes and
> jars into that.  Last time I looked at the code, however, there was support
> for that happening magically, but in the 30 seconds I have allotted to help
> you (sorry bout that), I can't see that there is a command line option to
> trigger that, unless it is the one for including a file in the jar file.
>
>   
I'll try to include my jar/class in streaming.jar.. is no SO clean but 
it would be great if it works! I'll keep you informed ;)
Thank you again
> On 4/4/08 3:00 AM, "Francesco Tamberi" <ta...@cli.di.unipi.it> wrote:
>
>   
>> Hi All,
>> I have a streaming tool chain written in c++/python that performs some
>> operations on really big text files (gigabytes order); the chain reads files
>> and writes its result to standard output.
>> The chain needs to read well structured files and so I need to control how
>> hadoop splits files: it should splits a file only at suitable places.
>> What's the best way to do that?
>> I'm trying defining a custom input format in that way but I'm not sure it's
>> ok:
>>
>> public class MyInputFormat extends FileInputFormat<LongWritable, Text> {
>> ...
>>
>> public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
>> ...
>> }
>> }
>>
>> That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with:
>>
>> $HADOOP_HOME/bin/hadoop jar
>> $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh
>> -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class
>> -inputformat MyClass -input test.txt -output test-output
>>
>> But it raises an exception "-inputformat : class not found : MyClass"
>> I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH,
>> putting in system' CLASSPATH but always the same result..
>>
>> Thank you for your patience!
>> -- Francesco
>>
>>
>>
>>     
>
>   

Re: Streaming + custom input format

Posted by Ted Dunning <td...@veoh.com>.
Take a looks at the way that the text input format moves to the next line
after a split point.

There are a couple of possible problems with your input format not found
problem.

First, is your input in a package?  If so, you need to provide a complete
name for the class.

Secondly, you have to give streaming information about how to package up
your input format class for transfer to the cluster.  Having access to the
class on your initial invoking machine is not sufficient.  At one point, it
was necessary to unpack the streaming.jar file and put your own classes and
jars into that.  Last time I looked at the code, however, there was support
for that happening magically, but in the 30 seconds I have allotted to help
you (sorry bout that), I can't see that there is a command line option to
trigger that, unless it is the one for including a file in the jar file.


On 4/4/08 3:00 AM, "Francesco Tamberi" <ta...@cli.di.unipi.it> wrote:

> Hi All,
> I have a streaming tool chain written in c++/python that performs some
> operations on really big text files (gigabytes order); the chain reads files
> and writes its result to standard output.
> The chain needs to read well structured files and so I need to control how
> hadoop splits files: it should splits a file only at suitable places.
> What's the best way to do that?
> I'm trying defining a custom input format in that way but I'm not sure it's
> ok:
> 
> public class MyInputFormat extends FileInputFormat<LongWritable, Text> {
> ...
> 
> public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
> ...
> }
> }
> 
> That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with:
> 
> $HADOOP_HOME/bin/hadoop jar
> $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh
> -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class
> -inputformat MyClass -input test.txt -output test-output
> 
> But it raises an exception "-inputformat : class not found : MyClass"
> I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH,
> putting in system' CLASSPATH but always the same result..
> 
> Thank you for your patience!
> -- Francesco
> 
> 
>