You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Yuri K." <mr...@hotmail.com> on 2010/03/24 16:23:58 UTC

Manually splitting files in blocks

Dear Hadoopers,

i'm trying to find out how and where hadoop splits a file into blocks and
decides to send them to the datanodes.

My specific problem:
i have two types of data files.
One large file is used as a database-file where information is sorted like
this:
[BEGIN DATAROW]
... lots of data 1
[END DATAROW]

[BEGIN DATAROW]
... lots of data 2
[END DATAROW]
and so on.

and the other smaller files contain raw data and are to be compared to a
datarow in the large file.

so my question is: is it possible to manually set how hadoop splits the
large data file into blocks?
obviously i want the begin-end section to be in one block to optimize
performance. thus i can replicate the smaller files on each node and so
those can work independently from the other.

thanks, yk
-- 
View this message in context: http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28015936.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

RE: Manually splitting files in blocks

Posted by Ankit Bhatnagar <ab...@vantage.com>.

So this is how it goes

1- customInputFormat or whatever name extends the TextFormat
2- this class has a method that returns the RecordReader Object
3- u have to create a customRecordReader as well that reads the blocks

Ankit



-----Original Message-----
From: Yuri K. [mailto:mr_greenshit@hotmail.com] 
Sent: Friday, March 26, 2010 10:49 AM
To: core-user@hadoop.apache.org
Subject: Re: Manually splitting files in blocks


ok so far so good. thanks for the reply. i'm trying to implement a custom
file input format. but i can set it only in the job configuration:
job.setInputFormatClass(CustomFileInputFormat.class);

how do i make hadoop implement the file format, or the custom file split
when i upload new files to the hdfs? do i need a custom upload interface for
that or is there a hadoop config option for that?

tnx


ANKITBHATNAGAR wrote:
> 
> 
> 
> Yuri K. wrote:
>> 
>> Dear Hadoopers,
>> 
>> i'm trying to find out how and where hadoop splits a file into blocks and
>> decides to send them to the datanodes.
>> 
>> My specific problem:
>> i have two types of data files.
>> One large file is used as a database-file where information is sorted
>> like this:
>> [BEGIN DATAROW]
>> ... lots of data 1
>> [END DATAROW]
>> 
>> [BEGIN DATAROW]
>> ... lots of data 2
>> [END DATAROW]
>> and so on.
>> 
>> and the other smaller files contain raw data and are to be compared to a
>> datarow in the large file.
>> 
>> so my question is: is it possible to manually set how hadoop splits the
>> large data file into blocks?
>> obviously i want the begin-end section to be in one block to optimize
>> performance. thus i can replicate the smaller files on each node and so
>> those can work independently from the other.
>> 
>> thanks, yk
>> 
> 
> 
> You should create a CustomInputSplit and CustomRecordReader (should have
> start and end tag )
> 
> 
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28043517.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Manually splitting files in blocks

Posted by Patrick Angeles <pa...@cloudera.com>.

My understanding (please correct me, list) is that hadoop will always spit
> your files based on the block size setting. The InputSplit and
> RecordReaders
> are used by jobs to retrieve chunks of files for processing - that is,
> there
> are two separate splits happening here: one "physical" split for storage
> and
> one "logical" split for processing.
>
>
That's right. The physical split are HDFS "blocks". An InputSplit is a
logical split, and represents a unit of work that is sent to a single
Mapper. A RecordReader provides a record-oriented view of the data. In most
cases, the last record in each InputSplit will span the input split's
boundaries, or even a block's boundaries. In the latter case, data is
transfered from a DN that holds the next contiguous block such that the
reader can construct a full record.

Re: Manually splitting files in blocks

Posted by Nick Dimiduk <nd...@gmail.com>.

Inline

On Fri, Mar 26, 2010 at 7:49 AM, Yuri K. <mr...@hotmail.com> wrote:

>
> ok so far so good. thanks for the reply. i'm trying to implement a custom
> file input format. but i can set it only in the job configuration:
> job.setInputFormatClass(CustomFileInputFormat.class);
>
>
This is exactly right. The custom input code ends up bundled in your job jar
and is available to the job at runtime just like any other dependency
library. Alternately, you could package your new input format into it's own
jar and "install" it onto the cluster by pushing it out to the
$HADOOP_HOME/lib on every machine. Unless you're building a common
infrastructure for a disparate set of users, I'd recommend the former
approach.

how do i make hadoop implement the file format, or the custom file split
> when i upload new files to the hdfs? do i need a custom upload interface
> for
> that or is there a hadoop config option for that?
>

My understanding (please correct me, list) is that hadoop will always spit
your files based on the block size setting. The InputSplit and RecordReaders
are used by jobs to retrieve chunks of files for processing - that is, there
are two separate splits happening here: one "physical" split for storage and
one "logical" split for processing.

Cheers,
-Nick


> ANKITBHATNAGAR wrote:
> >
> >
> >
> > Yuri K. wrote:
> >>
> >> Dear Hadoopers,
> >>
> >> i'm trying to find out how and where hadoop splits a file into blocks
> and
> >> decides to send them to the datanodes.
> >>
> >> My specific problem:
> >> i have two types of data files.
> >> One large file is used as a database-file where information is sorted
> >> like this:
> >> [BEGIN DATAROW]
> >> ... lots of data 1
> >> [END DATAROW]
> >>
> >> [BEGIN DATAROW]
> >> ... lots of data 2
> >> [END DATAROW]
> >> and so on.
> >>
> >> and the other smaller files contain raw data and are to be compared to a
> >> datarow in the large file.
> >>
> >> so my question is: is it possible to manually set how hadoop splits the
> >> large data file into blocks?
> >> obviously i want the begin-end section to be in one block to optimize
> >> performance. thus i can replicate the smaller files on each node and so
> >> those can work independently from the other.
> >>
> >> thanks, yk
> >>
> >
> >
> > You should create a CustomInputSplit and CustomRecordReader (should have
> > start and end tag )
> >
> >
> >
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28043517.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: Manually splitting files in blocks

Posted by Antonio Barbuzzi <an...@gmail.com>.

HDFS splits files in chunks regardless of their content.

This is what I understood so far:
an InputFormat object reads the file and returns a list of InputSplit 
object (an object that usually contains the boundaries of the file 
section your map task will read). Moreover InputFormat contains a method 
to return a RecordReader, able to read and interpret your InputSplit.

Therefore, when you have a file for which none of the existent 
InputSplit implementations work, you can:

- if you are able to start reading from an arbitrary offset in the file 
(there are sync points in the file, such as \n o spaces in textfile), 
you can just define your custom recordreader.
- if your file is splittable in fixed length chunks, you have to 
override the getSplits method of InputFormat in order to supply to the 
job a whole record, and of course, subclass InputFormat.

BR,
Antonio Barbuzzi


-------- Original Message  --------
Subject: Re: Manually splitting files in blocks
From: Yuri K. <mr...@hotmail.com>
To: core-user@hadoop.apache.org
Date: Fri Mar 26 2010 15:49:09 GMT+0100 (CET)

> ok so far so good. thanks for the reply. i'm trying to implement a custom
> file input format. but i can set it only in the job configuration:
> job.setInputFormatClass(CustomFileInputFormat.class);
> 
> how do i make hadoop implement the file format, or the custom file split
> when i upload new files to the hdfs? do i need a custom upload interface for
> that or is there a hadoop config option for that?
> 
> tnx
> 
> 
> ANKITBHATNAGAR wrote:
>>
>>
>> Yuri K. wrote:
>>> Dear Hadoopers,
>>>
>>> i'm trying to find out how and where hadoop splits a file into blocks and
>>> decides to send them to the datanodes.
>>>
>>> My specific problem:
>>> i have two types of data files.
>>> One large file is used as a database-file where information is sorted
>>> like this:
>>> [BEGIN DATAROW]
>>> ... lots of data 1
>>> [END DATAROW]
>>>
>>> [BEGIN DATAROW]
>>> ... lots of data 2
>>> [END DATAROW]
>>> and so on.
>>>
>>> and the other smaller files contain raw data and are to be compared to a
>>> datarow in the large file.
>>>
>>> so my question is: is it possible to manually set how hadoop splits the
>>> large data file into blocks?
>>> obviously i want the begin-end section to be in one block to optimize
>>> performance. thus i can replicate the smaller files on each node and so
>>> those can work independently from the other.
>>>
>>> thanks, yk
>>>
>>
>> You should create a CustomInputSplit and CustomRecordReader (should have
>> start and end tag )
>>
>>
>>
>>
>>
>

Re: Manually splitting files in blocks

Posted by "Yuri K." <mr...@hotmail.com>.

ok so far so good. thanks for the reply. i'm trying to implement a custom
file input format. but i can set it only in the job configuration:
job.setInputFormatClass(CustomFileInputFormat.class);

how do i make hadoop implement the file format, or the custom file split
when i upload new files to the hdfs? do i need a custom upload interface for
that or is there a hadoop config option for that?

tnx


ANKITBHATNAGAR wrote:
> 
> 
> 
> Yuri K. wrote:
>> 
>> Dear Hadoopers,
>> 
>> i'm trying to find out how and where hadoop splits a file into blocks and
>> decides to send them to the datanodes.
>> 
>> My specific problem:
>> i have two types of data files.
>> One large file is used as a database-file where information is sorted
>> like this:
>> [BEGIN DATAROW]
>> ... lots of data 1
>> [END DATAROW]
>> 
>> [BEGIN DATAROW]
>> ... lots of data 2
>> [END DATAROW]
>> and so on.
>> 
>> and the other smaller files contain raw data and are to be compared to a
>> datarow in the large file.
>> 
>> so my question is: is it possible to manually set how hadoop splits the
>> large data file into blocks?
>> obviously i want the begin-end section to be in one block to optimize
>> performance. thus i can replicate the smaller files on each node and so
>> those can work independently from the other.
>> 
>> thanks, yk
>> 
> 
> 
> You should create a CustomInputSplit and CustomRecordReader (should have
> start and end tag )
> 
> 
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28043517.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Manually splitting files in blocks

Posted by ANKITBHATNAGAR <ab...@vantage.com>.



Yuri K. wrote:
> 
> Dear Hadoopers,
> 
> i'm trying to find out how and where hadoop splits a file into blocks and
> decides to send them to the datanodes.
> 
> My specific problem:
> i have two types of data files.
> One large file is used as a database-file where information is sorted like
> this:
> [BEGIN DATAROW]
> ... lots of data 1
> [END DATAROW]
> 
> [BEGIN DATAROW]
> ... lots of data 2
> [END DATAROW]
> and so on.
> 
> and the other smaller files contain raw data and are to be compared to a
> datarow in the large file.
> 
> so my question is: is it possible to manually set how hadoop splits the
> large data file into blocks?
> obviously i want the begin-end section to be in one block to optimize
> performance. thus i can replicate the smaller files on each node and so
> those can work independently from the other.
> 
> thanks, yk
> 


You should create a CustomInputSplit and CustomRecordReader (should have
start and end tag )




-- 
View this message in context: http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28021294.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Manually splitting files in blocks

Posted by Sonal Goyal <so...@gmail.com>.

Hi Yuri,

You can also check the source code of FileInputFormat and create your own
RecordReader implementation.
Thanks and Regards,
Sonal
www.meghsoft.com


On Wed, Mar 24, 2010 at 9:08 PM, Patrick Angeles <pa...@cloudera.com>wrote:

> Yuri,
>
> Probably the easiest thing is to actually create distinct files and
> configure the block size per file such that HDFS doesn't split it into
> smaller blocks for you.
>
> - P
>
> On Wed, Mar 24, 2010 at 11:23 AM, Yuri K. <mr...@hotmail.com>
> wrote:
>
> >
> > Dear Hadoopers,
> >
> > i'm trying to find out how and where hadoop splits a file into blocks and
> > decides to send them to the datanodes.
> >
> > My specific problem:
> > i have two types of data files.
> > One large file is used as a database-file where information is sorted
> like
> > this:
> > [BEGIN DATAROW]
> > ... lots of data 1
> > [END DATAROW]
> >
> > [BEGIN DATAROW]
> > ... lots of data 2
> > [END DATAROW]
> > and so on.
> >
> > and the other smaller files contain raw data and are to be compared to a
> > datarow in the large file.
> >
> > so my question is: is it possible to manually set how hadoop splits the
> > large data file into blocks?
> > obviously i want the begin-end section to be in one block to optimize
> > performance. thus i can replicate the smaller files on each node and so
> > those can work independently from the other.
> >
> > thanks, yk
> > --
> > View this message in context:
> >
> http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28015936.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: Manually splitting files in blocks

Posted by Patrick Angeles <pa...@cloudera.com>.

Yuri,

Probably the easiest thing is to actually create distinct files and
configure the block size per file such that HDFS doesn't split it into
smaller blocks for you.

- P

On Wed, Mar 24, 2010 at 11:23 AM, Yuri K. <mr...@hotmail.com> wrote:

>
> Dear Hadoopers,
>
> i'm trying to find out how and where hadoop splits a file into blocks and
> decides to send them to the datanodes.
>
> My specific problem:
> i have two types of data files.
> One large file is used as a database-file where information is sorted like
> this:
> [BEGIN DATAROW]
> ... lots of data 1
> [END DATAROW]
>
> [BEGIN DATAROW]
> ... lots of data 2
> [END DATAROW]
> and so on.
>
> and the other smaller files contain raw data and are to be compared to a
> datarow in the large file.
>
> so my question is: is it possible to manually set how hadoop splits the
> large data file into blocks?
> obviously i want the begin-end section to be in one block to optimize
> performance. thus i can replicate the smaller files on each node and so
> those can work independently from the other.
>
> thanks, yk
> --
> View this message in context:
> http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28015936.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>