You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sugandha Naolekar <su...@gmail.com> on 2014/02/26 06:24:28 UTC

Logic of isSplittable() of class FileInputFormat

Hello,

If a single file is split of size 129 MB is split in two halves/blocks of
HDFS as the max block size id 128 MB. And each of the blocks is read
depending on the InputFormat it supports. Thus, what is the significance of
isSplittable() method then?

If it is set to false, entire block will be considered as single input
split? How will TextInputFormat react to it?


--
Thanks & Regards,
Sugandha Naolekar

Re: Logic of isSplittable() of class FileInputFormat

Posted by Devin Suiter RDX <ds...@rdx.com>.
Or, as another example, I'm writing a program to analyze a large email
dump. The emails are more than one line. TextInputFormat will split them up
by line, in addition to deserializing them to text. I'm going to need to
customize RecordReader to split based on the MIME metadata length of the
emails instead of the newline character, and also preserve them in stream
form for the parser to properly parse.

Or, I could customize InputFormat to a subclass that was
isSplittable(false) and then just have to handle the preserving as
InputStream part. Incidentally, tips on that are welcome if anyone on the
list wants to help.

So, there are some reasons the isSplittable is able to be modified. There
is a trade-off for performance at some point, too, once the files get big,
I think, with the mapper having to spill records to disk if the data being
mapped gets too big for the JVM memory...

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Feb 26, 2014 at 6:04 AM, Dieter De Witte <dr...@gmail.com> wrote:

> if you have a simple one line record format you should allow files to be
> splitted, since your simulations will be better balanced.
>
>
> 2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
>> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
>> its value as false and keep the data of records consistent. I mean, the
>> length of all the records should be the same.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> No, an example could be that records have a variable number of lines, if
>>> you would then allow a file to be split your record may be broken, so then
>>> you could override isSplittable to be always false.
>>>
>>>
>>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>
>>> So basically what I can deduce from it is, isSplittable() only applies
>>>> to stream compressed files. Right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Take gz file as an example, It is not splittable because of the
>>>>> compression algorithm it is used.  It can not guarantee that one record is
>>>>> located in one block, if one record is in 2 blocks, your program will crash
>>>>> since you can not get the whole record.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> If a single file is split of size 129 MB is split in two
>>>>>> halves/blocks of HDFS as the max block size id 128 MB. And each of the
>>>>>> blocks is read depending on the InputFormat it supports. Thus, what is the
>>>>>> significance of isSplittable() method then?
>>>>>>
>>>>>> If it is set to false, entire block will be considered as single
>>>>>> input split? How will TextInputFormat react to it?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Devin Suiter RDX <ds...@rdx.com>.
Or, as another example, I'm writing a program to analyze a large email
dump. The emails are more than one line. TextInputFormat will split them up
by line, in addition to deserializing them to text. I'm going to need to
customize RecordReader to split based on the MIME metadata length of the
emails instead of the newline character, and also preserve them in stream
form for the parser to properly parse.

Or, I could customize InputFormat to a subclass that was
isSplittable(false) and then just have to handle the preserving as
InputStream part. Incidentally, tips on that are welcome if anyone on the
list wants to help.

So, there are some reasons the isSplittable is able to be modified. There
is a trade-off for performance at some point, too, once the files get big,
I think, with the mapper having to spill records to disk if the data being
mapped gets too big for the JVM memory...

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Feb 26, 2014 at 6:04 AM, Dieter De Witte <dr...@gmail.com> wrote:

> if you have a simple one line record format you should allow files to be
> splitted, since your simulations will be better balanced.
>
>
> 2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
>> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
>> its value as false and keep the data of records consistent. I mean, the
>> length of all the records should be the same.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> No, an example could be that records have a variable number of lines, if
>>> you would then allow a file to be split your record may be broken, so then
>>> you could override isSplittable to be always false.
>>>
>>>
>>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>
>>> So basically what I can deduce from it is, isSplittable() only applies
>>>> to stream compressed files. Right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Take gz file as an example, It is not splittable because of the
>>>>> compression algorithm it is used.  It can not guarantee that one record is
>>>>> located in one block, if one record is in 2 blocks, your program will crash
>>>>> since you can not get the whole record.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> If a single file is split of size 129 MB is split in two
>>>>>> halves/blocks of HDFS as the max block size id 128 MB. And each of the
>>>>>> blocks is read depending on the InputFormat it supports. Thus, what is the
>>>>>> significance of isSplittable() method then?
>>>>>>
>>>>>> If it is set to false, entire block will be considered as single
>>>>>> input split? How will TextInputFormat react to it?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Devin Suiter RDX <ds...@rdx.com>.
Or, as another example, I'm writing a program to analyze a large email
dump. The emails are more than one line. TextInputFormat will split them up
by line, in addition to deserializing them to text. I'm going to need to
customize RecordReader to split based on the MIME metadata length of the
emails instead of the newline character, and also preserve them in stream
form for the parser to properly parse.

Or, I could customize InputFormat to a subclass that was
isSplittable(false) and then just have to handle the preserving as
InputStream part. Incidentally, tips on that are welcome if anyone on the
list wants to help.

So, there are some reasons the isSplittable is able to be modified. There
is a trade-off for performance at some point, too, once the files get big,
I think, with the mapper having to spill records to disk if the data being
mapped gets too big for the JVM memory...

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Feb 26, 2014 at 6:04 AM, Dieter De Witte <dr...@gmail.com> wrote:

> if you have a simple one line record format you should allow files to be
> splitted, since your simulations will be better balanced.
>
>
> 2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
>> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
>> its value as false and keep the data of records consistent. I mean, the
>> length of all the records should be the same.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> No, an example could be that records have a variable number of lines, if
>>> you would then allow a file to be split your record may be broken, so then
>>> you could override isSplittable to be always false.
>>>
>>>
>>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>
>>> So basically what I can deduce from it is, isSplittable() only applies
>>>> to stream compressed files. Right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Take gz file as an example, It is not splittable because of the
>>>>> compression algorithm it is used.  It can not guarantee that one record is
>>>>> located in one block, if one record is in 2 blocks, your program will crash
>>>>> since you can not get the whole record.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> If a single file is split of size 129 MB is split in two
>>>>>> halves/blocks of HDFS as the max block size id 128 MB. And each of the
>>>>>> blocks is read depending on the InputFormat it supports. Thus, what is the
>>>>>> significance of isSplittable() method then?
>>>>>>
>>>>>> If it is set to false, entire block will be considered as single
>>>>>> input split? How will TextInputFormat react to it?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Devin Suiter RDX <ds...@rdx.com>.
Or, as another example, I'm writing a program to analyze a large email
dump. The emails are more than one line. TextInputFormat will split them up
by line, in addition to deserializing them to text. I'm going to need to
customize RecordReader to split based on the MIME metadata length of the
emails instead of the newline character, and also preserve them in stream
form for the parser to properly parse.

Or, I could customize InputFormat to a subclass that was
isSplittable(false) and then just have to handle the preserving as
InputStream part. Incidentally, tips on that are welcome if anyone on the
list wants to help.

So, there are some reasons the isSplittable is able to be modified. There
is a trade-off for performance at some point, too, once the files get big,
I think, with the mapper having to spill records to disk if the data being
mapped gets too big for the JVM memory...

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Feb 26, 2014 at 6:04 AM, Dieter De Witte <dr...@gmail.com> wrote:

> if you have a simple one line record format you should allow files to be
> splitted, since your simulations will be better balanced.
>
>
> 2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
>> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
>> its value as false and keep the data of records consistent. I mean, the
>> length of all the records should be the same.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com>wrote:
>>
>>> No, an example could be that records have a variable number of lines, if
>>> you would then allow a file to be split your record may be broken, so then
>>> you could override isSplittable to be always false.
>>>
>>>
>>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>>
>>> So basically what I can deduce from it is, isSplittable() only applies
>>>> to stream compressed files. Right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Take gz file as an example, It is not splittable because of the
>>>>> compression algorithm it is used.  It can not guarantee that one record is
>>>>> located in one block, if one record is in 2 blocks, your program will crash
>>>>> since you can not get the whole record.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> If a single file is split of size 129 MB is split in two
>>>>>> halves/blocks of HDFS as the max block size id 128 MB. And each of the
>>>>>> blocks is read depending on the InputFormat it supports. Thus, what is the
>>>>>> significance of isSplittable() method then?
>>>>>>
>>>>>> If it is set to false, entire block will be considered as single
>>>>>> input split? How will TextInputFormat react to it?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Dieter De Witte <dr...@gmail.com>.
if you have a simple one line record format you should allow files to be
splitted, since your simulations will be better balanced.


2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
> its value as false and keep the data of records consistent. I mean, the
> length of all the records should be the same.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> No, an example could be that records have a variable number of lines, if
>> you would then allow a file to be split your record may be broken, so then
>> you could override isSplittable to be always false.
>>
>>
>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>
>> So basically what I can deduce from it is, isSplittable() only applies to
>>> stream compressed files. Right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Take gz file as an example, It is not splittable because of the
>>>> compression algorithm it is used.  It can not guarantee that one record is
>>>> located in one block, if one record is in 2 blocks, your program will crash
>>>> since you can not get the whole record.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> If a single file is split of size 129 MB is split in two halves/blocks
>>>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>>>> depending on the InputFormat it supports. Thus, what is the significance of
>>>>> isSplittable() method then?
>>>>>
>>>>> If it is set to false, entire block will be considered as single input
>>>>> split? How will TextInputFormat react to it?
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Dieter De Witte <dr...@gmail.com>.
if you have a simple one line record format you should allow files to be
splitted, since your simulations will be better balanced.


2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
> its value as false and keep the data of records consistent. I mean, the
> length of all the records should be the same.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> No, an example could be that records have a variable number of lines, if
>> you would then allow a file to be split your record may be broken, so then
>> you could override isSplittable to be always false.
>>
>>
>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>
>> So basically what I can deduce from it is, isSplittable() only applies to
>>> stream compressed files. Right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Take gz file as an example, It is not splittable because of the
>>>> compression algorithm it is used.  It can not guarantee that one record is
>>>> located in one block, if one record is in 2 blocks, your program will crash
>>>> since you can not get the whole record.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> If a single file is split of size 129 MB is split in two halves/blocks
>>>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>>>> depending on the InputFormat it supports. Thus, what is the significance of
>>>>> isSplittable() method then?
>>>>>
>>>>> If it is set to false, entire block will be considered as single input
>>>>> split? How will TextInputFormat react to it?
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Dieter De Witte <dr...@gmail.com>.
if you have a simple one line record format you should allow files to be
splitted, since your simulations will be better balanced.


2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
> its value as false and keep the data of records consistent. I mean, the
> length of all the records should be the same.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> No, an example could be that records have a variable number of lines, if
>> you would then allow a file to be split your record may be broken, so then
>> you could override isSplittable to be always false.
>>
>>
>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>
>> So basically what I can deduce from it is, isSplittable() only applies to
>>> stream compressed files. Right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Take gz file as an example, It is not splittable because of the
>>>> compression algorithm it is used.  It can not guarantee that one record is
>>>> located in one block, if one record is in 2 blocks, your program will crash
>>>> since you can not get the whole record.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> If a single file is split of size 129 MB is split in two halves/blocks
>>>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>>>> depending on the InputFormat it supports. Thus, what is the significance of
>>>>> isSplittable() method then?
>>>>>
>>>>> If it is set to false, entire block will be considered as single input
>>>>> split? How will TextInputFormat react to it?
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Dieter De Witte <dr...@gmail.com>.
if you have a simple one line record format you should allow files to be
splitted, since your simulations will be better balanced.


2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
> its value as false and keep the data of records consistent. I mean, the
> length of all the records should be the same.
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com>wrote:
>
>> No, an example could be that records have a variable number of lines, if
>> you would then allow a file to be split your record may be broken, so then
>> you could override isSplittable to be always false.
>>
>>
>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>>
>> So basically what I can deduce from it is, isSplittable() only applies to
>>> stream compressed files. Right?
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Take gz file as an example, It is not splittable because of the
>>>> compression algorithm it is used.  It can not guarantee that one record is
>>>> located in one block, if one record is in 2 blocks, your program will crash
>>>> since you can not get the whole record.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> If a single file is split of size 129 MB is split in two halves/blocks
>>>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>>>> depending on the InputFormat it supports. Thus, what is the significance of
>>>>> isSplittable() method then?
>>>>>
>>>>> If it is set to false, entire block will be considered as single input
>>>>> split? How will TextInputFormat react to it?
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Sugandha Naolekar <su...@gmail.com>.
Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
its value as false and keep the data of records consistent. I mean, the
length of all the records should be the same.

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com> wrote:

> No, an example could be that records have a variable number of lines, if
> you would then allow a file to be split your record may be broken, so then
> you could override isSplittable to be always false.
>
>
> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
> So basically what I can deduce from it is, isSplittable() only applies to
>> stream compressed files. Right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>
>>> Hi Sugandha,
>>>
>>> Take gz file as an example, It is not splittable because of the
>>> compression algorithm it is used.  It can not guarantee that one record is
>>> located in one block, if one record is in 2 blocks, your program will crash
>>> since you can not get the whole record.
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> If a single file is split of size 129 MB is split in two halves/blocks
>>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>>> depending on the InputFormat it supports. Thus, what is the significance of
>>>> isSplittable() method then?
>>>>
>>>> If it is set to false, entire block will be considered as single input
>>>> split? How will TextInputFormat react to it?
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Sugandha Naolekar <su...@gmail.com>.
Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
its value as false and keep the data of records consistent. I mean, the
length of all the records should be the same.

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com> wrote:

> No, an example could be that records have a variable number of lines, if
> you would then allow a file to be split your record may be broken, so then
> you could override isSplittable to be always false.
>
>
> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
> So basically what I can deduce from it is, isSplittable() only applies to
>> stream compressed files. Right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>
>>> Hi Sugandha,
>>>
>>> Take gz file as an example, It is not splittable because of the
>>> compression algorithm it is used.  It can not guarantee that one record is
>>> located in one block, if one record is in 2 blocks, your program will crash
>>> since you can not get the whole record.
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> If a single file is split of size 129 MB is split in two halves/blocks
>>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>>> depending on the InputFormat it supports. Thus, what is the significance of
>>>> isSplittable() method then?
>>>>
>>>> If it is set to false, entire block will be considered as single input
>>>> split? How will TextInputFormat react to it?
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Sugandha Naolekar <su...@gmail.com>.
Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
its value as false and keep the data of records consistent. I mean, the
length of all the records should be the same.

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com> wrote:

> No, an example could be that records have a variable number of lines, if
> you would then allow a file to be split your record may be broken, so then
> you could override isSplittable to be always false.
>
>
> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
> So basically what I can deduce from it is, isSplittable() only applies to
>> stream compressed files. Right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>
>>> Hi Sugandha,
>>>
>>> Take gz file as an example, It is not splittable because of the
>>> compression algorithm it is used.  It can not guarantee that one record is
>>> located in one block, if one record is in 2 blocks, your program will crash
>>> since you can not get the whole record.
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> If a single file is split of size 129 MB is split in two halves/blocks
>>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>>> depending on the InputFormat it supports. Thus, what is the significance of
>>>> isSplittable() method then?
>>>>
>>>> If it is set to false, entire block will be considered as single input
>>>> split? How will TextInputFormat react to it?
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Sugandha Naolekar <su...@gmail.com>.
Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
its value as false and keep the data of records consistent. I mean, the
length of all the records should be the same.

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <dr...@gmail.com> wrote:

> No, an example could be that records have a variable number of lines, if
> you would then allow a file to be split your record may be broken, so then
> you could override isSplittable to be always false.
>
>
> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:
>
> So basically what I can deduce from it is, isSplittable() only applies to
>> stream compressed files. Right?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com>wrote:
>>
>>> Hi Sugandha,
>>>
>>> Take gz file as an example, It is not splittable because of the
>>> compression algorithm it is used.  It can not guarantee that one record is
>>> located in one block, if one record is in 2 blocks, your program will crash
>>> since you can not get the whole record.
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> If a single file is split of size 129 MB is split in two halves/blocks
>>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>>> depending on the InputFormat it supports. Thus, what is the significance of
>>>> isSplittable() method then?
>>>>
>>>> If it is set to false, entire block will be considered as single input
>>>> split? How will TextInputFormat react to it?
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Dieter De Witte <dr...@gmail.com>.
No, an example could be that records have a variable number of lines, if
you would then allow a file to be split your record may be broken, so then
you could override isSplittable to be always false.


2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> So basically what I can deduce from it is, isSplittable() only applies to
> stream compressed files. Right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com> wrote:
>
>> Hi Sugandha,
>>
>> Take gz file as an example, It is not splittable because of the
>> compression algorithm it is used.  It can not guarantee that one record is
>> located in one block, if one record is in 2 blocks, your program will crash
>> since you can not get the whole record.
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> If a single file is split of size 129 MB is split in two halves/blocks
>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>> depending on the InputFormat it supports. Thus, what is the significance of
>>> isSplittable() method then?
>>>
>>> If it is set to false, entire block will be considered as single input
>>> split? How will TextInputFormat react to it?
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Dieter De Witte <dr...@gmail.com>.
No, an example could be that records have a variable number of lines, if
you would then allow a file to be split your record may be broken, so then
you could override isSplittable to be always false.


2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> So basically what I can deduce from it is, isSplittable() only applies to
> stream compressed files. Right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com> wrote:
>
>> Hi Sugandha,
>>
>> Take gz file as an example, It is not splittable because of the
>> compression algorithm it is used.  It can not guarantee that one record is
>> located in one block, if one record is in 2 blocks, your program will crash
>> since you can not get the whole record.
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> If a single file is split of size 129 MB is split in two halves/blocks
>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>> depending on the InputFormat it supports. Thus, what is the significance of
>>> isSplittable() method then?
>>>
>>> If it is set to false, entire block will be considered as single input
>>> split? How will TextInputFormat react to it?
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Dieter De Witte <dr...@gmail.com>.
No, an example could be that records have a variable number of lines, if
you would then allow a file to be split your record may be broken, so then
you could override isSplittable to be always false.


2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> So basically what I can deduce from it is, isSplittable() only applies to
> stream compressed files. Right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com> wrote:
>
>> Hi Sugandha,
>>
>> Take gz file as an example, It is not splittable because of the
>> compression algorithm it is used.  It can not guarantee that one record is
>> located in one block, if one record is in 2 blocks, your program will crash
>> since you can not get the whole record.
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> If a single file is split of size 129 MB is split in two halves/blocks
>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>> depending on the InputFormat it supports. Thus, what is the significance of
>>> isSplittable() method then?
>>>
>>> If it is set to false, entire block will be considered as single input
>>> split? How will TextInputFormat react to it?
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Dieter De Witte <dr...@gmail.com>.
No, an example could be that records have a variable number of lines, if
you would then allow a file to be split your record may be broken, so then
you could override isSplittable to be always false.


2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <su...@gmail.com>:

> So basically what I can deduce from it is, isSplittable() only applies to
> stream compressed files. Right?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com> wrote:
>
>> Hi Sugandha,
>>
>> Take gz file as an example, It is not splittable because of the
>> compression algorithm it is used.  It can not guarantee that one record is
>> located in one block, if one record is in 2 blocks, your program will crash
>> since you can not get the whole record.
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> If a single file is split of size 129 MB is split in two halves/blocks
>>> of HDFS as the max block size id 128 MB. And each of the blocks is read
>>> depending on the InputFormat it supports. Thus, what is the significance of
>>> isSplittable() method then?
>>>
>>> If it is set to false, entire block will be considered as single input
>>> split? How will TextInputFormat react to it?
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Sugandha Naolekar <su...@gmail.com>.
So basically what I can deduce from it is, isSplittable() only applies to
stream compressed files. Right?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com> wrote:

> Hi Sugandha,
>
> Take gz file as an example, It is not splittable because of the
> compression algorithm it is used.  It can not guarantee that one record is
> located in one block, if one record is in 2 blocks, your program will crash
> since you can not get the whole record.
>
>
>
>
> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hello,
>>
>> If a single file is split of size 129 MB is split in two halves/blocks of
>> HDFS as the max block size id 128 MB. And each of the blocks is read
>> depending on the InputFormat it supports. Thus, what is the significance of
>> isSplittable() method then?
>>
>> If it is set to false, entire block will be considered as single input
>> split? How will TextInputFormat react to it?
>>
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Sugandha Naolekar <su...@gmail.com>.
So basically what I can deduce from it is, isSplittable() only applies to
stream compressed files. Right?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com> wrote:

> Hi Sugandha,
>
> Take gz file as an example, It is not splittable because of the
> compression algorithm it is used.  It can not guarantee that one record is
> located in one block, if one record is in 2 blocks, your program will crash
> since you can not get the whole record.
>
>
>
>
> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hello,
>>
>> If a single file is split of size 129 MB is split in two halves/blocks of
>> HDFS as the max block size id 128 MB. And each of the blocks is read
>> depending on the InputFormat it supports. Thus, what is the significance of
>> isSplittable() method then?
>>
>> If it is set to false, entire block will be considered as single input
>> split? How will TextInputFormat react to it?
>>
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Sugandha Naolekar <su...@gmail.com>.
So basically what I can deduce from it is, isSplittable() only applies to
stream compressed files. Right?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com> wrote:

> Hi Sugandha,
>
> Take gz file as an example, It is not splittable because of the
> compression algorithm it is used.  It can not guarantee that one record is
> located in one block, if one record is in 2 blocks, your program will crash
> since you can not get the whole record.
>
>
>
>
> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hello,
>>
>> If a single file is split of size 129 MB is split in two halves/blocks of
>> HDFS as the max block size id 128 MB. And each of the blocks is read
>> depending on the InputFormat it supports. Thus, what is the significance of
>> isSplittable() method then?
>>
>> If it is set to false, entire block will be considered as single input
>> split? How will TextInputFormat react to it?
>>
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Sugandha Naolekar <su...@gmail.com>.
So basically what I can deduce from it is, isSplittable() only applies to
stream compressed files. Right?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <je...@gopivotal.com> wrote:

> Hi Sugandha,
>
> Take gz file as an example, It is not splittable because of the
> compression algorithm it is used.  It can not guarantee that one record is
> located in one block, if one record is in 2 blocks, your program will crash
> since you can not get the whole record.
>
>
>
>
> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hello,
>>
>> If a single file is split of size 129 MB is split in two halves/blocks of
>> HDFS as the max block size id 128 MB. And each of the blocks is read
>> depending on the InputFormat it supports. Thus, what is the significance of
>> isSplittable() method then?
>>
>> If it is set to false, entire block will be considered as single input
>> split? How will TextInputFormat react to it?
>>
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Jeff Zhang <je...@gopivotal.com>.
Hi Sugandha,

Take gz file as an example, It is not splittable because of the compression
algorithm it is used.  It can not guarantee that one record is located in
one block, if one record is in 2 blocks, your program will crash since you
can not get the whole record.




On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> If a single file is split of size 129 MB is split in two halves/blocks of
> HDFS as the max block size id 128 MB. And each of the blocks is read
> depending on the InputFormat it supports. Thus, what is the significance of
> isSplittable() method then?
>
> If it is set to false, entire block will be considered as single input
> split? How will TextInputFormat react to it?
>
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Jeff Zhang <je...@gopivotal.com>.
Hi Sugandha,

Take gz file as an example, It is not splittable because of the compression
algorithm it is used.  It can not guarantee that one record is located in
one block, if one record is in 2 blocks, your program will crash since you
can not get the whole record.




On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> If a single file is split of size 129 MB is split in two halves/blocks of
> HDFS as the max block size id 128 MB. And each of the blocks is read
> depending on the InputFormat it supports. Thus, what is the significance of
> isSplittable() method then?
>
> If it is set to false, entire block will be considered as single input
> split? How will TextInputFormat react to it?
>
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Jeff Zhang <je...@gopivotal.com>.
Hi Sugandha,

Take gz file as an example, It is not splittable because of the compression
algorithm it is used.  It can not guarantee that one record is located in
one block, if one record is in 2 blocks, your program will crash since you
can not get the whole record.




On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> If a single file is split of size 129 MB is split in two halves/blocks of
> HDFS as the max block size id 128 MB. And each of the blocks is read
> depending on the InputFormat it supports. Thus, what is the significance of
> isSplittable() method then?
>
> If it is set to false, entire block will be considered as single input
> split? How will TextInputFormat react to it?
>
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>

Re: Logic of isSplittable() of class FileInputFormat

Posted by Jeff Zhang <je...@gopivotal.com>.
Hi Sugandha,

Take gz file as an example, It is not splittable because of the compression
algorithm it is used.  It can not guarantee that one record is located in
one block, if one record is in 2 blocks, your program will crash since you
can not get the whole record.




On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar
<su...@gmail.com>wrote:

> Hello,
>
> If a single file is split of size 129 MB is split in two halves/blocks of
> HDFS as the max block size id 128 MB. And each of the blocks is read
> depending on the InputFormat it supports. Thus, what is the significance of
> isSplittable() method then?
>
> If it is set to false, entire block will be considered as single input
> split? How will TextInputFormat react to it?
>
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>