You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by chandra <ch...@cognizant.com> on 2008/09/24 11:17:29 UTC

1 file per record

hi..

By setting isSplitable false, we can set 1 file with n records 1 mapper.

Is there any way to set 1 complete file per record..

Thanks in advance
Chandravadana S




This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. 
Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly 
prohibited and may be unlawful.

Re: 1 file per record

Posted by chandravadana <Ch...@cognizant.com>.

hi 

By setting isSplitable false, we prevent the files from splitting.
we can check that from the no. of map tasks..
but how do we check, if the records are proper..

Chandravadana S

 

Enis Soztutar wrote:
> 
> Nope, not right now. But this has came up before. Perhaps you will 
> contribute one?
> 
> 
> chandravadana wrote:
>> thanks
>>
>> is there any built in record reader which performs this function..
>>
>>
>>
>> Enis Soztutar wrote:
>>   
>>> Yes, you can use MultiFileInputFormat.
>>>
>>> You can extend the MultiFileInputFormat to return a RecordReader, which 
>>> reads a record for each file in the MultiFileSplit.
>>>
>>> Enis
>>>
>>> chandra wrote:
>>>     
>>>> hi..
>>>>
>>>> By setting isSplitable false, we can set 1 file with n records 1
>>>> mapper.
>>>>
>>>> Is there any way to set 1 complete file per record..
>>>>
>>>> Thanks in advance
>>>> Chandravadana S
>>>>
>>>>
>>>>
>>>>
>>>> This e-mail and any files transmitted with it are for the sole use of
>>>> the
>>>> intended recipient(s) and may contain confidential and privileged
>>>> information.
>>>> If you are not the intended recipient, please contact the sender by
>>>> reply
>>>> e-mail and destroy all copies of the original message. 
>>>> Any unauthorized review, use, disclosure, dissemination, forwarding,
>>>> printing or copying of this email or any action taken in reliance on
>>>> this
>>>> e-mail is strictly 
>>>> prohibited and may be unlawful.
>>>>
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 

-- 
View this message in context: http://www.nabble.com/1-file-per-record-tp19644985p19667750.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: 1 file per record

Posted by Enis Soztutar <en...@gmail.com>.

Nope, not right now. But this has came up before. Perhaps you will 
contribute one?


chandravadana wrote:
> thanks
>
> is there any built in record reader which performs this function..
>
>
>
> Enis Soztutar wrote:
>   
>> Yes, you can use MultiFileInputFormat.
>>
>> You can extend the MultiFileInputFormat to return a RecordReader, which 
>> reads a record for each file in the MultiFileSplit.
>>
>> Enis
>>
>> chandra wrote:
>>     
>>> hi..
>>>
>>> By setting isSplitable false, we can set 1 file with n records 1 mapper.
>>>
>>> Is there any way to set 1 complete file per record..
>>>
>>> Thanks in advance
>>> Chandravadana S
>>>
>>>
>>>
>>>
>>> This e-mail and any files transmitted with it are for the sole use of the
>>> intended recipient(s) and may contain confidential and privileged
>>> information.
>>> If you are not the intended recipient, please contact the sender by reply
>>> e-mail and destroy all copies of the original message. 
>>> Any unauthorized review, use, disclosure, dissemination, forwarding,
>>> printing or copying of this email or any action taken in reliance on this
>>> e-mail is strictly 
>>> prohibited and may be unlawful.
>>>
>>>   
>>>       
>>
>>     
>
>

Re: 1 file per record

Posted by chandravadana <Ch...@cognizant.com>.

thanks

is there any built in record reader which performs this function..



Enis Soztutar wrote:
> 
> Yes, you can use MultiFileInputFormat.
> 
> You can extend the MultiFileInputFormat to return a RecordReader, which 
> reads a record for each file in the MultiFileSplit.
> 
> Enis
> 
> chandra wrote:
>> hi..
>>
>> By setting isSplitable false, we can set 1 file with n records 1 mapper.
>>
>> Is there any way to set 1 complete file per record..
>>
>> Thanks in advance
>> Chandravadana S
>>
>>
>>
>>
>> This e-mail and any files transmitted with it are for the sole use of the
>> intended recipient(s) and may contain confidential and privileged
>> information.
>> If you are not the intended recipient, please contact the sender by reply
>> e-mail and destroy all copies of the original message. 
>> Any unauthorized review, use, disclosure, dissemination, forwarding,
>> printing or copying of this email or any action taken in reliance on this
>> e-mail is strictly 
>> prohibited and may be unlawful.
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/1-file-per-record-tp19644985p19646442.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: 1 file per record

Posted by Sean Arietta <sa...@virginia.edu>.

I have a similar issue and would like some clarification if possible. Suppose
each file is meant to be emitted as a one single record to a set of map
tasks. That is, each key-value pair will include data from one file and one
file alone. 

I have written custom InputFormats and RecordReaders before so I am familiar
with the general process. Does it suffice to just return an empty array from
the InputFormat.getSplits() function and then take care of the actual record
emitting from inside the custom RecordReader? 

Thanks for your time!

-Sean

owen.omalley wrote:
> 
> On Oct 2, 2008, at 1:50 AM, chandravadana wrote:
> 
>> If we dont specify numSplits in getsplits(), then what is the default
>> number of splits taken...
> 
> The getSplits() is either library or user code, so it depends which  
> class you are using as your InputFormat. The FileInputFormats  
> (TextInputFormat and SequenceFileInputFormat) basically divide input  
> files by blocks, unless the requested number of mappers is really high.
> 
> -- Owen
> 
> 

-- 
View this message in context: http://www.nabble.com/1-file-per-record-tp19644985p22551968.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: 1 file per record

Posted by chandravadana <Ch...@cognizant.com>.


suppose i use TextInputFormat.. i set issplitable false.. and there are 5
files.. 
so what happens to numsplits now... will that be set to 0..

S.Chandravadana


owen.omalley wrote:
> 
> On Oct 2, 2008, at 1:50 AM, chandravadana wrote:
> 
>> If we dont specify numSplits in getsplits(), then what is the default
>> number of splits taken...
> 
> The getSplits() is either library or user code, so it depends which  
> class you are using as your InputFormat. The FileInputFormats  
> (TextInputFormat and SequenceFileInputFormat) basically divide input  
> files by blocks, unless the requested number of mappers is really high.
> 
> -- Owen
> 
> 

-- 
View this message in context: http://www.nabble.com/1-file-per-record-tp19644985p19794194.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: 1 file per record

Posted by Owen O'Malley <om...@apache.org>.

On Oct 2, 2008, at 1:50 AM, chandravadana wrote:

> If we dont specify numSplits in getsplits(), then what is the default
> number of splits taken...

The getSplits() is either library or user code, so it depends which  
class you are using as your InputFormat. The FileInputFormats  
(TextInputFormat and SequenceFileInputFormat) basically divide input  
files by blocks, unless the requested number of mappers is really high.

-- Owen

Re: 1 file per record

Posted by chandravadana <Ch...@cognizant.com>.


hi all...

i have doubt..
If we dont specify numSplits in getsplits(), then what is the default
number of splits taken...


-- 
Best Regards
S.Chandravadana 

-- 
View this message in context: http://www.nabble.com/1-file-per-record-tp19644985p19775580.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

RE: 1 file per record

Posted by "Goel, Ankur" <an...@corp.aol.com>.

The way this is done in hadoop-land is you create your custom
InputFormat and override the getSplits(), isSplitable() and
getRecordReader() APIs.

The idea is that application knows how to construct splits of the data
(which is no splits in your case) and how to detect record boundaries
and read records. 

I suggest you override an exisiting RecordReader implementation or
create your own to fit your case.

-----Original Message-----
From: chandravadana [mailto:Chandravadana.Selvachamy@cognizant.com] 
Sent: Friday, September 26, 2008 3:04 PM
To: core-user@hadoop.apache.org
Subject: Re: 1 file per record 

hi

i'm writing an appln which computes using the entire data from a file.
for that purpose i dont want to split my file and the entire file shd go
to
map task..
i've been able to override isSplitable() do it and the file is not
getting
split now..
then..
i had to store the input values to an array..(in map func) and then
proceed
with my computation. when i displayed that array i found only the last
line
of the file getting displayed... does this mean that data is read line
by
line by the line reader and not continously.
if so, what shd i do inorder to read complete contents of the file...

Thank you
Chandravadana S

Enis Soztutar wrote:
> 
> Yes, you can use MultiFileInputFormat.
> 
> You can extend the MultiFileInputFormat to return a RecordReader,
which 
> reads a record for each file in the MultiFileSplit.
> 
> Enis
> 
> chandra wrote:
>> hi..
>>
>> By setting isSplitable false, we can set 1 file with n records 1
mapper.
>>
>> Is there any way to set 1 complete file per record..
>>
>> Thanks in advance
>> Chandravadana S
>>
>>
>>
>>
>> This e-mail and any files transmitted with it are for the sole use of
the
>> intended recipient(s) and may contain confidential and privileged
>> information.
>> If you are not the intended recipient, please contact the sender by
reply
>> e-mail and destroy all copies of the original message. 
>> Any unauthorized review, use, disclosure, dissemination, forwarding,
>> printing or copying of this email or any action taken in reliance on
this
>> e-mail is strictly 
>> prohibited and may be unlawful.
>>
>>   
> 
> 
> 

-- 
View this message in context:
http://www.nabble.com/1-file-per-record-tp19644985p19685269.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: 1 file per record

Posted by chandravadana <Ch...@cognizant.com>.

hi

i'm writing an appln which computes using the entire data from a file.
for that purpose i dont want to split my file and the entire file shd go to
map task..
i've been able to override isSplitable() do it and the file is not getting
split now..
then..
i had to store the input values to an array..(in map func) and then proceed
with my computation. when i displayed that array i found only the last line
of the file getting displayed... does this mean that data is read line by
line by the line reader and not continously.
if so, what shd i do inorder to read complete contents of the file...

Thank you
Chandravadana S

Enis Soztutar wrote:
> 
> Yes, you can use MultiFileInputFormat.
> 
> You can extend the MultiFileInputFormat to return a RecordReader, which 
> reads a record for each file in the MultiFileSplit.
> 
> Enis
> 
> chandra wrote:
>> hi..
>>
>> By setting isSplitable false, we can set 1 file with n records 1 mapper.
>>
>> Is there any way to set 1 complete file per record..
>>
>> Thanks in advance
>> Chandravadana S
>>
>>
>>
>>
>> This e-mail and any files transmitted with it are for the sole use of the
>> intended recipient(s) and may contain confidential and privileged
>> information.
>> If you are not the intended recipient, please contact the sender by reply
>> e-mail and destroy all copies of the original message. 
>> Any unauthorized review, use, disclosure, dissemination, forwarding,
>> printing or copying of this email or any action taken in reliance on this
>> e-mail is strictly 
>> prohibited and may be unlawful.
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/1-file-per-record-tp19644985p19685269.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: 1 file per record

Posted by Enis Soztutar <en...@gmail.com>.

Yes, you can use MultiFileInputFormat.

You can extend the MultiFileInputFormat to return a RecordReader, which 
reads a record for each file in the MultiFileSplit.

Enis

chandra wrote:
> hi..
>
> By setting isSplitable false, we can set 1 file with n records 1 mapper.
>
> Is there any way to set 1 complete file per record..
>
> Thanks in advance
> Chandravadana S
>
>
>
>
> This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information.
> If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. 
> Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly 
> prohibited and may be unlawful.
>
>