You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Public Network Services <pu...@gmail.com> on 2013/02/23 15:13:43 UTC

Getting custom input splits from files that are not byte-aligned or line-aligned

Hi...

I use an application that processes text files containing data records
which are of variable size and not line-aligned.

The application implementation includes a Java library with a "reader"
object that can extract records one-by-one in a "pull" fashion, as strings,
i.e. for each such "reader" object the client code can call

reader.next()


and get an entire record as a String. So, proceeding in this fashion, the
client code can consume a file of arbitrarily long length, from start to
end, whereupon a null value is returned.

Another peculiarity is that the extracted record strings may lose some
secondary information (e.g., trim spaces), so exact byte alignment of the
records to the underlying data is not possible.

How could the above code be used to efficiently split compliant text files
of large size (ranging from hundreds of megabytes to several gigabytes and
terrabytes in size)?

The source code I have seen in FileInputFormat and numerous other
implementations is line or byte-aligned, so it is not applicable for the
above case.

It would actually be very useful if there was a template implementation
that left only the string record "reader" object unspecified and did
everything else, but apparently there is none.

Two alternatives that should work are:

   1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) and
   supply them to HDFS afterwards, returning false in the isSplitable() method
   of the custom InputFormat.
   2. Read and write records into HDFS files in the getSplits[] method of
   the custom InputFormat and create one FileSplit reference for each of these
   HDFS files, once they are filled to the desired size.

Is there any better approach and/or any example code relevant to the above?

Thanks!

Re: Getting custom input splits from files that are not byte-aligned or line-aligned

Posted by Public Network Services <pu...@gmail.com>.
This appears to be the case.

My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
original post).


On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
wellington.chevreuil@gmail.com> wrote:

> Hi,
>
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
>
> Regards,
> Wellington.
> Em 23/02/2013 14:14, "Public Network Services" <
> publicnetworkservices@gmail.com> escreveu:
>
> Hi...
>>
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>>
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>>
>> reader.next()
>>
>>
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>>
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>>
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>>
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>>
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>>
>> Two alternatives that should work are:
>>
>>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>>    and supply them to HDFS afterwards, returning false in the isSplitable()
>>    method of the custom InputFormat.
>>    2. Read and write records into HDFS files in the getSplits[] method
>>    of the custom InputFormat and create one FileSplit reference for each of
>>    these HDFS files, once they are filled to the desired size.
>>
>> Is there any better approach and/or any example code relevant to the
>> above?
>>
>> Thanks!
>>
>

Re: Getting custom input splits from files that are not byte-aligned or line-aligned

Posted by Public Network Services <pu...@gmail.com>.
This appears to be the case.

My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
original post).


On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
wellington.chevreuil@gmail.com> wrote:

> Hi,
>
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
>
> Regards,
> Wellington.
> Em 23/02/2013 14:14, "Public Network Services" <
> publicnetworkservices@gmail.com> escreveu:
>
> Hi...
>>
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>>
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>>
>> reader.next()
>>
>>
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>>
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>>
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>>
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>>
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>>
>> Two alternatives that should work are:
>>
>>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>>    and supply them to HDFS afterwards, returning false in the isSplitable()
>>    method of the custom InputFormat.
>>    2. Read and write records into HDFS files in the getSplits[] method
>>    of the custom InputFormat and create one FileSplit reference for each of
>>    these HDFS files, once they are filled to the desired size.
>>
>> Is there any better approach and/or any example code relevant to the
>> above?
>>
>> Thanks!
>>
>

Re: Getting custom input splits from files that are not byte-aligned or line-aligned

Posted by Public Network Services <pu...@gmail.com>.
This appears to be the case.

My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
original post).


On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
wellington.chevreuil@gmail.com> wrote:

> Hi,
>
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
>
> Regards,
> Wellington.
> Em 23/02/2013 14:14, "Public Network Services" <
> publicnetworkservices@gmail.com> escreveu:
>
> Hi...
>>
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>>
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>>
>> reader.next()
>>
>>
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>>
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>>
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>>
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>>
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>>
>> Two alternatives that should work are:
>>
>>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>>    and supply them to HDFS afterwards, returning false in the isSplitable()
>>    method of the custom InputFormat.
>>    2. Read and write records into HDFS files in the getSplits[] method
>>    of the custom InputFormat and create one FileSplit reference for each of
>>    these HDFS files, once they are filled to the desired size.
>>
>> Is there any better approach and/or any example code relevant to the
>> above?
>>
>> Thanks!
>>
>

Re: Getting custom input splits from files that are not byte-aligned or line-aligned

Posted by Public Network Services <pu...@gmail.com>.
This appears to be the case.

My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
original post).


On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
wellington.chevreuil@gmail.com> wrote:

> Hi,
>
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
>
> Regards,
> Wellington.
> Em 23/02/2013 14:14, "Public Network Services" <
> publicnetworkservices@gmail.com> escreveu:
>
> Hi...
>>
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>>
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>>
>> reader.next()
>>
>>
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>>
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>>
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>>
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>>
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>>
>> Two alternatives that should work are:
>>
>>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>>    and supply them to HDFS afterwards, returning false in the isSplitable()
>>    method of the custom InputFormat.
>>    2. Read and write records into HDFS files in the getSplits[] method
>>    of the custom InputFormat and create one FileSplit reference for each of
>>    these HDFS files, once they are filled to the desired size.
>>
>> Is there any better approach and/or any example code relevant to the
>> above?
>>
>> Thanks!
>>
>

Re: Getting custom input splits from files that are not byte-aligned or line-aligned

Posted by Wellington Chevreuil <we...@gmail.com>.
Hi,

I think you'll have to implement your own custom FileInputFormat, using
this lib you mentioned to properly read your file records and split them
through map tasks.

Regards,
Wellington.
Em 23/02/2013 14:14, "Public Network Services" <
publicnetworkservices@gmail.com> escreveu:

> Hi...
>
> I use an application that processes text files containing data records
> which are of variable size and not line-aligned.
>
> The application implementation includes a Java library with a "reader"
> object that can extract records one-by-one in a "pull" fashion, as strings,
> i.e. for each such "reader" object the client code can call
>
> reader.next()
>
>
> and get an entire record as a String. So, proceeding in this fashion, the
> client code can consume a file of arbitrarily long length, from start to
> end, whereupon a null value is returned.
>
> Another peculiarity is that the extracted record strings may lose some
> secondary information (e.g., trim spaces), so exact byte alignment of the
> records to the underlying data is not possible.
>
> How could the above code be used to efficiently split compliant text files
> of large size (ranging from hundreds of megabytes to several gigabytes and
> terrabytes in size)?
>
> The source code I have seen in FileInputFormat and numerous other
> implementations is line or byte-aligned, so it is not applicable for the
> above case.
>
> It would actually be very useful if there was a template implementation
> that left only the string record "reader" object unspecified and did
> everything else, but apparently there is none.
>
> Two alternatives that should work are:
>
>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) and
>    supply them to HDFS afterwards, returning false in the isSplitable() method
>    of the custom InputFormat.
>    2. Read and write records into HDFS files in the getSplits[] method of
>    the custom InputFormat and create one FileSplit reference for each of these
>    HDFS files, once they are filled to the desired size.
>
> Is there any better approach and/or any example code relevant to the above?
>
> Thanks!
>

Re: Getting custom input splits from files that are not byte-aligned or line-aligned

Posted by Wellington Chevreuil <we...@gmail.com>.
Hi,

I think you'll have to implement your own custom FileInputFormat, using
this lib you mentioned to properly read your file records and split them
through map tasks.

Regards,
Wellington.
Em 23/02/2013 14:14, "Public Network Services" <
publicnetworkservices@gmail.com> escreveu:

> Hi...
>
> I use an application that processes text files containing data records
> which are of variable size and not line-aligned.
>
> The application implementation includes a Java library with a "reader"
> object that can extract records one-by-one in a "pull" fashion, as strings,
> i.e. for each such "reader" object the client code can call
>
> reader.next()
>
>
> and get an entire record as a String. So, proceeding in this fashion, the
> client code can consume a file of arbitrarily long length, from start to
> end, whereupon a null value is returned.
>
> Another peculiarity is that the extracted record strings may lose some
> secondary information (e.g., trim spaces), so exact byte alignment of the
> records to the underlying data is not possible.
>
> How could the above code be used to efficiently split compliant text files
> of large size (ranging from hundreds of megabytes to several gigabytes and
> terrabytes in size)?
>
> The source code I have seen in FileInputFormat and numerous other
> implementations is line or byte-aligned, so it is not applicable for the
> above case.
>
> It would actually be very useful if there was a template implementation
> that left only the string record "reader" object unspecified and did
> everything else, but apparently there is none.
>
> Two alternatives that should work are:
>
>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) and
>    supply them to HDFS afterwards, returning false in the isSplitable() method
>    of the custom InputFormat.
>    2. Read and write records into HDFS files in the getSplits[] method of
>    the custom InputFormat and create one FileSplit reference for each of these
>    HDFS files, once they are filled to the desired size.
>
> Is there any better approach and/or any example code relevant to the above?
>
> Thanks!
>

Re: Getting custom input splits from files that are not byte-aligned or line-aligned

Posted by Wellington Chevreuil <we...@gmail.com>.
Hi,

I think you'll have to implement your own custom FileInputFormat, using
this lib you mentioned to properly read your file records and split them
through map tasks.

Regards,
Wellington.
Em 23/02/2013 14:14, "Public Network Services" <
publicnetworkservices@gmail.com> escreveu:

> Hi...
>
> I use an application that processes text files containing data records
> which are of variable size and not line-aligned.
>
> The application implementation includes a Java library with a "reader"
> object that can extract records one-by-one in a "pull" fashion, as strings,
> i.e. for each such "reader" object the client code can call
>
> reader.next()
>
>
> and get an entire record as a String. So, proceeding in this fashion, the
> client code can consume a file of arbitrarily long length, from start to
> end, whereupon a null value is returned.
>
> Another peculiarity is that the extracted record strings may lose some
> secondary information (e.g., trim spaces), so exact byte alignment of the
> records to the underlying data is not possible.
>
> How could the above code be used to efficiently split compliant text files
> of large size (ranging from hundreds of megabytes to several gigabytes and
> terrabytes in size)?
>
> The source code I have seen in FileInputFormat and numerous other
> implementations is line or byte-aligned, so it is not applicable for the
> above case.
>
> It would actually be very useful if there was a template implementation
> that left only the string record "reader" object unspecified and did
> everything else, but apparently there is none.
>
> Two alternatives that should work are:
>
>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) and
>    supply them to HDFS afterwards, returning false in the isSplitable() method
>    of the custom InputFormat.
>    2. Read and write records into HDFS files in the getSplits[] method of
>    the custom InputFormat and create one FileSplit reference for each of these
>    HDFS files, once they are filled to the desired size.
>
> Is there any better approach and/or any example code relevant to the above?
>
> Thanks!
>

Re: Getting custom input splits from files that are not byte-aligned or line-aligned

Posted by Wellington Chevreuil <we...@gmail.com>.
Hi,

I think you'll have to implement your own custom FileInputFormat, using
this lib you mentioned to properly read your file records and split them
through map tasks.

Regards,
Wellington.
Em 23/02/2013 14:14, "Public Network Services" <
publicnetworkservices@gmail.com> escreveu:

> Hi...
>
> I use an application that processes text files containing data records
> which are of variable size and not line-aligned.
>
> The application implementation includes a Java library with a "reader"
> object that can extract records one-by-one in a "pull" fashion, as strings,
> i.e. for each such "reader" object the client code can call
>
> reader.next()
>
>
> and get an entire record as a String. So, proceeding in this fashion, the
> client code can consume a file of arbitrarily long length, from start to
> end, whereupon a null value is returned.
>
> Another peculiarity is that the extracted record strings may lose some
> secondary information (e.g., trim spaces), so exact byte alignment of the
> records to the underlying data is not possible.
>
> How could the above code be used to efficiently split compliant text files
> of large size (ranging from hundreds of megabytes to several gigabytes and
> terrabytes in size)?
>
> The source code I have seen in FileInputFormat and numerous other
> implementations is line or byte-aligned, so it is not applicable for the
> above case.
>
> It would actually be very useful if there was a template implementation
> that left only the string record "reader" object unspecified and did
> everything else, but apparently there is none.
>
> Two alternatives that should work are:
>
>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) and
>    supply them to HDFS afterwards, returning false in the isSplitable() method
>    of the custom InputFormat.
>    2. Read and write records into HDFS files in the getSplits[] method of
>    the custom InputFormat and create one FileSplit reference for each of these
>    HDFS files, once they are filled to the desired size.
>
> Is there any better approach and/or any example code relevant to the above?
>
> Thanks!
>