You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Kaliyug Antagonist <ka...@gmail.com> on 2013/01/16 16:31:27 UTC

Loading file to HDFS with custom chunk structure

I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
of a 3-node Apache Hadoop cluster.

To summarize, the SegY file consists of :

   1. 3200 bytes *textual header*
   2. 400 bytes *binary header*
   3. Variable bytes *data*

The 99.99% size of the file is due to the variable bytes data which is
collection of thousands of contiguous traces. For any SegY file to make
sense, it must have the textual header+binary header+at least one trace of
data. What I want to achieve is to split a large SegY file onto the Hadoop
cluster so that a smaller SegY file is available on each node for local
processing.

The scenario is as follows:

   1. The SegY file is large in size(above 10GB) and is resting on the
   local file system of the NameNode machine
   2. The file is to be split on the nodes in such a way each node has a
   small SegY file with a strict structure - 3200 bytes *textual header* +
   400 bytes *binary header* + variable bytes *data*As obvious, I can't
   blindly use FSDataOutputStream or hadoop fs -copyFromLocal as this may not
   ensure the format in which the chunks of the larger file are required

Please guide me as to how I must proceed.

Thanks and regards !

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
First of all, the software will get just the block residing on that DN
and not the entire file.

What is your primary intention?To process the SEGY data using MR
or through the tool you are talking about?I had tried something similar
through SU, but it didn't quite work for me and because of the time
constraint I could not continue that. So I can't comment on that with
100% confident.

And, if you are OK with conversion of SEGY files into SequesnceFiles
and do the processing, then you actually don't need any other tool. You
just have to think on how to implement the processing algo you want to
implement as a MR job. Infact, few processing procedures can actually
be implemented very easily as libraries are already available for them.
For example Apache provides libraries for FFT and Inverse FFT and so
forth.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Jan 22, 2013 at 8:42 PM, Kaliyug Antagonist <
kaliyugantagonist@gmail.com> wrote:

> I'm already using Cloudera SeismicHadoop and do not wish to take its track.
>
> Suppose there is a software installed on every node that will expect a
> SegY file for processing. Now, suppose I wish to call this software via
> Hadoop Streaming API and expecting that the software must get a reasonably
> large file for processing, I'll have to do something to pull bytes from the
> HDFS, say from a SequenceFile. These bytes must have the fixed textual
> header + fixed binary header + n(trace header + trace data) - how do I
> achieve this?
>
>
> On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Since SEGY files are flat binary files, you might have a tough
>>> time in dealing with them as their is no native InputFormat for
>>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>>> Bytes) and store the SEGY file as Sequence Files, where each
>>> trace (Trace Header+Trace Data) would be the value and the
>>> trace no. could be the key.
>>>
>>> Otherwise you have to write a custom InputFormat to deal with
>>> that. It would enhance the performance as well, since Sequence
>>> Files are already in key-value form.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>>
>>>> Look at  the block size concept in Hadoop and see if that is what you
>>>> are looking for
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>>> kaliyugantagonist@gmail.com> wrote:
>>>>
>>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>>> HDFS of a 3-node Apache Hadoop cluster.
>>>>
>>>> To summarize, the SegY file consists of :
>>>>
>>>>    1. 3200 bytes *textual header*
>>>>    2. 400 bytes *binary header*
>>>>    3. Variable bytes *data*
>>>>
>>>> The 99.99% size of the file is due to the variable bytes data which is
>>>> collection of thousands of contiguous traces. For any SegY file to make
>>>> sense, it must have the textual header+binary header+at least one trace of
>>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>>> cluster so that a smaller SegY file is available on each node for local
>>>> processing.
>>>>
>>>> The scenario is as follows:
>>>>
>>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>>    local file system of the NameNode machine
>>>>    2. The file is to be split on the nodes in such a way each node has
>>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>>    as this may not ensure the format in which the chunks of the larger file
>>>>    are required
>>>>
>>>> Please guide me as to how I must proceed.
>>>>
>>>> Thanks and regards !
>>>>
>>>>
>>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
First of all, the software will get just the block residing on that DN
and not the entire file.

What is your primary intention?To process the SEGY data using MR
or through the tool you are talking about?I had tried something similar
through SU, but it didn't quite work for me and because of the time
constraint I could not continue that. So I can't comment on that with
100% confident.

And, if you are OK with conversion of SEGY files into SequesnceFiles
and do the processing, then you actually don't need any other tool. You
just have to think on how to implement the processing algo you want to
implement as a MR job. Infact, few processing procedures can actually
be implemented very easily as libraries are already available for them.
For example Apache provides libraries for FFT and Inverse FFT and so
forth.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Jan 22, 2013 at 8:42 PM, Kaliyug Antagonist <
kaliyugantagonist@gmail.com> wrote:

> I'm already using Cloudera SeismicHadoop and do not wish to take its track.
>
> Suppose there is a software installed on every node that will expect a
> SegY file for processing. Now, suppose I wish to call this software via
> Hadoop Streaming API and expecting that the software must get a reasonably
> large file for processing, I'll have to do something to pull bytes from the
> HDFS, say from a SequenceFile. These bytes must have the fixed textual
> header + fixed binary header + n(trace header + trace data) - how do I
> achieve this?
>
>
> On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Since SEGY files are flat binary files, you might have a tough
>>> time in dealing with them as their is no native InputFormat for
>>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>>> Bytes) and store the SEGY file as Sequence Files, where each
>>> trace (Trace Header+Trace Data) would be the value and the
>>> trace no. could be the key.
>>>
>>> Otherwise you have to write a custom InputFormat to deal with
>>> that. It would enhance the performance as well, since Sequence
>>> Files are already in key-value form.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>>
>>>> Look at  the block size concept in Hadoop and see if that is what you
>>>> are looking for
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>>> kaliyugantagonist@gmail.com> wrote:
>>>>
>>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>>> HDFS of a 3-node Apache Hadoop cluster.
>>>>
>>>> To summarize, the SegY file consists of :
>>>>
>>>>    1. 3200 bytes *textual header*
>>>>    2. 400 bytes *binary header*
>>>>    3. Variable bytes *data*
>>>>
>>>> The 99.99% size of the file is due to the variable bytes data which is
>>>> collection of thousands of contiguous traces. For any SegY file to make
>>>> sense, it must have the textual header+binary header+at least one trace of
>>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>>> cluster so that a smaller SegY file is available on each node for local
>>>> processing.
>>>>
>>>> The scenario is as follows:
>>>>
>>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>>    local file system of the NameNode machine
>>>>    2. The file is to be split on the nodes in such a way each node has
>>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>>    as this may not ensure the format in which the chunks of the larger file
>>>>    are required
>>>>
>>>> Please guide me as to how I must proceed.
>>>>
>>>> Thanks and regards !
>>>>
>>>>
>>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
First of all, the software will get just the block residing on that DN
and not the entire file.

What is your primary intention?To process the SEGY data using MR
or through the tool you are talking about?I had tried something similar
through SU, but it didn't quite work for me and because of the time
constraint I could not continue that. So I can't comment on that with
100% confident.

And, if you are OK with conversion of SEGY files into SequesnceFiles
and do the processing, then you actually don't need any other tool. You
just have to think on how to implement the processing algo you want to
implement as a MR job. Infact, few processing procedures can actually
be implemented very easily as libraries are already available for them.
For example Apache provides libraries for FFT and Inverse FFT and so
forth.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Jan 22, 2013 at 8:42 PM, Kaliyug Antagonist <
kaliyugantagonist@gmail.com> wrote:

> I'm already using Cloudera SeismicHadoop and do not wish to take its track.
>
> Suppose there is a software installed on every node that will expect a
> SegY file for processing. Now, suppose I wish to call this software via
> Hadoop Streaming API and expecting that the software must get a reasonably
> large file for processing, I'll have to do something to pull bytes from the
> HDFS, say from a SequenceFile. These bytes must have the fixed textual
> header + fixed binary header + n(trace header + trace data) - how do I
> achieve this?
>
>
> On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Since SEGY files are flat binary files, you might have a tough
>>> time in dealing with them as their is no native InputFormat for
>>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>>> Bytes) and store the SEGY file as Sequence Files, where each
>>> trace (Trace Header+Trace Data) would be the value and the
>>> trace no. could be the key.
>>>
>>> Otherwise you have to write a custom InputFormat to deal with
>>> that. It would enhance the performance as well, since Sequence
>>> Files are already in key-value form.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>>
>>>> Look at  the block size concept in Hadoop and see if that is what you
>>>> are looking for
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>>> kaliyugantagonist@gmail.com> wrote:
>>>>
>>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>>> HDFS of a 3-node Apache Hadoop cluster.
>>>>
>>>> To summarize, the SegY file consists of :
>>>>
>>>>    1. 3200 bytes *textual header*
>>>>    2. 400 bytes *binary header*
>>>>    3. Variable bytes *data*
>>>>
>>>> The 99.99% size of the file is due to the variable bytes data which is
>>>> collection of thousands of contiguous traces. For any SegY file to make
>>>> sense, it must have the textual header+binary header+at least one trace of
>>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>>> cluster so that a smaller SegY file is available on each node for local
>>>> processing.
>>>>
>>>> The scenario is as follows:
>>>>
>>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>>    local file system of the NameNode machine
>>>>    2. The file is to be split on the nodes in such a way each node has
>>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>>    as this may not ensure the format in which the chunks of the larger file
>>>>    are required
>>>>
>>>> Please guide me as to how I must proceed.
>>>>
>>>> Thanks and regards !
>>>>
>>>>
>>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
First of all, the software will get just the block residing on that DN
and not the entire file.

What is your primary intention?To process the SEGY data using MR
or through the tool you are talking about?I had tried something similar
through SU, but it didn't quite work for me and because of the time
constraint I could not continue that. So I can't comment on that with
100% confident.

And, if you are OK with conversion of SEGY files into SequesnceFiles
and do the processing, then you actually don't need any other tool. You
just have to think on how to implement the processing algo you want to
implement as a MR job. Infact, few processing procedures can actually
be implemented very easily as libraries are already available for them.
For example Apache provides libraries for FFT and Inverse FFT and so
forth.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Jan 22, 2013 at 8:42 PM, Kaliyug Antagonist <
kaliyugantagonist@gmail.com> wrote:

> I'm already using Cloudera SeismicHadoop and do not wish to take its track.
>
> Suppose there is a software installed on every node that will expect a
> SegY file for processing. Now, suppose I wish to call this software via
> Hadoop Streaming API and expecting that the software must get a reasonably
> large file for processing, I'll have to do something to pull bytes from the
> HDFS, say from a SequenceFile. These bytes must have the fixed textual
> header + fixed binary header + n(trace header + trace data) - how do I
> achieve this?
>
>
> On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Since SEGY files are flat binary files, you might have a tough
>>> time in dealing with them as their is no native InputFormat for
>>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>>> Bytes) and store the SEGY file as Sequence Files, where each
>>> trace (Trace Header+Trace Data) would be the value and the
>>> trace no. could be the key.
>>>
>>> Otherwise you have to write a custom InputFormat to deal with
>>> that. It would enhance the performance as well, since Sequence
>>> Files are already in key-value form.
>>>
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>>
>>>> Look at  the block size concept in Hadoop and see if that is what you
>>>> are looking for
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>>> kaliyugantagonist@gmail.com> wrote:
>>>>
>>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>>> HDFS of a 3-node Apache Hadoop cluster.
>>>>
>>>> To summarize, the SegY file consists of :
>>>>
>>>>    1. 3200 bytes *textual header*
>>>>    2. 400 bytes *binary header*
>>>>    3. Variable bytes *data*
>>>>
>>>> The 99.99% size of the file is due to the variable bytes data which is
>>>> collection of thousands of contiguous traces. For any SegY file to make
>>>> sense, it must have the textual header+binary header+at least one trace of
>>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>>> cluster so that a smaller SegY file is available on each node for local
>>>> processing.
>>>>
>>>> The scenario is as follows:
>>>>
>>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>>    local file system of the NameNode machine
>>>>    2. The file is to be split on the nodes in such a way each node has
>>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>>    as this may not ensure the format in which the chunks of the larger file
>>>>    are required
>>>>
>>>> Please guide me as to how I must proceed.
>>>>
>>>> Thanks and regards !
>>>>
>>>>
>>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Kaliyug Antagonist <ka...@gmail.com>.
I'm already using Cloudera SeismicHadoop and do not wish to take its track.

Suppose there is a software installed on every node that will expect a SegY
file for processing. Now, suppose I wish to call this software via Hadoop
Streaming API and expecting that the software must get a reasonably large
file for processing, I'll have to do something to pull bytes from the HDFS,
say from a SequenceFile. These bytes must have the fixed textual header +
fixed binary header + n(trace header + trace data) - how do I achieve this?


On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since SEGY files are flat binary files, you might have a tough
>> time in dealing with them as their is no native InputFormat for
>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>> Bytes) and store the SEGY file as Sequence Files, where each
>> trace (Trace Header+Trace Data) would be the value and the
>> trace no. could be the key.
>>
>> Otherwise you have to write a custom InputFormat to deal with
>> that. It would enhance the performance as well, since Sequence
>> Files are already in key-value form.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Look at  the block size concept in Hadoop and see if that is what you
>>> are looking for
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>> kaliyugantagonist@gmail.com> wrote:
>>>
>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>> HDFS of a 3-node Apache Hadoop cluster.
>>>
>>> To summarize, the SegY file consists of :
>>>
>>>    1. 3200 bytes *textual header*
>>>    2. 400 bytes *binary header*
>>>    3. Variable bytes *data*
>>>
>>> The 99.99% size of the file is due to the variable bytes data which is
>>> collection of thousands of contiguous traces. For any SegY file to make
>>> sense, it must have the textual header+binary header+at least one trace of
>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>> cluster so that a smaller SegY file is available on each node for local
>>> processing.
>>>
>>> The scenario is as follows:
>>>
>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>    local file system of the NameNode machine
>>>    2. The file is to be split on the nodes in such a way each node has
>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>    as this may not ensure the format in which the chunks of the larger file
>>>    are required
>>>
>>> Please guide me as to how I must proceed.
>>>
>>> Thanks and regards !
>>>
>>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Kaliyug Antagonist <ka...@gmail.com>.
I'm already using Cloudera SeismicHadoop and do not wish to take its track.

Suppose there is a software installed on every node that will expect a SegY
file for processing. Now, suppose I wish to call this software via Hadoop
Streaming API and expecting that the software must get a reasonably large
file for processing, I'll have to do something to pull bytes from the HDFS,
say from a SequenceFile. These bytes must have the fixed textual header +
fixed binary header + n(trace header + trace data) - how do I achieve this?


On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since SEGY files are flat binary files, you might have a tough
>> time in dealing with them as their is no native InputFormat for
>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>> Bytes) and store the SEGY file as Sequence Files, where each
>> trace (Trace Header+Trace Data) would be the value and the
>> trace no. could be the key.
>>
>> Otherwise you have to write a custom InputFormat to deal with
>> that. It would enhance the performance as well, since Sequence
>> Files are already in key-value form.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Look at  the block size concept in Hadoop and see if that is what you
>>> are looking for
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>> kaliyugantagonist@gmail.com> wrote:
>>>
>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>> HDFS of a 3-node Apache Hadoop cluster.
>>>
>>> To summarize, the SegY file consists of :
>>>
>>>    1. 3200 bytes *textual header*
>>>    2. 400 bytes *binary header*
>>>    3. Variable bytes *data*
>>>
>>> The 99.99% size of the file is due to the variable bytes data which is
>>> collection of thousands of contiguous traces. For any SegY file to make
>>> sense, it must have the textual header+binary header+at least one trace of
>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>> cluster so that a smaller SegY file is available on each node for local
>>> processing.
>>>
>>> The scenario is as follows:
>>>
>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>    local file system of the NameNode machine
>>>    2. The file is to be split on the nodes in such a way each node has
>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>    as this may not ensure the format in which the chunks of the larger file
>>>    are required
>>>
>>> Please guide me as to how I must proceed.
>>>
>>> Thanks and regards !
>>>
>>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Kaliyug Antagonist <ka...@gmail.com>.
I'm already using Cloudera SeismicHadoop and do not wish to take its track.

Suppose there is a software installed on every node that will expect a SegY
file for processing. Now, suppose I wish to call this software via Hadoop
Streaming API and expecting that the software must get a reasonably large
file for processing, I'll have to do something to pull bytes from the HDFS,
say from a SequenceFile. These bytes must have the fixed textual header +
fixed binary header + n(trace header + trace data) - how do I achieve this?


On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since SEGY files are flat binary files, you might have a tough
>> time in dealing with them as their is no native InputFormat for
>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>> Bytes) and store the SEGY file as Sequence Files, where each
>> trace (Trace Header+Trace Data) would be the value and the
>> trace no. could be the key.
>>
>> Otherwise you have to write a custom InputFormat to deal with
>> that. It would enhance the performance as well, since Sequence
>> Files are already in key-value form.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Look at  the block size concept in Hadoop and see if that is what you
>>> are looking for
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>> kaliyugantagonist@gmail.com> wrote:
>>>
>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>> HDFS of a 3-node Apache Hadoop cluster.
>>>
>>> To summarize, the SegY file consists of :
>>>
>>>    1. 3200 bytes *textual header*
>>>    2. 400 bytes *binary header*
>>>    3. Variable bytes *data*
>>>
>>> The 99.99% size of the file is due to the variable bytes data which is
>>> collection of thousands of contiguous traces. For any SegY file to make
>>> sense, it must have the textual header+binary header+at least one trace of
>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>> cluster so that a smaller SegY file is available on each node for local
>>> processing.
>>>
>>> The scenario is as follows:
>>>
>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>    local file system of the NameNode machine
>>>    2. The file is to be split on the nodes in such a way each node has
>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>    as this may not ensure the format in which the chunks of the larger file
>>>    are required
>>>
>>> Please guide me as to how I must proceed.
>>>
>>> Thanks and regards !
>>>
>>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Kaliyug Antagonist <ka...@gmail.com>.
I'm already using Cloudera SeismicHadoop and do not wish to take its track.

Suppose there is a software installed on every node that will expect a SegY
file for processing. Now, suppose I wish to call this software via Hadoop
Streaming API and expecting that the software must get a reasonably large
file for processing, I'll have to do something to pull bytes from the HDFS,
say from a SequenceFile. These bytes must have the fixed textual header +
fixed binary header + n(trace header + trace data) - how do I achieve this?


On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since SEGY files are flat binary files, you might have a tough
>> time in dealing with them as their is no native InputFormat for
>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>> Bytes) and store the SEGY file as Sequence Files, where each
>> trace (Trace Header+Trace Data) would be the value and the
>> trace no. could be the key.
>>
>> Otherwise you have to write a custom InputFormat to deal with
>> that. It would enhance the performance as well, since Sequence
>> Files are already in key-value form.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Look at  the block size concept in Hadoop and see if that is what you
>>> are looking for
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>> kaliyugantagonist@gmail.com> wrote:
>>>
>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>> HDFS of a 3-node Apache Hadoop cluster.
>>>
>>> To summarize, the SegY file consists of :
>>>
>>>    1. 3200 bytes *textual header*
>>>    2. 400 bytes *binary header*
>>>    3. Variable bytes *data*
>>>
>>> The 99.99% size of the file is due to the variable bytes data which is
>>> collection of thousands of contiguous traces. For any SegY file to make
>>> sense, it must have the textual header+binary header+at least one trace of
>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>> cluster so that a smaller SegY file is available on each node for local
>>> processing.
>>>
>>> The scenario is as follows:
>>>
>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>    local file system of the NameNode machine
>>>    2. The file is to be split on the nodes in such a way each node has
>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>    as this may not ensure the format in which the chunks of the larger file
>>>    are required
>>>
>>> Please guide me as to how I must proceed.
>>>
>>> Thanks and regards !
>>>
>>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
You might also find this link <https://github.com/cloudera/seismichadoop>useful.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since SEGY files are flat binary files, you might have a tough
> time in dealing with them as their is no native InputFormat for
> that. You can strip off the EBCDIC+Binary header(Initial 3600
> Bytes) and store the SEGY file as Sequence Files, where each
> trace (Trace Header+Trace Data) would be the value and the
> trace no. could be the key.
>
> Otherwise you have to write a custom InputFormat to deal with
> that. It would enhance the performance as well, since Sequence
> Files are already in key-value form.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Look at  the block size concept in Hadoop and see if that is what you are
>> looking for
>>
>> Sent from my iPhone
>>
>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>> kaliyugantagonist@gmail.com> wrote:
>>
>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>> HDFS of a 3-node Apache Hadoop cluster.
>>
>> To summarize, the SegY file consists of :
>>
>>    1. 3200 bytes *textual header*
>>    2. 400 bytes *binary header*
>>    3. Variable bytes *data*
>>
>> The 99.99% size of the file is due to the variable bytes data which is
>> collection of thousands of contiguous traces. For any SegY file to make
>> sense, it must have the textual header+binary header+at least one trace of
>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>> cluster so that a smaller SegY file is available on each node for local
>> processing.
>>
>> The scenario is as follows:
>>
>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>    local file system of the NameNode machine
>>    2. The file is to be split on the nodes in such a way each node has a
>>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>>    *binary header* + variable bytes *data*As obvious, I can't blindly
>>    use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure
>>    the format in which the chunks of the larger file are required
>>
>> Please guide me as to how I must proceed.
>>
>> Thanks and regards !
>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
You might also find this link <https://github.com/cloudera/seismichadoop>useful.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since SEGY files are flat binary files, you might have a tough
> time in dealing with them as their is no native InputFormat for
> that. You can strip off the EBCDIC+Binary header(Initial 3600
> Bytes) and store the SEGY file as Sequence Files, where each
> trace (Trace Header+Trace Data) would be the value and the
> trace no. could be the key.
>
> Otherwise you have to write a custom InputFormat to deal with
> that. It would enhance the performance as well, since Sequence
> Files are already in key-value form.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Look at  the block size concept in Hadoop and see if that is what you are
>> looking for
>>
>> Sent from my iPhone
>>
>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>> kaliyugantagonist@gmail.com> wrote:
>>
>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>> HDFS of a 3-node Apache Hadoop cluster.
>>
>> To summarize, the SegY file consists of :
>>
>>    1. 3200 bytes *textual header*
>>    2. 400 bytes *binary header*
>>    3. Variable bytes *data*
>>
>> The 99.99% size of the file is due to the variable bytes data which is
>> collection of thousands of contiguous traces. For any SegY file to make
>> sense, it must have the textual header+binary header+at least one trace of
>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>> cluster so that a smaller SegY file is available on each node for local
>> processing.
>>
>> The scenario is as follows:
>>
>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>    local file system of the NameNode machine
>>    2. The file is to be split on the nodes in such a way each node has a
>>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>>    *binary header* + variable bytes *data*As obvious, I can't blindly
>>    use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure
>>    the format in which the chunks of the larger file are required
>>
>> Please guide me as to how I must proceed.
>>
>> Thanks and regards !
>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
You might also find this link <https://github.com/cloudera/seismichadoop>useful.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since SEGY files are flat binary files, you might have a tough
> time in dealing with them as their is no native InputFormat for
> that. You can strip off the EBCDIC+Binary header(Initial 3600
> Bytes) and store the SEGY file as Sequence Files, where each
> trace (Trace Header+Trace Data) would be the value and the
> trace no. could be the key.
>
> Otherwise you have to write a custom InputFormat to deal with
> that. It would enhance the performance as well, since Sequence
> Files are already in key-value form.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Look at  the block size concept in Hadoop and see if that is what you are
>> looking for
>>
>> Sent from my iPhone
>>
>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>> kaliyugantagonist@gmail.com> wrote:
>>
>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>> HDFS of a 3-node Apache Hadoop cluster.
>>
>> To summarize, the SegY file consists of :
>>
>>    1. 3200 bytes *textual header*
>>    2. 400 bytes *binary header*
>>    3. Variable bytes *data*
>>
>> The 99.99% size of the file is due to the variable bytes data which is
>> collection of thousands of contiguous traces. For any SegY file to make
>> sense, it must have the textual header+binary header+at least one trace of
>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>> cluster so that a smaller SegY file is available on each node for local
>> processing.
>>
>> The scenario is as follows:
>>
>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>    local file system of the NameNode machine
>>    2. The file is to be split on the nodes in such a way each node has a
>>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>>    *binary header* + variable bytes *data*As obvious, I can't blindly
>>    use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure
>>    the format in which the chunks of the larger file are required
>>
>> Please guide me as to how I must proceed.
>>
>> Thanks and regards !
>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
You might also find this link <https://github.com/cloudera/seismichadoop>useful.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since SEGY files are flat binary files, you might have a tough
> time in dealing with them as their is no native InputFormat for
> that. You can strip off the EBCDIC+Binary header(Initial 3600
> Bytes) and store the SEGY file as Sequence Files, where each
> trace (Trace Header+Trace Data) would be the value and the
> trace no. could be the key.
>
> Otherwise you have to write a custom InputFormat to deal with
> that. It would enhance the performance as well, since Sequence
> Files are already in key-value form.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Look at  the block size concept in Hadoop and see if that is what you are
>> looking for
>>
>> Sent from my iPhone
>>
>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>> kaliyugantagonist@gmail.com> wrote:
>>
>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>> HDFS of a 3-node Apache Hadoop cluster.
>>
>> To summarize, the SegY file consists of :
>>
>>    1. 3200 bytes *textual header*
>>    2. 400 bytes *binary header*
>>    3. Variable bytes *data*
>>
>> The 99.99% size of the file is due to the variable bytes data which is
>> collection of thousands of contiguous traces. For any SegY file to make
>> sense, it must have the textual header+binary header+at least one trace of
>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>> cluster so that a smaller SegY file is available on each node for local
>> processing.
>>
>> The scenario is as follows:
>>
>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>    local file system of the NameNode machine
>>    2. The file is to be split on the nodes in such a way each node has a
>>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>>    *binary header* + variable bytes *data*As obvious, I can't blindly
>>    use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure
>>    the format in which the chunks of the larger file are required
>>
>> Please guide me as to how I must proceed.
>>
>> Thanks and regards !
>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
Since SEGY files are flat binary files, you might have a tough
time in dealing with them as their is no native InputFormat for
that. You can strip off the EBCDIC+Binary header(Initial 3600
Bytes) and store the SEGY file as Sequence Files, where each
trace (Trace Header+Trace Data) would be the value and the
trace no. could be the key.

Otherwise you have to write a custom InputFormat to deal with
that. It would enhance the performance as well, since Sequence
Files are already in key-value form.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Look at  the block size concept in Hadoop and see if that is what you are
> looking for
>
> Sent from my iPhone
>
> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
> kaliyugantagonist@gmail.com> wrote:
>
> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
Since SEGY files are flat binary files, you might have a tough
time in dealing with them as their is no native InputFormat for
that. You can strip off the EBCDIC+Binary header(Initial 3600
Bytes) and store the SEGY file as Sequence Files, where each
trace (Trace Header+Trace Data) would be the value and the
trace no. could be the key.

Otherwise you have to write a custom InputFormat to deal with
that. It would enhance the performance as well, since Sequence
Files are already in key-value form.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Look at  the block size concept in Hadoop and see if that is what you are
> looking for
>
> Sent from my iPhone
>
> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
> kaliyugantagonist@gmail.com> wrote:
>
> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
Since SEGY files are flat binary files, you might have a tough
time in dealing with them as their is no native InputFormat for
that. You can strip off the EBCDIC+Binary header(Initial 3600
Bytes) and store the SEGY file as Sequence Files, where each
trace (Trace Header+Trace Data) would be the value and the
trace no. could be the key.

Otherwise you have to write a custom InputFormat to deal with
that. It would enhance the performance as well, since Sequence
Files are already in key-value form.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Look at  the block size concept in Hadoop and see if that is what you are
> looking for
>
> Sent from my iPhone
>
> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
> kaliyugantagonist@gmail.com> wrote:
>
> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
Since SEGY files are flat binary files, you might have a tough
time in dealing with them as their is no native InputFormat for
that. You can strip off the EBCDIC+Binary header(Initial 3600
Bytes) and store the SEGY file as Sequence Files, where each
trace (Trace Header+Trace Data) would be the value and the
trace no. could be the key.

Otherwise you have to write a custom InputFormat to deal with
that. It would enhance the performance as well, since Sequence
Files are already in key-value form.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Look at  the block size concept in Hadoop and see if that is what you are
> looking for
>
> Sent from my iPhone
>
> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
> kaliyugantagonist@gmail.com> wrote:
>
> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohit Anchlia <mo...@gmail.com>.
Look at  the block size concept in Hadoop and see if that is what you are looking for 

Sent from my iPhone

On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <ka...@gmail.com> wrote:

> I want to load a SegY file onto HDFS of a 3-node Apache Hadoop cluster.
> 
> To summarize, the SegY file consists of :
> 
> 3200 bytes textual header
> 400 bytes binary header
> Variable bytes data
> The 99.99% size of the file is due to the variable bytes data which is collection of thousands of contiguous traces. For any SegY file to make sense, it must have the textual header+binary header+at least one trace of data. What I want to achieve is to split a large SegY file onto the Hadoop cluster so that a smaller SegY file is available on each node for local processing.
> 
> The scenario is as follows:
> 
> The SegY file is large in size(above 10GB) and is resting on the local file system of the NameNode machine
> The file is to be split on the nodes in such a way each node has a small SegY file with a strict structure - 3200 bytes textual header + 400 bytes binary header + variable bytes dataAs obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the format in which the chunks of the larger file are required
> Please guide me as to how I must proceed.
> 
> Thanks and regards !

Re: Loading file to HDFS with custom chunk structure

Posted by Mohit Anchlia <mo...@gmail.com>.
Look at  the block size concept in Hadoop and see if that is what you are looking for 

Sent from my iPhone

On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <ka...@gmail.com> wrote:

> I want to load a SegY file onto HDFS of a 3-node Apache Hadoop cluster.
> 
> To summarize, the SegY file consists of :
> 
> 3200 bytes textual header
> 400 bytes binary header
> Variable bytes data
> The 99.99% size of the file is due to the variable bytes data which is collection of thousands of contiguous traces. For any SegY file to make sense, it must have the textual header+binary header+at least one trace of data. What I want to achieve is to split a large SegY file onto the Hadoop cluster so that a smaller SegY file is available on each node for local processing.
> 
> The scenario is as follows:
> 
> The SegY file is large in size(above 10GB) and is resting on the local file system of the NameNode machine
> The file is to be split on the nodes in such a way each node has a small SegY file with a strict structure - 3200 bytes textual header + 400 bytes binary header + variable bytes dataAs obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the format in which the chunks of the larger file are required
> Please guide me as to how I must proceed.
> 
> Thanks and regards !

Re: Loading file to HDFS with custom chunk structure

Posted by Mohit Anchlia <mo...@gmail.com>.
Look at  the block size concept in Hadoop and see if that is what you are looking for 

Sent from my iPhone

On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <ka...@gmail.com> wrote:

> I want to load a SegY file onto HDFS of a 3-node Apache Hadoop cluster.
> 
> To summarize, the SegY file consists of :
> 
> 3200 bytes textual header
> 400 bytes binary header
> Variable bytes data
> The 99.99% size of the file is due to the variable bytes data which is collection of thousands of contiguous traces. For any SegY file to make sense, it must have the textual header+binary header+at least one trace of data. What I want to achieve is to split a large SegY file onto the Hadoop cluster so that a smaller SegY file is available on each node for local processing.
> 
> The scenario is as follows:
> 
> The SegY file is large in size(above 10GB) and is resting on the local file system of the NameNode machine
> The file is to be split on the nodes in such a way each node has a small SegY file with a strict structure - 3200 bytes textual header + 400 bytes binary header + variable bytes dataAs obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the format in which the chunks of the larger file are required
> Please guide me as to how I must proceed.
> 
> Thanks and regards !

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
Hello there,

    You don't have to split the file. When you push anything into
HDFS, it automatically gets splitted into small chunks of uniform
size (64MB or 128MB usually). And the MapReduce frameworks
ensures that each block is processed locally on the node where it
is located.

Do you have any specific requirement??

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:01 PM, Kaliyug Antagonist <
kaliyugantagonist@gmail.com> wrote:

> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
Hello there,

    You don't have to split the file. When you push anything into
HDFS, it automatically gets splitted into small chunks of uniform
size (64MB or 128MB usually). And the MapReduce frameworks
ensures that each block is processed locally on the node where it
is located.

Do you have any specific requirement??

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:01 PM, Kaliyug Antagonist <
kaliyugantagonist@gmail.com> wrote:

> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohit Anchlia <mo...@gmail.com>.
Look at  the block size concept in Hadoop and see if that is what you are looking for 

Sent from my iPhone

On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <ka...@gmail.com> wrote:

> I want to load a SegY file onto HDFS of a 3-node Apache Hadoop cluster.
> 
> To summarize, the SegY file consists of :
> 
> 3200 bytes textual header
> 400 bytes binary header
> Variable bytes data
> The 99.99% size of the file is due to the variable bytes data which is collection of thousands of contiguous traces. For any SegY file to make sense, it must have the textual header+binary header+at least one trace of data. What I want to achieve is to split a large SegY file onto the Hadoop cluster so that a smaller SegY file is available on each node for local processing.
> 
> The scenario is as follows:
> 
> The SegY file is large in size(above 10GB) and is resting on the local file system of the NameNode machine
> The file is to be split on the nodes in such a way each node has a small SegY file with a strict structure - 3200 bytes textual header + 400 bytes binary header + variable bytes dataAs obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the format in which the chunks of the larger file are required
> Please guide me as to how I must proceed.
> 
> Thanks and regards !

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
Hello there,

    You don't have to split the file. When you push anything into
HDFS, it automatically gets splitted into small chunks of uniform
size (64MB or 128MB usually). And the MapReduce frameworks
ensures that each block is processed locally on the node where it
is located.

Do you have any specific requirement??

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:01 PM, Kaliyug Antagonist <
kaliyugantagonist@gmail.com> wrote:

> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>

Re: Loading file to HDFS with custom chunk structure

Posted by Mohammad Tariq <do...@gmail.com>.
Hello there,

    You don't have to split the file. When you push anything into
HDFS, it automatically gets splitted into small chunks of uniform
size (64MB or 128MB usually). And the MapReduce frameworks
ensures that each block is processed locally on the node where it
is located.

Do you have any specific requirement??

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:01 PM, Kaliyug Antagonist <
kaliyugantagonist@gmail.com> wrote:

> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>