You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com> on 2014/07/19 06:12:32 UTC

what exactly does data in HDFS look like?

And by that I mean is there an HDFS file type? I feel like I’m missing something. Let’s say I have a HUGE json file that I import into HDFS. Does it retain it’s JSON format in HDFS? What if it’s just random tweets I’m streaming. Is it kind of like a normal disk where there are all kinds of files sitting on disk in their own format it’s just that in HDFS they are spread out over nodes?

B.

Re: what exactly does data in HDFS look like?

Posted by Bertrand Dechoux <de...@gmail.com>.

But basically you are right : it is the same concept as with a classical
file system. A file is seen as a sequence of bytes. For various efficiency
reasons, the whole sequence is not stored like that but first splits into
blocks (subsequences). With a local file system, these blocks will be
within the local drives. With HDFS, they are somewhere within the cluster
(and replicated, of course).

So really, the filesystem doesn't care about what is inside the file and
the format is something it is really oblivious to.

Bertrand


On Sat, Jul 19, 2014 at 7:02 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> The data itself is eventually store in a form of file. Each blocks of the
> file and it replicas are stored in files and directories on different
> nodes. The Namenode that keep the information and maintains it about each
> file and where its blocks (and replicated blocks exist in the cluster.)
>
> As for the format, it is stored as bytes. In the normal cases you use the
> DFS or FileOutputStream classes to  write data and in those instances it is
> written in byte form (conversion to bytes i.e. serialize data.) When you
> read the data, you use the same counterpart classes like InputStream and
> those convert the data from byte to text (i.e. deserialization). Point
> being, HDFS is oblivious to the fact whether it was JSON of XML.
>
> This would be more evident if you see the code to read/write from HDFS
> (writing example below):
>
> https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-hdfs-using-hadoop
>
> Now on the other hand, if you were using compression or other storage
> formats like Avro or Parquet then those formats come with their own classes
> which take care of serialization and deserialization.
>
> For basic cases, this should be helpful:
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow
>
> More here on data storage:
>
> http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
> http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization
> https://developer.yahoo.com/hadoop/tutorial/module1.html#data
>
> Regards,
> Shahab
>
>
> On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   And by that I mean is there an HDFS file type? I feel like I’m missing
>> something. Let’s say I have a HUGE json file that I import into HDFS. Does
>> it retain it’s JSON format in HDFS? What if it’s just random tweets I’m
>> streaming. Is it kind of like a normal disk where there are all kinds of
>> files sitting on disk in their own format it’s just that in HDFS they are
>> spread out over nodes?
>>
>> B.
>>
>
>

Re: what exactly does data in HDFS look like?

Posted by Bertrand Dechoux <de...@gmail.com>.

But basically you are right : it is the same concept as with a classical
file system. A file is seen as a sequence of bytes. For various efficiency
reasons, the whole sequence is not stored like that but first splits into
blocks (subsequences). With a local file system, these blocks will be
within the local drives. With HDFS, they are somewhere within the cluster
(and replicated, of course).

So really, the filesystem doesn't care about what is inside the file and
the format is something it is really oblivious to.

Bertrand


On Sat, Jul 19, 2014 at 7:02 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> The data itself is eventually store in a form of file. Each blocks of the
> file and it replicas are stored in files and directories on different
> nodes. The Namenode that keep the information and maintains it about each
> file and where its blocks (and replicated blocks exist in the cluster.)
>
> As for the format, it is stored as bytes. In the normal cases you use the
> DFS or FileOutputStream classes to  write data and in those instances it is
> written in byte form (conversion to bytes i.e. serialize data.) When you
> read the data, you use the same counterpart classes like InputStream and
> those convert the data from byte to text (i.e. deserialization). Point
> being, HDFS is oblivious to the fact whether it was JSON of XML.
>
> This would be more evident if you see the code to read/write from HDFS
> (writing example below):
>
> https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-hdfs-using-hadoop
>
> Now on the other hand, if you were using compression or other storage
> formats like Avro or Parquet then those formats come with their own classes
> which take care of serialization and deserialization.
>
> For basic cases, this should be helpful:
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow
>
> More here on data storage:
>
> http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
> http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization
> https://developer.yahoo.com/hadoop/tutorial/module1.html#data
>
> Regards,
> Shahab
>
>
> On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   And by that I mean is there an HDFS file type? I feel like I’m missing
>> something. Let’s say I have a HUGE json file that I import into HDFS. Does
>> it retain it’s JSON format in HDFS? What if it’s just random tweets I’m
>> streaming. Is it kind of like a normal disk where there are all kinds of
>> files sitting on disk in their own format it’s just that in HDFS they are
>> spread out over nodes?
>>
>> B.
>>
>
>

Re: what exactly does data in HDFS look like?

Posted by Bertrand Dechoux <de...@gmail.com>.

But basically you are right : it is the same concept as with a classical
file system. A file is seen as a sequence of bytes. For various efficiency
reasons, the whole sequence is not stored like that but first splits into
blocks (subsequences). With a local file system, these blocks will be
within the local drives. With HDFS, they are somewhere within the cluster
(and replicated, of course).

So really, the filesystem doesn't care about what is inside the file and
the format is something it is really oblivious to.

Bertrand


On Sat, Jul 19, 2014 at 7:02 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> The data itself is eventually store in a form of file. Each blocks of the
> file and it replicas are stored in files and directories on different
> nodes. The Namenode that keep the information and maintains it about each
> file and where its blocks (and replicated blocks exist in the cluster.)
>
> As for the format, it is stored as bytes. In the normal cases you use the
> DFS or FileOutputStream classes to  write data and in those instances it is
> written in byte form (conversion to bytes i.e. serialize data.) When you
> read the data, you use the same counterpart classes like InputStream and
> those convert the data from byte to text (i.e. deserialization). Point
> being, HDFS is oblivious to the fact whether it was JSON of XML.
>
> This would be more evident if you see the code to read/write from HDFS
> (writing example below):
>
> https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-hdfs-using-hadoop
>
> Now on the other hand, if you were using compression or other storage
> formats like Avro or Parquet then those formats come with their own classes
> which take care of serialization and deserialization.
>
> For basic cases, this should be helpful:
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow
>
> More here on data storage:
>
> http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
> http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization
> https://developer.yahoo.com/hadoop/tutorial/module1.html#data
>
> Regards,
> Shahab
>
>
> On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   And by that I mean is there an HDFS file type? I feel like I’m missing
>> something. Let’s say I have a HUGE json file that I import into HDFS. Does
>> it retain it’s JSON format in HDFS? What if it’s just random tweets I’m
>> streaming. Is it kind of like a normal disk where there are all kinds of
>> files sitting on disk in their own format it’s just that in HDFS they are
>> spread out over nodes?
>>
>> B.
>>
>
>

Re: what exactly does data in HDFS look like?

Posted by Bertrand Dechoux <de...@gmail.com>.

But basically you are right : it is the same concept as with a classical
file system. A file is seen as a sequence of bytes. For various efficiency
reasons, the whole sequence is not stored like that but first splits into
blocks (subsequences). With a local file system, these blocks will be
within the local drives. With HDFS, they are somewhere within the cluster
(and replicated, of course).

So really, the filesystem doesn't care about what is inside the file and
the format is something it is really oblivious to.

Bertrand


On Sat, Jul 19, 2014 at 7:02 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> The data itself is eventually store in a form of file. Each blocks of the
> file and it replicas are stored in files and directories on different
> nodes. The Namenode that keep the information and maintains it about each
> file and where its blocks (and replicated blocks exist in the cluster.)
>
> As for the format, it is stored as bytes. In the normal cases you use the
> DFS or FileOutputStream classes to  write data and in those instances it is
> written in byte form (conversion to bytes i.e. serialize data.) When you
> read the data, you use the same counterpart classes like InputStream and
> those convert the data from byte to text (i.e. deserialization). Point
> being, HDFS is oblivious to the fact whether it was JSON of XML.
>
> This would be more evident if you see the code to read/write from HDFS
> (writing example below):
>
> https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-hdfs-using-hadoop
>
> Now on the other hand, if you were using compression or other storage
> formats like Avro or Parquet then those formats come with their own classes
> which take care of serialization and deserialization.
>
> For basic cases, this should be helpful:
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow
>
> More here on data storage:
>
> http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
> http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization
> https://developer.yahoo.com/hadoop/tutorial/module1.html#data
>
> Regards,
> Shahab
>
>
> On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   And by that I mean is there an HDFS file type? I feel like I’m missing
>> something. Let’s say I have a HUGE json file that I import into HDFS. Does
>> it retain it’s JSON format in HDFS? What if it’s just random tweets I’m
>> streaming. Is it kind of like a normal disk where there are all kinds of
>> files sitting on disk in their own format it’s just that in HDFS they are
>> spread out over nodes?
>>
>> B.
>>
>
>

Re: what exactly does data in HDFS look like?

Posted by Shahab Yunus <sh...@gmail.com>.

The data itself is eventually store in a form of file. Each blocks of the
file and it replicas are stored in files and directories on different
nodes. The Namenode that keep the information and maintains it about each
file and where its blocks (and replicated blocks exist in the cluster.)

As for the format, it is stored as bytes. In the normal cases you use the
DFS or FileOutputStream classes to  write data and in those instances it is
written in byte form (conversion to bytes i.e. serialize data.) When you
read the data, you use the same counterpart classes like InputStream and
those convert the data from byte to text (i.e. deserialization). Point
being, HDFS is oblivious to the fact whether it was JSON of XML.

This would be more evident if you see the code to read/write from HDFS
(writing example below):
https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-hdfs-using-hadoop

Now on the other hand, if you were using compression or other storage
formats like Avro or Parquet then those formats come with their own classes
which take care of serialization and deserialization.

For basic cases, this should be helpful:
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow

More here on data storage:
http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization
https://developer.yahoo.com/hadoop/tutorial/module1.html#data

Regards,
Shahab

On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   And by that I mean is there an HDFS file type? I feel like I’m missing
> something. Let’s say I have a HUGE json file that I import into HDFS. Does
> it retain it’s JSON format in HDFS? What if it’s just random tweets I’m
> streaming. Is it kind of like a normal disk where there are all kinds of
> files sitting on disk in their own format it’s just that in HDFS they are
> spread out over nodes?
>
> B.
>

Re: what exactly does data in HDFS look like?

Posted by Shahab Yunus <sh...@gmail.com>.

The data itself is eventually store in a form of file. Each blocks of the
file and it replicas are stored in files and directories on different
nodes. The Namenode that keep the information and maintains it about each
file and where its blocks (and replicated blocks exist in the cluster.)

As for the format, it is stored as bytes. In the normal cases you use the
DFS or FileOutputStream classes to  write data and in those instances it is
written in byte form (conversion to bytes i.e. serialize data.) When you
read the data, you use the same counterpart classes like InputStream and
those convert the data from byte to text (i.e. deserialization). Point
being, HDFS is oblivious to the fact whether it was JSON of XML.

This would be more evident if you see the code to read/write from HDFS
(writing example below):
https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-hdfs-using-hadoop

Now on the other hand, if you were using compression or other storage
formats like Avro or Parquet then those formats come with their own classes
which take care of serialization and deserialization.

For basic cases, this should be helpful:
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow

More here on data storage:
http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization
https://developer.yahoo.com/hadoop/tutorial/module1.html#data

Regards,
Shahab

On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   And by that I mean is there an HDFS file type? I feel like I’m missing
> something. Let’s say I have a HUGE json file that I import into HDFS. Does
> it retain it’s JSON format in HDFS? What if it’s just random tweets I’m
> streaming. Is it kind of like a normal disk where there are all kinds of
> files sitting on disk in their own format it’s just that in HDFS they are
> spread out over nodes?
>
> B.
>

Re: what exactly does data in HDFS look like?

Posted by Shahab Yunus <sh...@gmail.com>.

The data itself is eventually store in a form of file. Each blocks of the
file and it replicas are stored in files and directories on different
nodes. The Namenode that keep the information and maintains it about each
file and where its blocks (and replicated blocks exist in the cluster.)

As for the format, it is stored as bytes. In the normal cases you use the
DFS or FileOutputStream classes to  write data and in those instances it is
written in byte form (conversion to bytes i.e. serialize data.) When you
read the data, you use the same counterpart classes like InputStream and
those convert the data from byte to text (i.e. deserialization). Point
being, HDFS is oblivious to the fact whether it was JSON of XML.

This would be more evident if you see the code to read/write from HDFS
(writing example below):
https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-hdfs-using-hadoop

Now on the other hand, if you were using compression or other storage
formats like Avro or Parquet then those formats come with their own classes
which take care of serialization and deserialization.

For basic cases, this should be helpful:
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow

More here on data storage:
http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization
https://developer.yahoo.com/hadoop/tutorial/module1.html#data

Regards,
Shahab

On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   And by that I mean is there an HDFS file type? I feel like I’m missing
> something. Let’s say I have a HUGE json file that I import into HDFS. Does
> it retain it’s JSON format in HDFS? What if it’s just random tweets I’m
> streaming. Is it kind of like a normal disk where there are all kinds of
> files sitting on disk in their own format it’s just that in HDFS they are
> spread out over nodes?
>
> B.
>

Re: what exactly does data in HDFS look like?

Posted by Shahab Yunus <sh...@gmail.com>.

The data itself is eventually store in a form of file. Each blocks of the
file and it replicas are stored in files and directories on different
nodes. The Namenode that keep the information and maintains it about each
file and where its blocks (and replicated blocks exist in the cluster.)

As for the format, it is stored as bytes. In the normal cases you use the
DFS or FileOutputStream classes to  write data and in those instances it is
written in byte form (conversion to bytes i.e. serialize data.) When you
read the data, you use the same counterpart classes like InputStream and
those convert the data from byte to text (i.e. deserialization). Point
being, HDFS is oblivious to the fact whether it was JSON of XML.

This would be more evident if you see the code to read/write from HDFS
(writing example below):
https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-hdfs-using-hadoop

Now on the other hand, if you were using compression or other storage
formats like Avro or Parquet then those formats come with their own classes
which take care of serialization and deserialization.

For basic cases, this should be helpful:
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-3/data-flow

More here on data storage:
http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locally-by-default
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization
https://developer.yahoo.com/hadoop/tutorial/module1.html#data

Regards,
Shahab

On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   And by that I mean is there an HDFS file type? I feel like I’m missing
> something. Let’s say I have a HUGE json file that I import into HDFS. Does
> it retain it’s JSON format in HDFS? What if it’s just random tweets I’m
> streaming. Is it kind of like a normal disk where there are all kinds of
> files sitting on disk in their own format it’s just that in HDFS they are
> spread out over nodes?
>
> B.
>