You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Aleks Laz <al...@none.at> on 2014/11/27 17:49:14 UTC

Newbie Question about 37TB binary storage on HDFS

Dear All.

We have since ~2012 collected a lot of binary data (jpg's).

The Storage hierarchy is like this.

                     <YEAR>/<MONTH>/<DAY>
<MOUNT_ROOT>/cams/<ID>/2014/11/19/

The binary data are in the directory below <DAY> ~1000 Files per 
directory and mounted with xfs.

Due to the fact that the platform now grows up we need to create a more 
scalable setup.

I have read

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
http://wiki.apache.org/hadoop/FAQ#HDFS
...
and hope that I have understand the main concept behind HDFS.

Due to the fact that on this list more experienced Hadoop and HDFS users 
are then I, I hope you can answer some basic questions from me.

Our application is a nginx/php-fpm/postgresql Setup.
The target design is nginx + proxy features / php-fpm / $DB / $Storage.

.) Can I mix HDFS for binary data storage and data analyzing?

.) What is the preferred way to us HDFS with PHP?
.) How difficult is it to use HDFS with PHP?
    Google have a lot of answers to this question (WebHDFS, NFS, thrift, 
...) but which one is now 'the' solution and still 'supported' by the 
hadoop community?
    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP is 
a 404


.) What's a good solution for the 37 TB or the upcoming ~120 TB to 
distribute?
   [ ] N Servers with 1 37 TB mountpoints per server?
   [ ] N Servers with x TB mountpoints pers server?
   [ ] other:

.) Is HDFS a good value for $Storage?
.) Is HBase a good value for $DB?
    DB-Size is smaller then 1 GB, I would use HBase just for HA features 
of Hadoop.

.) Due to the fact that HDFS is a file-system I could use
      /cams , for binary data
      /DB   , for DB storage
      /logs , for log storage
    but is this wise. On the 'disk' they are different RAIDs.

.) Should I plan a dedicated Network+Card for the 'cluster 
communication' as for the most other cluster software?
    From what I have read it looks not necessary but from security point 
of view, yes.

.) Maybe the communication with the componnents (hadoop, zk, ...) could 
be setup ed with TLS?

Thank you very much that you have read the mail up to this line ;-)

Thank you also for feedback which is very welcome and appreciated.

Best Regards
Aleks

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Aleks Laz <al...@none.at>.
Am 27-11-2014 21:05, schrieb Wilm Schumacher:
> Am 27.11.2014 um 20:44 schrieb Aleks Laz:

[snipp]

>> Please can you try to answer the questions below for hbase or should I
>> subscribe to
>> 
>> https://hbase.apache.org/mail-lists.html => User List
>> 
>> and ask there?
> you should ask there, so we do not bother the other guys here who are
> not interested ;).

In case anybody is interested

https://mail-archives.apache.org/mod_mbox/hbase-user/201411.mbox/%3Ccc155b92feb49ee9eb026b0cf2ab5c18@none.at%3E

[snipp]

> As your data is written only once, but read often, hbase seems to be 
> the
> perfect fit to your needs (despite cassandra or something else). See 
> you
> at the hbase-list ;)
> 
> Best,
> 
> Wilm

BR Aleks

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Aleks Laz <al...@none.at>.
Am 27-11-2014 21:05, schrieb Wilm Schumacher:
> Am 27.11.2014 um 20:44 schrieb Aleks Laz:

[snipp]

>> Please can you try to answer the questions below for hbase or should I
>> subscribe to
>> 
>> https://hbase.apache.org/mail-lists.html => User List
>> 
>> and ask there?
> you should ask there, so we do not bother the other guys here who are
> not interested ;).

In case anybody is interested

https://mail-archives.apache.org/mod_mbox/hbase-user/201411.mbox/%3Ccc155b92feb49ee9eb026b0cf2ab5c18@none.at%3E

[snipp]

> As your data is written only once, but read often, hbase seems to be 
> the
> perfect fit to your needs (despite cassandra or something else). See 
> you
> at the hbase-list ;)
> 
> Best,
> 
> Wilm

BR Aleks

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Aleks Laz <al...@none.at>.
Am 27-11-2014 21:05, schrieb Wilm Schumacher:
> Am 27.11.2014 um 20:44 schrieb Aleks Laz:

[snipp]

>> Please can you try to answer the questions below for hbase or should I
>> subscribe to
>> 
>> https://hbase.apache.org/mail-lists.html => User List
>> 
>> and ask there?
> you should ask there, so we do not bother the other guys here who are
> not interested ;).

In case anybody is interested

https://mail-archives.apache.org/mod_mbox/hbase-user/201411.mbox/%3Ccc155b92feb49ee9eb026b0cf2ab5c18@none.at%3E

[snipp]

> As your data is written only once, but read often, hbase seems to be 
> the
> perfect fit to your needs (despite cassandra or something else). See 
> you
> at the hbase-list ;)
> 
> Best,
> 
> Wilm

BR Aleks

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Aleks Laz <al...@none.at>.
Am 27-11-2014 21:05, schrieb Wilm Schumacher:
> Am 27.11.2014 um 20:44 schrieb Aleks Laz:

[snipp]

>> Please can you try to answer the questions below for hbase or should I
>> subscribe to
>> 
>> https://hbase.apache.org/mail-lists.html => User List
>> 
>> and ask there?
> you should ask there, so we do not bother the other guys here who are
> not interested ;).

In case anybody is interested

https://mail-archives.apache.org/mod_mbox/hbase-user/201411.mbox/%3Ccc155b92feb49ee9eb026b0cf2ab5c18@none.at%3E

[snipp]

> As your data is written only once, but read often, hbase seems to be 
> the
> perfect fit to your needs (despite cassandra or something else). See 
> you
> at the hbase-list ;)
> 
> Best,
> 
> Wilm

BR Aleks

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Wilm Schumacher <wi...@cawoom.com>.

Am 27.11.2014 um 20:44 schrieb Aleks Laz:
> After a quick look to
> 
> https://hbase.apache.org/book/architecture.html#arch.overview
> 
> this sounds a real option.
> Is this in the current version?
nope. You have to compile it into your hbase-version. But I didn't do
any performance tests. You should asks the experts.

> Please can you try to answer the questions below for hbase or should I
> subscribe to
> 
> https://hbase.apache.org/mail-lists.html => User List
> 
> and ask there?
you should ask there, so we do not bother the other guys here who are
not interested ;).

> Is this too much for hdaoop or hbase?
well, depends. Not for hbase, but possibly for hadoop. Hadoop is for
streaming LARGE data, but only a "few" files. Because of the design of
hdfs, to be more specific the namenode, there is the so called "small
files problem".

http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

So you are bound to about 20-100 million files, what you will reach.
However, hadoop is able to use "container files", e.g. sequence files,
or better map files. So if your data isn't changing very often you could
put a day or a month of images into one of these container files and end
up with hundreds of files, which will work quite well. But I think you
could encounter latency issues if you want fast fetches of an image.

> I haven't thout about this aspect.
> 
>> Furthermore another question: are the images volatile? Means: Are the
>> images often changed by the application?
> 
> The pictures are more or less volatile.
> Means: After saved on the disc there are seldom and never changes on the
> images.
okay, so the map file plan could work if you somehow do not want to use
hbase but hdfs directly

As your data is written only once, but read often, hbase seems to be the
perfect fit to your needs (despite cassandra or something else). See you
at the hbase-list ;)

Best,

Wilm

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Wilm Schumacher <wi...@cawoom.com>.

Am 27.11.2014 um 20:44 schrieb Aleks Laz:
> After a quick look to
> 
> https://hbase.apache.org/book/architecture.html#arch.overview
> 
> this sounds a real option.
> Is this in the current version?
nope. You have to compile it into your hbase-version. But I didn't do
any performance tests. You should asks the experts.

> Please can you try to answer the questions below for hbase or should I
> subscribe to
> 
> https://hbase.apache.org/mail-lists.html => User List
> 
> and ask there?
you should ask there, so we do not bother the other guys here who are
not interested ;).

> Is this too much for hdaoop or hbase?
well, depends. Not for hbase, but possibly for hadoop. Hadoop is for
streaming LARGE data, but only a "few" files. Because of the design of
hdfs, to be more specific the namenode, there is the so called "small
files problem".

http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

So you are bound to about 20-100 million files, what you will reach.
However, hadoop is able to use "container files", e.g. sequence files,
or better map files. So if your data isn't changing very often you could
put a day or a month of images into one of these container files and end
up with hundreds of files, which will work quite well. But I think you
could encounter latency issues if you want fast fetches of an image.

> I haven't thout about this aspect.
> 
>> Furthermore another question: are the images volatile? Means: Are the
>> images often changed by the application?
> 
> The pictures are more or less volatile.
> Means: After saved on the disc there are seldom and never changes on the
> images.
okay, so the map file plan could work if you somehow do not want to use
hbase but hdfs directly

As your data is written only once, but read often, hbase seems to be the
perfect fit to your needs (despite cassandra or something else). See you
at the hbase-list ;)

Best,

Wilm

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Wilm Schumacher <wi...@cawoom.com>.

Am 27.11.2014 um 20:44 schrieb Aleks Laz:
> After a quick look to
> 
> https://hbase.apache.org/book/architecture.html#arch.overview
> 
> this sounds a real option.
> Is this in the current version?
nope. You have to compile it into your hbase-version. But I didn't do
any performance tests. You should asks the experts.

> Please can you try to answer the questions below for hbase or should I
> subscribe to
> 
> https://hbase.apache.org/mail-lists.html => User List
> 
> and ask there?
you should ask there, so we do not bother the other guys here who are
not interested ;).

> Is this too much for hdaoop or hbase?
well, depends. Not for hbase, but possibly for hadoop. Hadoop is for
streaming LARGE data, but only a "few" files. Because of the design of
hdfs, to be more specific the namenode, there is the so called "small
files problem".

http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

So you are bound to about 20-100 million files, what you will reach.
However, hadoop is able to use "container files", e.g. sequence files,
or better map files. So if your data isn't changing very often you could
put a day or a month of images into one of these container files and end
up with hundreds of files, which will work quite well. But I think you
could encounter latency issues if you want fast fetches of an image.

> I haven't thout about this aspect.
> 
>> Furthermore another question: are the images volatile? Means: Are the
>> images often changed by the application?
> 
> The pictures are more or less volatile.
> Means: After saved on the disc there are seldom and never changes on the
> images.
okay, so the map file plan could work if you somehow do not want to use
hbase but hdfs directly

As your data is written only once, but read often, hbase seems to be the
perfect fit to your needs (despite cassandra or something else). See you
at the hbase-list ;)

Best,

Wilm

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Wilm Schumacher <wi...@cawoom.com>.

Am 27.11.2014 um 20:44 schrieb Aleks Laz:
> After a quick look to
> 
> https://hbase.apache.org/book/architecture.html#arch.overview
> 
> this sounds a real option.
> Is this in the current version?
nope. You have to compile it into your hbase-version. But I didn't do
any performance tests. You should asks the experts.

> Please can you try to answer the questions below for hbase or should I
> subscribe to
> 
> https://hbase.apache.org/mail-lists.html => User List
> 
> and ask there?
you should ask there, so we do not bother the other guys here who are
not interested ;).

> Is this too much for hdaoop or hbase?
well, depends. Not for hbase, but possibly for hadoop. Hadoop is for
streaming LARGE data, but only a "few" files. Because of the design of
hdfs, to be more specific the namenode, there is the so called "small
files problem".

http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

So you are bound to about 20-100 million files, what you will reach.
However, hadoop is able to use "container files", e.g. sequence files,
or better map files. So if your data isn't changing very often you could
put a day or a month of images into one of these container files and end
up with hundreds of files, which will work quite well. But I think you
could encounter latency issues if you want fast fetches of an image.

> I haven't thout about this aspect.
> 
>> Furthermore another question: are the images volatile? Means: Are the
>> images often changed by the application?
> 
> The pictures are more or less volatile.
> Means: After saved on the disc there are seldom and never changes on the
> images.
okay, so the map file plan could work if you somehow do not want to use
hbase but hdfs directly

As your data is written only once, but read often, hbase seems to be the
perfect fit to your needs (despite cassandra or something else). See you
at the hbase-list ;)

Best,

Wilm

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Aleks Laz <al...@none.at>.
Hi.

Am 27-11-2014 18:16, schrieb Wilm Schumacher:
> Hi,
> 
> I would like to open up another option for you. You could pump the data
> into hbase directly.
> 
> Together with
> https://issues.apache.org/jira/browse/HBASE-11339
> this would be a good fit.

Thank you.

After a quick look to

https://hbase.apache.org/book/architecture.html#arch.overview

this sounds a real option.
Is this in the current version?

Please can you try to answer the questions below for hbase or should I 
subscribe to

https://hbase.apache.org/mail-lists.html => User List

and ask there?

> And I would like to ask a question of the mean size of the images. If 
> it
> is ~10MB (large but normal sized image) and you plan to save 120TB, 
> this
> would be around 12 million images. Is that correct?

The size are ~1-5 MB currently but this could be changed.

To be honest there are much more then 22 488 987 Files, the count still 
run, in ~680 <ID> dirs with this hierarchy

>>                     <YEAR>/<MONTH>/<DAY>
>> <MOUNT_ROOT>/cams/<ID>/2014/11/19/

Is this too much for hdaoop or hbase?
I haven't thout about this aspect.

> Furthermore another question: are the images volatile? Means: Are the
> images often changed by the application?

The pictures are more or less volatile.
Means: After saved on the disc there are seldom and never changes on the 
images.

BR Aleks

> Best,
> 
> Wilm
> 
> Am 27.11.2014 um 17:49 schrieb Aleks Laz:
>> Dear All.
>> 
>> We have since ~2012 collected a lot of binary data (jpg's).
>> 
>> The Storage hierarchy is like this.
>> 
>>                     <YEAR>/<MONTH>/<DAY>
>> <MOUNT_ROOT>/cams/<ID>/2014/11/19/
>> 
>> The binary data are in the directory below <DAY> ~1000 Files per
>> directory and mounted with xfs.
>> 
>> Due to the fact that the platform now grows up we need to create a 
>> more
>> scalable setup.
>> 
>> I have read
>> 
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
>> 
>> http://wiki.apache.org/hadoop/FAQ#HDFS
>> ...
>> and hope that I have understand the main concept behind HDFS.
>> 
>> Due to the fact that on this list more experienced Hadoop and HDFS 
>> users
>> are then I, I hope you can answer some basic questions from me.
>> 
>> Our application is a nginx/php-fpm/postgresql Setup.
>> The target design is nginx + proxy features / php-fpm / $DB / 
>> $Storage.
>> 
>> .) Can I mix HDFS for binary data storage and data analyzing?
>> 
>> .) What is the preferred way to us HDFS with PHP?
>> .) How difficult is it to use HDFS with PHP?
>>    Google have a lot of answers to this question (WebHDFS, NFS, 
>> thrift,
>> ...) but which one is now 'the' solution and still 'supported' by the
>> hadoop community?
>>    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP 
>> is
>> a 404
>> 
>> 
>> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
>> distribute?
>>   [ ] N Servers with 1 37 TB mountpoints per server?
>>   [ ] N Servers with x TB mountpoints pers server?
>>   [ ] other:
>> 
>> .) Is HDFS a good value for $Storage?
>> .) Is HBase a good value for $DB?
>>    DB-Size is smaller then 1 GB, I would use HBase just for HA 
>> features
>> of Hadoop.
>> 
>> .) Due to the fact that HDFS is a file-system I could use
>>      /cams , for binary data
>>      /DB   , for DB storage
>>      /logs , for log storage
>>    but is this wise. On the 'disk' they are different RAIDs.
>> 
>> .) Should I plan a dedicated Network+Card for the 'cluster
>> communication' as for the most other cluster software?
>>    From what I have read it looks not necessary but from security 
>> point
>> of view, yes.
>> 
>> .) Maybe the communication with the componnents (hadoop, zk, ...) 
>> could
>> be setup ed with TLS?
>> 
>> Thank you very much that you have read the mail up to this line ;-)
>> 
>> Thank you also for feedback which is very welcome and appreciated.
>> 
>> Best Regards
>> Aleks
>> 

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Aleks Laz <al...@none.at>.
Hi.

Am 27-11-2014 18:16, schrieb Wilm Schumacher:
> Hi,
> 
> I would like to open up another option for you. You could pump the data
> into hbase directly.
> 
> Together with
> https://issues.apache.org/jira/browse/HBASE-11339
> this would be a good fit.

Thank you.

After a quick look to

https://hbase.apache.org/book/architecture.html#arch.overview

this sounds a real option.
Is this in the current version?

Please can you try to answer the questions below for hbase or should I 
subscribe to

https://hbase.apache.org/mail-lists.html => User List

and ask there?

> And I would like to ask a question of the mean size of the images. If 
> it
> is ~10MB (large but normal sized image) and you plan to save 120TB, 
> this
> would be around 12 million images. Is that correct?

The size are ~1-5 MB currently but this could be changed.

To be honest there are much more then 22 488 987 Files, the count still 
run, in ~680 <ID> dirs with this hierarchy

>>                     <YEAR>/<MONTH>/<DAY>
>> <MOUNT_ROOT>/cams/<ID>/2014/11/19/

Is this too much for hdaoop or hbase?
I haven't thout about this aspect.

> Furthermore another question: are the images volatile? Means: Are the
> images often changed by the application?

The pictures are more or less volatile.
Means: After saved on the disc there are seldom and never changes on the 
images.

BR Aleks

> Best,
> 
> Wilm
> 
> Am 27.11.2014 um 17:49 schrieb Aleks Laz:
>> Dear All.
>> 
>> We have since ~2012 collected a lot of binary data (jpg's).
>> 
>> The Storage hierarchy is like this.
>> 
>>                     <YEAR>/<MONTH>/<DAY>
>> <MOUNT_ROOT>/cams/<ID>/2014/11/19/
>> 
>> The binary data are in the directory below <DAY> ~1000 Files per
>> directory and mounted with xfs.
>> 
>> Due to the fact that the platform now grows up we need to create a 
>> more
>> scalable setup.
>> 
>> I have read
>> 
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
>> 
>> http://wiki.apache.org/hadoop/FAQ#HDFS
>> ...
>> and hope that I have understand the main concept behind HDFS.
>> 
>> Due to the fact that on this list more experienced Hadoop and HDFS 
>> users
>> are then I, I hope you can answer some basic questions from me.
>> 
>> Our application is a nginx/php-fpm/postgresql Setup.
>> The target design is nginx + proxy features / php-fpm / $DB / 
>> $Storage.
>> 
>> .) Can I mix HDFS for binary data storage and data analyzing?
>> 
>> .) What is the preferred way to us HDFS with PHP?
>> .) How difficult is it to use HDFS with PHP?
>>    Google have a lot of answers to this question (WebHDFS, NFS, 
>> thrift,
>> ...) but which one is now 'the' solution and still 'supported' by the
>> hadoop community?
>>    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP 
>> is
>> a 404
>> 
>> 
>> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
>> distribute?
>>   [ ] N Servers with 1 37 TB mountpoints per server?
>>   [ ] N Servers with x TB mountpoints pers server?
>>   [ ] other:
>> 
>> .) Is HDFS a good value for $Storage?
>> .) Is HBase a good value for $DB?
>>    DB-Size is smaller then 1 GB, I would use HBase just for HA 
>> features
>> of Hadoop.
>> 
>> .) Due to the fact that HDFS is a file-system I could use
>>      /cams , for binary data
>>      /DB   , for DB storage
>>      /logs , for log storage
>>    but is this wise. On the 'disk' they are different RAIDs.
>> 
>> .) Should I plan a dedicated Network+Card for the 'cluster
>> communication' as for the most other cluster software?
>>    From what I have read it looks not necessary but from security 
>> point
>> of view, yes.
>> 
>> .) Maybe the communication with the componnents (hadoop, zk, ...) 
>> could
>> be setup ed with TLS?
>> 
>> Thank you very much that you have read the mail up to this line ;-)
>> 
>> Thank you also for feedback which is very welcome and appreciated.
>> 
>> Best Regards
>> Aleks
>> 

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Aleks Laz <al...@none.at>.
Hi.

Am 27-11-2014 18:16, schrieb Wilm Schumacher:
> Hi,
> 
> I would like to open up another option for you. You could pump the data
> into hbase directly.
> 
> Together with
> https://issues.apache.org/jira/browse/HBASE-11339
> this would be a good fit.

Thank you.

After a quick look to

https://hbase.apache.org/book/architecture.html#arch.overview

this sounds a real option.
Is this in the current version?

Please can you try to answer the questions below for hbase or should I 
subscribe to

https://hbase.apache.org/mail-lists.html => User List

and ask there?

> And I would like to ask a question of the mean size of the images. If 
> it
> is ~10MB (large but normal sized image) and you plan to save 120TB, 
> this
> would be around 12 million images. Is that correct?

The size are ~1-5 MB currently but this could be changed.

To be honest there are much more then 22 488 987 Files, the count still 
run, in ~680 <ID> dirs with this hierarchy

>>                     <YEAR>/<MONTH>/<DAY>
>> <MOUNT_ROOT>/cams/<ID>/2014/11/19/

Is this too much for hdaoop or hbase?
I haven't thout about this aspect.

> Furthermore another question: are the images volatile? Means: Are the
> images often changed by the application?

The pictures are more or less volatile.
Means: After saved on the disc there are seldom and never changes on the 
images.

BR Aleks

> Best,
> 
> Wilm
> 
> Am 27.11.2014 um 17:49 schrieb Aleks Laz:
>> Dear All.
>> 
>> We have since ~2012 collected a lot of binary data (jpg's).
>> 
>> The Storage hierarchy is like this.
>> 
>>                     <YEAR>/<MONTH>/<DAY>
>> <MOUNT_ROOT>/cams/<ID>/2014/11/19/
>> 
>> The binary data are in the directory below <DAY> ~1000 Files per
>> directory and mounted with xfs.
>> 
>> Due to the fact that the platform now grows up we need to create a 
>> more
>> scalable setup.
>> 
>> I have read
>> 
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
>> 
>> http://wiki.apache.org/hadoop/FAQ#HDFS
>> ...
>> and hope that I have understand the main concept behind HDFS.
>> 
>> Due to the fact that on this list more experienced Hadoop and HDFS 
>> users
>> are then I, I hope you can answer some basic questions from me.
>> 
>> Our application is a nginx/php-fpm/postgresql Setup.
>> The target design is nginx + proxy features / php-fpm / $DB / 
>> $Storage.
>> 
>> .) Can I mix HDFS for binary data storage and data analyzing?
>> 
>> .) What is the preferred way to us HDFS with PHP?
>> .) How difficult is it to use HDFS with PHP?
>>    Google have a lot of answers to this question (WebHDFS, NFS, 
>> thrift,
>> ...) but which one is now 'the' solution and still 'supported' by the
>> hadoop community?
>>    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP 
>> is
>> a 404
>> 
>> 
>> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
>> distribute?
>>   [ ] N Servers with 1 37 TB mountpoints per server?
>>   [ ] N Servers with x TB mountpoints pers server?
>>   [ ] other:
>> 
>> .) Is HDFS a good value for $Storage?
>> .) Is HBase a good value for $DB?
>>    DB-Size is smaller then 1 GB, I would use HBase just for HA 
>> features
>> of Hadoop.
>> 
>> .) Due to the fact that HDFS is a file-system I could use
>>      /cams , for binary data
>>      /DB   , for DB storage
>>      /logs , for log storage
>>    but is this wise. On the 'disk' they are different RAIDs.
>> 
>> .) Should I plan a dedicated Network+Card for the 'cluster
>> communication' as for the most other cluster software?
>>    From what I have read it looks not necessary but from security 
>> point
>> of view, yes.
>> 
>> .) Maybe the communication with the componnents (hadoop, zk, ...) 
>> could
>> be setup ed with TLS?
>> 
>> Thank you very much that you have read the mail up to this line ;-)
>> 
>> Thank you also for feedback which is very welcome and appreciated.
>> 
>> Best Regards
>> Aleks
>> 

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Aleks Laz <al...@none.at>.
Hi.

Am 27-11-2014 18:16, schrieb Wilm Schumacher:
> Hi,
> 
> I would like to open up another option for you. You could pump the data
> into hbase directly.
> 
> Together with
> https://issues.apache.org/jira/browse/HBASE-11339
> this would be a good fit.

Thank you.

After a quick look to

https://hbase.apache.org/book/architecture.html#arch.overview

this sounds a real option.
Is this in the current version?

Please can you try to answer the questions below for hbase or should I 
subscribe to

https://hbase.apache.org/mail-lists.html => User List

and ask there?

> And I would like to ask a question of the mean size of the images. If 
> it
> is ~10MB (large but normal sized image) and you plan to save 120TB, 
> this
> would be around 12 million images. Is that correct?

The size are ~1-5 MB currently but this could be changed.

To be honest there are much more then 22 488 987 Files, the count still 
run, in ~680 <ID> dirs with this hierarchy

>>                     <YEAR>/<MONTH>/<DAY>
>> <MOUNT_ROOT>/cams/<ID>/2014/11/19/

Is this too much for hdaoop or hbase?
I haven't thout about this aspect.

> Furthermore another question: are the images volatile? Means: Are the
> images often changed by the application?

The pictures are more or less volatile.
Means: After saved on the disc there are seldom and never changes on the 
images.

BR Aleks

> Best,
> 
> Wilm
> 
> Am 27.11.2014 um 17:49 schrieb Aleks Laz:
>> Dear All.
>> 
>> We have since ~2012 collected a lot of binary data (jpg's).
>> 
>> The Storage hierarchy is like this.
>> 
>>                     <YEAR>/<MONTH>/<DAY>
>> <MOUNT_ROOT>/cams/<ID>/2014/11/19/
>> 
>> The binary data are in the directory below <DAY> ~1000 Files per
>> directory and mounted with xfs.
>> 
>> Due to the fact that the platform now grows up we need to create a 
>> more
>> scalable setup.
>> 
>> I have read
>> 
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
>> 
>> http://wiki.apache.org/hadoop/FAQ#HDFS
>> ...
>> and hope that I have understand the main concept behind HDFS.
>> 
>> Due to the fact that on this list more experienced Hadoop and HDFS 
>> users
>> are then I, I hope you can answer some basic questions from me.
>> 
>> Our application is a nginx/php-fpm/postgresql Setup.
>> The target design is nginx + proxy features / php-fpm / $DB / 
>> $Storage.
>> 
>> .) Can I mix HDFS for binary data storage and data analyzing?
>> 
>> .) What is the preferred way to us HDFS with PHP?
>> .) How difficult is it to use HDFS with PHP?
>>    Google have a lot of answers to this question (WebHDFS, NFS, 
>> thrift,
>> ...) but which one is now 'the' solution and still 'supported' by the
>> hadoop community?
>>    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP 
>> is
>> a 404
>> 
>> 
>> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
>> distribute?
>>   [ ] N Servers with 1 37 TB mountpoints per server?
>>   [ ] N Servers with x TB mountpoints pers server?
>>   [ ] other:
>> 
>> .) Is HDFS a good value for $Storage?
>> .) Is HBase a good value for $DB?
>>    DB-Size is smaller then 1 GB, I would use HBase just for HA 
>> features
>> of Hadoop.
>> 
>> .) Due to the fact that HDFS is a file-system I could use
>>      /cams , for binary data
>>      /DB   , for DB storage
>>      /logs , for log storage
>>    but is this wise. On the 'disk' they are different RAIDs.
>> 
>> .) Should I plan a dedicated Network+Card for the 'cluster
>> communication' as for the most other cluster software?
>>    From what I have read it looks not necessary but from security 
>> point
>> of view, yes.
>> 
>> .) Maybe the communication with the componnents (hadoop, zk, ...) 
>> could
>> be setup ed with TLS?
>> 
>> Thank you very much that you have read the mail up to this line ;-)
>> 
>> Thank you also for feedback which is very welcome and appreciated.
>> 
>> Best Regards
>> Aleks
>> 

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

I would like to open up another option for you. You could pump the data
into hbase directly.

Together with
https://issues.apache.org/jira/browse/HBASE-11339
this would be a good fit.

And I would like to ask a question of the mean size of the images. If it
is ~10MB (large but normal sized image) and you plan to save 120TB, this
would be around 12 million images. Is that correct?

Furthermore another question: are the images volatile? Means: Are the
images often changed by the application?

Best,

Wilm

Am 27.11.2014 um 17:49 schrieb Aleks Laz:
> Dear All.
> 
> We have since ~2012 collected a lot of binary data (jpg's).
> 
> The Storage hierarchy is like this.
> 
>                     <YEAR>/<MONTH>/<DAY>
> <MOUNT_ROOT>/cams/<ID>/2014/11/19/
> 
> The binary data are in the directory below <DAY> ~1000 Files per
> directory and mounted with xfs.
> 
> Due to the fact that the platform now grows up we need to create a more
> scalable setup.
> 
> I have read
> 
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
> 
> http://wiki.apache.org/hadoop/FAQ#HDFS
> ...
> and hope that I have understand the main concept behind HDFS.
> 
> Due to the fact that on this list more experienced Hadoop and HDFS users
> are then I, I hope you can answer some basic questions from me.
> 
> Our application is a nginx/php-fpm/postgresql Setup.
> The target design is nginx + proxy features / php-fpm / $DB / $Storage.
> 
> .) Can I mix HDFS for binary data storage and data analyzing?
> 
> .) What is the preferred way to us HDFS with PHP?
> .) How difficult is it to use HDFS with PHP?
>    Google have a lot of answers to this question (WebHDFS, NFS, thrift,
> ...) but which one is now 'the' solution and still 'supported' by the
> hadoop community?
>    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP is
> a 404
> 
> 
> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
> distribute?
>   [ ] N Servers with 1 37 TB mountpoints per server?
>   [ ] N Servers with x TB mountpoints pers server?
>   [ ] other:
> 
> .) Is HDFS a good value for $Storage?
> .) Is HBase a good value for $DB?
>    DB-Size is smaller then 1 GB, I would use HBase just for HA features
> of Hadoop.
> 
> .) Due to the fact that HDFS is a file-system I could use
>      /cams , for binary data
>      /DB   , for DB storage
>      /logs , for log storage
>    but is this wise. On the 'disk' they are different RAIDs.
> 
> .) Should I plan a dedicated Network+Card for the 'cluster
> communication' as for the most other cluster software?
>    From what I have read it looks not necessary but from security point
> of view, yes.
> 
> .) Maybe the communication with the componnents (hadoop, zk, ...) could
> be setup ed with TLS?
> 
> Thank you very much that you have read the mail up to this line ;-)
> 
> Thank you also for feedback which is very welcome and appreciated.
> 
> Best Regards
> Aleks
> 

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

I would like to open up another option for you. You could pump the data
into hbase directly.

Together with
https://issues.apache.org/jira/browse/HBASE-11339
this would be a good fit.

And I would like to ask a question of the mean size of the images. If it
is ~10MB (large but normal sized image) and you plan to save 120TB, this
would be around 12 million images. Is that correct?

Furthermore another question: are the images volatile? Means: Are the
images often changed by the application?

Best,

Wilm

Am 27.11.2014 um 17:49 schrieb Aleks Laz:
> Dear All.
> 
> We have since ~2012 collected a lot of binary data (jpg's).
> 
> The Storage hierarchy is like this.
> 
>                     <YEAR>/<MONTH>/<DAY>
> <MOUNT_ROOT>/cams/<ID>/2014/11/19/
> 
> The binary data are in the directory below <DAY> ~1000 Files per
> directory and mounted with xfs.
> 
> Due to the fact that the platform now grows up we need to create a more
> scalable setup.
> 
> I have read
> 
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
> 
> http://wiki.apache.org/hadoop/FAQ#HDFS
> ...
> and hope that I have understand the main concept behind HDFS.
> 
> Due to the fact that on this list more experienced Hadoop and HDFS users
> are then I, I hope you can answer some basic questions from me.
> 
> Our application is a nginx/php-fpm/postgresql Setup.
> The target design is nginx + proxy features / php-fpm / $DB / $Storage.
> 
> .) Can I mix HDFS for binary data storage and data analyzing?
> 
> .) What is the preferred way to us HDFS with PHP?
> .) How difficult is it to use HDFS with PHP?
>    Google have a lot of answers to this question (WebHDFS, NFS, thrift,
> ...) but which one is now 'the' solution and still 'supported' by the
> hadoop community?
>    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP is
> a 404
> 
> 
> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
> distribute?
>   [ ] N Servers with 1 37 TB mountpoints per server?
>   [ ] N Servers with x TB mountpoints pers server?
>   [ ] other:
> 
> .) Is HDFS a good value for $Storage?
> .) Is HBase a good value for $DB?
>    DB-Size is smaller then 1 GB, I would use HBase just for HA features
> of Hadoop.
> 
> .) Due to the fact that HDFS is a file-system I could use
>      /cams , for binary data
>      /DB   , for DB storage
>      /logs , for log storage
>    but is this wise. On the 'disk' they are different RAIDs.
> 
> .) Should I plan a dedicated Network+Card for the 'cluster
> communication' as for the most other cluster software?
>    From what I have read it looks not necessary but from security point
> of view, yes.
> 
> .) Maybe the communication with the componnents (hadoop, zk, ...) could
> be setup ed with TLS?
> 
> Thank you very much that you have read the mail up to this line ;-)
> 
> Thank you also for feedback which is very welcome and appreciated.
> 
> Best Regards
> Aleks
> 

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

I would like to open up another option for you. You could pump the data
into hbase directly.

Together with
https://issues.apache.org/jira/browse/HBASE-11339
this would be a good fit.

And I would like to ask a question of the mean size of the images. If it
is ~10MB (large but normal sized image) and you plan to save 120TB, this
would be around 12 million images. Is that correct?

Furthermore another question: are the images volatile? Means: Are the
images often changed by the application?

Best,

Wilm

Am 27.11.2014 um 17:49 schrieb Aleks Laz:
> Dear All.
> 
> We have since ~2012 collected a lot of binary data (jpg's).
> 
> The Storage hierarchy is like this.
> 
>                     <YEAR>/<MONTH>/<DAY>
> <MOUNT_ROOT>/cams/<ID>/2014/11/19/
> 
> The binary data are in the directory below <DAY> ~1000 Files per
> directory and mounted with xfs.
> 
> Due to the fact that the platform now grows up we need to create a more
> scalable setup.
> 
> I have read
> 
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
> 
> http://wiki.apache.org/hadoop/FAQ#HDFS
> ...
> and hope that I have understand the main concept behind HDFS.
> 
> Due to the fact that on this list more experienced Hadoop and HDFS users
> are then I, I hope you can answer some basic questions from me.
> 
> Our application is a nginx/php-fpm/postgresql Setup.
> The target design is nginx + proxy features / php-fpm / $DB / $Storage.
> 
> .) Can I mix HDFS for binary data storage and data analyzing?
> 
> .) What is the preferred way to us HDFS with PHP?
> .) How difficult is it to use HDFS with PHP?
>    Google have a lot of answers to this question (WebHDFS, NFS, thrift,
> ...) but which one is now 'the' solution and still 'supported' by the
> hadoop community?
>    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP is
> a 404
> 
> 
> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
> distribute?
>   [ ] N Servers with 1 37 TB mountpoints per server?
>   [ ] N Servers with x TB mountpoints pers server?
>   [ ] other:
> 
> .) Is HDFS a good value for $Storage?
> .) Is HBase a good value for $DB?
>    DB-Size is smaller then 1 GB, I would use HBase just for HA features
> of Hadoop.
> 
> .) Due to the fact that HDFS is a file-system I could use
>      /cams , for binary data
>      /DB   , for DB storage
>      /logs , for log storage
>    but is this wise. On the 'disk' they are different RAIDs.
> 
> .) Should I plan a dedicated Network+Card for the 'cluster
> communication' as for the most other cluster software?
>    From what I have read it looks not necessary but from security point
> of view, yes.
> 
> .) Maybe the communication with the componnents (hadoop, zk, ...) could
> be setup ed with TLS?
> 
> Thank you very much that you have read the mail up to this line ;-)
> 
> Thank you also for feedback which is very welcome and appreciated.
> 
> Best Regards
> Aleks
> 

Re: Newbie Question about 37TB binary storage on HDFS

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

I would like to open up another option for you. You could pump the data
into hbase directly.

Together with
https://issues.apache.org/jira/browse/HBASE-11339
this would be a good fit.

And I would like to ask a question of the mean size of the images. If it
is ~10MB (large but normal sized image) and you plan to save 120TB, this
would be around 12 million images. Is that correct?

Furthermore another question: are the images volatile? Means: Are the
images often changed by the application?

Best,

Wilm

Am 27.11.2014 um 17:49 schrieb Aleks Laz:
> Dear All.
> 
> We have since ~2012 collected a lot of binary data (jpg's).
> 
> The Storage hierarchy is like this.
> 
>                     <YEAR>/<MONTH>/<DAY>
> <MOUNT_ROOT>/cams/<ID>/2014/11/19/
> 
> The binary data are in the directory below <DAY> ~1000 Files per
> directory and mounted with xfs.
> 
> Due to the fact that the platform now grows up we need to create a more
> scalable setup.
> 
> I have read
> 
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
> 
> http://wiki.apache.org/hadoop/FAQ#HDFS
> ...
> and hope that I have understand the main concept behind HDFS.
> 
> Due to the fact that on this list more experienced Hadoop and HDFS users
> are then I, I hope you can answer some basic questions from me.
> 
> Our application is a nginx/php-fpm/postgresql Setup.
> The target design is nginx + proxy features / php-fpm / $DB / $Storage.
> 
> .) Can I mix HDFS for binary data storage and data analyzing?
> 
> .) What is the preferred way to us HDFS with PHP?
> .) How difficult is it to use HDFS with PHP?
>    Google have a lot of answers to this question (WebHDFS, NFS, thrift,
> ...) but which one is now 'the' solution and still 'supported' by the
> hadoop community?
>    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP is
> a 404
> 
> 
> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
> distribute?
>   [ ] N Servers with 1 37 TB mountpoints per server?
>   [ ] N Servers with x TB mountpoints pers server?
>   [ ] other:
> 
> .) Is HDFS a good value for $Storage?
> .) Is HBase a good value for $DB?
>    DB-Size is smaller then 1 GB, I would use HBase just for HA features
> of Hadoop.
> 
> .) Due to the fact that HDFS is a file-system I could use
>      /cams , for binary data
>      /DB   , for DB storage
>      /logs , for log storage
>    but is this wise. On the 'disk' they are different RAIDs.
> 
> .) Should I plan a dedicated Network+Card for the 'cluster
> communication' as for the most other cluster software?
>    From what I have read it looks not necessary but from security point
> of view, yes.
> 
> .) Maybe the communication with the componnents (hadoop, zk, ...) could
> be setup ed with TLS?
> 
> Thank you very much that you have read the mail up to this line ;-)
> 
> Thank you also for feedback which is very welcome and appreciated.
> 
> Best Regards
> Aleks
>