You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Willie Slepecki <sc...@gmail.com> on 2013/11/17 06:10:06 UTC

Struggling to understand CFS and its use.

Hi all.  I'm in the bar napkin phase of coming up with a big app.  The
application is going to be a large graph app so I was drawn to Cassandra
because of Titan and the replication of Cassandra is far superior to Neo4j
and other open source systems I have looked at.

The last issue i'm dealing with before starting to write code is random
file storage.  The application will have the ability to upload whatever,
images, pdf, etc, and i need to put them somewhere.  (for the record,
Amazon S3 is not an option, long story)  So i'm looking at a hugely
expensive raid array, or an insanely complex distributed file system.
 Given the budget im dealing with, most likely distributed file system.

Now in the past hour or so, i stumbled on CFS.  And I think i know what it
is, and that its not going to work for me, but I just wanted to make sure.

>From what I can tell, it is a file system that does not like small files
(15k images and such) because for each file you upload, its going to
allocate a 2 meg block.

Second, it looks like its similar to HDFS in that the FS is a misleading
statement and should have probably been named CDS (Cassandra Data Store).
 I mean that in the sense, it wasn't designed to map a drive to and drop
files in with explorer, but intended more as a convenient way to upload to
your analytics engine (mapreduce or whatever) large files of structured
data to have back end processes rip apart and tell you cool things you
didn't know.  Or for us really old guys, think of it as an easy way to dump
a butt load of data into your data warehouse without having to write an
ETL, and instead you write the ETL when you want to do something with it.

Third, it looks like it commercial, from that stax something company.

Am i wrong about any of this?

Thanks

-- 
You want it fast, cheap, or right.  Pick two!!

Re: Struggling to understand CFS and its use.

Posted by Jon Haddad <jo...@jonhaddad.com>.

Having used (and moved off of) Titan I do not recommend it as a primary database.  Until it overcomes it’s extremely unoptimized graph traversals, it will increase the load on your database by several orders of magnitude.  

As a secondary analytics database, it might do fine.  Just don’t rely on it for anything time sensitive.  

Jon


On Nov 16, 2013, at 9:10 PM, Willie Slepecki <sc...@gmail.com> wrote:

> Hi all.  I'm in the bar napkin phase of coming up with a big app.  The application is going to be a large graph app so I was drawn to Cassandra because of Titan and the replication of Cassandra is far superior to Neo4j and other open source systems I have looked at.
> 
> The last issue i'm dealing with before starting to write code is random file storage.  The application will have the ability to upload whatever, images, pdf, etc, and i need to put them somewhere.  (for the record, Amazon S3 is not an option, long story)  So i'm looking at a hugely expensive raid array, or an insanely complex distributed file system.  Given the budget im dealing with, most likely distributed file system.
> 
> Now in the past hour or so, i stumbled on CFS.  And I think i know what it is, and that its not going to work for me, but I just wanted to make sure.  
> 
> From what I can tell, it is a file system that does not like small files (15k images and such) because for each file you upload, its going to allocate a 2 meg block.  
> 
> Second, it looks like its similar to HDFS in that the FS is a misleading statement and should have probably been named CDS (Cassandra Data Store).  I mean that in the sense, it wasn't designed to map a drive to and drop files in with explorer, but intended more as a convenient way to upload to your analytics engine (mapreduce or whatever) large files of structured data to have back end processes rip apart and tell you cool things you didn't know.  Or for us really old guys, think of it as an easy way to dump a butt load of data into your data warehouse without having to write an ETL, and instead you write the ETL when you want to do something with it.
> 
> Third, it looks like it commercial, from that stax something company.  
> 
> Am i wrong about any of this?
> 
> Thanks
> 
> -- 
> You want it fast, cheap, or right.  Pick two!!

Re: Struggling to understand CFS and its use.

Posted by Ben Coverston <be...@datastax.com>.

+1 to what Ed said.

CFS is a good facilitator for running MR jobs on Cassandra to fill the HDFS
requirement (you just want to run MR, but you don't want the whole Hadoop
stack). The source data for your MR jobs should be in Cassandra KS/CFs.


On Mon, Nov 18, 2013 at 3:21 PM, Edward Capriolo <ed...@gmail.com>wrote:

> CFS was written so that Brisk (now defunct) did not need a separate hadoop
> HDFS stack (NN + DataNodes) to do map reduce work. It is better served as
> an alternative to HDFS not as a general purpose distributed file system.
>
>
> On Mon, Nov 18, 2013 at 2:02 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Sat, Nov 16, 2013 at 9:10 PM, Willie Slepecki <sc...@gmail.com>wrote:
>>
>>> The last issue i'm dealing with before starting to write code is random
>>> file storage.  The application will have the ability to upload whatever,
>>> images, pdf, etc, and i need to put them somewhere.  (for the record,
>>> Amazon S3 is not an option, long story)  So i'm looking at a hugely
>>> expensive raid array, or an insanely complex distributed file system.
>>>
>>
>>
>>> <cdfs> From what I can tell, it is a file system that does not like
>>> small files ...   [not a fs] I mean that in the sense, it wasn't
>>> designed to map a drive to and drop files in with explorer ... Third,
>>> it looks like it commercial, from that stax something company.  ... Am
>>> i wrong about any of this?
>>>
>>
>> No.
>>
>> If you don't have the requirement of a POSIX filesystem with locking etc.
>> (and if you do, you are probably Doing It Wrong..), you may want to use
>> MogileFS.
>>
>> https://code.google.com/p/mogilefs/
>>
>> Summary :
>>
>> - distributed file system designed to keep redundant copies of arbitrary
>> sized files, which are uploaded and accessed via HTTP
>> - uses MySQL as the meta-data store, so you keep it available in the same
>> way you (probably already know how to) keep MySQL available
>> - scales to more files than almost anyone has
>>
>> =Rob
>>
>>
>


-- 
Ben Coverston
DataStax -- The Apache Cassandra Company

Re: Struggling to understand CFS and its use.

Posted by Edward Capriolo <ed...@gmail.com>.

CFS was written so that Brisk (now defunct) did not need a separate hadoop
HDFS stack (NN + DataNodes) to do map reduce work. It is better served as
an alternative to HDFS not as a general purpose distributed file system.


On Mon, Nov 18, 2013 at 2:02 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Sat, Nov 16, 2013 at 9:10 PM, Willie Slepecki <sc...@gmail.com>wrote:
>
>> The last issue i'm dealing with before starting to write code is random
>> file storage.  The application will have the ability to upload whatever,
>> images, pdf, etc, and i need to put them somewhere.  (for the record,
>> Amazon S3 is not an option, long story)  So i'm looking at a hugely
>> expensive raid array, or an insanely complex distributed file system.
>>
>
>
>> <cdfs> From what I can tell, it is a file system that does not like
>> small files ...   [not a fs] I mean that in the sense, it wasn't
>> designed to map a drive to and drop files in with explorer ... Third, it
>> looks like it commercial, from that stax something company.  ... Am i
>> wrong about any of this?
>>
>
> No.
>
> If you don't have the requirement of a POSIX filesystem with locking etc.
> (and if you do, you are probably Doing It Wrong..), you may want to use
> MogileFS.
>
> https://code.google.com/p/mogilefs/
>
> Summary :
>
> - distributed file system designed to keep redundant copies of arbitrary
> sized files, which are uploaded and accessed via HTTP
> - uses MySQL as the meta-data store, so you keep it available in the same
> way you (probably already know how to) keep MySQL available
> - scales to more files than almost anyone has
>
> =Rob
>
>

Re: Struggling to understand CFS and its use.

Posted by Robert Coli <rc...@eventbrite.com>.

On Sat, Nov 16, 2013 at 9:10 PM, Willie Slepecki <sc...@gmail.com> wrote:

> The last issue i'm dealing with before starting to write code is random
> file storage.  The application will have the ability to upload whatever,
> images, pdf, etc, and i need to put them somewhere.  (for the record,
> Amazon S3 is not an option, long story)  So i'm looking at a hugely
> expensive raid array, or an insanely complex distributed file system.
>


> <cdfs> From what I can tell, it is a file system that does not like small
> files ...   [not a fs] I mean that in the sense, it wasn't designed to
> map a drive to and drop files in with explorer ... Third, it looks like
> it commercial, from that stax something company.  ... Am i wrong about
> any of this?
>

No.

If you don't have the requirement of a POSIX filesystem with locking etc.
(and if you do, you are probably Doing It Wrong..), you may want to use
MogileFS.

https://code.google.com/p/mogilefs/

Summary :

- distributed file system designed to keep redundant copies of arbitrary
sized files, which are uploaded and accessed via HTTP
- uses MySQL as the meta-data store, so you keep it available in the same
way you (probably already know how to) keep MySQL available
- scales to more files than almost anyone has

=Rob