You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Vasileios Vlachos <va...@gmail.com> on 2012/10/16 17:49:55 UTC

Using Cassandra to store binary files?

Hello All,

We need to store about 40G of binary files in a redundant way and since we
are already using Cassandra for other applications we were thinking that we
could just solve that problem using the same Cassandra cluster. Each
individual File will be approximately 1MB.

We are thinking that the data structure should be very simple for this
case, using one CF with just one column which will contain the actual
files. The row key should then uniquely identify each file. Speed is not an
issue when we retrieving the files. Impacting other applications using
Cassandra is more important for us. In order to prevent performance issues
with other applications using our Cassandra cluster at the moment, we think
we should disable key_cache and row_cache for this column family.

Anyone tried this before or anyone thinks this is going to be a bad idea?
Do you think our current plan is sensible? Any input would be much
appreciated. Thank you in advance.

Regards,

Vasilis

Re: Using Cassandra to store binary files?

Posted by "Hiller, Dean" <De...@nrel.gov>.
Astyanax provides a streaming file feature and was written by netflix who is storing probably a huge amount of files with that feature.  I was going to use that feature for one product but I never got around to creating the product…..but I still use astyanax under the hood of PlayOrm  (we kind of use a combination so we can put some relational data in cassandra with PlayOrm and then do our own thing as well with noSQL with the raw astyanx apis as well)…..it gets rid of us needing the RDBMS at all which is nice.

Later,
Dean

From: Vasileios Vlachos <va...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, October 16, 2012 9:49 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Using Cassandra to store binary files?

ed to store about 40G of binary files in a redundant way and since we are already using Cassandra for other applications we were thinking that we could just solve that problem using the same Cassandra cluster. Each individual File will be approximately 1MB.

We are thinking that the data structure should be very simple for this case, using one CF with just one column which will contain the actual files. The row key should then uniquely identify each file. Speed is not an issue when we retrieving the files. Impacting other applications using Cassandra is more important for us. In order to prevent performance issues with other applications using our Cassandra cluster at the moment, we think we should disable key_cache and row_cache for this column family.

Anyone tried this before or anyone thinks this is going to be a bad idea? Do you think our current plan is sensible? Any input would be much appreciated. Thank you in adv

Re: Using Cassandra to store binary files?

Posted by Vasileios Vlachos <va...@gmail.com>.
Hello,

Thank you all for your responses.

Performance is not an issue at all as I described, so it shouldn't be
problematic. At least this is our current understanding. We will try it and
post back if something interesting comes up. Many thanks.

Regards,

Vasilis



On Tue, Oct 16, 2012 at 7:34 PM, Hiller, Dean <De...@nrel.gov> wrote:

> I am not sure.  If I were to implement it myself though, I would have
> probably...
>
> postfixed the rows with 1,2,3,4,...<lastValue> and then stored the lastValue
> in the first row so then my program knows all the rows.
>
> Ie. Not sure an index is really needed in that case.
>
> Dean
>
> On 10/16/12 11:45 AM, "Michael Kjellman" <mk...@barracuda.com> wrote:
>
> >Ah, so they just wrote chunking into Astyanax? Do they create an index
> >somewhere so they know how to reassemble the file on the way out?
> >
> >On 10/16/12 10:36 AM, "Hiller, Dean" <De...@nrel.gov> wrote:
> >
> >>Yes, astyanax stores the file in many rows so it reads from many disks
> >>giving you a performance advantage vs. storing each file in one row....well
> >>at least from my understanding so read performance "should" be really
> >>really good in that case.
> >>
> >>Dean
> >>
> >>From: Michael Kjellman
> >><mk...@barracuda.com>>
> >>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
> >><us...@cassandra.apache.org>>
> >>Date: Tuesday, October 16, 2012 10:07 AM
> >>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
> >><us...@cassandra.apache.org>>
> >>Subject: Re: Using Cassandra to store binary files?
> >>
> >>When we started with Cassandra almost 2 years ago in production
> >>originally it was for the sole purpose storing blobs in a redundant way.
> >>I ignored the warnings as my own tests showed it would be okay (and two
> >>years later it is "ok"). If you plan on using Cassandra later (as we now
> >>as as features such as secondary indexes and cql have matured I'm now
> >>stuck with a large amount of data in Cassandra that maybe could be in a
> >>better place.) Does it work? Yes. Would I do it again? Not 100% sure.
> >>Compactions of these column families take forever.
> >>
> >>Also, by default there is a 16MB limit. Yes, this is adjustable but
> >>currently Thrift does not stream data. I didn't know that Netflix had
> >>worked around this (referring to Dean's reply) -- I'll have to look
> >>through the source to see how they are overcoming the limitations of the
> >>protocol. Last I read there were no plans to make Thrift stream. Looks
> >>like there is a bug at
> >>https://issues.apache.org/jira/browse/CASSANDRA-265
> >>
> >>You might want to take a look at the following page:
> >>http://wiki.apache.org/cassandra/CassandraLimitations
> >>
> >>I wanted an easy key value store when I originally picked Cassandra. As
> >>our project needs changed and Cassandra has now begun playing a more
> >>critical role as it has matured (since the 0.7 days), in retrospect HDFS
> >>might have been a better option long term as I really will never need
> >>indexing etc on my binary blobs and the convenience of simply being able
> >>to grab/reassemble a file by grabbing it's key was convenient at the time
> >>but maybe not the most forward thinking. Hope that helps a bit.
> >>
> >>Also, your read performance won't be amazing by any means with blobs. Not
> >>sure if your priority is reads or writes. In our case it was writes so it
> >>wasn't a large loss.
> >>
> >>Best,
> >>michael
> >>
> >>
> >>From: Vasileios Vlachos
> >><va...@gmail.com>>
> >>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
> >><us...@cassandra.apache.org>>
> >>Date: Tuesday, October 16, 2012 8:49 AM
> >>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
> >><us...@cassandra.apache.org>>
> >>Subject: Using Cassandra to store binary files?
> >>
> >>Hello All,
> >>
> >>We need to store about 40G of binary files in a redundant way and since
> >>we are already using Cassandra for other applications we were thinking
> >>that we could just solve that problem using the same Cassandra cluster.
> >>Each individual File will be approximately 1MB.
> >>
> >>We are thinking that the data structure should be very simple for this
> >>case, using one CF with just one column which will contain the actual
> >>files. The row key should then uniquely identify each file. Speed is not
> >>an issue when we retrieving the files. Impacting other applications using
> >>Cassandra is more important for us. In order to prevent performance
> >>issues with other applications using our Cassandra cluster at the moment,
> >>we think we should disable key_cache and row_cache for this column
> >>family.
> >>
> >>Anyone tried this before or anyone thinks this is going to be a bad idea?
> >>Do you think our current plan is sensible? Any input would be much
> >>appreciated. Thank you in advance.
> >>
> >>Regards,
> >>
> >>Vasilis
> >>
> >>----------------------------------
> >>'Like' us on Facebook for exclusive content and other resources on all
> >>Barracuda Networks solutions.
> >>Visit http://barracudanetworks.com/facebook
> >>  --
> >
> >
> >'Like' us on Facebook for exclusive content and other resources on all
> >Barracuda Networks solutions.
> >Visit http://barracudanetworks.com/facebook
> >
> >
>
>

Re: Using Cassandra to store binary files?

Posted by "Hiller, Dean" <De...@nrel.gov>.
I am not sure.  If I were to implement it myself though, I would have
probably…

postfixed the rows with 1,2,3,4,…<lastValue> and then stored the lastValue
in the first row so then my program knows all the rows.

Ie. Not sure an index is really needed in that case.

Dean

On 10/16/12 11:45 AM, "Michael Kjellman" <mk...@barracuda.com> wrote:

>Ah, so they just wrote chunking into Astyanax? Do they create an index
>somewhere so they know how to reassemble the file on the way out?
>
>On 10/16/12 10:36 AM, "Hiller, Dean" <De...@nrel.gov> wrote:
>
>>Yes, astyanax stores the file in many rows so it reads from many disks
>>giving you a performance advantage vs. storing each file in one row….well
>>at least from my understanding so read performance "should" be really
>>really good in that case.
>>
>>Dean
>>
>>From: Michael Kjellman
>><mk...@barracuda.com>>
>>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>><us...@cassandra.apache.org>>
>>Date: Tuesday, October 16, 2012 10:07 AM
>>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>><us...@cassandra.apache.org>>
>>Subject: Re: Using Cassandra to store binary files?
>>
>>When we started with Cassandra almost 2 years ago in production
>>originally it was for the sole purpose storing blobs in a redundant way.
>>I ignored the warnings as my own tests showed it would be okay (and two
>>years later it is "ok"). If you plan on using Cassandra later (as we now
>>as as features such as secondary indexes and cql have matured I'm now
>>stuck with a large amount of data in Cassandra that maybe could be in a
>>better place.) Does it work? Yes. Would I do it again? Not 100% sure.
>>Compactions of these column families take forever.
>>
>>Also, by default there is a 16MB limit. Yes, this is adjustable but
>>currently Thrift does not stream data. I didn't know that Netflix had
>>worked around this (referring to Dean's reply) ― I'll have to look
>>through the source to see how they are overcoming the limitations of the
>>protocol. Last I read there were no plans to make Thrift stream. Looks
>>like there is a bug at
>>https://issues.apache.org/jira/browse/CASSANDRA-265
>>
>>You might want to take a look at the following page:
>>http://wiki.apache.org/cassandra/CassandraLimitations
>>
>>I wanted an easy key value store when I originally picked Cassandra. As
>>our project needs changed and Cassandra has now begun playing a more
>>critical role as it has matured (since the 0.7 days), in retrospect HDFS
>>might have been a better option long term as I really will never need
>>indexing etc on my binary blobs and the convenience of simply being able
>>to grab/reassemble a file by grabbing it's key was convenient at the time
>>but maybe not the most forward thinking. Hope that helps a bit.
>>
>>Also, your read performance won't be amazing by any means with blobs. Not
>>sure if your priority is reads or writes. In our case it was writes so it
>>wasn't a large loss.
>>
>>Best,
>>michael
>>
>>
>>From: Vasileios Vlachos
>><va...@gmail.com>>
>>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>><us...@cassandra.apache.org>>
>>Date: Tuesday, October 16, 2012 8:49 AM
>>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>><us...@cassandra.apache.org>>
>>Subject: Using Cassandra to store binary files?
>>
>>Hello All,
>>
>>We need to store about 40G of binary files in a redundant way and since
>>we are already using Cassandra for other applications we were thinking
>>that we could just solve that problem using the same Cassandra cluster.
>>Each individual File will be approximately 1MB.
>>
>>We are thinking that the data structure should be very simple for this
>>case, using one CF with just one column which will contain the actual
>>files. The row key should then uniquely identify each file. Speed is not
>>an issue when we retrieving the files. Impacting other applications using
>>Cassandra is more important for us. In order to prevent performance
>>issues with other applications using our Cassandra cluster at the moment,
>>we think we should disable key_cache and row_cache for this column
>>family.
>>
>>Anyone tried this before or anyone thinks this is going to be a bad idea?
>>Do you think our current plan is sensible? Any input would be much
>>appreciated. Thank you in advance.
>>
>>Regards,
>>
>>Vasilis
>>
>>----------------------------------
>>'Like' us on Facebook for exclusive content and other resources on all
>>Barracuda Networks solutions.
>>Visit http://barracudanetworks.com/facebook
>>  ­­
>
>
>'Like' us on Facebook for exclusive content and other resources on all
>Barracuda Networks solutions.
>Visit http://barracudanetworks.com/facebook
>
>


Re: Using Cassandra to store binary files?

Posted by Michael Kjellman <mk...@barracuda.com>.
Ah, so they just wrote chunking into Astyanax? Do they create an index
somewhere so they know how to reassemble the file on the way out?

On 10/16/12 10:36 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>Yes, astyanax stores the file in many rows so it reads from many disks
>giving you a performance advantage vs. storing each file in one row….well
>at least from my understanding so read performance "should" be really
>really good in that case.
>
>Dean
>
>From: Michael Kjellman
><mk...@barracuda.com>>
>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Date: Tuesday, October 16, 2012 10:07 AM
>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Subject: Re: Using Cassandra to store binary files?
>
>When we started with Cassandra almost 2 years ago in production
>originally it was for the sole purpose storing blobs in a redundant way.
>I ignored the warnings as my own tests showed it would be okay (and two
>years later it is "ok"). If you plan on using Cassandra later (as we now
>as as features such as secondary indexes and cql have matured I'm now
>stuck with a large amount of data in Cassandra that maybe could be in a
>better place.) Does it work? Yes. Would I do it again? Not 100% sure.
>Compactions of these column families take forever.
>
>Also, by default there is a 16MB limit. Yes, this is adjustable but
>currently Thrift does not stream data. I didn't know that Netflix had
>worked around this (referring to Dean's reply) ― I'll have to look
>through the source to see how they are overcoming the limitations of the
>protocol. Last I read there were no plans to make Thrift stream. Looks
>like there is a bug at https://issues.apache.org/jira/browse/CASSANDRA-265
>
>You might want to take a look at the following page:
>http://wiki.apache.org/cassandra/CassandraLimitations
>
>I wanted an easy key value store when I originally picked Cassandra. As
>our project needs changed and Cassandra has now begun playing a more
>critical role as it has matured (since the 0.7 days), in retrospect HDFS
>might have been a better option long term as I really will never need
>indexing etc on my binary blobs and the convenience of simply being able
>to grab/reassemble a file by grabbing it's key was convenient at the time
>but maybe not the most forward thinking. Hope that helps a bit.
>
>Also, your read performance won't be amazing by any means with blobs. Not
>sure if your priority is reads or writes. In our case it was writes so it
>wasn't a large loss.
>
>Best,
>michael
>
>
>From: Vasileios Vlachos
><va...@gmail.com>>
>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Date: Tuesday, October 16, 2012 8:49 AM
>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Subject: Using Cassandra to store binary files?
>
>Hello All,
>
>We need to store about 40G of binary files in a redundant way and since
>we are already using Cassandra for other applications we were thinking
>that we could just solve that problem using the same Cassandra cluster.
>Each individual File will be approximately 1MB.
>
>We are thinking that the data structure should be very simple for this
>case, using one CF with just one column which will contain the actual
>files. The row key should then uniquely identify each file. Speed is not
>an issue when we retrieving the files. Impacting other applications using
>Cassandra is more important for us. In order to prevent performance
>issues with other applications using our Cassandra cluster at the moment,
>we think we should disable key_cache and row_cache for this column family.
>
>Anyone tried this before or anyone thinks this is going to be a bad idea?
>Do you think our current plan is sensible? Any input would be much
>appreciated. Thank you in advance.
>
>Regards,
>
>Vasilis
>
>----------------------------------
>'Like' us on Facebook for exclusive content and other resources on all
>Barracuda Networks solutions.
>Visit http://barracudanetworks.com/facebook
>  ­­


'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook



Re: Using Cassandra to store binary files?

Posted by "Hiller, Dean" <De...@nrel.gov>.
Yes, astyanax stores the file in many rows so it reads from many disks giving you a performance advantage vs. storing each file in one row….well at least from my understanding so read performance "should" be really really good in that case.

Dean

From: Michael Kjellman <mk...@barracuda.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, October 16, 2012 10:07 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Using Cassandra to store binary files?

When we started with Cassandra almost 2 years ago in production originally it was for the sole purpose storing blobs in a redundant way. I ignored the warnings as my own tests showed it would be okay (and two years later it is "ok"). If you plan on using Cassandra later (as we now as as features such as secondary indexes and cql have matured I'm now stuck with a large amount of data in Cassandra that maybe could be in a better place.) Does it work? Yes. Would I do it again? Not 100% sure. Compactions of these column families take forever.

Also, by default there is a 16MB limit. Yes, this is adjustable but currently Thrift does not stream data. I didn't know that Netflix had worked around this (referring to Dean's reply) — I'll have to look through the source to see how they are overcoming the limitations of the protocol. Last I read there were no plans to make Thrift stream. Looks like there is a bug at https://issues.apache.org/jira/browse/CASSANDRA-265

You might want to take a look at the following page: http://wiki.apache.org/cassandra/CassandraLimitations

I wanted an easy key value store when I originally picked Cassandra. As our project needs changed and Cassandra has now begun playing a more critical role as it has matured (since the 0.7 days), in retrospect HDFS might have been a better option long term as I really will never need indexing etc on my binary blobs and the convenience of simply being able to grab/reassemble a file by grabbing it's key was convenient at the time but maybe not the most forward thinking. Hope that helps a bit.

Also, your read performance won't be amazing by any means with blobs. Not sure if your priority is reads or writes. In our case it was writes so it wasn't a large loss.

Best,
michael


From: Vasileios Vlachos <va...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, October 16, 2012 8:49 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Using Cassandra to store binary files?

Hello All,

We need to store about 40G of binary files in a redundant way and since we are already using Cassandra for other applications we were thinking that we could just solve that problem using the same Cassandra cluster. Each individual File will be approximately 1MB.

We are thinking that the data structure should be very simple for this case, using one CF with just one column which will contain the actual files. The row key should then uniquely identify each file. Speed is not an issue when we retrieving the files. Impacting other applications using Cassandra is more important for us. In order to prevent performance issues with other applications using our Cassandra cluster at the moment, we think we should disable key_cache and row_cache for this column family.

Anyone tried this before or anyone thinks this is going to be a bad idea? Do you think our current plan is sensible? Any input would be much appreciated. Thank you in advance.

Regards,

Vasilis

----------------------------------
'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook
  ­­

Re: Using Cassandra to store binary files?

Posted by Michael Kjellman <mk...@barracuda.com>.
When we started with Cassandra almost 2 years ago in production originally it was for the sole purpose storing blobs in a redundant way. I ignored the warnings as my own tests showed it would be okay (and two years later it is "ok"). If you plan on using Cassandra later (as we now as as features such as secondary indexes and cql have matured I'm now stuck with a large amount of data in Cassandra that maybe could be in a better place.) Does it work? Yes. Would I do it again? Not 100% sure. Compactions of these column families take forever.

Also, by default there is a 16MB limit. Yes, this is adjustable but currently Thrift does not stream data. I didn't know that Netflix had worked around this (referring to Dean's reply) — I'll have to look through the source to see how they are overcoming the limitations of the protocol. Last I read there were no plans to make Thrift stream. Looks like there is a bug at https://issues.apache.org/jira/browse/CASSANDRA-265

You might want to take a look at the following page: http://wiki.apache.org/cassandra/CassandraLimitations

I wanted an easy key value store when I originally picked Cassandra. As our project needs changed and Cassandra has now begun playing a more critical role as it has matured (since the 0.7 days), in retrospect HDFS might have been a better option long term as I really will never need indexing etc on my binary blobs and the convenience of simply being able to grab/reassemble a file by grabbing it's key was convenient at the time but maybe not the most forward thinking. Hope that helps a bit.

Also, your read performance won't be amazing by any means with blobs. Not sure if your priority is reads or writes. In our case it was writes so it wasn't a large loss.

Best,
michael


From: Vasileios Vlachos <va...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, October 16, 2012 8:49 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Using Cassandra to store binary files?

Hello All,

We need to store about 40G of binary files in a redundant way and since we are already using Cassandra for other applications we were thinking that we could just solve that problem using the same Cassandra cluster. Each individual File will be approximately 1MB.

We are thinking that the data structure should be very simple for this case, using one CF with just one column which will contain the actual files. The row key should then uniquely identify each file. Speed is not an issue when we retrieving the files. Impacting other applications using Cassandra is more important for us. In order to prevent performance issues with other applications using our Cassandra cluster at the moment, we think we should disable key_cache and row_cache for this column family.

Anyone tried this before or anyone thinks this is going to be a bad idea? Do you think our current plan is sensible? Any input would be much appreciated. Thank you in advance.

Regards,

Vasilis

'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook