You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Srinivasa T N <se...@gmail.com> on 2015/01/02 11:45:24 UTC

Storing large files for later processing through hadoop

Hi All,
   The problem I am trying to address is:  Store the raw files (files are
in xml format and of the size arnd 700MB) in cassandra, later fetch it and
process it in hadoop cluster and populate back the processed data in
cassandra.  Regarding this, I wanted few clarifications:

1) The FAQ (
https://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage) informs
that I can have only files of around 64 MB but at the same time talks about
about an jira issue https://issues.apache.org/jira/browse/CASSANDRA-16
which is solved in 0.6 version itself.  So, in the present version of
cassandra (2.0.11), is there any limit on the size of the file in a column
and if so, what is it?
2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the
file from cassandra to HDFS when I want to process it in hadoop cluster?

Regards,
Seenu.

Re: Storing large files for later processing through hadoop

Posted by mck <mi...@apache.org>.

> Since the hadoop MR streaming job requires the file to be processed to be present in HDFS,
>  I was thinking whether can it get directly from mongodb instead of me manually fetching it 
> and placing it in a directory before submitting the hadoop job?


Hadoop M/R can get data directly from Cassandra. See CqlInputFormat.

~mck

Re: Storing large files for later processing through hadoop

Posted by Wilm Schumacher <wi...@gmail.com>.

Am 03.01.2015 um 07:07 schrieb Srinivasa T N:
> Hi Wilm,
>    The reason is that for some auditing purpose, I want to store the
> original files also.
well, then I would use a hdfs cluster for storing, as it seems to be
exactly what you need. If you collocate hdfs DataNodes and yarns
ResourceManager, you also could spare a lot of hardware or costs for
external services. It is not recommended to do that, but in your special
case this should work. This seems applicable as you only use the hdfs
for storing the xml exactly for that purpose.

But I'm more familiar with hadoop, hdfs and hbase than with Cassandra.
So perhaps I'm biased.

And what Jacob proposed could be a solution, too. Spares a lot of nerves ;).

Best wishes,

Wilm

Re: Storing large files for later processing through hadoop

Posted by Jacob Rhoden <ja...@me.com>.

If it's for auditing, if recommend pushing the files out somewhere reasonably external, Amazon S3 works well for this type of thing, and you don't have to worry too much about backups and the like.

______________________________
Sent from iPhone

> On 3 Jan 2015, at 5:07 pm, Srinivasa T N <se...@gmail.com> wrote:
> 
> Hi Wilm,
>    The reason is that for some auditing purpose, I want to store the original files also.
> 
> Regards,
> Seenu.
> 
>> On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher <wi...@gmail.com> wrote:
>> Hi,
>> 
>> perhaps I totally misunderstood your problem, but why "bother" with
>> cassandra for storing in the first place?
>> 
>> If your MR for hadoop is only run once for each file (as you wrote
>> above), why not copy the data directly to hdfs, run your MR job and use
>> cassandra as sink?
>> 
>> As hdfs and yarn are more or less completely independent you could
>> perhaps use the "master" as ResourceManager (yarn) AND NameNode and
>> DataNode (hdfs) and launch your MR job directly and as mentioned use
>> Cassandra as sink for the reduced data. By this you won't need dedicated
>> hardware, as you only need the hdfs once, process and delete the files
>> afterwards.
>> 
>> Best wishes,
>> 
>> Wilm
>

Re: Storing large files for later processing through hadoop

Posted by Srinivasa T N <se...@gmail.com>.

Hi Wilm,
   The reason is that for some auditing purpose, I want to store the
original files also.

Regards,
Seenu.

On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher <wi...@gmail.com>
wrote:

> Hi,
>
> perhaps I totally misunderstood your problem, but why "bother" with
> cassandra for storing in the first place?
>
> If your MR for hadoop is only run once for each file (as you wrote
> above), why not copy the data directly to hdfs, run your MR job and use
> cassandra as sink?
>
> As hdfs and yarn are more or less completely independent you could
> perhaps use the "master" as ResourceManager (yarn) AND NameNode and
> DataNode (hdfs) and launch your MR job directly and as mentioned use
> Cassandra as sink for the reduced data. By this you won't need dedicated
> hardware, as you only need the hdfs once, process and delete the files
> afterwards.
>
> Best wishes,
>
> Wilm
>

Re: Storing large files for later processing through hadoop

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

perhaps I totally misunderstood your problem, but why "bother" with
cassandra for storing in the first place?

If your MR for hadoop is only run once for each file (as you wrote
above), why not copy the data directly to hdfs, run your MR job and use
cassandra as sink?

As hdfs and yarn are more or less completely independent you could
perhaps use the "master" as ResourceManager (yarn) AND NameNode and
DataNode (hdfs) and launch your MR job directly and as mentioned use
Cassandra as sink for the reduced data. By this you won't need dedicated
hardware, as you only need the hdfs once, process and delete the files
afterwards.

Best wishes,

Wilm

Re: Storing large files for later processing through hadoop

Posted by Srinivasa T N <se...@gmail.com>.

I agree that cassandra is a columnar store.  The storing of the raw xml
file, parsing the file using hadoop and then storing the extracted value is
only once.  The extracted data on which further operations will be done
suits well with the timeseries storage of the data provided by cassandra
and that is the reason I am trying to get the things done for which it is
not designed.

Regards,
Seenu.



On Fri, Jan 2, 2015 at 10:42 PM, Eric Stevens <mi...@gmail.com> wrote:

> > Can this split and combine be done automatically by cassandra when
> inserting/fetching the file without application being bothered about it?
>
> There are client libraries which offer recipes for this, but in general,
> no.
>
> You're trying to do something with Cassandra that it's not designed to
> do.  You can get there from here, but you're not going to have a good
> time.  If you need a document store, you should use a NoSQL solution
> designed with that in mind (Cassandra is a columnar store).  If you need a
> distributed filesystem, you should use one of those.
>
> If you do want to continue forward and do this with Cassandra, then you
> should definitely not do this on the same cluster as handles normal clients
> as the kind of workload you'd be subjecting this cluster to is going to
> cause all sorts of troubles for normal clients, particularly with respect
> to GC pressure, compaction and streaming problems, and many other
> consequences of vastly exceeding recommended limits.
>
> On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N <se...@gmail.com> wrote:
>
>>
>>
>> On Fri, Jan 2, 2015 at 5:54 PM, mck <mi...@apache.org> wrote:
>>
>>>
>>> You could manually chunk them down to 64Mb pieces.
>>>
>>> Can this split and combine be done automatically by cassandra when
>> inserting/fetching the file without application being bothered about it?
>>
>>
>>>
>>> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
>>> > the file from cassandra to HDFS when I want to process it in hadoop
>>> cluster?
>>>
>>>
>>> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
>>> need for backups of it, no need to upgrade data, and we're free to wipe
>>> it whenever hadoop has been stopped.
>>> ~mck
>>>
>>
>> Since the hadoop MR streaming job requires the file to be processed to be
>> present in HDFS, I was thinking whether can it get directly from mongodb
>> instead of me manually fetching it and placing it in a directory before
>> submitting the hadoop job?
>>
>>
>> >> There was a datastax project before in being able to replace HDFS with
>> >> Cassandra, but i don't think it's alive anymore.
>>
>> I think you are referring to Brisk project (
>> http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
>> but I don't know its current status.
>>
>> Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?
>>
>> Regards,
>> Seenu.
>>
>
>

Re: Storing large files for later processing through hadoop

Posted by Eric Stevens <mi...@gmail.com>.

> Can this split and combine be done automatically by cassandra when
inserting/fetching the file without application being bothered about it?

There are client libraries which offer recipes for this, but in general,
no.

You're trying to do something with Cassandra that it's not designed to do.
You can get there from here, but you're not going to have a good time.  If
you need a document store, you should use a NoSQL solution designed with
that in mind (Cassandra is a columnar store).  If you need a distributed
filesystem, you should use one of those.

If you do want to continue forward and do this with Cassandra, then you
should definitely not do this on the same cluster as handles normal clients
as the kind of workload you'd be subjecting this cluster to is going to
cause all sorts of troubles for normal clients, particularly with respect
to GC pressure, compaction and streaming problems, and many other
consequences of vastly exceeding recommended limits.

On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N <se...@gmail.com> wrote:

>
>
> On Fri, Jan 2, 2015 at 5:54 PM, mck <mi...@apache.org> wrote:
>
>>
>> You could manually chunk them down to 64Mb pieces.
>>
>> Can this split and combine be done automatically by cassandra when
> inserting/fetching the file without application being bothered about it?
>
>
>>
>> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
>> > the file from cassandra to HDFS when I want to process it in hadoop
>> cluster?
>>
>>
>> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
>> need for backups of it, no need to upgrade data, and we're free to wipe
>> it whenever hadoop has been stopped.
>> ~mck
>>
>
> Since the hadoop MR streaming job requires the file to be processed to be
> present in HDFS, I was thinking whether can it get directly from mongodb
> instead of me manually fetching it and placing it in a directory before
> submitting the hadoop job?
>
>
> >> There was a datastax project before in being able to replace HDFS with
> >> Cassandra, but i don't think it's alive anymore.
>
> I think you are referring to Brisk project (
> http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
> but I don't know its current status.
>
> Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?
>
> Regards,
> Seenu.
>

Re: Storing large files for later processing through hadoop

Posted by Srinivasa T N <se...@gmail.com>.

On Fri, Jan 2, 2015 at 5:54 PM, mck <mi...@apache.org> wrote:

>
> You could manually chunk them down to 64Mb pieces.
>
> Can this split and combine be done automatically by cassandra when
inserting/fetching the file without application being bothered about it?

>
> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
> > the file from cassandra to HDFS when I want to process it in hadoop
> cluster?
>
>
> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
> need for backups of it, no need to upgrade data, and we're free to wipe
> it whenever hadoop has been stopped.
> ~mck
>

Since the hadoop MR streaming job requires the file to be processed to be
present in HDFS, I was thinking whether can it get directly from mongodb
instead of me manually fetching it and placing it in a directory before
submitting the hadoop job?

>> There was a datastax project before in being able to replace HDFS with
>> Cassandra, but i don't think it's alive anymore.

I think you are referring to Brisk project (
http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
but I don't know its current status.

Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?

Regards,
Seenu.

Re: Storing large files for later processing through hadoop

Posted by mck <mi...@apache.org>.

> 1) The FAQ … informs that I can have only files of around 64 MB …

See http://wiki.apache.org/cassandra/CassandraLimitations
 "A single column value may not be larger than 2GB; in practice, "single
 digits of MB" is a more reasonable limit, since there is no streaming
 or random access of blob values."

CASSANDRA-16  only covers pushing those objects through compaction.
Getting the objects in and out of the heap during normal requests is
still a problem.

You could manually chunk them down to 64Mb pieces.


> 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
> the file from cassandra to HDFS when I want to process it in hadoop cluster?


We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
need for backups of it, no need to upgrade data, and we're free to wipe
it whenever hadoop has been stopped. 

Otherwise all our hadoop jobs still read from and write to Cassandra.
Cassandra is our "big data" platform, with hadoop/spark just providing
additional aggregation abilities. I think this is the effective way,
rather than trying to completely gut out HDFS. 

There was a datastax project before in being able to replace HDFS with
Cassandra, but i don't think it's alive anymore.

~mck