You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Amit Kumar <ku...@gmail.com> on 2013/03/23 20:43:14 UTC

reading sstables stored in hdfs

I am starting some work on an input-format that would let us read
sstables stored in HDFS, I wonder if anyone has worked on something
similar before. I did come across

http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html

However it's not open sourced/available yet.

I am writing for a sanity check before I go too deep into this.

I have a few questions -hoping someone here would be able to help.

So far, I have been able to read sstables stored on the local file
system using the SSTableScanner and the SSTableReader. I am wondering
what would be a good way to proceed -having a custom implementation of
RandomAccessFile like the (RandomAccessReader and the
CompressedRandomAccessReader), that would use hadoop's  File System
API?


I did search for, but could have missed -Is there some documentation
on the binary format of the data, index, and stats files? That might
make it simpler for me to prototype without having to go through the
Cassandra Internals. I am currently working of our production
deployment that is 1.1.0.

Any guidance if you want to give (I am new to Cassandra Internals).

Many thanks
Amit

Re: reading sstables stored in hdfs

Posted by Amit Kumar <ku...@gmail.com>.

Thanks Jonathan, I have been spending time with them to better know
them. Is there any documentation about the on disk file format of the
data, index and stats file?


Amit

On Sat, Mar 23, 2013 at 5:14 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> For the gory details you're going to need to explore SSTableReader
> and/or SSTableWriter.
>
> On Sat, Mar 23, 2013 at 7:01 PM, Amit Kumar <ku...@gmail.com> wrote:
>> We don't want to setup a parallel  workflow for analytics, for which
>> we use hadoop and it will be trivial to copy the new sstables that get
>> created to the hdfs periodically and then have mappers read the
>> sstable in parallel. Going through Thrift is an option -but an
>> inefficient one and one that impacts production Cassandra.
>>
>> Amit
>>
>>
>>
>> On Sat, Mar 23, 2013 at 2:40 PM, Michael Kjellman
>> <mk...@barracuda.com> wrote:
>>> Just curious, why would you want to store sstables in HDFS?
>>>
>>> On 3/23/13 12:43 PM, "Amit Kumar" <ku...@gmail.com> wrote:
>>>
>>>>I am starting some work on an input-format that would let us read
>>>>sstables stored in HDFS, I wonder if anyone has worked on something
>>>>similar before. I did come across
>>>>
>>>>http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.ht
>>>>ml
>>>>
>>>>However it's not open sourced/available yet.
>>>>
>>>>I am writing for a sanity check before I go too deep into this.
>>>>
>>>>I have a few questions -hoping someone here would be able to help.
>>>>
>>>>So far, I have been able to read sstables stored on the local file
>>>>system using the SSTableScanner and the SSTableReader. I am wondering
>>>>what would be a good way to proceed -having a custom implementation of
>>>>RandomAccessFile like the (RandomAccessReader and the
>>>>CompressedRandomAccessReader), that would use hadoop's  File System
>>>>API?
>>>>
>>>>
>>>>I did search for, but could have missed -Is there some documentation
>>>>on the binary format of the data, index, and stats files? That might
>>>>make it simpler for me to prototype without having to go through the
>>>>Cassandra Internals. I am currently working of our production
>>>>deployment that is 1.1.0.
>>>>
>>>>Any guidance if you want to give (I am new to Cassandra Internals).
>>>>
>>>>Many thanks
>>>>Amit
>>>
>>>
>>> Copy, by Barracuda, helps you store, protect, and share all your amazing
>>>
>>> things. Start today: www.copy.com.
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced

Re: reading sstables stored in hdfs

Posted by Jonathan Ellis <jb...@gmail.com>.

For the gory details you're going to need to explore SSTableReader
and/or SSTableWriter.

On Sat, Mar 23, 2013 at 7:01 PM, Amit Kumar <ku...@gmail.com> wrote:
> We don't want to setup a parallel  workflow for analytics, for which
> we use hadoop and it will be trivial to copy the new sstables that get
> created to the hdfs periodically and then have mappers read the
> sstable in parallel. Going through Thrift is an option -but an
> inefficient one and one that impacts production Cassandra.
>
> Amit
>
>
>
> On Sat, Mar 23, 2013 at 2:40 PM, Michael Kjellman
> <mk...@barracuda.com> wrote:
>> Just curious, why would you want to store sstables in HDFS?
>>
>> On 3/23/13 12:43 PM, "Amit Kumar" <ku...@gmail.com> wrote:
>>
>>>I am starting some work on an input-format that would let us read
>>>sstables stored in HDFS, I wonder if anyone has worked on something
>>>similar before. I did come across
>>>
>>>http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.ht
>>>ml
>>>
>>>However it's not open sourced/available yet.
>>>
>>>I am writing for a sanity check before I go too deep into this.
>>>
>>>I have a few questions -hoping someone here would be able to help.
>>>
>>>So far, I have been able to read sstables stored on the local file
>>>system using the SSTableScanner and the SSTableReader. I am wondering
>>>what would be a good way to proceed -having a custom implementation of
>>>RandomAccessFile like the (RandomAccessReader and the
>>>CompressedRandomAccessReader), that would use hadoop's  File System
>>>API?
>>>
>>>
>>>I did search for, but could have missed -Is there some documentation
>>>on the binary format of the data, index, and stats files? That might
>>>make it simpler for me to prototype without having to go through the
>>>Cassandra Internals. I am currently working of our production
>>>deployment that is 1.1.0.
>>>
>>>Any guidance if you want to give (I am new to Cassandra Internals).
>>>
>>>Many thanks
>>>Amit
>>
>>
>> Copy, by Barracuda, helps you store, protect, and share all your amazing
>>
>> things. Start today: www.copy.com.



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced

Re: reading sstables stored in hdfs

Posted by Amit Kumar <ku...@gmail.com>.

We don't want to setup a parallel  workflow for analytics, for which
we use hadoop and it will be trivial to copy the new sstables that get
created to the hdfs periodically and then have mappers read the
sstable in parallel. Going through Thrift is an option -but an
inefficient one and one that impacts production Cassandra.

Amit



On Sat, Mar 23, 2013 at 2:40 PM, Michael Kjellman
<mk...@barracuda.com> wrote:
> Just curious, why would you want to store sstables in HDFS?
>
> On 3/23/13 12:43 PM, "Amit Kumar" <ku...@gmail.com> wrote:
>
>>I am starting some work on an input-format that would let us read
>>sstables stored in HDFS, I wonder if anyone has worked on something
>>similar before. I did come across
>>
>>http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.ht
>>ml
>>
>>However it's not open sourced/available yet.
>>
>>I am writing for a sanity check before I go too deep into this.
>>
>>I have a few questions -hoping someone here would be able to help.
>>
>>So far, I have been able to read sstables stored on the local file
>>system using the SSTableScanner and the SSTableReader. I am wondering
>>what would be a good way to proceed -having a custom implementation of
>>RandomAccessFile like the (RandomAccessReader and the
>>CompressedRandomAccessReader), that would use hadoop's  File System
>>API?
>>
>>
>>I did search for, but could have missed -Is there some documentation
>>on the binary format of the data, index, and stats files? That might
>>make it simpler for me to prototype without having to go through the
>>Cassandra Internals. I am currently working of our production
>>deployment that is 1.1.0.
>>
>>Any guidance if you want to give (I am new to Cassandra Internals).
>>
>>Many thanks
>>Amit
>
>
> Copy, by Barracuda, helps you store, protect, and share all your amazing
>
> things. Start today: www.copy.com.

Re: reading sstables stored in hdfs

Posted by Michael Kjellman <mk...@barracuda.com>.

Just curious, why would you want to store sstables in HDFS?

On 3/23/13 12:43 PM, "Amit Kumar" <ku...@gmail.com> wrote:

>I am starting some work on an input-format that would let us read
>sstables stored in HDFS, I wonder if anyone has worked on something
>similar before. I did come across
>
>http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.ht
>ml
>
>However it's not open sourced/available yet.
>
>I am writing for a sanity check before I go too deep into this.
>
>I have a few questions -hoping someone here would be able to help.
>
>So far, I have been able to read sstables stored on the local file
>system using the SSTableScanner and the SSTableReader. I am wondering
>what would be a good way to proceed -having a custom implementation of
>RandomAccessFile like the (RandomAccessReader and the
>CompressedRandomAccessReader), that would use hadoop's  File System
>API?
>
>
>I did search for, but could have missed -Is there some documentation
>on the binary format of the data, index, and stats files? That might
>make it simpler for me to prototype without having to go through the
>Cassandra Internals. I am currently working of our production
>deployment that is 1.1.0.
>
>Any guidance if you want to give (I am new to Cassandra Internals).
>
>Many thanks
>Amit


Copy, by Barracuda, helps you store, protect, and share all your amazing
things. Start today: www.copy.com.