You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by java8964 java8964 <ja...@hotmail.com> on 2013/09/03 20:06:20 UTC

RE: map/reduce performance time and sstable readerŠ.

I am trying to do the same thing, as in our project, we want to load the data from Cassandra into Hadoop cluster, and SSTable is one obvious option, as you can get the changed data since last batch loading directly from the SSTable incremental backup files.
But, based on so far my research (I maybe wrong, as I just did limited research about the SSTable, I hope someone in this forum can tell me that I am wrong), it maybe is NOT a good option:
1) sstable2json looks like NOT a scalable solution to get the data out from the Cassandra, and it needs the access to "data" directory to get some meta data from system keyspace for the column family data dumped, which maybe is not an option in your MR environment.2) So far I am thinking reuse the same API as being used in the sstable2json, but I have to provide these metadata in the API, like validator types/partitioner etc. I am surprised that as a backup, the column family SSTable dump files DOESN't contain these information by itself. Shouldn't it find out this from the SSTable files(ONLY) by itself?3) The big trouble comes this if you want to parse the SSTables in  your MR code. The API internal will load the Index/Compression_Info information from the Index/Compression files, which it assumes located in the same place  as the data file, but it will use the FileSteam internal. So if these data files are in the DFS (Distributed File System), so far, I didn't find a way to tell the API to use the stream from the DFS, instead of Local File Input stream. So basically you have 2 options: a) Copy these files from DSF to local file system (Same as what Knewton guys did at https://github.com/Knewton/KassandraMRHelper) b) Develop your own API to access the SStable files directly ( My guess is that Netflix guys probably did this way. They have a project called "Aegisthus" (See here: http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html), but it is not open source.4) About the performance, I am not sure, as SSTable2Json underline is using the same Cassandra API, but running in MR give us some support in scalability, as we can reuse the Hadoop framework for a lot of benefits it can bring.
Yong

> From: Dean.Hiller@nrel.gov
> To: user@cassandra.apache.org
> Date: Fri, 30 Aug 2013 07:25:09 -0600
> Subject: map/reduce performance time and sstable readerŠ.
> 
> Has anyone done performance tests on sstable reading vs. M/R?  I did a quick test on reading all SSTAbles in a LCS column family on 23 tables and took the average time it took sstable2json(to /dev/null to make it faster) which was 7 seconds per table.  (reading to stdout took 16 seconds per table).  This then worked out to an estimation of 12.5 hours up to 27 hours(from to stdout calculation).  I am suspecting the map/reduce time may be much worse since there are not as many repeated rows in LCS????
> 
> Ie. I am wondering if I should just read from SSTAbles directly instead of map/reduce?   I am about to dig around in the code of M/R and sstable2json to see what each is doing specifically.
> 
> Thanks,
> Dean
 		 	   		  

Re: map/reduce performance time and sstable readerŠ.

Posted by "Hiller, Dean" <De...@nrel.gov>.
We are considering creating our own InputFormat for hadoop and running the tasktrackers on every 3rd node(ie. RF=3) such that we cover all ranges.  Our M/R overhead appears to be 13 days vs. 12.5 hours on just reading SSTAbles directly on our current data set.

I personally don't think parsing SSTables(using the hadoop M/R framework) is a big deal from us since we run task trackers on the cassandra nodes we need it on.  Ie. We don't need to copy to DFS to do this I believe(at least not in our situation).

I already wrote a client on the SSTableReader parsing out sstables to take a look at some of our data while our 13 day M/R job is running(we are 4 days in already with no failures and no performance degradation).

later,
Dean

From: java8964 java8964 <ja...@hotmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, September 3, 2013 12:06 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: RE: map/reduce performance time and sstable readerŠ.

I am trying to do the same thing, as in our project, we want to load the data from Cassandra into Hadoop cluster, and SSTable is one obvious option, as you can get the changed data since last batch loading directly from the SSTable incremental backup files.

But, based on so far my research (I maybe wrong, as I just did limited research about the SSTable, I hope someone in this forum can tell me that I am wrong), it maybe is NOT a good option:

1) sstable2json looks like NOT a scalable solution to get the data out from the Cassandra, and it needs the access to "data" directory to get some meta data from system keyspace for the column family data dumped, which maybe is not an option in your MR environment.
2) So far I am thinking reuse the same API as being used in the sstable2json, but I have to provide these metadata in the API, like validator types/partitioner etc. I am surprised that as a backup, the column family SSTable dump files DOESN't contain these information by itself. Shouldn't it find out this from the SSTable files(ONLY) by itself?
3) The big trouble comes this if you want to parse the SSTables in  your MR code. The API internal will load the Index/Compression_Info information from the Index/Compression files, which it assumes located in the same place  as the data file, but it will use the FileSteam internal. So if these data files are in the DFS (Distributed File System), so far, I didn't find a way to tell the API to use the stream from the DFS, instead of Local File Input stream. So basically you have 2 options: a) Copy these files from DSF to local file system (Same as what Knewton guys did at https://github.com/Knewton/KassandraMRHelper) b) Develop your own API to access the SStable files directly ( My guess is that Netflix guys probably did this way. They have a project called "Aegisthus<http://en.wikipedia.org/wiki/Cassandra#History>" (See here: http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html), but it is not open source.
4) About the performance, I am not sure, as SSTable2Json underline is using the same Cassandra API, but running in MR give us some support in scalability, as we can reuse the Hadoop framework for a lot of benefits it can bring.

Yong

> From: Dean.Hiller@nrel.gov<ma...@nrel.gov>
> To: user@cassandra.apache.org<ma...@cassandra.apache.org>
> Date: Fri, 30 Aug 2013 07:25:09 -0600
> Subject: map/reduce performance time and sstable readerŠ.
>
> Has anyone done performance tests on sstable reading vs. M/R? I did a quick test on reading all SSTAbles in a LCS column family on 23 tables and took the average time it took sstable2json(to /dev/null to make it faster) which was 7 seconds per table. (reading to stdout took 16 seconds per table). This then worked out to an estimation of 12.5 hours up to 27 hours(from to stdout calculation). I am suspecting the map/reduce time may be much worse since there are not as many repeated rows in LCS????
>
> Ie. I am wondering if I should just read from SSTAbles directly instead of map/reduce? I am about to dig around in the code of M/R and sstable2json to see what each is doing specifically.
>
> Thanks,
> Dean