You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "DOAN DuyHai (JIRA)" <ji...@apache.org> on 2016/03/18 21:30:33 UTC
[jira] [Comment Edited] (CASSANDRA-11383) SASI index build leads to
massive OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202088#comment-15202088 ]
DOAN DuyHai edited comment on CASSANDRA-11383 at 3/18/16 8:30 PM:
------------------------------------------------------------------
[~jkrupan]
1. Not that large, see below the Spark script to generate randomized data:
{noformat}
import java.util.UUID
import com.datastax.spark.connector._
case class Resource(dsrId:UUID, relSeq:Long, seq:Long, dspReleaseCode:String,
commercialOfferCode:String, transferCode:String, mediaCode:String,
modelCode:String, unicWork:String,
title:String, status:String, contributorsName:List[String],
periodEndMonthInt:Int, dspCode:String, territoryCode:String,
payingNetQty:Long, authorizedSocietiesTxt: String, relType:String)
val allDsps = List("youtube", "itunes", "spotify", "deezer", "vevo", "google-play", "7digital", "spotify", "youtube", "spotify", "youtube", "youtube", "youtube")
val allCountries = List("FR", "UK", "BE", "IT", "NL", "ES", "FR", "FR")
val allPeriodsEndMonths:Seq[Int] = for(year <- 2013 to 2015; month <- 1 to 12) yield (year.toString + f"$month%02d").toInt
val allModelCodes = List("PayAsYouGo", "AdFunded", "Subscription")
val allMediaCodes = List("Music","Ringtone")
val allTransferCodes = List("Streaming","Download")
val allCommercialOffers = List("Premium","Free")
val status = "Declared"
val authorizedSocietiesTxt: String="sacem sgae"
val relType = "whatever"
val titlesAndContributors: Array[(String, String)] = sc.textFile("/tmp/top_100.csv").map(line => {val split = line.split(";"); (split(1),split(2))}).distinct.collect
for(i<- 1 to 100) {
sc.parallelize((1 to 40000000).map(i => UUID.randomUUID)).
map(dsrId => {
val r = new java.util.Random(System.currentTimeMillis())
val relSeq = r.nextLong()
val seq = r.nextLong()
val dspReleaseCode = seq.toString
val dspCode = allDsps(r.nextInt(allDsps.size))
val periodEndMonth = allPeriodsEndMonths(r.nextInt(allPeriodsEndMonths.size))
val territoryCode = allCountries(r.nextInt(allCountries.size))
val modelCode = allModelCodes(r.nextInt(allModelCodes.size))
val mediaCode = allMediaCodes(r.nextInt(allMediaCodes.size))
val transferCode = allTransferCodes(r.nextInt(allTransferCodes.size))
val commercialOffer = allCommercialOffers(r.nextInt(allCommercialOffers.size))
val titleAndContributor: (String, String) = titlesAndContributors(r.nextInt(titlesAndContributors.size))
val title = titleAndContributor._1
val contributorsName = titleAndContributor._2.split(",").toList
val unicWork = title + "|" + titleAndContributor._2
val payingNetQty = r.nextInt(100).toLong
Resource(dsrId, relSeq, seq, dspReleaseCode, commercialOffer, transferCode, mediaCode, modelCode,
unicWork, title, status, contributorsName, periodEndMonth, dspCode, territoryCode, payingNetQty,
authorizedSocietiesTxt, relType)
}).
saveToCassandra("keyspace", "resource")
Thread.sleep(500)
}
{noformat}
2. Does OOM occur if SASI indexes are created one at a time - serially, waiting for full index to build before moving on to the next? --> *Yes it does*, see log file with CMS settings attached above
3. Do you need a 32G heap to build just one index? I cringe when I see a heap larger than 14G. See if you can get a single SASI index build to work in 10-12G or less.
--> Well the 32Gb heap was for analytics use-cases and I was using G1 GC. But changing to CMS with 8Gb heap has the same result, OOM. see log file with CMS settings attached above
was (Author: doanduyhai):
[~jkrupan]
1. Not that large, see below the Spark script to generate randomized data:
{code:scala}
import java.util.UUID
import com.datastax.spark.connector._
case class Resource(dsrId:UUID, relSeq:Long, seq:Long, dspReleaseCode:String,
commercialOfferCode:String, transferCode:String, mediaCode:String,
modelCode:String, unicWork:String,
title:String, status:String, contributorsName:List[String],
periodEndMonthInt:Int, dspCode:String, territoryCode:String,
payingNetQty:Long, authorizedSocietiesTxt: String, relType:String)
val allDsps = List("youtube", "itunes", "spotify", "deezer", "vevo", "google-play", "7digital", "spotify", "youtube", "spotify", "youtube", "youtube", "youtube")
val allCountries = List("FR", "UK", "BE", "IT", "NL", "ES", "FR", "FR")
val allPeriodsEndMonths:Seq[Int] = for(year <- 2013 to 2015; month <- 1 to 12) yield (year.toString + f"$month%02d").toInt
val allModelCodes = List("PayAsYouGo", "AdFunded", "Subscription")
val allMediaCodes = List("Music","Ringtone")
val allTransferCodes = List("Streaming","Download")
val allCommercialOffers = List("Premium","Free")
val status = "Declared"
val authorizedSocietiesTxt: String="sacem sgae"
val relType = "whatever"
val titlesAndContributors: Array[(String, String)] = sc.textFile("/tmp/top_100.csv").map(line => {val split = line.split(";"); (split(1),split(2))}).distinct.collect
for(i<- 1 to 100) {
sc.parallelize((1 to 40000000).map(i => UUID.randomUUID)).
map(dsrId => {
val r = new java.util.Random(System.currentTimeMillis())
val relSeq = r.nextLong()
val seq = r.nextLong()
val dspReleaseCode = seq.toString
val dspCode = allDsps(r.nextInt(allDsps.size))
val periodEndMonth = allPeriodsEndMonths(r.nextInt(allPeriodsEndMonths.size))
val territoryCode = allCountries(r.nextInt(allCountries.size))
val modelCode = allModelCodes(r.nextInt(allModelCodes.size))
val mediaCode = allMediaCodes(r.nextInt(allMediaCodes.size))
val transferCode = allTransferCodes(r.nextInt(allTransferCodes.size))
val commercialOffer = allCommercialOffers(r.nextInt(allCommercialOffers.size))
val titleAndContributor: (String, String) = titlesAndContributors(r.nextInt(titlesAndContributors.size))
val title = titleAndContributor._1
val contributorsName = titleAndContributor._2.split(",").toList
val unicWork = title + "|" + titleAndContributor._2
val payingNetQty = r.nextInt(100).toLong
Resource(dsrId, relSeq, seq, dspReleaseCode, commercialOffer, transferCode, mediaCode, modelCode,
unicWork, title, status, contributorsName, periodEndMonth, dspCode, territoryCode, payingNetQty,
authorizedSocietiesTxt, relType)
}).
saveToCassandra("keyspace", "resource")
Thread.sleep(500)
}
{code:scala}
2. Does OOM occur if SASI indexes are created one at a time - serially, waiting for full index to build before moving on to the next? --> *Yes it does*, see log file with CMS settings attached above
3. Do you need a 32G heap to build just one index? I cringe when I see a heap larger than 14G. See if you can get a single SASI index build to work in 10-12G or less.
--> Well the 32Gb heap was for analytics use-cases and I was using G1 GC. But changing to CMS with 8Gb heap has the same result, OOM. see log file with CMS settings attached above
> SASI index build leads to massive OOM
> -------------------------------------
>
> Key: CASSANDRA-11383
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11383
> Project: Cassandra
> Issue Type: Bug
> Components: CQL
> Environment: C* 3.4
> Reporter: DOAN DuyHai
> Attachments: CASSANDRA-11383.patch, new_system_log_CMS_8GB_OOM.log, system.log_sasi_build_oom
>
>
> 13 bare metal machines
> - 6 cores CPU (12 HT)
> - 64Gb RAM
> - 4 SSD in RAID0
> JVM settings:
> - G1 GC
> - Xms32G, Xmx32G
> Data set:
> - ≈ 100Gb/per node
> - 1.3 Tb cluster-wide
> - ≈ 20Gb for all SASI indices
> C* settings:
> - concurrent_compactors: 1
> - compaction_throughput_mb_per_sec: 256
> - memtable_heap_space_in_mb: 2048
> - memtable_offheap_space_in_mb: 2048
> I created 9 SASI indices
> - 8 indices with text field, NonTokenizingAnalyser, PREFIX mode, case-insensitive
> - 1 index with numeric field, SPARSE mode
> After a while, the nodes just gone OOM.
> I attach log files. You can see a lot of GC happening while index segments are flush to disk. At some point the node OOM ...
> /cc [~xedin]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)