You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "DOAN DuyHai (JIRA)" <ji...@apache.org> on 2016/03/18 21:30:33 UTC

[jira] [Comment Edited] (CASSANDRA-11383) SASI index build leads to massive OOM

    [ https://issues.apache.org/jira/browse/CASSANDRA-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202088#comment-15202088 ] 

DOAN DuyHai edited comment on CASSANDRA-11383 at 3/18/16 8:30 PM:
------------------------------------------------------------------

[~jkrupan] 

1. Not that large, see below the Spark script to generate randomized data:

{noformat}
    import java.util.UUID
    import com.datastax.spark.connector._
    case class Resource(dsrId:UUID, relSeq:Long, seq:Long, dspReleaseCode:String,
                        commercialOfferCode:String, transferCode:String, mediaCode:String,
                        modelCode:String, unicWork:String,
                        title:String, status:String, contributorsName:List[String],
                        periodEndMonthInt:Int, dspCode:String, territoryCode:String,
                        payingNetQty:Long, authorizedSocietiesTxt: String, relType:String)

    val allDsps = List("youtube", "itunes", "spotify", "deezer", "vevo", "google-play", "7digital", "spotify", "youtube", "spotify", "youtube", "youtube", "youtube")
    val allCountries = List("FR", "UK", "BE", "IT", "NL", "ES", "FR", "FR")
    val allPeriodsEndMonths:Seq[Int] = for(year <- 2013 to 2015; month <- 1 to 12) yield (year.toString + f"$month%02d").toInt
    val allModelCodes = List("PayAsYouGo", "AdFunded", "Subscription")
    val allMediaCodes = List("Music","Ringtone")
    val allTransferCodes = List("Streaming","Download")
    val allCommercialOffers = List("Premium","Free")
    val status = "Declared"
    val authorizedSocietiesTxt: String="sacem sgae"
    val relType = "whatever"
    val titlesAndContributors: Array[(String, String)] = sc.textFile("/tmp/top_100.csv").map(line => {val split = line.split(";"); (split(1),split(2))}).distinct.collect

    for(i<- 1 to 100) {
        sc.parallelize((1 to 40000000).map(i => UUID.randomUUID)).
          map(dsrId => {
            val r = new java.util.Random(System.currentTimeMillis())

            val relSeq = r.nextLong()
            val seq = r.nextLong()
            val dspReleaseCode = seq.toString
            val dspCode = allDsps(r.nextInt(allDsps.size))
            val periodEndMonth = allPeriodsEndMonths(r.nextInt(allPeriodsEndMonths.size))
            val territoryCode = allCountries(r.nextInt(allCountries.size))
            val modelCode = allModelCodes(r.nextInt(allModelCodes.size))
            val mediaCode = allMediaCodes(r.nextInt(allMediaCodes.size))
            val transferCode = allTransferCodes(r.nextInt(allTransferCodes.size))
            val commercialOffer = allCommercialOffers(r.nextInt(allCommercialOffers.size))
            val titleAndContributor: (String, String) = titlesAndContributors(r.nextInt(titlesAndContributors.size))
            val title = titleAndContributor._1
            val contributorsName = titleAndContributor._2.split(",").toList
            val unicWork = title + "|" + titleAndContributor._2
            val payingNetQty = r.nextInt(100).toLong
            Resource(dsrId, relSeq, seq, dspReleaseCode, commercialOffer, transferCode, mediaCode, modelCode,
              unicWork, title, status, contributorsName, periodEndMonth, dspCode, territoryCode, payingNetQty,
              authorizedSocietiesTxt, relType)

          }).
          saveToCassandra("keyspace", "resource")

        Thread.sleep(500)
    }
{noformat}

2. Does OOM occur if SASI indexes are created one at a time - serially, waiting for full index to build before moving on to the next?  --> *Yes it does*, see log file with CMS settings attached above

3. Do you need a 32G heap to build just one index? I cringe when I see a heap larger than 14G. See if you can get a single SASI index build to work in 10-12G or less.

 --> Well the 32Gb heap was for analytics use-cases and I was using G1 GC. But changing to CMS with 8Gb heap has the same result, OOM. see log file with CMS settings attached above



was (Author: doanduyhai):
[~jkrupan] 

1. Not that large, see below the Spark script to generate randomized data:

{code:scala}
    import java.util.UUID
    import com.datastax.spark.connector._
    case class Resource(dsrId:UUID, relSeq:Long, seq:Long, dspReleaseCode:String,
                        commercialOfferCode:String, transferCode:String, mediaCode:String,
                        modelCode:String, unicWork:String,
                        title:String, status:String, contributorsName:List[String],
                        periodEndMonthInt:Int, dspCode:String, territoryCode:String,
                        payingNetQty:Long, authorizedSocietiesTxt: String, relType:String)

    val allDsps = List("youtube", "itunes", "spotify", "deezer", "vevo", "google-play", "7digital", "spotify", "youtube", "spotify", "youtube", "youtube", "youtube")
    val allCountries = List("FR", "UK", "BE", "IT", "NL", "ES", "FR", "FR")
    val allPeriodsEndMonths:Seq[Int] = for(year <- 2013 to 2015; month <- 1 to 12) yield (year.toString + f"$month%02d").toInt
    val allModelCodes = List("PayAsYouGo", "AdFunded", "Subscription")
    val allMediaCodes = List("Music","Ringtone")
    val allTransferCodes = List("Streaming","Download")
    val allCommercialOffers = List("Premium","Free")
    val status = "Declared"
    val authorizedSocietiesTxt: String="sacem sgae"
    val relType = "whatever"
    val titlesAndContributors: Array[(String, String)] = sc.textFile("/tmp/top_100.csv").map(line => {val split = line.split(";"); (split(1),split(2))}).distinct.collect

    for(i<- 1 to 100) {
        sc.parallelize((1 to 40000000).map(i => UUID.randomUUID)).
          map(dsrId => {
            val r = new java.util.Random(System.currentTimeMillis())

            val relSeq = r.nextLong()
            val seq = r.nextLong()
            val dspReleaseCode = seq.toString
            val dspCode = allDsps(r.nextInt(allDsps.size))
            val periodEndMonth = allPeriodsEndMonths(r.nextInt(allPeriodsEndMonths.size))
            val territoryCode = allCountries(r.nextInt(allCountries.size))
            val modelCode = allModelCodes(r.nextInt(allModelCodes.size))
            val mediaCode = allMediaCodes(r.nextInt(allMediaCodes.size))
            val transferCode = allTransferCodes(r.nextInt(allTransferCodes.size))
            val commercialOffer = allCommercialOffers(r.nextInt(allCommercialOffers.size))
            val titleAndContributor: (String, String) = titlesAndContributors(r.nextInt(titlesAndContributors.size))
            val title = titleAndContributor._1
            val contributorsName = titleAndContributor._2.split(",").toList
            val unicWork = title + "|" + titleAndContributor._2
            val payingNetQty = r.nextInt(100).toLong
            Resource(dsrId, relSeq, seq, dspReleaseCode, commercialOffer, transferCode, mediaCode, modelCode,
              unicWork, title, status, contributorsName, periodEndMonth, dspCode, territoryCode, payingNetQty,
              authorizedSocietiesTxt, relType)

          }).
          saveToCassandra("keyspace", "resource")

        Thread.sleep(500)
    }
{code:scala}

2. Does OOM occur if SASI indexes are created one at a time - serially, waiting for full index to build before moving on to the next?  --> *Yes it does*, see log file with CMS settings attached above

3. Do you need a 32G heap to build just one index? I cringe when I see a heap larger than 14G. See if you can get a single SASI index build to work in 10-12G or less.

 --> Well the 32Gb heap was for analytics use-cases and I was using G1 GC. But changing to CMS with 8Gb heap has the same result, OOM. see log file with CMS settings attached above


> SASI index build leads to massive OOM
> -------------------------------------
>
>                 Key: CASSANDRA-11383
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11383
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: C* 3.4
>            Reporter: DOAN DuyHai
>         Attachments: CASSANDRA-11383.patch, new_system_log_CMS_8GB_OOM.log, system.log_sasi_build_oom
>
>
> 13 bare metal machines
> - 6 cores CPU (12 HT)
> - 64Gb RAM
> - 4 SSD in RAID0
>  JVM settings:
> - G1 GC
> - Xms32G, Xmx32G
> Data set:
>  - ≈ 100Gb/per node
>  - 1.3 Tb cluster-wide
>  - ≈ 20Gb for all SASI indices
> C* settings:
> - concurrent_compactors: 1
> - compaction_throughput_mb_per_sec: 256
> - memtable_heap_space_in_mb: 2048
> - memtable_offheap_space_in_mb: 2048
> I created 9 SASI indices
>  - 8 indices with text field, NonTokenizingAnalyser,  PREFIX mode, case-insensitive
>  - 1 index with numeric field, SPARSE mode
>  After a while, the nodes just gone OOM.
>  I attach log files. You can see a lot of GC happening while index segments are flush to disk. At some point the node OOM ...
> /cc [~xedin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)