You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Samuel Bock <sb...@marinsoftware.com> on 2015/03/02 19:14:15 UTC

Re: OutOfMemoryError on step #3 of Cube build

Hi Jiang,

I'm afraid that our investigation is moving on for the moment, so I don't
have the resources to set that up. If we end up coming back around, I will
try that, however.

Thanks,
sam

On Fri, Feb 27, 2015 at 7:31 PM, 蒋旭 <ji...@qq.com> wrote:

> Hi Sam,
>
> Could you try the pre-join on your test data set firstly? You can verify
> whether kylin can meet your requirements on the test dat set or not.
>
> If the pre-join solution works,  we can add "pre-join" option into cube
> definition and automate it into cube build engine. Then you can change the
> dimension data easily that won't impacting the cube building.
>
> Thanks
> Jiang Xu
>
> ------------------ 原始邮件 ------------------
> *发件人:* Samuel Bock <sb...@marinsoftware.com>
> *发送时间:* 2015年02月28日 02:26
> *收件人:* 蒋旭 <ji...@qq.com>
> *抄送:* dev <de...@kylin.incubator.apache.org>
> *主题:* Re: OutOfMemoryError on step #3 of Cube build
>
> While that might be possible when putting together a test dataset, the
> actual system will need to retain the ability to change dimension data
> easily. A prejoined table would make that significantly harder (among other
> things).
>
> thanks,
> Sam
>
>
> On Wed, Feb 25, 2015 at 4:38 PM, 蒋旭 <ji...@qq.com> wrote:
>
> > As a workaround, could you prejoin the big dimension table with fact
> table
> > in hive? Then, you can run Kylin on the prejoined table.
> >
> > We will do the optimization on the big dimension table later.
> >
> > Thanks
> > Jiang Xu
> >
> > ------------------ 原始邮件 ------------------
> > *发件人:* Samuel Bock <sb...@marinsoftware.com>
> > *发送时间:* 2015年02月26日 03:28
> > *收件人:* dev <de...@kylin.incubator.apache.org>
> > *主题:* Re: OutOfMemoryError on step #3 of Cube build
> >
> > Thank you for the follow up,
> >
> > Our dimension table is 25 million rows for our test data set, and would
> be
> > far larger in production. Given that, it sounds like our data doesn't fit
> > the Kylin use case. I appreciate the assistance in tracking down the
> source
> > of this issue,
> >
> > cheers,
> > sam
> >
> > On Tue, Feb 24, 2015 at 7:28 PM, Shi, Shaofeng <sh...@ebay.com> wrote:
> >
> > > Hi Samuel,
> > >
> > > Kylin only supports the star schema: only 1 fact table join with
> multiple
> > > lookup tables. The lookup table need be small so that Kylin can read
> them
> > > into memory for join and cube build. Also as you found, Kylin will take
> > > snapshot on the lookup tables and persist them in Hbase; That should be
> > > the problem. In your case, how many rows there in the KEYWORDS table?
> > >
> > > On 2/21/15, 2:12 AM, "Samuel Bock" <sb...@marinsoftware.com> wrote:
> > >
> > > >Thank you for you response,
> > > >
> > > >I went into the code, and I'm fairly confident that I've isolated the
> > > >problem. The OutOfMemoryError is part of the dimension dictionary
> step,
> > > >but
> > > >is not actually related to the dictionary itself (since, as you
> > mentioned,
> > > >that is skipped when dictionary=false). The problem arises from the
> > second
> > > >half of that step in which it builds the dimension table snapshot.
> > Looking
> > > >at the code, the process of building the snapshot table loads in the
> > > >entire
> > > >table into memory as strings (SnapshotTable.takeSnapshot), then
> > serializes
> > > >that to an in memory ByteArrayOutputStream
> (ResourceStore.putResource),
> > > >then finally creates a copy of the internal byte array from the stream
> > in
> > > >order to store it in HBase
> (HBaseResourceStore.checkAndPutResourceImpl).
> > > >That means that there needs to be space for three in-memory copies of
> > the
> > > >full dimension table. Given that even our test subset dimension table
> is
> > > >25
> > > >million rows, 14 columns, that becomes problematic. From
> > experimentation,
> > > >it breaks even with 95 gig heap.
> > > >
> > > >For completeness, the log leading up to the crash (minus the pointless
> > zk
> > > >messages) is:
> > > > - Start to execute command:
> > > > -cubename foo -segmentname FULL_BUILD -input
> > >
> >
> >/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns
> > > >[QuartzScheduler_Worker-1]:[2015-02-19
> > >
> >
> >22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processS
> > > >egment(DictionaryGeneratorCLI.java:57)]
> > > >- Building snapshot of KEYWORDS
> > > >[QuartzScheduler_Worker-2]:[2015-02-19
> > >
> >
> >22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > > >r.java:60)]
> > > >- 0 pending jobs
> > > >[QuartzScheduler_Worker-3]:[2015-02-19
> > >
> >
> >23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > > >r.java:60)]
> > > >- 0 pending jobs
> > > >[QuartzScheduler_Worker-1]:[2015-02-19
> > >
> >
> >23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDe
> > > >lim(FileTableReader.java:156)]
> > > >- Auto detect delim to be ' ', split line to 14 columns --
> > > >1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768
> 1020
> > > >341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246
> > > >[http-bio-7070-exec-8]:[2015-02-19
> > >
> >
> >23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsSt
> > > >ring(AdminService.java:91)]
> > > >- Get Kylin Runtime Config
> > > >[QuartzScheduler_Worker-4]:[2015-02-19
> > >
> >
> >23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > > >r.java:60)]
> > > >- 0 pending jobs
> > > >[QuartzScheduler_Worker-1]:[2015-02-19
> > >
> >
> >23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putRes
> > > >ource(ResourceStore.java:166)]
> > > >- Saving resource
> > >
> >
> >/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapsh
> > > >ot
> > > >(Store kylin_metadata_qa@hbase)
> > > >[QuartzScheduler_Worker-6]:[2015-02-19
> > >
> >
> >23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > > >r.java:60)]
> > > >- 0 pending jobs
> > > >java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> > > >Dumping heap to java_pid3705.hprof ...
> > > >
> > > >
> > > >The cube JSON is:
> > > >
> > > >{
> > > >  "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110",
> > > >  "name": "foo",
> > > >  "description": "",
> > > >  "dimensions": [
> > > >    {
> > > >      "id": 1,
> > > >      "name": "KEYWORDS_DERIVED",
> > > >      "join": {
> > > >        "type": "left",
> > > >        "primary_key": [
> > > >          "DIM_ID"
> > > >        ],
> > > >        "foreign_key": [
> > > >          "KEYWORD_DIM_ID"
> > > >        ]
> > > >      },
> > > >      "hierarchy": null,
> > > >      "table": "KEYWORDS",
> > > >      "column": "{FK}",
> > > >      "datatype": null,
> > > >      "derived": [
> > > >        "PUBLISHER_GROUP_ID",
> > > >        "PUBLISHER_CAMPAIGN_ID",
> > > >        "PUBLISHER_ID"
> > > >      ]
> > > >    }
> > > >  ],
> > > >  "measures": [
> > > >    {
> > > >      "id": 1,
> > > >      "name": "_COUNT_",
> > > >      "function": {
> > > >        "expression": "COUNT",
> > > >        "parameter": {
> > > >          "type": "constant",
> > > >          "value": "1"
> > > >        },
> > > >        "returntype": "bigint"
> > > >      },
> > > >      "dependent_measure_ref": null
> > > >    },
> > > >    {
> > > >      "id": 2,
> > > >      "name": "CONVERSIONS",
> > > >      "function": {
> > > >        "expression": "SUM",
> > > >        "parameter": {
> > > >          "type": "column",
> > > >          "value": "CONVERSIONS"
> > > >        },
> > > >        "returntype": "bigint"
> > > >      },
> > > >      "dependent_measure_ref": null
> > > >    }
> > > >  ],
> > > >  "rowkey": {
> > > >    "rowkey_columns": [
> > > >      {
> > > >        "column": "KEYWORD_DIM_ID",
> > > >        "length": 0,
> > > >        "dictionary": "false",
> > > >        "mandatory": false
> > > >      }
> > > >    ],
> > > >    "aggregation_groups": [
> > > >      [
> > > >        "KEYWORD_DIM_ID"
> > > >      ]
> > > >    ]
> > > >  },
> > > >  "signature": "T+aYTH/KlCwwmVAGRQR3hQ==",
> > > >  "capacity": "LARGE",
> > > >  "last_modified": 1424367558297,
> > > >  "fact_table": "FACTS",
> > > >  "null_string": null,
> > > >  "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931",
> > > >  "cube_partition_desc": {
> > > >    "partition_date_column": null,
> > > >    "partition_date_start": 0,
> > > >    "cube_partition_type": "APPEND"
> > > >  },
> > > >  "hbase_mapping": {
> > > >    "column_family": [
> > > >      {
> > > >        "name": "F1",
> > > >        "columns": [
> > > >          {
> > > >            "qualifier": "M",
> > > >            "measure_refs": [
> > > >              "_COUNT_",
> > > >              "CONVERSIONS"
> > > >            ]
> > > >          }
> > > >        ]
> > > >      }
> > > >    ]
> > > >  },
> > > >  "notify_list": [
> > > >    "sam"
> > > >  ]
> > > >}
> > > >
> > > >
> > > >Cheers,
> > > >sam
> > > >
> > > >On Thu, Feb 19, 2015 at 9:49 PM, 周千昊 <z....@gmail.com> wrote:
> > > >
> > > >> Also since you set the dictionary to false, there should not be any
> > > >>memory
> > > >> consuming while building dictionary.
> > > >> So can you also give us the json description of the cube?(in the
> cube
> > > >>tab,
> > > >> click the corresponding cube, click the json button)
> > > >>
> > > >>
> > > >> On Fri Feb 20 2015 at 1:39:15 PM 周千昊 <z....@gmail.com> wrote:
> > > >>
> > > >> > Hi, Samuel
> > > >> >      Can you give us some detail log, so we can dig into the root
> > > >>cause
> > > >> >
> > > >> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock <
> > sbock@marinsoftware.com
> > > >
> > > >> > wrote:
> > > >> >
> > > >> >> Hello all,
> > > >> >>
> > > >> >> We are in the process of evaluating Kylin for use as an OLAP
> > engine.
> > > >>To
> > > >> >> that end, we are trying to get a minimum viable setup with a
> > > >> >> representative
> > > >> >> sample of our data in order to gather performance metrics. We
> have
> > > >>kylin
> > > >> >> running against a 10 node cluster, the provided cubes build
> > > >>successfully
> > > >> >> and the system seems functional. Attempting to build a simple
> cube
> > > >> against
> > > >> >> our data results in an OutOfMemoryError in the kylin server
> process
> > > >>(so
> > > >> >> far
> > > >> >> we have given it up to a 46 gig heap). I was wondering if you
> could
> > > >>give
> > > >> >> me
> > > >> >> some guidance as to likely causes, any configurations I'm likely
> to
> > > >>have
> > > >> >> missed before I start diving into the source. I have changed the
> > > >> >> "dictionary" setting to false, as recommended for
> high-cardinality
> > > >> >> dimensions, but have not changed configuration significantly
> apart
> > > >>from
> > > >> >> that.
> > > >> >>
> > > >> >> For reference, the sizes of the hive tables we're building the
> > cubes
> > > >> from
> > > >> >> dimension table: 25,399,061 rows
> > > >> >> fact table: 270,940,921 rows
> > > >> >>
> > > >> >> (And as a note, there are no pertinent log messages except to
> > > >>indicate
> > > >> >> that
> > > >> >> it is in the Build Dimension Dictionary step)
> > > >> >>
> > > >> >> Thank you,
> > > >> >> sam bock
> > > >> >>
> > > >> >
> > > >>
> > >
> > >
> >
> >
>
>