You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by 蒋旭 <ji...@qq.com> on 2015/02/26 01:38:11 UTC

回复：OutOfMemoryError on step #3 of Cube build

As a workaround, could you prejoin the big dimension table with fact table in hive? Then, you can run Kylin on the prejoined table.


We will do the optimization on the big dimension table later.


Thanks
Jiang Xu


------------------ 原始邮件 ------------------
发件人: Samuel Bock <sb...@marinsoftware.com>
发送时间: 2015年02月26日 03:28
收件人: dev <de...@kylin.incubator.apache.org>
主题: Re: OutOfMemoryError on step #3 of Cube build



Thank you for the follow up,

Our dimension table is 25 million rows for our test data set, and would be
far larger in production. Given that, it sounds like our data doesn't fit
the Kylin use case. I appreciate the assistance in tracking down the source
of this issue,

cheers,
sam

On Tue, Feb 24, 2015 at 7:28 PM, Shi, Shaofeng <sh...@ebay.com> wrote:

> Hi Samuel,
>
> Kylin only supports the star schema: only 1 fact table join with multiple
> lookup tables. The lookup table need be small so that Kylin can read them
> into memory for join and cube build. Also as you found, Kylin will take
> snapshot on the lookup tables and persist them in Hbase; That should be
> the problem. In your case, how many rows there in the KEYWORDS table?
>
> On 2/21/15, 2:12 AM, "Samuel Bock" <sb...@marinsoftware.com> wrote:
>
> >Thank you for you response,
> >
> >I went into the code, and I'm fairly confident that I've isolated the
> >problem. The OutOfMemoryError is part of the dimension dictionary step,
> >but
> >is not actually related to the dictionary itself (since, as you mentioned,
> >that is skipped when dictionary=false). The problem arises from the second
> >half of that step in which it builds the dimension table snapshot. Looking
> >at the code, the process of building the snapshot table loads in the
> >entire
> >table into memory as strings (SnapshotTable.takeSnapshot), then serializes
> >that to an in memory ByteArrayOutputStream (ResourceStore.putResource),
> >then finally creates a copy of the internal byte array from the stream in
> >order to store it in HBase (HBaseResourceStore.checkAndPutResourceImpl).
> >That means that there needs to be space for three in-memory copies of the
> >full dimension table. Given that even our test subset dimension table is
> >25
> >million rows, 14 columns, that becomes problematic. From experimentation,
> >it breaks even with 95 gig heap.
> >
> >For completeness, the log leading up to the crash (minus the pointless zk
> >messages) is:
> > - Start to execute command:
> > -cubename foo -segmentname FULL_BUILD -input
> >/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns
> >[QuartzScheduler_Worker-1]:[2015-02-19
> >22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processS
> >egment(DictionaryGeneratorCLI.java:57)]
> >- Building snapshot of KEYWORDS
> >[QuartzScheduler_Worker-2]:[2015-02-19
> >22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> >r.java:60)]
> >- 0 pending jobs
> >[QuartzScheduler_Worker-3]:[2015-02-19
> >23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> >r.java:60)]
> >- 0 pending jobs
> >[QuartzScheduler_Worker-1]:[2015-02-19
> >23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDe
> >lim(FileTableReader.java:156)]
> >- Auto detect delim to be ' ', split line to 14 columns --
> >1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768 1020
> >341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246
> >[http-bio-7070-exec-8]:[2015-02-19
> >23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsSt
> >ring(AdminService.java:91)]
> >- Get Kylin Runtime Config
> >[QuartzScheduler_Worker-4]:[2015-02-19
> >23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> >r.java:60)]
> >- 0 pending jobs
> >[QuartzScheduler_Worker-1]:[2015-02-19
> >23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putRes
> >ource(ResourceStore.java:166)]
> >- Saving resource
> >/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapsh
> >ot
> >(Store kylin_metadata_qa@hbase)
> >[QuartzScheduler_Worker-6]:[2015-02-19
> >23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> >r.java:60)]
> >- 0 pending jobs
> >java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> >Dumping heap to java_pid3705.hprof ...
> >
> >
> >The cube JSON is:
> >
> >{
> >  "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110",
> >  "name": "foo",
> >  "description": "",
> >  "dimensions": [
> >    {
> >      "id": 1,
> >      "name": "KEYWORDS_DERIVED",
> >      "join": {
> >        "type": "left",
> >        "primary_key": [
> >          "DIM_ID"
> >        ],
> >        "foreign_key": [
> >          "KEYWORD_DIM_ID"
> >        ]
> >      },
> >      "hierarchy": null,
> >      "table": "KEYWORDS",
> >      "column": "{FK}",
> >      "datatype": null,
> >      "derived": [
> >        "PUBLISHER_GROUP_ID",
> >        "PUBLISHER_CAMPAIGN_ID",
> >        "PUBLISHER_ID"
> >      ]
> >    }
> >  ],
> >  "measures": [
> >    {
> >      "id": 1,
> >      "name": "_COUNT_",
> >      "function": {
> >        "expression": "COUNT",
> >        "parameter": {
> >          "type": "constant",
> >          "value": "1"
> >        },
> >        "returntype": "bigint"
> >      },
> >      "dependent_measure_ref": null
> >    },
> >    {
> >      "id": 2,
> >      "name": "CONVERSIONS",
> >      "function": {
> >        "expression": "SUM",
> >        "parameter": {
> >          "type": "column",
> >          "value": "CONVERSIONS"
> >        },
> >        "returntype": "bigint"
> >      },
> >      "dependent_measure_ref": null
> >    }
> >  ],
> >  "rowkey": {
> >    "rowkey_columns": [
> >      {
> >        "column": "KEYWORD_DIM_ID",
> >        "length": 0,
> >        "dictionary": "false",
> >        "mandatory": false
> >      }
> >    ],
> >    "aggregation_groups": [
> >      [
> >        "KEYWORD_DIM_ID"
> >      ]
> >    ]
> >  },
> >  "signature": "T+aYTH/KlCwwmVAGRQR3hQ==",
> >  "capacity": "LARGE",
> >  "last_modified": 1424367558297,
> >  "fact_table": "FACTS",
> >  "null_string": null,
> >  "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931",
> >  "cube_partition_desc": {
> >    "partition_date_column": null,
> >    "partition_date_start": 0,
> >    "cube_partition_type": "APPEND"
> >  },
> >  "hbase_mapping": {
> >    "column_family": [
> >      {
> >        "name": "F1",
> >        "columns": [
> >          {
> >            "qualifier": "M",
> >            "measure_refs": [
> >              "_COUNT_",
> >              "CONVERSIONS"
> >            ]
> >          }
> >        ]
> >      }
> >    ]
> >  },
> >  "notify_list": [
> >    "sam"
> >  ]
> >}
> >
> >
> >Cheers,
> >sam
> >
> >On Thu, Feb 19, 2015 at 9:49 PM, 周千昊 <z....@gmail.com> wrote:
> >
> >> Also since you set the dictionary to false, there should not be any
> >>memory
> >> consuming while building dictionary.
> >> So can you also give us the json description of the cube?(in the cube
> >>tab,
> >> click the corresponding cube, click the json button)
> >>
> >>
> >> On Fri Feb 20 2015 at 1:39:15 PM 周千昊 <z....@gmail.com> wrote:
> >>
> >> > Hi, Samuel
> >> >      Can you give us some detail log, so we can dig into the root
> >>cause
> >> >
> >> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock <sbock@marinsoftware.com
> >
> >> > wrote:
> >> >
> >> >> Hello all,
> >> >>
> >> >> We are in the process of evaluating Kylin for use as an OLAP engine.
> >>To
> >> >> that end, we are trying to get a minimum viable setup with a
> >> >> representative
> >> >> sample of our data in order to gather performance metrics. We have
> >>kylin
> >> >> running against a 10 node cluster, the provided cubes build
> >>successfully
> >> >> and the system seems functional. Attempting to build a simple cube
> >> against
> >> >> our data results in an OutOfMemoryError in the kylin server process
> >>(so
> >> >> far
> >> >> we have given it up to a 46 gig heap). I was wondering if you could
> >>give
> >> >> me
> >> >> some guidance as to likely causes, any configurations I'm likely to
> >>have
> >> >> missed before I start diving into the source. I have changed the
> >> >> "dictionary" setting to false, as recommended for high-cardinality
> >> >> dimensions, but have not changed configuration significantly apart
> >>from
> >> >> that.
> >> >>
> >> >> For reference, the sizes of the hive tables we're building the cubes
> >> from
> >> >> dimension table: 25,399,061 rows
> >> >> fact table: 270,940,921 rows
> >> >>
> >> >> (And as a note, there are no pertinent log messages except to
> >>indicate
> >> >> that
> >> >> it is in the Build Dimension Dictionary step)
> >> >>
> >> >> Thank you,
> >> >> sam bock
> >> >>
> >> >
> >>
>
>

回复： OutOfMemoryError on step #3 of Cube build

Posted by 蒋旭 <ji...@qq.com>.

Hi Sam,
Could you try the pre-join on your test data set firstly? You can verify whether kylin can meet your requirements on the test dat set or not.
If the pre-join solution works,  we can add "pre-join" option into cube definition and automate it into cube build engine. Then you can change the dimension data easily that won't impacting the cube building.
Thanks
Jiang Xu‍





------------------ 原始邮件 ------------------
发件人: "Samuel Bock";<sb...@marinsoftware.com>;
发送时间: 2015年2月28日(星期六) 凌晨2:23
收件人: "蒋旭"<ji...@qq.com>; 
抄送: "dev"<de...@kylin.incubator.apache.org>; 
主题: Re: OutOfMemoryError on step #3 of Cube build



While that might be possible when putting together a test dataset, the
actual system will need to retain the ability to change dimension data
easily. A prejoined table would make that significantly harder (among other
things).

thanks,
Sam


On Wed, Feb 25, 2015 at 4:38 PM, 蒋旭 <ji...@qq.com> wrote:

> As a workaround, could you prejoin the big dimension table with fact table
> in hive? Then, you can run Kylin on the prejoined table.
>
> We will do the optimization on the big dimension table later.
>
> Thanks
> Jiang Xu
>
> ------------------ 原始邮件 ------------------
> *发件人:* Samuel Bock <sb...@marinsoftware.com>
> *发送时间:* 2015年02月26日 03:28
> *收件人:* dev <de...@kylin.incubator.apache.org>
> *主题:* Re: OutOfMemoryError on step #3 of Cube build
>
> Thank you for the follow up,
>
> Our dimension table is 25 million rows for our test data set, and would be
> far larger in production. Given that, it sounds like our data doesn't fit
> the Kylin use case. I appreciate the assistance in tracking down the source
> of this issue,
>
> cheers,
> sam
>
> On Tue, Feb 24, 2015 at 7:28 PM, Shi, Shaofeng <sh...@ebay.com> wrote:
>
> > Hi Samuel,
> >
> > Kylin only supports the star schema: only 1 fact table join with multiple
> > lookup tables. The lookup table need be small so that Kylin can read them
> > into memory for join and cube build. Also as you found, Kylin will take
> > snapshot on the lookup tables and persist them in Hbase; That should be
> > the problem. In your case, how many rows there in the KEYWORDS table?
> >
> > On 2/21/15, 2:12 AM, "Samuel Bock" <sb...@marinsoftware.com> wrote:
> >
> > >Thank you for you response,
> > >
> > >I went into the code, and I'm fairly confident that I've isolated the
> > >problem. The OutOfMemoryError is part of the dimension dictionary step,
> > >but
> > >is not actually related to the dictionary itself (since, as you
> mentioned,
> > >that is skipped when dictionary=false). The problem arises from the
> second
> > >half of that step in which it builds the dimension table snapshot.
> Looking
> > >at the code, the process of building the snapshot table loads in the
> > >entire
> > >table into memory as strings (SnapshotTable.takeSnapshot), then
> serializes
> > >that to an in memory ByteArrayOutputStream (ResourceStore.putResource),
> > >then finally creates a copy of the internal byte array from the stream
> in
> > >order to store it in HBase (HBaseResourceStore.checkAndPutResourceImpl).
> > >That means that there needs to be space for three in-memory copies of
> the
> > >full dimension table. Given that even our test subset dimension table is
> > >25
> > >million rows, 14 columns, that becomes problematic. From
> experimentation,
> > >it breaks even with 95 gig heap.
> > >
> > >For completeness, the log leading up to the crash (minus the pointless
> zk
> > >messages) is:
> > > - Start to execute command:
> > > -cubename foo -segmentname FULL_BUILD -input
> >
> >/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processS
> > >egment(DictionaryGeneratorCLI.java:57)]
> > >- Building snapshot of KEYWORDS
> > >[QuartzScheduler_Worker-2]:[2015-02-19
> >
> >22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-3]:[2015-02-19
> >
> >23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDe
> > >lim(FileTableReader.java:156)]
> > >- Auto detect delim to be ' ', split line to 14 columns --
> > >1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768 1020
> > >341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246
> > >[http-bio-7070-exec-8]:[2015-02-19
> >
> >23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsSt
> > >ring(AdminService.java:91)]
> > >- Get Kylin Runtime Config
> > >[QuartzScheduler_Worker-4]:[2015-02-19
> >
> >23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putRes
> > >ource(ResourceStore.java:166)]
> > >- Saving resource
> >
> >/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapsh
> > >ot
> > >(Store kylin_metadata_qa@hbase)
> > >[QuartzScheduler_Worker-6]:[2015-02-19
> >
> >23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> > >Dumping heap to java_pid3705.hprof ...
> > >
> > >
> > >The cube JSON is:
> > >
> > >{
> > >  "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110",
> > >  "name": "foo",
> > >  "description": "",
> > >  "dimensions": [
> > >    {
> > >      "id": 1,
> > >      "name": "KEYWORDS_DERIVED",
> > >      "join": {
> > >        "type": "left",
> > >        "primary_key": [
> > >          "DIM_ID"
> > >        ],
> > >        "foreign_key": [
> > >          "KEYWORD_DIM_ID"
> > >        ]
> > >      },
> > >      "hierarchy": null,
> > >      "table": "KEYWORDS",
> > >      "column": "{FK}",
> > >      "datatype": null,
> > >      "derived": [
> > >        "PUBLISHER_GROUP_ID",
> > >        "PUBLISHER_CAMPAIGN_ID",
> > >        "PUBLISHER_ID"
> > >      ]
> > >    }
> > >  ],
> > >  "measures": [
> > >    {
> > >      "id": 1,
> > >      "name": "_COUNT_",
> > >      "function": {
> > >        "expression": "COUNT",
> > >        "parameter": {
> > >          "type": "constant",
> > >          "value": "1"
> > >        },
> > >        "returntype": "bigint"
> > >      },
> > >      "dependent_measure_ref": null
> > >    },
> > >    {
> > >      "id": 2,
> > >      "name": "CONVERSIONS",
> > >      "function": {
> > >        "expression": "SUM",
> > >        "parameter": {
> > >          "type": "column",
> > >          "value": "CONVERSIONS"
> > >        },
> > >        "returntype": "bigint"
> > >      },
> > >      "dependent_measure_ref": null
> > >    }
> > >  ],
> > >  "rowkey": {
> > >    "rowkey_columns": [
> > >      {
> > >        "column": "KEYWORD_DIM_ID",
> > >        "length": 0,
> > >        "dictionary": "false",
> > >        "mandatory": false
> > >      }
> > >    ],
> > >    "aggregation_groups": [
> > >      [
> > >        "KEYWORD_DIM_ID"
> > >      ]
> > >    ]
> > >  },
> > >  "signature": "T+aYTH/KlCwwmVAGRQR3hQ==",
> > >  "capacity": "LARGE",
> > >  "last_modified": 1424367558297,
> > >  "fact_table": "FACTS",
> > >  "null_string": null,
> > >  "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931",
> > >  "cube_partition_desc": {
> > >    "partition_date_column": null,
> > >    "partition_date_start": 0,
> > >    "cube_partition_type": "APPEND"
> > >  },
> > >  "hbase_mapping": {
> > >    "column_family": [
> > >      {
> > >        "name": "F1",
> > >        "columns": [
> > >          {
> > >            "qualifier": "M",
> > >            "measure_refs": [
> > >              "_COUNT_",
> > >              "CONVERSIONS"
> > >            ]
> > >          }
> > >        ]
> > >      }
> > >    ]
> > >  },
> > >  "notify_list": [
> > >    "sam"
> > >  ]
> > >}
> > >
> > >
> > >Cheers,
> > >sam
> > >
> > >On Thu, Feb 19, 2015 at 9:49 PM, 周千昊 <z....@gmail.com> wrote:
> > >
> > >> Also since you set the dictionary to false, there should not be any
> > >>memory
> > >> consuming while building dictionary.
> > >> So can you also give us the json description of the cube?(in the cube
> > >>tab,
> > >> click the corresponding cube, click the json button)
> > >>
> > >>
> > >> On Fri Feb 20 2015 at 1:39:15 PM 周千昊 <z....@gmail.com> wrote:
> > >>
> > >> > Hi, Samuel
> > >> >      Can you give us some detail log, so we can dig into the root
> > >>cause
> > >> >
> > >> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock <
> sbock@marinsoftware.com
> > >
> > >> > wrote:
> > >> >
> > >> >> Hello all,
> > >> >>
> > >> >> We are in the process of evaluating Kylin for use as an OLAP
> engine.
> > >>To
> > >> >> that end, we are trying to get a minimum viable setup with a
> > >> >> representative
> > >> >> sample of our data in order to gather performance metrics. We have
> > >>kylin
> > >> >> running against a 10 node cluster, the provided cubes build
> > >>successfully
> > >> >> and the system seems functional. Attempting to build a simple cube
> > >> against
> > >> >> our data results in an OutOfMemoryError in the kylin server process
> > >>(so
> > >> >> far
> > >> >> we have given it up to a 46 gig heap). I was wondering if you could
> > >>give
> > >> >> me
> > >> >> some guidance as to likely causes, any configurations I'm likely to
> > >>have
> > >> >> missed before I start diving into the source. I have changed the
> > >> >> "dictionary" setting to false, as recommended for high-cardinality
> > >> >> dimensions, but have not changed configuration significantly apart
> > >>from
> > >> >> that.
> > >> >>
> > >> >> For reference, the sizes of the hive tables we're building the
> cubes
> > >> from
> > >> >> dimension table: 25,399,061 rows
> > >> >> fact table: 270,940,921 rows
> > >> >>
> > >> >> (And as a note, there are no pertinent log messages except to
> > >>indicate
> > >> >> that
> > >> >> it is in the Build Dimension Dictionary step)
> > >> >>
> > >> >> Thank you,
> > >> >> sam bock
> > >> >>
> > >> >
> > >>
> >
> >
>
>

回复： OutOfMemoryError on step #3 of Cube build

Posted by 蒋旭 <ji...@qq.com>.

Hi Sam,


Thanks for your response! Could you try the pre-join on your test data set firstly? You can verify whether kylin can meet your requirements on the test dat set or not.


If the pre-join solution works,  we can add "pre-join" option into cube definition and automate it into cube build engine. Then you can change the dimension data easily that won't impacting the cube building.


Thanks
Jiang Xu


------------------ 原始邮件 ------------------
发件人: "Samuel Bock";<sb...@marinsoftware.com>;
发送时间: 2015年2月28日(星期六) 凌晨2:23
收件人: "蒋旭"<ji...@qq.com>; 
抄送: "dev"<de...@kylin.incubator.apache.org>; 
主题: Re: OutOfMemoryError on step #3 of Cube build



While that might be possible when putting together a test dataset, the
actual system will need to retain the ability to change dimension data
easily. A prejoined table would make that significantly harder (among other
things).

thanks,
Sam


On Wed, Feb 25, 2015 at 4:38 PM, 蒋旭 <ji...@qq.com> wrote:

> As a workaround, could you prejoin the big dimension table with fact table
> in hive? Then, you can run Kylin on the prejoined table.
>
> We will do the optimization on the big dimension table later.
>
> Thanks
> Jiang Xu
>
> ------------------ 原始邮件 ------------------
> *发件人:* Samuel Bock <sb...@marinsoftware.com>
> *发送时间:* 2015年02月26日 03:28
> *收件人:* dev <de...@kylin.incubator.apache.org>
> *主题:* Re: OutOfMemoryError on step #3 of Cube build
>
> Thank you for the follow up,
>
> Our dimension table is 25 million rows for our test data set, and would be
> far larger in production. Given that, it sounds like our data doesn't fit
> the Kylin use case. I appreciate the assistance in tracking down the source
> of this issue,
>
> cheers,
> sam
>
> On Tue, Feb 24, 2015 at 7:28 PM, Shi, Shaofeng <sh...@ebay.com> wrote:
>
> > Hi Samuel,
> >
> > Kylin only supports the star schema: only 1 fact table join with multiple
> > lookup tables. The lookup table need be small so that Kylin can read them
> > into memory for join and cube build. Also as you found, Kylin will take
> > snapshot on the lookup tables and persist them in Hbase; That should be
> > the problem. In your case, how many rows there in the KEYWORDS table?
> >
> > On 2/21/15, 2:12 AM, "Samuel Bock" <sb...@marinsoftware.com> wrote:
> >
> > >Thank you for you response,
> > >
> > >I went into the code, and I'm fairly confident that I've isolated the
> > >problem. The OutOfMemoryError is part of the dimension dictionary step,
> > >but
> > >is not actually related to the dictionary itself (since, as you
> mentioned,
> > >that is skipped when dictionary=false). The problem arises from the
> second
> > >half of that step in which it builds the dimension table snapshot.
> Looking
> > >at the code, the process of building the snapshot table loads in the
> > >entire
> > >table into memory as strings (SnapshotTable.takeSnapshot), then
> serializes
> > >that to an in memory ByteArrayOutputStream (ResourceStore.putResource),
> > >then finally creates a copy of the internal byte array from the stream
> in
> > >order to store it in HBase (HBaseResourceStore.checkAndPutResourceImpl).
> > >That means that there needs to be space for three in-memory copies of
> the
> > >full dimension table. Given that even our test subset dimension table is
> > >25
> > >million rows, 14 columns, that becomes problematic. From
> experimentation,
> > >it breaks even with 95 gig heap.
> > >
> > >For completeness, the log leading up to the crash (minus the pointless
> zk
> > >messages) is:
> > > - Start to execute command:
> > > -cubename foo -segmentname FULL_BUILD -input
> >
> >/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processS
> > >egment(DictionaryGeneratorCLI.java:57)]
> > >- Building snapshot of KEYWORDS
> > >[QuartzScheduler_Worker-2]:[2015-02-19
> >
> >22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-3]:[2015-02-19
> >
> >23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDe
> > >lim(FileTableReader.java:156)]
> > >- Auto detect delim to be ' ', split line to 14 columns --
> > >1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768 1020
> > >341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246
> > >[http-bio-7070-exec-8]:[2015-02-19
> >
> >23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsSt
> > >ring(AdminService.java:91)]
> > >- Get Kylin Runtime Config
> > >[QuartzScheduler_Worker-4]:[2015-02-19
> >
> >23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putRes
> > >ource(ResourceStore.java:166)]
> > >- Saving resource
> >
> >/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapsh
> > >ot
> > >(Store kylin_metadata_qa@hbase)
> > >[QuartzScheduler_Worker-6]:[2015-02-19
> >
> >23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> > >Dumping heap to java_pid3705.hprof ...
> > >
> > >
> > >The cube JSON is:
> > >
> > >{
> > >  "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110",
> > >  "name": "foo",
> > >  "description": "",
> > >  "dimensions": [
> > >    {
> > >      "id": 1,
> > >      "name": "KEYWORDS_DERIVED",
> > >      "join": {
> > >        "type": "left",
> > >        "primary_key": [
> > >          "DIM_ID"
> > >        ],
> > >        "foreign_key": [
> > >          "KEYWORD_DIM_ID"
> > >        ]
> > >      },
> > >      "hierarchy": null,
> > >      "table": "KEYWORDS",
> > >      "column": "{FK}",
> > >      "datatype": null,
> > >      "derived": [
> > >        "PUBLISHER_GROUP_ID",
> > >        "PUBLISHER_CAMPAIGN_ID",
> > >        "PUBLISHER_ID"
> > >      ]
> > >    }
> > >  ],
> > >  "measures": [
> > >    {
> > >      "id": 1,
> > >      "name": "_COUNT_",
> > >      "function": {
> > >        "expression": "COUNT",
> > >        "parameter": {
> > >          "type": "constant",
> > >          "value": "1"
> > >        },
> > >        "returntype": "bigint"
> > >      },
> > >      "dependent_measure_ref": null
> > >    },
> > >    {
> > >      "id": 2,
> > >      "name": "CONVERSIONS",
> > >      "function": {
> > >        "expression": "SUM",
> > >        "parameter": {
> > >          "type": "column",
> > >          "value": "CONVERSIONS"
> > >        },
> > >        "returntype": "bigint"
> > >      },
> > >      "dependent_measure_ref": null
> > >    }
> > >  ],
> > >  "rowkey": {
> > >    "rowkey_columns": [
> > >      {
> > >        "column": "KEYWORD_DIM_ID",
> > >        "length": 0,
> > >        "dictionary": "false",
> > >        "mandatory": false
> > >      }
> > >    ],
> > >    "aggregation_groups": [
> > >      [
> > >        "KEYWORD_DIM_ID"
> > >      ]
> > >    ]
> > >  },
> > >  "signature": "T+aYTH/KlCwwmVAGRQR3hQ==",
> > >  "capacity": "LARGE",
> > >  "last_modified": 1424367558297,
> > >  "fact_table": "FACTS",
> > >  "null_string": null,
> > >  "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931",
> > >  "cube_partition_desc": {
> > >    "partition_date_column": null,
> > >    "partition_date_start": 0,
> > >    "cube_partition_type": "APPEND"
> > >  },
> > >  "hbase_mapping": {
> > >    "column_family": [
> > >      {
> > >        "name": "F1",
> > >        "columns": [
> > >          {
> > >            "qualifier": "M",
> > >            "measure_refs": [
> > >              "_COUNT_",
> > >              "CONVERSIONS"
> > >            ]
> > >          }
> > >        ]
> > >      }
> > >    ]
> > >  },
> > >  "notify_list": [
> > >    "sam"
> > >  ]
> > >}
> > >
> > >
> > >Cheers,
> > >sam
> > >
> > >On Thu, Feb 19, 2015 at 9:49 PM, 周千昊 <z....@gmail.com> wrote:
> > >
> > >> Also since you set the dictionary to false, there should not be any
> > >>memory
> > >> consuming while building dictionary.
> > >> So can you also give us the json description of the cube?(in the cube
> > >>tab,
> > >> click the corresponding cube, click the json button)
> > >>
> > >>
> > >> On Fri Feb 20 2015 at 1:39:15 PM 周千昊 <z....@gmail.com> wrote:
> > >>
> > >> > Hi, Samuel
> > >> >      Can you give us some detail log, so we can dig into the root
> > >>cause
> > >> >
> > >> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock <
> sbock@marinsoftware.com
> > >
> > >> > wrote:
> > >> >
> > >> >> Hello all,
> > >> >>
> > >> >> We are in the process of evaluating Kylin for use as an OLAP
> engine.
> > >>To
> > >> >> that end, we are trying to get a minimum viable setup with a
> > >> >> representative
> > >> >> sample of our data in order to gather performance metrics. We have
> > >>kylin
> > >> >> running against a 10 node cluster, the provided cubes build
> > >>successfully
> > >> >> and the system seems functional. Attempting to build a simple cube
> > >> against
> > >> >> our data results in an OutOfMemoryError in the kylin server process
> > >>(so
> > >> >> far
> > >> >> we have given it up to a 46 gig heap). I was wondering if you could
> > >>give
> > >> >> me
> > >> >> some guidance as to likely causes, any configurations I'm likely to
> > >>have
> > >> >> missed before I start diving into the source. I have changed the
> > >> >> "dictionary" setting to false, as recommended for high-cardinality
> > >> >> dimensions, but have not changed configuration significantly apart
> > >>from
> > >> >> that.
> > >> >>
> > >> >> For reference, the sizes of the hive tables we're building the
> cubes
> > >> from
> > >> >> dimension table: 25,399,061 rows
> > >> >> fact table: 270,940,921 rows
> > >> >>
> > >> >> (And as a note, there are no pertinent log messages except to
> > >>indicate
> > >> >> that
> > >> >> it is in the Build Dimension Dictionary step)
> > >> >>
> > >> >> Thank you,
> > >> >> sam bock
> > >> >>
> > >> >
> > >>
> >
> >
>
>

Re: OutOfMemoryError on step #3 of Cube build

Posted by Samuel Bock <sb...@marinsoftware.com>.

While that might be possible when putting together a test dataset, the
actual system will need to retain the ability to change dimension data
easily. A prejoined table would make that significantly harder (among other
things).

thanks,
Sam


On Wed, Feb 25, 2015 at 4:38 PM, 蒋旭 <ji...@qq.com> wrote:

> As a workaround, could you prejoin the big dimension table with fact table
> in hive? Then, you can run Kylin on the prejoined table.
>
> We will do the optimization on the big dimension table later.
>
> Thanks
> Jiang Xu
>
> ------------------ 原始邮件 ------------------
> *发件人:* Samuel Bock <sb...@marinsoftware.com>
> *发送时间:* 2015年02月26日 03:28
> *收件人:* dev <de...@kylin.incubator.apache.org>
> *主题:* Re: OutOfMemoryError on step #3 of Cube build
>
> Thank you for the follow up,
>
> Our dimension table is 25 million rows for our test data set, and would be
> far larger in production. Given that, it sounds like our data doesn't fit
> the Kylin use case. I appreciate the assistance in tracking down the source
> of this issue,
>
> cheers,
> sam
>
> On Tue, Feb 24, 2015 at 7:28 PM, Shi, Shaofeng <sh...@ebay.com> wrote:
>
> > Hi Samuel,
> >
> > Kylin only supports the star schema: only 1 fact table join with multiple
> > lookup tables. The lookup table need be small so that Kylin can read them
> > into memory for join and cube build. Also as you found, Kylin will take
> > snapshot on the lookup tables and persist them in Hbase; That should be
> > the problem. In your case, how many rows there in the KEYWORDS table?
> >
> > On 2/21/15, 2:12 AM, "Samuel Bock" <sb...@marinsoftware.com> wrote:
> >
> > >Thank you for you response,
> > >
> > >I went into the code, and I'm fairly confident that I've isolated the
> > >problem. The OutOfMemoryError is part of the dimension dictionary step,
> > >but
> > >is not actually related to the dictionary itself (since, as you
> mentioned,
> > >that is skipped when dictionary=false). The problem arises from the
> second
> > >half of that step in which it builds the dimension table snapshot.
> Looking
> > >at the code, the process of building the snapshot table loads in the
> > >entire
> > >table into memory as strings (SnapshotTable.takeSnapshot), then
> serializes
> > >that to an in memory ByteArrayOutputStream (ResourceStore.putResource),
> > >then finally creates a copy of the internal byte array from the stream
> in
> > >order to store it in HBase (HBaseResourceStore.checkAndPutResourceImpl).
> > >That means that there needs to be space for three in-memory copies of
> the
> > >full dimension table. Given that even our test subset dimension table is
> > >25
> > >million rows, 14 columns, that becomes problematic. From
> experimentation,
> > >it breaks even with 95 gig heap.
> > >
> > >For completeness, the log leading up to the crash (minus the pointless
> zk
> > >messages) is:
> > > - Start to execute command:
> > > -cubename foo -segmentname FULL_BUILD -input
> >
> >/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processS
> > >egment(DictionaryGeneratorCLI.java:57)]
> > >- Building snapshot of KEYWORDS
> > >[QuartzScheduler_Worker-2]:[2015-02-19
> >
> >22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-3]:[2015-02-19
> >
> >23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDe
> > >lim(FileTableReader.java:156)]
> > >- Auto detect delim to be ' ', split line to 14 columns --
> > >1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768 1020
> > >341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246
> > >[http-bio-7070-exec-8]:[2015-02-19
> >
> >23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsSt
> > >ring(AdminService.java:91)]
> > >- Get Kylin Runtime Config
> > >[QuartzScheduler_Worker-4]:[2015-02-19
> >
> >23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >[QuartzScheduler_Worker-1]:[2015-02-19
> >
> >23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putRes
> > >ource(ResourceStore.java:166)]
> > >- Saving resource
> >
> >/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapsh
> > >ot
> > >(Store kylin_metadata_qa@hbase)
> > >[QuartzScheduler_Worker-6]:[2015-02-19
> >
> >23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche
> > >r.java:60)]
> > >- 0 pending jobs
> > >java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> > >Dumping heap to java_pid3705.hprof ...
> > >
> > >
> > >The cube JSON is:
> > >
> > >{
> > >  "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110",
> > >  "name": "foo",
> > >  "description": "",
> > >  "dimensions": [
> > >    {
> > >      "id": 1,
> > >      "name": "KEYWORDS_DERIVED",
> > >      "join": {
> > >        "type": "left",
> > >        "primary_key": [
> > >          "DIM_ID"
> > >        ],
> > >        "foreign_key": [
> > >          "KEYWORD_DIM_ID"
> > >        ]
> > >      },
> > >      "hierarchy": null,
> > >      "table": "KEYWORDS",
> > >      "column": "{FK}",
> > >      "datatype": null,
> > >      "derived": [
> > >        "PUBLISHER_GROUP_ID",
> > >        "PUBLISHER_CAMPAIGN_ID",
> > >        "PUBLISHER_ID"
> > >      ]
> > >    }
> > >  ],
> > >  "measures": [
> > >    {
> > >      "id": 1,
> > >      "name": "_COUNT_",
> > >      "function": {
> > >        "expression": "COUNT",
> > >        "parameter": {
> > >          "type": "constant",
> > >          "value": "1"
> > >        },
> > >        "returntype": "bigint"
> > >      },
> > >      "dependent_measure_ref": null
> > >    },
> > >    {
> > >      "id": 2,
> > >      "name": "CONVERSIONS",
> > >      "function": {
> > >        "expression": "SUM",
> > >        "parameter": {
> > >          "type": "column",
> > >          "value": "CONVERSIONS"
> > >        },
> > >        "returntype": "bigint"
> > >      },
> > >      "dependent_measure_ref": null
> > >    }
> > >  ],
> > >  "rowkey": {
> > >    "rowkey_columns": [
> > >      {
> > >        "column": "KEYWORD_DIM_ID",
> > >        "length": 0,
> > >        "dictionary": "false",
> > >        "mandatory": false
> > >      }
> > >    ],
> > >    "aggregation_groups": [
> > >      [
> > >        "KEYWORD_DIM_ID"
> > >      ]
> > >    ]
> > >  },
> > >  "signature": "T+aYTH/KlCwwmVAGRQR3hQ==",
> > >  "capacity": "LARGE",
> > >  "last_modified": 1424367558297,
> > >  "fact_table": "FACTS",
> > >  "null_string": null,
> > >  "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931",
> > >  "cube_partition_desc": {
> > >    "partition_date_column": null,
> > >    "partition_date_start": 0,
> > >    "cube_partition_type": "APPEND"
> > >  },
> > >  "hbase_mapping": {
> > >    "column_family": [
> > >      {
> > >        "name": "F1",
> > >        "columns": [
> > >          {
> > >            "qualifier": "M",
> > >            "measure_refs": [
> > >              "_COUNT_",
> > >              "CONVERSIONS"
> > >            ]
> > >          }
> > >        ]
> > >      }
> > >    ]
> > >  },
> > >  "notify_list": [
> > >    "sam"
> > >  ]
> > >}
> > >
> > >
> > >Cheers,
> > >sam
> > >
> > >On Thu, Feb 19, 2015 at 9:49 PM, 周千昊 <z....@gmail.com> wrote:
> > >
> > >> Also since you set the dictionary to false, there should not be any
> > >>memory
> > >> consuming while building dictionary.
> > >> So can you also give us the json description of the cube?(in the cube
> > >>tab,
> > >> click the corresponding cube, click the json button)
> > >>
> > >>
> > >> On Fri Feb 20 2015 at 1:39:15 PM 周千昊 <z....@gmail.com> wrote:
> > >>
> > >> > Hi, Samuel
> > >> >      Can you give us some detail log, so we can dig into the root
> > >>cause
> > >> >
> > >> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock <
> sbock@marinsoftware.com
> > >
> > >> > wrote:
> > >> >
> > >> >> Hello all,
> > >> >>
> > >> >> We are in the process of evaluating Kylin for use as an OLAP
> engine.
> > >>To
> > >> >> that end, we are trying to get a minimum viable setup with a
> > >> >> representative
> > >> >> sample of our data in order to gather performance metrics. We have
> > >>kylin
> > >> >> running against a 10 node cluster, the provided cubes build
> > >>successfully
> > >> >> and the system seems functional. Attempting to build a simple cube
> > >> against
> > >> >> our data results in an OutOfMemoryError in the kylin server process
> > >>(so
> > >> >> far
> > >> >> we have given it up to a 46 gig heap). I was wondering if you could
> > >>give
> > >> >> me
> > >> >> some guidance as to likely causes, any configurations I'm likely to
> > >>have
> > >> >> missed before I start diving into the source. I have changed the
> > >> >> "dictionary" setting to false, as recommended for high-cardinality
> > >> >> dimensions, but have not changed configuration significantly apart
> > >>from
> > >> >> that.
> > >> >>
> > >> >> For reference, the sizes of the hive tables we're building the
> cubes
> > >> from
> > >> >> dimension table: 25,399,061 rows
> > >> >> fact table: 270,940,921 rows
> > >> >>
> > >> >> (And as a note, there are no pertinent log messages except to
> > >>indicate
> > >> >> that
> > >> >> it is in the Build Dimension Dictionary step)
> > >> >>
> > >> >> Thank you,
> > >> >> sam bock
> > >> >>
> > >> >
> > >>
> >
> >
>
>