You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by yu feng <ol...@gmail.com> on 2015/09/10 12:20:48 UTC

rubbish files exist in HDFS

I see this core Improvement in release 1.0, JIRA url :
https://issues.apache.org/jira/browse/KYLIN-926

However, after my test and check the source code , I find some rubbish(I am not
sure) file in HDFS.

First, kylin only drop the Intermediate table in hive, but the table is an
EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I check
that..)

Second, the cuboid files take a large space in HDFS, and kylin do not
delete after the cube build(fact_distinct_columns files exist too). I am
not sure if those has other effects, remind me please if it has..

Third, After I discard a job, I think kylin should delete the Intermediate
files and drop Intermediate hive table, even though delete
them asynchronous. I think those data do not have any effects..remind me
please if it has..

These are rubbish datas still exist in current version(kylin-1.0), please
check, thanks..

Re: rubbish files exist in HDFS

Posted by ShaoFeng Shi <sh...@gmail.com>.

BTW, actually you can treat v1.0 as 0.7.3; it is compatible with 0.7.x;
Suggest you upgrade to v1.1 directly, which will be released soon and
includes a couple of bug fixes and performance enhancements;

2015-09-30 9:05 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:

> for v1.0 or before, please refer to this doc to do manual cleanup:
>
> https://kylin.incubator.apache.org/docs/howto/howto_cleanup_storage.html
>
>
>
> On 9/30/15, 9:00 AM, "Luke Han" <lu...@gmail.com> wrote:
>
> >Hi Abhilash,
> >   I would like to recommend to upgrade to v1.0 or v1.1 (is under
> >releasing
> >process now).
> >
> >   Thanks.
> >Luke
> >
> >
> >Best Regards!
> >---------------------
> >
> >Luke Han
> >
> >On Wed, Sep 30, 2015 at 12:46 AM, Abhilash L L <ab...@infoworks.io>
> >wrote:
> >
> >> Hello,
> >>
> >>     We observered that purging and dropping a cube is not deleting
> >> dictionaries / snapshots and also not dropping the table in hbase.
> >>
> >>     Also, its leaving a lot of temporary data in hdfs
> >>
> >>     We are on 0.7.2.  I hope it is being fixed shortly and on priority.
> >>
> >>      I saw that the ticket has been fixed on v1.1 and v2. Can this be
> >>back
> >> ported to 0.7.2
> >>
> >>
> >> Regards,
> >> Abhilash
> >>
> >> On Fri, Sep 18, 2015 at 11:02 AM, yu feng <ol...@gmail.com> wrote:
> >>
> >> > After build another cube successfully, I recheck this bug and find the
> >> > reason, thanks to all of you ...
> >> >
> >> > 2015-09-11 11:17 GMT+08:00 ShaoFeng Shi <sh...@gmail.com>:
> >> >
> >> > > If "rowkey_stats" wasn't found, Kylin should throw exception and
> >>exit,
> >> > > instead of using 1 region silently; I'm going to change this, please
> >> let
> >> > me
> >> > > know if you don't agree.
> >> > >
> >> > > 2015-09-11 10:17 GMT+08:00 Yerui Sun <su...@gmail.com>:
> >> > >
> >> > > > Hi, yu feng,
> >> > > >   Let me guess the reason of your problem.
> >> > > >
> >> > > >   The num of reducers of converting hfile job depends on the
> >>region
> >> > > > numbers of corresponding HTable.
> >> > > >
> >> > > >   For now, all HTables were created with only one region, caused
> >>by
> >> the
> >> > > > wrong path of rowkey_stats. I’ve opened a jira for this issue:
> >> > > > https://issues.apache.org/jira/browse/KYLIN-968. The patch has
> >>been
> >> > > > available last night.
> >> > > >
> >> > > >   Here’s some clues to confirm my guessing:
> >> > > >   1. You can find the corresponding HTable name in log, check its
> >> > > regions,
> >> > > > it should have only one region.
> >> > > >   2. Check your kylin working directory on hdfs, there should be a
> >> path
> >> > > > like ‘../kylin-null/../rowkey_stats'.
> >> > > >   3. Grep your kylin.log in tomcat dir, you should find the log
> >> > contains
> >> > > > ‘no region split, HTable will be one region’.
> >> > > >
> >> > > >   If you hit all the three clues, I think KYLIN-968 could resolve
> >> your
> >> > > > problem.
> >> > > >
> >> > > >
> >> > > > > 在 2015年9月11日，00:54，yu feng <ol...@gmail.com> 写道：
> >> > > > >
> >> > > > > OK, I find another problem(I am a problem maker, ^_^), today I
> >>buid
> >> > > this
> >> > > > > cube which has 15 dimensions(one Mandatory dimension, to
> >>hierarchy
> >> > > > > dimension and others are normal dimension), I find cuboid files
> >>are
> >> > > > 1.9TB,
> >> > > > > in the step of converting cuboid to hfile it is too slow. I
> >>check
> >> the
> >> > > log
> >> > > > > of this job and find there are 9000+ mappers and only one
> >>reducer.
> >> > > > >
> >> > > > > I discard this job when our hadoop administrator tells me the
> >>node
> >> > > witch
> >> > > > > run this reducer is out of space of disk. I have to stop it, I
> >>am
> >> > doubt
> >> > > > > that why there are only one reducer(I do not check source code
> >>of
> >> > this
> >> > > > > job), By the way, my original data is only hundreds MB. I think
> >> this
> >> > > > would
> >> > > > > cause more problems if original is bigger or dimension is much
> >> more..
> >> > > > >
> >> > > > > 2015-09-10 23:46 GMT+08:00 Luke Han <lu...@gmail.com>:
> >> > > > >
> >> > > > >> The 2.0 will not come recently, there are huge refactor and
> >>bunch
> >> of
> >> > > new
> >> > > > >> features, we have to make sure there are no critical bugs
> >>before
> >> > > > release.
> >> > > > >>
> >> > > > >> The same function also available under v1.x branch, please stay
> >> > tuned
> >> > > > for
> >> > > > >> update information for that.
> >> > > > >>
> >> > > > >> Thanks.
> >> > > > >>
> >> > > > >>
> >> > > > >> Best Regards!
> >> > > > >> ---------------------
> >> > > > >>
> >> > > > >> Luke Han
> >> > > > >>
> >> > > > >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <olaptestyu@gmail.com
> >
> >> > > wrote:
> >> > > > >>
> >> > > > >>> What good news !  I wish you can release the version as
> >>quickly
> >> as
> >> > > > >>> possible, Today, I build a cube whose cuboid files is 1.9TB.
> >>If
> >> we
> >> > > > merge
> >> > > > >>> cube based on cuboid files, I think it will be very slowly..
> >> > > > >>>
> >> > > > >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
> >> > > > >>>
> >> > > > >>>> We have implemented the merge from HTable directly in Kylin
> >>2.0,
> >> > > which
> >> > > > >>>> hasn’t been released/announced.
> >> > > > >>>>
> >> > > > >>>> On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
> >> > > > >>>>
> >> > > > >>>>> I think kylin can finish merging just depend on tables on
> >> hbase,
> >> > > This
> >> > > > >>> will
> >> > > > >>>>> make merging cubes more quickly, Isn't it ?
> >> > > > >>>>>
> >> > > > >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
> >> > > > >>>>>
> >> > > > >>>>>> After check source code, I find you are right, cuboid files
> >> will
> >> > > be
> >> > > > >>> used
> >> > > > >>>>>> while merging segments, But a new question comes, Why kylin
> >> > merge
> >> > > > >>>>>> segment
> >> > > > >>>>>> just based on hfile, I can not find how to take hbase
> >>table as
> >> > > input
> >> > > > >>>>>> format
> >> > > > >>>>>> of mapreduce job, But kylin take HFileOutputFormat as
> >>output
> >> > > format
> >> > > > >>>>>> while
> >> > > > >>>>>> changing cuboid to hfile.
> >> > > > >>>>>>
> >> > > > >>>>>> From this, I find kylin will take more space for a cube
> >> > actually ,
> >> > > > >> not
> >> > > > >>>>>> only hfile but also cuboid files, the former are used for
> >> query
> >> > > and
> >> > > > >>> the
> >> > > > >>>>>> latter are used for merge, and the capacity of cuboid
> >>files is
> >> > > > >> bigger
> >> > > > >>>>>> than
> >> > > > >>>>>> hfiles.
> >> > > > >>>>>>
> >> > > > >>>>>> I think we could do some thing to optimize it... I want to
> >> know
> >> > > your
> >> > > > >>>>>> opinions about it .
> >> > > > >>>>>>
> >> > > > >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> >> > > > >>>>>>
> >> > > > >>>>>>> Hi, yu feng,
> >> > > > >>>>>>>  I’ve also noticed these files and opened a jira:
> >> > > > >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll
> >> > post a
> >> > > > >>> patch
> >> > > > >>>>>>> tonight.
> >> > > > >>>>>>>
> >> > > > >>>>>>>  Here’s my opinions on your three question, feel free to
> >> > correct
> >> > > > >> me:
> >> > > > >>>>>>>
> >> > > > >>>>>>>  First, the data path of intermediate hive table should be
> >> > > deleted
> >> > > > >>>>>>> after
> >> > > > >>>>>>> building, I agreed with that.
> >> > > > >>>>>>>
> >> > > > >>>>>>>  Second, the cuboid files will be used for merge and will
> >>be
> >> > > > >> deleted
> >> > > > >>>>>>> when merging job completed, we need and must leave them on
> >> > hdfs.
> >> > > > >> The
> >> > > > >>>>>>> fact_distint_columns should be deleted. In additionally,
> >>the
> >> > path
> >> > > > >> of
> >> > > > >>>>>>> rowkey_stats and hfile
> >> > > > >>>>>>> should also be deleted.
> >> > > > >>>>>>>
> >> > > > >>>>>>>  Third, there’s no garbage collection steps if a job
> >>discard,
> >> > > > >> maybe
> >> > > > >>> we
> >> > > > >>>>>>> need a patch for this.
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>> Short answer:
> >> > > > >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files
> >>after
> >> > > > >>> buildJob
> >> > > > >>>>>>> and mergeJob completed.
> >> > > > >>>>>>>  The hdfs path will not be cleanup if a job was
> >>discarded, we
> >> > > need
> >> > > > >>>>>>> improvement on this.
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>> Best Regards,
> >> > > > >>>>>>> Yerui Sun
> >> > > > >>>>>>> sunyerui@gmail.com
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>>> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
> >> > > > >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> However, after my test and check the source code , I find
> >> some
> >> > > > >>>>>>> rubbish(I am not
> >> > > > >>>>>>>> sure) file in HDFS.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> First, kylin only drop the Intermediate table in hive,
> >>but
> >> the
> >> > > > >>> table
> >> > > > >>>>>>> is
> >> > > > >>>>>>> an
> >> > > > >>>>>>>> EXTERNAL table, the file still exist in kylin tmp
> >>directory
> >> in
> >> > > > >>> HDFS(I
> >> > > > >>>>>>> check
> >> > > > >>>>>>>> that..)
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Second, the cuboid files take a large space in HDFS, and
> >> kylin
> >> > > do
> >> > > > >>> not
> >> > > > >>>>>>>> delete after the cube build(fact_distinct_columns files
> >> exist
> >> > > > >> too).
> >> > > > >>>>>>> I am
> >> > > > >>>>>>>> not sure if those has other effects, remind me please if
> >>it
> >> > > has..
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Third, After I discard a job, I think kylin should delete
> >> the
> >> > > > >>>>>>> Intermediate
> >> > > > >>>>>>>> files and drop Intermediate hive table, even though
> >>delete
> >> > > > >>>>>>>> them asynchronous. I think those data do not have any
> >> > > > >>>>>>> effects..remind me
> >> > > > >>>>>>>> please if it has..
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> These are rubbish datas still exist in current
> >> > > > >> version(kylin-1.0),
> >> > > > >>>>>>> please
> >> > > > >>>>>>>> check, thanks..
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>
> >> > > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
>
>

Re: rubbish files exist in HDFS

Posted by "Shi, Shaofeng" <sh...@ebay.com>.

for v1.0 or before, please refer to this doc to do manual cleanup:

https://kylin.incubator.apache.org/docs/howto/howto_cleanup_storage.html



On 9/30/15, 9:00 AM, "Luke Han" <lu...@gmail.com> wrote:

>Hi Abhilash,
>   I would like to recommend to upgrade to v1.0 or v1.1 (is under
>releasing
>process now).
>
>   Thanks.
>Luke
>
>
>Best Regards!
>---------------------
>
>Luke Han
>
>On Wed, Sep 30, 2015 at 12:46 AM, Abhilash L L <ab...@infoworks.io>
>wrote:
>
>> Hello,
>>
>>     We observered that purging and dropping a cube is not deleting
>> dictionaries / snapshots and also not dropping the table in hbase.
>>
>>     Also, its leaving a lot of temporary data in hdfs
>>
>>     We are on 0.7.2.  I hope it is being fixed shortly and on priority.
>>
>>      I saw that the ticket has been fixed on v1.1 and v2. Can this be
>>back
>> ported to 0.7.2
>>
>>
>> Regards,
>> Abhilash
>>
>> On Fri, Sep 18, 2015 at 11:02 AM, yu feng <ol...@gmail.com> wrote:
>>
>> > After build another cube successfully, I recheck this bug and find the
>> > reason, thanks to all of you ...
>> >
>> > 2015-09-11 11:17 GMT+08:00 ShaoFeng Shi <sh...@gmail.com>:
>> >
>> > > If "rowkey_stats" wasn't found, Kylin should throw exception and
>>exit,
>> > > instead of using 1 region silently; I'm going to change this, please
>> let
>> > me
>> > > know if you don't agree.
>> > >
>> > > 2015-09-11 10:17 GMT+08:00 Yerui Sun <su...@gmail.com>:
>> > >
>> > > > Hi, yu feng,
>> > > >   Let me guess the reason of your problem.
>> > > >
>> > > >   The num of reducers of converting hfile job depends on the
>>region
>> > > > numbers of corresponding HTable.
>> > > >
>> > > >   For now, all HTables were created with only one region, caused
>>by
>> the
>> > > > wrong path of rowkey_stats. I’ve opened a jira for this issue:
>> > > > https://issues.apache.org/jira/browse/KYLIN-968. The patch has
>>been
>> > > > available last night.
>> > > >
>> > > >   Here’s some clues to confirm my guessing:
>> > > >   1. You can find the corresponding HTable name in log, check its
>> > > regions,
>> > > > it should have only one region.
>> > > >   2. Check your kylin working directory on hdfs, there should be a
>> path
>> > > > like ‘../kylin-null/../rowkey_stats'.
>> > > >   3. Grep your kylin.log in tomcat dir, you should find the log
>> > contains
>> > > > ‘no region split, HTable will be one region’.
>> > > >
>> > > >   If you hit all the three clues, I think KYLIN-968 could resolve
>> your
>> > > > problem.
>> > > >
>> > > >
>> > > > > 在 2015年9月11日，00:54，yu feng <ol...@gmail.com> 写道：
>> > > > >
>> > > > > OK, I find another problem(I am a problem maker, ^_^), today I
>>buid
>> > > this
>> > > > > cube which has 15 dimensions(one Mandatory dimension, to
>>hierarchy
>> > > > > dimension and others are normal dimension), I find cuboid files
>>are
>> > > > 1.9TB,
>> > > > > in the step of converting cuboid to hfile it is too slow. I
>>check
>> the
>> > > log
>> > > > > of this job and find there are 9000+ mappers and only one
>>reducer.
>> > > > >
>> > > > > I discard this job when our hadoop administrator tells me the
>>node
>> > > witch
>> > > > > run this reducer is out of space of disk. I have to stop it, I
>>am
>> > doubt
>> > > > > that why there are only one reducer(I do not check source code
>>of
>> > this
>> > > > > job), By the way, my original data is only hundreds MB. I think
>> this
>> > > > would
>> > > > > cause more problems if original is bigger or dimension is much
>> more..
>> > > > >
>> > > > > 2015-09-10 23:46 GMT+08:00 Luke Han <lu...@gmail.com>:
>> > > > >
>> > > > >> The 2.0 will not come recently, there are huge refactor and
>>bunch
>> of
>> > > new
>> > > > >> features, we have to make sure there are no critical bugs
>>before
>> > > > release.
>> > > > >>
>> > > > >> The same function also available under v1.x branch, please stay
>> > tuned
>> > > > for
>> > > > >> update information for that.
>> > > > >>
>> > > > >> Thanks.
>> > > > >>
>> > > > >>
>> > > > >> Best Regards!
>> > > > >> ---------------------
>> > > > >>
>> > > > >> Luke Han
>> > > > >>
>> > > > >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <ol...@gmail.com>
>> > > wrote:
>> > > > >>
>> > > > >>> What good news !  I wish you can release the version as
>>quickly
>> as
>> > > > >>> possible, Today, I build a cube whose cuboid files is 1.9TB.
>>If
>> we
>> > > > merge
>> > > > >>> cube based on cuboid files, I think it will be very slowly..
>> > > > >>>
>> > > > >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
>> > > > >>>
>> > > > >>>> We have implemented the merge from HTable directly in Kylin
>>2.0,
>> > > which
>> > > > >>>> hasn’t been released/announced.
>> > > > >>>>
>> > > > >>>> On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
>> > > > >>>>
>> > > > >>>>> I think kylin can finish merging just depend on tables on
>> hbase,
>> > > This
>> > > > >>> will
>> > > > >>>>> make merging cubes more quickly, Isn't it ?
>> > > > >>>>>
>> > > > >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
>> > > > >>>>>
>> > > > >>>>>> After check source code, I find you are right, cuboid files
>> will
>> > > be
>> > > > >>> used
>> > > > >>>>>> while merging segments, But a new question comes, Why kylin
>> > merge
>> > > > >>>>>> segment
>> > > > >>>>>> just based on hfile, I can not find how to take hbase
>>table as
>> > > input
>> > > > >>>>>> format
>> > > > >>>>>> of mapreduce job, But kylin take HFileOutputFormat as
>>output
>> > > format
>> > > > >>>>>> while
>> > > > >>>>>> changing cuboid to hfile.
>> > > > >>>>>>
>> > > > >>>>>> From this, I find kylin will take more space for a cube
>> > actually ,
>> > > > >> not
>> > > > >>>>>> only hfile but also cuboid files, the former are used for
>> query
>> > > and
>> > > > >>> the
>> > > > >>>>>> latter are used for merge, and the capacity of cuboid
>>files is
>> > > > >> bigger
>> > > > >>>>>> than
>> > > > >>>>>> hfiles.
>> > > > >>>>>>
>> > > > >>>>>> I think we could do some thing to optimize it... I want to
>> know
>> > > your
>> > > > >>>>>> opinions about it .
>> > > > >>>>>>
>> > > > >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
>> > > > >>>>>>
>> > > > >>>>>>> Hi, yu feng,
>> > > > >>>>>>>  I’ve also noticed these files and opened a jira:
>> > > > >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll
>> > post a
>> > > > >>> patch
>> > > > >>>>>>> tonight.
>> > > > >>>>>>>
>> > > > >>>>>>>  Here’s my opinions on your three question, feel free to
>> > correct
>> > > > >> me:
>> > > > >>>>>>>
>> > > > >>>>>>>  First, the data path of intermediate hive table should be
>> > > deleted
>> > > > >>>>>>> after
>> > > > >>>>>>> building, I agreed with that.
>> > > > >>>>>>>
>> > > > >>>>>>>  Second, the cuboid files will be used for merge and will
>>be
>> > > > >> deleted
>> > > > >>>>>>> when merging job completed, we need and must leave them on
>> > hdfs.
>> > > > >> The
>> > > > >>>>>>> fact_distint_columns should be deleted. In additionally,
>>the
>> > path
>> > > > >> of
>> > > > >>>>>>> rowkey_stats and hfile
>> > > > >>>>>>> should also be deleted.
>> > > > >>>>>>>
>> > > > >>>>>>>  Third, there’s no garbage collection steps if a job
>>discard,
>> > > > >> maybe
>> > > > >>> we
>> > > > >>>>>>> need a patch for this.
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>> Short answer:
>> > > > >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files
>>after
>> > > > >>> buildJob
>> > > > >>>>>>> and mergeJob completed.
>> > > > >>>>>>>  The hdfs path will not be cleanup if a job was
>>discarded, we
>> > > need
>> > > > >>>>>>> improvement on this.
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>> Best Regards,
>> > > > >>>>>>> Yerui Sun
>> > > > >>>>>>> sunyerui@gmail.com
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
>> > > > >>>>>>>>
>> > > > >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
>> > > > >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
>> > > > >>>>>>>>
>> > > > >>>>>>>> However, after my test and check the source code , I find
>> some
>> > > > >>>>>>> rubbish(I am not
>> > > > >>>>>>>> sure) file in HDFS.
>> > > > >>>>>>>>
>> > > > >>>>>>>> First, kylin only drop the Intermediate table in hive,
>>but
>> the
>> > > > >>> table
>> > > > >>>>>>> is
>> > > > >>>>>>> an
>> > > > >>>>>>>> EXTERNAL table, the file still exist in kylin tmp
>>directory
>> in
>> > > > >>> HDFS(I
>> > > > >>>>>>> check
>> > > > >>>>>>>> that..)
>> > > > >>>>>>>>
>> > > > >>>>>>>> Second, the cuboid files take a large space in HDFS, and
>> kylin
>> > > do
>> > > > >>> not
>> > > > >>>>>>>> delete after the cube build(fact_distinct_columns files
>> exist
>> > > > >> too).
>> > > > >>>>>>> I am
>> > > > >>>>>>>> not sure if those has other effects, remind me please if
>>it
>> > > has..
>> > > > >>>>>>>>
>> > > > >>>>>>>> Third, After I discard a job, I think kylin should delete
>> the
>> > > > >>>>>>> Intermediate
>> > > > >>>>>>>> files and drop Intermediate hive table, even though
>>delete
>> > > > >>>>>>>> them asynchronous. I think those data do not have any
>> > > > >>>>>>> effects..remind me
>> > > > >>>>>>>> please if it has..
>> > > > >>>>>>>>
>> > > > >>>>>>>> These are rubbish datas still exist in current
>> > > > >> version(kylin-1.0),
>> > > > >>>>>>> please
>> > > > >>>>>>>> check, thanks..
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>
>> > > > >>>>
>> > > > >>>>
>> > > > >>>
>> > > > >>
>> > > >
>> > > >
>> > >
>> >
>>

Re: rubbish files exist in HDFS

Posted by Luke Han <lu...@gmail.com>.

Hi Abhilash,
   I would like to recommend to upgrade to v1.0 or v1.1 (is under releasing
process now).

   Thanks.
Luke


Best Regards!
---------------------

Luke Han

On Wed, Sep 30, 2015 at 12:46 AM, Abhilash L L <ab...@infoworks.io>
wrote:

> Hello,
>
>     We observered that purging and dropping a cube is not deleting
> dictionaries / snapshots and also not dropping the table in hbase.
>
>     Also, its leaving a lot of temporary data in hdfs
>
>     We are on 0.7.2.  I hope it is being fixed shortly and on priority.
>
>      I saw that the ticket has been fixed on v1.1 and v2. Can this be back
> ported to 0.7.2
>
>
> Regards,
> Abhilash
>
> On Fri, Sep 18, 2015 at 11:02 AM, yu feng <ol...@gmail.com> wrote:
>
> > After build another cube successfully, I recheck this bug and find the
> > reason, thanks to all of you ...
> >
> > 2015-09-11 11:17 GMT+08:00 ShaoFeng Shi <sh...@gmail.com>:
> >
> > > If "rowkey_stats" wasn't found, Kylin should throw exception and exit,
> > > instead of using 1 region silently; I'm going to change this, please
> let
> > me
> > > know if you don't agree.
> > >
> > > 2015-09-11 10:17 GMT+08:00 Yerui Sun <su...@gmail.com>:
> > >
> > > > Hi, yu feng,
> > > >   Let me guess the reason of your problem.
> > > >
> > > >   The num of reducers of converting hfile job depends on the region
> > > > numbers of corresponding HTable.
> > > >
> > > >   For now, all HTables were created with only one region, caused by
> the
> > > > wrong path of rowkey_stats. I’ve opened a jira for this issue:
> > > > https://issues.apache.org/jira/browse/KYLIN-968. The patch has been
> > > > available last night.
> > > >
> > > >   Here’s some clues to confirm my guessing:
> > > >   1. You can find the corresponding HTable name in log, check its
> > > regions,
> > > > it should have only one region.
> > > >   2. Check your kylin working directory on hdfs, there should be a
> path
> > > > like ‘../kylin-null/../rowkey_stats'.
> > > >   3. Grep your kylin.log in tomcat dir, you should find the log
> > contains
> > > > ‘no region split, HTable will be one region’.
> > > >
> > > >   If you hit all the three clues, I think KYLIN-968 could resolve
> your
> > > > problem.
> > > >
> > > >
> > > > > 在 2015年9月11日，00:54，yu feng <ol...@gmail.com> 写道：
> > > > >
> > > > > OK, I find another problem(I am a problem maker, ^_^), today I buid
> > > this
> > > > > cube which has 15 dimensions(one Mandatory dimension, to hierarchy
> > > > > dimension and others are normal dimension), I find cuboid files are
> > > > 1.9TB,
> > > > > in the step of converting cuboid to hfile it is too slow. I check
> the
> > > log
> > > > > of this job and find there are 9000+ mappers and only one reducer.
> > > > >
> > > > > I discard this job when our hadoop administrator tells me the node
> > > witch
> > > > > run this reducer is out of space of disk. I have to stop it, I am
> > doubt
> > > > > that why there are only one reducer(I do not check source code of
> > this
> > > > > job), By the way, my original data is only hundreds MB. I think
> this
> > > > would
> > > > > cause more problems if original is bigger or dimension is much
> more..
> > > > >
> > > > > 2015-09-10 23:46 GMT+08:00 Luke Han <lu...@gmail.com>:
> > > > >
> > > > >> The 2.0 will not come recently, there are huge refactor and bunch
> of
> > > new
> > > > >> features, we have to make sure there are no critical bugs before
> > > > release.
> > > > >>
> > > > >> The same function also available under v1.x branch, please stay
> > tuned
> > > > for
> > > > >> update information for that.
> > > > >>
> > > > >> Thanks.
> > > > >>
> > > > >>
> > > > >> Best Regards!
> > > > >> ---------------------
> > > > >>
> > > > >> Luke Han
> > > > >>
> > > > >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <ol...@gmail.com>
> > > wrote:
> > > > >>
> > > > >>> What good news !  I wish you can release the version as quickly
> as
> > > > >>> possible, Today, I build a cube whose cuboid files is 1.9TB. If
> we
> > > > merge
> > > > >>> cube based on cuboid files, I think it will be very slowly..
> > > > >>>
> > > > >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
> > > > >>>
> > > > >>>> We have implemented the merge from HTable directly in Kylin 2.0,
> > > which
> > > > >>>> hasn’t been released/announced.
> > > > >>>>
> > > > >>>> On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
> > > > >>>>
> > > > >>>>> I think kylin can finish merging just depend on tables on
> hbase,
> > > This
> > > > >>> will
> > > > >>>>> make merging cubes more quickly, Isn't it ?
> > > > >>>>>
> > > > >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
> > > > >>>>>
> > > > >>>>>> After check source code, I find you are right, cuboid files
> will
> > > be
> > > > >>> used
> > > > >>>>>> while merging segments, But a new question comes, Why kylin
> > merge
> > > > >>>>>> segment
> > > > >>>>>> just based on hfile, I can not find how to take hbase table as
> > > input
> > > > >>>>>> format
> > > > >>>>>> of mapreduce job, But kylin take HFileOutputFormat as  output
> > > format
> > > > >>>>>> while
> > > > >>>>>> changing cuboid to hfile.
> > > > >>>>>>
> > > > >>>>>> From this, I find kylin will take more space for a cube
> > actually ,
> > > > >> not
> > > > >>>>>> only hfile but also cuboid files, the former are used for
> query
> > > and
> > > > >>> the
> > > > >>>>>> latter are used for merge, and the capacity of cuboid files is
> > > > >> bigger
> > > > >>>>>> than
> > > > >>>>>> hfiles.
> > > > >>>>>>
> > > > >>>>>> I think we could do some thing to optimize it... I want to
> know
> > > your
> > > > >>>>>> opinions about it .
> > > > >>>>>>
> > > > >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> > > > >>>>>>
> > > > >>>>>>> Hi, yu feng,
> > > > >>>>>>>  I’ve also noticed these files and opened a jira:
> > > > >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll
> > post a
> > > > >>> patch
> > > > >>>>>>> tonight.
> > > > >>>>>>>
> > > > >>>>>>>  Here’s my opinions on your three question, feel free to
> > correct
> > > > >> me:
> > > > >>>>>>>
> > > > >>>>>>>  First, the data path of intermediate hive table should be
> > > deleted
> > > > >>>>>>> after
> > > > >>>>>>> building, I agreed with that.
> > > > >>>>>>>
> > > > >>>>>>>  Second, the cuboid files will be used for merge and will be
> > > > >> deleted
> > > > >>>>>>> when merging job completed, we need and must leave them on
> > hdfs.
> > > > >> The
> > > > >>>>>>> fact_distint_columns should be deleted. In additionally, the
> > path
> > > > >> of
> > > > >>>>>>> rowkey_stats and hfile
> > > > >>>>>>> should also be deleted.
> > > > >>>>>>>
> > > > >>>>>>>  Third, there’s no garbage collection steps if a job discard,
> > > > >> maybe
> > > > >>> we
> > > > >>>>>>> need a patch for this.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Short answer:
> > > > >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files after
> > > > >>> buildJob
> > > > >>>>>>> and mergeJob completed.
> > > > >>>>>>>  The hdfs path will not be cleanup if a job was discarded, we
> > > need
> > > > >>>>>>> improvement on this.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Best Regards,
> > > > >>>>>>> Yerui Sun
> > > > >>>>>>> sunyerui@gmail.com
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> > > > >>>>>>>>
> > > > >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
> > > > >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
> > > > >>>>>>>>
> > > > >>>>>>>> However, after my test and check the source code , I find
> some
> > > > >>>>>>> rubbish(I am not
> > > > >>>>>>>> sure) file in HDFS.
> > > > >>>>>>>>
> > > > >>>>>>>> First, kylin only drop the Intermediate table in hive, but
> the
> > > > >>> table
> > > > >>>>>>> is
> > > > >>>>>>> an
> > > > >>>>>>>> EXTERNAL table, the file still exist in kylin tmp directory
> in
> > > > >>> HDFS(I
> > > > >>>>>>> check
> > > > >>>>>>>> that..)
> > > > >>>>>>>>
> > > > >>>>>>>> Second, the cuboid files take a large space in HDFS, and
> kylin
> > > do
> > > > >>> not
> > > > >>>>>>>> delete after the cube build(fact_distinct_columns files
> exist
> > > > >> too).
> > > > >>>>>>> I am
> > > > >>>>>>>> not sure if those has other effects, remind me please if it
> > > has..
> > > > >>>>>>>>
> > > > >>>>>>>> Third, After I discard a job, I think kylin should delete
> the
> > > > >>>>>>> Intermediate
> > > > >>>>>>>> files and drop Intermediate hive table, even though delete
> > > > >>>>>>>> them asynchronous. I think those data do not have any
> > > > >>>>>>> effects..remind me
> > > > >>>>>>>> please if it has..
> > > > >>>>>>>>
> > > > >>>>>>>> These are rubbish datas still exist in current
> > > > >> version(kylin-1.0),
> > > > >>>>>>> please
> > > > >>>>>>>> check, thanks..
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: rubbish files exist in HDFS

Posted by Abhilash L L <ab...@infoworks.io>.

Hello,

    We observered that purging and dropping a cube is not deleting
dictionaries / snapshots and also not dropping the table in hbase.

    Also, its leaving a lot of temporary data in hdfs

    We are on 0.7.2.  I hope it is being fixed shortly and on priority.

     I saw that the ticket has been fixed on v1.1 and v2. Can this be back
ported to 0.7.2


Regards,
Abhilash

On Fri, Sep 18, 2015 at 11:02 AM, yu feng <ol...@gmail.com> wrote:

> After build another cube successfully, I recheck this bug and find the
> reason, thanks to all of you ...
>
> 2015-09-11 11:17 GMT+08:00 ShaoFeng Shi <sh...@gmail.com>:
>
> > If "rowkey_stats" wasn't found, Kylin should throw exception and exit,
> > instead of using 1 region silently; I'm going to change this, please let
> me
> > know if you don't agree.
> >
> > 2015-09-11 10:17 GMT+08:00 Yerui Sun <su...@gmail.com>:
> >
> > > Hi, yu feng,
> > >   Let me guess the reason of your problem.
> > >
> > >   The num of reducers of converting hfile job depends on the region
> > > numbers of corresponding HTable.
> > >
> > >   For now, all HTables were created with only one region, caused by the
> > > wrong path of rowkey_stats. I’ve opened a jira for this issue:
> > > https://issues.apache.org/jira/browse/KYLIN-968. The patch has been
> > > available last night.
> > >
> > >   Here’s some clues to confirm my guessing:
> > >   1. You can find the corresponding HTable name in log, check its
> > regions,
> > > it should have only one region.
> > >   2. Check your kylin working directory on hdfs, there should be a path
> > > like ‘../kylin-null/../rowkey_stats'.
> > >   3. Grep your kylin.log in tomcat dir, you should find the log
> contains
> > > ‘no region split, HTable will be one region’.
> > >
> > >   If you hit all the three clues, I think KYLIN-968 could resolve your
> > > problem.
> > >
> > >
> > > > 在 2015年9月11日，00:54，yu feng <ol...@gmail.com> 写道：
> > > >
> > > > OK, I find another problem(I am a problem maker, ^_^), today I buid
> > this
> > > > cube which has 15 dimensions(one Mandatory dimension, to hierarchy
> > > > dimension and others are normal dimension), I find cuboid files are
> > > 1.9TB,
> > > > in the step of converting cuboid to hfile it is too slow. I check the
> > log
> > > > of this job and find there are 9000+ mappers and only one reducer.
> > > >
> > > > I discard this job when our hadoop administrator tells me the node
> > witch
> > > > run this reducer is out of space of disk. I have to stop it, I am
> doubt
> > > > that why there are only one reducer(I do not check source code of
> this
> > > > job), By the way, my original data is only hundreds MB. I think this
> > > would
> > > > cause more problems if original is bigger or dimension is much more..
> > > >
> > > > 2015-09-10 23:46 GMT+08:00 Luke Han <lu...@gmail.com>:
> > > >
> > > >> The 2.0 will not come recently, there are huge refactor and bunch of
> > new
> > > >> features, we have to make sure there are no critical bugs before
> > > release.
> > > >>
> > > >> The same function also available under v1.x branch, please stay
> tuned
> > > for
> > > >> update information for that.
> > > >>
> > > >> Thanks.
> > > >>
> > > >>
> > > >> Best Regards!
> > > >> ---------------------
> > > >>
> > > >> Luke Han
> > > >>
> > > >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <ol...@gmail.com>
> > wrote:
> > > >>
> > > >>> What good news !  I wish you can release the version as quickly as
> > > >>> possible, Today, I build a cube whose cuboid files is 1.9TB. If we
> > > merge
> > > >>> cube based on cuboid files, I think it will be very slowly..
> > > >>>
> > > >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
> > > >>>
> > > >>>> We have implemented the merge from HTable directly in Kylin 2.0,
> > which
> > > >>>> hasn’t been released/announced.
> > > >>>>
> > > >>>> On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
> > > >>>>
> > > >>>>> I think kylin can finish merging just depend on tables on hbase,
> > This
> > > >>> will
> > > >>>>> make merging cubes more quickly, Isn't it ?
> > > >>>>>
> > > >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
> > > >>>>>
> > > >>>>>> After check source code, I find you are right, cuboid files will
> > be
> > > >>> used
> > > >>>>>> while merging segments, But a new question comes, Why kylin
> merge
> > > >>>>>> segment
> > > >>>>>> just based on hfile, I can not find how to take hbase table as
> > input
> > > >>>>>> format
> > > >>>>>> of mapreduce job, But kylin take HFileOutputFormat as  output
> > format
> > > >>>>>> while
> > > >>>>>> changing cuboid to hfile.
> > > >>>>>>
> > > >>>>>> From this, I find kylin will take more space for a cube
> actually ,
> > > >> not
> > > >>>>>> only hfile but also cuboid files, the former are used for query
> > and
> > > >>> the
> > > >>>>>> latter are used for merge, and the capacity of cuboid files is
> > > >> bigger
> > > >>>>>> than
> > > >>>>>> hfiles.
> > > >>>>>>
> > > >>>>>> I think we could do some thing to optimize it... I want to know
> > your
> > > >>>>>> opinions about it .
> > > >>>>>>
> > > >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> > > >>>>>>
> > > >>>>>>> Hi, yu feng,
> > > >>>>>>>  I’ve also noticed these files and opened a jira:
> > > >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll
> post a
> > > >>> patch
> > > >>>>>>> tonight.
> > > >>>>>>>
> > > >>>>>>>  Here’s my opinions on your three question, feel free to
> correct
> > > >> me:
> > > >>>>>>>
> > > >>>>>>>  First, the data path of intermediate hive table should be
> > deleted
> > > >>>>>>> after
> > > >>>>>>> building, I agreed with that.
> > > >>>>>>>
> > > >>>>>>>  Second, the cuboid files will be used for merge and will be
> > > >> deleted
> > > >>>>>>> when merging job completed, we need and must leave them on
> hdfs.
> > > >> The
> > > >>>>>>> fact_distint_columns should be deleted. In additionally, the
> path
> > > >> of
> > > >>>>>>> rowkey_stats and hfile
> > > >>>>>>> should also be deleted.
> > > >>>>>>>
> > > >>>>>>>  Third, there’s no garbage collection steps if a job discard,
> > > >> maybe
> > > >>> we
> > > >>>>>>> need a patch for this.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Short answer:
> > > >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files after
> > > >>> buildJob
> > > >>>>>>> and mergeJob completed.
> > > >>>>>>>  The hdfs path will not be cleanup if a job was discarded, we
> > need
> > > >>>>>>> improvement on this.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Best Regards,
> > > >>>>>>> Yerui Sun
> > > >>>>>>> sunyerui@gmail.com
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> > > >>>>>>>>
> > > >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
> > > >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
> > > >>>>>>>>
> > > >>>>>>>> However, after my test and check the source code , I find some
> > > >>>>>>> rubbish(I am not
> > > >>>>>>>> sure) file in HDFS.
> > > >>>>>>>>
> > > >>>>>>>> First, kylin only drop the Intermediate table in hive, but the
> > > >>> table
> > > >>>>>>> is
> > > >>>>>>> an
> > > >>>>>>>> EXTERNAL table, the file still exist in kylin tmp directory in
> > > >>> HDFS(I
> > > >>>>>>> check
> > > >>>>>>>> that..)
> > > >>>>>>>>
> > > >>>>>>>> Second, the cuboid files take a large space in HDFS, and kylin
> > do
> > > >>> not
> > > >>>>>>>> delete after the cube build(fact_distinct_columns files exist
> > > >> too).
> > > >>>>>>> I am
> > > >>>>>>>> not sure if those has other effects, remind me please if it
> > has..
> > > >>>>>>>>
> > > >>>>>>>> Third, After I discard a job, I think kylin should delete the
> > > >>>>>>> Intermediate
> > > >>>>>>>> files and drop Intermediate hive table, even though delete
> > > >>>>>>>> them asynchronous. I think those data do not have any
> > > >>>>>>> effects..remind me
> > > >>>>>>>> please if it has..
> > > >>>>>>>>
> > > >>>>>>>> These are rubbish datas still exist in current
> > > >> version(kylin-1.0),
> > > >>>>>>> please
> > > >>>>>>>> check, thanks..
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: rubbish files exist in HDFS

Posted by yu feng <ol...@gmail.com>.

After build another cube successfully, I recheck this bug and find the
reason, thanks to all of you ...

2015-09-11 11:17 GMT+08:00 ShaoFeng Shi <sh...@gmail.com>:

> If "rowkey_stats" wasn't found, Kylin should throw exception and exit,
> instead of using 1 region silently; I'm going to change this, please let me
> know if you don't agree.
>
> 2015-09-11 10:17 GMT+08:00 Yerui Sun <su...@gmail.com>:
>
> > Hi, yu feng,
> >   Let me guess the reason of your problem.
> >
> >   The num of reducers of converting hfile job depends on the region
> > numbers of corresponding HTable.
> >
> >   For now, all HTables were created with only one region, caused by the
> > wrong path of rowkey_stats. I’ve opened a jira for this issue:
> > https://issues.apache.org/jira/browse/KYLIN-968. The patch has been
> > available last night.
> >
> >   Here’s some clues to confirm my guessing:
> >   1. You can find the corresponding HTable name in log, check its
> regions,
> > it should have only one region.
> >   2. Check your kylin working directory on hdfs, there should be a path
> > like ‘../kylin-null/../rowkey_stats'.
> >   3. Grep your kylin.log in tomcat dir, you should find the log contains
> > ‘no region split, HTable will be one region’.
> >
> >   If you hit all the three clues, I think KYLIN-968 could resolve your
> > problem.
> >
> >
> > > 在 2015年9月11日，00:54，yu feng <ol...@gmail.com> 写道：
> > >
> > > OK, I find another problem(I am a problem maker, ^_^), today I buid
> this
> > > cube which has 15 dimensions(one Mandatory dimension, to hierarchy
> > > dimension and others are normal dimension), I find cuboid files are
> > 1.9TB,
> > > in the step of converting cuboid to hfile it is too slow. I check the
> log
> > > of this job and find there are 9000+ mappers and only one reducer.
> > >
> > > I discard this job when our hadoop administrator tells me the node
> witch
> > > run this reducer is out of space of disk. I have to stop it, I am doubt
> > > that why there are only one reducer(I do not check source code of this
> > > job), By the way, my original data is only hundreds MB. I think this
> > would
> > > cause more problems if original is bigger or dimension is much more..
> > >
> > > 2015-09-10 23:46 GMT+08:00 Luke Han <lu...@gmail.com>:
> > >
> > >> The 2.0 will not come recently, there are huge refactor and bunch of
> new
> > >> features, we have to make sure there are no critical bugs before
> > release.
> > >>
> > >> The same function also available under v1.x branch, please stay tuned
> > for
> > >> update information for that.
> > >>
> > >> Thanks.
> > >>
> > >>
> > >> Best Regards!
> > >> ---------------------
> > >>
> > >> Luke Han
> > >>
> > >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <ol...@gmail.com>
> wrote:
> > >>
> > >>> What good news !  I wish you can release the version as quickly as
> > >>> possible, Today, I build a cube whose cuboid files is 1.9TB. If we
> > merge
> > >>> cube based on cuboid files, I think it will be very slowly..
> > >>>
> > >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
> > >>>
> > >>>> We have implemented the merge from HTable directly in Kylin 2.0,
> which
> > >>>> hasn’t been released/announced.
> > >>>>
> > >>>> On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
> > >>>>
> > >>>>> I think kylin can finish merging just depend on tables on hbase,
> This
> > >>> will
> > >>>>> make merging cubes more quickly, Isn't it ?
> > >>>>>
> > >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
> > >>>>>
> > >>>>>> After check source code, I find you are right, cuboid files will
> be
> > >>> used
> > >>>>>> while merging segments, But a new question comes, Why kylin merge
> > >>>>>> segment
> > >>>>>> just based on hfile, I can not find how to take hbase table as
> input
> > >>>>>> format
> > >>>>>> of mapreduce job, But kylin take HFileOutputFormat as  output
> format
> > >>>>>> while
> > >>>>>> changing cuboid to hfile.
> > >>>>>>
> > >>>>>> From this, I find kylin will take more space for a cube actually ,
> > >> not
> > >>>>>> only hfile but also cuboid files, the former are used for query
> and
> > >>> the
> > >>>>>> latter are used for merge, and the capacity of cuboid files is
> > >> bigger
> > >>>>>> than
> > >>>>>> hfiles.
> > >>>>>>
> > >>>>>> I think we could do some thing to optimize it... I want to know
> your
> > >>>>>> opinions about it .
> > >>>>>>
> > >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> > >>>>>>
> > >>>>>>> Hi, yu feng,
> > >>>>>>>  I’ve also noticed these files and opened a jira:
> > >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a
> > >>> patch
> > >>>>>>> tonight.
> > >>>>>>>
> > >>>>>>>  Here’s my opinions on your three question, feel free to correct
> > >> me:
> > >>>>>>>
> > >>>>>>>  First, the data path of intermediate hive table should be
> deleted
> > >>>>>>> after
> > >>>>>>> building, I agreed with that.
> > >>>>>>>
> > >>>>>>>  Second, the cuboid files will be used for merge and will be
> > >> deleted
> > >>>>>>> when merging job completed, we need and must leave them on hdfs.
> > >> The
> > >>>>>>> fact_distint_columns should be deleted. In additionally, the path
> > >> of
> > >>>>>>> rowkey_stats and hfile
> > >>>>>>> should also be deleted.
> > >>>>>>>
> > >>>>>>>  Third, there’s no garbage collection steps if a job discard,
> > >> maybe
> > >>> we
> > >>>>>>> need a patch for this.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Short answer:
> > >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files after
> > >>> buildJob
> > >>>>>>> and mergeJob completed.
> > >>>>>>>  The hdfs path will not be cleanup if a job was discarded, we
> need
> > >>>>>>> improvement on this.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Best Regards,
> > >>>>>>> Yerui Sun
> > >>>>>>> sunyerui@gmail.com
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> > >>>>>>>>
> > >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
> > >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
> > >>>>>>>>
> > >>>>>>>> However, after my test and check the source code , I find some
> > >>>>>>> rubbish(I am not
> > >>>>>>>> sure) file in HDFS.
> > >>>>>>>>
> > >>>>>>>> First, kylin only drop the Intermediate table in hive, but the
> > >>> table
> > >>>>>>> is
> > >>>>>>> an
> > >>>>>>>> EXTERNAL table, the file still exist in kylin tmp directory in
> > >>> HDFS(I
> > >>>>>>> check
> > >>>>>>>> that..)
> > >>>>>>>>
> > >>>>>>>> Second, the cuboid files take a large space in HDFS, and kylin
> do
> > >>> not
> > >>>>>>>> delete after the cube build(fact_distinct_columns files exist
> > >> too).
> > >>>>>>> I am
> > >>>>>>>> not sure if those has other effects, remind me please if it
> has..
> > >>>>>>>>
> > >>>>>>>> Third, After I discard a job, I think kylin should delete the
> > >>>>>>> Intermediate
> > >>>>>>>> files and drop Intermediate hive table, even though delete
> > >>>>>>>> them asynchronous. I think those data do not have any
> > >>>>>>> effects..remind me
> > >>>>>>>> please if it has..
> > >>>>>>>>
> > >>>>>>>> These are rubbish datas still exist in current
> > >> version(kylin-1.0),
> > >>>>>>> please
> > >>>>>>>> check, thanks..
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: rubbish files exist in HDFS

Posted by ShaoFeng Shi <sh...@gmail.com>.

If "rowkey_stats" wasn't found, Kylin should throw exception and exit,
instead of using 1 region silently; I'm going to change this, please let me
know if you don't agree.

2015-09-11 10:17 GMT+08:00 Yerui Sun <su...@gmail.com>:

> Hi, yu feng,
>   Let me guess the reason of your problem.
>
>   The num of reducers of converting hfile job depends on the region
> numbers of corresponding HTable.
>
>   For now, all HTables were created with only one region, caused by the
> wrong path of rowkey_stats. I’ve opened a jira for this issue:
> https://issues.apache.org/jira/browse/KYLIN-968. The patch has been
> available last night.
>
>   Here’s some clues to confirm my guessing:
>   1. You can find the corresponding HTable name in log, check its regions,
> it should have only one region.
>   2. Check your kylin working directory on hdfs, there should be a path
> like ‘../kylin-null/../rowkey_stats'.
>   3. Grep your kylin.log in tomcat dir, you should find the log contains
> ‘no region split, HTable will be one region’.
>
>   If you hit all the three clues, I think KYLIN-968 could resolve your
> problem.
>
>
> > 在 2015年9月11日，00:54，yu feng <ol...@gmail.com> 写道：
> >
> > OK, I find another problem(I am a problem maker, ^_^), today I buid this
> > cube which has 15 dimensions(one Mandatory dimension, to hierarchy
> > dimension and others are normal dimension), I find cuboid files are
> 1.9TB,
> > in the step of converting cuboid to hfile it is too slow. I check the log
> > of this job and find there are 9000+ mappers and only one reducer.
> >
> > I discard this job when our hadoop administrator tells me the node witch
> > run this reducer is out of space of disk. I have to stop it, I am doubt
> > that why there are only one reducer(I do not check source code of this
> > job), By the way, my original data is only hundreds MB. I think this
> would
> > cause more problems if original is bigger or dimension is much more..
> >
> > 2015-09-10 23:46 GMT+08:00 Luke Han <lu...@gmail.com>:
> >
> >> The 2.0 will not come recently, there are huge refactor and bunch of new
> >> features, we have to make sure there are no critical bugs before
> release.
> >>
> >> The same function also available under v1.x branch, please stay tuned
> for
> >> update information for that.
> >>
> >> Thanks.
> >>
> >>
> >> Best Regards!
> >> ---------------------
> >>
> >> Luke Han
> >>
> >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <ol...@gmail.com> wrote:
> >>
> >>> What good news !  I wish you can release the version as quickly as
> >>> possible, Today, I build a cube whose cuboid files is 1.9TB. If we
> merge
> >>> cube based on cuboid files, I think it will be very slowly..
> >>>
> >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
> >>>
> >>>> We have implemented the merge from HTable directly in Kylin 2.0, which
> >>>> hasn’t been released/announced.
> >>>>
> >>>> On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
> >>>>
> >>>>> I think kylin can finish merging just depend on tables on hbase, This
> >>> will
> >>>>> make merging cubes more quickly, Isn't it ?
> >>>>>
> >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
> >>>>>
> >>>>>> After check source code, I find you are right, cuboid files will be
> >>> used
> >>>>>> while merging segments, But a new question comes, Why kylin merge
> >>>>>> segment
> >>>>>> just based on hfile, I can not find how to take hbase table as input
> >>>>>> format
> >>>>>> of mapreduce job, But kylin take HFileOutputFormat as  output format
> >>>>>> while
> >>>>>> changing cuboid to hfile.
> >>>>>>
> >>>>>> From this, I find kylin will take more space for a cube actually ,
> >> not
> >>>>>> only hfile but also cuboid files, the former are used for query and
> >>> the
> >>>>>> latter are used for merge, and the capacity of cuboid files is
> >> bigger
> >>>>>> than
> >>>>>> hfiles.
> >>>>>>
> >>>>>> I think we could do some thing to optimize it... I want to know your
> >>>>>> opinions about it .
> >>>>>>
> >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> >>>>>>
> >>>>>>> Hi, yu feng,
> >>>>>>>  I’ve also noticed these files and opened a jira:
> >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a
> >>> patch
> >>>>>>> tonight.
> >>>>>>>
> >>>>>>>  Here’s my opinions on your three question, feel free to correct
> >> me:
> >>>>>>>
> >>>>>>>  First, the data path of intermediate hive table should be deleted
> >>>>>>> after
> >>>>>>> building, I agreed with that.
> >>>>>>>
> >>>>>>>  Second, the cuboid files will be used for merge and will be
> >> deleted
> >>>>>>> when merging job completed, we need and must leave them on hdfs.
> >> The
> >>>>>>> fact_distint_columns should be deleted. In additionally, the path
> >> of
> >>>>>>> rowkey_stats and hfile
> >>>>>>> should also be deleted.
> >>>>>>>
> >>>>>>>  Third, there’s no garbage collection steps if a job discard,
> >> maybe
> >>> we
> >>>>>>> need a patch for this.
> >>>>>>>
> >>>>>>>
> >>>>>>> Short answer:
> >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files after
> >>> buildJob
> >>>>>>> and mergeJob completed.
> >>>>>>>  The hdfs path will not be cleanup if a job was discarded, we need
> >>>>>>> improvement on this.
> >>>>>>>
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>> Yerui Sun
> >>>>>>> sunyerui@gmail.com
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> >>>>>>>>
> >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
> >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
> >>>>>>>>
> >>>>>>>> However, after my test and check the source code , I find some
> >>>>>>> rubbish(I am not
> >>>>>>>> sure) file in HDFS.
> >>>>>>>>
> >>>>>>>> First, kylin only drop the Intermediate table in hive, but the
> >>> table
> >>>>>>> is
> >>>>>>> an
> >>>>>>>> EXTERNAL table, the file still exist in kylin tmp directory in
> >>> HDFS(I
> >>>>>>> check
> >>>>>>>> that..)
> >>>>>>>>
> >>>>>>>> Second, the cuboid files take a large space in HDFS, and kylin do
> >>> not
> >>>>>>>> delete after the cube build(fact_distinct_columns files exist
> >> too).
> >>>>>>> I am
> >>>>>>>> not sure if those has other effects, remind me please if it has..
> >>>>>>>>
> >>>>>>>> Third, After I discard a job, I think kylin should delete the
> >>>>>>> Intermediate
> >>>>>>>> files and drop Intermediate hive table, even though delete
> >>>>>>>> them asynchronous. I think those data do not have any
> >>>>>>> effects..remind me
> >>>>>>>> please if it has..
> >>>>>>>>
> >>>>>>>> These are rubbish datas still exist in current
> >> version(kylin-1.0),
> >>>>>>> please
> >>>>>>>> check, thanks..
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: rubbish files exist in HDFS

Posted by Yerui Sun <su...@gmail.com>.

Hi, yu feng,
  Let me guess the reason of your problem.

  The num of reducers of converting hfile job depends on the region numbers of corresponding HTable. 

  For now, all HTables were created with only one region, caused by the wrong path of rowkey_stats. I’ve opened a jira for this issue: https://issues.apache.org/jira/browse/KYLIN-968. The patch has been available last night.

  Here’s some clues to confirm my guessing:
  1. You can find the corresponding HTable name in log, check its regions, it should have only one region.
  2. Check your kylin working directory on hdfs, there should be a path like ‘../kylin-null/../rowkey_stats'.
  3. Grep your kylin.log in tomcat dir, you should find the log contains ‘no region split, HTable will be one region’.

  If you hit all the three clues, I think KYLIN-968 could resolve your problem.

  
> 在 2015年9月11日，00:54，yu feng <ol...@gmail.com> 写道：
> 
> OK, I find another problem(I am a problem maker, ^_^), today I buid this
> cube which has 15 dimensions(one Mandatory dimension, to hierarchy
> dimension and others are normal dimension), I find cuboid files are 1.9TB,
> in the step of converting cuboid to hfile it is too slow. I check the log
> of this job and find there are 9000+ mappers and only one reducer.
> 
> I discard this job when our hadoop administrator tells me the node witch
> run this reducer is out of space of disk. I have to stop it, I am doubt
> that why there are only one reducer(I do not check source code of this
> job), By the way, my original data is only hundreds MB. I think this would
> cause more problems if original is bigger or dimension is much more..
> 
> 2015-09-10 23:46 GMT+08:00 Luke Han <lu...@gmail.com>:
> 
>> The 2.0 will not come recently, there are huge refactor and bunch of new
>> features, we have to make sure there are no critical bugs before release.
>> 
>> The same function also available under v1.x branch, please stay tuned for
>> update information for that.
>> 
>> Thanks.
>> 
>> 
>> Best Regards!
>> ---------------------
>> 
>> Luke Han
>> 
>> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <ol...@gmail.com> wrote:
>> 
>>> What good news !  I wish you can release the version as quickly as
>>> possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge
>>> cube based on cuboid files, I think it will be very slowly..
>>> 
>>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
>>> 
>>>> We have implemented the merge from HTable directly in Kylin 2.0, which
>>>> hasn’t been released/announced.
>>>> 
>>>> On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
>>>> 
>>>>> I think kylin can finish merging just depend on tables on hbase, This
>>> will
>>>>> make merging cubes more quickly, Isn't it ?
>>>>> 
>>>>> 2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
>>>>> 
>>>>>> After check source code, I find you are right, cuboid files will be
>>> used
>>>>>> while merging segments, But a new question comes, Why kylin merge
>>>>>> segment
>>>>>> just based on hfile, I can not find how to take hbase table as input
>>>>>> format
>>>>>> of mapreduce job, But kylin take HFileOutputFormat as  output format
>>>>>> while
>>>>>> changing cuboid to hfile.
>>>>>> 
>>>>>> From this, I find kylin will take more space for a cube actually ,
>> not
>>>>>> only hfile but also cuboid files, the former are used for query and
>>> the
>>>>>> latter are used for merge, and the capacity of cuboid files is
>> bigger
>>>>>> than
>>>>>> hfiles.
>>>>>> 
>>>>>> I think we could do some thing to optimize it... I want to know your
>>>>>> opinions about it .
>>>>>> 
>>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
>>>>>> 
>>>>>>> Hi, yu feng,
>>>>>>>  I’ve also noticed these files and opened a jira:
>>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a
>>> patch
>>>>>>> tonight.
>>>>>>> 
>>>>>>>  Here’s my opinions on your three question, feel free to correct
>> me:
>>>>>>> 
>>>>>>>  First, the data path of intermediate hive table should be deleted
>>>>>>> after
>>>>>>> building, I agreed with that.
>>>>>>> 
>>>>>>>  Second, the cuboid files will be used for merge and will be
>> deleted
>>>>>>> when merging job completed, we need and must leave them on hdfs.
>> The
>>>>>>> fact_distint_columns should be deleted. In additionally, the path
>> of
>>>>>>> rowkey_stats and hfile
>>>>>>> should also be deleted.
>>>>>>> 
>>>>>>>  Third, there’s no garbage collection steps if a job discard,
>> maybe
>>> we
>>>>>>> need a patch for this.
>>>>>>> 
>>>>>>> 
>>>>>>> Short answer:
>>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files after
>>> buildJob
>>>>>>> and mergeJob completed.
>>>>>>>  The hdfs path will not be cleanup if a job was discarded, we need
>>>>>>> improvement on this.
>>>>>>> 
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Yerui Sun
>>>>>>> sunyerui@gmail.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
>>>>>>>> 
>>>>>>>> I see this core Improvement in release 1.0, JIRA url :
>>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
>>>>>>>> 
>>>>>>>> However, after my test and check the source code , I find some
>>>>>>> rubbish(I am not
>>>>>>>> sure) file in HDFS.
>>>>>>>> 
>>>>>>>> First, kylin only drop the Intermediate table in hive, but the
>>> table
>>>>>>> is
>>>>>>> an
>>>>>>>> EXTERNAL table, the file still exist in kylin tmp directory in
>>> HDFS(I
>>>>>>> check
>>>>>>>> that..)
>>>>>>>> 
>>>>>>>> Second, the cuboid files take a large space in HDFS, and kylin do
>>> not
>>>>>>>> delete after the cube build(fact_distinct_columns files exist
>> too).
>>>>>>> I am
>>>>>>>> not sure if those has other effects, remind me please if it has..
>>>>>>>> 
>>>>>>>> Third, After I discard a job, I think kylin should delete the
>>>>>>> Intermediate
>>>>>>>> files and drop Intermediate hive table, even though delete
>>>>>>>> them asynchronous. I think those data do not have any
>>>>>>> effects..remind me
>>>>>>>> please if it has..
>>>>>>>> 
>>>>>>>> These are rubbish datas still exist in current
>> version(kylin-1.0),
>>>>>>> please
>>>>>>>> check, thanks..
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: rubbish files exist in HDFS

Posted by yu feng <ol...@gmail.com>.

OK, I find another problem(I am a problem maker, ^_^), today I buid this
cube which has 15 dimensions(one Mandatory dimension, to hierarchy
dimension and others are normal dimension), I find cuboid files are 1.9TB,
in the step of converting cuboid to hfile it is too slow. I check the log
of this job and find there are 9000+ mappers and only one reducer.

I discard this job when our hadoop administrator tells me the node witch
run this reducer is out of space of disk. I have to stop it, I am doubt
that why there are only one reducer(I do not check source code of this
job), By the way, my original data is only hundreds MB. I think this would
cause more problems if original is bigger or dimension is much more..

2015-09-10 23:46 GMT+08:00 Luke Han <lu...@gmail.com>:

> The 2.0 will not come recently, there are huge refactor and bunch of new
> features, we have to make sure there are no critical bugs before release.
>
> The same function also available under v1.x branch, please stay tuned for
> update information for that.
>
> Thanks.
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <ol...@gmail.com> wrote:
>
> > What good news !  I wish you can release the version as quickly as
> > possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge
> > cube based on cuboid files, I think it will be very slowly..
> >
> > 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
> >
> > > We have implemented the merge from HTable directly in Kylin 2.0, which
> > > hasn’t been released/announced.
> > >
> > > On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
> > >
> > > >I think kylin can finish merging just depend on tables on hbase, This
> > will
> > > >make merging cubes more quickly, Isn't it ?
> > > >
> > > >2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
> > > >
> > > >> After check source code, I find you are right, cuboid files will be
> > used
> > > >> while merging segments, But a new question comes, Why kylin merge
> > > >>segment
> > > >> just based on hfile, I can not find how to take hbase table as input
> > > >>format
> > > >> of mapreduce job, But kylin take HFileOutputFormat as  output format
> > > >>while
> > > >> changing cuboid to hfile.
> > > >>
> > > >> From this, I find kylin will take more space for a cube actually ,
> not
> > > >> only hfile but also cuboid files, the former are used for query and
> > the
> > > >> latter are used for merge, and the capacity of cuboid files is
> bigger
> > > >>than
> > > >> hfiles.
> > > >>
> > > >> I think we could do some thing to optimize it... I want to know your
> > > >> opinions about it .
> > > >>
> > > >> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> > > >>
> > > >>> Hi, yu feng,
> > > >>>   I’ve also noticed these files and opened a jira:
> > > >>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a
> > patch
> > > >>> tonight.
> > > >>>
> > > >>>   Here’s my opinions on your three question, feel free to correct
> me:
> > > >>>
> > > >>>   First, the data path of intermediate hive table should be deleted
> > > >>>after
> > > >>> building, I agreed with that.
> > > >>>
> > > >>>   Second, the cuboid files will be used for merge and will be
> deleted
> > > >>> when merging job completed, we need and must leave them on hdfs.
> The
> > > >>> fact_distint_columns should be deleted. In additionally, the path
> of
> > > >>> rowkey_stats and hfile
> > > >>> should also be deleted.
> > > >>>
> > > >>>   Third, there’s no garbage collection steps if a job discard,
> maybe
> > we
> > > >>> need a patch for this.
> > > >>>
> > > >>>
> > > >>> Short answer:
> > > >>>   KYLIN-978 will clean all hdfs path except cuboid files after
> > buildJob
> > > >>> and mergeJob completed.
> > > >>>   The hdfs path will not be cleanup if a job was discarded, we need
> > > >>> improvement on this.
> > > >>>
> > > >>>
> > > >>> Best Regards,
> > > >>> Yerui Sun
> > > >>> sunyerui@gmail.com
> > > >>>
> > > >>>
> > > >>>
> > > >>> > 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> > > >>> >
> > > >>> > I see this core Improvement in release 1.0, JIRA url :
> > > >>> > https://issues.apache.org/jira/browse/KYLIN-926
> > > >>> >
> > > >>> > However, after my test and check the source code , I find some
> > > >>> rubbish(I am not
> > > >>> > sure) file in HDFS.
> > > >>> >
> > > >>> > First, kylin only drop the Intermediate table in hive, but the
> > table
> > > >>>is
> > > >>> an
> > > >>> > EXTERNAL table, the file still exist in kylin tmp directory in
> > HDFS(I
> > > >>> check
> > > >>> > that..)
> > > >>> >
> > > >>> > Second, the cuboid files take a large space in HDFS, and kylin do
> > not
> > > >>> > delete after the cube build(fact_distinct_columns files exist
> too).
> > > >>>I am
> > > >>> > not sure if those has other effects, remind me please if it has..
> > > >>> >
> > > >>> > Third, After I discard a job, I think kylin should delete the
> > > >>> Intermediate
> > > >>> > files and drop Intermediate hive table, even though delete
> > > >>> > them asynchronous. I think those data do not have any
> > > >>>effects..remind me
> > > >>> > please if it has..
> > > >>> >
> > > >>> > These are rubbish datas still exist in current
> version(kylin-1.0),
> > > >>> please
> > > >>> > check, thanks..
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: rubbish files exist in HDFS

Posted by Luke Han <lu...@gmail.com>.

The 2.0 will not come recently, there are huge refactor and bunch of new
features, we have to make sure there are no critical bugs before release.

The same function also available under v1.x branch, please stay tuned for
update information for that.

Thanks.


Best Regards!
---------------------

Luke Han

On Thu, Sep 10, 2015 at 7:50 PM, yu feng <ol...@gmail.com> wrote:

> What good news !  I wish you can release the version as quickly as
> possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge
> cube based on cuboid files, I think it will be very slowly..
>
> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
>
> > We have implemented the merge from HTable directly in Kylin 2.0, which
> > hasn’t been released/announced.
> >
> > On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
> >
> > >I think kylin can finish merging just depend on tables on hbase, This
> will
> > >make merging cubes more quickly, Isn't it ?
> > >
> > >2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
> > >
> > >> After check source code, I find you are right, cuboid files will be
> used
> > >> while merging segments, But a new question comes, Why kylin merge
> > >>segment
> > >> just based on hfile, I can not find how to take hbase table as input
> > >>format
> > >> of mapreduce job, But kylin take HFileOutputFormat as  output format
> > >>while
> > >> changing cuboid to hfile.
> > >>
> > >> From this, I find kylin will take more space for a cube actually , not
> > >> only hfile but also cuboid files, the former are used for query and
> the
> > >> latter are used for merge, and the capacity of cuboid files is bigger
> > >>than
> > >> hfiles.
> > >>
> > >> I think we could do some thing to optimize it... I want to know your
> > >> opinions about it .
> > >>
> > >> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> > >>
> > >>> Hi, yu feng,
> > >>>   I’ve also noticed these files and opened a jira:
> > >>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a
> patch
> > >>> tonight.
> > >>>
> > >>>   Here’s my opinions on your three question, feel free to correct me:
> > >>>
> > >>>   First, the data path of intermediate hive table should be deleted
> > >>>after
> > >>> building, I agreed with that.
> > >>>
> > >>>   Second, the cuboid files will be used for merge and will be deleted
> > >>> when merging job completed, we need and must leave them on hdfs. The
> > >>> fact_distint_columns should be deleted. In additionally, the path of
> > >>> rowkey_stats and hfile
> > >>> should also be deleted.
> > >>>
> > >>>   Third, there’s no garbage collection steps if a job discard, maybe
> we
> > >>> need a patch for this.
> > >>>
> > >>>
> > >>> Short answer:
> > >>>   KYLIN-978 will clean all hdfs path except cuboid files after
> buildJob
> > >>> and mergeJob completed.
> > >>>   The hdfs path will not be cleanup if a job was discarded, we need
> > >>> improvement on this.
> > >>>
> > >>>
> > >>> Best Regards,
> > >>> Yerui Sun
> > >>> sunyerui@gmail.com
> > >>>
> > >>>
> > >>>
> > >>> > 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> > >>> >
> > >>> > I see this core Improvement in release 1.0, JIRA url :
> > >>> > https://issues.apache.org/jira/browse/KYLIN-926
> > >>> >
> > >>> > However, after my test and check the source code , I find some
> > >>> rubbish(I am not
> > >>> > sure) file in HDFS.
> > >>> >
> > >>> > First, kylin only drop the Intermediate table in hive, but the
> table
> > >>>is
> > >>> an
> > >>> > EXTERNAL table, the file still exist in kylin tmp directory in
> HDFS(I
> > >>> check
> > >>> > that..)
> > >>> >
> > >>> > Second, the cuboid files take a large space in HDFS, and kylin do
> not
> > >>> > delete after the cube build(fact_distinct_columns files exist too).
> > >>>I am
> > >>> > not sure if those has other effects, remind me please if it has..
> > >>> >
> > >>> > Third, After I discard a job, I think kylin should delete the
> > >>> Intermediate
> > >>> > files and drop Intermediate hive table, even though delete
> > >>> > them asynchronous. I think those data do not have any
> > >>>effects..remind me
> > >>> > please if it has..
> > >>> >
> > >>> > These are rubbish datas still exist in current version(kylin-1.0),
> > >>> please
> > >>> > check, thanks..
> > >>>
> > >>>
> > >>
> >
> >
>

Re: rubbish files exist in HDFS

Posted by yu feng <ol...@gmail.com>.

What good news !  I wish you can release the version as quickly as
possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge
cube based on cuboid files, I think it will be very slowly..

2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:

> We have implemented the merge from HTable directly in Kylin 2.0, which
> hasn’t been released/announced.
>
> On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:
>
> >I think kylin can finish merging just depend on tables on hbase, This will
> >make merging cubes more quickly, Isn't it ?
> >
> >2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
> >
> >> After check source code, I find you are right, cuboid files will be used
> >> while merging segments, But a new question comes, Why kylin merge
> >>segment
> >> just based on hfile, I can not find how to take hbase table as input
> >>format
> >> of mapreduce job, But kylin take HFileOutputFormat as  output format
> >>while
> >> changing cuboid to hfile.
> >>
> >> From this, I find kylin will take more space for a cube actually , not
> >> only hfile but also cuboid files, the former are used for query and the
> >> latter are used for merge, and the capacity of cuboid files is bigger
> >>than
> >> hfiles.
> >>
> >> I think we could do some thing to optimize it... I want to know your
> >> opinions about it .
> >>
> >> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> >>
> >>> Hi, yu feng,
> >>>   I’ve also noticed these files and opened a jira:
> >>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
> >>> tonight.
> >>>
> >>>   Here’s my opinions on your three question, feel free to correct me:
> >>>
> >>>   First, the data path of intermediate hive table should be deleted
> >>>after
> >>> building, I agreed with that.
> >>>
> >>>   Second, the cuboid files will be used for merge and will be deleted
> >>> when merging job completed, we need and must leave them on hdfs. The
> >>> fact_distint_columns should be deleted. In additionally, the path of
> >>> rowkey_stats and hfile
> >>> should also be deleted.
> >>>
> >>>   Third, there’s no garbage collection steps if a job discard, maybe we
> >>> need a patch for this.
> >>>
> >>>
> >>> Short answer:
> >>>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
> >>> and mergeJob completed.
> >>>   The hdfs path will not be cleanup if a job was discarded, we need
> >>> improvement on this.
> >>>
> >>>
> >>> Best Regards,
> >>> Yerui Sun
> >>> sunyerui@gmail.com
> >>>
> >>>
> >>>
> >>> > 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> >>> >
> >>> > I see this core Improvement in release 1.0, JIRA url :
> >>> > https://issues.apache.org/jira/browse/KYLIN-926
> >>> >
> >>> > However, after my test and check the source code , I find some
> >>> rubbish(I am not
> >>> > sure) file in HDFS.
> >>> >
> >>> > First, kylin only drop the Intermediate table in hive, but the table
> >>>is
> >>> an
> >>> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
> >>> check
> >>> > that..)
> >>> >
> >>> > Second, the cuboid files take a large space in HDFS, and kylin do not
> >>> > delete after the cube build(fact_distinct_columns files exist too).
> >>>I am
> >>> > not sure if those has other effects, remind me please if it has..
> >>> >
> >>> > Third, After I discard a job, I think kylin should delete the
> >>> Intermediate
> >>> > files and drop Intermediate hive table, even though delete
> >>> > them asynchronous. I think those data do not have any
> >>>effects..remind me
> >>> > please if it has..
> >>> >
> >>> > These are rubbish datas still exist in current version(kylin-1.0),
> >>> please
> >>> > check, thanks..
> >>>
> >>>
> >>
>
>

Re: rubbish files exist in HDFS

Posted by "Shi, Shaofeng" <sh...@ebay.com>.

We have implemented the merge from HTable directly in Kylin 2.0, which
hasn’t been released/announced.

On 9/10/15, 7:22 PM, "yu feng" <ol...@gmail.com> wrote:

>I think kylin can finish merging just depend on tables on hbase, This will
>make merging cubes more quickly, Isn't it ?
>
>2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
>
>> After check source code, I find you are right, cuboid files will be used
>> while merging segments, But a new question comes, Why kylin merge
>>segment
>> just based on hfile, I can not find how to take hbase table as input
>>format
>> of mapreduce job, But kylin take HFileOutputFormat as  output format
>>while
>> changing cuboid to hfile.
>>
>> From this, I find kylin will take more space for a cube actually , not
>> only hfile but also cuboid files, the former are used for query and the
>> latter are used for merge, and the capacity of cuboid files is bigger
>>than
>> hfiles.
>>
>> I think we could do some thing to optimize it... I want to know your
>> opinions about it .
>>
>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
>>
>>> Hi, yu feng,
>>>   I’ve also noticed these files and opened a jira:
>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
>>> tonight.
>>>
>>>   Here’s my opinions on your three question, feel free to correct me:
>>>
>>>   First, the data path of intermediate hive table should be deleted
>>>after
>>> building, I agreed with that.
>>>
>>>   Second, the cuboid files will be used for merge and will be deleted
>>> when merging job completed, we need and must leave them on hdfs. The
>>> fact_distint_columns should be deleted. In additionally, the path of
>>> rowkey_stats and hfile
>>> should also be deleted.
>>>
>>>   Third, there’s no garbage collection steps if a job discard, maybe we
>>> need a patch for this.
>>>
>>>
>>> Short answer:
>>>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
>>> and mergeJob completed.
>>>   The hdfs path will not be cleanup if a job was discarded, we need
>>> improvement on this.
>>>
>>>
>>> Best Regards,
>>> Yerui Sun
>>> sunyerui@gmail.com
>>>
>>>
>>>
>>> > 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
>>> >
>>> > I see this core Improvement in release 1.0, JIRA url :
>>> > https://issues.apache.org/jira/browse/KYLIN-926
>>> >
>>> > However, after my test and check the source code , I find some
>>> rubbish(I am not
>>> > sure) file in HDFS.
>>> >
>>> > First, kylin only drop the Intermediate table in hive, but the table
>>>is
>>> an
>>> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
>>> check
>>> > that..)
>>> >
>>> > Second, the cuboid files take a large space in HDFS, and kylin do not
>>> > delete after the cube build(fact_distinct_columns files exist too).
>>>I am
>>> > not sure if those has other effects, remind me please if it has..
>>> >
>>> > Third, After I discard a job, I think kylin should delete the
>>> Intermediate
>>> > files and drop Intermediate hive table, even though delete
>>> > them asynchronous. I think those data do not have any
>>>effects..remind me
>>> > please if it has..
>>> >
>>> > These are rubbish datas still exist in current version(kylin-1.0),
>>> please
>>> > check, thanks..
>>>
>>>
>>

Re: rubbish files exist in HDFS

Posted by Luke Han <lu...@gmail.com>.

You are right Yu, that files will be input as source during merge.

It could be cleaned up after merge.

so that's actually just long-temporary files.

Thanks.


Best Regards!
---------------------

Luke Han

On Thu, Sep 10, 2015 at 7:22 PM, yu feng <ol...@gmail.com> wrote:

> I think kylin can finish merging just depend on tables on hbase, This will
> make merging cubes more quickly, Isn't it ?
>
> 2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:
>
> > After check source code, I find you are right, cuboid files will be used
> > while merging segments, But a new question comes, Why kylin merge segment
> > just based on hfile, I can not find how to take hbase table as input
> format
> > of mapreduce job, But kylin take HFileOutputFormat as  output format
> while
> > changing cuboid to hfile.
> >
> > From this, I find kylin will take more space for a cube actually , not
> > only hfile but also cuboid files, the former are used for query and the
> > latter are used for merge, and the capacity of cuboid files is bigger
> than
> > hfiles.
> >
> > I think we could do some thing to optimize it... I want to know your
> > opinions about it .
> >
> > 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
> >
> >> Hi, yu feng,
> >>   I’ve also noticed these files and opened a jira:
> >> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
> >> tonight.
> >>
> >>   Here’s my opinions on your three question, feel free to correct me:
> >>
> >>   First, the data path of intermediate hive table should be deleted
> after
> >> building, I agreed with that.
> >>
> >>   Second, the cuboid files will be used for merge and will be deleted
> >> when merging job completed, we need and must leave them on hdfs. The
> >> fact_distint_columns should be deleted. In additionally, the path of
> >> rowkey_stats and hfile
> >> should also be deleted.
> >>
> >>   Third, there’s no garbage collection steps if a job discard, maybe we
> >> need a patch for this.
> >>
> >>
> >> Short answer:
> >>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
> >> and mergeJob completed.
> >>   The hdfs path will not be cleanup if a job was discarded, we need
> >> improvement on this.
> >>
> >>
> >> Best Regards,
> >> Yerui Sun
> >> sunyerui@gmail.com
> >>
> >>
> >>
> >> > 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> >> >
> >> > I see this core Improvement in release 1.0, JIRA url :
> >> > https://issues.apache.org/jira/browse/KYLIN-926
> >> >
> >> > However, after my test and check the source code , I find some
> >> rubbish(I am not
> >> > sure) file in HDFS.
> >> >
> >> > First, kylin only drop the Intermediate table in hive, but the table
> is
> >> an
> >> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
> >> check
> >> > that..)
> >> >
> >> > Second, the cuboid files take a large space in HDFS, and kylin do not
> >> > delete after the cube build(fact_distinct_columns files exist too). I
> am
> >> > not sure if those has other effects, remind me please if it has..
> >> >
> >> > Third, After I discard a job, I think kylin should delete the
> >> Intermediate
> >> > files and drop Intermediate hive table, even though delete
> >> > them asynchronous. I think those data do not have any effects..remind
> me
> >> > please if it has..
> >> >
> >> > These are rubbish datas still exist in current version(kylin-1.0),
> >> please
> >> > check, thanks..
> >>
> >>
> >
>

Re: rubbish files exist in HDFS

Posted by yu feng <ol...@gmail.com>.

I think kylin can finish merging just depend on tables on hbase, This will
make merging cubes more quickly, Isn't it ?

2015-09-10 19:16 GMT+08:00 yu feng <ol...@gmail.com>:

> After check source code, I find you are right, cuboid files will be used
> while merging segments, But a new question comes, Why kylin merge segment
> just based on hfile, I can not find how to take hbase table as input format
> of mapreduce job, But kylin take HFileOutputFormat as  output format while
> changing cuboid to hfile.
>
> From this, I find kylin will take more space for a cube actually , not
> only hfile but also cuboid files, the former are used for query and the
> latter are used for merge, and the capacity of cuboid files is bigger than
> hfiles.
>
> I think we could do some thing to optimize it... I want to know your
> opinions about it .
>
> 2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:
>
>> Hi, yu feng,
>>   I’ve also noticed these files and opened a jira:
>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
>> tonight.
>>
>>   Here’s my opinions on your three question, feel free to correct me:
>>
>>   First, the data path of intermediate hive table should be deleted after
>> building, I agreed with that.
>>
>>   Second, the cuboid files will be used for merge and will be deleted
>> when merging job completed, we need and must leave them on hdfs. The
>> fact_distint_columns should be deleted. In additionally, the path of
>> rowkey_stats and hfile
>> should also be deleted.
>>
>>   Third, there’s no garbage collection steps if a job discard, maybe we
>> need a patch for this.
>>
>>
>> Short answer:
>>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
>> and mergeJob completed.
>>   The hdfs path will not be cleanup if a job was discarded, we need
>> improvement on this.
>>
>>
>> Best Regards,
>> Yerui Sun
>> sunyerui@gmail.com
>>
>>
>>
>> > 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
>> >
>> > I see this core Improvement in release 1.0, JIRA url :
>> > https://issues.apache.org/jira/browse/KYLIN-926
>> >
>> > However, after my test and check the source code , I find some
>> rubbish(I am not
>> > sure) file in HDFS.
>> >
>> > First, kylin only drop the Intermediate table in hive, but the table is
>> an
>> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
>> check
>> > that..)
>> >
>> > Second, the cuboid files take a large space in HDFS, and kylin do not
>> > delete after the cube build(fact_distinct_columns files exist too). I am
>> > not sure if those has other effects, remind me please if it has..
>> >
>> > Third, After I discard a job, I think kylin should delete the
>> Intermediate
>> > files and drop Intermediate hive table, even though delete
>> > them asynchronous. I think those data do not have any effects..remind me
>> > please if it has..
>> >
>> > These are rubbish datas still exist in current version(kylin-1.0),
>> please
>> > check, thanks..
>>
>>
>

Re: rubbish files exist in HDFS

Posted by yu feng <ol...@gmail.com>.

After check source code, I find you are right, cuboid files will be used
while merging segments, But a new question comes, Why kylin merge segment
just based on hfile, I can not find how to take hbase table as input format
of mapreduce job, But kylin take HFileOutputFormat as  output format while
changing cuboid to hfile.

>From this, I find kylin will take more space for a cube actually , not only
hfile but also cuboid files, the former are used for query and the latter
are used for merge, and the capacity of cuboid files is bigger than hfiles.

I think we could do some thing to optimize it... I want to know your
opinions about it .

2015-09-10 18:36 GMT+08:00 Yerui Sun <su...@gmail.com>:

> Hi, yu feng,
>   I’ve also noticed these files and opened a jira:
> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
> tonight.
>
>   Here’s my opinions on your three question, feel free to correct me:
>
>   First, the data path of intermediate hive table should be deleted after
> building, I agreed with that.
>
>   Second, the cuboid files will be used for merge and will be deleted when
> merging job completed, we need and must leave them on hdfs. The
> fact_distint_columns should be deleted. In additionally, the path of
> rowkey_stats and hfile
> should also be deleted.
>
>   Third, there’s no garbage collection steps if a job discard, maybe we
> need a patch for this.
>
>
> Short answer:
>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
> and mergeJob completed.
>   The hdfs path will not be cleanup if a job was discarded, we need
> improvement on this.
>
>
> Best Regards,
> Yerui Sun
> sunyerui@gmail.com
>
>
>
> > 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> >
> > I see this core Improvement in release 1.0, JIRA url :
> > https://issues.apache.org/jira/browse/KYLIN-926
> >
> > However, after my test and check the source code , I find some rubbish(I
> am not
> > sure) file in HDFS.
> >
> > First, kylin only drop the Intermediate table in hive, but the table is
> an
> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
> check
> > that..)
> >
> > Second, the cuboid files take a large space in HDFS, and kylin do not
> > delete after the cube build(fact_distinct_columns files exist too). I am
> > not sure if those has other effects, remind me please if it has..
> >
> > Third, After I discard a job, I think kylin should delete the
> Intermediate
> > files and drop Intermediate hive table, even though delete
> > them asynchronous. I think those data do not have any effects..remind me
> > please if it has..
> >
> > These are rubbish datas still exist in current version(kylin-1.0), please
> > check, thanks..
>
>

Re: rubbish files exist in HDFS

Posted by "Shi, Shaofeng" <sh...@ebay.com>.

Good summary and answer, thank you Yerui!

On 9/10/15, 6:36 PM, "Yerui Sun" <su...@gmail.com> wrote:

>Hi, yu feng,
>  I’ve also noticed these files and opened a jira:
>https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
>tonight.
>
>  Here’s my opinions on your three question, feel free to correct me:
>
>  First, the data path of intermediate hive table should be deleted after
>building, I agreed with that.
>
>  Second, the cuboid files will be used for merge and will be deleted
>when merging job completed, we need and must leave them on hdfs. The
>fact_distint_columns should be deleted. In additionally, the path of
>rowkey_stats and hfile
>should also be deleted.
>
>  Third, there’s no garbage collection steps if a job discard, maybe we
>need a patch for this.
>
>
>Short answer: 
>  KYLIN-978 will clean all hdfs path except cuboid files after buildJob
>and mergeJob completed.
>  The hdfs path will not be cleanup if a job was discarded, we need
>improvement on this.
> 
>
>Best Regards,
>Yerui Sun
>sunyerui@gmail.com
>
>
>
>> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
>> 
>> I see this core Improvement in release 1.0, JIRA url :
>> https://issues.apache.org/jira/browse/KYLIN-926
>> 
>> However, after my test and check the source code , I find some
>>rubbish(I am not
>> sure) file in HDFS.
>> 
>> First, kylin only drop the Intermediate table in hive, but the table is
>>an
>> EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
>>check
>> that..)
>> 
>> Second, the cuboid files take a large space in HDFS, and kylin do not
>> delete after the cube build(fact_distinct_columns files exist too). I am
>> not sure if those has other effects, remind me please if it has..
>> 
>> Third, After I discard a job, I think kylin should delete the
>>Intermediate
>> files and drop Intermediate hive table, even though delete
>> them asynchronous. I think those data do not have any effects..remind me
>> please if it has..
>> 
>> These are rubbish datas still exist in current version(kylin-1.0),
>>please
>> check, thanks..
>

Re: rubbish files exist in HDFS

Posted by Yerui Sun <su...@gmail.com>.

Hi, yu feng,
  I’ve also noticed these files and opened a jira: https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch tonight.

  Here’s my opinions on your three question, feel free to correct me:

  First, the data path of intermediate hive table should be deleted after building, I agreed with that.

  Second, the cuboid files will be used for merge and will be deleted when merging job completed, we need and must leave them on hdfs. The fact_distint_columns should be deleted. In additionally, the path of rowkey_stats and hfile 
should also be deleted.

  Third, there’s no garbage collection steps if a job discard, maybe we need a patch for this.


Short answer: 
  KYLIN-978 will clean all hdfs path except cuboid files after buildJob and mergeJob completed. 
  The hdfs path will not be cleanup if a job was discarded, we need improvement on this.
 

Best Regards,
Yerui Sun
sunyerui@gmail.com



> 在 2015年9月10日，18:20，yu feng <ol...@gmail.com> 写道：
> 
> I see this core Improvement in release 1.0, JIRA url :
> https://issues.apache.org/jira/browse/KYLIN-926
> 
> However, after my test and check the source code , I find some rubbish(I am not
> sure) file in HDFS.
> 
> First, kylin only drop the Intermediate table in hive, but the table is an
> EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I check
> that..)
> 
> Second, the cuboid files take a large space in HDFS, and kylin do not
> delete after the cube build(fact_distinct_columns files exist too). I am
> not sure if those has other effects, remind me please if it has..
> 
> Third, After I discard a job, I think kylin should delete the Intermediate
> files and drop Intermediate hive table, even though delete
> them asynchronous. I think those data do not have any effects..remind me
> please if it has..
> 
> These are rubbish datas still exist in current version(kylin-1.0), please
> check, thanks..