You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Alexander Sterligov <> on 2017/08/09 09:54:33 UTC

HFile is empty if kylin.hbase.cluster.fs is set to s3


I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.

Step "Convert Cuboid Data to HFile" finished without errors. Statistics at
the end of the job said that it has written lot's of data to s3.

But there is no hfiles in kylin_metadata folder (kylin_metadata
/kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), but only
_temporary folder and _SUCCESS file.

_temporary contains hfiles inside attempt folders. it looks like there were
not copied from _temporary to result dir. But there is no errors neither in
kylin log, nor in reducers' logs.

Then loading empty hfiles produces empty segments.

Is that a bug or I'm doing something wrong?

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
I totally agree:

Another problem is that I had to rise hbase rpc timeout, because bulk
> loading from hdfs takes long. That was not trivial. 3 minutes work good,
> but with drawback of queries or metadata writes handing for 3 minutes if
> something bad happen.

On Thu, Sep 7, 2017 at 1:23 PM, ShaoFeng Shi <> wrote:

> Setting hbase.rpc.timeout to a large value has drawback I think; It will
> cause other rpc operations wait longer. So the best way is directly writing
> HFile to the S3 bucket that HBase reads. Not sure whether HBase still needs
> a move operation; if need, that will become another problem.
> 2017-09-07 18:02 GMT+08:00 Alexander Sterligov <>:
>> Just in case - I've changed it in /etc/hbase/conf/hbase-site.xml
>> On Thu, Sep 7, 2017 at 12:59 PM, ShaoFeng Shi <>
>> wrote:
>>> Thanks; I also set a larger value for the rpc timeout, but it didn't
>>> change the behavior. I'm using EMR 5.5, not sure whether it is a bug.
>>> 2017-09-07 17:24 GMT+08:00 Alexander Sterligov <>:
>>>> Hi,
>>>> I've set large hbase timeout:
>>>> <property>
>>>>     <name>hbase.rpc.timeout</name>
>>>>     <value>1800000</value>
>>>>   </property>
>>>> On Thu, Sep 7, 2017 at 12:02 PM, ShaoFeng Shi <>
>>>> wrote:
>>>>> Hi Alexander,
>>>>> I encounter a problem when using HDFS for cubing building, and S3 for
>>>>> HBase on EMR. In the "Load HFile to HBase Table" step, Kylin got a failure
>>>>> with time out error:
>>>>> Thu Sep 07 15:33:27 GMT+08:00 2017, RpcRetryingCaller{globalStartTime=1504769048975,
>>>>> pause=100, retries=35}, Call to
>>>>> ip-10-0-0-28.ec2.internal/ failed on local exception:
>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=41,
>>>>> waitTime=60001, operationTimeout=60000
>>>>> In HBase region server, I saw HBase uploads the HFile to S3; Since the
>>>>> cube is a little big (13GB), it takes much longer time than usual. Kylin
>>>>> client closed the connection as it thought timeout:
>>>>> 2017-09-07 08:01:12,275 INFO  [RpcServer.FifoWFPBQ.default.handler=16,queue=1,port=16020]
>>>>> regionserver.HRegionFileSystem: Bulk-load file
>>>>> hdfs://ip-10-0-0-118.ec2.internal:8020/kylin/kylin_default_i
>>>>> nstance/kylin-cdcb5f57-2ea9-47d9-85db-7a6c7490cc55/test/hfil
>>>>> e/F1/a897b4d33ed648e6a5d0bfb05cffdfd6 is on different filesystem than
>>>>> the destination store. Copying file over to destination filesystem.
>>>>> 2017-09-07 08:01:23,919 INFO  [RpcServer.FifoWFPBQ.default.handler=22,queue=1,port=16020]
>>>>> s3.MultipartUploadManager: completed multipart upload of 8 parts 965420145
>>>>> bytes
>>>>> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
>>>>> ipc.RpcServer: (responseTooSlow): {"call":"BulkLoadHFile(org.apa
>>>>> che.hadoop.hbase.protobuf.generated.ClientProtos$BulkLoadHFi
>>>>> leRequest)","starttimems":1504770958916,"responsesize":2,"me
>>>>> thod":"BulkLoadHFile","param":"TODO: class
>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$Bulk
>>>>> LoadHFileRequest","processingtimems":1834922,"client":"10.0.
>>>>> 0.243:49152","queuetimems":0,"class":"HRegionServer"}
>>>>> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
>>>>> ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020:
>>>>> caught a ClosedChannelException, this means that the server /
>>>>> was processing a request but the client went away.
>>>>> The error message was: null
>>>>> So I wonder how did you bypass this problem, did you set a very large
>>>>> timeout value for HBase, or your cube size isn't that big? Thanks.
>>>>> 2017-08-14 14:19 GMT+08:00 Alexander Sterligov <>:
>>>>>> Here is ticket for hfile on s3 issue -
>>>>>> ra/browse/KYLIN-2788
>>>>>> On Mon, Aug 14, 2017 at 9:17 AM, Alexander Sterligov <
>>>>>>> wrote:
>>>>>>> I forgot there was one more issue with s3 -
>>>>>>> Global dictionary in 2.0 doesn't work out of the box. I patched
>>>>>>> kylin as described in ticket.
>>>>>>> On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <
>>>>>>>> wrote:
>>>>>>>> Nice; For the writting hfile to S3 issue,  it need more
>>>>>>>> investigation.  Please open a Kylin JIRA for tracking. We will update there
>>>>>>>> if has any finding.
>>>>>>>> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> Query performance is pretty same as on slides about kylin. I have
>>>>>>>>> high bucket cache hit (>90%), so data is almost always read from local
>>>>>>>>> disk. For some other use cases it might be different.
>>>>>>>>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <
>>>>>>>>>> написал:
>>>>>>>>> Cool; how about the query performance with data on s3?
>>>>>>>>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> Yes, that's the only one fow now.
>>>>>>>>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <
>>>>>>>>>>> wrote:
>>>>>>>>>>> No need to add I think, because I see they already in the
>>>>>>>>>>> configuration of that step.
>>>>>>>>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>>>>>>>>> [image: 内嵌图片 1]
>>>>>>>>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>>>>>>>>> and kylin_job_conf_inmem.xml?
>>>>>>>>>>>> hbase.zookeeper.quorum for example doesn't work if not
>>>>>>>>>>>> specified in these configs.
>>>>>>>>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> EMR enables the direct output in mapred-site.xml, while in
>>>>>>>>>>>>> this step it seems these settings doesn't work (althoug the job's
>>>>>>>>>>>>> configuration shows they are there). I disabled the direct output but the
>>>>>>>>>>>>> behavior has no change. I did some search but no finding. I need drop the
>>>>>>>>>>>>> EMR now, and may get back it later.
>>>>>>>>>>>>> If you have any idea or findings, please share it. We'd like
>>>>>>>>>>>>> to make Kylin has better support for cloud.
>>>>>>>>>>>>> Thanks for your feedback!
>>>>>>>>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>> Any ideas how to fix that?
>>>>>>>>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I got the same problem as you:
>>>>>>>>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job
>>>>>>>>>>>>>>> 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation
>>>>>>>>>>>>>>> did not find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it
>>>>>>>>>>>>>>> contain files in subdirectories that correspond to column family names?
>>>>>>>>>>>>>>> In S3 view, I see the files exist in "_temporary" folder,
>>>>>>>>>>>>>>> seems were not moved to the target folder on complete. It seems EMR try to
>>>>>>>>>>>>>>> direct write to otuput path, but actually not.
>>>>>>>>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>>>>>>>>> I’ve seen such behavior when set working dir to s3, but
>>>>>>>>>>>>>>>> didn’t set cluster-fs at all. Maybe you have a typo in the name of the
>>>>>>>>>>>>>>>> property. I used the old one «kylin.hbase.cluster.fs»
>>>>>>>>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>>>>>>>>> happens.
>>>>>>>>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <
>>>>>>>>>>>>>>>>> написал(а):
>>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>>> That makes sense. Using S3 for Cube build and storage is
>>>>>>>>>>>>>>>> required for a cloud hadoop environment.
>>>>>>>>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3
>>>>>>>>>>>>>>>> as HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>>>>>>>>> and "" to the S3 bucket. But
>>>>>>>>>>>>>>>> in the "Convert Cuboid Data to HFile" step, Kylin still
>>>>>>>>>>>>>>>> writes to local HDFS; Did you modify the core-site.xml to make S3 as the
>>>>>>>>>>>>>>>> default FS?
>>>>>>>>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>>>>>>>>> One problem of such solution is that I have to use pretty
>>>>>>>>>>>>>>>>> large hdfs and it'expensive. And also I have to manually garbage collect
>>>>>>>>>>>>>>>>> it, because it is not moved to s3, but copied. Kylin cleanup job doesn't
>>>>>>>>>>>>>>>>> work for it, because main metadata folder is at s3. So it would be really
>>>>>>>>>>>>>>>>> nice to put everything to s3.
>>>>>>>>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout,
>>>>>>>>>>>>>>>>> because bulk loading from hdfs takes long. That was not trivial. 3 minutes
>>>>>>>>>>>>>>>>> work good, but with drawback of queries or metadata writes handing for 3
>>>>>>>>>>>>>>>>> minutes if something bad happen. But that's rare event.
>>>>>>>>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"?
>>>>>>>>>>>>>>>>>> This property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>>>>>>>>> other for query);
>>>>>>>>>>>>>>>>>> When be empty, the HFile will be written to default fs
>>>>>>>>>>>>>>>>>> (HDFS in EMR), and then load to HBase. I'm not sure whether EMR HBase
>>>>>>>>>>>>>>>>>> (using S3 as storage) can bulk load files from HDFS or not. If it can, that
>>>>>>>>>>>>>>>>>> would be great as the write performance of HDFS would be better than S3.
>>>>>>>>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>>>>>>> In case of inconsistency of s3 files disappear right
>>>>>>>>>>>>>>>>>>> after they were written and appear some time after. Hfiles didn't appear
>>>>>>>>>>>>>>>>>>> after a day, but _template is there.
>>>>>>>>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this
>>>>>>>>>>>>>>>>>>> problem by running conversion job manually.
>>>>>>>>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>>>>>>> Did you enable the Consistent View? This article
>>>>>>>>>>>>>>>>>>>> explains the challenge when using S3 directly for ETL process:
>>>>>>>>>>>>>>>>>>>> s/big-data/ensuring-consistenc
>>>>>>>>>>>>>>>>>>>> y-when-using-amazon-s3-and-ama
>>>>>>>>>>>>>>>>>>>> zon-elastic-mapreduce-for-etl-workflows/
>>>>>>>>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping
>>>>>>>>>>>>>>>>>>>>> non-directory s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping
>>>>>>>>>>>>>>>>>>>>> non-file FileStatusExt{path=s3://joom.e
>>>>>>>>>>>>>>>>>>>>> mr.fs/home/production/bi/kylin
>>>>>>>>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685
>>>>>>>>>>>>>>>>>>>>> -7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1;
>>>>>>>>>>>>>>>>>>>>> isDirectory=true; modification_time=0; access_time=0; owner=; group=;
>>>>>>>>>>>>>>>>>>>>> permission=rwxrwxrwx; isSymlink=false}
>>>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load
>>>>>>>>>>>>>>>>>>>>> operation did not find any files to load in directory
>>>>>>>>>>>>>>>>>>>>> s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-647
>>>>>>>>>>>>>>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain
>>>>>>>>>>>>>>>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when
>>>>>>>>>>>>>>>>>>>>>> bulk load finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where
>>>>>>>>>>>>>>>>>>>>>>> hbase lives.
>>>>>>>>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished
>>>>>>>>>>>>>>>>>>>>>>> without errors. Statistics at the end of the job said that it has written
>>>>>>>>>>>>>>>>>>>>>>> lot's of data to s3.
>>>>>>>>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders.
>>>>>>>>>>>>>>>>>>>>>>> it looks like there were not copied from _temporary to result dir. But
>>>>>>>>>>>>>>>>>>>>>>> there is no errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
Setting hbase.rpc.timeout to a large value has drawback I think; It will
cause other rpc operations wait longer. So the best way is directly writing
HFile to the S3 bucket that HBase reads. Not sure whether HBase still needs
a move operation; if need, that will become another problem.

2017-09-07 18:02 GMT+08:00 Alexander Sterligov <>:

> Just in case - I've changed it in /etc/hbase/conf/hbase-site.xml
> On Thu, Sep 7, 2017 at 12:59 PM, ShaoFeng Shi <>
> wrote:
>> Thanks; I also set a larger value for the rpc timeout, but it didn't
>> change the behavior. I'm using EMR 5.5, not sure whether it is a bug.
>> 2017-09-07 17:24 GMT+08:00 Alexander Sterligov <>:
>>> Hi,
>>> I've set large hbase timeout:
>>> <property>
>>>     <name>hbase.rpc.timeout</name>
>>>     <value>1800000</value>
>>>   </property>
>>> On Thu, Sep 7, 2017 at 12:02 PM, ShaoFeng Shi <>
>>> wrote:
>>>> Hi Alexander,
>>>> I encounter a problem when using HDFS for cubing building, and S3 for
>>>> HBase on EMR. In the "Load HFile to HBase Table" step, Kylin got a failure
>>>> with time out error:
>>>> Thu Sep 07 15:33:27 GMT+08:00 2017, RpcRetryingCaller{globalStartTime=1504769048975,
>>>> pause=100, retries=35}, Call to
>>>> ip-10-0-0-28.ec2.internal/ failed on local exception:
>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=41,
>>>> waitTime=60001, operationTimeout=60000
>>>> In HBase region server, I saw HBase uploads the HFile to S3; Since the
>>>> cube is a little big (13GB), it takes much longer time than usual. Kylin
>>>> client closed the connection as it thought timeout:
>>>> 2017-09-07 08:01:12,275 INFO  [RpcServer.FifoWFPBQ.default.handler=16,queue=1,port=16020]
>>>> regionserver.HRegionFileSystem: Bulk-load file
>>>> hdfs://ip-10-0-0-118.ec2.internal:8020/kylin/kylin_default_i
>>>> nstance/kylin-cdcb5f57-2ea9-47d9-85db-7a6c7490cc55/test/hfil
>>>> e/F1/a897b4d33ed648e6a5d0bfb05cffdfd6 is on different filesystem than
>>>> the destination store. Copying file over to destination filesystem.
>>>> 2017-09-07 08:01:23,919 INFO  [RpcServer.FifoWFPBQ.default.handler=22,queue=1,port=16020]
>>>> s3.MultipartUploadManager: completed multipart upload of 8 parts 965420145
>>>> bytes
>>>> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
>>>> ipc.RpcServer: (responseTooSlow): {"call":"BulkLoadHFile(org.apa
>>>> che.hadoop.hbase.protobuf.generated.ClientProtos$BulkLoadHFi
>>>> leRequest)","starttimems":1504770958916,"responsesize":2,"me
>>>> thod":"BulkLoadHFile","param":"TODO: class
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$Bulk
>>>> LoadHFileRequest","processingtimems":1834922,"client":"
>>>> ","queuetimems":0,"class":"HRegionServer"}
>>>> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
>>>> ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020:
>>>> caught a ClosedChannelException, this means that the server /
>>>> was processing a request but the client went away. The
>>>> error message was: null
>>>> So I wonder how did you bypass this problem, did you set a very large
>>>> timeout value for HBase, or your cube size isn't that big? Thanks.
>>>> 2017-08-14 14:19 GMT+08:00 Alexander Sterligov <>:
>>>>> Here is ticket for hfile on s3 issue -
>>>>> ra/browse/KYLIN-2788
>>>>> On Mon, Aug 14, 2017 at 9:17 AM, Alexander Sterligov <
>>>>>> wrote:
>>>>>> I forgot there was one more issue with s3 -
>>>>>> Global dictionary in 2.0 doesn't work out of the box. I patched kylin
>>>>>> as described in ticket.
>>>>>> On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <
>>>>>> > wrote:
>>>>>>> Nice; For the writting hfile to S3 issue,  it need more
>>>>>>> investigation.  Please open a Kylin JIRA for tracking. We will update there
>>>>>>> if has any finding.
>>>>>>> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> Query performance is pretty same as on slides about kylin. I have
>>>>>>>> high bucket cache hit (>90%), so data is almost always read from local
>>>>>>>> disk. For some other use cases it might be different.
>>>>>>>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <
>>>>>>>>> написал:
>>>>>>>> Cool; how about the query performance with data on s3?
>>>>>>>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> Yes, that's the only one fow now.
>>>>>>>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <
>>>>>>>>>> wrote:
>>>>>>>>>> No need to add I think, because I see they already in the
>>>>>>>>>> configuration of that step.
>>>>>>>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>>>>>>>> [image: 内嵌图片 1]
>>>>>>>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>>>>>>>> and kylin_job_conf_inmem.xml?
>>>>>>>>>>> hbase.zookeeper.quorum for example doesn't work if not specified
>>>>>>>>>>> in these configs.
>>>>>>>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> EMR enables the direct output in mapred-site.xml, while in this
>>>>>>>>>>>> step it seems these settings doesn't work (althoug the job's configuration
>>>>>>>>>>>> shows they are there). I disabled the direct output but the behavior has no
>>>>>>>>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>>>>>>>>> get back it later.
>>>>>>>>>>>> If you have any idea or findings, please share it. We'd like to
>>>>>>>>>>>> make Kylin has better support for cloud.
>>>>>>>>>>>> Thanks for your feedback!
>>>>>>>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>> Any ideas how to fix that?
>>>>>>>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> I got the same problem as you:
>>>>>>>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job
>>>>>>>>>>>>>> 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation
>>>>>>>>>>>>>> did not find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it
>>>>>>>>>>>>>> contain files in subdirectories that correspond to column family names?
>>>>>>>>>>>>>> In S3 view, I see the files exist in "_temporary" folder,
>>>>>>>>>>>>>> seems were not moved to the target folder on complete. It seems EMR try to
>>>>>>>>>>>>>> direct write to otuput path, but actually not.
>>>>>>>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>>>>>>>> I’ve seen such behavior when set working dir to s3, but
>>>>>>>>>>>>>>> didn’t set cluster-fs at all. Maybe you have a typo in the name of the
>>>>>>>>>>>>>>> property. I used the old one «kylin.hbase.cluster.fs»
>>>>>>>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>>>>>>>> happens.
>>>>>>>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <
>>>>>>>>>>>>>>>> написал(а):
>>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>>> That makes sense. Using S3 for Cube build and storage is
>>>>>>>>>>>>>>> required for a cloud hadoop environment.
>>>>>>>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3
>>>>>>>>>>>>>>> as HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>>>>>>>> and "" to the S3 bucket. But
>>>>>>>>>>>>>>> in the "Convert Cuboid Data to HFile" step, Kylin still
>>>>>>>>>>>>>>> writes to local HDFS; Did you modify the core-site.xml to make S3 as the
>>>>>>>>>>>>>>> default FS?
>>>>>>>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>>>>>>>> One problem of such solution is that I have to use pretty
>>>>>>>>>>>>>>>> large hdfs and it'expensive. And also I have to manually garbage collect
>>>>>>>>>>>>>>>> it, because it is not moved to s3, but copied. Kylin cleanup job doesn't
>>>>>>>>>>>>>>>> work for it, because main metadata folder is at s3. So it would be really
>>>>>>>>>>>>>>>> nice to put everything to s3.
>>>>>>>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout,
>>>>>>>>>>>>>>>> because bulk loading from hdfs takes long. That was not trivial. 3 minutes
>>>>>>>>>>>>>>>> work good, but with drawback of queries or metadata writes handing for 3
>>>>>>>>>>>>>>>> minutes if something bad happen. But that's rare event.
>>>>>>>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>>>>>>>> other for query);
>>>>>>>>>>>>>>>>> When be empty, the HFile will be written to default fs
>>>>>>>>>>>>>>>>> (HDFS in EMR), and then load to HBase. I'm not sure whether EMR HBase
>>>>>>>>>>>>>>>>> (using S3 as storage) can bulk load files from HDFS or not. If it can, that
>>>>>>>>>>>>>>>>> would be great as the write performance of HDFS would be better than S3.
>>>>>>>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>>>>>> In case of inconsistency of s3 files disappear right
>>>>>>>>>>>>>>>>>> after they were written and appear some time after. Hfiles didn't appear
>>>>>>>>>>>>>>>>>> after a day, but _template is there.
>>>>>>>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this
>>>>>>>>>>>>>>>>>> problem by running conversion job manually.
>>>>>>>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>>>>>> Did you enable the Consistent View? This article explains
>>>>>>>>>>>>>>>>>>> the challenge when using S3 directly for ETL process:
>>>>>>>>>>>>>>>>>>> s/big-data/ensuring-consistenc
>>>>>>>>>>>>>>>>>>> y-when-using-amazon-s3-and-ama
>>>>>>>>>>>>>>>>>>> zon-elastic-mapreduce-for-etl-workflows/
>>>>>>>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping
>>>>>>>>>>>>>>>>>>>> non-directory s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping
>>>>>>>>>>>>>>>>>>>> non-file FileStatusExt{path=s3://joom.e
>>>>>>>>>>>>>>>>>>>> mr.fs/home/production/bi/kylin
>>>>>>>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685
>>>>>>>>>>>>>>>>>>>> -7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load
>>>>>>>>>>>>>>>>>>>> operation did not find any files to load in directory
>>>>>>>>>>>>>>>>>>>> s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-647
>>>>>>>>>>>>>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain
>>>>>>>>>>>>>>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk
>>>>>>>>>>>>>>>>>>>>> load finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase
>>>>>>>>>>>>>>>>>>>>>> lives.
>>>>>>>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it
>>>>>>>>>>>>>>>>>>>>>> looks like there were not copied from _temporary to result dir. But there
>>>>>>>>>>>>>>>>>>>>>> is no errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
Just in case - I've changed it in /etc/hbase/conf/hbase-site.xml

On Thu, Sep 7, 2017 at 12:59 PM, ShaoFeng Shi <>

> Thanks; I also set a larger value for the rpc timeout, but it didn't
> change the behavior. I'm using EMR 5.5, not sure whether it is a bug.
> 2017-09-07 17:24 GMT+08:00 Alexander Sterligov <>:
>> Hi,
>> I've set large hbase timeout:
>> <property>
>>     <name>hbase.rpc.timeout</name>
>>     <value>1800000</value>
>>   </property>
>> On Thu, Sep 7, 2017 at 12:02 PM, ShaoFeng Shi <>
>> wrote:
>>> Hi Alexander,
>>> I encounter a problem when using HDFS for cubing building, and S3 for
>>> HBase on EMR. In the "Load HFile to HBase Table" step, Kylin got a failure
>>> with time out error:
>>> Thu Sep 07 15:33:27 GMT+08:00 2017, RpcRetryingCaller{globalStartTime=1504769048975,
>>> pause=100, retries=35}, Call to
>>> ip-10-0-0-28.ec2.internal/ failed on local exception:
>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=41,
>>> waitTime=60001, operationTimeout=60000
>>> In HBase region server, I saw HBase uploads the HFile to S3; Since the
>>> cube is a little big (13GB), it takes much longer time than usual. Kylin
>>> client closed the connection as it thought timeout:
>>> 2017-09-07 08:01:12,275 INFO  [RpcServer.FifoWFPBQ.default.handler=16,queue=1,port=16020]
>>> regionserver.HRegionFileSystem: Bulk-load file
>>> hdfs://ip-10-0-0-118.ec2.internal:8020/kylin/kylin_default_i
>>> nstance/kylin-cdcb5f57-2ea9-47d9-85db-7a6c7490cc55/test/hfil
>>> e/F1/a897b4d33ed648e6a5d0bfb05cffdfd6 is on different filesystem than
>>> the destination store. Copying file over to destination filesystem.
>>> 2017-09-07 08:01:23,919 INFO  [RpcServer.FifoWFPBQ.default.handler=22,queue=1,port=16020]
>>> s3.MultipartUploadManager: completed multipart upload of 8 parts 965420145
>>> bytes
>>> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
>>> ipc.RpcServer: (responseTooSlow): {"call":"BulkLoadHFile(org.apa
>>> che.hadoop.hbase.protobuf.generated.ClientProtos$BulkLoadHFi
>>> leRequest)","starttimems":1504770958916,"responsesize":2,"me
>>> thod":"BulkLoadHFile","param":"TODO: class
>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$Bulk
>>> LoadHFileRequest","processingtimems":1834922,"client":"
>>> ","queuetimems":0,"class":"HRegionServer"}
>>> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
>>> ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020:
>>> caught a ClosedChannelException, this means that the server /
>>> was processing a request but the client went away. The
>>> error message was: null
>>> So I wonder how did you bypass this problem, did you set a very large
>>> timeout value for HBase, or your cube size isn't that big? Thanks.
>>> 2017-08-14 14:19 GMT+08:00 Alexander Sterligov <>:
>>>> Here is ticket for hfile on s3 issue -
>>>> ra/browse/KYLIN-2788
>>>> On Mon, Aug 14, 2017 at 9:17 AM, Alexander Sterligov <
>>>>> wrote:
>>>>> I forgot there was one more issue with s3 -
>>>>> Global dictionary in 2.0 doesn't work out of the box. I patched kylin
>>>>> as described in ticket.
>>>>> On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <>
>>>>> wrote:
>>>>>> Nice; For the writting hfile to S3 issue,  it need more
>>>>>> investigation.  Please open a Kylin JIRA for tracking. We will update there
>>>>>> if has any finding.
>>>>>> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>:
>>>>>>> Query performance is pretty same as on slides about kylin. I have
>>>>>>> high bucket cache hit (>90%), so data is almost always read from local
>>>>>>> disk. For some other use cases it might be different.
>>>>>>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <
>>>>>>>> написал:
>>>>>>> Cool; how about the query performance with data on s3?
>>>>>>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> Yes, that's the only one fow now.
>>>>>>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <
>>>>>>>>> wrote:
>>>>>>>>> No need to add I think, because I see they already in the
>>>>>>>>> configuration of that step.
>>>>>>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>>>>>>> [image: 内嵌图片 1]
>>>>>>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>>>>>>> and kylin_job_conf_inmem.xml?
>>>>>>>>>> hbase.zookeeper.quorum for example doesn't work if not specified
>>>>>>>>>> in these configs.
>>>>>>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <
>>>>>>>>>>> wrote:
>>>>>>>>>>> EMR enables the direct output in mapred-site.xml, while in this
>>>>>>>>>>> step it seems these settings doesn't work (althoug the job's configuration
>>>>>>>>>>> shows they are there). I disabled the direct output but the behavior has no
>>>>>>>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>>>>>>>> get back it later.
>>>>>>>>>>> If you have any idea or findings, please share it. We'd like to
>>>>>>>>>>> make Kylin has better support for cloud.
>>>>>>>>>>> Thanks for your feedback!
>>>>>>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> Any ideas how to fix that?
>>>>>>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> I got the same problem as you:
>>>>>>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job
>>>>>>>>>>>>> 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>>>>> not find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it
>>>>>>>>>>>>> contain files in subdirectories that correspond to column family names?
>>>>>>>>>>>>> In S3 view, I see the files exist in "_temporary" folder,
>>>>>>>>>>>>> seems were not moved to the target folder on complete. It seems EMR try to
>>>>>>>>>>>>> direct write to otuput path, but actually not.
>>>>>>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>>>>>>> I’ve seen such behavior when set working dir to s3, but
>>>>>>>>>>>>>> didn’t set cluster-fs at all. Maybe you have a typo in the name of the
>>>>>>>>>>>>>> property. I used the old one «kylin.hbase.cluster.fs»
>>>>>>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>>>>>>> happens.
>>>>>>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <
>>>>>>>>>>>>>>> написал(а):
>>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>>> That makes sense. Using S3 for Cube build and storage is
>>>>>>>>>>>>>> required for a cloud hadoop environment.
>>>>>>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as
>>>>>>>>>>>>>> HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>>>>>>> and "" to the S3 bucket. But
>>>>>>>>>>>>>> in the "Convert Cuboid Data to HFile" step, Kylin still
>>>>>>>>>>>>>> writes to local HDFS; Did you modify the core-site.xml to make S3 as the
>>>>>>>>>>>>>> default FS?
>>>>>>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>>>>>>> One problem of such solution is that I have to use pretty
>>>>>>>>>>>>>>> large hdfs and it'expensive. And also I have to manually garbage collect
>>>>>>>>>>>>>>> it, because it is not moved to s3, but copied. Kylin cleanup job doesn't
>>>>>>>>>>>>>>> work for it, because main metadata folder is at s3. So it would be really
>>>>>>>>>>>>>>> nice to put everything to s3.
>>>>>>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout,
>>>>>>>>>>>>>>> because bulk loading from hdfs takes long. That was not trivial. 3 minutes
>>>>>>>>>>>>>>> work good, but with drawback of queries or metadata writes handing for 3
>>>>>>>>>>>>>>> minutes if something bad happen. But that's rare event.
>>>>>>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>>>>>>> other for query);
>>>>>>>>>>>>>>>> When be empty, the HFile will be written to default fs
>>>>>>>>>>>>>>>> (HDFS in EMR), and then load to HBase. I'm not sure whether EMR HBase
>>>>>>>>>>>>>>>> (using S3 as storage) can bulk load files from HDFS or not. If it can, that
>>>>>>>>>>>>>>>> would be great as the write performance of HDFS would be better than S3.
>>>>>>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>>>>> In case of inconsistency of s3 files disappear right after
>>>>>>>>>>>>>>>>> they were written and appear some time after. Hfiles didn't appear after a
>>>>>>>>>>>>>>>>> day, but _template is there.
>>>>>>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this
>>>>>>>>>>>>>>>>> problem by running conversion job manually.
>>>>>>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>>>>> Did you enable the Consistent View? This article explains
>>>>>>>>>>>>>>>>>> the challenge when using S3 directly for ETL process:
>>>>>>>>>>>>>>>>>> s/big-data/ensuring-consistenc
>>>>>>>>>>>>>>>>>> y-when-using-amazon-s3-and-ama
>>>>>>>>>>>>>>>>>> zon-elastic-mapreduce-for-etl-workflows/
>>>>>>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping
>>>>>>>>>>>>>>>>>>> non-directory s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>>>>>>>>> FileStatusExt{path=s3://joom.e
>>>>>>>>>>>>>>>>>>> mr.fs/home/production/bi/kylin
>>>>>>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685
>>>>>>>>>>>>>>>>>>> -7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load
>>>>>>>>>>>>>>>>>>> operation did not find any files to load in directory
>>>>>>>>>>>>>>>>>>> s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-647
>>>>>>>>>>>>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain
>>>>>>>>>>>>>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk
>>>>>>>>>>>>>>>>>>>> load finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase
>>>>>>>>>>>>>>>>>>>>> lives.
>>>>>>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it
>>>>>>>>>>>>>>>>>>>>> looks like there were not copied from _temporary to result dir. But there
>>>>>>>>>>>>>>>>>>>>> is no errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
Thanks; I also set a larger value for the rpc timeout, but it didn't change
the behavior. I'm using EMR 5.5, not sure whether it is a bug.

2017-09-07 17:24 GMT+08:00 Alexander Sterligov <>:

> Hi,
> I've set large hbase timeout:
> <property>
>     <name>hbase.rpc.timeout</name>
>     <value>1800000</value>
>   </property>
> On Thu, Sep 7, 2017 at 12:02 PM, ShaoFeng Shi <>
> wrote:
>> Hi Alexander,
>> I encounter a problem when using HDFS for cubing building, and S3 for
>> HBase on EMR. In the "Load HFile to HBase Table" step, Kylin got a failure
>> with time out error:
>> Thu Sep 07 15:33:27 GMT+08:00 2017, RpcRetryingCaller{globalStartTime=1504769048975,
>> pause=100, retries=35}, Call to
>> ip-10-0-0-28.ec2.internal/ failed on local exception:
>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=41,
>> waitTime=60001, operationTimeout=60000
>> In HBase region server, I saw HBase uploads the HFile to S3; Since the
>> cube is a little big (13GB), it takes much longer time than usual. Kylin
>> client closed the connection as it thought timeout:
>> 2017-09-07 08:01:12,275 INFO  [RpcServer.FifoWFPBQ.default.handler=16,queue=1,port=16020]
>> regionserver.HRegionFileSystem: Bulk-load file
>> hdfs://ip-10-0-0-118.ec2.internal:8020/kylin/kylin_default_i
>> nstance/kylin-cdcb5f57-2ea9-47d9-85db-7a6c7490cc55/test/hfil
>> e/F1/a897b4d33ed648e6a5d0bfb05cffdfd6 is on different filesystem than
>> the destination store. Copying file over to destination filesystem.
>> 2017-09-07 08:01:23,919 INFO  [RpcServer.FifoWFPBQ.default.handler=22,queue=1,port=16020]
>> s3.MultipartUploadManager: completed multipart upload of 8 parts 965420145
>> bytes
>> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
>> ipc.RpcServer: (responseTooSlow): {"call":"BulkLoadHFile(org.apa
>> che.hadoop.hbase.protobuf.generated.ClientProtos$BulkLoadHFi
>> leRequest)","starttimems":1504770958916,"responsesize":2,"
>> method":"BulkLoadHFile","param":"TODO: class
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$Bulk
>> LoadHFileRequest","processingtimems":1834922,"client":"
>> ","queuetimems":0,"class":"HRegionServer"}
>> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
>> ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020:
>> caught a ClosedChannelException, this means that the server /
>> was processing a request but the client went away. The
>> error message was: null
>> So I wonder how did you bypass this problem, did you set a very large
>> timeout value for HBase, or your cube size isn't that big? Thanks.
>> 2017-08-14 14:19 GMT+08:00 Alexander Sterligov <>:
>>> Here is ticket for hfile on s3 issue -
>>> ra/browse/KYLIN-2788
>>> On Mon, Aug 14, 2017 at 9:17 AM, Alexander Sterligov <
>>>> wrote:
>>>> I forgot there was one more issue with s3 -
>>>> Global dictionary in 2.0 doesn't work out of the box. I patched kylin
>>>> as described in ticket.
>>>> On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <>
>>>> wrote:
>>>>> Nice; For the writting hfile to S3 issue,  it need more
>>>>> investigation.  Please open a Kylin JIRA for tracking. We will update there
>>>>> if has any finding.
>>>>> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>:
>>>>>> Query performance is pretty same as on slides about kylin. I have
>>>>>> high bucket cache hit (>90%), so data is almost always read from local
>>>>>> disk. For some other use cases it might be different.
>>>>>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <
>>>>>>> написал:
>>>>>> Cool; how about the query performance with data on s3?
>>>>>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:
>>>>>>> Yes, that's the only one fow now.
>>>>>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <
>>>>>>>> wrote:
>>>>>>>> No need to add I think, because I see they already in the
>>>>>>>> configuration of that step.
>>>>>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>>>>>> [image: 内嵌图片 1]
>>>>>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>>>>>> and kylin_job_conf_inmem.xml?
>>>>>>>>> hbase.zookeeper.quorum for example doesn't work if not specified
>>>>>>>>> in these configs.
>>>>>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <
>>>>>>>>>> wrote:
>>>>>>>>>> EMR enables the direct output in mapred-site.xml, while in this
>>>>>>>>>> step it seems these settings doesn't work (althoug the job's configuration
>>>>>>>>>> shows they are there). I disabled the direct output but the behavior has no
>>>>>>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>>>>>>> get back it later.
>>>>>>>>>> If you have any idea or findings, please share it. We'd like to
>>>>>>>>>> make Kylin has better support for cloud.
>>>>>>>>>> Thanks for your feedback!
>>>>>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>> Any ideas how to fix that?
>>>>>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> I got the same problem as you:
>>>>>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job
>>>>>>>>>>>> 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>>>> not find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it
>>>>>>>>>>>> contain files in subdirectories that correspond to column family names?
>>>>>>>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems
>>>>>>>>>>>> were not moved to the target folder on complete. It seems EMR try to direct
>>>>>>>>>>>> write to otuput path, but actually not.
>>>>>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t
>>>>>>>>>>>>> set cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>>>>>> happens.
>>>>>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>>>>>>>> написал(а):
>>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>>> That makes sense. Using S3 for Cube build and storage is
>>>>>>>>>>>>> required for a cloud hadoop environment.
>>>>>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as
>>>>>>>>>>>>> HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>>>>>> and "" to the S3 bucket. But in
>>>>>>>>>>>>> the "Convert Cuboid Data to HFile" step, Kylin still writes
>>>>>>>>>>>>> to local HDFS; Did you modify the core-site.xml to make S3 as the default
>>>>>>>>>>>>> FS?
>>>>>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>>>>>> One problem of such solution is that I have to use pretty
>>>>>>>>>>>>>> large hdfs and it'expensive. And also I have to manually garbage collect
>>>>>>>>>>>>>> it, because it is not moved to s3, but copied. Kylin cleanup job doesn't
>>>>>>>>>>>>>> work for it, because main metadata folder is at s3. So it would be really
>>>>>>>>>>>>>> nice to put everything to s3.
>>>>>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout,
>>>>>>>>>>>>>> because bulk loading from hdfs takes long. That was not trivial. 3 minutes
>>>>>>>>>>>>>> work good, but with drawback of queries or metadata writes handing for 3
>>>>>>>>>>>>>> minutes if something bad happen. But that's rare event.
>>>>>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>>>>>> other for query);
>>>>>>>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS
>>>>>>>>>>>>>>> in EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3
>>>>>>>>>>>>>>> as storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>>>> In case of inconsistency of s3 files disappear right after
>>>>>>>>>>>>>>>> they were written and appear some time after. Hfiles didn't appear after a
>>>>>>>>>>>>>>>> day, but _template is there.
>>>>>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this
>>>>>>>>>>>>>>>> problem by running conversion job manually.
>>>>>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>>>> Did you enable the Consistent View? This article explains
>>>>>>>>>>>>>>>>> the challenge when using S3 directly for ETL process:
>>>>>>>>>>>>>>>>> s/big-data/ensuring-consistenc
>>>>>>>>>>>>>>>>> y-when-using-amazon-s3-and-ama
>>>>>>>>>>>>>>>>> zon-elastic-mapreduce-for-etl-workflows/
>>>>>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping
>>>>>>>>>>>>>>>>>> non-directory s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>>>>>>>> FileStatusExt{path=s3://joom.e
>>>>>>>>>>>>>>>>>> mr.fs/home/production/bi/kylin
>>>>>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685
>>>>>>>>>>>>>>>>>> -7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load
>>>>>>>>>>>>>>>>>> operation did not find any files to load in directory
>>>>>>>>>>>>>>>>>> s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-647
>>>>>>>>>>>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain
>>>>>>>>>>>>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk
>>>>>>>>>>>>>>>>>>> load finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase
>>>>>>>>>>>>>>>>>>>> lives.
>>>>>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it
>>>>>>>>>>>>>>>>>>>> looks like there were not copied from _temporary to result dir. But there
>>>>>>>>>>>>>>>>>>>> is no errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.

I've set large hbase timeout:


On Thu, Sep 7, 2017 at 12:02 PM, ShaoFeng Shi <>

> Hi Alexander,
> I encounter a problem when using HDFS for cubing building, and S3 for
> HBase on EMR. In the "Load HFile to HBase Table" step, Kylin got a failure
> with time out error:
> Thu Sep 07 15:33:27 GMT+08:00 2017, RpcRetryingCaller{globalStartTime=1504769048975,
> pause=100, retries=35}, Call to
> ip-10-0-0-28.ec2.internal/ failed on local exception:
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=41,
> waitTime=60001, operationTimeout=60000
> In HBase region server, I saw HBase uploads the HFile to S3; Since the
> cube is a little big (13GB), it takes much longer time than usual. Kylin
> client closed the connection as it thought timeout:
> 2017-09-07 08:01:12,275 INFO  [RpcServer.FifoWFPBQ.default.handler=16,queue=1,port=16020]
> regionserver.HRegionFileSystem: Bulk-load file hdfs://ip-10-0-0-118.ec2.
> internal:8020/kylin/kylin_default_instance/kylin-cdcb5f57-2ea9-47d9-85db-
> 7a6c7490cc55/test/hfile/F1/a897b4d33ed648e6a5d0bfb05cffdfd6 is on
> different filesystem than the destination store. Copying file over to
> destination filesystem.
> 2017-09-07 08:01:23,919 INFO  [RpcServer.FifoWFPBQ.default.handler=22,queue=1,port=16020]
> s3.MultipartUploadManager: completed multipart upload of 8 parts 965420145
> bytes
> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
> ipc.RpcServer: (responseTooSlow): {"call":"BulkLoadHFile(org.
> apache.hadoop.hbase.protobuf.generated.ClientProtos$
> BulkLoadHFileRequest)","starttimems":1504770958916,"
> responsesize":2,"method":"BulkLoadHFile","param":"TODO: class
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$
> BulkLoadHFileRequest","processingtimems":1834922,"client":"
> 2017-09-07 08:26:33,838 WARN  [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020]
> ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020:
> caught a ClosedChannelException, this means that the server /
> was processing a request but the client went away. The
> error message was: null
> So I wonder how did you bypass this problem, did you set a very large
> timeout value for HBase, or your cube size isn't that big? Thanks.
> 2017-08-14 14:19 GMT+08:00 Alexander Sterligov <>:
>> Here is ticket for hfile on s3 issue -
>> ra/browse/KYLIN-2788
>> On Mon, Aug 14, 2017 at 9:17 AM, Alexander Sterligov <
>> > wrote:
>>> I forgot there was one more issue with s3 -
>>> ra/browse/KYLIN-2740.
>>> Global dictionary in 2.0 doesn't work out of the box. I patched kylin as
>>> described in ticket.
>>> On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <>
>>> wrote:
>>>> Nice; For the writting hfile to S3 issue,  it need more
>>>> investigation.  Please open a Kylin JIRA for tracking. We will update there
>>>> if has any finding.
>>>> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>:
>>>>> Query performance is pretty same as on slides about kylin. I have high
>>>>> bucket cache hit (>90%), so data is almost always read from local disk. For
>>>>> some other use cases it might be different.
>>>>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <
>>>>>> написал:
>>>>> Cool; how about the query performance with data on s3?
>>>>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:
>>>>>> Yes, that's the only one fow now.
>>>>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <
>>>>>> > wrote:
>>>>>>> No need to add I think, because I see they already in the
>>>>>>> configuration of that step.
>>>>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>>>>> [image: 内嵌图片 1]
>>>>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>>>>> and kylin_job_conf_inmem.xml?
>>>>>>>> hbase.zookeeper.quorum for example doesn't work if not specified in
>>>>>>>> these configs.
>>>>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <
>>>>>>>>> wrote:
>>>>>>>>> EMR enables the direct output in mapred-site.xml, while in this
>>>>>>>>> step it seems these settings doesn't work (althoug the job's configuration
>>>>>>>>> shows they are there). I disabled the direct output but the behavior has no
>>>>>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>>>>>> get back it later.
>>>>>>>>> If you have any idea or findings, please share it. We'd like to
>>>>>>>>> make Kylin has better support for cloud.
>>>>>>>>> Thanks for your feedback!
>>>>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> Any ideas how to fix that?
>>>>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>>>>> wrote:
>>>>>>>>>>> I got the same problem as you:
>>>>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>>> not find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it
>>>>>>>>>>> contain files in subdirectories that correspond to column family names?
>>>>>>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems
>>>>>>>>>>> were not moved to the target folder on complete. It seems EMR try to direct
>>>>>>>>>>> write to otuput path, but actually not.
>>>>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t
>>>>>>>>>>>> set cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>>>>> happens.
>>>>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>>>>>>> написал(а):
>>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>>> That makes sense. Using S3 for Cube build and storage is
>>>>>>>>>>>> required for a cloud hadoop environment.
>>>>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as
>>>>>>>>>>>> HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>>>>> and "" to the S3 bucket. But in
>>>>>>>>>>>> the "Convert Cuboid Data to HFile" step, Kylin still writes to
>>>>>>>>>>>> local HDFS; Did you modify the core-site.xml to make S3 as the default FS?
>>>>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>>>>> One problem of such solution is that I have to use pretty
>>>>>>>>>>>>> large hdfs and it'expensive. And also I have to manually garbage collect
>>>>>>>>>>>>> it, because it is not moved to s3, but copied. Kylin cleanup job doesn't
>>>>>>>>>>>>> work for it, because main metadata folder is at s3. So it would be really
>>>>>>>>>>>>> nice to put everything to s3.
>>>>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout,
>>>>>>>>>>>>> because bulk loading from hdfs takes long. That was not trivial. 3 minutes
>>>>>>>>>>>>> work good, but with drawback of queries or metadata writes handing for 3
>>>>>>>>>>>>> minutes if something bad happen. But that's rare event.
>>>>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>>>>> other for query);
>>>>>>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS
>>>>>>>>>>>>>> in EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3
>>>>>>>>>>>>>> as storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>>> In case of inconsistency of s3 files disappear right after
>>>>>>>>>>>>>>> they were written and appear some time after. Hfiles didn't appear after a
>>>>>>>>>>>>>>> day, but _template is there.
>>>>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this
>>>>>>>>>>>>>>> problem by running conversion job manually.
>>>>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>>> Did you enable the Consistent View? This article explains
>>>>>>>>>>>>>>>> the challenge when using S3 directly for ETL process:
>>>>>>>>>>>>>>>> s/big-data/ensuring-consistenc
>>>>>>>>>>>>>>>> y-when-using-amazon-s3-and-ama
>>>>>>>>>>>>>>>> zon-elastic-mapreduce-for-etl-workflows/
>>>>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping
>>>>>>>>>>>>>>>>> non-directory s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>>>>>>> FileStatusExt{path=s3://joom.e
>>>>>>>>>>>>>>>>> mr.fs/home/production/bi/kylin
>>>>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685
>>>>>>>>>>>>>>>>> -7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation
>>>>>>>>>>>>>>>>> did not find any files to load in directory
>>>>>>>>>>>>>>>>> s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-647
>>>>>>>>>>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files
>>>>>>>>>>>>>>>>> in subdirectories that correspond to column family names?
>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk
>>>>>>>>>>>>>>>>>> load finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase
>>>>>>>>>>>>>>>>>>> lives.
>>>>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it
>>>>>>>>>>>>>>>>>>> looks like there were not copied from _temporary to result dir. But there
>>>>>>>>>>>>>>>>>>> is no errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
Hi Alexander,

I encounter a problem when using HDFS for cubing building, and S3 for HBase
on EMR. In the "Load HFile to HBase Table" step, Kylin got a failure with
time out error:

Thu Sep 07 15:33:27 GMT+08:00 2017,
RpcRetryingCaller{globalStartTime=1504769048975, pause=100, retries=35}, Call to ip-10-0-0-28.ec2.internal/
failed on local exception:
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=41,
waitTime=60001, operationTimeout=60000

In HBase region server, I saw HBase uploads the HFile to S3; Since the cube
is a little big (13GB), it takes much longer time than usual. Kylin client
closed the connection as it thought timeout:

2017-09-07 08:01:12,275 INFO
regionserver.HRegionFileSystem: Bulk-load file
is on different filesystem than the destination store. Copying file over to
destination filesystem.
2017-09-07 08:01:23,919 INFO
s3.MultipartUploadManager: completed multipart upload of 8 parts 965420145

2017-09-07 08:26:33,838 WARN
[RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020] ipc.RpcServer:
2017-09-07 08:26:33,838 WARN
[RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020] ipc.RpcServer:
RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020: caught a
ClosedChannelException, this means that the server / was
processing a request but the client went away. The error message was: null

So I wonder how did you bypass this problem, did you set a very large
timeout value for HBase, or your cube size isn't that big? Thanks.

2017-08-14 14:19 GMT+08:00 Alexander Sterligov <>:

> Here is ticket for hfile on s3 issue -
> jira/browse/KYLIN-2788
> On Mon, Aug 14, 2017 at 9:17 AM, Alexander Sterligov <>
> wrote:
>> I forgot there was one more issue with s3 -
>> ra/browse/KYLIN-2740.
>> Global dictionary in 2.0 doesn't work out of the box. I patched kylin as
>> described in ticket.
>> On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <>
>> wrote:
>>> Nice; For the writting hfile to S3 issue,  it need more
>>> investigation.  Please open a Kylin JIRA for tracking. We will update there
>>> if has any finding.
>>> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>:
>>>> Query performance is pretty same as on slides about kylin. I have high
>>>> bucket cache hit (>90%), so data is almost always read from local disk. For
>>>> some other use cases it might be different.
>>>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <
>>>>> написал:
>>>> Cool; how about the query performance with data on s3?
>>>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:
>>>>> Yes, that's the only one fow now.
>>>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <>
>>>>> wrote:
>>>>>> No need to add I think, because I see they already in the
>>>>>> configuration of that step.
>>>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>>>> [image: 内嵌图片 1]
>>>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:
>>>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>>>> and kylin_job_conf_inmem.xml?
>>>>>>> hbase.zookeeper.quorum for example doesn't work if not specified in
>>>>>>> these configs.
>>>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <
>>>>>>>> wrote:
>>>>>>>> EMR enables the direct output in mapred-site.xml, while in this
>>>>>>>> step it seems these settings doesn't work (althoug the job's configuration
>>>>>>>> shows they are there). I disabled the direct output but the behavior has no
>>>>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>>>>> get back it later.
>>>>>>>> If you have any idea or findings, please share it. We'd like to
>>>>>>>> make Kylin has better support for cloud.
>>>>>>>> Thanks for your feedback!
>>>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> Any ideas how to fix that?
>>>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>>>> wrote:
>>>>>>>>>> I got the same problem as you:
>>>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>> not find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>>>>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems
>>>>>>>>>> were not moved to the target folder on complete. It seems EMR try to direct
>>>>>>>>>> write to otuput path, but actually not.
>>>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t
>>>>>>>>>>> set cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>>>> happens.
>>>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>>>>>> написал(а):
>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>> That makes sense. Using S3 for Cube build and storage is
>>>>>>>>>>> required for a cloud hadoop environment.
>>>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as
>>>>>>>>>>> HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>>>> and "" to the S3 bucket. But in
>>>>>>>>>>> the "Convert Cuboid Data to HFile" step, Kylin still writes to
>>>>>>>>>>> local HDFS; Did you modify the core-site.xml to make S3 as the default FS?
>>>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>>>> One problem of such solution is that I have to use pretty large
>>>>>>>>>>>> hdfs and it'expensive. And also I have to manually garbage collect it,
>>>>>>>>>>>> because it is not moved to s3, but copied. Kylin cleanup job doesn't work
>>>>>>>>>>>> for it, because main metadata folder is at s3. So it would be really nice
>>>>>>>>>>>> to put everything to s3.
>>>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout,
>>>>>>>>>>>> because bulk loading from hdfs takes long. That was not trivial. 3 minutes
>>>>>>>>>>>> work good, but with drawback of queries or metadata writes handing for 3
>>>>>>>>>>>> minutes if something bad happen. But that's rare event.
>>>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>> написал:
>>>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>>>> other for query);
>>>>>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS
>>>>>>>>>>>>> in EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3
>>>>>>>>>>>>> as storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>> In case of inconsistency of s3 files disappear right after
>>>>>>>>>>>>>> they were written and appear some time after. Hfiles didn't appear after a
>>>>>>>>>>>>>> day, but _template is there.
>>>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this problem
>>>>>>>>>>>>>> by running conversion job manually.
>>>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>>>>>>>> workflows/
>>>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping
>>>>>>>>>>>>>>>> non-directory s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>>>>>> FileStatusExt{path=s3://joom.e
>>>>>>>>>>>>>>>> mr.fs/home/production/bi/kylin
>>>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685
>>>>>>>>>>>>>>>> -7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation
>>>>>>>>>>>>>>>> did not find any files to load in directory
>>>>>>>>>>>>>>>> s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-647
>>>>>>>>>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files
>>>>>>>>>>>>>>>> in subdirectories that correspond to column family names?
>>>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk
>>>>>>>>>>>>>>>>> load finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase
>>>>>>>>>>>>>>>>>> lives.
>>>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it
>>>>>>>>>>>>>>>>>> looks like there were not copied from _temporary to result dir. But there
>>>>>>>>>>>>>>>>>> is no errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
Hi Alexander,

I encounter a problem when using HDFS for cubing building, and S3 for HBase
on EMR. In the "Load HFile to HBase Table" step, Kylin got a failure with
time out error:

Thu Sep 07 15:33:27 GMT+08:00 2017,
RpcRetryingCaller{globalStartTime=1504769048975, pause=100, retries=35}, Call to ip-10-0-0-28.ec2.internal/
failed on local exception:
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=41,
waitTime=60001, operationTimeout=60000

In HBase region server, I saw HBase uploads the HFile to S3; Since the cube
is a little big (13GB), it takes much longer time than usual. Kylin client
closed the connection as it thought timeout:

2017-09-07 08:01:12,275 INFO
regionserver.HRegionFileSystem: Bulk-load file
is on different filesystem than the destination store. Copying file over to
destination filesystem.
2017-09-07 08:01:23,919 INFO
s3.MultipartUploadManager: completed multipart upload of 8 parts 965420145

2017-09-07 08:26:33,838 WARN
[RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020] ipc.RpcServer:
2017-09-07 08:26:33,838 WARN
[RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020] ipc.RpcServer:
RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=16020: caught a
ClosedChannelException, this means that the server / was
processing a request but the client went away. The error message was: null

So I wonder how did you bypass this problem, did you set a very large
timeout value for HBase, or your cube size isn't that big? Thanks.

2017-08-14 14:19 GMT+08:00 Alexander Sterligov <>:

> Here is ticket for hfile on s3 issue -
> jira/browse/KYLIN-2788
> On Mon, Aug 14, 2017 at 9:17 AM, Alexander Sterligov <>
> wrote:
>> I forgot there was one more issue with s3 -
>> ra/browse/KYLIN-2740.
>> Global dictionary in 2.0 doesn't work out of the box. I patched kylin as
>> described in ticket.
>> On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <>
>> wrote:
>>> Nice; For the writting hfile to S3 issue,  it need more
>>> investigation.  Please open a Kylin JIRA for tracking. We will update there
>>> if has any finding.
>>> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>:
>>>> Query performance is pretty same as on slides about kylin. I have high
>>>> bucket cache hit (>90%), so data is almost always read from local disk. For
>>>> some other use cases it might be different.
>>>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <
>>>>> написал:
>>>> Cool; how about the query performance with data on s3?
>>>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:
>>>>> Yes, that's the only one fow now.
>>>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <>
>>>>> wrote:
>>>>>> No need to add I think, because I see they already in the
>>>>>> configuration of that step.
>>>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>>>> [image: 内嵌图片 1]
>>>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:
>>>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>>>> and kylin_job_conf_inmem.xml?
>>>>>>> hbase.zookeeper.quorum for example doesn't work if not specified in
>>>>>>> these configs.
>>>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <
>>>>>>>> wrote:
>>>>>>>> EMR enables the direct output in mapred-site.xml, while in this
>>>>>>>> step it seems these settings doesn't work (althoug the job's configuration
>>>>>>>> shows they are there). I disabled the direct output but the behavior has no
>>>>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>>>>> get back it later.
>>>>>>>> If you have any idea or findings, please share it. We'd like to
>>>>>>>> make Kylin has better support for cloud.
>>>>>>>> Thanks for your feedback!
>>>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> Any ideas how to fix that?
>>>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>>>> wrote:
>>>>>>>>>> I got the same problem as you:
>>>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>> not find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>>>>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems
>>>>>>>>>> were not moved to the target folder on complete. It seems EMR try to direct
>>>>>>>>>> write to otuput path, but actually not.
>>>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t
>>>>>>>>>>> set cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>>>> happens.
>>>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>>>>>> написал(а):
>>>>>>>>>>> Hi Alexander,
>>>>>>>>>>> That makes sense. Using S3 for Cube build and storage is
>>>>>>>>>>> required for a cloud hadoop environment.
>>>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as
>>>>>>>>>>> HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>>>> and "" to the S3 bucket. But in
>>>>>>>>>>> the "Convert Cuboid Data to HFile" step, Kylin still writes to
>>>>>>>>>>> local HDFS; Did you modify the core-site.xml to make S3 as the default FS?
>>>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>>>> One problem of such solution is that I have to use pretty large
>>>>>>>>>>>> hdfs and it'expensive. And also I have to manually garbage collect it,
>>>>>>>>>>>> because it is not moved to s3, but copied. Kylin cleanup job doesn't work
>>>>>>>>>>>> for it, because main metadata folder is at s3. So it would be really nice
>>>>>>>>>>>> to put everything to s3.
>>>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout,
>>>>>>>>>>>> because bulk loading from hdfs takes long. That was not trivial. 3 minutes
>>>>>>>>>>>> work good, but with drawback of queries or metadata writes handing for 3
>>>>>>>>>>>> minutes if something bad happen. But that's rare event.
>>>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>> написал:
>>>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>>>> other for query);
>>>>>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS
>>>>>>>>>>>>> in EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3
>>>>>>>>>>>>> as storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>> In case of inconsistency of s3 files disappear right after
>>>>>>>>>>>>>> they were written and appear some time after. Hfiles didn't appear after a
>>>>>>>>>>>>>> day, but _template is there.
>>>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this problem
>>>>>>>>>>>>>> by running conversion job manually.
>>>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>>>>>>>> workflows/
>>>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping
>>>>>>>>>>>>>>>> non-directory s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>>>>>> FileStatusExt{path=s3://joom.e
>>>>>>>>>>>>>>>> mr.fs/home/production/bi/kylin
>>>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685
>>>>>>>>>>>>>>>> -7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation
>>>>>>>>>>>>>>>> did not find any files to load in directory
>>>>>>>>>>>>>>>> s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-647
>>>>>>>>>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files
>>>>>>>>>>>>>>>> in subdirectories that correspond to column family names?
>>>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk
>>>>>>>>>>>>>>>>> load finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase
>>>>>>>>>>>>>>>>>> lives.
>>>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it
>>>>>>>>>>>>>>>>>> looks like there were not copied from _temporary to result dir. But there
>>>>>>>>>>>>>>>>>> is no errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
Here is ticket for hfile on s3 issue -

On Mon, Aug 14, 2017 at 9:17 AM, Alexander Sterligov <>

> I forgot there was one more issue with s3 -
> jira/browse/KYLIN-2740.
> Global dictionary in 2.0 doesn't work out of the box. I patched kylin as
> described in ticket.
> On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <>
> wrote:
>> Nice; For the writting hfile to S3 issue,  it need more
>> investigation.  Please open a Kylin JIRA for tracking. We will update there
>> if has any finding.
>> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>:
>>> Query performance is pretty same as on slides about kylin. I have high
>>> bucket cache hit (>90%), so data is almost always read from local disk. For
>>> some other use cases it might be different.
>>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <
>>>> написал:
>>> Cool; how about the query performance with data on s3?
>>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:
>>>> Yes, that's the only one fow now.
>>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <>
>>>> wrote:
>>>>> No need to add I think, because I see they already in the
>>>>> configuration of that step.
>>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>>> [image: 内嵌图片 1]
>>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:
>>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>>> and kylin_job_conf_inmem.xml?
>>>>>> hbase.zookeeper.quorum for example doesn't work if not specified in
>>>>>> these configs.
>>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <
>>>>>> > wrote:
>>>>>>> EMR enables the direct output in mapred-site.xml, while in this step
>>>>>>> it seems these settings doesn't work (althoug the job's configuration shows
>>>>>>> they are there). I disabled the direct output but the behavior has no
>>>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>>>> get back it later.
>>>>>>> If you have any idea or findings, please share it. We'd like to make
>>>>>>> Kylin has better support for cloud.
>>>>>>> Thanks for your feedback!
>>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> Any ideas how to fix that?
>>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>>> wrote:
>>>>>>>>> I got the same problem as you:
>>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>>>>> find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>>>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems
>>>>>>>>> were not moved to the target folder on complete. It seems EMR try to direct
>>>>>>>>> write to otuput path, but actually not.
>>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t
>>>>>>>>>> set cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>>> happens.
>>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>>>>> написал(а):
>>>>>>>>>> Hi Alexander,
>>>>>>>>>> That makes sense. Using S3 for Cube build and storage is required
>>>>>>>>>> for a cloud hadoop environment.
>>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as
>>>>>>>>>> HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>>> and "" to the S3 bucket. But in
>>>>>>>>>> the "Convert Cuboid Data to HFile" step, Kylin still writes to
>>>>>>>>>> local HDFS; Did you modify the core-site.xml to make S3 as the default FS?
>>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>>> One problem of such solution is that I have to use pretty large
>>>>>>>>>>> hdfs and it'expensive. And also I have to manually garbage collect it,
>>>>>>>>>>> because it is not moved to s3, but copied. Kylin cleanup job doesn't work
>>>>>>>>>>> for it, because main metadata folder is at s3. So it would be really nice
>>>>>>>>>>> to put everything to s3.
>>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout, because
>>>>>>>>>>> bulk loading from hdfs takes long. That was not trivial. 3 minutes work
>>>>>>>>>>> good, but with drawback of queries or metadata writes handing for 3 minutes
>>>>>>>>>>> if something bad happen. But that's rare event.
>>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>> написал:
>>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>>> other for query);
>>>>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS in
>>>>>>>>>>>> EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>>>>>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>>> problems.
>>>>>>>>>>>>> In case of inconsistency of s3 files disappear right after
>>>>>>>>>>>>> they were written and appear some time after. Hfiles didn't appear after a
>>>>>>>>>>>>> day, but _template is there.
>>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this problem
>>>>>>>>>>>>> by running conversion job manually.
>>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>>> написал:
>>>>>>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>>>>>>> workflows/
>>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping
>>>>>>>>>>>>>>> non-directory s3://joom.emr.fs/home/producti
>>>>>>>>>>>>>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647
>>>>>>>>>>>>>>> 2b866126d
>>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation
>>>>>>>>>>>>>>> did not find any files to load in directory
>>>>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile.
>>>>>>>>>>>>>>> Does it contain files in subdirectories that correspond to column family
>>>>>>>>>>>>>>> names?
>>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase
>>>>>>>>>>>>>>>>> lives.
>>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it
>>>>>>>>>>>>>>>>> looks like there were not copied from _temporary to result dir. But there
>>>>>>>>>>>>>>>>> is no errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
I forgot there was one more issue with s3 -

Global dictionary in 2.0 doesn't work out of the box. I patched kylin as
described in ticket.

On Sun, Aug 13, 2017 at 4:24 AM, ShaoFeng Shi <>

> Nice; For the writting hfile to S3 issue,  it need more
> investigation.  Please open a Kylin JIRA for tracking. We will update there
> if has any finding.
> 2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>:
>> Query performance is pretty same as on slides about kylin. I have high
>> bucket cache hit (>90%), so data is almost always read from local disk. For
>> some other use cases it might be different.
>> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <>
>> написал:
>> Cool; how about the query performance with data on s3?
>> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:
>>> Yes, that's the only one fow now.
>>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <>
>>> wrote:
>>>> No need to add I think, because I see they already in the configuration
>>>> of that step.
>>>> Is this the only issue you see with Kylin on EMR+S3?
>>>> [image: 内嵌图片 1]
>>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:
>>>>> What if we shall add direct output in kylin_job_conf.xml
>>>>> and kylin_job_conf_inmem.xml?
>>>>> hbase.zookeeper.quorum for example doesn't work if not specified in
>>>>> these configs.
>>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <>
>>>>> wrote:
>>>>>> EMR enables the direct output in mapred-site.xml, while in this step
>>>>>> it seems these settings doesn't work (althoug the job's configuration shows
>>>>>> they are there). I disabled the direct output but the behavior has no
>>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>>> get back it later.
>>>>>> If you have any idea or findings, please share it. We'd like to make
>>>>>> Kylin has better support for cloud.
>>>>>> Thanks for your feedback!
>>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>:
>>>>>>> Any ideas how to fix that?
>>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>>>> wrote:
>>>>>>>> I got the same problem as you:
>>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>>>> find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems
>>>>>>>> were not moved to the target folder on complete. It seems EMR try to direct
>>>>>>>> write to otuput path, but actually not.
>>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> No, defaultFs is hdfs.
>>>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t set
>>>>>>>>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>>> happens.
>>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>>>> написал(а):
>>>>>>>>> Hi Alexander,
>>>>>>>>> That makes sense. Using S3 for Cube build and storage is required
>>>>>>>>> for a cloud hadoop environment.
>>>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as
>>>>>>>>> HBase storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>>> and "" to the S3 bucket. But in the
>>>>>>>>> "Convert Cuboid Data to HFile" step, Kylin still writes to local
>>>>>>>>> HDFS; Did you modify the core-site.xml to make S3 as the default FS?
>>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>>> One problem of such solution is that I have to use pretty large
>>>>>>>>>> hdfs and it'expensive. And also I have to manually garbage collect it,
>>>>>>>>>> because it is not moved to s3, but copied. Kylin cleanup job doesn't work
>>>>>>>>>> for it, because main metadata folder is at s3. So it would be really nice
>>>>>>>>>> to put everything to s3.
>>>>>>>>>> Another problem is that I had to rise hbase rpc timeout, because
>>>>>>>>>> bulk loading from hdfs takes long. That was not trivial. 3 minutes work
>>>>>>>>>> good, but with drawback of queries or metadata writes handing for 3 minutes
>>>>>>>>>> if something bad happen. But that's rare event.
>>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>>> написал:
>>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>>> other for query);
>>>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS in
>>>>>>>>>>> EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>>>>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own
>>>>>>>>>>>> map-reduce jobs and it's ok.
>>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No
>>>>>>>>>>>> problems.
>>>>>>>>>>>> In case of inconsistency of s3 files disappear right after they
>>>>>>>>>>>> were written and appear some time after. Hfiles didn't appear after a day,
>>>>>>>>>>>> but _template is there.
>>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this problem
>>>>>>>>>>>> by running conversion job manually.
>>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>>> написал:
>>>>>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>>>>>> workflows/
>>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation
>>>>>>>>>>>>>> did not find any files to load in directory
>>>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile.
>>>>>>>>>>>>>> Does it contain files in subdirectories that correspond to column family
>>>>>>>>>>>>>> names?
>>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks
>>>>>>>>>>>>>>>> like there were not copied from _temporary to result dir. But there is no
>>>>>>>>>>>>>>>> errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
Nice; For the writting hfile to S3 issue,  it need more
investigation.  Please open a Kylin JIRA for tracking. We will update there
if has any finding.

2017-08-12 23:52 GMT+08:00 Alexander Sterligov <>:

> Query performance is pretty same as on slides about kylin. I have high
> bucket cache hit (>90%), so data is almost always read from local disk. For
> some other use cases it might be different.
> 12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <>
> написал:
> Cool; how about the query performance with data on s3?
> 2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:
>> Yes, that's the only one fow now.
>> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <>
>> wrote:
>>> No need to add I think, because I see they already in the configuration
>>> of that step.
>>> Is this the only issue you see with Kylin on EMR+S3?
>>> [image: 内嵌图片 1]
>>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:
>>>> What if we shall add direct output in kylin_job_conf.xml
>>>> and kylin_job_conf_inmem.xml?
>>>> hbase.zookeeper.quorum for example doesn't work if not specified in
>>>> these configs.
>>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <>
>>>> wrote:
>>>>> EMR enables the direct output in mapred-site.xml, while in this step
>>>>> it seems these settings doesn't work (althoug the job's configuration shows
>>>>> they are there). I disabled the direct output but the behavior has no
>>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>>> get back it later.
>>>>> If you have any idea or findings, please share it. We'd like to make
>>>>> Kylin has better support for cloud.
>>>>> Thanks for your feedback!
>>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>:
>>>>>> Any ideas how to fix that?
>>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <
>>>>>> > wrote:
>>>>>>> I got the same problem as you:
>>>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>>> find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>>>>>>> files in subdirectories that correspond to column family names?
>>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems were
>>>>>>> not moved to the target folder on complete. It seems EMR try to direct
>>>>>>> write to otuput path, but actually not.
>>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> No, defaultFs is hdfs.
>>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t set
>>>>>>>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>>> When both working-dir and cluster-fs were set to s3 I got
>>>>>>>> _temporary dir of convert job at s3, but no hfiles. Also I saw correct
>>>>>>>> output path for the job in the log. But I didn’t check if job creates
>>>>>>>> temporary files in s3, but then copies results to hdfs. I hardly believe it
>>>>>>>> happens.
>>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>>> написал(а):
>>>>>>>> Hi Alexander,
>>>>>>>> That makes sense. Using S3 for Cube build and storage is required
>>>>>>>> for a cloud hadoop environment.
>>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as HBase
>>>>>>>> storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>>> and "" to the S3 bucket. But in the "Convert
>>>>>>>> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
>>>>>>>> modify the core-site.xml to make S3 as the default FS?
>>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>>> One problem of such solution is that I have to use pretty large
>>>>>>>>> hdfs and it'expensive. And also I have to manually garbage collect it,
>>>>>>>>> because it is not moved to s3, but copied. Kylin cleanup job doesn't work
>>>>>>>>> for it, because main metadata folder is at s3. So it would be really nice
>>>>>>>>> to put everything to s3.
>>>>>>>>> Another problem is that I had to rise hbase rpc timeout, because
>>>>>>>>> bulk loading from hdfs takes long. That was not trivial. 3 minutes work
>>>>>>>>> good, but with drawback of queries or metadata writes handing for 3 minutes
>>>>>>>>> if something bad happen. But that's rare event.
>>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>>> написал:
>>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>>> other for query);
>>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS in
>>>>>>>>>> EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>>>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>>> Consistency view is enabled. I use same s3 for my own map-reduce
>>>>>>>>>>> jobs and it's ok.
>>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>>>>>>>>> In case of inconsistency of s3 files disappear right after they
>>>>>>>>>>> were written and appear some time after. Hfiles didn't appear after a day,
>>>>>>>>>>> but _template is there.
>>>>>>>>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>>>>>>>>> running conversion job manually.
>>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>>> написал:
>>>>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>>>>> workflows/
>>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>>>>> not find any files to load in directory
>>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile.
>>>>>>>>>>>>> Does it contain files in subdirectories that correspond to column family
>>>>>>>>>>>>> names?
>>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without
>>>>>>>>>>>>>>> errors. Statistics at the end of the job said that it has written lot's of
>>>>>>>>>>>>>>> data to s3.
>>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks
>>>>>>>>>>>>>>> like there were not copied from _temporary to result dir. But there is no
>>>>>>>>>>>>>>> errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
Query performance is pretty same as on slides about kylin. I have high
bucket cache hit (>90%), so data is almost always read from local disk. For
some other use cases it might be different.

12 авг. 2017 г. 17:59 пользователь "ShaoFeng Shi" <>

Cool; how about the query performance with data on s3?

2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:

> Yes, that's the only one fow now.
> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <>
> wrote:
>> No need to add I think, because I see they already in the configuration
>> of that step.
>> Is this the only issue you see with Kylin on EMR+S3?
>> [image: 内嵌图片 1]
>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:
>>> What if we shall add direct output in kylin_job_conf.xml
>>> and kylin_job_conf_inmem.xml?
>>> hbase.zookeeper.quorum for example doesn't work if not specified in
>>> these configs.
>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <>
>>> wrote:
>>>> EMR enables the direct output in mapred-site.xml, while in this step it
>>>> seems these settings doesn't work (althoug the job's configuration shows
>>>> they are there). I disabled the direct output but the behavior has no
>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>> get back it later.
>>>> If you have any idea or findings, please share it. We'd like to make
>>>> Kylin has better support for cloud.
>>>> Thanks for your feedback!
>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>:
>>>>> Any ideas how to fix that?
>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <>
>>>>> wrote:
>>>>>> I got the same problem as you:
>>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>> find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>>>>>> files in subdirectories that correspond to column family names?
>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems were
>>>>>> not moved to the target folder on complete. It seems EMR try to direct
>>>>>> write to otuput path, but actually not.
>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>:
>>>>>>> No, defaultFs is hdfs.
>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t set
>>>>>>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>> When both working-dir and cluster-fs were set to s3 I got _temporary
>>>>>>> dir of convert job at s3, but no hfiles. Also I saw correct output path for
>>>>>>> the job in the log. But I didn’t check if job creates temporary files in
>>>>>>> s3, but then copies results to hdfs. I hardly believe it happens.
>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>> написал(а):
>>>>>>> Hi Alexander,
>>>>>>> That makes sense. Using S3 for Cube build and storage is required
>>>>>>> for a cloud hadoop environment.
>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as HBase
>>>>>>> storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>> and "" to the S3 bucket. But in the "Convert
>>>>>>> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
>>>>>>> modify the core-site.xml to make S3 as the default FS?
>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>> One problem of such solution is that I have to use pretty large
>>>>>>>> hdfs and it'expensive. And also I have to manually garbage collect it,
>>>>>>>> because it is not moved to s3, but copied. Kylin cleanup job doesn't work
>>>>>>>> for it, because main metadata folder is at s3. So it would be really nice
>>>>>>>> to put everything to s3.
>>>>>>>> Another problem is that I had to rise hbase rpc timeout, because
>>>>>>>> bulk loading from hdfs takes long. That was not trivial. 3 minutes work
>>>>>>>> good, but with drawback of queries or metadata writes handing for 3 minutes
>>>>>>>> if something bad happen. But that's rare event.
>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>> написал:
>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>> other for query);
>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS in
>>>>>>>>> EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>> Consistency view is enabled. I use same s3 for my own map-reduce
>>>>>>>>>> jobs and it's ok.
>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>>>>>>>> In case of inconsistency of s3 files disappear right after they
>>>>>>>>>> were written and appear some time after. Hfiles didn't appear after a day,
>>>>>>>>>> but _template is there.
>>>>>>>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>>>>>>>> running conversion job manually.
>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>> написал:
>>>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>>>> workflows/
>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>>>> not find any files to load in directory
>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile.
>>>>>>>>>>>> Does it contain files in subdirectories that correspond to column family
>>>>>>>>>>>> names?
>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>>>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>>>>>>>>> s3.
>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks
>>>>>>>>>>>>>> like there were not copied from _temporary to result dir. But there is no
>>>>>>>>>>>>>> errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
Cool; how about the query performance with data on s3?

2017-08-11 23:27 GMT+08:00 Alexander Sterligov <>:

> Yes, that's the only one fow now.
> On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <>
> wrote:
>> No need to add I think, because I see they already in the configuration
>> of that step.
>> Is this the only issue you see with Kylin on EMR+S3?
>> [image: 内嵌图片 1]
>> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:
>>> What if we shall add direct output in kylin_job_conf.xml
>>> and kylin_job_conf_inmem.xml?
>>> hbase.zookeeper.quorum for example doesn't work if not specified in
>>> these configs.
>>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <>
>>> wrote:
>>>> EMR enables the direct output in mapred-site.xml, while in this step it
>>>> seems these settings doesn't work (althoug the job's configuration shows
>>>> they are there). I disabled the direct output but the behavior has no
>>>> change. I did some search but no finding. I need drop the EMR now, and may
>>>> get back it later.
>>>> If you have any idea or findings, please share it. We'd like to make
>>>> Kylin has better support for cloud.
>>>> Thanks for your feedback!
>>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>:
>>>>> Any ideas how to fix that?
>>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <>
>>>>> wrote:
>>>>>> I got the same problem as you:
>>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>> find any files to load in directory s3://privatekeybucket-anac5h41
>>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>>>>>> files in subdirectories that correspond to column family names?
>>>>>> In S3 view, I see the files exist in "_temporary" folder, seems were
>>>>>> not moved to the target folder on complete. It seems EMR try to direct
>>>>>> write to otuput path, but actually not.
>>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>:
>>>>>>> No, defaultFs is hdfs.
>>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t set
>>>>>>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>>> When both working-dir and cluster-fs were set to s3 I got _temporary
>>>>>>> dir of convert job at s3, but no hfiles. Also I saw correct output path for
>>>>>>> the job in the log. But I didn’t check if job creates temporary files in
>>>>>>> s3, but then copies results to hdfs. I hardly believe it happens.
>>>>>>> Do you see proper arguments for the step in the log?
>>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>>> написал(а):
>>>>>>> Hi Alexander,
>>>>>>> That makes sense. Using S3 for Cube build and storage is required
>>>>>>> for a cloud hadoop environment.
>>>>>>> I tried to reproduce this problem. I created a EMR with S3 as HBase
>>>>>>> storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>>> and "" to the S3 bucket. But in the "Convert
>>>>>>> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
>>>>>>> modify the core-site.xml to make S3 as the default FS?
>>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>>> One problem of such solution is that I have to use pretty large
>>>>>>>> hdfs and it'expensive. And also I have to manually garbage collect it,
>>>>>>>> because it is not moved to s3, but copied. Kylin cleanup job doesn't work
>>>>>>>> for it, because main metadata folder is at s3. So it would be really nice
>>>>>>>> to put everything to s3.
>>>>>>>> Another problem is that I had to rise hbase rpc timeout, because
>>>>>>>> bulk loading from hdfs takes long. That was not trivial. 3 minutes work
>>>>>>>> good, but with drawback of queries or metadata writes handing for 3 minutes
>>>>>>>> if something bad happen. But that's rare event.
>>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>>> написал:
>>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This
>>>>>>>>> property is for two-cluster deployment (one Hadoop for cube build, the
>>>>>>>>> other for query);
>>>>>>>>> When be empty, the HFile will be written to default fs (HDFS in
>>>>>>>>> EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>>> Consistency view is enabled. I use same s3 for my own map-reduce
>>>>>>>>>> jobs and it's ok.
>>>>>>>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>>>>>>>> In case of inconsistency of s3 files disappear right after they
>>>>>>>>>> were written and appear some time after. Hfiles didn't appear after a day,
>>>>>>>>>> but _template is there.
>>>>>>>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>>>>>>>> running conversion job manually.
>>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>>> написал:
>>>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>>>> workflows/
>>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job
>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job
>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job
>>>>>>>>>>>> 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>>>> not find any files to load in directory
>>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile.
>>>>>>>>>>>> Does it contain files in subdirectories that correspond to column family
>>>>>>>>>>>> names?
>>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>>>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>>>>>>>>> s3.
>>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks
>>>>>>>>>>>>>> like there were not copied from _temporary to result dir. But there is no
>>>>>>>>>>>>>> errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
Yes, that's the only one fow now.

On Fri, Aug 11, 2017 at 6:23 PM, ShaoFeng Shi <>

> No need to add I think, because I see they already in the configuration of
> that step.
> Is this the only issue you see with Kylin on EMR+S3?
> [image: 内嵌图片 1]
> 2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:
>> What if we shall add direct output in kylin_job_conf.xml
>> and kylin_job_conf_inmem.xml?
>> hbase.zookeeper.quorum for example doesn't work if not specified in these
>> configs.
>> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <>
>> wrote:
>>> EMR enables the direct output in mapred-site.xml, while in this step it
>>> seems these settings doesn't work (althoug the job's configuration shows
>>> they are there). I disabled the direct output but the behavior has no
>>> change. I did some search but no finding. I need drop the EMR now, and may
>>> get back it later.
>>> If you have any idea or findings, please share it. We'd like to make
>>> Kylin has better support for cloud.
>>> Thanks for your feedback!
>>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>:
>>>> Any ideas how to fix that?
>>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <>
>>>> wrote:
>>>>> I got the same problem as you:
>>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>> find any files to load in directory s3://privatekeybucket-anac5h41
>>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>>>>> files in subdirectories that correspond to column family names?
>>>>> In S3 view, I see the files exist in "_temporary" folder, seems were
>>>>> not moved to the target folder on complete. It seems EMR try to direct
>>>>> write to otuput path, but actually not.
>>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>:
>>>>>> No, defaultFs is hdfs.
>>>>>> I’ve seen such behavior when set working dir to s3, but didn’t set
>>>>>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>>> When both working-dir and cluster-fs were set to s3 I got _temporary
>>>>>> dir of convert job at s3, but no hfiles. Also I saw correct output path for
>>>>>> the job in the log. But I didn’t check if job creates temporary files in
>>>>>> s3, but then copies results to hdfs. I hardly believe it happens.
>>>>>> Do you see proper arguments for the step in the log?
>>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>>> написал(а):
>>>>>> Hi Alexander,
>>>>>> That makes sense. Using S3 for Cube build and storage is required for
>>>>>> a cloud hadoop environment.
>>>>>> I tried to reproduce this problem. I created a EMR with S3 as HBase
>>>>>> storage, in, I set "kylin.env.hdfs-working-dir"
>>>>>> and "" to the S3 bucket. But in the "Convert
>>>>>> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
>>>>>> modify the core-site.xml to make S3 as the default FS?
>>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>:
>>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>>> One problem of such solution is that I have to use pretty large hdfs
>>>>>>> and it'expensive. And also I have to manually garbage collect it, because
>>>>>>> it is not moved to s3, but copied. Kylin cleanup job doesn't work for it,
>>>>>>> because main metadata folder is at s3. So it would be really nice to put
>>>>>>> everything to s3.
>>>>>>> Another problem is that I had to rise hbase rpc timeout, because
>>>>>>> bulk loading from hdfs takes long. That was not trivial. 3 minutes work
>>>>>>> good, but with drawback of queries or metadata writes handing for 3 minutes
>>>>>>> if something bad happen. But that's rare event.
>>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>>> написал:
>>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This property
>>>>>>>> is for two-cluster deployment (one Hadoop for cube build, the other for
>>>>>>>> query);
>>>>>>>> When be empty, the HFile will be written to default fs (HDFS in
>>>>>>>> EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>>> Consistency view is enabled. I use same s3 for my own map-reduce
>>>>>>>>> jobs and it's ok.
>>>>>>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>>>>>>> In case of inconsistency of s3 files disappear right after they
>>>>>>>>> were written and appear some time after. Hfiles didn't appear after a day,
>>>>>>>>> but _template is there.
>>>>>>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>>>>>>> running conversion job manually.
>>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>>> написал:
>>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>>> workflows/
>>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>>> isSymlink=false}
>>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>>> not find any files to load in directory
>>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile.
>>>>>>>>>>> Does it contain files in subdirectories that correspond to column family
>>>>>>>>>>> names?
>>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>>>>>>>> s3.
>>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder
>>>>>>>>>>>>> (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks
>>>>>>>>>>>>> like there were not copied from _temporary to result dir. But there is no
>>>>>>>>>>>>> errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
No need to add I think, because I see they already in the configuration of
that step.

Is this the only issue you see with Kylin on EMR+S3?

[image: 内嵌图片 1]

2017-08-11 20:26 GMT+08:00 Alexander Sterligov <>:

> What if we shall add direct output in kylin_job_conf.xml
> and kylin_job_conf_inmem.xml?
> hbase.zookeeper.quorum for example doesn't work if not specified in these
> configs.
> On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <>
> wrote:
>> EMR enables the direct output in mapred-site.xml, while in this step it
>> seems these settings doesn't work (althoug the job's configuration shows
>> they are there). I disabled the direct output but the behavior has no
>> change. I did some search but no finding. I need drop the EMR now, and may
>> get back it later.
>> If you have any idea or findings, please share it. We'd like to make
>> Kylin has better support for cloud.
>> Thanks for your feedback!
>> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>:
>>> Any ideas how to fix that?
>>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <>
>>> wrote:
>>>> I got the same problem as you:
>>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find
>>>> any files to load in directory s3://privatekeybucket-anac5h41
>>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-b
>>>> a63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain files
>>>> in subdirectories that correspond to column family names?
>>>> In S3 view, I see the files exist in "_temporary" folder, seems were
>>>> not moved to the target folder on complete. It seems EMR try to direct
>>>> write to otuput path, but actually not.
>>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>:
>>>>> No, defaultFs is hdfs.
>>>>> I’ve seen such behavior when set working dir to s3, but didn’t set
>>>>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>>> used the old one «kylin.hbase.cluster.fs»
>>>>> When both working-dir and cluster-fs were set to s3 I got _temporary
>>>>> dir of convert job at s3, but no hfiles. Also I saw correct output path for
>>>>> the job in the log. But I didn’t check if job creates temporary files in
>>>>> s3, but then copies results to hdfs. I hardly believe it happens.
>>>>> Do you see proper arguments for the step in the log?
>>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>>> написал(а):
>>>>> Hi Alexander,
>>>>> That makes sense. Using S3 for Cube build and storage is required for
>>>>> a cloud hadoop environment.
>>>>> I tried to reproduce this problem. I created a EMR with S3 as HBase
>>>>> storage, in, I set "kylin.env.hdfs-working-dir"
>>>>> and "" to the S3 bucket. But in the "Convert
>>>>> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
>>>>> modify the core-site.xml to make S3 as the default FS?
>>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>:
>>>>>> Yes, I workarounded this problem in such way and it works.
>>>>>> One problem of such solution is that I have to use pretty large hdfs
>>>>>> and it'expensive. And also I have to manually garbage collect it, because
>>>>>> it is not moved to s3, but copied. Kylin cleanup job doesn't work for it,
>>>>>> because main metadata folder is at s3. So it would be really nice to put
>>>>>> everything to s3.
>>>>>> Another problem is that I had to rise hbase rpc timeout, because bulk
>>>>>> loading from hdfs takes long. That was not trivial. 3 minutes work good,
>>>>>> but with drawback of queries or metadata writes handing for 3 minutes if
>>>>>> something bad happen. But that's rare event.
>>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>>> написал:
>>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This property
>>>>>>> is for two-cluster deployment (one Hadoop for cube build, the other for
>>>>>>> query);
>>>>>>> When be empty, the HFile will be written to default fs (HDFS in
>>>>>>> EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>>> Consistency view is enabled. I use same s3 for my own map-reduce
>>>>>>>> jobs and it's ok.
>>>>>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>>>>>> In case of inconsistency of s3 files disappear right after they
>>>>>>>> were written and appear some time after. Hfiles didn't appear after a day,
>>>>>>>> but _template is there.
>>>>>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>>>>>> running conversion job manually.
>>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>>> написал:
>>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>>> workflows/
>>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>>> isSymlink=false}
>>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did
>>>>>>>>>> not find any files to load in directory
>>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile.
>>>>>>>>>> Does it contain files in subdirectories that correspond to column family
>>>>>>>>>> names?
>>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>>> wrote:
>>>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>>> Hi!
>>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>>>>>>> s3.
>>>>>>>>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>>>>>>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks
>>>>>>>>>>>> like there were not copied from _temporary to result dir. But there is no
>>>>>>>>>>>> errors neither in kylin log, nor in reducers' logs.
>>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
What if we shall add direct output in kylin_job_conf.xml
and kylin_job_conf_inmem.xml?

hbase.zookeeper.quorum for example doesn't work if not specified in these

On Fri, Aug 11, 2017 at 3:13 PM, ShaoFeng Shi <>

> EMR enables the direct output in mapred-site.xml, while in this step it
> seems these settings doesn't work (althoug the job's configuration shows
> they are there). I disabled the direct output but the behavior has no
> change. I did some search but no finding. I need drop the EMR now, and may
> get back it later.
> If you have any idea or findings, please share it. We'd like to make Kylin
> has better support for cloud.
> Thanks for your feedback!
> 2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>:
>> Any ideas how to fix that?
>> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <>
>> wrote:
>>> I got the same problem as you:
>>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find
>>> any files to load in directory s3://privatekeybucket-anac5h41
>>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-
>>> ba63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain files
>>> in subdirectories that correspond to column family names?
>>> In S3 view, I see the files exist in "_temporary" folder, seems were not
>>> moved to the target folder on complete. It seems EMR try to direct write to
>>> otuput path, but actually not.
>>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>:
>>>> No, defaultFs is hdfs.
>>>> I’ve seen such behavior when set working dir to s3, but didn’t set
>>>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>>>> used the old one «kylin.hbase.cluster.fs»
>>>> When both working-dir and cluster-fs were set to s3 I got _temporary
>>>> dir of convert job at s3, but no hfiles. Also I saw correct output path for
>>>> the job in the log. But I didn’t check if job creates temporary files in
>>>> s3, but then copies results to hdfs. I hardly believe it happens.
>>>> Do you see proper arguments for the step in the log?
>>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>>> написал(а):
>>>> Hi Alexander,
>>>> That makes sense. Using S3 for Cube build and storage is required for a
>>>> cloud hadoop environment.
>>>> I tried to reproduce this problem. I created a EMR with S3 as HBase
>>>> storage, in, I set "kylin.env.hdfs-working-dir"
>>>> and "" to the S3 bucket. But in the "Convert
>>>> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
>>>> modify the core-site.xml to make S3 as the default FS?
>>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>:
>>>>> Yes, I workarounded this problem in such way and it works.
>>>>> One problem of such solution is that I have to use pretty large hdfs
>>>>> and it'expensive. And also I have to manually garbage collect it, because
>>>>> it is not moved to s3, but copied. Kylin cleanup job doesn't work for it,
>>>>> because main metadata folder is at s3. So it would be really nice to put
>>>>> everything to s3.
>>>>> Another problem is that I had to rise hbase rpc timeout, because bulk
>>>>> loading from hdfs takes long. That was not trivial. 3 minutes work good,
>>>>> but with drawback of queries or metadata writes handing for 3 minutes if
>>>>> something bad happen. But that's rare event.
>>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>>> написал:
>>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This property
>>>>>> is for two-cluster deployment (one Hadoop for cube build, the other for
>>>>>> query);
>>>>>> When be empty, the HFile will be written to default fs (HDFS in EMR),
>>>>>> and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>>> great as the write performance of HDFS would be better than S3.
>>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <>:
>>>>>>> I also thought about it, but no, it's not consistency.
>>>>>>> Consistency view is enabled. I use same s3 for my own map-reduce
>>>>>>> jobs and it's ok.
>>>>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>>>>> In case of inconsistency of s3 files disappear right after they were
>>>>>>> written and appear some time after. Hfiles didn't appear after a day, but
>>>>>>> _template is there.
>>>>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>>>>> running conversion job manually.
>>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>>> написал:
>>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>>> workflows/
>>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>>> isSymlink=false}
>>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>>>>> find any files to load in directory s3://joom.emr.fs/home/producti
>>>>>>>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647
>>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files in
>>>>>>>>> subdirectories that correspond to column family names?
>>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>>> wrote:
>>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>>> Hi!
>>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>>>>>> s3.
>>>>>>>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>>>>>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table
>>>>>>>>>>> name>/hfile), but only _temporary folder and _SUCCESS file.
>>>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks like
>>>>>>>>>>> there were not copied from _temporary to result dir. But there is no errors
>>>>>>>>>>> neither in kylin log, nor in reducers' logs.
>>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
EMR enables the direct output in mapred-site.xml, while in this step it
seems these settings doesn't work (althoug the job's configuration shows
they are there). I disabled the direct output but the behavior has no
change. I did some search but no finding. I need drop the EMR now, and may
get back it later.

If you have any idea or findings, please share it. We'd like to make Kylin
has better support for cloud.

Thanks for your feedback!

2017-08-11 19:19 GMT+08:00 Alexander Sterligov <>:

> Any ideas how to fix that?
> On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <>
> wrote:
>> I got the same problem as you:
>> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find
>> any files to load in directory s3://privatekeybucket-anac5h41
>> 523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-
>> 4a97-ba63-63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain
>> files in subdirectories that correspond to column family names?
>> In S3 view, I see the files exist in "_temporary" folder, seems were not
>> moved to the target folder on complete. It seems EMR try to direct write to
>> otuput path, but actually not.
>> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>:
>>> No, defaultFs is hdfs.
>>> I’ve seen such behavior when set working dir to s3, but didn’t set
>>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>>> used the old one «kylin.hbase.cluster.fs»
>>> When both working-dir and cluster-fs were set to s3 I got _temporary dir
>>> of convert job at s3, but no hfiles. Also I saw correct output path for the
>>> job in the log. But I didn’t check if job creates temporary files in s3,
>>> but then copies results to hdfs. I hardly believe it happens.
>>> Do you see proper arguments for the step in the log?
>>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>>> написал(а):
>>> Hi Alexander,
>>> That makes sense. Using S3 for Cube build and storage is required for a
>>> cloud hadoop environment.
>>> I tried to reproduce this problem. I created a EMR with S3 as HBase
>>> storage, in, I set "kylin.env.hdfs-working-dir"
>>> and "" to the S3 bucket. But in the "Convert
>>> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
>>> modify the core-site.xml to make S3 as the default FS?
>>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>:
>>>> Yes, I workarounded this problem in such way and it works.
>>>> One problem of such solution is that I have to use pretty large hdfs
>>>> and it'expensive. And also I have to manually garbage collect it, because
>>>> it is not moved to s3, but copied. Kylin cleanup job doesn't work for it,
>>>> because main metadata folder is at s3. So it would be really nice to put
>>>> everything to s3.
>>>> Another problem is that I had to rise hbase rpc timeout, because bulk
>>>> loading from hdfs takes long. That was not trivial. 3 minutes work good,
>>>> but with drawback of queries or metadata writes handing for 3 minutes if
>>>> something bad happen. But that's rare event.
>>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>>> написал:
>>>> How about leaving empty for "kylin.hbase.cluster.fs"? This property is
>>>>> for two-cluster deployment (one Hadoop for cube build, the other for
>>>>> query);
>>>>> When be empty, the HFile will be written to default fs (HDFS in EMR),
>>>>> and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>>> great as the write performance of HDFS would be better than S3.
>>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <>:
>>>>>> I also thought about it, but no, it's not consistency.
>>>>>> Consistency view is enabled. I use same s3 for my own map-reduce jobs
>>>>>> and it's ok.
>>>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>>>> In case of inconsistency of s3 files disappear right after they were
>>>>>> written and appear some time after. Hfiles didn't appear after a day, but
>>>>>> _template is there.
>>>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>>>> running conversion job manually.
>>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>>> написал:
>>>>>> Did you enable the Consistent View? This article explains the
>>>>>>> challenge when using S3 directly for ETL process:
>>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>>> workflows/
>>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>>> isSymlink=false}
>>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>>>> find any files to load in directory s3://joom.emr.fs/home/producti
>>>>>>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647
>>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files in
>>>>>>>> subdirectories that correspond to column family names?
>>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>>>> wrote:
>>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>>>> Hi!
>>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>>>>> s3.
>>>>>>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>>>>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile),
>>>>>>>>>> but only _temporary folder and _SUCCESS file.
>>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks like
>>>>>>>>>> there were not copied from _temporary to result dir. But there is no errors
>>>>>>>>>> neither in kylin log, nor in reducers' logs.
>>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
Any ideas how to fix that?

On Fri, Aug 11, 2017 at 2:16 PM, ShaoFeng Shi <>

> I got the same problem as you:
> 2017-08-11 08:44:16,342 WARN  [Job 2c86b4b6-7639-4a97-ba63-63c9dca095f6-2255]
> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find
> any files to load in directory s3://privatekeybucket-
> anac5h41523l/kylin/kylin_default_instance/kylin-2c86b4b6-7639-4a97-ba63-
> 63c9dca095f6/kylin_sales_cube_clone3/hfile.  Does it contain files in
> subdirectories that correspond to column family names?
> In S3 view, I see the files exist in "_temporary" folder, seems were not
> moved to the target folder on complete. It seems EMR try to direct write to
> otuput path, but actually not.
> 2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>:
>> No, defaultFs is hdfs.
>> I’ve seen such behavior when set working dir to s3, but didn’t set
>> cluster-fs at all. Maybe you have a typo in the name of the property. I
>> used the old one «kylin.hbase.cluster.fs»
>> When both working-dir and cluster-fs were set to s3 I got _temporary dir
>> of convert job at s3, but no hfiles. Also I saw correct output path for the
>> job in the log. But I didn’t check if job creates temporary files in s3,
>> but then copies results to hdfs. I hardly believe it happens.
>> Do you see proper arguments for the step in the log?
>> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
>> написал(а):
>> Hi Alexander,
>> That makes sense. Using S3 for Cube build and storage is required for a
>> cloud hadoop environment.
>> I tried to reproduce this problem. I created a EMR with S3 as HBase
>> storage, in, I set "kylin.env.hdfs-working-dir"
>> and "" to the S3 bucket. But in the "Convert
>> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
>> modify the core-site.xml to make S3 as the default FS?
>> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>:
>>> Yes, I workarounded this problem in such way and it works.
>>> One problem of such solution is that I have to use pretty large hdfs and
>>> it'expensive. And also I have to manually garbage collect it, because it is
>>> not moved to s3, but copied. Kylin cleanup job doesn't work for it, because
>>> main metadata folder is at s3. So it would be really nice to put everything
>>> to s3.
>>> Another problem is that I had to rise hbase rpc timeout, because bulk
>>> loading from hdfs takes long. That was not trivial. 3 minutes work good,
>>> but with drawback of queries or metadata writes handing for 3 minutes if
>>> something bad happen. But that's rare event.
>>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <
>>>> написал:
>>> How about leaving empty for "kylin.hbase.cluster.fs"? This property is
>>>> for two-cluster deployment (one Hadoop for cube build, the other for
>>>> query);
>>>> When be empty, the HFile will be written to default fs (HDFS in EMR),
>>>> and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>>> great as the write performance of HDFS would be better than S3.
>>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <>:
>>>>> I also thought about it, but no, it's not consistency.
>>>>> Consistency view is enabled. I use same s3 for my own map-reduce jobs
>>>>> and it's ok.
>>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>>> In case of inconsistency of s3 files disappear right after they were
>>>>> written and appear some time after. Hfiles didn't appear after a day, but
>>>>> _template is there.
>>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>>> running conversion job manually.
>>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>>> написал:
>>>>> Did you enable the Consistent View? This article explains the
>>>>>> challenge when using S3 directly for ETL process:
>>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-
>>>>>> workflows/
>>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <>:
>>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>>> isSymlink=false}
>>>>>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>>> find any files to load in directory s3://joom.emr.fs/home/producti
>>>>>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647
>>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files in
>>>>>>> subdirectories that correspond to column family names?
>>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <
>>>>>>> > wrote:
>>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <
>>>>>>>> >:
>>>>>>>>> Hi!
>>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>>>> s3.
>>>>>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>>>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile),
>>>>>>>>> but only _temporary folder and _SUCCESS file.
>>>>>>>>> _temporary contains hfiles inside attempt folders. it looks like
>>>>>>>>> there were not copied from _temporary to result dir. But there is no errors
>>>>>>>>> neither in kylin log, nor in reducers' logs.
>>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Shaofeng Shi 史少锋
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
I got the same problem as you:

2017-08-11 08:44:16,342 WARN  [Job
mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find any
files to load in directory
Does it contain files in subdirectories that correspond to column family

In S3 view, I see the files exist in "_temporary" folder, seems were not
moved to the target folder on complete. It seems EMR try to direct write to
otuput path, but actually not.

2017-08-11 16:34 GMT+08:00 Alexander Sterligov <>:

> No, defaultFs is hdfs.
> I’ve seen such behavior when set working dir to s3, but didn’t set
> cluster-fs at all. Maybe you have a typo in the name of the property. I
> used the old one «kylin.hbase.cluster.fs»
> When both working-dir and cluster-fs were set to s3 I got _temporary dir
> of convert job at s3, but no hfiles. Also I saw correct output path for the
> job in the log. But I didn’t check if job creates temporary files in s3,
> but then copies results to hdfs. I hardly believe it happens.
> Do you see proper arguments for the step in the log?
> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <>
> написал(а):
> Hi Alexander,
> That makes sense. Using S3 for Cube build and storage is required for a
> cloud hadoop environment.
> I tried to reproduce this problem. I created a EMR with S3 as HBase
> storage, in, I set "kylin.env.hdfs-working-dir"
> and "" to the S3 bucket. But in the "Convert
> Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
> modify the core-site.xml to make S3 as the default FS?
> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>:
>> Yes, I workarounded this problem in such way and it works.
>> One problem of such solution is that I have to use pretty large hdfs and
>> it'expensive. And also I have to manually garbage collect it, because it is
>> not moved to s3, but copied. Kylin cleanup job doesn't work for it, because
>> main metadata folder is at s3. So it would be really nice to put everything
>> to s3.
>> Another problem is that I had to rise hbase rpc timeout, because bulk
>> loading from hdfs takes long. That was not trivial. 3 minutes work good,
>> but with drawback of queries or metadata writes handing for 3 minutes if
>> something bad happen. But that's rare event.
>> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <>
>> написал:
>> How about leaving empty for "kylin.hbase.cluster.fs"? This property is
>>> for two-cluster deployment (one Hadoop for cube build, the other for
>>> query);
>>> When be empty, the HFile will be written to default fs (HDFS in EMR),
>>> and then load to HBase. I'm not sure whether EMR HBase (using S3 as
>>> storage) can bulk load files from HDFS or not. If it can, that would be
>>> great as the write performance of HDFS would be better than S3.
>>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <>:
>>>> I also thought about it, but no, it's not consistency.
>>>> Consistency view is enabled. I use same s3 for my own map-reduce jobs
>>>> and it's ok.
>>>> I also checked if it lost consistency (emrfs diff). No problems.
>>>> In case of inconsistency of s3 files disappear right after they were
>>>> written and appear some time after. Hfiles didn't appear after a day, but
>>>> _template is there.
>>>> It's 100% reproducable, I think I'll investigate this problem by
>>>> running conversion job manually.
>>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>>> написал:
>>>> Did you enable the Consistent View? This article explains the challenge
>>>>> when using S3 directly for ETL process:
>>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-workflows/
>>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <>:
>>>>>> Yes, it's empty. Also I see this message in the log:
>>>>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>>> isSymlink=false}
>>>>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>>> find any files to load in directory s3://joom.emr.fs/home/producti
>>>>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647
>>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files in
>>>>>> subdirectories that correspond to column family names?
>>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <>
>>>>>> wrote:
>>>>>>> The HFile will be moved to HBase data folder when bulk load
>>>>>>> finished; Did you check whether the HTable has data?
>>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <>
>>>>>>> :
>>>>>>>> Hi!
>>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>>> s3.
>>>>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile),
>>>>>>>> but only _temporary folder and _SUCCESS file.
>>>>>>>> _temporary contains hfiles inside attempt folders. it looks like
>>>>>>>> there were not copied from _temporary to result dir. But there is no errors
>>>>>>>> neither in kylin log, nor in reducers' logs.
>>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Shaofeng Shi 史少锋
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
No, defaultFs is hdfs.

I’ve seen such behavior when set working dir to s3, but didn’t set cluster-fs at all. Maybe you have a typo in the name of the property. I used the old one «kylin.hbase.cluster.fs» 

When both working-dir and cluster-fs were set to s3 I got _temporary dir of convert job at s3, but no hfiles. Also I saw correct output path for the job in the log. But I didn’t check if job creates temporary files in s3, but then copies results to hdfs. I hardly believe it happens.

Do you see proper arguments for the step in the log?

> 11 авг. 2017 г., в 11:17, ShaoFeng Shi <> написал(а):
> Hi Alexander,
> That makes sense. Using S3 for Cube build and storage is required for a cloud hadoop environment.
> I tried to reproduce this problem. I created a EMR with S3 as HBase storage, in, I set "kylin.env.hdfs-working-dir" and "" to the S3 bucket. But in the "Convert Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you modify the core-site.xml to make S3 as the default FS?
> 2017-08-10 22:53 GMT+08:00 Alexander Sterligov < <>>:
> Yes, I workarounded this problem in such way and it works.
> One problem of such solution is that I have to use pretty large hdfs and it'expensive. And also I have to manually garbage collect it, because it is not moved to s3, but copied. Kylin cleanup job doesn't work for it, because main metadata folder is at s3. So it would be really nice to put everything to s3. 
> Another problem is that I had to rise hbase rpc timeout, because bulk loading from hdfs takes long. That was not trivial. 3 minutes work good, but with drawback of queries or metadata writes handing for 3 minutes if something bad happen. But that's rare event. 
> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" < <>> написал:
> How about leaving empty for "kylin.hbase.cluster.fs"? This property is for two-cluster deployment (one Hadoop for cube build, the other for query); 
> When be empty, the HFile will be written to default fs (HDFS in EMR), and then load to HBase. I'm not sure whether EMR HBase (using S3 as storage) can bulk load files from HDFS or not. If it can, that would be great as the write performance of HDFS would be better than S3.
> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov < <>>:
> I also thought about it, but no, it's not consistency. 
> Consistency view is enabled. I use same s3 for my own map-reduce jobs and it's ok.
> I also checked if it lost consistency (emrfs diff). No problems. 
> In case of inconsistency of s3 files disappear right after they were written and appear some time after. Hfiles didn't appear after a day, but _template is there. 
> It's 100% reproducable, I think I'll investigate this problem by running conversion job manually. 
> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" < <>> написал:
> Did you enable the Consistent View? This article explains the challenge when using S3 directly for ETL process:
> <>
> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov < <>>:
> Yes, it's empty. Also I see this message in the log:
> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608] mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d
> /main_event_1_main/hfile/_SUCCESS
> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608] mapreduce.LoadIncrementalHFiles:252 : Skipping non-file FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile/_temporary/1; isDirectory=true; modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx; isSymlink=false}
> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608] mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find any files to load in directory s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/main_event_1_main/hfile.  Does it contain files in subdirectories that correspond to column family names?
> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi < <>> wrote:
> The HFile will be moved to HBase data folder when bulk load finished; Did you check whether the HTable has data?
> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov < <>>:
> Hi!
> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
> Step "Convert Cuboid Data to HFile" finished without errors. Statistics at the end of the job said that it has written lot's of data to s3.
> But there is no hfiles in kylin_metadata folder (kylin_metadata /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), but only _temporary folder and _SUCCESS file.
> _temporary contains hfiles inside attempt folders. it looks like there were not copied from _temporary to result dir. But there is no errors neither in kylin log, nor in reducers' logs.
> Then loading empty hfiles produces empty segments.
> Is that a bug or I'm doing something wrong?
> -- 
> Best regards,
> Shaofeng Shi 史少锋
> -- 
> Best regards,
> Shaofeng Shi 史少锋
> -- 
> Best regards,
> Shaofeng Shi 史少锋
> -- 
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
Hi Alexander,

That makes sense. Using S3 for Cube build and storage is required for a
cloud hadoop environment.

I tried to reproduce this problem. I created a EMR with S3 as HBase
storage, in, I set "kylin.env.hdfs-working-dir"
and "" to the S3 bucket. But in the "Convert
Cuboid Data to HFile" step, Kylin still writes to local HDFS; Did you
modify the core-site.xml to make S3 as the default FS?

2017-08-10 22:53 GMT+08:00 Alexander Sterligov <>:

> Yes, I workarounded this problem in such way and it works.
> One problem of such solution is that I have to use pretty large hdfs and
> it'expensive. And also I have to manually garbage collect it, because it is
> not moved to s3, but copied. Kylin cleanup job doesn't work for it, because
> main metadata folder is at s3. So it would be really nice to put everything
> to s3.
> Another problem is that I had to rise hbase rpc timeout, because bulk
> loading from hdfs takes long. That was not trivial. 3 minutes work good,
> but with drawback of queries or metadata writes handing for 3 minutes if
> something bad happen. But that's rare event.
> 10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <>
> написал:
> How about leaving empty for "kylin.hbase.cluster.fs"? This property is
>> for two-cluster deployment (one Hadoop for cube build, the other for
>> query);
>> When be empty, the HFile will be written to default fs (HDFS in EMR), and
>> then load to HBase. I'm not sure whether EMR HBase (using S3 as storage)
>> can bulk load files from HDFS or not. If it can, that would be great as the
>> write performance of HDFS would be better than S3.
>> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <>:
>>> I also thought about it, but no, it's not consistency.
>>> Consistency view is enabled. I use same s3 for my own map-reduce jobs
>>> and it's ok.
>>> I also checked if it lost consistency (emrfs diff). No problems.
>>> In case of inconsistency of s3 files disappear right after they were
>>> written and appear some time after. Hfiles didn't appear after a day, but
>>> _template is there.
>>> It's 100% reproducable, I think I'll investigate this problem by running
>>> conversion job manually.
>>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <
>>>> написал:
>>> Did you enable the Consistent View? This article explains the challenge
>>>> when using S3 directly for ETL process:
>>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-workflows/
>>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <>:
>>>>> Yes, it's empty. Also I see this message in the log:
>>>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>>> /main_event_1_main/hfile/_SUCCESS
>>>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>>> isSymlink=false}
>>>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not
>>>>> find any files to load in directory s3://joom.emr.fs/home/producti
>>>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647
>>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files in
>>>>> subdirectories that correspond to column family names?
>>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <>
>>>>> wrote:
>>>>>> The HFile will be moved to HBase data folder when bulk load finished;
>>>>>> Did you check whether the HTable has data?
>>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <>:
>>>>>>> Hi!
>>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>>> s3.
>>>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile),
>>>>>>> but only _temporary folder and _SUCCESS file.
>>>>>>> _temporary contains hfiles inside attempt folders. it looks like
>>>>>>> there were not copied from _temporary to result dir. But there is no errors
>>>>>>> neither in kylin log, nor in reducers' logs.
>>>>>>> Then loading empty hfiles produces empty segments.
>>>>>>> Is that a bug or I'm doing something wrong?
>>>>>> --
>>>>>> Best regards,
>>>>>> Shaofeng Shi 史少锋
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
Yes, I workarounded this problem in such way and it works.

One problem of such solution is that I have to use pretty large hdfs and
it'expensive. And also I have to manually garbage collect it, because it is
not moved to s3, but copied. Kylin cleanup job doesn't work for it, because
main metadata folder is at s3. So it would be really nice to put everything
to s3.

Another problem is that I had to rise hbase rpc timeout, because bulk
loading from hdfs takes long. That was not trivial. 3 minutes work good,
but with drawback of queries or metadata writes handing for 3 minutes if
something bad happen. But that's rare event.

10 авг. 2017 г. 17:42 пользователь "ShaoFeng Shi" <>

> How about leaving empty for "kylin.hbase.cluster.fs"? This property is
> for two-cluster deployment (one Hadoop for cube build, the other for
> query);
> When be empty, the HFile will be written to default fs (HDFS in EMR), and
> then load to HBase. I'm not sure whether EMR HBase (using S3 as storage)
> can bulk load files from HDFS or not. If it can, that would be great as the
> write performance of HDFS would be better than S3.
> 2017-08-10 22:29 GMT+08:00 Alexander Sterligov <>:
>> I also thought about it, but no, it's not consistency.
>> Consistency view is enabled. I use same s3 for my own map-reduce jobs and
>> it's ok.
>> I also checked if it lost consistency (emrfs diff). No problems.
>> In case of inconsistency of s3 files disappear right after they were
>> written and appear some time after. Hfiles didn't appear after a day, but
>> _template is there.
>> It's 100% reproducable, I think I'll investigate this problem by running
>> conversion job manually.
>> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <>
>> написал:
>> Did you enable the Consistent View? This article explains the challenge
>>> when using S3 directly for ETL process:
>>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-workflows/
>>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <>:
>>>> Yes, it's empty. Also I see this message in the log:
>>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>>> /main_event_1_main/hfile/_SUCCESS
>>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>>> isSymlink=false}
>>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find
>>>> any files to load in directory s3://joom.emr.fs/home/producti
>>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647
>>>> 2b866126d/main_event_1_main/hfile.  Does it contain files in
>>>> subdirectories that correspond to column family names?
>>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <>
>>>> wrote:
>>>>> The HFile will be moved to HBase data folder when bulk load finished;
>>>>> Did you check whether the HTable has data?
>>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <>:
>>>>>> Hi!
>>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>>> s3.
>>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), but
>>>>>> only _temporary folder and _SUCCESS file.
>>>>>> _temporary contains hfiles inside attempt folders. it looks like
>>>>>> there were not copied from _temporary to result dir. But there is no errors
>>>>>> neither in kylin log, nor in reducers' logs.
>>>>>> Then loading empty hfiles produces empty segments.
>>>>>> Is that a bug or I'm doing something wrong?
>>>>> --
>>>>> Best regards,
>>>>> Shaofeng Shi 史少锋
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
How about leaving empty for "kylin.hbase.cluster.fs"? This property is for
two-cluster deployment (one Hadoop for cube build, the other for query);

When be empty, the HFile will be written to default fs (HDFS in EMR), and
then load to HBase. I'm not sure whether EMR HBase (using S3 as storage)
can bulk load files from HDFS or not. If it can, that would be great as the
write performance of HDFS would be better than S3.

2017-08-10 22:29 GMT+08:00 Alexander Sterligov <>:

> I also thought about it, but no, it's not consistency.
> Consistency view is enabled. I use same s3 for my own map-reduce jobs and
> it's ok.
> I also checked if it lost consistency (emrfs diff). No problems.
> In case of inconsistency of s3 files disappear right after they were
> written and appear some time after. Hfiles didn't appear after a day, but
> _template is there.
> It's 100% reproducable, I think I'll investigate this problem by running
> conversion job manually.
> 10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <>
> написал:
> Did you enable the Consistent View? This article explains the challenge
>> when using S3 directly for ETL process:
>> y-when-using-amazon-s3-and-amazon-elastic-mapreduce-for-etl-workflows/
>> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <>:
>>> Yes, it's empty. Also I see this message in the log:
>>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/kyl
>>> in-1e436685-7102-4621-a4cb-6472b866126d
>>> /main_event_1_main/hfile/_SUCCESS
>>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/m
>>> ain_event_1_main/hfile/_temporary/1; isDirectory=true;
>>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>>> isSymlink=false}
>>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find
>>> any files to load in directory s3://joom.emr.fs/home/producti
>>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-647
>>> 2b866126d/main_event_1_main/hfile.  Does it contain files in
>>> subdirectories that correspond to column family names?
>>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <>
>>> wrote:
>>>> The HFile will be moved to HBase data folder when bulk load finished;
>>>> Did you check whether the HTable has data?
>>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <>:
>>>>> Hi!
>>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>>> Statistics at the end of the job said that it has written lot's of data to
>>>>> s3.
>>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), but
>>>>> only _temporary folder and _SUCCESS file.
>>>>> _temporary contains hfiles inside attempt folders. it looks like there
>>>>> were not copied from _temporary to result dir. But there is no errors
>>>>> neither in kylin log, nor in reducers' logs.
>>>>> Then loading empty hfiles produces empty segments.
>>>>> Is that a bug or I'm doing something wrong?
>>>> --
>>>> Best regards,
>>>> Shaofeng Shi 史少锋
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
I also thought about it, but no, it's not consistency.

Consistency view is enabled. I use same s3 for my own map-reduce jobs and
it's ok.

I also checked if it lost consistency (emrfs diff). No problems.

In case of inconsistency of s3 files disappear right after they were
written and appear some time after. Hfiles didn't appear after a day, but
_template is there.

It's 100% reproducable, I think I'll investigate this problem by running
conversion job manually.

10 авг. 2017 г. 17:18 пользователь "ShaoFeng Shi" <>

> Did you enable the Consistent View? This article explains the challenge
> when using S3 directly for ETL process:
> consistency-when-using-amazon-s3-and-amazon-elastic-
> mapreduce-for-etl-workflows/
> 2017-08-09 18:19 GMT+08:00 Alexander Sterligov <>:
>> Yes, it's empty. Also I see this message in the log:
>> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
>> s3://joom.emr.fs/home/production/bi/kylin/kylin_metadata/
>> kylin-1e436685-7102-4621-a4cb-6472b866126d
>> /main_event_1_main/hfile/_SUCCESS
>> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
>> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/kylin
>> /kylin_metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d/
>> main_event_1_main/hfile/_temporary/1; isDirectory=true;
>> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
>> isSymlink=false}
>> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
>> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find
>> any files to load in directory s3://joom.emr.fs/home/producti
>> on/bi/kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-
>> 6472b866126d/main_event_1_main/hfile.  Does it contain files in
>> subdirectories that correspond to column family names?
>> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <>
>> wrote:
>>> The HFile will be moved to HBase data folder when bulk load finished;
>>> Did you check whether the HTable has data?
>>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <>:
>>>> Hi!
>>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>>> Step "Convert Cuboid Data to HFile" finished without errors.
>>>> Statistics at the end of the job said that it has written lot's of data to
>>>> s3.
>>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), but
>>>> only _temporary folder and _SUCCESS file.
>>>> _temporary contains hfiles inside attempt folders. it looks like there
>>>> were not copied from _temporary to result dir. But there is no errors
>>>> neither in kylin log, nor in reducers' logs.
>>>> Then loading empty hfiles produces empty segments.
>>>> Is that a bug or I'm doing something wrong?
>>> --
>>> Best regards,
>>> Shaofeng Shi 史少锋
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
Did you enable the Consistent View? This article explains the challenge
when using S3 directly for ETL process:

2017-08-09 18:19 GMT+08:00 Alexander Sterligov <>:

> Yes, it's empty. Also I see this message in the log:
> 2017-08-09 09:02:35,947 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
> mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
> s3://joom.emr.fs/home/production/bi/kylin/kylin_
> metadata/kylin-1e436685-7102-4621-a4cb-6472b866126d
> /main_event_1_main/hfile/_SUCCESS
> 2017-08-09 09:02:36,009 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
> mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
> FileStatusExt{path=s3://joom.emr.fs/home/production/bi/
> kylin/kylin_metadata/kylin-1e436685-7102-4621-a4cb-
> 6472b866126d/main_event_1_main/hfile/_temporary/1; isDirectory=true;
> modification_time=0; access_time=0; owner=; group=; permission=rwxrwxrwx;
> isSymlink=false}
> 2017-08-09 09:02:36,014 WARN  [Job 1e436685-7102-4621-a4cb-6472b866126d-7608]
> mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find
> any files to load in directory s3://joom.emr.fs/home/
> production/bi/kylin/kylin_metadata/kylin-1e436685-7102-
> 4621-a4cb-6472b866126d/main_event_1_main/hfile.  Does it contain files in
> subdirectories that correspond to column family names?
> On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <>
> wrote:
>> The HFile will be moved to HBase data folder when bulk load finished; Did
>> you check whether the HTable has data?
>> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <>:
>>> Hi!
>>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>>> Step "Convert Cuboid Data to HFile" finished without errors. Statistics
>>> at the end of the job said that it has written lot's of data to s3.
>>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), but
>>> only _temporary folder and _SUCCESS file.
>>> _temporary contains hfiles inside attempt folders. it looks like there
>>> were not copied from _temporary to result dir. But there is no errors
>>> neither in kylin log, nor in reducers' logs.
>>> Then loading empty hfiles produces empty segments.
>>> Is that a bug or I'm doing something wrong?
>> --
>> Best regards,
>> Shaofeng Shi 史少锋

Best regards,

Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by Alexander Sterligov <>.
Yes, it's empty. Also I see this message in the log:

2017-08-09 09:02:35,947 WARN  [Job
mapreduce.LoadIncrementalHFiles:234 : Skipping non-directory
2017-08-09 09:02:36,009 WARN  [Job
mapreduce.LoadIncrementalHFiles:252 : Skipping non-file
isDirectory=true; modification_time=0; access_time=0; owner=; group=;
permission=rwxrwxrwx; isSymlink=false}
2017-08-09 09:02:36,014 WARN  [Job
mapreduce.LoadIncrementalHFiles:422 : Bulk load operation did not find any
files to load in directory
Does it contain files in subdirectories that correspond to column family

On Wed, Aug 9, 2017 at 1:15 PM, ShaoFeng Shi <> wrote:

> The HFile will be moved to HBase data folder when bulk load finished; Did
> you check whether the HTable has data?
> 2017-08-09 17:54 GMT+08:00 Alexander Sterligov <>:
>> Hi!
>> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
>> Step "Convert Cuboid Data to HFile" finished without errors. Statistics
>> at the end of the job said that it has written lot's of data to s3.
>> But there is no hfiles in kylin_metadata folder (kylin_metadata
>> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), but
>> only _temporary folder and _SUCCESS file.
>> _temporary contains hfiles inside attempt folders. it looks like there
>> were not copied from _temporary to result dir. But there is no errors
>> neither in kylin log, nor in reducers' logs.
>> Then loading empty hfiles produces empty segments.
>> Is that a bug or I'm doing something wrong?
> --
> Best regards,
> Shaofeng Shi 史少锋

Re: HFile is empty if kylin.hbase.cluster.fs is set to s3

Posted by ShaoFeng Shi <>.
The HFile will be moved to HBase data folder when bulk load finished; Did
you check whether the HTable has data?

2017-08-09 17:54 GMT+08:00 Alexander Sterligov <>:

> Hi!
> I set kylin.hbase.cluster.fs to s3 bucket where hbase lives.
> Step "Convert Cuboid Data to HFile" finished without errors. Statistics
> at the end of the job said that it has written lot's of data to s3.
> But there is no hfiles in kylin_metadata folder (kylin_metadata
> /kylin-1e436685-7102-4621-a4cb-6472b866126d/<table name>/hfile), but only
> _temporary folder and _SUCCESS file.
> _temporary contains hfiles inside attempt folders. it looks like there
> were not copied from _temporary to result dir. But there is no errors
> neither in kylin log, nor in reducers' logs.
> Then loading empty hfiles produces empty segments.
> Is that a bug or I'm doing something wrong?

Best regards,

Shaofeng Shi 史少锋