You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by ShaoFeng Shi <sh...@apache.org> on 2017/01/20 06:04:30 UTC

org.apache.kylin.dict.CachedTreeMap use a couple classes from org.apache.hadoop.fs

Hi Yerui,

I noticed that the CachedTreeMap.java uses a couple of classes
from org.apache.hadoop.fs package; and you have a comment "TODO Depends on
HDFS for now, ideally just depends on storage interface"

Now this impact the cube building with Spark, as some classes like
org.apache.hadoop.fs.Path isn't serializable while Spark relies on Java
serialization heavily. Will get error when building a cube with bitmap
measure as in below. So, can it be changed to ordinary classes like String
here? Thanks!

Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.Path, value:
hdfs:/kylin/kylin_default_instance/resources/GlobalDict/dict/DEFAULT.TEST_KYLIN_FACT/TEST_COUNT_DISTINCT_BITMAP)
- writeObject data (class: java.util.TreeMap)
- object (class org.apache.kylin.dict.CachedTreeMap, {=null})
- field (class: org.apache.kylin.dict.AppendTrieDictionary, name:
dictSliceMap, type: class java.util.TreeMap)
- object (class org.apache.kylin.dict.AppendTrieDictionary,
AppendTrieDictionary(hdfs:///kylin/kylin_default_instance/resources/GlobalDict/dict/DEFAULT.TEST_KYLIN_FACT/TEST_COUNT_DISTINCT_BITMAP/))
- writeObject data (class: java.util.HashMap)
- object (class java.util.HashMap,
{DEFAULT.TEST_KYLIN_FACT.LSTG_SITE_ID=org.apache.kylin.dict.TrieDictionaryForest@f30773fa,
DEFAULT.TEST_CATEGORY_GROUPINGS.CATEG_LVL2_NAME=org.apache.kylin.dict.TrieDictionaryForest@18259639,
DEFAULT.TEST_CATEGORY_GROUPINGS.META_CATEG_NAME=org.apache.kylin.dict.TrieDictionaryForest@44184626,
BUYER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_SELLER_LEVEL=org.apache.kylin.dict.TrieDictionaryForest@879f6439,
SELLER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_SELLER_LEVEL=org.apache.kylin.dict.TrieDictionaryForest@879f6439,
BUYER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_BUYER_LEVEL=org.apache.kylin.dict.TrieDictionaryForest@879f6439,
SELLER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_BUYER_LEVEL=org.apache.kylin.dict.TrieDictionaryForest@879f6439,
DEFAULT.TEST_KYLIN_FACT.TRANS_ID=org.apache.kylin.dict.TrieDictionaryForest@93b5aa11,
DEFAULT.TEST_CATEGORY_GROUPINGS.CATEG_LVL3_NAME=org.apache.kylin.dict.TrieDictionaryForest@a494947b,
SELLER_COUNTRY:DEFAULT.TEST_COUNTRY.NAME
=org.apache.kylin.dict.TrieDictionaryForest@b3559b4c, BUYER_COUNTRY:
DEFAULT.TEST_COUNTRY.NAME
=org.apache.kylin.dict.TrieDictionaryForest@b3559b4c,
SELLER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_COUNTRY=org.apache.kylin.dict.TrieDictionaryForest@410216c0,
BUYER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_COUNTRY=org.apache.kylin.dict.TrieDictionaryForest@410216c0,
DEFAULT.TEST_KYLIN_FACT.PRICE=org.apache.kylin.dict.TrieDictionaryForest@89f144c6,
DEFAULT.TEST_KYLIN_FACT.TEST_COUNT_DISTINCT_BITMAP=AppendTrieDictionary(hdfs:///kylin/kylin_default_instance/resources/GlobalDict/dict/DEFAULT.TEST_KYLIN_FACT/TEST_COUNT_DISTINCT_BITMAP/),
DEFAULT.TEST_KYLIN_FACT.LEAF_CATEG_ID=org.apache.kylin.dict.TrieDictionaryForest@25e701d0,
DEFAULT.TEST_KYLIN_FACT.SLR_SEGMENT_CD=org.apache.kylin.dict.TrieDictionaryForest@dcfc7d11,
DEFAULT.TEST_KYLIN_FACT.CAL_DT=DateStrDictionary [pattern=yyyy-MM-dd,
baseId=0]})



-- 
Best regards,

Shaofeng Shi 史少锋

Re: org.apache.kylin.dict.CachedTreeMap use a couple classes from org.apache.hadoop.fs

Posted by ShaoFeng Shi <sh...@apache.org>.
Yerui,

You're correct, serialize the dictionary isn't a good idea; I will try to
initialize these big objects inner executors, instead of transfering them
from driver; I will get back to you if have problem. Thanks!

2017-01-25 15:56 GMT+08:00 Yerui Sun <su...@gmail.com>:

> Hi,shaofeng,
>    Sorry for my slow response.
>    There’s indeed some serialization issue in spark context. Here’s some
> my opinions:
>    * Some field is initialized in constructor, which meaning NOT NEED to
> be serialized, we can qualified these field with ‘transient’;
>    * Do we really need to serialized CachedTreeMap to spark executor
> tasks? Maybe every tasks initialize own CacheTreeMap instance is another
> choice;
>
>    Please feel free to change the code if you really need to serialize
> CachedTreeMap, and let me know if there’s somewhere I could help.
>
>
> 在 2017年1月20日,14:04,ShaoFeng Shi <sh...@apache.org> 写道:
>
> Hi Yerui,
>
> I noticed that the CachedTreeMap.java uses a couple of classes
> from org.apache.hadoop.fs package; and you have a comment "TODO Depends
> on HDFS for now, ideally just depends on storage interface"
>
> Now this impact the cube building with Spark, as some classes like
> org.apache.hadoop.fs.Path isn't serializable while Spark relies on Java
> serialization heavily. Will get error when building a cube with bitmap
> measure as in below. So, can it be changed to ordinary classes like String
> here? Thanks!
>
> Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path
> Serialization stack:
> - object not serializable (class: org.apache.hadoop.fs.Path, value:
> hdfs:/kylin/kylin_default_instance/resources/GlobalDict/
> dict/DEFAULT.TEST_KYLIN_FACT/TEST_COUNT_DISTINCT_BITMAP)
> - writeObject data (class: java.util.TreeMap)
> - object (class org.apache.kylin.dict.CachedTreeMap, {=null})
> - field (class: org.apache.kylin.dict.AppendTrieDictionary, name:
> dictSliceMap, type: class java.util.TreeMap)
> - object (class org.apache.kylin.dict.AppendTrieDictionary,
> AppendTrieDictionary(hdfs:///kylin/kylin_default_instance/
> resources/GlobalDict/dict/DEFAULT.TEST_KYLIN_FACT/TEST_
> COUNT_DISTINCT_BITMAP/))
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, {DEFAULT.TEST_KYLIN_FACT.LSTG_
> SITE_ID=org.apache.kylin.dict.TrieDictionaryForest@f30773fa,
> DEFAULT.TEST_CATEGORY_GROUPINGS.CATEG_LVL2_NAME=org.apache.kylin.dict.
> TrieDictionaryForest@18259639, DEFAULT.TEST_CATEGORY_
> GROUPINGS.META_CATEG_NAME=org.apache.kylin.dict.
> TrieDictionaryForest@44184626, BUYER_ACCOUNT:DEFAULT.TEST_
> ACCOUNT.ACCOUNT_SELLER_LEVEL=org.apache.kylin.dict.
> TrieDictionaryForest@879f6439, SELLER_ACCOUNT:DEFAULT.TEST_
> ACCOUNT.ACCOUNT_SELLER_LEVEL=org.apache.kylin.dict.
> TrieDictionaryForest@879f6439, BUYER_ACCOUNT:DEFAULT.TEST_
> ACCOUNT.ACCOUNT_BUYER_LEVEL=org.apache.kylin.dict.
> TrieDictionaryForest@879f6439, SELLER_ACCOUNT:DEFAULT.TEST_
> ACCOUNT.ACCOUNT_BUYER_LEVEL=org.apache.kylin.dict.
> TrieDictionaryForest@879f6439, DEFAULT.TEST_KYLIN_FACT.TRANS_
> ID=org.apache.kylin.dict.TrieDictionaryForest@93b5aa11,
> DEFAULT.TEST_CATEGORY_GROUPINGS.CATEG_LVL3_NAME=org.apache.kylin.dict.
> TrieDictionaryForest@a494947b, SELLER_COUNTRY:DEFAULT.TEST_COUNTRY.NAME
> <http://default.test_country.name/>=org.apache.kylin.
> dict.TrieDictionaryForest@b3559b4c, BUYER_COUNTRY:DEFAULT.TEST_
> COUNTRY.NAME <http://default.test_country.name/>=org.apache.kylin.
> dict.TrieDictionaryForest@b3559b4c, SELLER_ACCOUNT:DEFAULT.TEST_
> ACCOUNT.ACCOUNT_COUNTRY=org.apache.kylin.dict.
> TrieDictionaryForest@410216c0, BUYER_ACCOUNT:DEFAULT.TEST_
> ACCOUNT.ACCOUNT_COUNTRY=org.apache.kylin.dict.
> TrieDictionaryForest@410216c0, DEFAULT.TEST_KYLIN_FACT.PRICE=
> org.apache.kylin.dict.TrieDictionaryForest@89f144c6,
> DEFAULT.TEST_KYLIN_FACT.TEST_COUNT_DISTINCT_BITMAP=AppendTrieDictionary(
> hdfs:///kylin/kylin_default_instance/resources/GlobalDict/dict/
> DEFAULT.TEST_KYLIN_FACT/TEST_COUNT_DISTINCT_BITMAP/),
> DEFAULT.TEST_KYLIN_FACT.LEAF_CATEG_ID=org.apache.kylin.
> dict.TrieDictionaryForest@25e701d0, DEFAULT.TEST_KYLIN_FACT.SLR_
> SEGMENT_CD=org.apache.kylin.dict.TrieDictionaryForest@dcfc7d11,
> DEFAULT.TEST_KYLIN_FACT.CAL_DT=DateStrDictionary [pattern=yyyy-MM-dd,
> baseId=0]})
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: org.apache.kylin.dict.CachedTreeMap use a couple classes from org.apache.hadoop.fs

Posted by Yerui Sun <su...@gmail.com>.
Hi,shaofeng,
   Sorry for my slow response.
   There’s indeed some serialization issue in spark context. Here’s some my opinions:
   * Some field is initialized in constructor, which meaning NOT NEED to be serialized, we can qualified these field with ‘transient’;
   * Do we really need to serialized CachedTreeMap to spark executor tasks? Maybe every tasks initialize own CacheTreeMap instance is another choice;

   Please feel free to change the code if you really need to serialize CachedTreeMap, and let me know if there’s somewhere I could help.

> 在 2017年1月20日,14:04,ShaoFeng Shi <sh...@apache.org> 写道:
> 
> Hi Yerui,
> 
> I noticed that the CachedTreeMap.java uses a couple of classes from org.apache.hadoop.fs package; and you have a comment "TODO Depends on HDFS for now, ideally just depends on storage interface"
> 
> Now this impact the cube building with Spark, as some classes like org.apache.hadoop.fs.Path isn't serializable while Spark relies on Java serialization heavily. Will get error when building a cube with bitmap measure as in below. So, can it be changed to ordinary classes like String here? Thanks!
> 
> Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path
> Serialization stack:
> 	- object not serializable (class: org.apache.hadoop.fs.Path, value: hdfs:/kylin/kylin_default_instance/resources/GlobalDict/dict/DEFAULT.TEST_KYLIN_FACT/TEST_COUNT_DISTINCT_BITMAP)
> 	- writeObject data (class: java.util.TreeMap)
> 	- object (class org.apache.kylin.dict.CachedTreeMap, {=null})
> 	- field (class: org.apache.kylin.dict.AppendTrieDictionary, name: dictSliceMap, type: class java.util.TreeMap)
> 	- object (class org.apache.kylin.dict.AppendTrieDictionary, AppendTrieDictionary(hdfs:///kylin/kylin_default_instance/resources/GlobalDict/dict/DEFAULT.TEST_KYLIN_FACT/TEST_COUNT_DISTINCT_BITMAP/))
> 	- writeObject data (class: java.util.HashMap)
> 	- object (class java.util.HashMap, {DEFAULT.TEST_KYLIN_FACT.LSTG_SITE_ID=org.apache.kylin.dict.TrieDictionaryForest@f30773fa, DEFAULT.TEST_CATEGORY_GROUPINGS.CATEG_LVL2_NAME=org.apache.kylin.dict.TrieDictionaryForest@18259639, DEFAULT.TEST_CATEGORY_GROUPINGS.META_CATEG_NAME=org.apache.kylin.dict.TrieDictionaryForest@44184626, BUYER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_SELLER_LEVEL=org.apache.kylin.dict.TrieDictionaryForest@879f6439, SELLER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_SELLER_LEVEL=org.apache.kylin.dict.TrieDictionaryForest@879f6439, BUYER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_BUYER_LEVEL=org.apache.kylin.dict.TrieDictionaryForest@879f6439, SELLER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_BUYER_LEVEL=org.apache.kylin.dict.TrieDictionaryForest@879f6439, DEFAULT.TEST_KYLIN_FACT.TRANS_ID=org.apache.kylin.dict.TrieDictionaryForest@93b5aa11, DEFAULT.TEST_CATEGORY_GROUPINGS.CATEG_LVL3_NAME=org.apache.kylin.dict.TrieDictionaryForest@a494947b, SELLER_COUNTRY:DEFAULT.TEST_COUNTRY.NAME <http://default.test_country.name/>=org.apache.kylin.dict.TrieDictionaryForest@b3559b4c, BUYER_COUNTRY:DEFAULT.TEST_COUNTRY.NAME <http://default.test_country.name/>=org.apache.kylin.dict.TrieDictionaryForest@b3559b4c, SELLER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_COUNTRY=org.apache.kylin.dict.TrieDictionaryForest@410216c0, BUYER_ACCOUNT:DEFAULT.TEST_ACCOUNT.ACCOUNT_COUNTRY=org.apache.kylin.dict.TrieDictionaryForest@410216c0, DEFAULT.TEST_KYLIN_FACT.PRICE=org.apache.kylin.dict.TrieDictionaryForest@89f144c6, DEFAULT.TEST_KYLIN_FACT.TEST_COUNT_DISTINCT_BITMAP=AppendTrieDictionary(hdfs:///kylin/kylin_default_instance/resources/GlobalDict/dict/DEFAULT.TEST_KYLIN_FACT/TEST_COUNT_DISTINCT_BITMAP/), DEFAULT.TEST_KYLIN_FACT.LEAF_CATEG_ID=org.apache.kylin.dict.TrieDictionaryForest@25e701d0, DEFAULT.TEST_KYLIN_FACT.SLR_SEGMENT_CD=org.apache.kylin.dict.TrieDictionaryForest@dcfc7d11, DEFAULT.TEST_KYLIN_FACT.CAL_DT=DateStrDictionary [pattern=yyyy-MM-dd, baseId=0]})
> 
> 
> 
> -- 
> Best regards,
> 
> Shaofeng Shi 史少锋
>