You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 开心延年 <mu...@qq.com> on 2016/02/22 15:03:36 UTC
回复：回复： a new FileFormat 5x~100x faster than parquet

1.ya100 is not only the invert index  ,but also include the TOP N sort lazy read,also include label .
2.on the attache file is our test on ya100 and parquet
3.you are right ,the load into ya100 is longer then parquet ,but the query time is very fast.
4. the code is already available ，but i`m sorry ya100 and ydb is a          Commercial product now by our company (called ycloud),but i think if apache like it ,the source code is not a problem.

our invert index indeed is apache lucene-but a quite different useage, such as solr or es.
we can make a test on ya100(we support java jar),it is really very faster than  parquet          especially                     on sort 、group by 、filter and so on.


provide more information? what any thing other infomation do you want ,i can provide it  tomorrow.


------------------ 原始邮件 ------------------
发件人: "Gavin Yue";<yu...@gmail.com>;
发送时间: 2016年2月22日(星期一) 晚上9:33
收件人: "开心延年"<mu...@qq.com>; 
抄送: "Akhil Das"<ak...@sigmoidanalytics.com>; "user"<us...@spark.apache.org>; "dev"<de...@spark.apache.org>; 
主题: Re: 回复： a new FileFormat 5x~100x faster than parquet




I recommend you provide more information. Using inverted index certainly speed up the query time if hitting the index, but it would take longer to create and insert.  


Is the source code not available at this moment? 


Thanks 
Gavin 

On Feb 22, 2016, at 20:27, 开心延年 <mu...@qq.com> wrote:


if apache enjoy this project , of course we provider the source code .

BUt if apache dislike the porject , we had continue to improve the project by myself .

ya100 and ydb  max process data is 180billions rows data per day for neary realtime import .

because of index ,we make the search 10 secondes return in 1800billion (10days) rows data.




------------------ 原始邮件 ------------------
发件人: "Akhil Das";<ak...@sigmoidanalytics.com>;
发送时间: 2016年2月22日(星期一) 晚上8:42
收件人: "开心延年"<mu...@qq.com>; 
抄送: "user"<us...@spark.apache.org>; "dev"<de...@spark.apache.org>; 
主题: Re: a new FileFormat 5x~100x faster than parquet



Would be good to see the source code and the documentation in English.


ThanksBest Regards



 
On Mon, Feb 22, 2016 at 4:44 PM, 开心延年 <mu...@qq.com> wrote:
Ya100 is a FileFormat 5x~100x  faster than parquet。
we can get ya100 from this link https://github.com/ycloudnet/ya100/tree/master/v1.0.8


<A2...@C4EFD745.B6FECA56>

 
1.we used the inverted index，so we skip the rows that we does need.

  for example  the trade log search SQL

     

        select 
 
（1）phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong  
 
        from spark_txt where   
 
（2）tradeid=' 2014012213870282671'
 
        limit 10;  
 





     this sql is compose by two part

     (1)the part 1 is return the result which has 9 columns

     (2) the part 2 is the filter condition ,filter by tradeid

      


      let guess which plan is faster

     plan A :first read all the 9 columns result then filter by tradeid

     plan B: first filter by tradeid ,then we read the match 9 columns result.




    Ya100 choose plan B




     contrast  performance Ya100`index with parquet



<60...@C4EFD745.B6FECA56>




 2.TOP N sort ,the non sort column we doesn`t read it until the last 


  for example  we sort by the logtime

 

    select 
 
（1）phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong  
 
    from spark_txt 
 
（2）order by logtime desc 
 
    limit 10;  
 


  this sql is compose by two part
     (1)the part 1 is return the result which has 9 columns

     (2) the part 2 is the column need to sort

      


      let guess which plan is faster
     plan A :first read all the 9 columns result then sort by logtime
     plan B: first sort by logtime  ,then we read the match 9 columns result.




       Ya100 choose plan B




     contrast  performance Ya100`lazy read with parquet
<88...@C4EFD745.B6FECA56>

3.we used label instead of the original value for grouping and sorting

<C2...@C4EFD745.B6FECA56>

1).General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .
2).if we store the original value ,that will weast a lot of storage.
so we make a small modify at original  value, Additional add a new filed called label.
make a unique value sort by fields, and then gave each term a unique  Number from begin to end  .
3).we use number value(we called label) instead of original  value.lable is store by fixed length. the file could be read by random read.
4).the label`s order is the same with dictionary  order .so if we do some calculation like order by or group by only need to  read the label. we don`t need to read the original value.
5).some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.
 when we finish all of the calculation, we translate label to original  value by a dictionary.
6)if a lots of rows have the same original value ,the original value we only store once,only read once.
Solve the problem:
1)ya100`s data is quite big we don`t have enough memory to load all Values to memory.
2)on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build Cache will take a lot of times and io;
3)the Original value  is a string type.  whene sorting or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.
4)the label is number ,it`s type maybe short ,or maybe a byte ,or may be integer whitch depending on the max number of the label.

two-phase search
Original:
1)group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io 
2)compare by string is slowly then compare by integer
Our improve:
1)we split one search into multy-phase search
2)the first search we only search the field that use for order by ,group by
3)the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label for order by group by.
4)when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.
Solve the problem:
5)reduce io ,read original take a lot of disk io
6)reduce network io (for merger)
7)most of the field has repeated value, the repeated only need to read once
the group by filed only need to read the origin once by label where display to user.
8)most of the search only need to display on Top n (n<=100) results, so most of the original value could be skip.
<D3...@C4EFD745.B6FECA56>


<66...@C4EFD745.B6FECA56>


<F2...@C4EFD745.B6FECA56>

<2C...@C4EFD745.B6FECA56>


 
How to install ya100 
 
 
1)Add the depend jar 
 
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/data/ycloud/spark_ydb/ya100.jar
 
You can get the ya100 jar from this link



 
https://github.com/ycloudnet/ya100/blob/master/v1.0.8/ya100-1.0.8.jar



 
 
 
2)Start thriftserver 
 
./start-thriftserver.sh     --master yarn-client  --executor-memory 990m --executor-cores 2 --num-executors 16
 
 
 
Our suggest spark version is 6 
 
3)Config the ya100 function 
 
create  function Yfilter as 'cn.net.ycloud.ydb.handle.fun.Yfilter';   
 
create  function Ytop10000 as 'cn.net.ycloud.ydb.handle.fun.Ytop10000';  
 
create  function Ycombine as 'cn.net.ycloud.ydb.handle.fun.Ycombine'; 
 
create  function Ycount as 'cn.net.ycloud.ydb.handle.fun.Ycount';  
 
create  function Ymax as 'cn.net.ycloud.ydb.handle.fun.Ymax'; 
 
create  function Ymin as 'cn.net.ycloud.ydb.handle.fun.Ymin';   
 
create  function Yavg as 'cn.net.ycloud.ydb.handle.fun.Yavg';    
 
create  function Ysum as 'cn.net.ycloud.ydb.handle.fun.Ysum'; 
 
create  function Ymaxstring as 'cn.net.ycloud.ydb.handle.fun.Ymaxstring'; 
 
create  function Yminstring as 'cn.net.ycloud.ydb.handle.fun.Yminstring‘; 
 
 
YA100 support data type 
 <30...@C4EFD745.B6FECA56>

 
How to create ya100 table on spark Sql 
 
CREATE external  table spark_ya100(  
 
 phonenum bigint, usernick string, ydb_sex string,  ydb_province string, ydb_grade string, ydb_age string, ydb_blood string, ydb_zhiye string, ydb_earn string, ydb_prefer string, ydb_consume string, ydb_day string, amtdouble double,amtlong int, content string, ydbpartion string, ya100_pipe string 
 
) partitioned by (dt string) 
 
STORED BY 'cn.net.ycloud.ydb.handle.Ya100StorageHandler' 
 
LOCATION  '/data/ydb/shu_ya100' 
 
TBLPROPERTIES( 
 
"ya100.handler.table.name"="ydb_example_shu", 
 
"ya100.handler.schema"="phonenum long,usernick string,ydb_sex string,ydb_province string,ydb_grade string,ydb_age string,ydb_blood string,ydb_zhiye string,ydb_earn string,ydb_prefer string,ydb_consume string,ydb_day string, amtdouble double,amtlong int,content textcjk,ydbpartion string,ya100_pipe string" 
 
)  
 
//注，ydbpartion为ydb的分区字段与ya100_pipe 为ya100的管道字段，必须要都要创建，否则有些功能会被限制使用。
 
ya100.handler.table.name为预留的key,不同的表之间要区分开来
 
ya100.handler.schema 为内置的索引的数据类型，名字要与spark的字段名称一致。 
 

 
How to Import data into ya100 table


 

1)Set up import params 
 
set hive.mapred.supports.subdirectories=true; 
 
set mapred.min.split.size=2147483648;
 
注： 
 
ya100因为生成的最终不是一个文件，而是一个目录，这个目录下会有索引，所以要通过subdirectories参数设置spark支持子目录，否则向表里导入数据会报错失败。 
 
set mapred.min.split.size=2147483648 是为了控制map的个数，防止生成的索引文件数量太多。 
 
 
 
2)Begin import  
 
insert into table spark_ya100 partition (dt='1200million')
 
 select phonenum,usernick,ydb_sex,ydb_province,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtdouble,amtlong,content,2,1 
 
from spark_txt where dt='1200million'; 
 
 
How to use ya100 to filter data 
 
 
1),basic example 1 
 
set ya100.spark.filter.ydb_example_shu=ydb_sex='女' or ydb_province='辽宁' or ydb_day>='20151217'; 
 
有些时候，会遇到转义的问题，这里的value值，可以进行urlencode处理。
 
如：
 
set ya100.spark.filter.ydb_example_shu=ydbpartion%3D%2720151110%27+and+%28ydb_sex%3D%27%E5%A5%B3%27+or+ydb_province%3D%27%E8%BE%BD%E5%AE%81%27+or+ydb_day%3E%3D%2720151217%27%29;
 
 
 
2)Filter example 2 
 
set ya100.spark.filter.ydb_example_shu=ydb_sex='女' or ydb_province='辽宁' or ydb_day>='20151217'; 
 
select * from spark_ya100 where dt=‘1200million’ limit 10; 
 
 
 
 
 
 
Other filter example use case  
 
1)equal 
 
qq=‘165162897’
 
2)Support in 
 
如：indexnum in (1,2,3) 
 
3)>,<,>=,<=, 
 
clickcount >=10 and clickcount <=11 
 
4) range 
 
indexnum like '({0 TO 11}) '      不包含边界值
 
indexnum like '([10 TO 11] ) '    包含边界值
 
5) Unequal 
 
label<>'l_14' and label<>'l_15'  
 
6) 
 
indexnum='1' or indexnum='2' or (clickcount >=5 and clickcount <=11) 
 
 
 

<EE...@C4EFD745.B6FECA56>

 
How to use ya100 to make a top N sort 
 
set ya100.spark.top10000.ydb_example_shu=ydb_age desc limit 10; 
 
set ya100.spark.top10000.ydb_example_shu=ydb_sex desc,ydb_province limit 100； 
 
set ya100.spark.top10000.ydb_example_shu=* limit 10; 
 
 
 
注意，当前版本limit的最大值为10000 （每个索引最大返回1000） 
 
列的名字 写*表示 不需要排序，但只返回前N条就可以了 
 
 
 
SQL example  
 
 
 
set ya100.spark.top10000.ydb_example_shu=phonenum desc limit 10; 
 
select * from spark_ya100 where dt='1200million' order by phonenum desc limit 10; 
 
 
 

 
How to make group by or stat by ya100 
 
Example 1 
 
set ya100.spark.combine.ydb_example_shu=*,ydb_age; 
 
select ydb_province,Ycount('*',ya100_pipe), Ycount('ydb_age',ya100_pipe), Ymaxstring('ydb_age',ya100_pipe), Yminstring('ydb_age',ya100_pipe) 
 
 from spark_ya100  group by ydb_province limit 10 
 
  
 
Example 2 
 
set ya100.spark.combine.ydb_example_shu=amtlong; 
 
select ydb_sex, ydb_province,Ysum('amtlong',ya100_pipe) as cnt from spark_ya100  group by ydb_sex, ydb_province  limit 10 
 
 
 

 
Other SQL EXAMPLE


 
•EXAMPLE COUNT(*) 
 
set ya100.spark.filter.ydb_example_shu=phonenum='13870282671'  and usernick='江峻熙'; 
 
set ya100.spark.combine.ydb_example_shu=*; 
 
set ya100.spark.top10000.ydb_example_shu=; 
 
select Ycount('*',ya100_pipe) from spark_ya100 where dt='100million' limit 10; 
 •TOP N SORT EXAMPLE
 •set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100] )' and amtdouble like '([1090 TO 1100] )'; 
 
set ya100.spark.combine.ydb_example_shu=; 
 
set ya100.spark.top10000.ydb_example_shu=phonenum desc limit 10; 
 
select * from spark_ya100 where dt='100million' order by phonenum desc limit 10;
 
SUM(amtlong)
 
set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100] )' and amtdouble like '([1090 TO 1100] )'; 
 
set ya100.spark.combine.ydb_example_shu=amtlong; 
 
set ya100.spark.top10000.ydb_example_shu=; 
 
select Ycount(‘amtlong’,ya100_pipe) ,Ysum(‘amtlong’,ya100_pipe) ,Yavg(‘amtlong’,ya100_pipe), Ymax(‘amtlong’,ya100_pipe), Ymin(‘amtlong’,ya100_pipe)  from spark_ya100 where dt=‘100million’  limit 10; 
 
 •Top n sort 
 
set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100] )' and amtdouble like '([1090 TO 1100] )'; 
 
set ya100.spark.combine.ydb_example_shu=; 
 
set ya100.spark.top10000.ydb_example_shu= ydb_age desc, ydb_sex; 
 
select ydb_sex, ydb_age from spark_ya100 where dt='100million'  order  by  ydb_age desc, ydb_sex  limit 10;  
 •Group by 
 
set ya100.spark.filter.ydb_example_shu=; 
 
set ya100.spark.combine.ydb_example_shu=ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtlong; 
 
set ya100.spark.top10000.ydb_example_shu=; 
 
select ydb_sex,ydb_province,ydb_grade, Ymaxstring(‘ydb_age’,ya100_pipe),Ymaxstring(‘ydb_blood’,ya100_pipe),Ymaxstring(‘ydb_zhiye’,ya100_pipe),Ymaxstring(‘ydb_earn’,ya100_pipe),Ymaxstring(‘ydb_prefer’,ya100_pipe),Ymaxstring(‘ydb_consume’,ya100_pipe),Ymaxstring(‘ydb_day’,ya100_pipe),Ysum(‘amtlong’,ya100_pipe) as cnt from spark_ya100 where dt=‘100million’  group by ydb_sex,ydb_province,ydb_grade order by cnt desc limit 10 
 

 
Connect Ya100 to ydb  make real time data 
CREATE external  table spark_ydb(
 phonenum bigint, usernick string, ydb_sex string,  ydb_province string, ydb_grade string, ydb_age string, ydb_blood string, ydb_zhiye string, ydb_earn string, ydb_prefer string, ydb_consume string, ydb_day string, amtdouble double,amtlong int, content string, ydbpartion string, ya100_pipe string
)
STORED BY 'cn.net.ycloud.ydb.handle.Ya100StorageHandler'
LOCATION  '/data/ydb/shu_ydb'
TBLPROPERTIES(
"ya100.handler.table.name"="ydb_example_shu",
"ya100.handler.master"="101.200.130.48:8080",
"ya100.handler.columns.mapping"="phonenum,usernick,ydb_sex,ydb_province,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtdouble,amtlong,content,ydbpartion,ya100_pipe")

注

ydbpartion为ydb的分区字段，查询的时候必须执行ydbpartion,而spark表则没有分区字段 
 
ya100_pipe 为ya100的管道字段，必须要都要创建，否则有些功能会被限制使用。
 
ya100.handler.table.name为ydb系统里的表名字
 
ya100.handler.columns.mapping 为ydb系统内的表与spark的表之间的映射。 
 
 
 

<5C...@C4EFD745.B6FECA56>
<CD...@C4EFD745.B6FECA56>

 
 <10...@C4EFD745.B6FECA56>
回复： 回复： a new FileFormat 5x~100x faster than parquet

回复：回复： a new FileFormat 5x~100x faster than parquet