You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 开心延年 <mu...@qq.com> on 2016/02/22 15:03:36 UTC
回复: 回复: a new FileFormat 5x~100x faster than parquet
1.ya100 is not only the invert index ,but also include the TOP N sort lazy read,also include label .
2.on the attache file is our test on ya100 and parquet
3.you are right ,the load into ya100 is longer then parquet ,but the query time is very fast.
4. the code is already available ,but i`m sorry ya100 and ydb is a Commercial product now by our company (called ycloud),but i think if apache like it ,the source code is not a problem.
our invert index indeed is apache lucene-but a quite different useage, such as solr or es.
we can make a test on ya100(we support java jar),it is really very faster than parquet especially on sort 、group by 、filter and so on.
provide more information? what any thing other infomation do you want ,i can provide it tomorrow.
------------------ 原始邮件 ------------------
发件人: "Gavin Yue";<yu...@gmail.com>;
发送时间: 2016年2月22日(星期一) 晚上9:33
收件人: "开心延年"<mu...@qq.com>;
抄送: "Akhil Das"<ak...@sigmoidanalytics.com>; "user"<us...@spark.apache.org>; "dev"<de...@spark.apache.org>;
主题: Re: 回复: a new FileFormat 5x~100x faster than parquet
I recommend you provide more information. Using inverted index certainly speed up the query time if hitting the index, but it would take longer to create and insert.
Is the source code not available at this moment?
Thanks
Gavin
On Feb 22, 2016, at 20:27, 开心延年 <mu...@qq.com> wrote:
if apache enjoy this project , of course we provider the source code .
BUt if apache dislike the porject , we had continue to improve the project by myself .
ya100 and ydb max process data is 180billions rows data per day for neary realtime import .
because of index ,we make the search 10 secondes return in 1800billion (10days) rows data.
------------------ 原始邮件 ------------------
发件人: "Akhil Das";<ak...@sigmoidanalytics.com>;
发送时间: 2016年2月22日(星期一) 晚上8:42
收件人: "开心延年"<mu...@qq.com>;
抄送: "user"<us...@spark.apache.org>; "dev"<de...@spark.apache.org>;
主题: Re: a new FileFormat 5x~100x faster than parquet
Would be good to see the source code and the documentation in English.
ThanksBest Regards
On Mon, Feb 22, 2016 at 4:44 PM, 开心延年 <mu...@qq.com> wrote:
Ya100 is a FileFormat 5x~100x faster than parquet。
we can get ya100 from this link https://github.com/ycloudnet/ya100/tree/master/v1.0.8
<A2...@C4EFD745.B6FECA56>
1.we used the inverted index,so we skip the rows that we does need.
for example the trade log search SQL
select
(1)phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong
from spark_txt where
(2)tradeid=' 2014012213870282671'
limit 10;
this sql is compose by two part
(1)the part 1 is return the result which has 9 columns
(2) the part 2 is the filter condition ,filter by tradeid
let guess which plan is faster
plan A :first read all the 9 columns result then filter by tradeid
plan B: first filter by tradeid ,then we read the match 9 columns result.
Ya100 choose plan B
contrast performance Ya100`index with parquet
<60...@C4EFD745.B6FECA56>
2.TOP N sort ,the non sort column we doesn`t read it until the last
for example we sort by the logtime
select
(1)phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong
from spark_txt
(2)order by logtime desc
limit 10;
this sql is compose by two part
(1)the part 1 is return the result which has 9 columns
(2) the part 2 is the column need to sort
let guess which plan is faster
plan A :first read all the 9 columns result then sort by logtime
plan B: first sort by logtime ,then we read the match 9 columns result.
Ya100 choose plan B
contrast performance Ya100`lazy read with parquet
<88...@C4EFD745.B6FECA56>
3.we used label instead of the original value for grouping and sorting
<C2...@C4EFD745.B6FECA56>
1).General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .
2).if we store the original value ,that will weast a lot of storage.
so we make a small modify at original value, Additional add a new filed called label.
make a unique value sort by fields, and then gave each term a unique Number from begin to end .
3).we use number value(we called label) instead of original value.lable is store by fixed length. the file could be read by random read.
4).the label`s order is the same with dictionary order .so if we do some calculation like order by or group by only need to read the label. we don`t need to read the original value.
5).some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.
when we finish all of the calculation, we translate label to original value by a dictionary.
6)if a lots of rows have the same original value ,the original value we only store once,only read once.
Solve the problem:
1)ya100`s data is quite big we don`t have enough memory to load all Values to memory.
2)on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build Cache will take a lot of times and io;
3)the Original value is a string type. whene sorting or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number is fast.
4)the label is number ,it`s type maybe short ,or maybe a byte ,or may be integer whitch depending on the max number of the label.
two-phase search
Original:
1)group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe need a lot of io
2)compare by string is slowly then compare by integer
Our improve:
1)we split one search into multy-phase search
2)the first search we only search the field that use for order by ,group by
3)the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label for order by group by.
4)when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.
Solve the problem:
5)reduce io ,read original take a lot of disk io
6)reduce network io (for merger)
7)most of the field has repeated value, the repeated only need to read once
the group by filed only need to read the origin once by label where display to user.
8)most of the search only need to display on Top n (n<=100) results, so most of the original value could be skip.
<D3...@C4EFD745.B6FECA56>
<66...@C4EFD745.B6FECA56>
<F2...@C4EFD745.B6FECA56>
<2C...@C4EFD745.B6FECA56>
How to install ya100
1)Add the depend jar
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/data/ycloud/spark_ydb/ya100.jar
You can get the ya100 jar from this link
https://github.com/ycloudnet/ya100/blob/master/v1.0.8/ya100-1.0.8.jar
2)Start thriftserver
./start-thriftserver.sh --master yarn-client --executor-memory 990m --executor-cores 2 --num-executors 16
Our suggest spark version is 6
3)Config the ya100 function
create function Yfilter as 'cn.net.ycloud.ydb.handle.fun.Yfilter';
create function Ytop10000 as 'cn.net.ycloud.ydb.handle.fun.Ytop10000';
create function Ycombine as 'cn.net.ycloud.ydb.handle.fun.Ycombine';
create function Ycount as 'cn.net.ycloud.ydb.handle.fun.Ycount';
create function Ymax as 'cn.net.ycloud.ydb.handle.fun.Ymax';
create function Ymin as 'cn.net.ycloud.ydb.handle.fun.Ymin';
create function Yavg as 'cn.net.ycloud.ydb.handle.fun.Yavg';
create function Ysum as 'cn.net.ycloud.ydb.handle.fun.Ysum';
create function Ymaxstring as 'cn.net.ycloud.ydb.handle.fun.Ymaxstring';
create function Yminstring as 'cn.net.ycloud.ydb.handle.fun.Yminstring‘;
YA100 support data type
<30...@C4EFD745.B6FECA56>
How to create ya100 table on spark Sql
CREATE external table spark_ya100(
phonenum bigint, usernick string, ydb_sex string, ydb_province string, ydb_grade string, ydb_age string, ydb_blood string, ydb_zhiye string, ydb_earn string, ydb_prefer string, ydb_consume string, ydb_day string, amtdouble double,amtlong int, content string, ydbpartion string, ya100_pipe string
) partitioned by (dt string)
STORED BY 'cn.net.ycloud.ydb.handle.Ya100StorageHandler'
LOCATION '/data/ydb/shu_ya100'
TBLPROPERTIES(
"ya100.handler.table.name"="ydb_example_shu",
"ya100.handler.schema"="phonenum long,usernick string,ydb_sex string,ydb_province string,ydb_grade string,ydb_age string,ydb_blood string,ydb_zhiye string,ydb_earn string,ydb_prefer string,ydb_consume string,ydb_day string, amtdouble double,amtlong int,content textcjk,ydbpartion string,ya100_pipe string"
)
//注,ydbpartion为ydb的分区字段与ya100_pipe 为ya100的管道字段,必须要都要创建,否则有些功能会被限制使用。
ya100.handler.table.name为预留的key,不同的表之间要区分开来
ya100.handler.schema 为内置的索引的数据类型,名字要与spark的字段名称一致。
How to Import data into ya100 table
1)Set up import params
set hive.mapred.supports.subdirectories=true;
set mapred.min.split.size=2147483648;
注:
ya100因为生成的最终不是一个文件,而是一个目录,这个目录下会有索引,所以要通过subdirectories参数设置spark支持子目录,否则向表里导入数据会报错失败。
set mapred.min.split.size=2147483648 是为了控制map的个数,防止生成的索引文件数量太多。
2)Begin import
insert into table spark_ya100 partition (dt='1200million')
select phonenum,usernick,ydb_sex,ydb_province,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtdouble,amtlong,content,2,1
from spark_txt where dt='1200million';
How to use ya100 to filter data
1),basic example 1
set ya100.spark.filter.ydb_example_shu=ydb_sex='女' or ydb_province='辽宁' or ydb_day>='20151217';
有些时候,会遇到转义的问题,这里的value值,可以进行urlencode处理。
如:
set ya100.spark.filter.ydb_example_shu=ydbpartion%3D%2720151110%27+and+%28ydb_sex%3D%27%E5%A5%B3%27+or+ydb_province%3D%27%E8%BE%BD%E5%AE%81%27+or+ydb_day%3E%3D%2720151217%27%29;
2)Filter example 2
set ya100.spark.filter.ydb_example_shu=ydb_sex='女' or ydb_province='辽宁' or ydb_day>='20151217';
select * from spark_ya100 where dt=‘1200million’ limit 10;
Other filter example use case
1)equal
qq=‘165162897’
2)Support in
如:indexnum in (1,2,3)
3)>,<,>=,<=,
clickcount >=10 and clickcount <=11
4) range
indexnum like '({0 TO 11}) ' 不包含边界值
indexnum like '([10 TO 11] ) ' 包含边界值
5) Unequal
label<>'l_14' and label<>'l_15'
6)
indexnum='1' or indexnum='2' or (clickcount >=5 and clickcount <=11)
<EE...@C4EFD745.B6FECA56>
How to use ya100 to make a top N sort
set ya100.spark.top10000.ydb_example_shu=ydb_age desc limit 10;
set ya100.spark.top10000.ydb_example_shu=ydb_sex desc,ydb_province limit 100;
set ya100.spark.top10000.ydb_example_shu=* limit 10;
注意,当前版本limit的最大值为10000 (每个索引最大返回1000)
列的名字 写*表示 不需要排序,但只返回前N条就可以了
SQL example
set ya100.spark.top10000.ydb_example_shu=phonenum desc limit 10;
select * from spark_ya100 where dt='1200million' order by phonenum desc limit 10;
How to make group by or stat by ya100
Example 1
set ya100.spark.combine.ydb_example_shu=*,ydb_age;
select ydb_province,Ycount('*',ya100_pipe), Ycount('ydb_age',ya100_pipe), Ymaxstring('ydb_age',ya100_pipe), Yminstring('ydb_age',ya100_pipe)
from spark_ya100 group by ydb_province limit 10
Example 2
set ya100.spark.combine.ydb_example_shu=amtlong;
select ydb_sex, ydb_province,Ysum('amtlong',ya100_pipe) as cnt from spark_ya100 group by ydb_sex, ydb_province limit 10
Other SQL EXAMPLE
•EXAMPLE COUNT(*)
set ya100.spark.filter.ydb_example_shu=phonenum='13870282671' and usernick='江峻熙';
set ya100.spark.combine.ydb_example_shu=*;
set ya100.spark.top10000.ydb_example_shu=;
select Ycount('*',ya100_pipe) from spark_ya100 where dt='100million' limit 10;
•TOP N SORT EXAMPLE
•set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100] )' and amtdouble like '([1090 TO 1100] )';
set ya100.spark.combine.ydb_example_shu=;
set ya100.spark.top10000.ydb_example_shu=phonenum desc limit 10;
select * from spark_ya100 where dt='100million' order by phonenum desc limit 10;
SUM(amtlong)
set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100] )' and amtdouble like '([1090 TO 1100] )';
set ya100.spark.combine.ydb_example_shu=amtlong;
set ya100.spark.top10000.ydb_example_shu=;
select Ycount(‘amtlong’,ya100_pipe) ,Ysum(‘amtlong’,ya100_pipe) ,Yavg(‘amtlong’,ya100_pipe), Ymax(‘amtlong’,ya100_pipe), Ymin(‘amtlong’,ya100_pipe) from spark_ya100 where dt=‘100million’ limit 10;
•Top n sort
set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100] )' and amtdouble like '([1090 TO 1100] )';
set ya100.spark.combine.ydb_example_shu=;
set ya100.spark.top10000.ydb_example_shu= ydb_age desc, ydb_sex;
select ydb_sex, ydb_age from spark_ya100 where dt='100million' order by ydb_age desc, ydb_sex limit 10;
•Group by
set ya100.spark.filter.ydb_example_shu=;
set ya100.spark.combine.ydb_example_shu=ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtlong;
set ya100.spark.top10000.ydb_example_shu=;
select ydb_sex,ydb_province,ydb_grade, Ymaxstring(‘ydb_age’,ya100_pipe),Ymaxstring(‘ydb_blood’,ya100_pipe),Ymaxstring(‘ydb_zhiye’,ya100_pipe),Ymaxstring(‘ydb_earn’,ya100_pipe),Ymaxstring(‘ydb_prefer’,ya100_pipe),Ymaxstring(‘ydb_consume’,ya100_pipe),Ymaxstring(‘ydb_day’,ya100_pipe),Ysum(‘amtlong’,ya100_pipe) as cnt from spark_ya100 where dt=‘100million’ group by ydb_sex,ydb_province,ydb_grade order by cnt desc limit 10
Connect Ya100 to ydb make real time data
CREATE external table spark_ydb(
phonenum bigint, usernick string, ydb_sex string, ydb_province string, ydb_grade string, ydb_age string, ydb_blood string, ydb_zhiye string, ydb_earn string, ydb_prefer string, ydb_consume string, ydb_day string, amtdouble double,amtlong int, content string, ydbpartion string, ya100_pipe string
)
STORED BY 'cn.net.ycloud.ydb.handle.Ya100StorageHandler'
LOCATION '/data/ydb/shu_ydb'
TBLPROPERTIES(
"ya100.handler.table.name"="ydb_example_shu",
"ya100.handler.master"="101.200.130.48:8080",
"ya100.handler.columns.mapping"="phonenum,usernick,ydb_sex,ydb_province,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtdouble,amtlong,content,ydbpartion,ya100_pipe")
注
ydbpartion为ydb的分区字段,查询的时候必须执行ydbpartion,而spark表则没有分区字段
ya100_pipe 为ya100的管道字段,必须要都要创建,否则有些功能会被限制使用。
ya100.handler.table.name为ydb系统里的表名字
ya100.handler.columns.mapping 为ydb系统内的表与spark的表之间的映射。
<5C...@C4EFD745.B6FECA56>
<CD...@C4EFD745.B6FECA56>
<10...@C4EFD745.B6FECA56>