You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by "yannianmu (母延年)" <ya...@tencent.com> on 2015/01/29 12:59:09 UTC

Our Optimize Suggestions on lucene 3.5

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.



    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .


First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.



Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.



Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.



Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField).

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.



two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.



multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

[cid:_Foxmail.0@804840F2-FE63-4FD9-B75D-4DA504C5B591]



two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one

8. also we use a lot of cache to speed up the atomic inc field.



Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time





Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.





weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).





Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument



Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU



network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.



append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile



non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size



multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

[cid:_Foxmail.1@621DB6EB-924D-4485-911D-18CD154885DC]

hermes`s merger is like this

[cid:_Foxmail.2@81E0418D-FEFD-49D5-AC9B-1E4044F74A7F]

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table







finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100


________________________________
yannianmu(母延年)

回复: Our Optimize Suggestions on lucene 3.5(Internet mail)

Posted by "yannianmu (母延年)" <ya...@tencent.com>.
add attachment
________________________________
yannianmu(母延年)

发件人: yannianmu(母延年)<ma...@tencent.com>
发送时间: 2015-01-29 19:59
收件人: general<ma...@lucene.apache.org>; dev<ma...@lucene.apache.org>; commits<ma...@lucene.apache.org>
主题: Our Optimize Suggestions on lucene 3.5(Internet mail)

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.



    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .


First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.



Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.



Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.



Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField).

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.



two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.



multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

[cid:_Foxmail.0@804840F2-FE63-4FD9-B75D-4DA504C5B591]



two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one

8. also we use a lot of cache to speed up the atomic inc field.



Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time





Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.





weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).





Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument



Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU



network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.



append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile



non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size



multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

[cid:_Foxmail.1@621DB6EB-924D-4485-911D-18CD154885DC]

hermes`s merger is like this

[cid:_Foxmail.2@81E0418D-FEFD-49D5-AC9B-1E4044F74A7F]

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table







finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100


________________________________
yannianmu(母延年)

Re: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)

Posted by "david.w.smiley@gmail.com" <da...@gmail.com>.
In Solr, all you need to do is declare that your string field has docValues
(docValues=“true” in the schema on the field or field type)) and the
behavior I described (sort by ord) is what will happen when you sort on
this field.  Note that there is a segment local ord to global ord mapping
if the segment isn’t optimized.

Look at Solr’s StrField.getSortField(…) and start following who calls who.
You could set a break-point in your debugger to see.  In 4.x, this will
take you to SortField (a Lucene construct), and you will see references to
the FieldCache.  The FieldCache API in late 4.x releases will redirect to
DocValues if it is present.  Note that in trunk/5.x, the FieldCache API is
superseded by the DocValues API directly.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Sun, Mar 1, 2015 at 9:45 PM, yannianmu(母延年) <ya...@tencent.com>
wrote:

>   Is there any body could help me make an example of how to use
> addSortedField for sort by ord ,Just like what david.w.smiley said that
> 'e.g. sort by ord not string value'.
> I was reading the sourcecode of lucene4.10.3,but    I am not understand it
> well in this place.
> thanks any body.
>
>
>
>
>    *From:* david.w.smiley@gmail.com
> *Date:* 2015-01-29 22:16
> *To:* dev@lucene.apache.org
> *Subject:* Re: Our Optimize Suggestions on lucene 3.5(Internet mail)
>   Wow.  This is a lot to digest.
> p.s. the right list for this is dev@, not general@ or commits@.
>
> One or two of these optimizations seemed redundant with what Lucene/Solr
> does in the latest release (e.g. sort by ord not string value) but I may
> have misunderstood what you said.  For the most part, this all looks new.
> I’m not sure how familiar you are with the open-source process at the ASF
> and Lucene in particular.  The set of optimizations here need to each
> become a set of JIRA issues at an appropriate scope, submitted to either
> Lucene or Solr as appropriate.  Hopefully you are willing to submit patches
> for each, especially with tests, and ideally targeted for the trunk branch
> (not 4x!).  Frankly if you don’t, these issues will just be wish-list like
> and they will have a low chance of getting done.  And just to set realistic
> expectations, even if you do supply the code and do everything you can, the
> issue in question will require a committers attention plus the approach has
> to seem reasonable to the committer.  Smaller more focused patches have an
> easier time then big ones because we (committers) will have an easier time
> digesting what’s in the patch.
>
>  So with that said, *thanks* for sharing the information you have here,
> even if you choose not to share code. It’s useful to know where the
> bottlenecks are and ways to solve them.  Maybe you’d like to speak about
> this search architecture at the next Lucene/Solr Revolution conference.
>
>  ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley
>
> On Thu, Jan 29, 2015 at 6:59 AM, yannianmu(母延年) <ya...@tencent.com>
> wrote:
>
>>     Dear Lucene dev
>>
>>
>>     We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.
>>
>>
>> Hermes process 100 billions documents per day,2000 billions document for total days (two month). N
>> owadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for
>> the big data warehouse  speed up
>> .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.
>>
>>
>>
>>
>>     Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s
>>
>> For those purpose,We made lots of improve base on lucene and solr ,
>>  nowadays
>>  lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32
>> Physical Machines
>> .we think it may be helpfull for some people who have the similary sense .
>>
>>     First level index(tii),Loading by Demand
>>
>> Original:
>>
>> 1. .tii file is load to ram by TermInfosReaderIndex
>>
>> 2. that may quite slowly by first open Index
>>
>> 3. the index need open by Persistence,once open it ,nevel close it.
>>
>> 4.
>> this cause will limit the number of the index.when we have thouthand of index,that will
>> Impossible.
>>
>> Our improve:
>>
>> 1. Loading by Demand,not all fields need to load into memory
>>
>> 2. we modify the method getIndexOffset(dichotomy
>> ) on disk, not on memory,but we use lru cache to speed up it.
>>
>> 3. getIndexOffset
>>  on disk can save lots of memory,and can reduce times when open a index
>>
>> 4. hermes often open different index for dirrerent Business
>> ; when the index is not often to used ,we will to close it.(manage by lru)
>>
>> 5. such this my 1 Physical Machine
>>  can store over then 100000 number of index.
>>
>> Solve the problem:
>>
>> 1.
>> hermes need to store over then 1000billons documents,we have not enough memory to store the tii file
>>
>> 2.
>> we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.
>>
>>
>> Build index on Hdfs
>>
>> 1.
>> We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)
>>
>> 2. All the offline data is build by mapreduce on hdfs.
>>
>> 3. we move all the realtime index from local disk to hdfs
>>
>> 4. we can ignore disk failure because of index on hdfs
>>
>> 5. we can move process from on machine to another machine on hdfs
>>
>> 6. we can quick recover index when a disk failure happend .
>>
>> 7.
>> we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.
>>
>> 8.
>> all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .
>>
>> 9.
>> we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.
>>
>> 10. some hdfs file does`t need to close Immediately
>>  we make a lru cache to cache it ,to reduce the frequent of open file.
>>
>>
>> Improve solr, so that one core can dynamic process multy index.
>>
>> Original:
>>
>> 1. a solr core(one process) only process 1~N index by solr config
>>
>> Our improve:
>>
>> 2.
>> use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)
>>
>> 3. dynamic create table for dynamic businiss
>>
>> Solve the problem:
>>
>> 1.
>> to solve the index is to big over then Interger.maxvalue, docid overflow
>>
>> 2.
>> some times the searcher not need to search all of the data ,may be only need recent 3 days.
>>
>>
>> Label mark technology for doc values
>>
>> Original:
>>
>> 1. group by,sort,sum,max,min ,avg those stats method need to read
>> Original from tis file
>>
>> 2. FieldCacheImpl
>>  load all the term values into memory for solr fieldValueCache,Even if i only stat one record .
>>
>> 3.
>> first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory
>>
>> Our improve:
>>
>> 1. General situation
>> ,the data has a lot of repeat value,for exampe the sex file ,the age field .
>>
>> 2. if we store the original value ,that will weast a lot of storage.
>> so we make a small modify at TermInfosWriter, Additional
>>  add a new filed called termNumber.
>> make a unique term sort by term through TermInfosWriter
>> , and then gave each term a unique  Number from begin to end  (mutch like solr
>> UnInvertedField).
>>
>> 3.
>> we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.
>>
>> 4.
>> the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.
>>
>> 5.
>> some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.
>>
>> 6.
>> when we finish all of the calculation, we translate label to Term by a dictionary.
>>
>> 7.
>> if a lots of rows have the same original value ,the original value we only store once,onley read once.
>>
>> Solve the problem:
>>
>> 1.
>> Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene
>> FieldCacheImpl or solr UnInvertedField.
>>
>> 2. on realtime mode ,data is change Frequent , The cache is invalidated
>> Frequent by append or update. build FieldCacheImpl
>>  will take a lot of times and io;
>>
>> 3. the Original
>>  value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.
>>
>> 4.
>> the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.
>>
>> 5. read the original value will need lot of io, need iterate
>>  tis file.even though we just need to read only docunent.
>>
>> 6. Solve take a lot of time when first build FieldCacheImpl.
>>
>>
>>
>>  two-phase search
>>
>> Original:
>>
>> 1.
>> group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file
>>
>> 2. compare by string is slowly then compare by integer
>>
>> Our improve:
>>
>> 1. we split one search into multy-phase search
>>
>> 2.
>> the first search we only search the field that use for order by ,group by
>>
>> 3.
>> the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see <
>>  Label mark technology for doc values>) for order by group by.
>>
>> 4.
>> when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.
>>
>> Solve the problem:
>>
>> 1. reduce io ,read original take a lot of disk io
>>
>> 2. reduce network io (for merger)
>>
>> 3.
>> most of the field has repeated value, the repeated only need to read once
>>
>>
>> the group by filed only need to read the origina once by label whene display to user.
>>
>> 4.
>> most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.
>>
>>
>>
>>  multy-phase indexing
>>
>> 1. hermes doesn`t update index one by one,it use batch index
>>
>> 2.
>> the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex
>>
>> 3.
>> doclist only store the solrinputdocument for the batch update or append
>>
>> 4. buffer index is a ramdirectory ,use for merge doclist to index.
>>
>> 5.
>> ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.
>>
>> 6. disk/hdfs index is Persistence store use for big index
>>
>> 7. we also use wal called binlog(like mysql binlog) for recover
>>
>>
>>  two-phase commit for update
>>
>> 1.
>> we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.
>>
>> 2. we need Atomic
>>  inc field ,solr that can`t support ,solr only support replace field value.
>> Atomic
>>  inc field need to read the last value first ,and then increace it`s value.
>>
>> 3. hermes use pre mark delete,batch commit to update a document.
>>
>> 4.
>> if a document is state is premark ,it also could be search by the user,unil we commit it.
>> we modify SegmentReader ,split deletedDocs
>>  into to 3 part. one part is called deletedDocstmp
>>  whitch is for pre mark (pending delete),another one is called
>> deletedDocs_forsearch which is for index search, another is also call
>> deletedDocs
>>
>> 5. once we want to pending delete a document,we operate deletedDocstmp
>>  (a openbitset)to mark one document is pending delete.
>>
>> and then we append our new value to doclist area(buffer area)
>>
>> the pending delete means user also could search the old value.
>>
>> the buffer area means user couldn`t search the new value.
>>
>> but when we commit it(batch)
>>
>>
>> the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)
>>
>> 6.
>> the pending delete we called visual delete,after commit it we called physics delete
>>
>> 7.
>> hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the
>> Performance one by one
>>
>> 8. also we use a lot of cache to speed up the atomic inc field.
>>
>>
>>
>>   Term data skew
>>
>> Original:
>>
>> 1. lucene use inverted index to store term and doclist.
>>
>> 2.
>> some filed like sex  has only to value male or female, so male while have 50% of doclist.
>>
>> 3.
>> solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.
>>
>> 4.
>> when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.
>>
>> 5.
>> most of the time we only need the TOP n doclist,we dosn`t care about the score sort.
>>
>>  Our improve:
>>
>> 1.
>> we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)
>>
>> 2.
>> we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.
>>
>> 3.
>> our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory
>>
>> 4.
>> we modify the indexSearch  to support real Top N search and ignore the doc score sort
>>
>>  Solve the problem:
>>
>> 1. data skew take a lot of disk io to read not necessary doclist.
>>
>> 2.
>> 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor
>>
>> 3.
>> most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time
>>
>>
>>
>>
>>
>>  Block-Buffer-Cache
>>
>> Openbitset,fieldvalueCache
>>  need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as
>> UnInvertedField
>> ,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),
>>
>> Original:
>>
>> 1.
>> we create the big array directly,when we doesn`t neet we drop it to JVM GC
>>
>> Our improve:
>>
>> 1.
>> we split the big arry into fix length block,witch block is a small array,but fix 1024 length .
>>
>> 2.
>> if a block `s element is almost empty(element is zero),we use hashmap to instead of array
>>
>> 3.
>> if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array
>>
>> 4.
>> when the block is not to use ,we collectoion the array to buffer ,next time we reuse it
>>
>> Solve the problem:
>>
>> 1. save memory
>>
>> 2. reduce the jvm Garbage collection take a lot of cpu resource.
>>
>>
>>
>>
>> weakhashmap,hashmap , synchronized problem
>>
>> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has
>> memory leak BUG.
>>
>> 2.
>> sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory
>>
>> 3. AttributeSource use weakhashmap to cache class impl,and use a global
>> synchronized reduce performance
>>
>> 4. AttributeSource is a base class , NumbericField extends
>> AttributeSource,but they create a lot of hashmap,but NumbericField
>>  never use it .
>>
>> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>>
>>  Our improve:
>>
>> 1.
>> weakhashmap is not high performance ,we use softReferance instead of it
>>
>> 2. reuse NumbericField avoid create AttributeSource frequent
>>
>> 3. not use global synchronized
>>
>>
>>  when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).
>>
>>
>>
>>
>>
>>  Other GC optimization
>>
>> 1. reuse byte[] arry in the inputbuffer ,outpuer buffer .
>>
>> 2. reuse byte[] arry in the RAMfile
>>
>> 3. remove some finallze method, the not necessary.
>>
>> 4. use StringHelper.intern to reuse the field name in solrinputdocument
>>
>>
>>
>>  Directory optimization
>>
>> 1. index commit doesn`t neet sync all the field
>>
>> 2.
>> we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn
>>
>> 3.
>> we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU
>>
>>
>>
>>  network optimization
>>
>> 1. optimization ThreadPool
>>  in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.
>>
>> 2.
>> remove jetty ,we write socket by myself ,jetty import data is not high performance
>>
>> 3.
>> we change the data import form push mode to pull mode with like apache storm.
>>
>>
>> append mode,optimization
>>
>> 1.
>> append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.
>>
>> 2.
>> we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io
>>
>> 3. we make a pointer to point docid to sequencefile
>>
>>
>>
>>  non tokenizer field optimization
>>
>> 1. non tokenizer field we doesn`t store the field value to fdt field.
>>
>> 2. we read the field value from label (see  <<
>> Label mark technology for doc values>>)
>>
>> 3. most of the field has duplicate value,
>> this can reduce the index file size
>>
>>
>>
>>  multi level of merger server
>>
>> 1. solr can only use on shard to act as a merger server .
>>
>> 2. we use multi level of merger server to merge all shards result
>>
>> 3.
>> shard on the same mathine have the high priority to merger by the same mathine merger server.
>>
>> solr`s merger is like this
>>
>>   hermes`s merger is like this
>>
>>  other optimize
>>
>> 1. hermes support Sql .
>>
>> 2. support union Sql from different tables;
>>
>> 3. support view table
>>
>>
>>
>>
>>
>>
>>
>>   finallze
>>
>> Hermes`sql may be like this
>>
>> l
>> select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20
>>
>>  l
>> select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10
>>
>> l
>> select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100
>>
>> l
>> select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100
>>
>> l
>> select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;
>>
>> l
>> select count(*) from guangdiantong where thedate ='20141010' limit 0,100
>>
>> l
>> select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100
>>
>> l
>> select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10
>>
>> l
>>
>> l
>> select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100
>>
>> l
>> select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100
>>
>> l
>> select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100
>>
>> l
>> select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100
>>
>> l
>> select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100
>>
>> l
>> select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100
>>
>> l
>>
>> l
>> select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100
>>
>> l
>>
>> l
>> select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100
>>
>>
>> ------------------------------
>>  yannianmu(母延年)
>>
>
>

Re: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)

Posted by "yannianmu (母延年)" <ya...@tencent.com>.
Is there any body could help me make an example of how to use addSortedField for sort by ord ,Just like what david.w.smiley said that 'e.g. sort by ord not string value'.
I was reading the sourcecode of lucene4.10.3,but    I am not understand it well in this place.
thanks any body.




From: david.w.smiley@gmail.com<ma...@gmail.com>
Date: 2015-01-29 22:16
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)
Wow.  This is a lot to digest.
p.s. the right list for this is dev@, not general@ or commits@.

One or two of these optimizations seemed redundant with what Lucene/Solr does in the latest release (e.g. sort by ord not string value) but I may have misunderstood what you said.  For the most part, this all looks new.  I’m not sure how familiar you are with the open-source process at the ASF and Lucene in particular.  The set of optimizations here need to each become a set of JIRA issues at an appropriate scope, submitted to either Lucene or Solr as appropriate.  Hopefully you are willing to submit patches for each, especially with tests, and ideally targeted for the trunk branch (not 4x!).  Frankly if you don’t, these issues will just be wish-list like and they will have a low chance of getting done.  And just to set realistic expectations, even if you do supply the code and do everything you can, the issue in question will require a committers attention plus the approach has to seem reasonable to the committer.  Smaller more focused patches have an easier time then big ones because we (committers) will have an easier time digesting what’s in the patch.

So with that said, *thanks* for sharing the information you have here, even if you choose not to share code. It’s useful to know where the bottlenecks are and ways to solve them.  Maybe you’d like to speak about this search architecture at the next Lucene/Solr Revolution conference.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Thu, Jan 29, 2015 at 6:59 AM, yannianmu(母延年) <ya...@tencent.com>> wrote:

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.



    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .


First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.



Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.



Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.



Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField).

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.



two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.



multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

[cid:_Foxmail.0@485390D2-3050-42C9-B7B5-82E36CEDC598]



two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one

8. also we use a lot of cache to speed up the atomic inc field.



Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time





Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.





weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).





Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument



Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU



network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.



append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile



non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size



multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

[cid:_Foxmail.1@95AAFEC5-627F-46CF-8008-CBB1CCE834FC]

hermes`s merger is like this

[cid:_Foxmail.2@E859D409-0055-43B3-839C-924A77F16DD0]

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table







finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100


________________________________
yannianmu(母延年)


Re: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)

Posted by "yannianmu (母延年)" <ya...@tencent.com>.
HI david.w.smiley
Of course, I like to share my code.
three years ago i have shared my mdrill code on alibaba github(https://github.com/alibaba/mdrill &  https://github.com/alibaba/jstorm ) but mdrill only support 1 billions document /per day.it`s data is offline (not realtime import),and data is not store on hdfs.
After i enter into tencent company,I stop the update of mdrill (i am so sorry for this),and join to a project called hermes.Witch process 100 billions document on near realtime data import .
I will apply to open source Hermes project This year. if my company and my boss accept it ,i will clean up my code and publish it on http://data.qq.com/  or github .
after that i will make patch for lucene.
I`m so sorry for my poor english ,so these day I will to study how commit patch to "open-source process",I will try that .
If your mabe help me ,to gave me some link address to study the rules,that will be Very grateful.


________________________________
yannianmu(母延年)

From: david.w.smiley@gmail.com<ma...@gmail.com>
Date: 2015-01-29 22:16
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)
Wow.  This is a lot to digest.
p.s. the right list for this is dev@, not general@ or commits@.

One or two of these optimizations seemed redundant with what Lucene/Solr does in the latest release (e.g. sort by ord not string value) but I may have misunderstood what you said.  For the most part, this all looks new.  I’m not sure how familiar you are with the open-source process at the ASF and Lucene in particular.  The set of optimizations here need to each become a set of JIRA issues at an appropriate scope, submitted to either Lucene or Solr as appropriate.  Hopefully you are willing to submit patches for each, especially with tests, and ideally targeted for the trunk branch (not 4x!).  Frankly if you don’t, these issues will just be wish-list like and they will have a low chance of getting done.  And just to set realistic expectations, even if you do supply the code and do everything you can, the issue in question will require a committers attention plus the approach has to seem reasonable to the committer.  Smaller more focused patches have an easier time then big ones because we (committers) will have an easier time digesting what’s in the patch.

So with that said, *thanks* for sharing the information you have here, even if you choose not to share code. It’s useful to know where the bottlenecks are and ways to solve them.  Maybe you’d like to speak about this search architecture at the next Lucene/Solr Revolution conference.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Thu, Jan 29, 2015 at 6:59 AM, yannianmu(母延年) <ya...@tencent.com>> wrote:

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.



    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .


First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.



Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.



Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.



Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField).

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.



two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.



multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

[cid:_Foxmail.0@D9CC78A7-6CA9-4B1D-8226-EB7EA3FE78E6]



two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one

8. also we use a lot of cache to speed up the atomic inc field.



Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time





Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.





weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).





Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument



Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU



network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.



append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile



non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size



multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

[cid:_Foxmail.1@FDE3A1D6-CFEA-4814-9BC4-FC158846623C]

hermes`s merger is like this

[cid:_Foxmail.2@91B51F03-F054-4A9C-93D1-1B24839B2BC9]

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table







finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100


________________________________
yannianmu(母延年)


Re: Our Optimize Suggestions on lucene 3.5

Posted by "david.w.smiley@gmail.com" <da...@gmail.com>.
Wow.  This is a lot to digest.
p.s. the right list for this is dev@, not general@ or commits@.

One or two of these optimizations seemed redundant with what Lucene/Solr
does in the latest release (e.g. sort by ord not string value) but I may
have misunderstood what you said.  For the most part, this all looks new.
I’m not sure how familiar you are with the open-source process at the ASF
and Lucene in particular.  The set of optimizations here need to each
become a set of JIRA issues at an appropriate scope, submitted to either
Lucene or Solr as appropriate.  Hopefully you are willing to submit patches
for each, especially with tests, and ideally targeted for the trunk branch
(not 4x!).  Frankly if you don’t, these issues will just be wish-list like
and they will have a low chance of getting done.  And just to set realistic
expectations, even if you do supply the code and do everything you can, the
issue in question will require a committers attention plus the approach has
to seem reasonable to the committer.  Smaller more focused patches have an
easier time then big ones because we (committers) will have an easier time
digesting what’s in the patch.

So with that said, *thanks* for sharing the information you have here, even
if you choose not to share code. It’s useful to know where the bottlenecks
are and ways to solve them.  Maybe you’d like to speak about this search
architecture at the next Lucene/Solr Revolution conference.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Thu, Jan 29, 2015 at 6:59 AM, yannianmu(母延年) <ya...@tencent.com>
wrote:

>     Dear Lucene dev
>
>
>     We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.
>
>
> Hermes process 100 billions documents per day,2000 billions document for total days (two month). N
> owadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for
> the big data warehouse  speed up
> .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.
>
>
>
>
>     Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s
>
> For those purpose,We made lots of improve base on lucene and solr ,
>  nowadays
>  lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32
> Physical Machines
> .we think it may be helpfull for some people who have the similary sense .
>
>     First level index(tii),Loading by Demand
>
> Original:
>
> 1. .tii file is load to ram by TermInfosReaderIndex
>
> 2. that may quite slowly by first open Index
>
> 3. the index need open by Persistence,once open it ,nevel close it.
>
> 4.
> this cause will limit the number of the index.when we have thouthand of index,that will
> Impossible.
>
> Our improve:
>
> 1. Loading by Demand,not all fields need to load into memory
>
> 2. we modify the method getIndexOffset(dichotomy
> ) on disk, not on memory,but we use lru cache to speed up it.
>
> 3. getIndexOffset
>  on disk can save lots of memory,and can reduce times when open a index
>
> 4. hermes often open different index for dirrerent Business
> ; when the index is not often to used ,we will to close it.(manage by lru)
>
> 5. such this my 1 Physical Machine
>  can store over then 100000 number of index.
>
> Solve the problem:
>
> 1.
> hermes need to store over then 1000billons documents,we have not enough memory to store the tii file
>
> 2.
> we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.
>
>
> Build index on Hdfs
>
> 1.
> We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)
>
> 2. All the offline data is build by mapreduce on hdfs.
>
> 3. we move all the realtime index from local disk to hdfs
>
> 4. we can ignore disk failure because of index on hdfs
>
> 5. we can move process from on machine to another machine on hdfs
>
> 6. we can quick recover index when a disk failure happend .
>
> 7.
> we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.
>
> 8.
> all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .
>
> 9.
> we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.
>
> 10. some hdfs file does`t need to close Immediately
>  we make a lru cache to cache it ,to reduce the frequent of open file.
>
>
> Improve solr, so that one core can dynamic process multy index.
>
> Original:
>
> 1. a solr core(one process) only process 1~N index by solr config
>
> Our improve:
>
> 2.
> use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)
>
> 3. dynamic create table for dynamic businiss
>
> Solve the problem:
>
> 1.
> to solve the index is to big over then Interger.maxvalue, docid overflow
>
> 2.
> some times the searcher not need to search all of the data ,may be only need recent 3 days.
>
>
> Label mark technology for doc values
>
> Original:
>
> 1. group by,sort,sum,max,min ,avg those stats method need to read Original
>  from tis file
>
> 2. FieldCacheImpl
>  load all the term values into memory for solr fieldValueCache,Even if i only stat one record .
>
> 3.
> first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory
>
> Our improve:
>
> 1. General situation
> ,the data has a lot of repeat value,for exampe the sex file ,the age field .
>
> 2. if we store the original value ,that will weast a lot of storage.
> so we make a small modify at TermInfosWriter, Additional
>  add a new filed called termNumber.
> make a unique term sort by term through TermInfosWriter
> , and then gave each term a unique  Number from begin to end  (mutch like solr
> UnInvertedField).
>
> 3.
> we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.
>
> 4.
> the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.
>
> 5.
> some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.
>
> 6.
> when we finish all of the calculation, we translate label to Term by a dictionary.
>
> 7.
> if a lots of rows have the same original value ,the original value we only store once,onley read once.
>
> Solve the problem:
>
> 1.
> Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene
> FieldCacheImpl or solr UnInvertedField.
>
> 2. on realtime mode ,data is change Frequent , The cache is invalidated
> Frequent by append or update. build FieldCacheImpl
>  will take a lot of times and io;
>
> 3. the Original
>  value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.
>
> 4.
> the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.
>
> 5. read the original value will need lot of io, need iterate
>  tis file.even though we just need to read only docunent.
>
> 6. Solve take a lot of time when first build FieldCacheImpl.
>
>
>
>  two-phase search
>
> Original:
>
> 1.
> group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file
>
> 2. compare by string is slowly then compare by integer
>
> Our improve:
>
> 1. we split one search into multy-phase search
>
> 2.
> the first search we only search the field that use for order by ,group by
>
> 3.
> the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see <
>  Label mark technology for doc values>) for order by group by.
>
> 4.
> when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.
>
> Solve the problem:
>
> 1. reduce io ,read original take a lot of disk io
>
> 2. reduce network io (for merger)
>
> 3.
> most of the field has repeated value, the repeated only need to read once
>
>
> the group by filed only need to read the origina once by label whene display to user.
>
> 4.
> most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.
>
>
>
>  multy-phase indexing
>
> 1. hermes doesn`t update index one by one,it use batch index
>
> 2.
> the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex
>
> 3. doclist only store the solrinputdocument for the batch update or append
>
> 4. buffer index is a ramdirectory ,use for merge doclist to index.
>
> 5.
> ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.
>
> 6. disk/hdfs index is Persistence store use for big index
>
> 7. we also use wal called binlog(like mysql binlog) for recover
>
>
>  two-phase commit for update
>
> 1.
> we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.
>
> 2. we need Atomic
>  inc field ,solr that can`t support ,solr only support replace field value.
> Atomic
>  inc field need to read the last value first ,and then increace it`s value.
>
> 3. hermes use pre mark delete,batch commit to update a document.
>
> 4.
> if a document is state is premark ,it also could be search by the user,unil we commit it.
> we modify SegmentReader ,split deletedDocs
>  into to 3 part. one part is called deletedDocstmp
>  whitch is for pre mark (pending delete),another one is called
> deletedDocs_forsearch which is for index search, another is also call
> deletedDocs
>
> 5. once we want to pending delete a document,we operate deletedDocstmp
>  (a openbitset)to mark one document is pending delete.
>
> and then we append our new value to doclist area(buffer area)
>
> the pending delete means user also could search the old value.
>
> the buffer area means user couldn`t search the new value.
>
> but when we commit it(batch)
>
>
> the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)
>
> 6.
> the pending delete we called visual delete,after commit it we called physics delete
>
> 7.
> hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the
> Performance one by one
>
> 8. also we use a lot of cache to speed up the atomic inc field.
>
>
>
>   Term data skew
>
> Original:
>
> 1. lucene use inverted index to store term and doclist.
>
> 2.
> some filed like sex  has only to value male or female, so male while have 50% of doclist.
>
> 3.
> solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.
>
> 4.
> when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.
>
> 5.
> most of the time we only need the TOP n doclist,we dosn`t care about the score sort.
>
>  Our improve:
>
> 1.
> we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)
>
> 2.
> we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.
>
> 3.
> our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory
>
> 4.
> we modify the indexSearch  to support real Top N search and ignore the doc score sort
>
>  Solve the problem:
>
> 1. data skew take a lot of disk io to read not necessary doclist.
>
> 2.
> 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor
>
> 3.
> most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time
>
>
>
>
>
>  Block-Buffer-Cache
>
> Openbitset,fieldvalueCache
>  need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as
> UnInvertedField
> ,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),
>
> Original:
>
> 1.
> we create the big array directly,when we doesn`t neet we drop it to JVM GC
>
> Our improve:
>
> 1.
> we split the big arry into fix length block,witch block is a small array,but fix 1024 length .
>
> 2.
> if a block `s element is almost empty(element is zero),we use hashmap to instead of array
>
> 3.
> if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array
>
> 4.
> when the block is not to use ,we collectoion the array to buffer ,next time we reuse it
>
> Solve the problem:
>
> 1. save memory
>
> 2. reduce the jvm Garbage collection take a lot of cpu resource.
>
>
>
>
> weakhashmap,hashmap , synchronized problem
>
> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has
> memory leak BUG.
>
> 2.
> sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory
>
> 3. AttributeSource use weakhashmap to cache class impl,and use a global
> synchronized reduce performance
>
> 4. AttributeSource is a base class , NumbericField extends AttributeSource
> ,but they create a lot of hashmap,but NumbericField never use it .
>
> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>
>  Our improve:
>
> 1.
> weakhashmap is not high performance ,we use softReferance instead of it
>
> 2. reuse NumbericField avoid create AttributeSource frequent
>
> 3. not use global synchronized
>
>
>  when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).
>
>
>
>
>
>  Other GC optimization
>
> 1. reuse byte[] arry in the inputbuffer ,outpuer buffer .
>
> 2. reuse byte[] arry in the RAMfile
>
> 3. remove some finallze method, the not necessary.
>
> 4. use StringHelper.intern to reuse the field name in solrinputdocument
>
>
>
>  Directory optimization
>
> 1. index commit doesn`t neet sync all the field
>
> 2.
> we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn
>
> 3.
> we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU
>
>
>
>  network optimization
>
> 1. optimization ThreadPool
>  in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.
>
> 2.
> remove jetty ,we write socket by myself ,jetty import data is not high performance
>
> 3.
> we change the data import form push mode to pull mode with like apache storm.
>
>
> append mode,optimization
>
> 1.
> append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.
>
> 2.
> we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io
>
> 3. we make a pointer to point docid to sequencefile
>
>
>
>  non tokenizer field optimization
>
> 1. non tokenizer field we doesn`t store the field value to fdt field.
>
> 2. we read the field value from label (see  <<
> Label mark technology for doc values>>)
>
> 3. most of the field has duplicate value,
> this can reduce the index file size
>
>
>
>  multi level of merger server
>
> 1. solr can only use on shard to act as a merger server .
>
> 2. we use multi level of merger server to merge all shards result
>
> 3.
> shard on the same mathine have the high priority to merger by the same mathine merger server.
>
> solr`s merger is like this
>
>   hermes`s merger is like this
>
>  other optimize
>
> 1. hermes support Sql .
>
> 2. support union Sql from different tables;
>
> 3. support view table
>
>
>
>
>
>
>
>   finallze
>
> Hermes`sql may be like this
>
> l
> select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20
>
>  l
> select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10
>
> l
> select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100
>
> l
> select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100
>
> l
> select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;
>
> l select count(*) from guangdiantong where thedate ='20141010' limit 0,100
>
> l
> select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100
>
> l
> select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10
>
> l
>
> l
> select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100
>
> l
> select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100
>
> l
> select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100
>
> l
> select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100
>
> l
> select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100
>
> l
> select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100
>
> l
>
> l
> select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100
>
> l
>
> l
> select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100
>
>
> ------------------------------
> yannianmu(母延年)
>

RE: this is a BUG?

Posted by Will Martin <wm...@gmail.com>.
Myn, please let us know the issue number created?  So we can share with other projects that might be better suited for this infrastructure piece.

fwiw

-----Original Message-----
From: david.w.smiley@gmail.com [mailto:david.w.smiley@gmail.com] 
Sent: Thursday, June 11, 2015 11:27 AM
To: general@lucene.apache.org
Subject: Re: this is a BUG?

myn,
Please file a JIRA issue http://issues.apache.org/jira/browse/SOLR

On Mon, Jun 8, 2015 at 8:38 AM myn <my...@163.com> wrote:

>
> I also think that is not a high-performance implements on 
> HdfsDirectory,because direct read /write on hdfs is slower then local 
> filesystem.
>
> why we not supply a Cache on hdfs,so that`can imporve speed by local 
> filesystem.  the cache could Store in local disk,we split HDFS file 
> into bolcks(fix length), and store in local disk by LRU.
>
> we used hdfs for Data reliability,and we used local file system for 
> high-performance that`s how hermes used it ,that what is our suggest.
>
>
>
>
> At 2015-06-08 20:28:34, "myn" <my...@163.com> wrote:
>
>
> SOLR package
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.since method 
> BufferedIndexInput.wrap(sliceDescription, this, offset, length);
>
> I have  change my lucene version from lucene3.5 to lucene5.1
>
> on my test build index on hdfs ,that quit slow.
>
> I found when we use docvalues, the direcrory onten call since methos 
> for field input clone;
>   @Override
>   public IndexInput slice(String sliceDescription, long offset, long
> length) throws IOException {
>       return BufferedIndexInput.wrap(sliceDescription, this, offset, 
> length);
>   }
>
> but defaut buffer size is 1024,  is not the buffer my set; so I fix it 
> like below then build index go faster;
>
>
>   @Override
>   public IndexInput slice(String sliceDescription, long offset, long
> length) throws IOException {
>       SlicedIndexInput rtn= new SlicedIndexInput(sliceDescription, 
> this, offset, length);
>       rtn.setBufferSize(this.bufferSize);
>     return rtn;//BufferedIndexInput.wrap(sliceDescription, this, 
> offset, length);
>   }
>
>
>
>
>
> At 2015-01-29 20:14:25, "myn" <my...@163.com> wrote:
>
>
>
> add attachment
> ------------------------------
>
>
>
>
> 在 2015-01-29 19:59:09,"yannianmu(母延年)" <ya...@tencent.com> 写道:
>
>   Dear Lucene dev
>
>
>     We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.
>
>
> Hermes process 100 billions documents per day,2000 billions document 
> for total days (two month). N owadays our single cluster index size is 
> over then 200Tb,total size is 600T. We use lucene for the big data 
> warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.
>
>
>
>
>     Hermes could filter a data form 1000billions in 1 
> secondes.10billions data`s order by taken 10s,10billions data`s group 
> by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s
>
> For those purpose,We made lots of improve base on lucene and solr ,  
> nowadays  lucene has change so much since version 4.10, the coding has 
> change so much.so we don`t want to commit our code to lucene .only to 
> introduce our imporve base on luene 3.5,and introduce how hermes can 
> process 100billions documents per day on 32 Physical Machines .we think it may be helpfull for some people who have the similary sense .
>
>     First level index(tii),Loading by Demand
>
> Original:
>
> 1. .tii file is load to ram by TermInfosReaderIndex
>
> 2. that may quite slowly by first open Index
>
> 3. the index need open by Persistence,once open it ,nevel close it.
>
> 4.
> this cause will limit the number of the index.when we have thouthand 
> of index,that will Impossible.
>
> Our improve:
>
> 1. Loading by Demand,not all fields need to load into memory
>
> 2. we modify the method getIndexOffset(dichotomy
> ) on disk, not on memory,but we use lru cache to speed up it.
>
> 3. getIndexOffset
>  on disk can save lots of memory,and can reduce times when open a 
> index
>
> 4. hermes often open different index for dirrerent Business ; when the 
> index is not often to used ,we will to close it.(manage by lru)
>
> 5. such this my 1 Physical Machine
>  can store over then 100000 number of index.
>
> Solve the problem:
>
> 1.
> hermes need to store over then 1000billons documents,we have not 
> enough memory to store the tii file
>
> 2.
> we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.
>
>
> Build index on Hdfs
>
> 1.
> We modifyed lucene 3.5 code at 2013.so that we can build index direct 
> on hdfs.(lucene has support hdfs since 4.0)
>
> 2. All the offline data is build by mapreduce on hdfs.
>
> 3. we move all the realtime index from local disk to hdfs
>
> 4. we can ignore disk failure because of index on hdfs
>
> 5. we can move process from on machine to another machine on hdfs
>
> 6. we can quick recover index when a disk failure happend .
>
> 7.
> we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.
>
> 8.
> all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .
>
> 9.
> we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.
>
> 10. some hdfs file does`t need to close Immediately  we make a lru 
> cache to cache it ,to reduce the frequent of open file.
>
>
> Improve solr, so that one core can dynamic process multy index.
>
> Original:
>
> 1. a solr core(one process) only process 1~N index by solr config
>
> Our improve:
>
> 2.
> use a partion like oracle or hadoop hive.not build only one big 
> index,instand build lots of index by day(month,year,or other partion)
>
> 3. dynamic create table for dynamic businiss
>
> Solve the problem:
>
> 1.
> to solve the index is to big over then Interger.maxvalue, docid 
> overflow
>
> 2.
> some times the searcher not need to search all of the data ,may be only need recent 3 days.
>
>
> Label mark technology for doc values
>
> Original:
>
> 1. group by,sort,sum,max,min ,avg those stats method need to read 
> Original  from tis file
>
> 2. FieldCacheImpl
>  load all the term values into memory for solr fieldValueCache,Even if i only stat one record .
>
> 3.
> first time search is quite slowly because of to build the 
> fieldValueCache and load all the term values into memory
>
> Our improve:
>
> 1. General situation
> ,the data has a lot of repeat value,for exampe the sex file ,the age field .
>
> 2. if we store the original value ,that will weast a lot of storage.
> so we make a small modify at TermInfosWriter, Additional  add a new 
> filed called termNumber.
> make a unique term sort by term through TermInfosWriter , and then 
> gave each term a unique  Number from begin to end  (mutch like solr 
> UnInvertedField).
>
> 3.
> we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.
>
> 4.
> the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.
>
> 5.
> some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.
>
> 6.
> when we finish all of the calculation, we translate label to Term by a dictionary.
>
> 7.
> if a lots of rows have the same original value ,the original value we only store once,onley read once.
>
> Solve the problem:
>
> 1.
> Hermes`s data is quite big we don`t have enough memory to load all 
> Values to memory like lucene FieldCacheImpl or solr UnInvertedField.
>
> 2. on realtime mode ,data is change Frequent , The cache is 
> invalidated Frequent by append or update. build FieldCacheImpl  will 
> take a lot of times and io;
>
> 3. the Original
>  value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.
>
> 4.
> the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.
>
> 5. read the original value will need lot of io, need iterate  tis 
> file.even though we just need to read only docunent.
>
> 6. Solve take a lot of time when first build FieldCacheImpl.
>
>
>
>  two-phase search
>
> Original:
>
> 1.
> group by order by use original value,the real value may be is a string 
> type,may be more larger ,the real value maybe  need a lot of io  
> because of to read tis,frq file
>
> 2. compare by string is slowly then compare by integer
>
> Our improve:
>
> 1. we split one search into multy-phase search
>
> 2.
> the first search we only search the field that use for order by ,group 
> by
>
> 3.
> the first search we doesn`t need to read the original value(the real 
> value),we only need to read the docid and label(see <  Label mark technology for doc values>) for order by group by.
>
> 4.
> when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.
>
> Solve the problem:
>
> 1. reduce io ,read original take a lot of disk io
>
> 2. reduce network io (for merger)
>
> 3.
> most of the field has repeated value, the repeated only need to read 
> once
>
>
> the group by filed only need to read the origina once by label whene display to user.
>
> 4.
> most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.
>
>
>
>  multy-phase indexing
>
> 1. hermes doesn`t update index one by one,it use batch index
>
> 2.
> the index area is split into four area ,they are called 
> doclist=>buffer index=>ram index=>diskIndex/hdfsIndex
>
> 3. doclist only store the solrinputdocument for the batch update or 
> append
>
> 4. buffer index is a ramdirectory ,use for merge doclist to index.
>
> 5.
> ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.
>
> 6. disk/hdfs index is Persistence store use for big index
>
> 7. we also use wal called binlog(like mysql binlog) for recover
>
>
>  two-phase commit for update
>
> 1.
> we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.
>
> 2. we need Atomic
>  inc field ,solr that can`t support ,solr only support replace field value.
> Atomic
>  inc field need to read the last value first ,and then increace it`s value.
>
> 3. hermes use pre mark delete,batch commit to update a document.
>
> 4.
> if a document is state is premark ,it also could be search by the user,unil we commit it.
> we modify SegmentReader ,split deletedDocs  into to 3 part. one part 
> is called deletedDocstmp  whitch is for pre mark (pending 
> delete),another one is called deletedDocs_forsearch which is for index 
> search, another is also call deletedDocs
>
> 5. once we want to pending delete a document,we operate deletedDocstmp  
> (a openbitset)to mark one document is pending delete.
>
> and then we append our new value to doclist area(buffer area)
>
> the pending delete means user also could search the old value.
>
> the buffer area means user couldn`t search the new value.
>
> but when we commit it(batch)
>
>
> the old value is realy droped,and flush all the buffer area to Ram 
> area(ram area can be search)
>
> 6.
> the pending delete we called visual delete,after commit it we called 
> physics delete
>
> 7.
> hermes ofthen visula delete a lots of document ,and then commit once 
> ,to improve up the Performance one by one
>
> 8. also we use a lot of cache to speed up the atomic inc field.
>
>
>
>   Term data skew
>
> Original:
>
> 1. lucene use inverted index to store term and doclist.
>
> 2.
> some filed like sex  has only to value male or female, so male while have 50% of doclist.
>
> 3.
> solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.
>
> 4.
> when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.
>
> 5.
> most of the time we only need the TOP n doclist,we dosn`t care about the score sort.
>
>  Our improve:
>
> 1.
> we often combination other fq,to use the skip doclist to skip the 
> docid that not used( we may to seed the query methord called advance)
>
> 2.
> we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.
>
> 3.
> our index is quite big ,if we cache the FQ(openbitset),that will take 
> a lots of memory
>
> 4.
> we modify the indexSearch  to support real Top N search and ignore the 
> doc score sort
>
>  Solve the problem:
>
> 1. data skew take a lot of disk io to read not necessary doclist.
>
> 2.
> 2000billions index is to big,the FQ cache (filter cache) user 
> openbitset take a lot of memor
>
> 3.
> most of the search ,only need the top N result ,doesn`t need score 
> sort,we need to speed up the search time
>
>
>
>
>
>  Block-Buffer-Cache
>
> Openbitset,fieldvalueCache
>  need to malloc a big long[] or int[] array. it is ofen seen by lots 
> of cache ,such as UnInvertedField ,fieldCacheImpl,filterQueryCache and 
> so on. most of time  much of the elements is zero(empty),
>
> Original:
>
> 1.
> we create the big array directly,when we doesn`t neet we drop it to 
> JVM GC
>
> Our improve:
>
> 1.
> we split the big arry into fix length block,witch block is a small array,but fix 1024 length .
>
> 2.
> if a block `s element is almost empty(element is zero),we use hashmap 
> to instead of array
>
> 3.
> if a block `s non zero value is empty(length=0),we couldn`t create 
> this block arrry only use a null to instead of array
>
> 4.
> when the block is not to use ,we collectoion the array to buffer ,next 
> time we reuse it
>
> Solve the problem:
>
> 1. save memory
>
> 2. reduce the jvm Garbage collection take a lot of cpu resource.
>
>
>
>
> weakhashmap,hashmap , synchronized problem
>
> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has 
> memory leak BUG.
>
> 2.
> sorlInputDocunent use a lot of hashmap,linkhashmap for field,that 
> weast a lot of memory
>
> 3. AttributeSource use weakhashmap to cache class impl,and use a 
> global synchronized reduce performance
>
> 4. AttributeSource is a base class , NumbericField extends 
> AttributeSource ,but they create a lot of hashmap,but NumbericField never use it .
>
> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>
>  Our improve:
>
> 1.
> weakhashmap is not high performance ,we use softReferance instead of 
> it
>
> 2. reuse NumbericField avoid create AttributeSource frequent
>
> 3. not use global synchronized
>
>
>  when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).
>
>
>
>
>
>  Other GC optimization
>
> 1. reuse byte[] arry in the inputbuffer ,outpuer buffer .
>
> 2. reuse byte[] arry in the RAMfile
>
> 3. remove some finallze method, the not necessary.
>
> 4. use StringHelper.intern to reuse the field name in 
> solrinputdocument
>
>
>
>  Directory optimization
>
> 1. index commit doesn`t neet sync all the field
>
> 2.
> we use a block cache on top of FsDriectory and hdfsDirectory to speed 
> up read sppedn
>
> 3.
> we close index or index file that not often to used.also we limit the 
> index that allow max open;block cache is manager by LRU
>
>
>
>  network optimization
>
> 1. optimization ThreadPool
>  in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.
>
> 2.
> remove jetty ,we write socket by myself ,jetty import data is not high 
> performance
>
> 3.
> we change the data import form push mode to pull mode with like apache storm.
>
>
> append mode,optimization
>
> 1.
> append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.
>
> 2.
> we store the field data to a single file ,the files format is hadoop 
> sequence file ,we use LZO compress to save io
>
> 3. we make a pointer to point docid to sequencefile
>
>
>
>  non tokenizer field optimization
>
> 1. non tokenizer field we doesn`t store the field value to fdt field.
>
> 2. we read the field value from label (see  << Label mark technology 
> for doc values>>)
>
> 3. most of the field has duplicate value,
> this can reduce the index file size
>
>
>
>  multi level of merger server
>
> 1. solr can only use on shard to act as a merger server .
>
> 2. we use multi level of merger server to merge all shards result
>
> 3.
> shard on the same mathine have the high priority to merger by the same mathine merger server.
>
> solr`s merger is like this
>
>   hermes`s merger is like this
>
> <span style="FONT-FAMILY:
>
>


Re: this is a BUG?

Posted by "david.w.smiley@gmail.com" <da...@gmail.com>.
myn,
Please file a JIRA issue http://issues.apache.org/jira/browse/SOLR

On Mon, Jun 8, 2015 at 8:38 AM myn <my...@163.com> wrote:

>
> I also think that is not a high-performance implements on
> HdfsDirectory,because direct read /write on hdfs is slower then local
> filesystem.
>
> why we not supply a Cache on hdfs,so that`can imporve speed by local
> filesystem.  the cache could Store in local disk,we split HDFS file into
> bolcks(fix length), and store in local disk by LRU.
>
> we used hdfs for Data reliability,and we used local file system for
> high-performance that`s how hermes used it ,that what is our suggest.
>
>
>
>
> At 2015-06-08 20:28:34, "myn" <my...@163.com> wrote:
>
>
> SOLR package
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.since method
> BufferedIndexInput.wrap(sliceDescription, this, offset, length);
>
> I have  change my lucene version from lucene3.5 to lucene5.1
>
> on my test build index on hdfs ,that quit slow.
>
> I found when we use docvalues, the direcrory onten call since methos for
> field input clone;
>   @Override
>   public IndexInput slice(String sliceDescription, long offset, long
> length) throws IOException {
>       return BufferedIndexInput.wrap(sliceDescription, this, offset,
> length);
>   }
>
> but defaut buffer size is 1024,  is not the buffer my set; so I fix it
> like below then build index go faster;
>
>
>   @Override
>   public IndexInput slice(String sliceDescription, long offset, long
> length) throws IOException {
>       SlicedIndexInput rtn= new SlicedIndexInput(sliceDescription, this,
> offset, length);
>       rtn.setBufferSize(this.bufferSize);
>     return rtn;//BufferedIndexInput.wrap(sliceDescription, this, offset,
> length);
>   }
>
>
>
>
>
> At 2015-01-29 20:14:25, "myn" <my...@163.com> wrote:
>
>
>
> add attachment
> ------------------------------
>
>
>
>
> 在 2015-01-29 19:59:09,"yannianmu(母延年)" <ya...@tencent.com> 写道:
>
>   Dear Lucene dev
>
>
>     We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.
>
>
> Hermes process 100 billions documents per day,2000 billions document for total days (two month). N
> owadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for
> the big data warehouse  speed up
> .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.
>
>
>
>
>     Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s
>
> For those purpose,We made lots of improve base on lucene and solr ,
>  nowadays
>  lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32
> Physical Machines
> .we think it may be helpfull for some people who have the similary sense .
>
>     First level index(tii),Loading by Demand
>
> Original:
>
> 1. .tii file is load to ram by TermInfosReaderIndex
>
> 2. that may quite slowly by first open Index
>
> 3. the index need open by Persistence,once open it ,nevel close it.
>
> 4.
> this cause will limit the number of the index.when we have thouthand of index,that will
> Impossible.
>
> Our improve:
>
> 1. Loading by Demand,not all fields need to load into memory
>
> 2. we modify the method getIndexOffset(dichotomy
> ) on disk, not on memory,but we use lru cache to speed up it.
>
> 3. getIndexOffset
>  on disk can save lots of memory,and can reduce times when open a index
>
> 4. hermes often open different index for dirrerent Business
> ; when the index is not often to used ,we will to close it.(manage by lru)
>
> 5. such this my 1 Physical Machine
>  can store over then 100000 number of index.
>
> Solve the problem:
>
> 1.
> hermes need to store over then 1000billons documents,we have not enough memory to store the tii file
>
> 2.
> we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.
>
>
> Build index on Hdfs
>
> 1.
> We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)
>
> 2. All the offline data is build by mapreduce on hdfs.
>
> 3. we move all the realtime index from local disk to hdfs
>
> 4. we can ignore disk failure because of index on hdfs
>
> 5. we can move process from on machine to another machine on hdfs
>
> 6. we can quick recover index when a disk failure happend .
>
> 7.
> we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.
>
> 8.
> all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .
>
> 9.
> we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.
>
> 10. some hdfs file does`t need to close Immediately
>  we make a lru cache to cache it ,to reduce the frequent of open file.
>
>
> Improve solr, so that one core can dynamic process multy index.
>
> Original:
>
> 1. a solr core(one process) only process 1~N index by solr config
>
> Our improve:
>
> 2.
> use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)
>
> 3. dynamic create table for dynamic businiss
>
> Solve the problem:
>
> 1.
> to solve the index is to big over then Interger.maxvalue, docid overflow
>
> 2.
> some times the searcher not need to search all of the data ,may be only need recent 3 days.
>
>
> Label mark technology for doc values
>
> Original:
>
> 1. group by,sort,sum,max,min ,avg those stats method need to read Original
>  from tis file
>
> 2. FieldCacheImpl
>  load all the term values into memory for solr fieldValueCache,Even if i only stat one record .
>
> 3.
> first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory
>
> Our improve:
>
> 1. General situation
> ,the data has a lot of repeat value,for exampe the sex file ,the age field .
>
> 2. if we store the original value ,that will weast a lot of storage.
> so we make a small modify at TermInfosWriter, Additional
>  add a new filed called termNumber.
> make a unique term sort by term through TermInfosWriter
> , and then gave each term a unique  Number from begin to end  (mutch like solr
> UnInvertedField).
>
> 3.
> we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.
>
> 4.
> the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.
>
> 5.
> some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.
>
> 6.
> when we finish all of the calculation, we translate label to Term by a dictionary.
>
> 7.
> if a lots of rows have the same original value ,the original value we only store once,onley read once.
>
> Solve the problem:
>
> 1.
> Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene
> FieldCacheImpl or solr UnInvertedField.
>
> 2. on realtime mode ,data is change Frequent , The cache is invalidated
> Frequent by append or update. build FieldCacheImpl
>  will take a lot of times and io;
>
> 3. the Original
>  value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.
>
> 4.
> the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.
>
> 5. read the original value will need lot of io, need iterate
>  tis file.even though we just need to read only docunent.
>
> 6. Solve take a lot of time when first build FieldCacheImpl.
>
>
>
>  two-phase search
>
> Original:
>
> 1.
> group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file
>
> 2. compare by string is slowly then compare by integer
>
> Our improve:
>
> 1. we split one search into multy-phase search
>
> 2.
> the first search we only search the field that use for order by ,group by
>
> 3.
> the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see <
>  Label mark technology for doc values>) for order by group by.
>
> 4.
> when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.
>
> Solve the problem:
>
> 1. reduce io ,read original take a lot of disk io
>
> 2. reduce network io (for merger)
>
> 3.
> most of the field has repeated value, the repeated only need to read once
>
>
> the group by filed only need to read the origina once by label whene display to user.
>
> 4.
> most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.
>
>
>
>  multy-phase indexing
>
> 1. hermes doesn`t update index one by one,it use batch index
>
> 2.
> the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex
>
> 3. doclist only store the solrinputdocument for the batch update or append
>
> 4. buffer index is a ramdirectory ,use for merge doclist to index.
>
> 5.
> ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.
>
> 6. disk/hdfs index is Persistence store use for big index
>
> 7. we also use wal called binlog(like mysql binlog) for recover
>
>
>  two-phase commit for update
>
> 1.
> we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.
>
> 2. we need Atomic
>  inc field ,solr that can`t support ,solr only support replace field value.
> Atomic
>  inc field need to read the last value first ,and then increace it`s value.
>
> 3. hermes use pre mark delete,batch commit to update a document.
>
> 4.
> if a document is state is premark ,it also could be search by the user,unil we commit it.
> we modify SegmentReader ,split deletedDocs
>  into to 3 part. one part is called deletedDocstmp
>  whitch is for pre mark (pending delete),another one is called
> deletedDocs_forsearch which is for index search, another is also call
> deletedDocs
>
> 5. once we want to pending delete a document,we operate deletedDocstmp
>  (a openbitset)to mark one document is pending delete.
>
> and then we append our new value to doclist area(buffer area)
>
> the pending delete means user also could search the old value.
>
> the buffer area means user couldn`t search the new value.
>
> but when we commit it(batch)
>
>
> the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)
>
> 6.
> the pending delete we called visual delete,after commit it we called physics delete
>
> 7.
> hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the
> Performance one by one
>
> 8. also we use a lot of cache to speed up the atomic inc field.
>
>
>
>   Term data skew
>
> Original:
>
> 1. lucene use inverted index to store term and doclist.
>
> 2.
> some filed like sex  has only to value male or female, so male while have 50% of doclist.
>
> 3.
> solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.
>
> 4.
> when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.
>
> 5.
> most of the time we only need the TOP n doclist,we dosn`t care about the score sort.
>
>  Our improve:
>
> 1.
> we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)
>
> 2.
> we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.
>
> 3.
> our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory
>
> 4.
> we modify the indexSearch  to support real Top N search and ignore the doc score sort
>
>  Solve the problem:
>
> 1. data skew take a lot of disk io to read not necessary doclist.
>
> 2.
> 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor
>
> 3.
> most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time
>
>
>
>
>
>  Block-Buffer-Cache
>
> Openbitset,fieldvalueCache
>  need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as
> UnInvertedField
> ,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),
>
> Original:
>
> 1.
> we create the big array directly,when we doesn`t neet we drop it to JVM GC
>
> Our improve:
>
> 1.
> we split the big arry into fix length block,witch block is a small array,but fix 1024 length .
>
> 2.
> if a block `s element is almost empty(element is zero),we use hashmap to instead of array
>
> 3.
> if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array
>
> 4.
> when the block is not to use ,we collectoion the array to buffer ,next time we reuse it
>
> Solve the problem:
>
> 1. save memory
>
> 2. reduce the jvm Garbage collection take a lot of cpu resource.
>
>
>
>
> weakhashmap,hashmap , synchronized problem
>
> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has
> memory leak BUG.
>
> 2.
> sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory
>
> 3. AttributeSource use weakhashmap to cache class impl,and use a global
> synchronized reduce performance
>
> 4. AttributeSource is a base class , NumbericField extends AttributeSource
> ,but they create a lot of hashmap,but NumbericField never use it .
>
> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>
>  Our improve:
>
> 1.
> weakhashmap is not high performance ,we use softReferance instead of it
>
> 2. reuse NumbericField avoid create AttributeSource frequent
>
> 3. not use global synchronized
>
>
>  when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).
>
>
>
>
>
>  Other GC optimization
>
> 1. reuse byte[] arry in the inputbuffer ,outpuer buffer .
>
> 2. reuse byte[] arry in the RAMfile
>
> 3. remove some finallze method, the not necessary.
>
> 4. use StringHelper.intern to reuse the field name in solrinputdocument
>
>
>
>  Directory optimization
>
> 1. index commit doesn`t neet sync all the field
>
> 2.
> we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn
>
> 3.
> we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU
>
>
>
>  network optimization
>
> 1. optimization ThreadPool
>  in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.
>
> 2.
> remove jetty ,we write socket by myself ,jetty import data is not high performance
>
> 3.
> we change the data import form push mode to pull mode with like apache storm.
>
>
> append mode,optimization
>
> 1.
> append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.
>
> 2.
> we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io
>
> 3. we make a pointer to point docid to sequencefile
>
>
>
>  non tokenizer field optimization
>
> 1. non tokenizer field we doesn`t store the field value to fdt field.
>
> 2. we read the field value from label (see  <<
> Label mark technology for doc values>>)
>
> 3. most of the field has duplicate value,
> this can reduce the index file size
>
>
>
>  multi level of merger server
>
> 1. solr can only use on shard to act as a merger server .
>
> 2. we use multi level of merger server to merge all shards result
>
> 3.
> shard on the same mathine have the high priority to merger by the same mathine merger server.
>
> solr`s merger is like this
>
>   hermes`s merger is like this
>
> <span style="FONT-FAMILY:
>
>

Re:this is a BUG?

Posted by myn <my...@163.com>.

I also think that is not a high-performance implements on HdfsDirectory,because direct read /write on hdfs is slower then local filesystem.

why we not supply a Cache on hdfs,so that`can imporve speed by local filesystem.  the cache could Store in local disk,we split HDFS file into bolcks(fix length), and store in local disk by LRU.

we used hdfs for Data reliability,and we used local file system for high-performance that`s how hermes used it ,that what is our suggest.






At 2015-06-08 20:28:34, "myn" <my...@163.com> wrote:



SOLR package org.apache.solr.store.blockcache.CustomBufferedIndexInput.since method BufferedIndexInput.wrap(sliceDescription, this, offset, length);

I have  change my lucene version from lucene3.5 to lucene5.1

on my test build index on hdfs ,that quit slow.

I found when we use docvalues, the direcrory onten call since methos for field input clone;
  @Override
  public IndexInput slice(String sliceDescription, long offset, long length) throws IOException {
      return BufferedIndexInput.wrap(sliceDescription, this, offset, length);
  }

but defaut buffer size is 1024,  is not the buffer my set; so I fix it like below then build index go faster;


  @Override
  public IndexInput slice(String sliceDescription, long offset, long length) throws IOException {
      SlicedIndexInput rtn= new SlicedIndexInput(sliceDescription, this, offset, length);
      rtn.setBufferSize(this.bufferSize);
    return rtn;//BufferedIndexInput.wrap(sliceDescription, this, offset, length);
  }







At 2015-01-29 20:14:25, "myn" <my...@163.com> wrote:




add attachment





在 2015-01-29 19:59:09,"yannianmu(母延年)" <ya...@tencent.com> 写道:


Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 
First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 

Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 

Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 

Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 

weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 

append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size

 

multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

hermes`s merger is like this

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 
yannianmu(母延年)






Re:this is a BUG?

Posted by myn <my...@163.com>.

I also think that is not a high-performance implements on HdfsDirectory,because direct read /write on hdfs is slower then local filesystem.

why we not supply a Cache on hdfs,so that`can imporve speed by local filesystem.  the cache could Store in local disk,we split HDFS file into bolcks(fix length), and store in local disk by LRU.

we used hdfs for Data reliability,and we used local file system for high-performance that`s how hermes used it ,that what is our suggest.






At 2015-06-08 20:28:34, "myn" <my...@163.com> wrote:



SOLR package org.apache.solr.store.blockcache.CustomBufferedIndexInput.since method BufferedIndexInput.wrap(sliceDescription, this, offset, length);

I have  change my lucene version from lucene3.5 to lucene5.1

on my test build index on hdfs ,that quit slow.

I found when we use docvalues, the direcrory onten call since methos for field input clone;
  @Override
  public IndexInput slice(String sliceDescription, long offset, long length) throws IOException {
      return BufferedIndexInput.wrap(sliceDescription, this, offset, length);
  }

but defaut buffer size is 1024,  is not the buffer my set; so I fix it like below then build index go faster;


  @Override
  public IndexInput slice(String sliceDescription, long offset, long length) throws IOException {
      SlicedIndexInput rtn= new SlicedIndexInput(sliceDescription, this, offset, length);
      rtn.setBufferSize(this.bufferSize);
    return rtn;//BufferedIndexInput.wrap(sliceDescription, this, offset, length);
  }







At 2015-01-29 20:14:25, "myn" <my...@163.com> wrote:




add attachment





在 2015-01-29 19:59:09,"yannianmu(母延年)" <ya...@tencent.com> 写道:


Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 
First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 

Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 

Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 

Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 

weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 

append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size

 

multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

hermes`s merger is like this

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 
yannianmu(母延年)






this is a BUG?

Posted by myn <my...@163.com>.

SOLR package org.apache.solr.store.blockcache.CustomBufferedIndexInput.since method BufferedIndexInput.wrap(sliceDescription, this, offset, length);

I have  change my lucene version from lucene3.5 to lucene5.1

on my test build index on hdfs ,that quit slow.

I found when we use docvalues, the direcrory onten call since methos for field input clone;
  @Override
  public IndexInput slice(String sliceDescription, long offset, long length) throws IOException {
      return BufferedIndexInput.wrap(sliceDescription, this, offset, length);
  }

but defaut buffer size is 1024,  is not the buffer my set; so I fix it like below then build index go faster;


  @Override
  public IndexInput slice(String sliceDescription, long offset, long length) throws IOException {
      SlicedIndexInput rtn= new SlicedIndexInput(sliceDescription, this, offset, length);
      rtn.setBufferSize(this.bufferSize);
    return rtn;//BufferedIndexInput.wrap(sliceDescription, this, offset, length);
  }







At 2015-01-29 20:14:25, "myn" <my...@163.com> wrote:




add attachment





在 2015-01-29 19:59:09,"yannianmu(母延年)" <ya...@tencent.com> 写道:


Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 
First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 

Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 

Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 

Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 

weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 

append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size

 

multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

hermes`s merger is like this

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 
yannianmu(母延年)



this is a BUG?

Posted by myn <my...@163.com>.

SOLR package org.apache.solr.store.blockcache.CustomBufferedIndexInput.since method BufferedIndexInput.wrap(sliceDescription, this, offset, length);

I have  change my lucene version from lucene3.5 to lucene5.1

on my test build index on hdfs ,that quit slow.

I found when we use docvalues, the direcrory onten call since methos for field input clone;
  @Override
  public IndexInput slice(String sliceDescription, long offset, long length) throws IOException {
      return BufferedIndexInput.wrap(sliceDescription, this, offset, length);
  }

but defaut buffer size is 1024,  is not the buffer my set; so I fix it like below then build index go faster;


  @Override
  public IndexInput slice(String sliceDescription, long offset, long length) throws IOException {
      SlicedIndexInput rtn= new SlicedIndexInput(sliceDescription, this, offset, length);
      rtn.setBufferSize(this.bufferSize);
    return rtn;//BufferedIndexInput.wrap(sliceDescription, this, offset, length);
  }







At 2015-01-29 20:14:25, "myn" <my...@163.com> wrote:




add attachment





在 2015-01-29 19:59:09,"yannianmu(母延年)" <ya...@tencent.com> 写道:


Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 
First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 

Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 

Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 

Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 

weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 

append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size

 

multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

hermes`s merger is like this

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 
yannianmu(母延年)



Re:Our Optimize Suggestions on lucene 3.5

Posted by myn <my...@163.com>.


add attachment





在 2015-01-29 19:59:09,"yannianmu(母延年)" <ya...@tencent.com> 写道:


Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 
First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 

Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 

Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 

Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 

weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 

append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size

 

multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

hermes`s merger is like this

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 
yannianmu(母延年)

Re:Our Optimize Suggestions on lucene 3.5

Posted by myn <my...@163.com>.


add attachment





在 2015-01-29 19:59:09,"yannianmu(母延年)" <ya...@tencent.com> 写道:


Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 
First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 

Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 

Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 

Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 

weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 

append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size

 

multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

hermes`s merger is like this

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 
yannianmu(母延年)

RE: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

The “memory leak” you describe is actually 2 different things:

a)      Lucene prevents this special case of weak refs: Whenever an IndexReader is closed, it notifies the FieldCache, which expunges the uninverted values (it removes the weak reader keys from maps). If you keep the IndexReader open – as I said before – the field cache is not intended to be cleaned up, because the uninverted stuff will for sure be used for later queries. This is by design. On the other hand, if Solr sits on already closed index readers, it’s a bug on Solr, not by Lucene’s FieldCache.

b)      If you want your field cache entries expunged while readers are open (e.g., on memory pressure), you can plug in your own implementation that uses soft references on the values, too. This got easier with Lucene 4 and especially Lucene 5, because you can just wrap your reader with your own UninvertingReader implementation that emulates the DocValues APIs backed by a cache using soft references. Sorting code in Lucene no longer uses FieldCache directly, it does all this through the DocValues APIs, that can be replaced without modifying Lucene.

Nevertheless, Lucene 5 removes FieldCache from core packages, because users should use DocValues – so this is all no issue anymore, unless you use the old stuff or don’t correctly configure DocValues in Solr 5. The old impl is still available as an optional package, but this one should only be used during some index upgrading step: You can use UninvertingReader to wrap your input index without doc values and merge that into a new one that has doc values. UninvertingReader adds the docvalues while merging (because they are emulated and merging will pick the emulated docvalues from FieldCache).

So this is no longer an issue.

Uwe

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: yannianmu(母延年) [mailto:yannianmu@tencent.com] 
Sent: Saturday, January 31, 2015 5:27 AM
To: dev@lucene.apache.org
Cc: uwe@thetaphi.de
Subject: 答复: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)

 

Hi Uwe ,

  But As a matter of fact, FieldCacheImpl  has memory leak is true;

weakhashmap  Release  value memory  only when expungeStaleEntries is called.

expungeStaleEntries method depend on the weak keys is being garbage collected;

 

first   let us to see the key type in FieldCacheImpl  

final Object readerKey = reader.getCoreCacheKey();

it is a IndexReader(segmentReader) Object ,If  the reader still have reference count,the key can`t not be garbage collected, and the value won`t be garbage collected by expungeStaleEntries

but as we know , IndexReader(segmentReader) Object is often Cache in Solr Core(keeps alive).

you didn`t meet the memory leak,is because the index size is not big enougth. but we often hit memory lead  when my index size is over then 10000000 documents ; so we use weak value instead of weak key . but we wan`t to cache it for a while so we use soft reference.

the lucene`s design is not let us close often ,we often let the reader keep open Persistence

 

 

发件人: Uwe Schindler [mailto:uwe@thetaphi.de] 
发送时间: 2015年1月30日 18:47
收件人: dev@lucene.apache.org
主题: RE: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)

 

Hi yannianmu,

What you propose here is a so-called „soft reference cache“. That’s something completely different. In fact FieldCache was never be a “cache” because it was never able to evict entries on memory pressure. The weak map has a different reason (please note, it is “weak keys” not “weak values” as you do in your implementation). Weak maps are not useful for caches at all. They are useful to decouple to object instances from each other.

The weak map in FieldCacheImpl had the reason to prevent memory leaks if you open multiple IndexReaders on the same segments. If the last one is closed and garbage collected, the corresponding uninverted field should disappear. And this works correctly. But this does not mean that uninverted fields are removed on memory pressure: once loaded the uninverted stuff keeps alive until all referring readers are closed – this is the idea behind the design, so there is no memory leak! If you want a cache that discards the cached entries on memory pressure, implement your own field”cache” (in fact a real “cache” like you did).

Uwe

P.S.: FieldCache was a bad name, because it was no “cache”. This is why it should be used as “UninvertingReader” now.

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail:  <ma...@thetaphi.de> uwe@thetaphi.de

 

From: yannianmu(母延年) [ <ma...@tencent.com> mailto:yannianmu@tencent.com] 
Sent: Friday, January 30, 2015 5:47 AM
To: Robert Muir; dev
Subject: Re: Re: Our Optimize Suggestions on lucene 3.5

 

WeakHashMap may be cause a  memory leak problem.

 

we use SoftReference instad of it like this;

 

 

  public static class SoftLinkMap{

  private static int SORT_CACHE_SIZE=1024;

    private static float LOADFACTOR=0.75f;

    final Map<Object,SoftReference<Map<Entry,Object>>> readerCache_lru=new LinkedHashMap<Object,SoftReference<Map<Entry,Object>>>((int) Math.ceil(SORT_CACHE_SIZE / LOADFACTOR) + 1, LOADFACTOR, true) {

        @Override

        protected boolean removeEldestEntry(Map.Entry<Object,SoftReference<Map<Entry,Object>>> eldest) {

          return size() > SORT_CACHE_SIZE;

        }

      };

      

   public void remove(Object key)

   {

   readerCache_lru.remove(key);

   }

   

   public Map<Entry,Object> get(Object key)

   {

   SoftReference<Map<Entry,Object>> w =  readerCache_lru.get(key);

  if(w==null)

  {

  return null;

  }

return w.get();

   }

   

   

   public void put(Object key,Map<Entry,Object> value)

   {

   readerCache_lru.put(key, new SoftReference<Map<Entry,Object>>(value));

   }

   

   public Set<java.util.Map.Entry<Object, Map<Entry, Object>>> entrySet()

   {

   HashMap<Object,Map<Entry,Object>> rtn=new HashMap<Object, Map<Entry,Object>>();

   for(java.util.Map.Entry<Object,SoftReference<Map<Entry,Object>>> e:readerCache_lru.entrySet())

   {

   Map<Entry,Object> v=e.getValue().get();

   if(v!=null)

   {

   rtn.put(e.getKey(), v);

   }

   }

   return rtn.entrySet();

   }

  }

 

  final SoftLinkMap readerCache=new SoftLinkMap();

//    final Map<Object,Map<Entry,Object>> readerCache = new WeakHashMap<Object,Map<Entry,Object>>();

    

 

  _____  

yannianmu(母延年)

 

From: Robert Muir <ma...@gmail.com> 

Date: 2015-01-30 12:03

To: dev@lucene.apache.org

Subject: Re: Our Optimize Suggestions on lucene 3.5

I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO.

 

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

Hi,

parts of your suggestions are already done in Lucene 4+. For one part I can tell you:


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

All Lucene items no longer apply:

1.       FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene?s UninvertingReader, bus this one has no weak maps anymore because it is no cache.

2.       No idea about that one - its unrelated to Lucene

3.       AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7?s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes.

4.       NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too!

5.       See above

In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other ?optimizations? are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days).

Uwe

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: yannianmu(?ꉄ?N) [mailto:yannianmu@tencent.com] 
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5

 

 

 

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 

 


 


First level index(tii)?CLoading by Demand


Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 


Build index on Hdfs


1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 


Improve solr, so that one core can dynamic process multy index.


Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 


Label mark technology for doc values


Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \\compare <file:///\\compare>  \\equals <file:///\\equals>  ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

 


two-phase search


Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

 


multy-phase indexing


1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

 

 


two-phase commit for update


1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

 

 


Term data skew


Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

 

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance) 

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

 

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

 


Block-Buffer-Cache


Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

 

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 

when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

 


Other GC optimization


1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

 


Directory optimization


1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

 


network optimization


1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 


append mode,optimization


1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

 


non tokenizer field optimization


1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value?Cthis can reduce the index file size

 

 


multi level of merger server


1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

 

 

hermes`s merger is like this

 


other optimize


1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

 

 


finallze


Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

 

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 

 

  _____  

yannianmu(?ꉄ?N)

 


答复: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)

Posted by "yannianmu (母延年)" <ya...@tencent.com>.
Hi Uwe ,
  But As a matter of fact, FieldCacheImpl  has memory leak is true;
weakhashmap  Release  value memory  only when expungeStaleEntries is called.
expungeStaleEntries method depend on the weak keys is being garbage collected;

first   let us to see the key type in FieldCacheImpl
final Object readerKey = reader.getCoreCacheKey();
it is a IndexReader(segmentReader) Object ,If  the reader still have reference count,the key can`t not be garbage collected, and the value won`t be garbage collected by expungeStaleEntries
but as we know , IndexReader(segmentReader) Object is often Cache in Solr Core(keeps alive).
you didn`t meet the memory leak,is because the index size is not big enougth. but we often hit memory lead  when my index size is over then 10000000 documents ; so we use weak value instead of weak key . but we wan`t to cache it for a while so we use soft reference.
the lucene`s design is not let us close often ,we often let the reader keep open Persistence


发件人: Uwe Schindler [mailto:uwe@thetaphi.de]
发送时间: 2015年1月30日 18:47
收件人: dev@lucene.apache.org
主题: RE: Re: Our Optimize Suggestions on lucene 3.5(Internet mail)

Hi yannianmu,
What you propose here is a so-called „soft reference cache“. That’s something completely different. In fact FieldCache was never be a “cache” because it was never able to evict entries on memory pressure. The weak map has a different reason (please note, it is “weak keys” not “weak values” as you do in your implementation). Weak maps are not useful for caches at all. They are useful to decouple to object instances from each other.
The weak map in FieldCacheImpl had the reason to prevent memory leaks if you open multiple IndexReaders on the same segments. If the last one is closed and garbage collected, the corresponding uninverted field should disappear. And this works correctly. But this does not mean that uninverted fields are removed on memory pressure: once loaded the uninverted stuff keeps alive until all referring readers are closed – this is the idea behind the design, so there is no memory leak! If you want a cache that discards the cached entries on memory pressure, implement your own field”cache” (in fact a real “cache” like you did).
Uwe
P.S.: FieldCache was a bad name, because it was no “cache”. This is why it should be used as “UninvertingReader” now.
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de

From: yannianmu(母延年) [mailto:yannianmu@tencent.com]
Sent: Friday, January 30, 2015 5:47 AM
To: Robert Muir; dev
Subject: Re: Re: Our Optimize Suggestions on lucene 3.5

WeakHashMap may be cause a  memory leak problem.

we use SoftReference instad of it like this;


  public static class SoftLinkMap{
  private static int SORT_CACHE_SIZE=1024;
    private static float LOADFACTOR=0.75f;
    final Map<Object,SoftReference<Map<Entry,Object>>> readerCache_lru=new LinkedHashMap<Object,SoftReference<Map<Entry,Object>>>((int) Math.ceil(SORT_CACHE_SIZE / LOADFACTOR) + 1, LOADFACTOR, true) {
        @Override
        protected boolean removeEldestEntry(Map.Entry<Object,SoftReference<Map<Entry,Object>>> eldest) {
          return size() > SORT_CACHE_SIZE;
        }
      };

   public void remove(Object key)
   {
   readerCache_lru.remove(key);
   }

   public Map<Entry,Object> get(Object key)
   {
   SoftReference<Map<Entry,Object>> w =  readerCache_lru.get(key);
  if(w==null)
  {
  return null;
  }
return w.get();
   }


   public void put(Object key,Map<Entry,Object> value)
   {
   readerCache_lru.put(key, new SoftReference<Map<Entry,Object>>(value));
   }

   public Set<java.util.Map.Entry<Object, Map<Entry, Object>>> entrySet()
   {
   HashMap<Object,Map<Entry,Object>> rtn=new HashMap<Object, Map<Entry,Object>>();
   for(java.util.Map.Entry<Object,SoftReference<Map<Entry,Object>>> e:readerCache_lru.entrySet())
   {
   Map<Entry,Object> v=e.getValue().get();
   if(v!=null)
   {
   rtn.put(e.getKey(), v);
   }
   }
   return rtn.entrySet();
   }
  }

  final SoftLinkMap readerCache=new SoftLinkMap();
//    final Map<Object,Map<Entry,Object>> readerCache = new WeakHashMap<Object,Map<Entry,Object>>();


________________________________
yannianmu(母延年)

From: Robert Muir<ma...@gmail.com>
Date: 2015-01-30 12:03
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: Our Optimize Suggestions on lucene 3.5
I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still.
However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO.

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler <uw...@thetaphi.de>> wrote:
Hi,
parts of your suggestions are already done in Lucene 4+. For one part I can tell you:
weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.
All Lucene items no longer apply:

1.       FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene?s UninvertingReader, bus this one has no weak maps anymore because it is no cache.

2.       No idea about that one - its unrelated to Lucene

3.       AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7?s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes.

4.       NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too!

5.       See above
In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other ?optimizations? are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days).
Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de<ma...@thetaphi.de>

From: yannianmu(?ꉄ?N) [mailto:yannianmu@tencent.com<ma...@tencent.com>]
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5






Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.



    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .






First level index(tii)?CLoading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.



Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.



Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.



Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField).

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \\compare<file:///\\compare> \\equals<file:///\\equals> ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.





two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.





multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover





two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one

8. also we use a lot of cache to speed up the atomic inc field.







Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.



Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort



Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time







Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.





weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.



Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized



when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).







Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument





Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU





network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.



append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile





non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value?Cthis can reduce the index file size





multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this





hermes`s merger is like this



other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table











finallze

Hermes`sql may be like this

• select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20



• select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

• select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

• select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

• select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

• select count(*) from guangdiantong where thedate ='20141010' limit 0,100

• select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

• select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

•

• select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

• select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

• select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

• select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

• select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

• select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

•

• select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

•

• select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100



________________________________
yannianmu(?ꉄ?N)


RE: Re: Our Optimize Suggestions on lucene 3.5

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi yannianmu,

What you propose here is a so-called „soft reference cache“. That’s something completely different. In fact FieldCache was never be a “cache” because it was never able to evict entries on memory pressure. The weak map has a different reason (please note, it is “weak keys” not “weak values” as you do in your implementation). Weak maps are not useful for caches at all. They are useful to decouple to object instances from each other.

The weak map in FieldCacheImpl had the reason to prevent memory leaks if you open multiple IndexReaders on the same segments. If the last one is closed and garbage collected, the corresponding uninverted field should disappear. And this works correctly. But this does not mean that uninverted fields are removed on memory pressure: once loaded the uninverted stuff keeps alive until all referring readers are closed – this is the idea behind the design, so there is no memory leak! If you want a cache that discards the cached entries on memory pressure, implement your own field”cache” (in fact a real “cache” like you did).

Uwe

P.S.: FieldCache was a bad name, because it was no “cache”. This is why it should be used as “UninvertingReader” now.

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: yannianmu(母延年) [mailto:yannianmu@tencent.com] 
Sent: Friday, January 30, 2015 5:47 AM
To: Robert Muir; dev
Subject: Re: Re: Our Optimize Suggestions on lucene 3.5

 

WeakHashMap may be cause a  memory leak problem.

 

we use SoftReference instad of it like this;

 

 

  public static class SoftLinkMap{

  private static int SORT_CACHE_SIZE=1024;

    private static float LOADFACTOR=0.75f;

    final Map<Object,SoftReference<Map<Entry,Object>>> readerCache_lru=new LinkedHashMap<Object,SoftReference<Map<Entry,Object>>>((int) Math.ceil(SORT_CACHE_SIZE / LOADFACTOR) + 1, LOADFACTOR, true) {

        @Override

        protected boolean removeEldestEntry(Map.Entry<Object,SoftReference<Map<Entry,Object>>> eldest) {

          return size() > SORT_CACHE_SIZE;

        }

      };

      

   public void remove(Object key)

   {

   readerCache_lru.remove(key);

   }

   

   public Map<Entry,Object> get(Object key)

   {

   SoftReference<Map<Entry,Object>> w =  readerCache_lru.get(key);

  if(w==null)

  {

  return null;

  }

return w.get();

   }

   

   

   public void put(Object key,Map<Entry,Object> value)

   {

   readerCache_lru.put(key, new SoftReference<Map<Entry,Object>>(value));

   }

   

   public Set<java.util.Map.Entry<Object, Map<Entry, Object>>> entrySet()

   {

   HashMap<Object,Map<Entry,Object>> rtn=new HashMap<Object, Map<Entry,Object>>();

   for(java.util.Map.Entry<Object,SoftReference<Map<Entry,Object>>> e:readerCache_lru.entrySet())

   {

   Map<Entry,Object> v=e.getValue().get();

   if(v!=null)

   {

   rtn.put(e.getKey(), v);

   }

   }

   return rtn.entrySet();

   }

  }

 

  final SoftLinkMap readerCache=new SoftLinkMap();

//    final Map<Object,Map<Entry,Object>> readerCache = new WeakHashMap<Object,Map<Entry,Object>>();

    

 

  _____  

yannianmu(母延年)

 

From: Robert Muir <ma...@gmail.com> 

Date: 2015-01-30 12:03

To: dev@lucene.apache.org

Subject: Re: Our Optimize Suggestions on lucene 3.5

I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO.

 

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

Hi,

parts of your suggestions are already done in Lucene 4+. For one part I can tell you:


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

All Lucene items no longer apply:

1.       FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene?s UninvertingReader, bus this one has no weak maps anymore because it is no cache.

2.       No idea about that one - its unrelated to Lucene

3.       AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7?s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes.

4.       NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too!

5.       See above

In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other ?optimizations? are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days).

Uwe

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: yannianmu(?ꉄ?N) [mailto:yannianmu@tencent.com] 
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5

 

 

 

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 

 


 


First level index(tii)?CLoading by Demand


Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 


Build index on Hdfs


1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 


Improve solr, so that one core can dynamic process multy index.


Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 


Label mark technology for doc values


Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \\compare <file:///\\compare>  \\equals <file:///\\equals>  ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

 


two-phase search


Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

 


multy-phase indexing


1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

 


two-phase commit for update


1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

 

 


Term data skew


Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

 

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance) 

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

 

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

 


Block-Buffer-Cache


Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

 

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 

when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

 


Other GC optimization


1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

 


Directory optimization


1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

 


network optimization


1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 


append mode,optimization


1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

 


non tokenizer field optimization


1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value?Cthis can reduce the index file size

 

 


multi level of merger server


1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

 

hermes`s merger is like this


other optimize


1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

 

 


finallze


Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

 

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 

 

  _____  

yannianmu(?ꉄ?N)

 


Re: Re: Our Optimize Suggestions on lucene 3.5

Posted by "yannianmu (母延年)" <ya...@tencent.com>.
WeakHashMap may be cause a  memory leak problem.

we use SoftReference instad of it like this;


  public static class SoftLinkMap{
  private static int SORT_CACHE_SIZE=1024;
    private static float LOADFACTOR=0.75f;
    final Map<Object,SoftReference<Map<Entry,Object>>> readerCache_lru=new LinkedHashMap<Object,SoftReference<Map<Entry,Object>>>((int) Math.ceil(SORT_CACHE_SIZE / LOADFACTOR) + 1, LOADFACTOR, true) {
        @Override
        protected boolean removeEldestEntry(Map.Entry<Object,SoftReference<Map<Entry,Object>>> eldest) {
          return size() > SORT_CACHE_SIZE;
        }
      };

   public void remove(Object key)
   {
   readerCache_lru.remove(key);
   }

   public Map<Entry,Object> get(Object key)
   {
   SoftReference<Map<Entry,Object>> w =  readerCache_lru.get(key);
  if(w==null)
  {
  return null;
  }
return w.get();
   }


   public void put(Object key,Map<Entry,Object> value)
   {
   readerCache_lru.put(key, new SoftReference<Map<Entry,Object>>(value));
   }

   public Set<java.util.Map.Entry<Object, Map<Entry, Object>>> entrySet()
   {
   HashMap<Object,Map<Entry,Object>> rtn=new HashMap<Object, Map<Entry,Object>>();
   for(java.util.Map.Entry<Object,SoftReference<Map<Entry,Object>>> e:readerCache_lru.entrySet())
   {
   Map<Entry,Object> v=e.getValue().get();
   if(v!=null)
   {
   rtn.put(e.getKey(), v);
   }
   }
   return rtn.entrySet();
   }
  }

  final SoftLinkMap readerCache=new SoftLinkMap();
//    final Map<Object,Map<Entry,Object>> readerCache = new WeakHashMap<Object,Map<Entry,Object>>();


________________________________
yannianmu(母延年)

From: Robert Muir<ma...@gmail.com>
Date: 2015-01-30 12:03
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: Our Optimize Suggestions on lucene 3.5
I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO.

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler <uw...@thetaphi.de>> wrote:
Hi,
parts of your suggestions are already done in Lucene 4+. For one part I can tell you:
weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.
All Lucene items no longer apply:

1.       FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene?s UninvertingReader, bus this one has no weak maps anymore because it is no cache.

2.       No idea about that one - its unrelated to Lucene

3.       AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7?s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes.

4.       NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too!

5.       See above
In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other ?optimizations? are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days).
Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de<ma...@thetaphi.de>

From: yannianmu(?ꉄ?N) [mailto:yannianmu@tencent.com<ma...@tencent.com>]
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5






Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.



    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .






First level index(tii)?CLoading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.



Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.



Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.



Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField).

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \\compare \\equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.





two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.





multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

[X]



two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one

8. also we use a lot of cache to speed up the atomic inc field.







Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.



Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort



Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time







Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.





weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.



Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized



when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).







Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument





Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU





network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.



append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile





non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value?Cthis can reduce the index file size





multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

[X]



hermes`s merger is like this

[X]

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table











finallze

Hermes`sql may be like this

• select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20



• select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

• select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

• select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

• select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

• select count(*) from guangdiantong where thedate ='20141010' limit 0,100

• select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

• select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

•

• select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

• select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

• select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

• select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

• select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

• select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

•

• select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

•

• select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100



________________________________
yannianmu(?ꉄ?N)


Re: Our Optimize Suggestions on lucene 3.5

Posted by Robert Muir <rc...@gmail.com>.
I think this is all fine. Because things are keyed on core-reader and
there are already core listeners installed to purge when the ref count
for a core drops to zero.

honestly if you change the map to a regular one, all tests pass.

On Fri, Jan 30, 2015 at 5:37 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Sorry Robert – you’re right,
>
>
>
> I had the impression that we changed that already. In fact, the WeakHashMap
> is needed, because multiple readers (especially Slow ones) can share the
> same uninverted fields. In the ideal world, we should change the whole stuff
> and remove FieldCacheImpl completely and let the field maps stay directly on
> the UninvertingReader as regular member fields. The only problem with this
> is: if you have multiple UninvertigReaders, all of them have separate
> uninverted instances. But this is a bug already if you do this.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Friday, January 30, 2015 5:04 AM
> To: dev@lucene.apache.org
> Subject: Re: Our Optimize Suggestions on lucene 3.5
>
>
>
> I am not sure this is the case. Actually, FieldCacheImpl still works as
> before and has a weak hashmap still.
>
> However, i think the weak map is unnecessary. reader close listeners already
> ensure purging from the map, so I don't think the weak map serves any
> purpose today. The only possible advantage it has is to allow you to GC
> fieldcaches when you are already leaking readers... it could just be a
> regular map IMO.
>
>
>
> On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
> Hi,
>
> parts of your suggestions are already done in Lucene 4+. For one part I can
> tell you:
>
> weakhashmap,hashmap , synchronized problem
>
> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory
> leak BUG.
>
> 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a
> lot of memory
>
> 3. AttributeSource use weakhashmap to cache class impl,and use a global
> synchronized reduce performance
>
> 4. AttributeSource is a base class , NumbericField extends
> AttributeSource,but they create a lot of hashmap,but NumbericField never use
> it .
>
> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>
> All Lucene items no longer apply:
>
> 1.       FieldCache is gone and is no longer supported in Lucene 5. You
> should use the new DocValues index format for that (column based storage,
> optimized for sorting, numeric). You can still use Lucene’s
> UninvertingReader, bus this one has no weak maps anymore because it is no
> cache.
>
> 2.       No idea about that one - its unrelated to Lucene
>
> 3.       AttributeSource no longer uses this, since Lucene 4.8 it uses Java
> 7’s java.lang.ClassValue to attach the implementation class to the
> interface. No concurrency problems anymore. It also uses MethodHandles to
> invoke the attribute classes.
>
> 4.       NumericField no longer exists, the base class does not use
> AttributeSource. All field instances now automatically reuse the inner
> TokenStream instances across fields, too!
>
> 5.       See above
>
> In addition, Lucene has much better memory use, because terms are no longer
> UTF-16 strings and are in large shared byte arrays. So a lot of those other
> “optimizations” are handled in a different way in Lucene 4 and Lucene 5
> (coming out the next few days).
>
> Uwe
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> From: yannianmu(母延年) [mailto:yannianmu@tencent.com]
> Sent: Thursday, January 29, 2015 12:59 PM
> To: general; dev; commits
> Subject: Our Optimize Suggestions on lucene 3.5
>
>
>
>
>
>
>
> Dear Lucene dev
>
>     We are from the the Hermes team. Hermes is a project base on lucene 3.5
> and solr 3.5.
>
> Hermes process 100 billions documents per day,2000 billions document for
> total days (two month). Nowadays our single cluster index size is over then
> 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up
> .reduce the analysis response time, for example filter like this age=32 and
> keywords like 'lucene'  or do some thing like count ,sum,order by group by
> and so on.
>
>
>
>     Hermes could filter a data form 1000billions in 1 secondes.10billions
> data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions
> days`s sum,avg,max,min stat taken 30 s
>
> For those purpose,We made lots of improve base on lucene and solr , nowadays
> lucene has change so much since version 4.10, the coding has change so
> much.so we don`t want to commit our code to lucene .only to introduce our
> imporve base on luene 3.5,and introduce how hermes can process 100billions
> documents per day on 32 Physical Machines.we think it may be helpfull for
> some people who have the similary sense .
>
>
>
>
>
>
>
> First level index(tii),Loading by Demand
>
> Original:
>
> 1. .tii file is load to ram by TermInfosReaderIndex
>
> 2. that may quite slowly by first open Index
>
> 3. the index need open by Persistence,once open it ,nevel close it.
>
> 4. this cause will limit the number of the index.when we have thouthand of
> index,that will Impossible.
>
> Our improve:
>
> 1. Loading by Demand,not all fields need to load into memory
>
> 2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but
> we use lru cache to speed up it.
>
> 3. getIndexOffset on disk can save lots of memory,and can reduce times when
> open a index
>
> 4. hermes often open different index for dirrerent Business; when the index
> is not often to used ,we will to close it.(manage by lru)
>
> 5. such this my 1 Physical Machine can store over then 100000 number of
> index.
>
> Solve the problem:
>
> 1. hermes need to store over then 1000billons documents,we have not enough
> memory to store the tii file
>
> 2. we have over then 100000 number of index,if all is opend ,that will weast
> lots of file descriptor,the file system will not allow.
>
>
>
> Build index on Hdfs
>
> 1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on
> hdfs.(lucene has support hdfs since 4.0)
>
> 2. All the offline data is build by mapreduce on hdfs.
>
> 3. we move all the realtime index from local disk to hdfs
>
> 4. we can ignore disk failure because of index on hdfs
>
> 5. we can move process from on machine to another machine on hdfs
>
> 6. we can quick recover index when a disk failure happend .
>
> 7. we does need recover data when a machine is broker(the Index is so big
> move need lots of hours),the process can quick move to other machine by
> zookeeper heartbeat.
>
> 8. all we know index on hdfs is slower then local file system,but why ?
> local file system the OS make so many optimization, use lots cache to speed
> up random access. so we also need a optimization on hdfs.that is why some
> body often said that hdfs index is so slow the reason is that you didn`t
> optimize it .
>
> 9. we split the hdfs file into fix length block,1kb per block.and then use a
> lru cache to cache it ,the tii file and some frequent terms will speed up.
>
> 10. some hdfs file does`t need to close Immediately we make a lru cache to
> cache it ,to reduce the frequent of open file.
>
>
>
> Improve solr, so that one core can dynamic process multy index.
>
> Original:
>
> 1. a solr core(one process) only process 1~N index by solr config
>
> Our improve:
>
> 2. use a partion like oracle or hadoop hive.not build only one big
> index,instand build lots of index by day(month,year,or other partion)
>
> 3. dynamic create table for dynamic businiss
>
> Solve the problem:
>
> 1. to solve the index is to big over then Interger.maxvalue, docid overflow
>
> 2. some times the searcher not need to search all of the data ,may be only
> need recent 3 days.
>
>
>
> Label mark technology for doc values
>
> Original:
>
> 1. group by,sort,sum,max,min ,avg those stats method need to read Original
> from tis file
>
> 2. FieldCacheImpl load all the term values into memory for solr
> fieldValueCache,Even if i only stat one record .
>
> 3. first time search is quite slowly because of to build the fieldValueCache
> and load all the term values into memory
>
> Our improve:
>
> 1. General situation,the data has a lot of repeat value,for exampe the sex
> file ,the age field .
>
> 2. if we store the original value ,that will weast a lot of storage.
> so we make a small modify at TermInfosWriter, Additional add a new filed
> called termNumber.
> make a unique term sort by term through TermInfosWriter, and then gave each
> term a unique  Number from begin to end  (mutch like solr UnInvertedField).
>
> 3. we use termNum(we called label) instead of Term.we store termNum(label)
> into a file called doctotm. the doctotm file is order by docid,lable is
> store by fixed length. the file could be read by random read(like fdx it
> store by fixed length),the file doesn`t need load all into memory.
>
> 4. the label`s order is the same with terms order .so if we do some
> calculation like order by or group by only read the label. we don`t need to
> read the original value.
>
> 5. some field like sex field ,only have 2 different values.so we only use 2
> bits(not 2 bytes) to store the label, it will save a lot of Disk io.
>
> 6. when we finish all of the calculation, we translate label to Term by a
> dictionary.
>
> 7. if a lots of rows have the same original value ,the original value we
> only store once,onley read once.
>
> Solve the problem:
>
> 1. Hermes`s data is quite big we don`t have enough memory to load all Values
> to memory like lucene FieldCacheImpl or solr UnInvertedField.
>
> 2. on realtime mode ,data is change Frequent , The cache is invalidated
> Frequent by append or update. build FieldCacheImpl will take a lot of times
> and io;
>
> 3. the Original value is lucene Term. it is a string type.  whene sortring
> or grouping ,thed string value need a lot of memory and need lot of cpu time
> to calculate hashcode \compare \equals ,But label is number  is fast.
>
> 4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be
> integer whitch depending on the max number of the label.
>
> 5. read the original value will need lot of io, need iterate tis file.even
> though we just need to read only docunent.
>
> 6. Solve take a lot of time when first build FieldCacheImpl.
>
>
>
>
>
> two-phase search
>
> Original:
>
> 1. group by order by use original value,the real value may be is a string
> type,may be more larger ,the real value maybe  need a lot of io  because of
> to read tis,frq file
>
> 2. compare by string is slowly then compare by integer
>
> Our improve:
>
> 1. we split one search into multy-phase search
>
> 2. the first search we only search the field that use for order by ,group by
>
> 3. the first search we doesn`t need to read the original value(the real
> value),we only need to read the docid and label(see < Label mark technology
> for doc values>) for order by group by.
>
> 4. when we finish all the order by and group by ,may be we only need to
> return Top n records .so we start next to search to get the Top n records
> original value.
>
> Solve the problem:
>
> 1. reduce io ,read original take a lot of disk io
>
> 2. reduce network io (for merger)
>
> 3. most of the field has repeated value, the repeated only need to read once
>
> the group by filed only need to read the origina once by label whene display
> to user.
>
> 4. most of the search only need to display on Top n (n<=100) results, so use
> to phrase search some original value could be skip.
>
>
>
>
>
> multy-phase indexing
>
> 1. hermes doesn`t update index one by one,it use batch index
>
> 2. the index area is split into four area ,they are called doclist=>buffer
> index=>ram index=>diskIndex/hdfsIndex
>
> 3. doclist only store the solrinputdocument for the batch update or append
>
> 4. buffer index is a ramdirectory ,use for merge doclist to index.
>
> 5. ram index is also a ramdirector ,but it is biger then buffer index, it
> can be search by the user.
>
> 6. disk/hdfs index is Persistence store use for big index
>
> 7. we also use wal called binlog(like mysql binlog) for recover
>
>
>
>
>
> two-phase commit for update
>
> 1. we doesn`t update record once by once like solr(solr is search by
> term,found the document,delete it,and then append a new one),one by one is
> slowly.
>
> 2. we need Atomic inc field ,solr that can`t support ,solr only support
> replace field value.
> Atomic inc field need to read the last value first ,and then increace it`s
> value.
>
> 3. hermes use pre mark delete,batch commit to update a document.
>
> 4. if a document is state is premark ,it also could be search by the
> user,unil we commit it.
> we modify SegmentReader ,split deletedDocs into to 3 part. one part is
> called deletedDocstmp whitch is for pre mark (pending delete),another one is
> called deletedDocs_forsearch which is for index search, another is also call
> deletedDocs
>
> 5. once we want to pending delete a document,we operate deletedDocstmp (a
> openbitset)to mark one document is pending delete.
>
> and then we append our new value to doclist area(buffer area)
>
> the pending delete means user also could search the old value.
>
> the buffer area means user couldn`t search the new value.
>
> but when we commit it(batch)
>
> the old value is realy droped,and flush all the buffer area to Ram area(ram
> area can be search)
>
> 6. the pending delete we called visual delete,after commit it we called
> physics delete
>
> 7. hermes ofthen visula delete a lots of document ,and then commit once ,to
> improve up the Performance one by one
>
> 8. also we use a lot of cache to speed up the atomic inc field.
>
>
>
>
>
>
>
> Term data skew
>
> Original:
>
> 1. lucene use inverted index to store term and doclist.
>
> 2. some filed like sex  has only to value male or female, so male while have
> 50% of doclist.
>
> 3. solr use filter cache to cache the FQ,FQ is a openbitset which store the
> doclist.
>
> 4. when the firest time to use FQ(not cached),it will read a lot of doclist
> to build openbitset ,take a lot of disk io.
>
> 5. most of the time we only need the TOP n doclist,we dosn`t care about the
> score sort.
>
>
>
> Our improve:
>
> 1. we often combination other fq,to use the skip doclist to skip the docid
> that not used( we may to seed the query methord called advance)
>
> 2. we does`n cache the openbitset by FQ ,we cache the frq files block into
> memeory, to speed up the place often read.
>
> 3. our index is quite big ,if we cache the FQ(openbitset),that will take a
> lots of memory
>
> 4. we modify the indexSearch  to support real Top N search and ignore the
> doc score sort
>
>
>
> Solve the problem:
>
> 1. data skew take a lot of disk io to read not necessary doclist.
>
> 2. 2000billions index is to big,the FQ cache (filter cache) user openbitset
> take a lot of memor
>
> 3. most of the search ,only need the top N result ,doesn`t need score
> sort,we need to speed up the search time
>
>
>
>
>
>
>
> Block-Buffer-Cache
>
> Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is
> ofen seen by lots of cache ,such as
> UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time
> much of the elements is zero(empty),
>
> Original:
>
> 1. we create the big array directly,when we doesn`t neet we drop it to JVM
> GC
>
> Our improve:
>
> 1. we split the big arry into fix length block,witch block is a small
> array,but fix 1024 length .
>
> 2. if a block `s element is almost empty(element is zero),we use hashmap to
> instead of array
>
> 3. if a block `s non zero value is empty(length=0),we couldn`t create this
> block arrry only use a null to instead of array
>
> 4. when the block is not to use ,we collectoion the array to buffer ,next
> time we reuse it
>
> Solve the problem:
>
> 1. save memory
>
> 2. reduce the jvm Garbage collection take a lot of cpu resource.
>
>
>
>
>
> weakhashmap,hashmap , synchronized problem
>
> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory
> leak BUG.
>
> 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a
> lot of memory
>
> 3. AttributeSource use weakhashmap to cache class impl,and use a global
> synchronized reduce performance
>
> 4. AttributeSource is a base class , NumbericField extends
> AttributeSource,but they create a lot of hashmap,but NumbericField never use
> it .
>
> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>
>
>
> Our improve:
>
> 1. weakhashmap is not high performance ,we use softReferance instead of it
>
> 2. reuse NumbericField avoid create AttributeSource frequent
>
> 3. not use global synchronized
>
>
>
> when we finish this optimization our process,speed up from 20000/s to
> 60000/s (1k per document).
>
>
>
>
>
>
>
> Other GC optimization
>
> 1. reuse byte[] arry in the inputbuffer ,outpuer buffer .
>
> 2. reuse byte[] arry in the RAMfile
>
> 3. remove some finallze method, the not necessary.
>
> 4. use StringHelper.intern to reuse the field name in solrinputdocument
>
>
>
>
>
> Directory optimization
>
> 1. index commit doesn`t neet sync all the field
>
> 2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up
> read sppedn
>
> 3. we close index or index file that not often to used.also we limit the
> index that allow max open;block cache is manager by LRU
>
>
>
>
>
> network optimization
>
> 1. optimization ThreadPool in searchHandle class ,some times does`t need
> keep alive connection,and increate the timeout time for large Index.
>
> 2. remove jetty ,we write socket by myself ,jetty import data is not high
> performance
>
> 3. we change the data import form push mode to pull mode with like apache
> storm.
>
>
>
> append mode,optimization
>
> 1. append mode we doesn`t store the field value to fdt file.that will take a
> lot of io on index merger, but it is doesn`t need.
>
> 2. we store the field data to a single file ,the files format is hadoop
> sequence file ,we use LZO compress to save io
>
> 3. we make a pointer to point docid to sequencefile
>
>
>
>
>
> non tokenizer field optimization
>
> 1. non tokenizer field we doesn`t store the field value to fdt field.
>
> 2. we read the field value from label (see  <<Label mark technology for doc
> values>>)
>
> 3. most of the field has duplicate value,this can reduce the index file size
>
>
>
>
>
> multi level of merger server
>
> 1. solr can only use on shard to act as a merger server .
>
> 2. we use multi level of merger server to merge all shards result
>
> 3. shard on the same mathine have the high priority to merger by the same
> mathine merger server.
>
> solr`s merger is like this
>
>
>
> hermes`s merger is like this
>
> other optimize
>
> 1. hermes support Sql .
>
> 2. support union Sql from different tables;
>
> 3. support view table
>
>
>
>
>
>
>
>
>
>
>
> finallze
>
> Hermes`sql may be like this
>
> l select
> higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr
> from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and
> ddwuin=5713 limit 0,20
>
>
>
> l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where
> thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit
> 0,10
>
> l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921
> limit 0,100
>
> l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where
> thedate=20140921  limit 0,100
>
> l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in
> (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;
>
> l select count(*) from guangdiantong where thedate ='20141010' limit 0,100
>
> l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from
> guangdiantong where thedate ='20141010' limit 0,100
>
> l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from
> guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit
> 0,10
>
> l
>
> l select miniute1,count(*) from guangdiantong where thedate ='20141010'
> group by miniute1 limit 0,100
>
> l select miniute5,count(*) from guangdiantong where thedate ='20141010'
> group by miniute5 limit 0,100
>
> l select hour,miniute15,count(*) from guangdiantong where thedate
> ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100
>
> l select
> hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from
> guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit
> 0,100
>
> l select freqtype,count(*),sum(fspenttime),average(fspenttime) from
> guangdiantong where thedate ='20141010' and (freqtype>=10000 and
> freqtype<=10100) group by freqtype limit 0,100
>
> l select freqtype,count(*),sum(fspenttime),average(fspenttime) from
> guangdiantong where thedate ='20141010' and (freqtype>=10000 and
> freqtype<=10100) group by freqtype order by average(fspenttime) desc limit
> 0,100
>
> l
>
> l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from
> guangdiantong where thedate ='20141010' group by hour,miniute15 order by
> miniute15 desc limit 0,100
>
> l
>
> l select
> thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion
> from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc
> limit 0,100
>
>
>
>
>
> ________________________________
>
> yannianmu(母延年)
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Our Optimize Suggestions on lucene 3.5

Posted by Uwe Schindler <uw...@thetaphi.de>.
Sorry Robert – you’re right,

 

I had the impression that we changed that already. In fact, the WeakHashMap is needed, because multiple readers (especially Slow ones) can share the same uninverted fields. In the ideal world, we should change the whole stuff and remove FieldCacheImpl completely and let the field maps stay directly on the UninvertingReader as regular member fields. The only problem with this is: if you have multiple UninvertigReaders, all of them have separate uninverted instances. But this is a bug already if you do this.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Friday, January 30, 2015 5:04 AM
To: dev@lucene.apache.org
Subject: Re: Our Optimize Suggestions on lucene 3.5

 

I am not sure this is the case. Actually, FieldCacheImpl still works as before and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners already ensure purging from the map, so I don't think the weak map serves any purpose today. The only possible advantage it has is to allow you to GC fieldcaches when you are already leaking readers... it could just be a regular map IMO.

 

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

Hi,

parts of your suggestions are already done in Lucene 4+. For one part I can tell you:


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

All Lucene items no longer apply:

1.       FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this one has no weak maps anymore because it is no cache.

2.       No idea about that one - its unrelated to Lucene

3.       AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes.

4.       NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too!

5.       See above

In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other “optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days).

Uwe

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: yannianmu(母延年) [mailto:yannianmu@tencent.com] 
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5

 

 

 

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 

 


 


First level index(tii),Loading by Demand


Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 


Build index on Hdfs


1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 


Improve solr, so that one core can dynamic process multy index.


Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 


Label mark technology for doc values


Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

 


two-phase search


Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

 


multy-phase indexing


1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

 

 


two-phase commit for update


1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

 

 


Term data skew


Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

 

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance) 

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

 

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

 


Block-Buffer-Cache


Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

 

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 

when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

 


Other GC optimization


1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

 


Directory optimization


1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

 


network optimization


1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 


append mode,optimization


1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

 


non tokenizer field optimization


1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size

 

 


multi level of merger server


1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

 

hermes`s merger is like this


other optimize


1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

 

 


finallze


Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

 

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 

 

  _____  

yannianmu(母延年)

 


Re: Our Optimize Suggestions on lucene 3.5

Posted by Robert Muir <rc...@gmail.com>.
I am not sure this is the case. Actually, FieldCacheImpl still works as
before and has a weak hashmap still.

However, i think the weak map is unnecessary. reader close listeners
already ensure purging from the map, so I don't think the weak map serves
any purpose today. The only possible advantage it has is to allow you to GC
fieldcaches when you are already leaking readers... it could just be a
regular map IMO.

On Thu, Jan 29, 2015 at 9:35 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> parts of your suggestions are already done in Lucene 4+. For one part I
> can tell you:
> weakhashmap,hashmap , synchronized problem
>
> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory
> leak BUG.
>
> 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that
> weast a lot of memory
>
> 3. AttributeSource use weakhashmap to cache class impl,and use a global
> synchronized reduce performance
>
> 4. AttributeSource is a base class , NumbericField extends AttributeSource,but
> they create a lot of hashmap,but NumbericField never use it .
>
> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>
> All Lucene items no longer apply:
>
> 1.       FieldCache is gone and is no longer supported in Lucene 5. You
> should use the new DocValues index format for that (column based storage,
> optimized for sorting, numeric). You can still use Lucene’s
> UninvertingReader, bus this one has no weak maps anymore because it is no
> cache.
>
> 2.       No idea about that one - its unrelated to Lucene
>
> 3.       AttributeSource no longer uses this, since Lucene 4.8 it uses
> Java 7’s java.lang.ClassValue to attach the implementation class to the
> interface. No concurrency problems anymore. It also uses MethodHandles to
> invoke the attribute classes.
>
> 4.       NumericField no longer exists, the base class does not use
> AttributeSource. All field instances now automatically reuse the inner
> TokenStream instances across fields, too!
>
> 5.       See above
>
> In addition, Lucene has much better memory use, because terms are no
> longer UTF-16 strings and are in large shared byte arrays. So a lot of
> those other “optimizations” are handled in a different way in Lucene 4 and
> Lucene 5 (coming out the next few days).
>
> Uwe
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* yannianmu(母延年) [mailto:yannianmu@tencent.com]
> *Sent:* Thursday, January 29, 2015 12:59 PM
> *To:* general; dev; commits
> *Subject:* Our Optimize Suggestions on lucene 3.5
>
>
>
>
>
>
>
> Dear Lucene dev
>
>     We are from the the Hermes team. Hermes is a project base on lucene
> 3.5 and solr 3.5.
>
> Hermes process 100 billions documents per day,2000 billions document for
> total days (two month). Nowadays our single cluster index size is over
> then 200Tb,total size is 600T. We use lucene for the big data warehouse
> speed up .reduce the analysis response time, for example filter like this
> age=32 and keywords like 'lucene'  or do some thing like count ,sum,order
> by group by and so on.
>
>
>
>     Hermes could filter a data form 1000billions in 1 secondes.10billions
> data`s order by taken 10s,10billions data`s group by thaken 15 s,10
> billions days`s sum,avg,max,min stat taken 30 s
>
> For those purpose,We made lots of improve base on lucene and solr ,
> nowadays lucene has change so much since version 4.10, the coding has
> change so much.so we don`t want to commit our code to lucene .only to
> introduce our imporve base on luene 3.5,and introduce how hermes can
> process 100billions documents per day on 32 Physical Machines.we think it
> may be helpfull for some people who have the similary sense .
>
>
>
>
>  First level index(tii),Loading by Demand
>
> Original:
>
> 1. .tii file is load to ram by TermInfosReaderIndex
>
> 2. that may quite slowly by first open Index
>
> 3. the index need open by Persistence,once open it ,nevel close it.
>
> 4. this cause will limit the number of the index.when we have thouthand of
> index,that will Impossible.
>
> Our improve:
>
> 1. Loading by Demand,not all fields need to load into memory
>
> 2. we modify the method getIndexOffset(dichotomy) on disk, not on
> memory,but we use lru cache to speed up it.
>
> 3. getIndexOffset on disk can save lots of memory,and can reduce times
> when open a index
>
> 4. hermes often open different index for dirrerent Business; when the
> index is not often to used ,we will to close it.(manage by lru)
>
> 5. such this my 1 Physical Machine can store over then 100000 number of
> index.
>
> Solve the problem:
>
> 1. hermes need to store over then 1000billons documents,we have not enough
> memory to store the tii file
>
> 2. we have over then 100000 number of index,if all is opend ,that will
> weast lots of file descriptor,the file system will not allow.
>
>
> Build index on Hdfs
>
> 1. We modifyed lucene 3.5 code at 2013.so that we can build index direct
> on hdfs.(lucene has support hdfs since 4.0)
>
> 2. All the offline data is build by mapreduce on hdfs.
>
> 3. we move all the realtime index from local disk to hdfs
>
> 4. we can ignore disk failure because of index on hdfs
>
> 5. we can move process from on machine to another machine on hdfs
>
> 6. we can quick recover index when a disk failure happend .
>
> 7. we does need recover data when a machine is broker(the Index is so big
> move need lots of hours),the process can quick move to other machine by
> zookeeper heartbeat.
>
> 8. all we know index on hdfs is slower then local file system,but why ?
> local file system the OS make so many optimization, use lots cache to speed
> up random access. so we also need a optimization on hdfs.that is why some
> body often said that hdfs index is so slow the reason is that you didn`t
> optimize it .
>
> 9. we split the hdfs file into fix length block,1kb per block.and then use
> a lru cache to cache it ,the tii file and some frequent terms will speed up.
>
> 10. some hdfs file does`t need to close Immediately we make a lru cache
> to cache it ,to reduce the frequent of open file.
>
>
> Improve solr, so that one core can dynamic process multy index.
>
> Original:
>
> 1. a solr core(one process) only process 1~N index by solr config
>
> Our improve:
>
> 2. use a partion like oracle or hadoop hive.not build only one big
> index,instand build lots of index by day(month,year,or other partion)
>
> 3. dynamic create table for dynamic businiss
>
> Solve the problem:
>
> 1. to solve the index is to big over then Interger.maxvalue, docid overflow
>
> 2. some times the searcher not need to search all of the data ,may be only
> need recent 3 days.
>
>
> Label mark technology for doc values
>
> Original:
>
> 1. group by,sort,sum,max,min ,avg those stats method need to read Original
> from tis file
>
> 2. FieldCacheImpl load all the term values into memory for solr
> fieldValueCache,Even if i only stat one record .
>
> 3. first time search is quite slowly because of to build the
> fieldValueCache and load all the term values into memory
>
> Our improve:
>
> 1. General situation,the data has a lot of repeat value,for exampe the
> sex file ,the age field .
>
> 2. if we store the original value ,that will weast a lot of storage.
> so we make a small modify at TermInfosWriter, Additional add a new filed
> called termNumber.
> make a unique term sort by term through TermInfosWriter, and then gave
> each term a unique  Number from begin to end  (mutch like solr
> UnInvertedField).
>
> 3. we use termNum(we called label) instead of Term.we store
> termNum(label) into a file called doctotm. the doctotm file is order by
> docid,lable is store by fixed length. the file could be read by random
> read(like fdx it store by fixed length),the file doesn`t need load all into
> memory.
>
> 4. the label`s order is the same with terms order .so if we do some
> calculation like order by or group by only read the label. we don`t need to
> read the original value.
>
> 5. some field like sex field ,only have 2 different values.so we only use
> 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.
>
> 6. when we finish all of the calculation, we translate label to Term by a
> dictionary.
>
> 7. if a lots of rows have the same original value ,the original value we
> only store once,onley read once.
>
> Solve the problem:
>
> 1. Hermes`s data is quite big we don`t have enough memory to load all
> Values to memory like lucene FieldCacheImpl or solr UnInvertedField.
>
> 2. on realtime mode ,data is change Frequent , The cache is invalidated
> Frequent by append or update. build FieldCacheImpl will take a lot of
> times and io;
>
> 3. the Original value is lucene Term. it is a string type.  whene
> sortring or grouping ,thed string value need a lot of memory and need lot
> of cpu time to calculate hashcode \compare \equals ,But label is number  is
> fast.
>
> 4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be
> integer whitch depending on the max number of the label.
>
> 5. read the original value will need lot of io, need iterate tis
> file.even though we just need to read only docunent.
>
> 6. Solve take a lot of time when first build FieldCacheImpl.
>
>
>
>
> two-phase search
>
> Original:
>
> 1. group by order by use original value,the real value may be is a string
> type,may be more larger ,the real value maybe  need a lot of io  because of
> to read tis,frq file
>
> 2. compare by string is slowly then compare by integer
>
> Our improve:
>
> 1. we split one search into multy-phase search
>
> 2. the first search we only search the field that use for order by ,group
> by
>
> 3. the first search we doesn`t need to read the original value(the real
> value),we only need to read the docid and label(see < Label mark
> technology for doc values>) for order by group by.
>
> 4. when we finish all the order by and group by ,may be we only need to
> return Top n records .so we start next to search to get the Top n records
> original value.
>
> Solve the problem:
>
> 1. reduce io ,read original take a lot of disk io
>
> 2. reduce network io (for merger)
>
> 3. most of the field has repeated value, the repeated only need to read
> once
>
> the group by filed only need to read the origina once by label whene
> display to user.
>
> 4. most of the search only need to display on Top n (n<=100) results, so
> use to phrase search some original value could be skip.
>
>
>
>
> multy-phase indexing
>
> 1. hermes doesn`t update index one by one,it use batch index
>
> 2. the index area is split into four area ,they are called doclist=>buffer
> index=>ram index=>diskIndex/hdfsIndex
>
> 3. doclist only store the solrinputdocument for the batch update or append
>
> 4. buffer index is a ramdirectory ,use for merge doclist to index.
>
> 5. ram index is also a ramdirector ,but it is biger then buffer index, it
> can be search by the user.
>
> 6. disk/hdfs index is Persistence store use for big index
>
> 7. we also use wal called binlog(like mysql binlog) for recover
>
>
>
>
> two-phase commit for update
>
> 1. we doesn`t update record once by once like solr(solr is search by
> term,found the document,delete it,and then append a new one),one by one is
> slowly.
>
> 2. we need Atomic inc field ,solr that can`t support ,solr only support
> replace field value.
> Atomic inc field need to read the last value first ,and then increace
> it`s value.
>
> 3. hermes use pre mark delete,batch commit to update a document.
>
> 4. if a document is state is premark ,it also could be search by the
> user,unil we commit it.
> we modify SegmentReader ,split deletedDocs into to 3 part. one part is
> called deletedDocstmp whitch is for pre mark (pending delete),another one
> is called deletedDocs_forsearch which is for index search, another is
> also call deletedDocs
>
> 5. once we want to pending delete a document,we operate deletedDocstmp (a
> openbitset)to mark one document is pending delete.
>
> and then we append our new value to doclist area(buffer area)
>
> the pending delete means user also could search the old value.
>
> the buffer area means user couldn`t search the new value.
>
> but when we commit it(batch)
>
> the old value is realy droped,and flush all the buffer area to Ram
> area(ram area can be search)
>
> 6. the pending delete we called visual delete,after commit it we called
> physics delete
>
> 7. hermes ofthen visula delete a lots of document ,and then commit once
> ,to improve up the Performance one by one
>
> 8. also we use a lot of cache to speed up the atomic inc field.
>
>
>
>
>
>
> Term data skew
>
> Original:
>
> 1. lucene use inverted index to store term and doclist.
>
> 2. some filed like sex  has only to value male or female, so male while
> have 50% of doclist.
>
> 3. solr use filter cache to cache the FQ,FQ is a openbitset which store
> the doclist.
>
> 4. when the firest time to use FQ(not cached),it will read a lot of
> doclist to build openbitset ,take a lot of disk io.
>
> 5. most of the time we only need the TOP n doclist,we dosn`t care about
> the score sort.
>
>
>
> Our improve:
>
> 1. we often combination other fq,to use the skip doclist to skip the
> docid that not used( we may to seed the query methord called advance)
>
> 2. we does`n cache the openbitset by FQ ,we cache the frq files block
> into memeory, to speed up the place often read.
>
> 3. our index is quite big ,if we cache the FQ(openbitset),that will take
> a lots of memory
>
> 4. we modify the indexSearch  to support real Top N search and ignore the
> doc score sort
>
>
>
> Solve the problem:
>
> 1. data skew take a lot of disk io to read not necessary doclist.
>
> 2. 2000billions index is to big,the FQ cache (filter cache) user
> openbitset take a lot of memor
>
> 3. most of the search ,only need the top N result ,doesn`t need score
> sort,we need to speed up the search time
>
>
>
>
>
>
> Block-Buffer-Cache
>
> Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it
> is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache
> and so on. most of time  much of the elements is zero(empty),
>
> Original:
>
> 1. we create the big array directly,when we doesn`t neet we drop it to JVM
> GC
>
> Our improve:
>
> 1. we split the big arry into fix length block,witch block is a small
> array,but fix 1024 length .
>
> 2. if a block `s element is almost empty(element is zero),we use hashmap
> to instead of array
>
> 3. if a block `s non zero value is empty(length=0),we couldn`t create this
> block arrry only use a null to instead of array
>
> 4. when the block is not to use ,we collectoion the array to buffer ,next
> time we reuse it
>
> Solve the problem:
>
> 1. save memory
>
> 2. reduce the jvm Garbage collection take a lot of cpu resource.
>
>
>
>
> weakhashmap,hashmap , synchronized problem
>
> 1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory
> leak BUG.
>
> 2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that
> weast a lot of memory
>
> 3. AttributeSource use weakhashmap to cache class impl,and use a global
> synchronized reduce performance
>
> 4. AttributeSource is a base class , NumbericField extends AttributeSource,but
> they create a lot of hashmap,but NumbericField never use it .
>
> 5. all of this ,JVM GC take a lot of burder for the never used hashmap.
>
>
>
> Our improve:
>
> 1. weakhashmap is not high performance ,we use softReferance instead of
> it
>
> 2. reuse NumbericField avoid create AttributeSource frequent
>
> 3. not use global synchronized
>
>
>
> when we finish this optimization our process,speed up from 20000/s to
> 60000/s (1k per document).
>
>
>
>
>
>
> Other GC optimization
>
> 1. reuse byte[] arry in the inputbuffer ,outpuer buffer .
>
> 2. reuse byte[] arry in the RAMfile
>
> 3. remove some finallze method, the not necessary.
>
> 4. use StringHelper.intern to reuse the field name in solrinputdocument
>
>
>
>
> Directory optimization
>
> 1. index commit doesn`t neet sync all the field
>
> 2. we use a block cache on top of FsDriectory and hdfsDirectory to speed
> up read sppedn
>
> 3. we close index or index file that not often to used.also we limit the
> index that allow max open;block cache is manager by LRU
>
>
>
>
> network optimization
>
> 1. optimization ThreadPool in searchHandle class ,some times does`t need
> keep alive connection,and increate the timeout time for large Index.
>
> 2. remove jetty ,we write socket by myself ,jetty import data is not high
> performance
>
> 3. we change the data import form push mode to pull mode with like apache
> storm.
>
>
> append mode,optimization
>
> 1. append mode we doesn`t store the field value to fdt file.that will take
> a lot of io on index merger, but it is doesn`t need.
>
> 2. we store the field data to a single file ,the files format is hadoop
> sequence file ,we use LZO compress to save io
>
> 3. we make a pointer to point docid to sequencefile
>
>
>
>
> non tokenizer field optimization
>
> 1. non tokenizer field we doesn`t store the field value to fdt field.
>
> 2. we read the field value from label (see  <<Label mark technology for
> doc values>>)
>
> 3. most of the field has duplicate value,this can reduce the index file
> size
>
>
>
>
> multi level of merger server
>
> 1. solr can only use on shard to act as a merger server .
>
> 2. we use multi level of merger server to merge all shards result
>
> 3. shard on the same mathine have the high priority to merger by the same
> mathine merger server.
>
> *solr`s merger is like this*
>
>
>
> *hermes`s merger is like this*
>
> other optimize
>
> 1. hermes support Sql .
>
> 2. support union Sql from different tables;
>
> 3. support view table
>
>
>
>
>
>
>
>
>
>
> finallze
>
> Hermes`sql may be like this
>
> l select
> higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr
> from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and
> ddwuin=5713 limit 0,20
>
>
>
> l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where
> thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit
> 0,10
>
> l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921
> limit 0,100
>
> l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03
> where thedate=20140921  limit 0,100
>
> l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate
> in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;
>
> l select count(*) from guangdiantong where thedate ='20141010' limit 0,100
>
> l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid
> from guangdiantong where thedate ='20141010' limit 0,100
>
> l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid
> from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc
> limit 0,10
>
> l
>
> l select miniute1,count(*) from guangdiantong where thedate ='20141010'
> group by miniute1 limit 0,100
>
> l select miniute5,count(*) from guangdiantong where thedate ='20141010'
> group by miniute5 limit 0,100
>
> l select hour,miniute15,count(*) from guangdiantong where thedate
> ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100
>
> l select
> hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from
> guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit
> 0,100
>
> l select freqtype,count(*),sum(fspenttime),average(fspenttime) from
> guangdiantong where thedate ='20141010' and (freqtype>=10000 and
> freqtype<=10100) group by freqtype limit 0,100
>
> l select freqtype,count(*),sum(fspenttime),average(fspenttime) from
> guangdiantong where thedate ='20141010' and (freqtype>=10000 and
> freqtype<=10100) group by freqtype order by average(fspenttime) desc limit
> 0,100
>
> l
>
> l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from
> guangdiantong where thedate ='20141010' group by hour,miniute15 order by
> miniute15 desc limit 0,100
>
> l
>
> l select
> thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion
> from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc
> limit 0,100
>
>
>
>
> ------------------------------
>
> yannianmu(母延年)
>

Re: RE: Our Optimize Suggestions on lucene 3.5(Internet mail)

Posted by "yannianmu (母延年)" <ya...@tencent.com>.
I have read the DocValues source code from 4.10.3 just now.
that is what i wanted. lucene always gave us surprised,thank you .
the code as bellow awesome.
support random access read (not load all value into memory)
and  SortedDocValues is done similayr with us ,and codes looks beautiful

LongValues getNumeric(NumericEntry entry) throws IOException {
    RandomAccessInput slice = this.data.randomAccessSlice(entry.offset, entry.endOffset - entry.offset);
    switch (entry.format) {
      case DELTA_COMPRESSED:
        final long delta = entry.minValue;
        final LongValues values = DirectReader.getInstance(slice, entry.bitsPerValue);
        return new LongValues() {
          @Override
          public long get(long id) {
            return delta + values.get(id);
          }
        };
      case GCD_COMPRESSED:
        final long min = entry.minValue;
        final long mult = entry.gcd;
        final LongValues quotientReader = DirectReader.getInstance(slice, entry.bitsPerValue);
        return new LongValues() {
          @Override
          public long get(long id) {
            return min + mult * quotientReader.get(id);
          }
        };
      case TABLE_COMPRESSED:
        final long table[] = entry.table;
        final LongValues ords = DirectReader.getInstance(slice, entry.bitsPerValue);
        return new LongValues() {
          @Override
          public long get(long id) {
            return table[(int) ords.get(id)];
          }
        };
      default:
        throw new AssertionError();
    }
  }



________________________________
yannianmu(母延年)

From: Uwe Schindler<ma...@thetaphi.de>
Date: 2015-01-29 22:35
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: RE: Our Optimize Suggestions on lucene 3.5(Internet mail)
Hi,
parts of your suggestions are already done in Lucene 4+. For one part I can tell you:
weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.
All Lucene items no longer apply:

1.       FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this one has no weak maps anymore because it is no cache.

2.       No idea about that one - its unrelated to Lucene

3.       AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes.

4.       NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too!

5.       See above
In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other “optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days).
Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de

From: yannianmu(母延年) [mailto:yannianmu@tencent.com]
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5






Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.



    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .






First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.



Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.



Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.



Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField).

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.





two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.





multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

[cid:_Foxmail.0@804840F2-FE63-4FD9-B75D-4DA504C5B591]



two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one

8. also we use a lot of cache to speed up the atomic inc field.







Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.



Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort



Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time







Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.





weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.



Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized



when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).







Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument





Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU





network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.



append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile





non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size





multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

[cid:_Foxmail.1@621DB6EB-924D-4485-911D-18CD154885DC]



hermes`s merger is like this

[cid:_Foxmail.2@81E0418D-FEFD-49D5-AC9B-1E4044F74A7F]

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table











finallze

Hermes`sql may be like this

• select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20



• select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

• select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

• select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

• select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

• select count(*) from guangdiantong where thedate ='20141010' limit 0,100

• select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

• select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

•

• select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

• select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

• select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

• select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

• select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

• select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

•

• select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

•

• select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100



________________________________
yannianmu(母延年)

RE: Our Optimize Suggestions on lucene 3.5

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

parts of your suggestions are already done in Lucene 4+. For one part I can tell you:


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

All Lucene items no longer apply:

1.       FieldCache is gone and is no longer supported in Lucene 5. You should use the new DocValues index format for that (column based storage, optimized for sorting, numeric). You can still use Lucene’s UninvertingReader, bus this one has no weak maps anymore because it is no cache.

2.       No idea about that one - its unrelated to Lucene

3.       AttributeSource no longer uses this, since Lucene 4.8 it uses Java 7’s java.lang.ClassValue to attach the implementation class to the interface. No concurrency problems anymore. It also uses MethodHandles to invoke the attribute classes.

4.       NumericField no longer exists, the base class does not use AttributeSource. All field instances now automatically reuse the inner TokenStream instances across fields, too!

5.       See above

In addition, Lucene has much better memory use, because terms are no longer UTF-16 strings and are in large shared byte arrays. So a lot of those other “optimizations” are handled in a different way in Lucene 4 and Lucene 5 (coming out the next few days).

Uwe

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: yannianmu(母延年) [mailto:yannianmu@tencent.com] 
Sent: Thursday, January 29, 2015 12:59 PM
To: general; dev; commits
Subject: Our Optimize Suggestions on lucene 3.5

 

 

 

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.

 

    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .

 

 


 


First level index(tii),Loading by Demand


Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory 

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.

 


Build index on Hdfs


1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs 

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.

 


Improve solr, so that one core can dynamic process multy index.


Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.

 


Label mark technology for doc values


Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField). 

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.

 

 


two-phase search


Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by 

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.

 

 


multy-phase indexing


1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover



 

 


two-phase commit for update


1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs 

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one 

8. also we use a lot of cache to speed up the atomic inc field.

 

 

 


Term data skew


Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

 

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance) 

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

 

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time

 

 

 


Block-Buffer-Cache


Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.

 

 


weakhashmap,hashmap , synchronized problem


1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

 

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it 

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 

when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).

 

 

 


Other GC optimization


1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument

 

 


Directory optimization


1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn 

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU

 

 


network optimization


1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.

 


append mode,optimization


1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile

 

 


non tokenizer field optimization


1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size

 

 


multi level of merger server


1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this



 

hermes`s merger is like this




other optimize


1. hermes support Sql .

2. support union Sql from different tables;

3. support view table

 

 

 

 

 


finallze


Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

 

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l 

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l 

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l 

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100

 

 

  _____  

yannianmu(母延年)


回复: Our Optimize Suggestions on lucene 3.5(Internet mail)

Posted by "yannianmu (母延年)" <ya...@tencent.com>.
add attachment
________________________________
yannianmu(母延年)

发件人: yannianmu(母延年)<ma...@tencent.com>
发送时间: 2015-01-29 19:59
收件人: general<ma...@lucene.apache.org>; dev<ma...@lucene.apache.org>; commits<ma...@lucene.apache.org>
主题: Our Optimize Suggestions on lucene 3.5(Internet mail)

Dear Lucene dev

    We are from the the Hermes team. Hermes is a project base on lucene 3.5 and solr 3.5.

Hermes process 100 billions documents per day,2000 billions document for total days (two month). Nowadays our single cluster index size is over then 200Tb,total size is 600T. We use lucene for the big data warehouse  speed up .reduce the analysis response time, for example filter like this age=32 and keywords like 'lucene'  or do some thing like count ,sum,order by group by and so on.



    Hermes could filter a data form 1000billions in 1 secondes.10billions data`s order by taken 10s,10billions data`s group by thaken 15 s,10 billions days`s sum,avg,max,min stat taken 30 s

For those purpose,We made lots of improve base on lucene and solr , nowadays lucene has change so much since version 4.10, the coding has change so much.so we don`t want to commit our code to lucene .only to introduce our imporve base on luene 3.5,and introduce how hermes can process 100billions documents per day on 32 Physical Machines.we think it may be helpfull for some people who have the similary sense .


First level index(tii),Loading by Demand

Original:

1. .tii file is load to ram by TermInfosReaderIndex

2. that may quite slowly by first open Index

3. the index need open by Persistence,once open it ,nevel close it.

4. this cause will limit the number of the index.when we have thouthand of index,that will Impossible.

Our improve:

1. Loading by Demand,not all fields need to load into memory

2. we modify the method getIndexOffset(dichotomy) on disk, not on memory,but we use lru cache to speed up it.

3. getIndexOffset on disk can save lots of memory,and can reduce times when open a index

4. hermes often open different index for dirrerent Business; when the index is not often to used ,we will to close it.(manage by lru)

5. such this my 1 Physical Machine can store over then 100000 number of index.

Solve the problem:

1. hermes need to store over then 1000billons documents,we have not enough memory to store the tii file

2. we have over then 100000 number of index,if all is opend ,that will weast lots of file descriptor,the file system will not allow.



Build index on Hdfs

1. We modifyed lucene 3.5 code at 2013.so that we can build index direct on hdfs.(lucene has support hdfs since 4.0)

2. All the offline data is build by mapreduce on hdfs.

3. we move all the realtime index from local disk to hdfs

4. we can ignore disk failure because of index on hdfs

5. we can move process from on machine to another machine on hdfs

6. we can quick recover index when a disk failure happend .

7. we does need recover data when a machine is broker(the Index is so big move need lots of hours),the process can quick move to other machine by zookeeper heartbeat.

8. all we know index on hdfs is slower then local file system,but why ? local file system the OS make so many optimization, use lots cache to speed up random access. so we also need a optimization on hdfs.that is why some body often said that hdfs index is so slow the reason is that you didn`t optimize it .

9. we split the hdfs file into fix length block,1kb per block.and then use a lru cache to cache it ,the tii file and some frequent terms will speed up.

10. some hdfs file does`t need to close Immediately we make a lru cache to cache it ,to reduce the frequent of open file.



Improve solr, so that one core can dynamic process multy index.

Original:

1. a solr core(one process) only process 1~N index by solr config

Our improve:

2. use a partion like oracle or hadoop hive.not build only one big index,instand build lots of index by day(month,year,or other partion)

3. dynamic create table for dynamic businiss

Solve the problem:

1. to solve the index is to big over then Interger.maxvalue, docid overflow

2. some times the searcher not need to search all of the data ,may be only need recent 3 days.



Label mark technology for doc values

Original:

1. group by,sort,sum,max,min ,avg those stats method need to read Original from tis file

2. FieldCacheImpl load all the term values into memory for solr fieldValueCache,Even if i only stat one record .

3. first time search is quite slowly because of to build the fieldValueCache and load all the term values into memory

Our improve:

1. General situation,the data has a lot of repeat value,for exampe the sex file ,the age field .

2. if we store the original value ,that will weast a lot of storage.
so we make a small modify at TermInfosWriter, Additional add a new filed called termNumber.
make a unique term sort by term through TermInfosWriter, and then gave each term a unique  Number from begin to end  (mutch like solr UnInvertedField).

3. we use termNum(we called label) instead of Term.we store termNum(label) into a file called doctotm. the doctotm file is order by docid,lable is store by fixed length. the file could be read by random read(like fdx it store by fixed length),the file doesn`t need load all into memory.

4. the label`s order is the same with terms order .so if we do some calculation like order by or group by only read the label. we don`t need to read the original value.

5. some field like sex field ,only have 2 different values.so we only use 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.

6. when we finish all of the calculation, we translate label to Term by a dictionary.

7. if a lots of rows have the same original value ,the original value we only store once,onley read once.

Solve the problem:

1. Hermes`s data is quite big we don`t have enough memory to load all Values to memory like lucene FieldCacheImpl or solr UnInvertedField.

2. on realtime mode ,data is change Frequent , The cache is invalidated Frequent by append or update. build FieldCacheImpl will take a lot of times and io;

3. the Original value is lucene Term. it is a string type.  whene sortring or grouping ,thed string value need a lot of memory and need lot of cpu time to calculate hashcode \compare \equals ,But label is number  is fast.

4. the label is number ,it`s type mabbe short ,or maybe byte ,or may be integer whitch depending on the max number of the label.

5. read the original value will need lot of io, need iterate tis file.even though we just need to read only docunent.

6. Solve take a lot of time when first build FieldCacheImpl.



two-phase search

Original:

1. group by order by use original value,the real value may be is a string type,may be more larger ,the real value maybe  need a lot of io  because of to read tis,frq file

2. compare by string is slowly then compare by integer

Our improve:

1. we split one search into multy-phase search

2. the first search we only search the field that use for order by ,group by

3. the first search we doesn`t need to read the original value(the real value),we only need to read the docid and label(see < Label mark technology for doc values>) for order by group by.

4. when we finish all the order by and group by ,may be we only need to return Top n records .so we start next to search to get the Top n records original value.

Solve the problem:

1. reduce io ,read original take a lot of disk io

2. reduce network io (for merger)

3. most of the field has repeated value, the repeated only need to read once

the group by filed only need to read the origina once by label whene display to user.

4. most of the search only need to display on Top n (n<=100) results, so use to phrase search some original value could be skip.



multy-phase indexing

1. hermes doesn`t update index one by one,it use batch index

2. the index area is split into four area ,they are called doclist=>buffer index=>ram index=>diskIndex/hdfsIndex

3. doclist only store the solrinputdocument for the batch update or append

4. buffer index is a ramdirectory ,use for merge doclist to index.

5. ram index is also a ramdirector ,but it is biger then buffer index, it can be search by the user.

6. disk/hdfs index is Persistence store use for big index

7. we also use wal called binlog(like mysql binlog) for recover

[cid:_Foxmail.0@804840F2-FE63-4FD9-B75D-4DA504C5B591]



two-phase commit for update

1. we doesn`t update record once by once like solr(solr is search by term,found the document,delete it,and then append a new one),one by one is slowly.

2. we need Atomic inc field ,solr that can`t support ,solr only support replace field value.
Atomic inc field need to read the last value first ,and then increace it`s value.

3. hermes use pre mark delete,batch commit to update a document.

4. if a document is state is premark ,it also could be search by the user,unil we commit it.
we modify SegmentReader ,split deletedDocs into to 3 part. one part is called deletedDocstmp whitch is for pre mark (pending delete),another one is called deletedDocs_forsearch which is for index search, another is also call deletedDocs

5. once we want to pending delete a document,we operate deletedDocstmp (a openbitset)to mark one document is pending delete.

and then we append our new value to doclist area(buffer area)

the pending delete means user also could search the old value.

the buffer area means user couldn`t search the new value.

but when we commit it(batch)

the old value is realy droped,and flush all the buffer area to Ram area(ram area can be search)

6. the pending delete we called visual delete,after commit it we called physics delete

7. hermes ofthen visula delete a lots of document ,and then commit once ,to improve up the Performance one by one

8. also we use a lot of cache to speed up the atomic inc field.



Term data skew

Original:

1. lucene use inverted index to store term and doclist.

2. some filed like sex  has only to value male or female, so male while have 50% of doclist.

3. solr use filter cache to cache the FQ,FQ is a openbitset which store the doclist.

4. when the firest time to use FQ(not cached),it will read a lot of doclist to build openbitset ,take a lot of disk io.

5. most of the time we only need the TOP n doclist,we dosn`t care about the score sort.

Our improve:

1. we often combination other fq,to use the skip doclist to skip the docid that not used( we may to seed the query methord called advance)

2. we does`n cache the openbitset by FQ ,we cache the frq files block into memeory, to speed up the place often read.

3. our index is quite big ,if we cache the FQ(openbitset),that will take a lots of memory

4. we modify the indexSearch  to support real Top N search and ignore the doc score sort

Solve the problem:

1. data skew take a lot of disk io to read not necessary doclist.

2. 2000billions index is to big,the FQ cache (filter cache) user openbitset take a lot of memor

3. most of the search ,only need the top N result ,doesn`t need score sort,we need to speed up the search time





Block-Buffer-Cache

Openbitset,fieldvalueCache need to malloc a big long[] or int[] array. it is ofen seen by lots of cache ,such as UnInvertedField,fieldCacheImpl,filterQueryCache and so on. most of time  much of the elements is zero(empty),

Original:

1. we create the big array directly,when we doesn`t neet we drop it to JVM GC

Our improve:

1. we split the big arry into fix length block,witch block is a small array,but fix 1024 length .

2. if a block `s element is almost empty(element is zero),we use hashmap to instead of array

3. if a block `s non zero value is empty(length=0),we couldn`t create this block arrry only use a null to instead of array

4. when the block is not to use ,we collectoion the array to buffer ,next time we reuse it

Solve the problem:

1. save memory

2. reduce the jvm Garbage collection take a lot of cpu resource.





weakhashmap,hashmap , synchronized problem

1. FieldCacheImpl use weakhashmap to manage field value cache,it has memory leak BUG.

2. sorlInputDocunent use a lot of hashmap,linkhashmap for field,that weast a lot of memory

3. AttributeSource use weakhashmap to cache class impl,and use a global synchronized reduce performance

4. AttributeSource is a base class , NumbericField extends AttributeSource,but they create a lot of hashmap,but NumbericField never use it .

5. all of this ,JVM GC take a lot of burder for the never used hashmap.

Our improve:

1. weakhashmap is not high performance ,we use softReferance instead of it

2. reuse NumbericField avoid create AttributeSource frequent

3. not use global synchronized

 when we finish this optimization our process,speed up from 20000/s to 60000/s (1k per document).





Other GC optimization

1. reuse byte[] arry in the inputbuffer ,outpuer buffer .

2. reuse byte[] arry in the RAMfile

3. remove some finallze method, the not necessary.

4. use StringHelper.intern to reuse the field name in solrinputdocument



Directory optimization

1. index commit doesn`t neet sync all the field

2. we use a block cache on top of FsDriectory and hdfsDirectory to speed up read sppedn

3. we close index or index file that not often to used.also we limit the index that allow max open;block cache is manager by LRU



network optimization

1. optimization ThreadPool in searchHandle class ,some times does`t need keep alive connection,and increate the timeout time for large Index.

2. remove jetty ,we write socket by myself ,jetty import data is not high performance

3. we change the data import form push mode to pull mode with like apache storm.



append mode,optimization

1. append mode we doesn`t store the field value to fdt file.that will take a lot of io on index merger, but it is doesn`t need.

2. we store the field data to a single file ,the files format is hadoop sequence file ,we use LZO compress to save io

3. we make a pointer to point docid to sequencefile



non tokenizer field optimization

1. non tokenizer field we doesn`t store the field value to fdt field.

2. we read the field value from label (see  <<Label mark technology for doc values>>)

3. most of the field has duplicate value,this can reduce the index file size



multi level of merger server

1. solr can only use on shard to act as a merger server .

2. we use multi level of merger server to merge all shards result

3. shard on the same mathine have the high priority to merger by the same mathine merger server.

solr`s merger is like this

[cid:_Foxmail.1@621DB6EB-924D-4485-911D-18CD154885DC]

hermes`s merger is like this

[cid:_Foxmail.2@81E0418D-FEFD-49D5-AC9B-1E4044F74A7F]

other optimize

1. hermes support Sql .

2. support union Sql from different tables;

3. support view table







finallze

Hermes`sql may be like this

l select higo_uuid,thedate,ddwuid,dwinserttime,ddwlocaltime,dwappid,dwinituserdef1,dwclientip,sclientipv6,dwserviceip,dwlocaiip,dwclientversion,dwcmd,dwsubcmd,dwerrid,dwuserdef1,dwuserdef2,dwuserdef3,dwuserdef4,cloglevel,szlogstr from sngsearch06,sngsearch09,sngsearch12 where thedate in ('20140917') and ddwuin=5713 limit 0,20

l select thedate,ddwuin,dwinserttime,ddwlocaltime from sngsearch12 where thedate in ('20140921') and ddwuin=5713 order by ddwlocaltime desc  limit 0,10

l select count(*),count(ddwuid) from sngsearch03 where thedate=20140921 limit 0,100

l select sum(acnt),average(acnt),max(acnt),min(acnt) from sngsearch03 where thedate=20140921  limit 0,100

l select thedate,ddwuid,sum(acnt),count(*) from sngsearch18 where thedate in (20140908) and ddwuid=7823 group by thedate,ddwuid limit 0,100;

l select count(*) from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' limit 0,100

l select freqtype,fspenttime,fmodname,yyyymmddhhmmss,hermestime,freqid from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc  limit 0,10

l

l select miniute1,count(*) from guangdiantong where thedate ='20141010' group by miniute1 limit 0,100

l select miniute5,count(*) from guangdiantong where thedate ='20141010' group by miniute5 limit 0,100

l select hour,miniute15,count(*) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l select hour,count(*),sum(fspenttime),average(fspenttime),average(ferrorcode) from guangdiantong where thedate ='20141010' and freqtype=1  group by hour limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype limit 0,100

l select freqtype,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' and (freqtype>=10000 and freqtype<=10100) group by freqtype order by average(fspenttime) desc limit 0,100

l

l select hour,miniute15,count(*),sum(fspenttime),average(fspenttime) from guangdiantong where thedate ='20141010' group by hour,miniute15 order by miniute15 desc limit 0,100

l

l select thedate,yyyymmddhhmmss,miniute1,miniute5,miniute15,hour,hermestime,freqtype,freqname,freqid,fuid,fappid,fmodname,factionname,ferrorcode,ferrormsg,foperateret,ferrortype,fcreatetime,fspenttime,fserverip,fversion from guangdiantong where thedate ='20141010' order by yyyymmddhhmmss desc limit 0,100


________________________________
yannianmu(母延年)