You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ning Li (JIRA)" <ji...@apache.org> on 2007/10/26 03:31:50 UTC

[jira] Created: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance

ptional Buffer Pool to Improve Search Performance
-------------------------------------------------

                 Key: LUCENE-1035
                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Store
            Reporter: Ning Li


Index in RAMDirectory provides better performance over that in FSDirectory.
But many indexes cannot fit in memory or applications cannot afford to
spend that much memory on index. On the other hand, because of locality,
a reasonably sized buffer pool may provide good improvement over FSDirectory.

This issue aims at providing such an optional buffer pool layer. In cases
where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
a good improvement over FSDirectory.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574782#action_12574782 ] 

Ning Li commented on LUCENE-1035:
---------------------------------

> It looks like this was never fully done. I wonder if this should be closed, esp. since Ning might be working on slightly different problems now.

Sorry for the delay. I'll spend some time later this week or early next week to update and make it a contrib patch.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537827 ] 

Doug Cutting commented on LUCENE-1035:
--------------------------------------

Were the tests run using the same set of queries they were warmed for?  If so, an interesting benchmark might be to, e.g., start with 200 queries, then warm things with the first 100 and use the second for the benchmark.  Ideally you'd start with a log of real queries, but those are hard to obtain.  Over ten years ago I released a 1M query log from Excite, which I still see people reference in papers, so it must be out there somewhere.  It would be better than nothing for these kinds of benchmarks.  Or perhaps we can obtain a copy of the more-recent AOL query log?  Otherwise you've only demonstrated an improvement when queries are frequently repeated.  There are better ways to optimize for that, e.g., by caching hit lists, no?

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537977 ] 

Yonik Seeley commented on LUCENE-1035:
--------------------------------------

A couple of random thoughts
- previous tests showed that vint decoding was often bottleneck, but these tests seem to indicate otherwise (that the bottleneck is the system call to move data from OS  FS cache to userspace?)... perhaps this is due to the fact that all queries are "AND" and match a max of 1000 docs?  The recent addition of multi-level skipping perhaps removes the vint decoding bottleneck for these types of queries that match few documents.
- most lucene usecases store much more than just the document id... that would really affect locality.
- It seems like a simple LRU cache could really be blown out of the water by certain types of queries (retrieve a lot of stored fields, or do an expanding term query) that would force out all previously cached hotspots.  Most OS level caching has protection against this (multi-level LRU or whatever).  But of our user-level LRU cache fails, we've also messed up the OS level cache since we've been hiding page hits from it.
- I'd like to see single term queries, "OR" queries, and queries across multiple fields (also a common usecase) that match more documents tested also.




> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538065 ] 

Eks Dev commented on LUCENE-1035:
---------------------------------


did you compare it against MMAP? I

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537972 ] 

Ning Li commented on LUCENE-1035:
---------------------------------

> I don't think this is any better than the NIOFileCache directory I had already submitted.

Are you referring to LUCENE-414? I just read it and yes, it's similar to the MemoryLRUCache part. Hopefully this is more general, not just for NioFile.

> It not really approved because the community felt that it did not offer much over the standard OS file system cache.

Well, it shows it has its value in cases where you can achieve a reasonable hit ratio, right? This is optional. An application can test with its query log first to see the hit ratio and then decide whether to use a buffer pool and if so, how large.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Li updated LUCENE-1035:
----------------------------

    Attachment: LUCENE-1035.contrib.patch

Re-do as a contrib package. Creating BufferPooledDirectory with your customized file name filter for readers allows you to decide which files you want to use the caching layer for.

The package includes some tests. I also modified and tested the core tests with the caching layer in a private setting and all tests passed.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.contrib.patch, LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538126 ] 

Mike Klaas commented on LUCENE-1035:
------------------------------------

> Query set with average 590K results, retrieving docids for the first 5K

That seems like quite a few docs to retrieve--any particular reason why?  (It would be good to know if the speedup is occuring in the query phase or doc retrieval).  This would also explain why VInt decoding is not the bottleneck (it wouldn't be much-used for retrieving stored fields).

I echo Hoss' comment--proximity searching is important even if it isn't used much _directly_ by users.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574759#action_12574759 ] 

Eks Dev commented on LUCENE-1035:
---------------------------------

Robert,
you said: 
....We actually have a multiplexing directory that (depending on file type and size), either opens the file purely in memory, uses a cached file, or lets the OS do the caching. Works really well...

Did you create a patch somewhere, or is this your internal work?

I have a case where this could come in very handy, I plan to use MMAP for postings & co... but FSDirectory for stored fields as they could easily blow the size ... With possibility to to select on file type/size  makes MMAP use case much much closer to many users... one Directory implementation that allows users to select strategy is indeed perfect, LRU, FSDirectora, MMAP, RAM or whatnot 


> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537978 ] 

Ning Li commented on LUCENE-1035:
---------------------------------

> Were the tests run using the same set of queries they were warmed for?

Yes, the same set of queries were used. The warm-up and the real run are two separate runs, which means the file system cache is warmed, but not the buffer pool.

Yes, it'd much better if a real query log could be obtained. I'll take a look at the AOL query log. I used to have an intranet query log which has a lot of term locality. That's why I think this could provide a good improvement.

> There are better ways to optimize for that, e.g., by caching hit lists, no?

That's useful and that's for exact query match. If there are a lot of shared query term but not exact query match, caching hit list won't help. This is sort of caching posting list but simpler.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Li updated LUCENE-1035:
----------------------------

    Lucene Fields: [Patch Available]  (was: [New])

> ptional Buffer Pool to Improve Search Performance
> -------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538576 ] 

Doug Cutting commented on LUCENE-1035:
--------------------------------------

Ning, I didn't mean to sound negative about this.  Your benchmarks do show that in some situations this can provide significant speedup.  The question is whether such situations are common enough to warrant adding this to the core.  A way around that might be to layer it on top of FSDirectory and add it to contrib.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538112 ] 

Ning Li commented on LUCENE-1035:
---------------------------------

> I'll change to "OR" queries and see what happens.

  Query set with average 590K results, retrieving docids for the first 5K
  Buffer Pool Size    Hit Ratio    Queries per second
     0                 N/A             1.9
     16M               53%             1.9
     32M               68%             2.0
     64M               90%             2.3
     128M/256M/512M              99%             2.3

As Yonik pointed out, in the previous "AND" tests, the bottleneck is the system call to move data from file system cache to userspace. Here in the "OR" tests, much fewer such calls are made therefore the speedup is less significant. Wish I could get a real query workload for this dataset.

> Actually, phrase queries would be really interesting too since they hit the term positions.

Phrase queries are rare and term distribution is highly skewed according to the following study on the Excite query log:
Spink, Amanda and Xu, Jack L. (2000)   "Selected results from a large study of Web searching: the Excite study".  Information Research, 6(1) Available at: http://InformationR.net/ir/6-1/paper90.html

"4. Phase Searching: Phrases (terms enclosed by quotation marks) were seldom, while only 1 in 16 queries contained a phrase - but correctly used.
5. Search Terms: Distribution: Jansen, et al., (2000) report the distribution of the frequency of use of terms in queries as highly skewed."

I didn't find a good on on the AOL query log. In any case, this buffer pool is not intended for general purpose. I mentioned RAMDirectory earlier. This is more like an alternative to RAMDirectory (that's why it's per directory): you want persistent storage for the index, yet it's not too big that you want RAMDirectory search performance. In addition, the entire index doesn't have to fit into memory, as long as the most queried part does. Hopefully, this benefits a subset of Lucene use cases.

> did you compare it against MMAP? I

The index I experimented on didn't fit in memory...


> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569652#action_12569652 ] 

Otis Gospodnetic commented on LUCENE-1035:
------------------------------------------

It looks like this was never fully done.  I wonder if this should be closed, esp. since Ning might be working on slightly different problems now.


> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "robert engels (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537805 ] 

robert engels commented on LUCENE-1035:
---------------------------------------

I don't think this is any better than the NIOFileCache directory I had already submitted.

It not really approved because the community felt that it did not offer much over the standard OS file system cache.

My tests showed it was better, but I think this would fall into the same problem.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1035) ptional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Li updated LUCENE-1035:
----------------------------

    Attachment: LUCENE-1035.patch

Coding Changes
--------------
New classes are localized to the store package and so as most of the changes.
  - Two new interfaces: BareInput and BufferPool.
  - BareInput takes a subset of IndexInput's methods such as readBytes
    (IndexInput now implements BareInput).
  - BufferPoolLRU is a simple implementation of BufferPool interface.
    It uses a doubly linked list for the LRU algorithm.
  - BufferPooledIndexInput is a subclass of BufferedIndexInput. It takes
    a BareInput and a BufferPool. For BufferedIndexInput's readInternal,
    it will read from the BufferPool, and BufferPool will read from its
    cache if it's a hit and read from BareInput if it's a miss.
  - A FSDirectory object can optionally be created with a BufferPool with
    its size specified by a buffer size and number of buffers. BufferPool
    is shared among IndexInput of read-only files in the directory.

Unit tests
  - TestBufferPoolLRU.java is added.
  - Minor changes were made to _TestHelper.java and TestCompoundFile.java
    because they made specific assumptions of the type of IndexInput returns
    by FSDirectory.openInput.
  - All unit tests pass when I switch to always use a BufferPool.


Performance Results
-------------------
I ran some experiments using the enwiki dataset. The experiments were run on
a dual 2.0Ghz Intel Xeon server running Linux. The dataset has about 3.5M
documents and the index built from it is more than 3G. The only store field
is a unique docid which is retrieved for each query result. All queries are
two-term AND queries generated from the dictionary. The first set of queries
returns between 1 to 1000 results with an average of 40. The second set
returns between 1 to 3000 with an average of 560. All tests were run warm.

1 Query set with average 40 results
  Buffer Pool Size    Hit Ratio    Queries per second
      0                 N/A            230
      16M               55%            250
      32M               63%            282
      64M               73%            345
      128M              85%            476
      256M              95%            672
      512M              98%            685

2 Query set with average 560 results
  Buffer Pool Size    Hit Ratio    Queries per second
      0                 N/A             27
      16M               56%             29
      32M               70%             37
      64M               89%             55
      128M              97%             67
      256M              98%             71
      512M              99%             72

Of course if the tests are run cold, or if the queried portion of the index
is significantly larger than the file system cache, or there are a lot of
pre-processing of the queries and/or post-processing of the results, the
speedup will be less. But where it applies, i.e. a reasonable hit ratio can
be achieved, it should provide a good improvement.


> ptional Buffer Pool to Improve Search Performance
> -------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537980 ] 

Yonik Seeley commented on LUCENE-1035:
--------------------------------------

Also, in addition to some kind of protection against the LRU cache being busted by a single query, perhaps the ability to not cover parts of the index (like stored fields) would also help.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538129 ] 

Ning Li commented on LUCENE-1035:
---------------------------------

> That seems like quite a few docs to retrieve--any particular reason why?

I was guessing most applications won't want all 590K results, no? Lucene is used in so many different ways. No represent-all use case.

> I echo Hoss' comment--proximity searching is important even if it isn't used much directly by users.

Hmm, I agree with you and Hoss, especially in applications where proximity is used to rank results of OR queries.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Li updated LUCENE-1035:
----------------------------

    Summary: Optional Buffer Pool to Improve Search Performance  (was: ptional Buffer Pool to Improve Search Performance)

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538638 ] 

Ning Li commented on LUCENE-1035:
---------------------------------

> The question is whether such situations are common enough to warrant adding this to the core.

Agree.

> A way around that might be to layer it on top of FSDirectory and add it to contrib.

I'd be happy to do that. I'll also include the following in the javadoc which hopefully is a fair assessment:

"When will a buffer pool help:
  - When an index is significantly larger than the file system cache, the hit ratio of a buffer pool is probably low which means insignificant performance improvement.
  - When an index is about the size of the file system cache or smaller, a buffer pool with good enough hit ratio will help if the IO system calls are the bottleneck. An example is if you have many "AND" queries which causes a lot large skips."

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537998 ] 

Yonik Seeley commented on LUCENE-1035:
--------------------------------------

Actually, phrase queries would be really interesting too since they hit the term positions.

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538118 ] 

Hoss Man commented on LUCENE-1035:
----------------------------------

>> 4. Phase Searching: Phrases (terms enclosed by quotation marks) were seldom, while only 1 in 16 queries contained a phrase

quoted phrases in raw user input may be rare, but that does't mean PhraseQueries are as rare ... apps may artificially create a sloppy PhraseQuery containing all of the individual words in the users raw query string to help identify matches where the input words all appear close together (i may be bias in assuming this is common, since it's something i do a lot of personally)

> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "Ning Li (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537995 ] 

Ning Li commented on LUCENE-1035:
---------------------------------

> most lucene usecases store much more than just the document id... that would really affect locality.

In the experiments, I was simulating the (Google) paradigm where you retrieve just the docids and go to document servers for other things. If store almost always negatively affects locality, we can make the buffer pool sit only on data/files which we expect good locality (say posting lists), but not others.

> It seems like a simple LRU cache could really be blown out of the water by certain types of queries (retrieve a lot of stored fields, or do an expanding term query) that would force out all previously cached hotspots. Most OS level caching has protection against this (multi-level LRU or whatever). But of our user-level LRU cache fails, we've also messed up the OS level cache since we've been hiding page hits from it.

That's a good point. We can improve the algorithm but hopefully still keep it simple and general. This buffer pool is not a fit-all solution. But hopefully it will benefit a number of use cases. That's why I say "optional". :)

> I'd like to see single term queries, "OR" queries, and queries across multiple fields (also a common usecase) that match more documents tested also.

I'll change to "OR" queries and see what happens. The dataset is enwiki with four fields: docid, date (optional), title and body. Most terms are from title and body.


> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

Posted by "robert engels (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538582 ] 

robert engels commented on LUCENE-1035:
---------------------------------------

Again, see my previous code in issue 414.  That it only works NioFile is not really a limitation, it can easily work with any underlying "file". This is just an implementation detail.

This code is already implemented as a layer on top of FS directory, so the caller can decide to use an original FS directory or a caching one.

We actually have a multiplexing directory that (depending on file type and size), either opens the file purely in memory, uses a cached file, or lets the OS do the caching. Works really well.



> Optional Buffer Pool to Improve Search Performance
> --------------------------------------------------
>
>                 Key: LUCENE-1035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1035
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Ning Li
>         Attachments: LUCENE-1035.patch
>
>
> Index in RAMDirectory provides better performance over that in FSDirectory.
> But many indexes cannot fit in memory or applications cannot afford to
> spend that much memory on index. On the other hand, because of locality,
> a reasonably sized buffer pool may provide good improvement over FSDirectory.
> This issue aims at providing such an optional buffer pool layer. In cases
> where it fits, i.e. a reasonable hit ratio can be achieved, it should provide
> a good improvement over FSDirectory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org