You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Stefan Wachter <St...@gmx.de> on 2005/08/23 17:16:45 UTC

Little improvement for SimpleHTMLEncoder

Hi all!

The SimpleHTMLEncoder could be improved slightly: all characters with 
code >= 128 should be encoded as character entities. The reason is, that 
the encoder does not know the encoding that is used for the response. 
Therefore it is safer to encode all characters beyond ASCII as character 
entities. Can someone judge this proposal and commit it eventually?

Greetings!
--Stefan


        default:
          if (c < 128) {
            result.append(c);
          } else {
            result.append("&#").append((int)c).append(";");
          }

Re: Performance problem

Posted by WolfgangTäger <wt...@epo.org>.

Hello Erik,

I tried i++ and the performance is similar.

Maybe the problem is linked to the sorting of the results, because the 
required time increases with the number of hits:

Query: DE:Taste => Hits 2k              => retrieve first 2000: 3.9sec
Query: needle           => Hits 9.5k            => retrieve first 2000: 
15sec
Query: connection       => Hits 78k             => retrieve first 2000: 
10.1sec
Query: product          => Hits 81k             => retrieve first 2000: 
18.3sec
 
Wolfgang

Re: Performance problem

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Aug 24, 2005, at 3:32 AM, WolfgangTäger wrote:

> Dear all,
>
> we are using Lucene to store 10Mio bilingual sentence pairs for  
> doing some
> natural language processing with them. Each documents contains a  
> sentence,
> its translation and a topical code. We want to select sentences  
> containing
> certain words and do statistics over the topical codes in order to  
> detect
> translations which depend on the topic (like key=> Taste (topic: input
> devices), key=> Schlüssel (topic: cryptography)).
>
> While the search is carried out in a reasonably short time (about
> 500..800ms) we have a performance problem with actually retrieving the
> documents by code like:
>
> for (int i = nrhits-1; i >=0; i--){
>         Document hitDoc = hits.doc(i);
>         String code=hitDoc.get("code");
>         ... statistics
> }
>
> Even when restricting nrhits to 2000, we have to wait 10..20  
> seconds just
> for the retrieval. Since the documents are so short we would have  
> expected
> a quicker retrieval. BtW the loop was done in inverse order in the  
> hope to
> accelerate the retrieval.

How many documents are you trying to retrieve?   I think you'll have  
much better luck if you walked the documents in ascending Hits order  
than backwards, as Hits caches documents with the presumption you'll  
move forward through them.  I'd be curious to see how much (or if)  
moving forwards through Hits helps.

     Erik

Re: Performance problem

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 24 August 2005 09:32, WolfgangTäger wrote:
> Dear all,
> 
> we are using Lucene to store 10Mio bilingual sentence pairs for doing some 
> natural language processing with them. Each documents contains a sentence, 
> its translation and a topical code. We want to select sentences containing 
> certain words and do statistics over the topical codes in order to detect 
> translations which depend on the topic (like key=> Taste (topic: input 
> devices), key=> Schlüssel (topic: cryptography)).
> 
> While the search is carried out in a reasonably short time (about 
> 500..800ms) we have a performance problem with actually retrieving the 
> documents by code like:
> 
> for (int i = nrhits-1; i >=0; i--){
>         Document hitDoc = hits.doc(i);
>         String code=hitDoc.get("code");
>         ... statistics
> }
>  
> Even when restricting nrhits to 2000, we have to wait 10..20 seconds just 
> for the retrieval. Since the documents are so short we would have expected 
> a quicker retrieval. BtW the loop was done in inverse order in the hope to 
> accelerate the retrieval.
> 
> We are using Lucene 1.4.3 Java version on a Windows PC.
>  
> Would you recommend using the C version ? I suppose it is stable and we 
> can reuse the database ? Any other suggestions ?

For so much retrieval, it's better to roll your own:
Use the low level search api Searcher.search(Query, HitCollector) to collect
all the hits by doc number, keeping the scores if you need them.
Then sort these doc nrs (they normally are not far from sorted after
collecting), and retrieve all docs in that sorted order by
IndexReader.document(int).
In that way, with a bit of luck, the disk head never needs to change direction 
during retrieval, and prefetches by the operating system (if any) stand a lot
better change of actually being used.
In case you don't have the index reader around, open it explicitly
and construct your searcher from it.

Regards,
Paul Elschot

Re: Performance problem, Search within search for TopDocs ?

Posted by WolfgangTäger <wt...@epo.org>.

Paul,

the point with QueryFilter and FilteredQuery is that they expect a query 
and not a HitDocs.

Wolfgang

Re: Performance problem, Search within search for TopDocs ?

Posted by Paul Elschot <pa...@xs4all.nl>.

On Thursday 25 August 2005 09:14, WolfgangTäger wrote:
> Hello Paul, hello Doug,
> 
> many thanks for your help !!
> 
> @Paul: I didn't try your method, Paul, because I feared that retrieving 
> all hits would not solve my problems, because the number of hits may be 
> many times higher than the 2000 I wanted for my statistics.
> 
> 
> @Doug: I tried your method which is extremely fast !
> 
> The first run is still very slow (many seconds, but this is not a problem 
> for me), measuring the following ones sometimes gives 0ms  using 
> currentTimeMillis() !!

The method I gave is to try to optimize the first reading from disk.
I'd expect FieldCache to use it.

> 
> 
> Now I have one more question: I want to use the TopDocs result 
> 
> TopDocs hits = searcher.search(query, (Filter)null, 2000);
> 
> as a filter for following queries like
> 
> DE:Schlüssel            limited to the 2000 TopDocs for the query "key"
> 
> AFAIK, the recommended way for search within a search is:
> Combine the previous query with the current query using BooleanQuery, 
> wherein the previous query is marked as required.
> 
> Can this be done in this case ? My query does not contain any information 
> that I want 2000 results as maximum.
> Note: I do not want to limit the combined query to 2000, but typically I 
> expect fewer results by first restricting the "key" query to 2000 and then 
> looking in the results for DE:"Schlüssel".

Have a look at QueryFilter and FilteredQuery, they fit nicely here.

Regards,
Paul Elschot

Re: Performance problem, Search within search for TopDocs ?

Posted by WolfgangTäger <wt...@epo.org>.

Hello Paul, hello Doug,

many thanks for your help !!

@Paul: I didn't try your method, Paul, because I feared that retrieving 
all hits would not solve my problems, because the number of hits may be 
many times higher than the 2000 I wanted for my statistics.


@Doug: I tried your method which is extremely fast !

The first run is still very slow (many seconds, but this is not a problem 
for me), measuring the following ones sometimes gives 0ms  using 
currentTimeMillis() !!


Now I have one more question: I want to use the TopDocs result 

TopDocs hits = searcher.search(query, (Filter)null, 2000);

as a filter for following queries like

DE:Schlüssel            limited to the 2000 TopDocs for the query "key"

AFAIK, the recommended way for search within a search is:
Combine the previous query with the current query using BooleanQuery, 
wherein the previous query is marked as required.

Can this be done in this case ? My query does not contain any information 
that I want 2000 results as maximum.
Note: I do not want to limit the combined query to 2000, but typically I 
expect fewer results by first restricting the "key" query to 2000 and then 
looking in the results for DE:"Schlüssel".

Wolfgang

Re: Performance problem

Posted by WolfgangTäger <wt...@epo.org>.

Hello again,

Since my code field does not contain strings, I've a little problem with 
using
getStrings


I probably have to use

FieldCache.StringIndex codes = 
FieldCache.DEFAULT.getStringIndex(indexReader, "code");

I however do not understand how to find the code of index i in the 
results.

I tried something like 
        codes.lookup[codes.order[hits.scoreDocs[i].doc]]

Is this correct ?

Wolfgang



If you have enough RAM, a FieldCache would make this very fast.

TopDocs hits = searcher.search(query, (Filter)null, 2000);
String[] codes = FieldCache.DEFAULT.getStrings(indexReader, "code");
for (int i = 0; i < hits.scoreDocs.length; i++) {
   String code = codes[hits.scoreDocs[i].doc];
   ...
}

Doug

Re: Performance problem

Posted by Doug Cutting <cu...@apache.org>.

WolfgangTäger wrote:
> While the search is carried out in a reasonably short time (about 
> 500..800ms) we have a performance problem with actually retrieving the 
> documents by code like:
> 
> for (int i = nrhits-1; i >=0; i--){
>         Document hitDoc = hits.doc(i);
>         String code=hitDoc.get("code");
>         ... statistics
> }

If you have enough RAM, a FieldCache would make this very fast.

TopDocs hits = searcher.search(query, (Filter)null, 2000);
String[] codes = FieldCache.DEFAULT.getStrings(indexReader, "code");
for (int i = 0; i < hits.scoreDocs.length; i++) {
   String code = codes[hits.scoreDocs[i].doc];
   ...
}

Doug

Performance problem

Posted by WolfgangTäger <wt...@epo.org>.

Dear all,

we are using Lucene to store 10Mio bilingual sentence pairs for doing some 
natural language processing with them. Each documents contains a sentence, 
its translation and a topical code. We want to select sentences containing 
certain words and do statistics over the topical codes in order to detect 
translations which depend on the topic (like key=> Taste (topic: input 
devices), key=> Schlüssel (topic: cryptography)).

While the search is carried out in a reasonably short time (about 
500..800ms) we have a performance problem with actually retrieving the 
documents by code like:

for (int i = nrhits-1; i >=0; i--){
        Document hitDoc = hits.doc(i);
        String code=hitDoc.get("code");
        ... statistics
}
 
Even when restricting nrhits to 2000, we have to wait 10..20 seconds just 
for the retrieval. Since the documents are so short we would have expected 
a quicker retrieval. BtW the loop was done in inverse order in the hope to 
accelerate the retrieval.

We are using Lucene 1.4.3 Java version on a Windows PC.
 
Would you recommend using the C version ? I suppose it is stable and we 
can reuse the database ? Any other suggestions ?

Thanks for your help !

Wolfgang

Re: Little improvement for SimpleHTMLEncoder

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Could you add this to a Bugzilla issue so it doesn't get lost in the  
never ending pile of e-mail that we all have?

Thanks,
     Erik

On Aug 23, 2005, at 11:16 AM, Stefan Wachter wrote:

> Hi all!
>
> The SimpleHTMLEncoder could be improved slightly: all characters  
> with code >= 128 should be encoded as character entities. The  
> reason is, that the encoder does not know the encoding that is used  
> for the response. Therefore it is safer to encode all characters  
> beyond ASCII as character entities. Can someone judge this proposal  
> and commit it eventually?
>
> Greetings!
> --Stefan
>
>
>        default:
>          if (c < 128) {
>            result.append(c);
>          } else {
>            result.append("&#").append((int)c).append(";");
>          }
>
>