You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marcel Stör <ma...@frightanic.com> on 2003/02/05 16:01:41 UTC

Too few search results

Hi all,

My Lucene IndexSearcher returns too few hits when I use some extended
query syntaxt. I'll give examples of my query/hits pairs at the bottom.
I'm indexing a database table:

<CODE>
//creating a full index
DBConnection conn = null;
try {
    Analyzer analyzer = new StandardAnalyzer();
    IndexWriter writer = new IndexWriter(indexFile, analyzer, true);
    conn =
DBConnectionFactory.getDBConnetion(DBConnectionFactory.ORACLE);
    String query = "select id, category, problem, solution from
kedb.solution";
    conn.open(cred);
    ResultSet rs = conn.execute(query);
    while (rs.next()) {
        Document doc = new Document();
        doc.add(Field.Text("id", rs.getString(1)));
        doc.add(Field.Text("category", rs.getString(2)));
        doc.add(Field.Text("problem", rs.getString(3)));
        doc.add(Field.UnStored("solution", rs.getString(4)));
        writer.addDocument(doc);
    }
    writer.close();
    conn.close();
} catch (Exception ex) {
    ex.printStackTrace();
} finally {
    conn.close();
}

//searching the index
Analyzer analyzer = new StandardAnalyzer();
Searcher searcher = new IndexSearcher(IndexReader.open(indexFile));
Query q = QueryParser.parse(query, "problem", analyzer);
Hits  hits = searcher.search(q);
for (int i = 0; i < hits.length(); i++) {
    if (0 == cat.compareTo("") || 0 ==
hits.doc(i).get("category").compareTo(cat)) {
        ids.add(hits.doc(i).get("id"));
        cats.add(hits.doc(i).get("category"));
        probs.add(hits.doc(i).get("problem"));
    }
}
</CODE>

This is sample data from the table. Right, it's German and not English!
I tried GermanAnalyzer instead of StandardAnalyzer but that made the
search results even worse.

ID;CATEGORY;PROBLEM;SOLUTION;
1;2;Irgendetwas mit dem Novell Server stimmt nicht;Server neu booten;
2;15;User kann sich nicht mehr am Novell anmelden. Kabel ist
eingesteckt. Neubooten nützt nichts. Baum wird nicht gefunden.;Baum
manuell suchen über Button auf Login-Maske.;
3;9;Makros in Word funktionieren nicht;Optionen geändert (Medium Level).
Extras - Optionen - Makros;
4;5;Scanner funktioniert nicht. Gerät erscheint nicht in der Liste der
USB Geräte.;Neusten Treiber installieren.
VORSICHT: NT 4.0 unterstützt kein USB!!!;
5;16;Maus spinnt;Reinigen/Auswechseln;
6;16;Eheprobleme;Adresse einer Beratungsstelle vermitteln;
7;11;Browser bringt imme eine Sexseite als Startseite;Extras - Optionen
- Startseite festlegen;


  Query             /hits       /expected
1 Novell            /2          /2
2 Nove*             /0          /2
3 Novel~            /0          /2
4 N?vell            /0          /2
5 +Scanner +Baum    /0          /0
6 Scanner -USB      /0          /0
7 usb AND liste     /1          /1
8 Mikros~           /1          /1
9 Ehe*              /1          /0
10 "Word Optionen"~10/1          /0
11 "Browser Sexseite"~10/1       /1


Boolean operators never seem to be a problem (ok, it's the easiest to
implement ;-)). Fuzzy searches, wildcard searches, plus proximity
searches just don't return enough hits. Very surprising are number 10 &
11. At 10 the proximity fails but at 11 everything is fine!

If you have any tips as for how to improve the search process please let
me know.

Regards,
Marcel


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Too few search results

Posted by sim luc <si...@gmx.net>.
hi,
i just read an interesting bit of documentation in the api under
IndexSearcher.search(Query, Filter, HitCollector). it says:
"Applications should only use this if they need all of the matching
documents. The high-level search API (Searcher.search(Query))
is usually more efficient, as it skips non-high-scoring hits."
i don't know if the hits you are missing are "non-high-scoring",
but it might be worth checking.

> Hi all,
> 
> My Lucene IndexSearcher returns too few hits when I use some extended
> query syntaxt. I'll give examples of my query/hits pairs at the bottom.
> I'm indexing a database table:
> 
> <CODE>
> //creating a full index
> DBConnection conn = null;
> try {
>     Analyzer analyzer = new StandardAnalyzer();
>     IndexWriter writer = new IndexWriter(indexFile, analyzer, true);
>     conn =
> DBConnectionFactory.getDBConnetion(DBConnectionFactory.ORACLE);
>     String query = "select id, category, problem, solution from
> kedb.solution";
>     conn.open(cred);
>     ResultSet rs = conn.execute(query);
>     while (rs.next()) {
>         Document doc = new Document();
>         doc.add(Field.Text("id", rs.getString(1)));
>         doc.add(Field.Text("category", rs.getString(2)));
>         doc.add(Field.Text("problem", rs.getString(3)));
>         doc.add(Field.UnStored("solution", rs.getString(4)));
>         writer.addDocument(doc);
>     }
>     writer.close();
>     conn.close();
> } catch (Exception ex) {
>     ex.printStackTrace();
> } finally {
>     conn.close();
> }
> 
> //searching the index
> Analyzer analyzer = new StandardAnalyzer();
> Searcher searcher = new IndexSearcher(IndexReader.open(indexFile));
> Query q = QueryParser.parse(query, "problem", analyzer);
> Hits  hits = searcher.search(q);
> for (int i = 0; i < hits.length(); i++) {
>     if (0 == cat.compareTo("") || 0 ==
> hits.doc(i).get("category").compareTo(cat)) {
>         ids.add(hits.doc(i).get("id"));
>         cats.add(hits.doc(i).get("category"));
>         probs.add(hits.doc(i).get("problem"));
>     }
> }
> </CODE>
> 
> This is sample data from the table. Right, it's German and not English!
> I tried GermanAnalyzer instead of StandardAnalyzer but that made the
> search results even worse.
> 
> ID;CATEGORY;PROBLEM;SOLUTION;
> 1;2;Irgendetwas mit dem Novell Server stimmt nicht;Server neu booten;
> 2;15;User kann sich nicht mehr am Novell anmelden. Kabel ist
> eingesteckt. Neubooten nützt nichts. Baum wird nicht gefunden.;Baum
> manuell suchen über Button auf Login-Maske.;
> 3;9;Makros in Word funktionieren nicht;Optionen geändert (Medium 
> Level).
> Extras - Optionen - Makros;
> 4;5;Scanner funktioniert nicht. Gerät erscheint nicht in der Liste der
> USB Geräte.;Neusten Treiber installieren.
> VORSICHT: NT 4.0 unterstützt kein USB!!!;
> 5;16;Maus spinnt;Reinigen/Auswechseln;
> 6;16;Eheprobleme;Adresse einer Beratungsstelle vermitteln;
> 7;11;Browser bringt imme eine Sexseite als Startseite;Extras - Optionen
> - Startseite festlegen;
> 
> 
>   Query             /hits       /expected
> 1 Novell            /2          /2
> 2 Nove*             /0          /2
> 3 Novel~            /0          /2
> 4 N?vell            /0          /2
> 5 +Scanner +Baum    /0          /0
> 6 Scanner -USB      /0          /0
> 7 usb AND liste     /1          /1
> 8 Mikros~           /1          /1
> 9 Ehe*              /1          /0
> 10 "Word Optionen"~10/1          /0
> 11 "Browser Sexseite"~10/1       /1
> 
> 
> Boolean operators never seem to be a problem (ok, it's the easiest to
> implement ;-)). Fuzzy searches, wildcard searches, plus proximity
> searches just don't return enough hits. Very surprising are number 10 &
> 11. At 10 the proximity fails but at 11 everything is fine!
> 
> If you have any tips as for how to improve the search process please let
> me know.
> 
> Regards,
> Marcel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
+++ GMX - Mail, Messaging & more  http://www.gmx.net +++
NEU: Mit GMX ins Internet. Rund um die Uhr für 1 ct/ Min. surfen!


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Too few search results

Posted by Otis Gospodnetic <ot...@yahoo.com>.
This is really a FAQ (prefix queries and case sensitivity).
StandardAnalyzer converts all tokens to lower case before indexing, no?
You are using upper case when using prefix queries.
See?

Regarding your indexing snippet - Exception could be SQLException, so
couldn't your 'conn' be null, therefore causing NPE in the finally
block?
And what happens with your writer.close() if you get an Exception?  You
could put writer.close() in finally block, too, maybe, checking for
null, etc.

Otis


--- Marcel_St�r <ma...@frightanic.com> wrote:
> Hi all,
> 
> My Lucene IndexSearcher returns too few hits when I use some extended
> query syntaxt. I'll give examples of my query/hits pairs at the
> bottom.
> I'm indexing a database table:
> 
> <CODE>
> //creating a full index
> DBConnection conn = null;
> try {
>     Analyzer analyzer = new StandardAnalyzer();
>     IndexWriter writer = new IndexWriter(indexFile, analyzer, true);
>     conn =
> DBConnectionFactory.getDBConnetion(DBConnectionFactory.ORACLE);
>     String query = "select id, category, problem, solution from
> kedb.solution";
>     conn.open(cred);
>     ResultSet rs = conn.execute(query);
>     while (rs.next()) {
>         Document doc = new Document();
>         doc.add(Field.Text("id", rs.getString(1)));
>         doc.add(Field.Text("category", rs.getString(2)));
>         doc.add(Field.Text("problem", rs.getString(3)));
>         doc.add(Field.UnStored("solution", rs.getString(4)));
>         writer.addDocument(doc);
>     }
>     writer.close();
>     conn.close();
> } catch (Exception ex) {
>     ex.printStackTrace();
> } finally {
>     conn.close();
> }
> 
> //searching the index
> Analyzer analyzer = new StandardAnalyzer();
> Searcher searcher = new IndexSearcher(IndexReader.open(indexFile));
> Query q = QueryParser.parse(query, "problem", analyzer);
> Hits  hits = searcher.search(q);
> for (int i = 0; i < hits.length(); i++) {
>     if (0 == cat.compareTo("") || 0 ==
> hits.doc(i).get("category").compareTo(cat)) {
>         ids.add(hits.doc(i).get("id"));
>         cats.add(hits.doc(i).get("category"));
>         probs.add(hits.doc(i).get("problem"));
>     }
> }
> </CODE>
> 
> This is sample data from the table. Right, it's German and not
> English!
> I tried GermanAnalyzer instead of StandardAnalyzer but that made the
> search results even worse.
> 
> ID;CATEGORY;PROBLEM;SOLUTION;
> 1;2;Irgendetwas mit dem Novell Server stimmt nicht;Server neu booten;
> 2;15;User kann sich nicht mehr am Novell anmelden. Kabel ist
> eingesteckt. Neubooten n�tzt nichts. Baum wird nicht gefunden.;Baum
> manuell suchen �ber Button auf Login-Maske.;
> 3;9;Makros in Word funktionieren nicht;Optionen ge�ndert (Medium
> Level).
> Extras - Optionen - Makros;
> 4;5;Scanner funktioniert nicht. Ger�t erscheint nicht in der Liste
> der
> USB Ger�te.;Neusten Treiber installieren.
> VORSICHT: NT 4.0 unterst�tzt kein USB!!!;
> 5;16;Maus spinnt;Reinigen/Auswechseln;
> 6;16;Eheprobleme;Adresse einer Beratungsstelle vermitteln;
> 7;11;Browser bringt imme eine Sexseite als Startseite;Extras -
> Optionen
> - Startseite festlegen;
> 
> 
>   Query             /hits       /expected
> 1 Novell            /2          /2
> 2 Nove*             /0          /2
> 3 Novel~            /0          /2
> 4 N?vell            /0          /2
> 5 +Scanner +Baum    /0          /0
> 6 Scanner -USB      /0          /0
> 7 usb AND liste     /1          /1
> 8 Mikros~           /1          /1
> 9 Ehe*              /1          /0
> 10 "Word Optionen"~10/1          /0
> 11 "Browser Sexseite"~10/1       /1
> 
> 
> Boolean operators never seem to be a problem (ok, it's the easiest to
> implement ;-)). Fuzzy searches, wildcard searches, plus proximity
> searches just don't return enough hits. Very surprising are number 10
> &
> 11. At 10 the proximity fails but at 11 everything is fine!
> 
> If you have any tips as for how to improve the search process please
> let
> me know.
> 
> Regards,
> Marcel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org