You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jerry Jalenak <Je...@LABONE.com> on 2005/01/19 21:16:01 UTC

[newbie] Confused about PrefixQuery

All,

I'm investigating the use of Lucene as a search engine, and have been doing
some 'proof-of-concept' coding today.  I'm indexing about 650 text files,
and then searching against them using QueryParser.  Here's the indexing code
snippet:

<snip>
public static void Result(IndexWriter indexWriter, File file)
	throws FileNotFoundException
{
	Document document = null;
	String content = "";
		
	BufferedReader br = new BufferedReader(new FileReader(file));
	boolean EOF = false;
		
	try
	{
		while(!EOF)
		{
			String s = (String) br.readLine();
			if (null == s)
			{
				EOF = true;
			}
			else
			{
				if (!"".equals(s) &&
"CC>".equals(s.substring(0, 3)))
				{
					document = new Document();
					
					document.add(Field.Text("account",
s.substring(3, 7)));
	
document.add(Field.Keyword("created", s.substring(s.indexOf("DC>") + 3,
s.indexOf("DC>") + 11)));
					
					content = new String();
				}
				else if (!"".equals(s) &&
"AN>".equals(s.substring(0, 3)))
				{
	
document.add(Field.Keyword("lastname", s.substring(3,
28).trim().toLowerCase()));
	
document.add(Field.Keyword("firstname", s.substring(28,
43).trim().toLowerCase()));
					document.add(Field.Text("name",
s.substring(28, 43).trim() + " " + s.substring(3, 28).trim()));
	
document.add(Field.Keyword("controlnumber", s.substring(44, 52)));
					document.add(Field.Keyword("status",
s.substring(52, 53).trim()));
					document.add(Field.Keyword("ssn",
s.substring(53, 62)));
					document.add(Field.Keyword("dob",
s.substring(62, 70)));
	
document.add(Field.Keyword("collected", s.substring(137, 145)));
				}
				else if (!"".equals(s) &&
"<FF".equals(s.substring(0, 3)))
				{
	
document.add(Field.UnStored("content", content));
					indexWriter.addDocument(document);
				}
				else
				{
					content = content + s + "\n";
				}
			}
		}
		br.close();
	}
	catch(IOException ioe)
	{
		System.out.println(ioe.getClass() + " caught with message "
+ ioe.getMessage());
	}
}
</snip>

The text files have two control lines at the beginning of them - CC> and
AN>.  I extract particular fields from these lines and add them to my
document.  Everything (I think) indexes correctly.  When I search against
this index, though, I get some weird results, especially when using an '*'
at the end of my criteria.  Here's the search code snippet:

<snip>
public static void main(String[] args)
{
	try
	{
		Searcher searcher = new IndexSearcher("c:\\ResultIndex");
		Analyzer analyzer = new StandardAnalyzer();
		
		BufferedReader br= new BufferedReader(new
InputStreamReader(System.in));
		while(true)
		{
			System.out.println("Query: ");
			String s = br.readLine();
			if (null == s)
			{
				break;
			}
			else
			{
				Query query = QueryParser.parse(s,
"content", analyzer);
				System.out.println("Searching for: " +
query.toString("content"));
				
				Hits hits = searcher.search(query);
				System.out.println("... Found " +
hits.length() + " matching documents");
				System.out.println("");
				
				for (int i = 0; i < hits.length(); i++)
				{
					Document document = hits.doc(i);
					System.out.println("Hit " + i + ":
Specimen = " + document.get("controlnumber") + ", Account = " +
document.get("account") + 
							", Status = " +
document.get("status") + ", Name = " + document.get("name") + ", SSN = " +
document.get("ssn") + 
							", DOB = " +
document.get("dob") + ", Collected = " + document.get("collected") + ",
Created = " + document.get("created"));
	
//System.out.println(document.get("content"));
				}
			}
		}
	}
	catch(Exception e)
	{
		System.out.println(e.getClass() + " caught with message " +
e.getMessage());
	}
}
</snip>

When I run this using a criteria string of 

	lastname:mar*

I get back the following:

Query: 
lastname:mar*
Searching for: lastname:mar*
... Found 9 matching documents

Hit 0: Specimen = 40062720, Account = 0001, Status = N, Name = LOIS MARTIN,
SSN = 536628498, DOB = 19010101, Collected = 20050118, Created = 20050119
Hit 1: Specimen = 38843845, Account = 4NEK, Status = N, Name = RENEE
CAPPETTA, SSN = 585132901, DOB = 19010101, Collected = 20050117, Created =
20050119
Hit 2: Specimen = 39894441, Account = 3384, Status = N, Name = LINDA CANTU,
SSN = 453539817, DOB = 19010101, Collected = 20050118, Created = 20050119
Hit 3: Specimen = 39894441, Account = 3384, Status = N, Name = LINDA CANTU,
SSN = 453539817, DOB = 19010101, Collected = 20050118, Created = 20050119
Hit 4: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT
BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created =
20050119
Hit 5: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT
BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created =
20050119
Hit 6: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT
BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created =
20050119
Hit 7: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT
BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created =
20050119
Hit 8: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT
BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created =
20050119

I'm at a loss to explain why I'm getting hits 1 - 8 - the lastnames don't
start with mar!  I suspect it is due to an incorrect use of Field.Keyword vs
Field.Text in the indexer, but I can seem to figure it out...

Thanks.

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

jerry.jalenak@labone.com


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: securityincidentreporting@labone.com


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org