You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Walker, Keith 1" <ke...@lmco.com> on 2007/03/09 18:14:38 UTC

Words not found, large file indexing

I'm having problems with queries not returning a hit when a document
does in fact have those terms.  (I'm not worried about the ranking, just
whether or not it's a hit.)

Is anything wrong with the query syntax? (see below)  Also, words in the
document's index (not the Lucene index) seemed less likely to be
recognized.   I'm also wondering if anyone's run into problems with
large files, since the one I'm using is 161MB, but boils down to 472KB
as text.  The smaller file had no problems.

Thanks for any advice,
Keith

Here are some of my test results on 2 different documents, with the test
code below.
query	location of words in document (src: Acrobat)	Test 2	
http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted
text) 			
+content:("Research-based")	310 instances	positive	
+content:("Organize Information Clearly")	4 instances	positive

+content:("partitioning")	3 instances	negative	
+content:("distinguishing required")	1 instance in index	negative

+content:("evaluators")	14 instances	negative	
+content:("distinguishing required" AND "evaluators")	(see above)
negative	
+content:("partitioning" AND "evaluators")	(see above)	negative

			
automatic_format_identification.pdf (566KB, 53KB as text)  v. 1 (not the
latest)			
+content:("tentative")	several instances	positive	
+content:("tentative hits")	several instances	positive	
+content:("tentative" AND "hits")	several instances	positive

+content:("tentative hits" AND "identification")	several
instances	positive	


public static void testLuceneIndexing() throws EraException,
IOException, ParseException {
		File indexDir = new
File("D:/kcw/test_data/gate_test/huge_files/index");
		String filename =
"D:/kcw/test_data/gate_test/huge_files/hhs.txt";
		File file = new File(filename);
		if (indexDir.exists()){
			deleteDirectory(indexDir);
		}
		IndexWriter writer = new IndexWriter(indexDir, new
SimpleAnalyzer(),
				true);
		Document doc = new Document();
		doc.add(Field.Text("content", new FileReader(file)));
		doc.add(Field.Keyword("filename",
file.getCanonicalPath()));
		System.out.println("before addDocument()");
		long start = System.currentTimeMillis();
		writer.addDocument(doc);
		System.out.println("# docs indexed: " +
writer.docCount());		
		writer.optimize();
		writer.close();
		System.out.println("Done indexing.  Duration(ms): " +
(System.currentTimeMillis() - start));

		IndexSearcher search = new
IndexSearcher(indexDir.getCanonicalPath());

		Query luceneQuery = null;
		
		luceneQuery =
QueryParser.parse("+content:(\"Research-based\")", "body",
				new SimpleAnalyzer());
		System.out.println("Query= " +
luceneQuery.toString("body"));

		Hits hits = search.search(luceneQuery);
		int resultLength = hits.length();
		System.out.println("hit result = " + resultLength);
	}


RE: Words not found, large file indexing

Posted by "Walker, Keith 1" <ke...@lmco.com>.
That worked for me too.  Thanks! 

-----Original Message-----
From: Steffen Heinrich [mailto:lucene-users@atablis.com] 
Sent: Friday, March 09, 2007 1:39 PM
To: java-user@lucene.apache.org
Subject: Re: Words not found, large file indexing

Hello Chris,

this is incredible!
I'm new to Lucene and did just subscribe to the list for this very
phenomena. Keith' problem was also my problem.

Your mail was the first one I read and is _exactly_ what I needed to
know.

Thanks a lotta!

Cheers, Steffen

On 9 Mar 2007 at 9:25, Chris Hostetter wrote:

> 
> are you perhaps exceding this...
> 
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWr
> iter
> .html#setMaxFieldLength(int)
> 

--
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --
-- steffen heinrich, berlin, germany --


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Words not found, large file indexing

Posted by Steffen Heinrich <lu...@atablis.com>.
Hello Chris,

this is incredible!
I'm new to Lucene and did just subscribe to the list for this very 
phenomena. Keith' problem was also my problem.

Your mail was the first one I read and is _exactly_ what I needed to 
know.

Thanks a lotta!

Cheers, Steffen

On 9 Mar 2007 at 9:25, Chris Hostetter wrote:

> 
> are you perhaps exceding this...
> 
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter
> .html#setMaxFieldLength(int)
> 

-- 
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --
-- steffen heinrich, berlin, germany --


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Words not found, large file indexing

Posted by Chris Hostetter <ho...@fucit.org>.
are you perhaps exceding this...

http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)


: Date: Fri, 09 Mar 2007 12:14:38 -0500
: From: "Walker, Keith 1" <ke...@lmco.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Words not found, large file indexing
:
: I'm having problems with queries not returning a hit when a document
: does in fact have those terms.  (I'm not worried about the ranking, just
: whether or not it's a hit.)
:
: Is anything wrong with the query syntax? (see below)  Also, words in the
: document's index (not the Lucene index) seemed less likely to be
: recognized.   I'm also wondering if anyone's run into problems with
: large files, since the one I'm using is 161MB, but boils down to 472KB
: as text.  The smaller file had no problems.
:
: Thanks for any advice,
: Keith
:
: Here are some of my test results on 2 different documents, with the test
: code below.
: query	location of words in document (src: Acrobat)	Test 2
: http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted
: text)
: +content:("Research-based")	310 instances	positive
: +content:("Organize Information Clearly")	4 instances	positive
:
: +content:("partitioning")	3 instances	negative
: +content:("distinguishing required")	1 instance in index	negative
:
: +content:("evaluators")	14 instances	negative
: +content:("distinguishing required" AND "evaluators")	(see above)
: negative
: +content:("partitioning" AND "evaluators")	(see above)	negative
:
:
: automatic_format_identification.pdf (566KB, 53KB as text)  v. 1 (not the
: latest)
: +content:("tentative")	several instances	positive
: +content:("tentative hits")	several instances	positive
: +content:("tentative" AND "hits")	several instances	positive
:
: +content:("tentative hits" AND "identification")	several
: instances	positive
:
:
: public static void testLuceneIndexing() throws EraException,
: IOException, ParseException {
: 		File indexDir = new
: File("D:/kcw/test_data/gate_test/huge_files/index");
: 		String filename =
: "D:/kcw/test_data/gate_test/huge_files/hhs.txt";
: 		File file = new File(filename);
: 		if (indexDir.exists()){
: 			deleteDirectory(indexDir);
: 		}
: 		IndexWriter writer = new IndexWriter(indexDir, new
: SimpleAnalyzer(),
: 				true);
: 		Document doc = new Document();
: 		doc.add(Field.Text("content", new FileReader(file)));
: 		doc.add(Field.Keyword("filename",
: file.getCanonicalPath()));
: 		System.out.println("before addDocument()");
: 		long start = System.currentTimeMillis();
: 		writer.addDocument(doc);
: 		System.out.println("# docs indexed: " +
: writer.docCount());
: 		writer.optimize();
: 		writer.close();
: 		System.out.println("Done indexing.  Duration(ms): " +
: (System.currentTimeMillis() - start));
:
: 		IndexSearcher search = new
: IndexSearcher(indexDir.getCanonicalPath());
:
: 		Query luceneQuery = null;
:
: 		luceneQuery =
: QueryParser.parse("+content:(\"Research-based\")", "body",
: 				new SimpleAnalyzer());
: 		System.out.println("Query= " +
: luceneQuery.toString("body"));
:
: 		Hits hits = search.search(luceneQuery);
: 		int resultLength = hits.length();
: 		System.out.println("hit result = " + resultLength);
: 	}
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org