You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Walker, Keith 1" <ke...@lmco.com> on 2007/03/09 18:14:38 UTC
Words not found, large file indexing
I'm having problems with queries not returning a hit when a document
does in fact have those terms. (I'm not worried about the ranking, just
whether or not it's a hit.)
Is anything wrong with the query syntax? (see below) Also, words in the
document's index (not the Lucene index) seemed less likely to be
recognized. I'm also wondering if anyone's run into problems with
large files, since the one I'm using is 161MB, but boils down to 472KB
as text. The smaller file had no problems.
Thanks for any advice,
Keith
Here are some of my test results on 2 different documents, with the test
code below.
query location of words in document (src: Acrobat) Test 2
http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted
text)
+content:("Research-based") 310 instances positive
+content:("Organize Information Clearly") 4 instances positive
+content:("partitioning") 3 instances negative
+content:("distinguishing required") 1 instance in index negative
+content:("evaluators") 14 instances negative
+content:("distinguishing required" AND "evaluators") (see above)
negative
+content:("partitioning" AND "evaluators") (see above) negative
automatic_format_identification.pdf (566KB, 53KB as text) v. 1 (not the
latest)
+content:("tentative") several instances positive
+content:("tentative hits") several instances positive
+content:("tentative" AND "hits") several instances positive
+content:("tentative hits" AND "identification") several
instances positive
public static void testLuceneIndexing() throws EraException,
IOException, ParseException {
File indexDir = new
File("D:/kcw/test_data/gate_test/huge_files/index");
String filename =
"D:/kcw/test_data/gate_test/huge_files/hhs.txt";
File file = new File(filename);
if (indexDir.exists()){
deleteDirectory(indexDir);
}
IndexWriter writer = new IndexWriter(indexDir, new
SimpleAnalyzer(),
true);
Document doc = new Document();
doc.add(Field.Text("content", new FileReader(file)));
doc.add(Field.Keyword("filename",
file.getCanonicalPath()));
System.out.println("before addDocument()");
long start = System.currentTimeMillis();
writer.addDocument(doc);
System.out.println("# docs indexed: " +
writer.docCount());
writer.optimize();
writer.close();
System.out.println("Done indexing. Duration(ms): " +
(System.currentTimeMillis() - start));
IndexSearcher search = new
IndexSearcher(indexDir.getCanonicalPath());
Query luceneQuery = null;
luceneQuery =
QueryParser.parse("+content:(\"Research-based\")", "body",
new SimpleAnalyzer());
System.out.println("Query= " +
luceneQuery.toString("body"));
Hits hits = search.search(luceneQuery);
int resultLength = hits.length();
System.out.println("hit result = " + resultLength);
}
RE: Words not found, large file indexing
Posted by "Walker, Keith 1" <ke...@lmco.com>.
That worked for me too. Thanks!
-----Original Message-----
From: Steffen Heinrich [mailto:lucene-users@atablis.com]
Sent: Friday, March 09, 2007 1:39 PM
To: java-user@lucene.apache.org
Subject: Re: Words not found, large file indexing
Hello Chris,
this is incredible!
I'm new to Lucene and did just subscribe to the list for this very
phenomena. Keith' problem was also my problem.
Your mail was the first one I read and is _exactly_ what I needed to
know.
Thanks a lotta!
Cheers, Steffen
On 9 Mar 2007 at 9:25, Chris Hostetter wrote:
>
> are you perhaps exceding this...
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWr
> iter
> .html#setMaxFieldLength(int)
>
--
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --
-- steffen heinrich, berlin, germany --
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Words not found, large file indexing
Posted by Steffen Heinrich <lu...@atablis.com>.
Hello Chris,
this is incredible!
I'm new to Lucene and did just subscribe to the list for this very
phenomena. Keith' problem was also my problem.
Your mail was the first one I read and is _exactly_ what I needed to
know.
Thanks a lotta!
Cheers, Steffen
On 9 Mar 2007 at 9:25, Chris Hostetter wrote:
>
> are you perhaps exceding this...
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter
> .html#setMaxFieldLength(int)
>
--
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --
-- steffen heinrich, berlin, germany --
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Words not found, large file indexing
Posted by Chris Hostetter <ho...@fucit.org>.
are you perhaps exceding this...
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)
: Date: Fri, 09 Mar 2007 12:14:38 -0500
: From: "Walker, Keith 1" <ke...@lmco.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Words not found, large file indexing
:
: I'm having problems with queries not returning a hit when a document
: does in fact have those terms. (I'm not worried about the ranking, just
: whether or not it's a hit.)
:
: Is anything wrong with the query syntax? (see below) Also, words in the
: document's index (not the Lucene index) seemed less likely to be
: recognized. I'm also wondering if anyone's run into problems with
: large files, since the one I'm using is 161MB, but boils down to 472KB
: as text. The smaller file had no problems.
:
: Thanks for any advice,
: Keith
:
: Here are some of my test results on 2 different documents, with the test
: code below.
: query location of words in document (src: Acrobat) Test 2
: http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted
: text)
: +content:("Research-based") 310 instances positive
: +content:("Organize Information Clearly") 4 instances positive
:
: +content:("partitioning") 3 instances negative
: +content:("distinguishing required") 1 instance in index negative
:
: +content:("evaluators") 14 instances negative
: +content:("distinguishing required" AND "evaluators") (see above)
: negative
: +content:("partitioning" AND "evaluators") (see above) negative
:
:
: automatic_format_identification.pdf (566KB, 53KB as text) v. 1 (not the
: latest)
: +content:("tentative") several instances positive
: +content:("tentative hits") several instances positive
: +content:("tentative" AND "hits") several instances positive
:
: +content:("tentative hits" AND "identification") several
: instances positive
:
:
: public static void testLuceneIndexing() throws EraException,
: IOException, ParseException {
: File indexDir = new
: File("D:/kcw/test_data/gate_test/huge_files/index");
: String filename =
: "D:/kcw/test_data/gate_test/huge_files/hhs.txt";
: File file = new File(filename);
: if (indexDir.exists()){
: deleteDirectory(indexDir);
: }
: IndexWriter writer = new IndexWriter(indexDir, new
: SimpleAnalyzer(),
: true);
: Document doc = new Document();
: doc.add(Field.Text("content", new FileReader(file)));
: doc.add(Field.Keyword("filename",
: file.getCanonicalPath()));
: System.out.println("before addDocument()");
: long start = System.currentTimeMillis();
: writer.addDocument(doc);
: System.out.println("# docs indexed: " +
: writer.docCount());
: writer.optimize();
: writer.close();
: System.out.println("Done indexing. Duration(ms): " +
: (System.currentTimeMillis() - start));
:
: IndexSearcher search = new
: IndexSearcher(indexDir.getCanonicalPath());
:
: Query luceneQuery = null;
:
: luceneQuery =
: QueryParser.parse("+content:(\"Research-based\")", "body",
: new SimpleAnalyzer());
: System.out.println("Query= " +
: luceneQuery.toString("body"));
:
: Hits hits = search.search(luceneQuery);
: int resultLength = hits.length();
: System.out.println("hit result = " + resultLength);
: }
:
:
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org