You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Amin Mohammed-Coleman <am...@gmail.com> on 2009/01/01 22:11:00 UTC

Fwd: Search Problem

>
> Hi
>
> I have created a RTFHandler which takes a RTF file and creates a  
> lucene Document which is indexed.  The RTFHandler looks like  
> something like this:
>
> if (bodyText != null) {
> 			Document document = new Document();
> 			Field field = new Field(MetaDataEnum.BODY.getDescription(),  
> bodyText.trim(), Field.Store.YES, Field.Index.ANALYZED);
> 			document.add(field);
> 			
> 		
> }
>
> I am using Java Built in RTF text extraction.  When I run my test to  
> verify that the document contains text that I expect this works  
> fine.  I get the following when I print the document:
>
> Document<stored/uncompressed,indexed,tokenized<body:This is a test  
> rtf document that will be indexed.
>
> Amin Mohammed-Coleman> stored/ 
> uncompressed,indexed<path:rtfDocumentToIndex.rtf> stored/ 
> uncompressed,indexed<name:rtfDocumentToIndex.rtf> stored/ 
> uncompressed,indexed<type:RTF_INDEXER> stored/ 
> uncompressed,indexed<summary:This is a >>
>
>
> The problem is when I use the following to search I get no result:
>
> 	MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]  
> {rtfIndexSearcher});
> 			Term t = new Term("body", "Amin");
> 			TermQuery termQuery = new TermQuery(t);
> 			TopDocs topDocs = multiSearcher.search(termQuery, 1);
> 			System.out.println(topDocs.totalHits);
> 			multiSearcher.close();
>
> RftIndexSearcher is configured with the directory that holds rtf  
> documents.  I have used Luke to look at the document and what I am  
> finding in the overview tab is the following for the document:
>
> 1	body	test
> 1	id	1234
> 1	name	rtfDocumentToIndex.rtf
> 1	path	rtfDocumentToIndex.rtf
> 1	summary	This is a
> 1	type	RTF_INDEXER
> 1	body	rtf
>
>
> However on the Document tab I am getting (in the body field):
>
> This is a test rtf document that will be indexed.
>
> Amin Mohammed-Coleman
>
>
> I would expect to get a hit using "Amin" or even "document".  I am  
> not sure whether the
> line:
> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>
> is incorrect as I am not too sure of the meaning of "Finds the top n  
> hits for query." for search (Query query, int n) according to java  
> docs.
>
> I would be grateful if someone may be able to advise on what I may  
> be doing wrong.  I am using Lucene 2.4.0
>
>
> Cheers
> Amin
>
>
>
>

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

I am currently doing this as the indexer will be called from a upload  
action. There is no bulk file processing functioaliry at the moment.


Cheers

Sent from my iPhone

On 3 Jan 2009, at 13:48, Shashi Kant <sh...@yahoo.com> wrote:

> Amin,
>
> Are you calling Close & Optimize after every addDocument?
>
> I would suggest something like this
> try
> {
>      while //this could be your looping through a data reader for  
> example
>       {
>            indexWriter.addDocument(document);
>       }
> }
>
> finally
> {
>  commitAndOptimise()
> }
>
>
> HTH
>
> Shashi
>
>
> ----- Original Message ----
> From: Amin Mohammed-Coleman <am...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Saturday, January 3, 2009 4:02:52 AM
> Subject: Re: Search Problem
>
>
> Hi again!
>
> I think I may have found the problem but I was wondering if you  
> could verify:
>
> I have the following for my indexer:
>
> public void add(Document document) {
>        IndexWriter indexWriter =  
> IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
>        try {
>            indexWriter.addDocument(document);
>            LOGGER.debug("Added Document:" + document + " to index");
>            commitAndOptimise(indexWriter);
>        } catch (CorruptIndexException e) {
>            throw new IllegalStateException(e);
>        } catch (IOException e) {
>            throw new IllegalStateException(e);
>        }
>    }
>
> the commitAndOptimise(indexWriter) looks like this:
>
> private void commitAndOptimise(IndexWriter indexWriter) throws  
> CorruptIndexException,IOException {
>        LOGGER.debug("Committing document and closing index writer");
>        indexWriter.optimize();
>        indexWriter.commit();
>        indexWriter.close();
>    }
>
> It seems as though if I comment out optimize then the overview tab  
> in Luke  for the rtf document looks like:
>
> 5    id    1234
> 3    body    document
> 3    body    body
> 1    body    test
> 1    body    rtf
> 1    name    rtfDocumentToIndex.rtf
> 1    body    new
> 1    path    rtfDocumentToIndex.rtf
> 1    summary    This is a
> 1    type    RTF_INDEXER
> 1    body    content
>
>
> This is more what I expected although "Amin Mohammed-Coleman" hasn't  
> been stored in the index.  Should I not be using  
> indexWriter.optimize() ?
>
> I tried using the search function in luke and got the following  
> results:
> body:test ---> returns result
> body:document ---> no result
> body:content ---> no result
> body:rtf ----> returns result
>
>
> Thanks again...sorry to be sending so many emails about this. I am  
> in the process of designing and developing a prototype of a document  
> and domain indexing/searching component and I would like to demo to  
> the rest of my team.
>
>
> Cheers
> Amin
>
>
>
> On 3 Jan 2009, at 01:23, Erick Erickson wrote:
>
>> Well, your query results are consistent with what Luke is
>> reporting. So I'd go back and test your assumptions. I
>> suspect that you're not indexing what you think you are.
>>
>> For your test document, I'd just print out what you're indexing
>> and the field it's going into. *for each field*. that is, every  
>> time you
>> do a document.add(<field of some kind>), print out that data. I'm
>> pretty sure you'll find that you're not getting what you expect. For
>> instance, the call to:
>>
>> MetaDataEnum.BODY.getDescription()
>>
>> may be returning some nonsense. Or
>> bodyText.trim()
>>
>> isn't doing what you expect.
>>
>> Lucene is used by many folks, and errors of the magnitude you're
>> experiencing would be seen by many people and the user list would
>> be flooded with complaints if it were a Lucene issue at root. That
>> leaves the code you wrote as the most likely culprit. So try a very  
>> simple
>> test case with lots of debugging println's. I'm pretty sure you'll
>> find the underlying issue with some of your assumptions pretty  
>> quickly.
>>
>> Sorry I can't be more specific, but we'd have to see all of your code
>> and the test cases to do that....
>>
>> Best
>> Erick
>>
>> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>> >wrote:
>>
>>> Hi Erick
>>>
>>> Thanks for your reply.
>>>
>>> I have used luke to inspect the document and I am some what  
>>> confused.  For
>>> example when I view the index using the overview tab of Luke I get  
>>> the
>>> following:
>>>
>>> 1       body    test
>>> 1       id      1234
>>> 1       name    rtfDocumentToIndex.rtf
>>> 1       path    rtfDocumentToIndex.rtf
>>> 1       summary This is a
>>> 1       type    RTF_INDEXER
>>> 1       body    rtf
>>>
>>>
>>> However when I view the document in the Document tab I get the  
>>> full text
>>> that was extracted from the rft document (field:body) which is:
>>>
>>> This is a test rtf document that will be indexed.
>>> Amin Mohammed-Coleman
>>>
>>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>>> document, indexed, Amin Mohammed-Coleman to be removed.
>>>
>>> I have referenced the Lucene In Action book and I can't see what I  
>>> may be
>>> doing wrong.  I would be happy to provide a testcase should it be  
>>> required.
>>> When adding the body field to the document I am doing:
>>>
>>>      Document document = new Document();
>>>                      Field field = new
>>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>>> Field.Store.YES,
>>> Field.Index.ANALYZED);
>>>                      document.add(field);
>>>
>>>
>>>
>>> When I run the search code the string "test" is the only word that  
>>> returns
>>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>>> "document",
>>> "indexed").
>>>
>>> Thanks again for your help and advice.
>>>
>>>
>>> Cheers
>>> Amin
>>>
>>>
>>>
>>>
>>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>>
>>> Casing is usually handled by the analyzer. Since you construct
>>>> the term query programmatically, it doesn't go through
>>>> any analyzers, thus is not converted into lower case for
>>>> searching as was done automatically for you when you
>>>> indexed using StandardAnalyzer.
>>>>
>>>> As for why you aren't getting hits, it's unclear to me. But
>>>> what I'd do is get a copy of Luke and examine your index
>>>> to see what's *really* there. This will often give you clues,
>>>> usually pointing to some kind of analyzer behavior that you
>>>> weren't expecting.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>> wrote:
>>>>
>>>> Hi
>>>>>
>>>>> I have tried this and it doesn't work.  I don't understand why  
>>>>> using
>>>>> "amin"
>>>>> instead of "Amin" would work, is it not case insensitive?
>>>>>
>>>>> I tried "test" for field "body" and this works.  Any other terms  
>>>>> don't
>>>>> work
>>>>> for example:
>>>>>
>>>>> "document"
>>>>> "indexed"
>>>>>
>>>>> these are tokens that were extracted when creating the lucene  
>>>>> document.
>>>>>
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> Cheers
>>>>>
>>>>> Amin
>>>>>
>>>>>
>>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>>
>>>>> Basically Lucene stores analyzed tokens, and looks up for the  
>>>>> matches
>>>>>
>>>>>> based
>>>>>> on the tokens.
>>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>>> Term("body",
>>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>>
>>>>>> --
>>>>>> Chris Lu
>>>>>> -------------------------
>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>> site: http://www.dbsight.net
>>>>>> demo: http://search.dbsight.com
>>>>>> Lucene Database Search in 3 minutes:
>>>>>>
>>>>>>
>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>> request)
>>>>>> got
>>>>>> 2.6 Million Euro funding!
>>>>>>
>>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>>> aminmc@gmail.com
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>>>
>>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>>
>>>>>>> You need to let us know the analyzer you are using.
>>>>>>>
>>>>>>> -- Chris Lu
>>>>>>>> -------------------------
>>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>>> site: http://www.dbsight.net
>>>>>>>> demo: http://search.dbsight.com
>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>>>> request)
>>>>>>>> got
>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>
>>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>>> aminmc@gmail.com
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I have created a RTFHandler which takes a RTF file and  
>>>>>>>>>> creates a
>>>>>>>>>> lucene
>>>>>>>>>> Document which is indexed.  The RTFHandler looks like  
>>>>>>>>>> something like
>>>>>>>>>> this:
>>>>>>>>>>
>>>>>>>>>> if (bodyText != null) {
>>>>>>>>>>                Document document = new Document();
>>>>>>>>>>                Field field = new
>>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>>> Field.Store.YES,
>>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>>                document.add(field);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> I am using Java Built in RTF text extraction.  When I run  
>>>>>>>>>> my test to
>>>>>>>>>> verify that the document contains text that I expect this  
>>>>>>>>>> works
>>>>>>>>>> fine.
>>>>>>>>>> I
>>>>>>>>>> get
>>>>>>>>>> the following when I print the document:
>>>>>>>>>>
>>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is  
>>>>>>>>>> a test
>>>>>>>>>> rtf
>>>>>>>>>> document that will be indexed.
>>>>>>>>>>
>>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The problem is when I use the following to search I get no  
>>>>>>>>>> result:
>>>>>>>>>>
>>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>>>>>>> Searchable[]
>>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>>                Term t = new Term("body", "Amin");
>>>>>>>>>>                TermQuery termQuery = new TermQuery(t);
>>>>>>>>>>                TopDocs topDocs =  
>>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>>> 1);
>>>>>>>>>>                System.out.println(topDocs.totalHits);
>>>>>>>>>>                multiSearcher.close();
>>>>>>>>>>
>>>>>>>>>> RftIndexSearcher is configured with the directory that  
>>>>>>>>>> holds rtf
>>>>>>>>>> documents.  I have used Luke to look at the document and  
>>>>>>>>>> what I am
>>>>>>>>>> finding
>>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>>
>>>>>>>>>> 1       body    test
>>>>>>>>>> 1       id      1234
>>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>>> 1       summary This is a
>>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>>> 1       body    rtf
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>>>>
>>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>>
>>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I would expect to get a hit using "Amin" or even  
>>>>>>>>>> "document".  I am
>>>>>>>>>> not
>>>>>>>>>> sure whether the
>>>>>>>>>> line:
>>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>>>
>>>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds  
>>>>>>>>>> the top n
>>>>>>>>>> hits
>>>>>>>>>> for query." for search (Query query, int n) according to  
>>>>>>>>>> java docs.
>>>>>>>>>>
>>>>>>>>>> I would be grateful if someone may be able to advise on  
>>>>>>>>>> what I may
>>>>>>>>>> be
>>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Amin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- 
>>>>>>>>> --- 
>>>>>>>>> --- 
>>>>>>>>> ------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --- 
>>>>> ------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

Please find attached a standalone test (inner classes for rtfHandler,  
indexing, etc) that shows search not returning expected results.  I am  
using Lucene 2.4.

Thanks again for the help!

Cheers
Amin

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

I have uploaded to google docs:

url: http://docs.google.com/Doc?id=d77xf5q_0n6hb38fx

Hope this works.


Cheers
Amin
On 3 Jan 2009, at 19:53, Grant Ingersoll wrote:

> The mailing list often strips attachments (in fact, I'm surprised  
> your earlier ones made it through).  Perhaps you can put them up  
> somewhere for download.
>
>
> On Jan 3, 2009, at 1:07 PM, Amin Mohammed-Coleman wrote:
>
>> Hi again
>>
>> Sorry I didn't include the WorkItem class!  Here is the final test  
>> case.  Apologies!
>> On 3 Jan 2009, at 14:02, Grant Ingersoll wrote:
>>
>>> You shouldn't need to call close and optimize after each document.
>>>
>>> You also don't need the commit if you are going to immediately  
>>> close.
>>>
>>> Also, can you send a standalone test that shows the RTF  
>>> extraction, the document creation and the indexing code that  
>>> demonstrates your issue.
>>>
>>> FWIW, and as a complete aside to save you some time after you get  
>>> this figured out, instead of re-inventing RTF extraction and PDF  
>>> extraction (as you appear to be doing), have a look at Tika (http://lucene.apache.org/tika 
>>> )
>>>
>>> On Jan 3, 2009, at 8:48 AM, Shashi Kant wrote:
>>>
>>>> Amin,
>>>>
>>>> Are you calling Close & Optimize after every addDocument?
>>>>
>>>> I would suggest something like this
>>>> try
>>>> {
>>>>   while //this could be your looping through a data reader for  
>>>> example
>>>>    {
>>>>         indexWriter.addDocument(document);
>>>>    }
>>>> }
>>>>
>>>> finally
>>>> {
>>>> commitAndOptimise()
>>>> }
>>>>
>>>>
>>>> HTH
>>>>
>>>> Shashi
>>>>
>>>>
>>>> ----- Original Message ----
>>>> From: Amin Mohammed-Coleman <am...@gmail.com>
>>>> To: java-user@lucene.apache.org
>>>> Sent: Saturday, January 3, 2009 4:02:52 AM
>>>> Subject: Re: Search Problem
>>>>
>>>>
>>>> Hi again!
>>>>
>>>> I think I may have found the problem but I was wondering if you  
>>>> could verify:
>>>>
>>>> I have the following for my indexer:
>>>>
>>>> public void add(Document document) {
>>>>     IndexWriter indexWriter =  
>>>> IndexWriterFactory.createIndexWriter(getDirectory(),  
>>>> getAnalyzer());
>>>>     try {
>>>>         indexWriter.addDocument(document);
>>>>         LOGGER.debug("Added Document:" + document + " to index");
>>>>         commitAndOptimise(indexWriter);
>>>>     } catch (CorruptIndexException e) {
>>>>         throw new IllegalStateException(e);
>>>>     } catch (IOException e) {
>>>>         throw new IllegalStateException(e);
>>>>     }
>>>> }
>>>>
>>>> the commitAndOptimise(indexWriter) looks like this:
>>>>
>>>> private void commitAndOptimise(IndexWriter indexWriter) throws  
>>>> CorruptIndexException,IOException {
>>>>     LOGGER.debug("Committing document and closing index writer");
>>>>     indexWriter.optimize();
>>>>     indexWriter.commit();
>>>>     indexWriter.close();
>>>> }
>>>>
>>>> It seems as though if I comment out optimize then the overview  
>>>> tab in Luke  for the rtf document looks like:
>>>>
>>>> 5    id    1234
>>>> 3    body    document
>>>> 3    body    body
>>>> 1    body    test
>>>> 1    body    rtf
>>>> 1    name    rtfDocumentToIndex.rtf
>>>> 1    body    new
>>>> 1    path    rtfDocumentToIndex.rtf
>>>> 1    summary    This is a
>>>> 1    type    RTF_INDEXER
>>>> 1    body    content
>>>>
>>>>
>>>> This is more what I expected although "Amin Mohammed-Coleman"  
>>>> hasn't been stored in the index.  Should I not be using  
>>>> indexWriter.optimize() ?
>>>>
>>>> I tried using the search function in luke and got the following  
>>>> results:
>>>> body:test ---> returns result
>>>> body:document ---> no result
>>>> body:content ---> no result
>>>> body:rtf ----> returns result
>>>>
>>>>
>>>> Thanks again...sorry to be sending so many emails about this. I  
>>>> am in the process of designing and developing a prototype of a  
>>>> document and domain indexing/searching component and I would like  
>>>> to demo to the rest of my team.
>>>>
>>>>
>>>> Cheers
>>>> Amin
>>>>
>>>>
>>>>
>>>> On 3 Jan 2009, at 01:23, Erick Erickson wrote:
>>>>
>>>>> Well, your query results are consistent with what Luke is
>>>>> reporting. So I'd go back and test your assumptions. I
>>>>> suspect that you're not indexing what you think you are.
>>>>>
>>>>> For your test document, I'd just print out what you're indexing
>>>>> and the field it's going into. *for each field*. that is, every  
>>>>> time you
>>>>> do a document.add(<field of some kind>), print out that data. I'm
>>>>> pretty sure you'll find that you're not getting what you expect.  
>>>>> For
>>>>> instance, the call to:
>>>>>
>>>>> MetaDataEnum.BODY.getDescription()
>>>>>
>>>>> may be returning some nonsense. Or
>>>>> bodyText.trim()
>>>>>
>>>>> isn't doing what you expect.
>>>>>
>>>>> Lucene is used by many folks, and errors of the magnitude you're
>>>>> experiencing would be seen by many people and the user list would
>>>>> be flooded with complaints if it were a Lucene issue at root. That
>>>>> leaves the code you wrote as the most likely culprit. So try a  
>>>>> very simple
>>>>> test case with lots of debugging println's. I'm pretty sure you'll
>>>>> find the underlying issue with some of your assumptions pretty  
>>>>> quickly.
>>>>>
>>>>> Sorry I can't be more specific, but we'd have to see all of your  
>>>>> code
>>>>> and the test cases to do that....
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>>>>> >wrote:
>>>>>
>>>>>> Hi Erick
>>>>>>
>>>>>> Thanks for your reply.
>>>>>>
>>>>>> I have used luke to inspect the document and I am some what  
>>>>>> confused.  For
>>>>>> example when I view the index using the overview tab of Luke I  
>>>>>> get the
>>>>>> following:
>>>>>>
>>>>>> 1       body    test
>>>>>> 1       id      1234
>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>> 1       summary This is a
>>>>>> 1       type    RTF_INDEXER
>>>>>> 1       body    rtf
>>>>>>
>>>>>>
>>>>>> However when I view the document in the Document tab I get the  
>>>>>> full text
>>>>>> that was extracted from the rft document (field:body) which is:
>>>>>>
>>>>>> This is a test rtf document that will be indexed.
>>>>>> Amin Mohammed-Coleman
>>>>>>
>>>>>> I am using the StandardAnaylzer therefore I wouldnt expect the  
>>>>>> words
>>>>>> document, indexed, Amin Mohammed-Coleman to be removed.
>>>>>>
>>>>>> I have referenced the Lucene In Action book and I can't see  
>>>>>> what I may be
>>>>>> doing wrong.  I would be happy to provide a testcase should it  
>>>>>> be required.
>>>>>> When adding the body field to the document I am doing:
>>>>>>
>>>>>>   Document document = new Document();
>>>>>>                   Field field = new
>>>>>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>>>>>> Field.Store.YES,
>>>>>> Field.Index.ANALYZED);
>>>>>>                   document.add(field);
>>>>>>
>>>>>>
>>>>>>
>>>>>> When I run the search code the string "test" is the only word  
>>>>>> that returns
>>>>>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>>>>>> "document",
>>>>>> "indexed").
>>>>>>
>>>>>> Thanks again for your help and advice.
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>> Amin
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>>>>>
>>>>>> Casing is usually handled by the analyzer. Since you construct
>>>>>>> the term query programmatically, it doesn't go through
>>>>>>> any analyzers, thus is not converted into lower case for
>>>>>>> searching as was done automatically for you when you
>>>>>>> indexed using StandardAnalyzer.
>>>>>>>
>>>>>>> As for why you aren't getting hits, it's unclear to me. But
>>>>>>> what I'd do is get a copy of Luke and examine your index
>>>>>>> to see what's *really* there. This will often give you clues,
>>>>>>> usually pointing to some kind of analyzer behavior that you
>>>>>>> weren't expecting.
>>>>>>>
>>>>>>> Best
>>>>>>> Erick
>>>>>>>
>>>>>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>>
>>>>>>>> I have tried this and it doesn't work.  I don't understand  
>>>>>>>> why using
>>>>>>>> "amin"
>>>>>>>> instead of "Amin" would work, is it not case insensitive?
>>>>>>>>
>>>>>>>> I tried "test" for field "body" and this works.  Any other  
>>>>>>>> terms don't
>>>>>>>> work
>>>>>>>> for example:
>>>>>>>>
>>>>>>>> "document"
>>>>>>>> "indexed"
>>>>>>>>
>>>>>>>> these are tokens that were extracted when creating the lucene  
>>>>>>>> document.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> Amin
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>>>>>
>>>>>>>> Basically Lucene stores analyzed tokens, and looks up for the  
>>>>>>>> matches
>>>>>>>>
>>>>>>>>> based
>>>>>>>>> on the tokens.
>>>>>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use  
>>>>>>>>> new
>>>>>>>>> Term("body",
>>>>>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Chris Lu
>>>>>>>>> -------------------------
>>>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>>>> site: http://www.dbsight.net
>>>>>>>>> demo: http://search.dbsight.com
>>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>>>>> request)
>>>>>>>>> got
>>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>>
>>>>>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>>>>>> aminmc@gmail.com
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>>>>>
>>>>>>>>>> You need to let us know the analyzer you are using.
>>>>>>>>>>
>>>>>>>>>> -- Chris Lu
>>>>>>>>>>> -------------------------
>>>>>>>>>>> Instant Scalable Full-Text Search On Any Database/ 
>>>>>>>>>>> Application
>>>>>>>>>>> site: http://www.dbsight.net
>>>>>>>>>>> demo: http://search.dbsight.com
>>>>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>>>>> DBSight customer, a shopping comparison site, (anonymous  
>>>>>>>>>>> per request)
>>>>>>>>>>> got
>>>>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>>>>>> aminmc@gmail.com
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I have created a RTFHandler which takes a RTF file and  
>>>>>>>>>>>>> creates a
>>>>>>>>>>>>> lucene
>>>>>>>>>>>>> Document which is indexed.  The RTFHandler looks like  
>>>>>>>>>>>>> something like
>>>>>>>>>>>>> this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> if (bodyText != null) {
>>>>>>>>>>>>>             Document document = new Document();
>>>>>>>>>>>>>             Field field = new
>>>>>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>>>>>> Field.Store.YES,
>>>>>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>>>>>             document.add(field);
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am using Java Built in RTF text extraction.  When I  
>>>>>>>>>>>>> run my test to
>>>>>>>>>>>>> verify that the document contains text that I expect  
>>>>>>>>>>>>> this works
>>>>>>>>>>>>> fine.
>>>>>>>>>>>>> I
>>>>>>>>>>>>> get
>>>>>>>>>>>>> the following when I print the document:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This  
>>>>>>>>>>>>> is a test
>>>>>>>>>>>>> rtf
>>>>>>>>>>>>> document that will be indexed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem is when I use the following to search I get  
>>>>>>>>>>>>> no result:
>>>>>>>>>>>>>
>>>>>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>>>>>>>>>> Searchable[]
>>>>>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>>>>>             Term t = new Term("body", "Amin");
>>>>>>>>>>>>>             TermQuery termQuery = new TermQuery(t);
>>>>>>>>>>>>>             TopDocs topDocs =  
>>>>>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>>>>>> 1);
>>>>>>>>>>>>>             System.out.println(topDocs.totalHits);
>>>>>>>>>>>>>             multiSearcher.close();
>>>>>>>>>>>>>
>>>>>>>>>>>>> RftIndexSearcher is configured with the directory that  
>>>>>>>>>>>>> holds rtf
>>>>>>>>>>>>> documents.  I have used Luke to look at the document and  
>>>>>>>>>>>>> what I am
>>>>>>>>>>>>> finding
>>>>>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1       body    test
>>>>>>>>>>>>> 1       id      1234
>>>>>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>>>>>> 1       summary This is a
>>>>>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>>>>>> 1       body    rtf
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> However on the Document tab I am getting (in the body  
>>>>>>>>>>>>> field):
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would expect to get a hit using "Amin" or even  
>>>>>>>>>>>>> "document".  I am
>>>>>>>>>>>>> not
>>>>>>>>>>>>> sure whether the
>>>>>>>>>>>>> line:
>>>>>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>> is incorrect as I am not too sure of the meaning of  
>>>>>>>>>>>>> "Finds the top n
>>>>>>>>>>>>> hits
>>>>>>>>>>>>> for query." for search (Query query, int n) according to  
>>>>>>>>>>>>> java docs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would be grateful if someone may be able to advise on  
>>>>>>>>>>>>> what I may
>>>>>>>>>>>>> be
>>>>>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> Amin
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Problem

Posted by Grant Ingersoll <gs...@apache.org>.

The mailing list often strips attachments (in fact, I'm surprised your  
earlier ones made it through).  Perhaps you can put them up somewhere  
for download.


On Jan 3, 2009, at 1:07 PM, Amin Mohammed-Coleman wrote:

> Hi again
>
> Sorry I didn't include the WorkItem class!  Here is the final test  
> case.  Apologies!
> On 3 Jan 2009, at 14:02, Grant Ingersoll wrote:
>
>> You shouldn't need to call close and optimize after each document.
>>
>> You also don't need the commit if you are going to immediately close.
>>
>> Also, can you send a standalone test that shows the RTF extraction,  
>> the document creation and the indexing code that demonstrates your  
>> issue.
>>
>> FWIW, and as a complete aside to save you some time after you get  
>> this figured out, instead of re-inventing RTF extraction and PDF  
>> extraction (as you appear to be doing), have a look at Tika (http://lucene.apache.org/tika 
>> )
>>
>> On Jan 3, 2009, at 8:48 AM, Shashi Kant wrote:
>>
>>> Amin,
>>>
>>> Are you calling Close & Optimize after every addDocument?
>>>
>>> I would suggest something like this
>>> try
>>> {
>>>    while //this could be your looping through a data reader for  
>>> example
>>>     {
>>>          indexWriter.addDocument(document);
>>>     }
>>> }
>>>
>>> finally
>>> {
>>> commitAndOptimise()
>>> }
>>>
>>>
>>> HTH
>>>
>>> Shashi
>>>
>>>
>>> ----- Original Message ----
>>> From: Amin Mohammed-Coleman <am...@gmail.com>
>>> To: java-user@lucene.apache.org
>>> Sent: Saturday, January 3, 2009 4:02:52 AM
>>> Subject: Re: Search Problem
>>>
>>>
>>> Hi again!
>>>
>>> I think I may have found the problem but I was wondering if you  
>>> could verify:
>>>
>>> I have the following for my indexer:
>>>
>>> public void add(Document document) {
>>>      IndexWriter indexWriter =  
>>> IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
>>>      try {
>>>          indexWriter.addDocument(document);
>>>          LOGGER.debug("Added Document:" + document + " to index");
>>>          commitAndOptimise(indexWriter);
>>>      } catch (CorruptIndexException e) {
>>>          throw new IllegalStateException(e);
>>>      } catch (IOException e) {
>>>          throw new IllegalStateException(e);
>>>      }
>>>  }
>>>
>>> the commitAndOptimise(indexWriter) looks like this:
>>>
>>> private void commitAndOptimise(IndexWriter indexWriter) throws  
>>> CorruptIndexException,IOException {
>>>      LOGGER.debug("Committing document and closing index writer");
>>>      indexWriter.optimize();
>>>      indexWriter.commit();
>>>      indexWriter.close();
>>>  }
>>>
>>> It seems as though if I comment out optimize then the overview tab  
>>> in Luke  for the rtf document looks like:
>>>
>>> 5    id    1234
>>> 3    body    document
>>> 3    body    body
>>> 1    body    test
>>> 1    body    rtf
>>> 1    name    rtfDocumentToIndex.rtf
>>> 1    body    new
>>> 1    path    rtfDocumentToIndex.rtf
>>> 1    summary    This is a
>>> 1    type    RTF_INDEXER
>>> 1    body    content
>>>
>>>
>>> This is more what I expected although "Amin Mohammed-Coleman"  
>>> hasn't been stored in the index.  Should I not be using  
>>> indexWriter.optimize() ?
>>>
>>> I tried using the search function in luke and got the following  
>>> results:
>>> body:test ---> returns result
>>> body:document ---> no result
>>> body:content ---> no result
>>> body:rtf ----> returns result
>>>
>>>
>>> Thanks again...sorry to be sending so many emails about this. I am  
>>> in the process of designing and developing a prototype of a  
>>> document and domain indexing/searching component and I would like  
>>> to demo to the rest of my team.
>>>
>>>
>>> Cheers
>>> Amin
>>>
>>>
>>>
>>> On 3 Jan 2009, at 01:23, Erick Erickson wrote:
>>>
>>>> Well, your query results are consistent with what Luke is
>>>> reporting. So I'd go back and test your assumptions. I
>>>> suspect that you're not indexing what you think you are.
>>>>
>>>> For your test document, I'd just print out what you're indexing
>>>> and the field it's going into. *for each field*. that is, every  
>>>> time you
>>>> do a document.add(<field of some kind>), print out that data. I'm
>>>> pretty sure you'll find that you're not getting what you expect.  
>>>> For
>>>> instance, the call to:
>>>>
>>>> MetaDataEnum.BODY.getDescription()
>>>>
>>>> may be returning some nonsense. Or
>>>> bodyText.trim()
>>>>
>>>> isn't doing what you expect.
>>>>
>>>> Lucene is used by many folks, and errors of the magnitude you're
>>>> experiencing would be seen by many people and the user list would
>>>> be flooded with complaints if it were a Lucene issue at root. That
>>>> leaves the code you wrote as the most likely culprit. So try a  
>>>> very simple
>>>> test case with lots of debugging println's. I'm pretty sure you'll
>>>> find the underlying issue with some of your assumptions pretty  
>>>> quickly.
>>>>
>>>> Sorry I can't be more specific, but we'd have to see all of your  
>>>> code
>>>> and the test cases to do that....
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>>>> >wrote:
>>>>
>>>>> Hi Erick
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> I have used luke to inspect the document and I am some what  
>>>>> confused.  For
>>>>> example when I view the index using the overview tab of Luke I  
>>>>> get the
>>>>> following:
>>>>>
>>>>> 1       body    test
>>>>> 1       id      1234
>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>> 1       summary This is a
>>>>> 1       type    RTF_INDEXER
>>>>> 1       body    rtf
>>>>>
>>>>>
>>>>> However when I view the document in the Document tab I get the  
>>>>> full text
>>>>> that was extracted from the rft document (field:body) which is:
>>>>>
>>>>> This is a test rtf document that will be indexed.
>>>>> Amin Mohammed-Coleman
>>>>>
>>>>> I am using the StandardAnaylzer therefore I wouldnt expect the  
>>>>> words
>>>>> document, indexed, Amin Mohammed-Coleman to be removed.
>>>>>
>>>>> I have referenced the Lucene In Action book and I can't see what  
>>>>> I may be
>>>>> doing wrong.  I would be happy to provide a testcase should it  
>>>>> be required.
>>>>> When adding the body field to the document I am doing:
>>>>>
>>>>>    Document document = new Document();
>>>>>                    Field field = new
>>>>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>>>>> Field.Store.YES,
>>>>> Field.Index.ANALYZED);
>>>>>                    document.add(field);
>>>>>
>>>>>
>>>>>
>>>>> When I run the search code the string "test" is the only word  
>>>>> that returns
>>>>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>>>>> "document",
>>>>> "indexed").
>>>>>
>>>>> Thanks again for your help and advice.
>>>>>
>>>>>
>>>>> Cheers
>>>>> Amin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>>>>
>>>>> Casing is usually handled by the analyzer. Since you construct
>>>>>> the term query programmatically, it doesn't go through
>>>>>> any analyzers, thus is not converted into lower case for
>>>>>> searching as was done automatically for you when you
>>>>>> indexed using StandardAnalyzer.
>>>>>>
>>>>>> As for why you aren't getting hits, it's unclear to me. But
>>>>>> what I'd do is get a copy of Luke and examine your index
>>>>>> to see what's *really* there. This will often give you clues,
>>>>>> usually pointing to some kind of analyzer behavior that you
>>>>>> weren't expecting.
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>>>> wrote:
>>>>>>
>>>>>> Hi
>>>>>>>
>>>>>>> I have tried this and it doesn't work.  I don't understand why  
>>>>>>> using
>>>>>>> "amin"
>>>>>>> instead of "Amin" would work, is it not case insensitive?
>>>>>>>
>>>>>>> I tried "test" for field "body" and this works.  Any other  
>>>>>>> terms don't
>>>>>>> work
>>>>>>> for example:
>>>>>>>
>>>>>>> "document"
>>>>>>> "indexed"
>>>>>>>
>>>>>>> these are tokens that were extracted when creating the lucene  
>>>>>>> document.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for your reply.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Amin
>>>>>>>
>>>>>>>
>>>>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>>>>
>>>>>>> Basically Lucene stores analyzed tokens, and looks up for the  
>>>>>>> matches
>>>>>>>
>>>>>>>> based
>>>>>>>> on the tokens.
>>>>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>>>>> Term("body",
>>>>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Chris Lu
>>>>>>>> -------------------------
>>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>>> site: http://www.dbsight.net
>>>>>>>> demo: http://search.dbsight.com
>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>
>>>>>>>>
>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>>>> request)
>>>>>>>> got
>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>
>>>>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>>>>> aminmc@gmail.com
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>>>>
>>>>>>>>> You need to let us know the analyzer you are using.
>>>>>>>>>
>>>>>>>>> -- Chris Lu
>>>>>>>>>> -------------------------
>>>>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>>>>> site: http://www.dbsight.net
>>>>>>>>>> demo: http://search.dbsight.com
>>>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>>>> DBSight customer, a shopping comparison site, (anonymous  
>>>>>>>>>> per request)
>>>>>>>>>> got
>>>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>>>>> aminmc@gmail.com
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I have created a RTFHandler which takes a RTF file and  
>>>>>>>>>>>> creates a
>>>>>>>>>>>> lucene
>>>>>>>>>>>> Document which is indexed.  The RTFHandler looks like  
>>>>>>>>>>>> something like
>>>>>>>>>>>> this:
>>>>>>>>>>>>
>>>>>>>>>>>> if (bodyText != null) {
>>>>>>>>>>>>              Document document = new Document();
>>>>>>>>>>>>              Field field = new
>>>>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>>>>> Field.Store.YES,
>>>>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>>>>              document.add(field);
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> I am using Java Built in RTF text extraction.  When I run  
>>>>>>>>>>>> my test to
>>>>>>>>>>>> verify that the document contains text that I expect this  
>>>>>>>>>>>> works
>>>>>>>>>>>> fine.
>>>>>>>>>>>> I
>>>>>>>>>>>> get
>>>>>>>>>>>> the following when I print the document:
>>>>>>>>>>>>
>>>>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This  
>>>>>>>>>>>> is a test
>>>>>>>>>>>> rtf
>>>>>>>>>>>> document that will be indexed.
>>>>>>>>>>>>
>>>>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is when I use the following to search I get  
>>>>>>>>>>>> no result:
>>>>>>>>>>>>
>>>>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>>>>>>>>> Searchable[]
>>>>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>>>>              Term t = new Term("body", "Amin");
>>>>>>>>>>>>              TermQuery termQuery = new TermQuery(t);
>>>>>>>>>>>>              TopDocs topDocs =  
>>>>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>>>>> 1);
>>>>>>>>>>>>              System.out.println(topDocs.totalHits);
>>>>>>>>>>>>              multiSearcher.close();
>>>>>>>>>>>>
>>>>>>>>>>>> RftIndexSearcher is configured with the directory that  
>>>>>>>>>>>> holds rtf
>>>>>>>>>>>> documents.  I have used Luke to look at the document and  
>>>>>>>>>>>> what I am
>>>>>>>>>>>> finding
>>>>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>>>>
>>>>>>>>>>>> 1       body    test
>>>>>>>>>>>> 1       id      1234
>>>>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>>>>> 1       summary This is a
>>>>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>>>>> 1       body    rtf
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> However on the Document tab I am getting (in the body  
>>>>>>>>>>>> field):
>>>>>>>>>>>>
>>>>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>>>>
>>>>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I would expect to get a hit using "Amin" or even  
>>>>>>>>>>>> "document".  I am
>>>>>>>>>>>> not
>>>>>>>>>>>> sure whether the
>>>>>>>>>>>> line:
>>>>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>>>>>
>>>>>>>>>>>> is incorrect as I am not too sure of the meaning of  
>>>>>>>>>>>> "Finds the top n
>>>>>>>>>>>> hits
>>>>>>>>>>>> for query." for search (Query query, int n) according to  
>>>>>>>>>>>> java docs.
>>>>>>>>>>>>
>>>>>>>>>>>> I would be grateful if someone may be able to advise on  
>>>>>>>>>>>> what I may
>>>>>>>>>>>> be
>>>>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Amin
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: java-user- 
>>>>>>>>> unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi again

Sorry I didn't include the WorkItem class!  Here is the final test  
case.  Apologies!

Re: Search Problem

Posted by Grant Ingersoll <gs...@apache.org>.

You shouldn't need to call close and optimize after each document.

You also don't need the commit if you are going to immediately close.

Also, can you send a standalone test that shows the RTF extraction,  
the document creation and the indexing code that demonstrates your  
issue.

FWIW, and as a complete aside to save you some time after you get this  
figured out, instead of re-inventing RTF extraction and PDF extraction  
(as you appear to be doing), have a look at Tika (http://lucene.apache.org/tika 
)

On Jan 3, 2009, at 8:48 AM, Shashi Kant wrote:

> Amin,
>
> Are you calling Close & Optimize after every addDocument?
>
> I would suggest something like this
> try
> {
>      while //this could be your looping through a data reader for  
> example
>       {
>            indexWriter.addDocument(document);
>       }
> }
>
> finally
> {
>  commitAndOptimise()
> }
>
>
> HTH
>
> Shashi
>
>
> ----- Original Message ----
> From: Amin Mohammed-Coleman <am...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Saturday, January 3, 2009 4:02:52 AM
> Subject: Re: Search Problem
>
>
> Hi again!
>
> I think I may have found the problem but I was wondering if you  
> could verify:
>
> I have the following for my indexer:
>
> public void add(Document document) {
>        IndexWriter indexWriter =  
> IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
>        try {
>            indexWriter.addDocument(document);
>            LOGGER.debug("Added Document:" + document + " to index");
>            commitAndOptimise(indexWriter);
>        } catch (CorruptIndexException e) {
>            throw new IllegalStateException(e);
>        } catch (IOException e) {
>            throw new IllegalStateException(e);
>        }
>    }
>
> the commitAndOptimise(indexWriter) looks like this:
>
> private void commitAndOptimise(IndexWriter indexWriter) throws  
> CorruptIndexException,IOException {
>        LOGGER.debug("Committing document and closing index writer");
>        indexWriter.optimize();
>        indexWriter.commit();
>        indexWriter.close();
>    }
>
> It seems as though if I comment out optimize then the overview tab  
> in Luke  for the rtf document looks like:
>
> 5    id    1234
> 3    body    document
> 3    body    body
> 1    body    test
> 1    body    rtf
> 1    name    rtfDocumentToIndex.rtf
> 1    body    new
> 1    path    rtfDocumentToIndex.rtf
> 1    summary    This is a
> 1    type    RTF_INDEXER
> 1    body    content
>
>
> This is more what I expected although "Amin Mohammed-Coleman" hasn't  
> been stored in the index.  Should I not be using  
> indexWriter.optimize() ?
>
> I tried using the search function in luke and got the following  
> results:
> body:test ---> returns result
> body:document ---> no result
> body:content ---> no result
> body:rtf ----> returns result
>
>
> Thanks again...sorry to be sending so many emails about this. I am  
> in the process of designing and developing a prototype of a document  
> and domain indexing/searching component and I would like to demo to  
> the rest of my team.
>
>
> Cheers
> Amin
>
>
>
> On 3 Jan 2009, at 01:23, Erick Erickson wrote:
>
>> Well, your query results are consistent with what Luke is
>> reporting. So I'd go back and test your assumptions. I
>> suspect that you're not indexing what you think you are.
>>
>> For your test document, I'd just print out what you're indexing
>> and the field it's going into. *for each field*. that is, every  
>> time you
>> do a document.add(<field of some kind>), print out that data. I'm
>> pretty sure you'll find that you're not getting what you expect. For
>> instance, the call to:
>>
>> MetaDataEnum.BODY.getDescription()
>>
>> may be returning some nonsense. Or
>> bodyText.trim()
>>
>> isn't doing what you expect.
>>
>> Lucene is used by many folks, and errors of the magnitude you're
>> experiencing would be seen by many people and the user list would
>> be flooded with complaints if it were a Lucene issue at root. That
>> leaves the code you wrote as the most likely culprit. So try a very  
>> simple
>> test case with lots of debugging println's. I'm pretty sure you'll
>> find the underlying issue with some of your assumptions pretty  
>> quickly.
>>
>> Sorry I can't be more specific, but we'd have to see all of your code
>> and the test cases to do that....
>>
>> Best
>> Erick
>>
>> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>> >wrote:
>>
>>> Hi Erick
>>>
>>> Thanks for your reply.
>>>
>>> I have used luke to inspect the document and I am some what  
>>> confused.  For
>>> example when I view the index using the overview tab of Luke I get  
>>> the
>>> following:
>>>
>>> 1       body    test
>>> 1       id      1234
>>> 1       name    rtfDocumentToIndex.rtf
>>> 1       path    rtfDocumentToIndex.rtf
>>> 1       summary This is a
>>> 1       type    RTF_INDEXER
>>> 1       body    rtf
>>>
>>>
>>> However when I view the document in the Document tab I get the  
>>> full text
>>> that was extracted from the rft document (field:body) which is:
>>>
>>> This is a test rtf document that will be indexed.
>>> Amin Mohammed-Coleman
>>>
>>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>>> document, indexed, Amin Mohammed-Coleman to be removed.
>>>
>>> I have referenced the Lucene In Action book and I can't see what I  
>>> may be
>>> doing wrong.  I would be happy to provide a testcase should it be  
>>> required.
>>> When adding the body field to the document I am doing:
>>>
>>>      Document document = new Document();
>>>                      Field field = new
>>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>>> Field.Store.YES,
>>> Field.Index.ANALYZED);
>>>                      document.add(field);
>>>
>>>
>>>
>>> When I run the search code the string "test" is the only word that  
>>> returns
>>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>>> "document",
>>> "indexed").
>>>
>>> Thanks again for your help and advice.
>>>
>>>
>>> Cheers
>>> Amin
>>>
>>>
>>>
>>>
>>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>>
>>> Casing is usually handled by the analyzer. Since you construct
>>>> the term query programmatically, it doesn't go through
>>>> any analyzers, thus is not converted into lower case for
>>>> searching as was done automatically for you when you
>>>> indexed using StandardAnalyzer.
>>>>
>>>> As for why you aren't getting hits, it's unclear to me. But
>>>> what I'd do is get a copy of Luke and examine your index
>>>> to see what's *really* there. This will often give you clues,
>>>> usually pointing to some kind of analyzer behavior that you
>>>> weren't expecting.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>> wrote:
>>>>
>>>> Hi
>>>>>
>>>>> I have tried this and it doesn't work.  I don't understand why  
>>>>> using
>>>>> "amin"
>>>>> instead of "Amin" would work, is it not case insensitive?
>>>>>
>>>>> I tried "test" for field "body" and this works.  Any other terms  
>>>>> don't
>>>>> work
>>>>> for example:
>>>>>
>>>>> "document"
>>>>> "indexed"
>>>>>
>>>>> these are tokens that were extracted when creating the lucene  
>>>>> document.
>>>>>
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> Cheers
>>>>>
>>>>> Amin
>>>>>
>>>>>
>>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>>
>>>>> Basically Lucene stores analyzed tokens, and looks up for the  
>>>>> matches
>>>>>
>>>>>> based
>>>>>> on the tokens.
>>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>>> Term("body",
>>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>>
>>>>>> --
>>>>>> Chris Lu
>>>>>> -------------------------
>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>> site: http://www.dbsight.net
>>>>>> demo: http://search.dbsight.com
>>>>>> Lucene Database Search in 3 minutes:
>>>>>>
>>>>>>
>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>> request)
>>>>>> got
>>>>>> 2.6 Million Euro funding!
>>>>>>
>>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>>> aminmc@gmail.com
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>>>
>>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>>
>>>>>>> You need to let us know the analyzer you are using.
>>>>>>>
>>>>>>> -- Chris Lu
>>>>>>>> -------------------------
>>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>>> site: http://www.dbsight.net
>>>>>>>> demo: http://search.dbsight.com
>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>>>> request)
>>>>>>>> got
>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>
>>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>>> aminmc@gmail.com
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I have created a RTFHandler which takes a RTF file and  
>>>>>>>>>> creates a
>>>>>>>>>> lucene
>>>>>>>>>> Document which is indexed.  The RTFHandler looks like  
>>>>>>>>>> something like
>>>>>>>>>> this:
>>>>>>>>>>
>>>>>>>>>> if (bodyText != null) {
>>>>>>>>>>                Document document = new Document();
>>>>>>>>>>                Field field = new
>>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>>> Field.Store.YES,
>>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>>                document.add(field);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> I am using Java Built in RTF text extraction.  When I run  
>>>>>>>>>> my test to
>>>>>>>>>> verify that the document contains text that I expect this  
>>>>>>>>>> works
>>>>>>>>>> fine.
>>>>>>>>>> I
>>>>>>>>>> get
>>>>>>>>>> the following when I print the document:
>>>>>>>>>>
>>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is  
>>>>>>>>>> a test
>>>>>>>>>> rtf
>>>>>>>>>> document that will be indexed.
>>>>>>>>>>
>>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The problem is when I use the following to search I get no  
>>>>>>>>>> result:
>>>>>>>>>>
>>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>>>>>>> Searchable[]
>>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>>                Term t = new Term("body", "Amin");
>>>>>>>>>>                TermQuery termQuery = new TermQuery(t);
>>>>>>>>>>                TopDocs topDocs =  
>>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>>> 1);
>>>>>>>>>>                System.out.println(topDocs.totalHits);
>>>>>>>>>>                multiSearcher.close();
>>>>>>>>>>
>>>>>>>>>> RftIndexSearcher is configured with the directory that  
>>>>>>>>>> holds rtf
>>>>>>>>>> documents.  I have used Luke to look at the document and  
>>>>>>>>>> what I am
>>>>>>>>>> finding
>>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>>
>>>>>>>>>> 1       body    test
>>>>>>>>>> 1       id      1234
>>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>>> 1       summary This is a
>>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>>> 1       body    rtf
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>>>>
>>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>>
>>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I would expect to get a hit using "Amin" or even  
>>>>>>>>>> "document".  I am
>>>>>>>>>> not
>>>>>>>>>> sure whether the
>>>>>>>>>> line:
>>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>>>
>>>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds  
>>>>>>>>>> the top n
>>>>>>>>>> hits
>>>>>>>>>> for query." for search (Query query, int n) according to  
>>>>>>>>>> java docs.
>>>>>>>>>>
>>>>>>>>>> I would be grateful if someone may be able to advise on  
>>>>>>>>>> what I may
>>>>>>>>>> be
>>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Amin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Problem

Posted by Shashi Kant <sh...@yahoo.com>.

Amin,

Are you calling Close & Optimize after every addDocument?

I would suggest something like this
try
{
      while //this could be your looping through a data reader for example
       {
            indexWriter.addDocument(document);
       }
}

finally
{
  commitAndOptimise()
}


HTH

Shashi


----- Original Message ----
From: Amin Mohammed-Coleman <am...@gmail.com>
To: java-user@lucene.apache.org
Sent: Saturday, January 3, 2009 4:02:52 AM
Subject: Re: Search Problem


Hi again!

I think I may have found the problem but I was wondering if you could verify:

I have the following for my indexer:

public void add(Document document) {
        IndexWriter indexWriter = IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
        try {
            indexWriter.addDocument(document);
            LOGGER.debug("Added Document:" + document + " to index");
            commitAndOptimise(indexWriter);
        } catch (CorruptIndexException e) {
            throw new IllegalStateException(e);
        } catch (IOException e) {
            throw new IllegalStateException(e);
        }
    }

the commitAndOptimise(indexWriter) looks like this:

private void commitAndOptimise(IndexWriter indexWriter) throws CorruptIndexException,IOException {
        LOGGER.debug("Committing document and closing index writer");
        indexWriter.optimize();
        indexWriter.commit();
        indexWriter.close();
    }

It seems as though if I comment out optimize then the overview tab in Luke  for the rtf document looks like:

5    id    1234
3    body    document
3    body    body
1    body    test
1    body    rtf
1    name    rtfDocumentToIndex.rtf
1    body    new
1    path    rtfDocumentToIndex.rtf
1    summary    This is a
1    type    RTF_INDEXER
1    body    content


This is more what I expected although "Amin Mohammed-Coleman" hasn't been stored in the index.  Should I not be using indexWriter.optimize() ?

I tried using the search function in luke and got the following results:
body:test ---> returns result
body:document ---> no result
body:content ---> no result
body:rtf ----> returns result


Thanks again...sorry to be sending so many emails about this. I am in the process of designing and developing a prototype of a document and domain indexing/searching component and I would like to demo to the rest of my team.


Cheers
Amin



On 3 Jan 2009, at 01:23, Erick Erickson wrote:

> Well, your query results are consistent with what Luke is
> reporting. So I'd go back and test your assumptions. I
> suspect that you're not indexing what you think you are.
> 
> For your test document, I'd just print out what you're indexing
> and the field it's going into. *for each field*. that is, every time you
> do a document.add(<field of some kind>), print out that data. I'm
> pretty sure you'll find that you're not getting what you expect. For
> instance, the call to:
> 
> MetaDataEnum.BODY.getDescription()
> 
> may be returning some nonsense. Or
> bodyText.trim()
> 
> isn't doing what you expect.
> 
> Lucene is used by many folks, and errors of the magnitude you're
> experiencing would be seen by many people and the user list would
> be flooded with complaints if it were a Lucene issue at root. That
> leaves the code you wrote as the most likely culprit. So try a very simple
> test case with lots of debugging println's. I'm pretty sure you'll
> find the underlying issue with some of your assumptions pretty quickly.
> 
> Sorry I can't be more specific, but we'd have to see all of your code
> and the test cases to do that....
> 
> Best
> Erick
> 
> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:
> 
>> Hi Erick
>> 
>> Thanks for your reply.
>> 
>> I have used luke to inspect the document and I am some what confused.  For
>> example when I view the index using the overview tab of Luke I get the
>> following:
>> 
>> 1       body    test
>> 1       id      1234
>> 1       name    rtfDocumentToIndex.rtf
>> 1       path    rtfDocumentToIndex.rtf
>> 1       summary This is a
>> 1       type    RTF_INDEXER
>> 1       body    rtf
>> 
>> 
>> However when I view the document in the Document tab I get the full text
>> that was extracted from the rft document (field:body) which is:
>> 
>> This is a test rtf document that will be indexed.
>> Amin Mohammed-Coleman
>> 
>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>> document, indexed, Amin Mohammed-Coleman to be removed.
>> 
>> I have referenced the Lucene In Action book and I can't see what I may be
>> doing wrong.  I would be happy to provide a testcase should it be required.
>> When adding the body field to the document I am doing:
>> 
>>       Document document = new Document();
>>                       Field field = new
>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(), Field.Store.YES,
>> Field.Index.ANALYZED);
>>                       document.add(field);
>> 
>> 
>> 
>> When I run the search code the string "test" is the only word that returns
>> a result (TopDocs), whereas the others do not (e.g. "amin", "document",
>> "indexed").
>> 
>> Thanks again for your help and advice.
>> 
>> 
>> Cheers
>> Amin
>> 
>> 
>> 
>> 
>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>> 
>> Casing is usually handled by the analyzer. Since you construct
>>> the term query programmatically, it doesn't go through
>>> any analyzers, thus is not converted into lower case for
>>> searching as was done automatically for you when you
>>> indexed using StandardAnalyzer.
>>> 
>>> As for why you aren't getting hits, it's unclear to me. But
>>> what I'd do is get a copy of Luke and examine your index
>>> to see what's *really* there. This will often give you clues,
>>> usually pointing to some kind of analyzer behavior that you
>>> weren't expecting.
>>> 
>>> Best
>>> Erick
>>> 
>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>> wrote:
>>> 
>>> Hi
>>>> 
>>>> I have tried this and it doesn't work.  I don't understand why using
>>>> "amin"
>>>> instead of "Amin" would work, is it not case insensitive?
>>>> 
>>>> I tried "test" for field "body" and this works.  Any other terms don't
>>>> work
>>>> for example:
>>>> 
>>>> "document"
>>>> "indexed"
>>>> 
>>>> these are tokens that were extracted when creating the lucene document.
>>>> 
>>>> 
>>>> Thanks for your reply.
>>>> 
>>>> Cheers
>>>> 
>>>> Amin
>>>> 
>>>> 
>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>> 
>>>> Basically Lucene stores analyzed tokens, and looks up for the matches
>>>> 
>>>>> based
>>>>> on the tokens.
>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>> Term("body",
>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>> 
>>>>> --
>>>>> Chris Lu
>>>>> -------------------------
>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>> site: http://www.dbsight.net
>>>>> demo: http://search.dbsight.com
>>>>> Lucene Database Search in 3 minutes:
>>>>> 
>>>>> 
>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>>> got
>>>>> 2.6 Million Euro funding!
>>>>> 
>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>> aminmc@gmail.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>> 
>>>>> Hi
>>>>> 
>>>>>> 
>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>> 
>>>>>> You need to let us know the analyzer you are using.
>>>>>> 
>>>>>> -- Chris Lu
>>>>>>> -------------------------
>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>> site: http://www.dbsight.net
>>>>>>> demo: http://search.dbsight.com
>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>>>>> got
>>>>>>> 2.6 Million Euro funding!
>>>>>>> 
>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>> aminmc@gmail.com
>>>>>>> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> I have created a RTFHandler which takes a RTF file and creates a
>>>>>>>>> lucene
>>>>>>>>> Document which is indexed.  The RTFHandler looks like something like
>>>>>>>>> this:
>>>>>>>>> 
>>>>>>>>> if (bodyText != null) {
>>>>>>>>>                 Document document = new Document();
>>>>>>>>>                 Field field = new
>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>> Field.Store.YES,
>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>                 document.add(field);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> I am using Java Built in RTF text extraction.  When I run my test to
>>>>>>>>> verify that the document contains text that I expect this works
>>>>>>>>> fine.
>>>>>>>>> I
>>>>>>>>> get
>>>>>>>>> the following when I print the document:
>>>>>>>>> 
>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is a test
>>>>>>>>> rtf
>>>>>>>>> document that will be indexed.
>>>>>>>>> 
>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The problem is when I use the following to search I get no result:
>>>>>>>>> 
>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]
>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>                 Term t = new Term("body", "Amin");
>>>>>>>>>                 TermQuery termQuery = new TermQuery(t);
>>>>>>>>>                 TopDocs topDocs = multiSearcher.search(termQuery,
>>>>>>>>> 1);
>>>>>>>>>                 System.out.println(topDocs.totalHits);
>>>>>>>>>                 multiSearcher.close();
>>>>>>>>> 
>>>>>>>>> RftIndexSearcher is configured with the directory that holds rtf
>>>>>>>>> documents.  I have used Luke to look at the document and what I am
>>>>>>>>> finding
>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>> 
>>>>>>>>> 1       body    test
>>>>>>>>> 1       id      1234
>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>> 1       summary This is a
>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>> 1       body    rtf
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>>> 
>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>> 
>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I would expect to get a hit using "Amin" or even "document".  I am
>>>>>>>>> not
>>>>>>>>> sure whether the
>>>>>>>>> line:
>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>> 
>>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds the top n
>>>>>>>>> hits
>>>>>>>>> for query." for search (Query query, int n) according to java docs.
>>>>>>>>> 
>>>>>>>>> I would be grateful if someone may be able to advise on what I may
>>>>>>>>> be
>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> Amin
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi again!

I think I may have found the problem but I was wondering if you could  
verify:

I have the following for my indexer:

public void add(Document document) {
		IndexWriter indexWriter =  
IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
		try {
			indexWriter.addDocument(document);
			LOGGER.debug("Added Document:" + document + " to index");
			commitAndOptimise(indexWriter);
		} catch (CorruptIndexException e) {
			throw new IllegalStateException(e);
		} catch (IOException e) {
			throw new IllegalStateException(e);
		}
	}

the commitAndOptimise(indexWriter) looks like this:

private void commitAndOptimise(IndexWriter indexWriter) throws  
CorruptIndexException,IOException {
		LOGGER.debug("Committing document and closing index writer");
		indexWriter.optimize();
		indexWriter.commit();
		indexWriter.close();
	}

It seems as though if I comment out optimize then the overview tab in  
Luke  for the rtf document looks like:

5	id	1234
3	body	document
3	body	body
1	body	test
1	body	rtf
1	name	rtfDocumentToIndex.rtf
1	body	new
1	path	rtfDocumentToIndex.rtf
1	summary	This is a
1	type	RTF_INDEXER
1	body	content


This is more what I expected although "Amin Mohammed-Coleman" hasn't  
been stored in the index.  Should I not be using  
indexWriter.optimize() ?

I tried using the search function in luke and got the following results:
body:test ---> returns result
body:document ---> no result
body:content ---> no result
body:rtf ----> returns result


Thanks again...sorry to be sending so many emails about this. I am in  
the process of designing and developing a prototype of a document and  
domain indexing/searching component and I would like to demo to the  
rest of my team.


Cheers
Amin



On 3 Jan 2009, at 01:23, Erick Erickson wrote:

> Well, your query results are consistent with what Luke is
> reporting. So I'd go back and test your assumptions. I
> suspect that you're not indexing what you think you are.
>
> For your test document, I'd just print out what you're indexing
> and the field it's going into. *for each field*. that is, every time  
> you
> do a document.add(<field of some kind>), print out that data. I'm
> pretty sure you'll find that you're not getting what you expect. For
> instance, the call to:
>
> MetaDataEnum.BODY.getDescription()
>
> may be returning some nonsense. Or
> bodyText.trim()
>
> isn't doing what you expect.
>
> Lucene is used by many folks, and errors of the magnitude you're
> experiencing would be seen by many people and the user list would
> be flooded with complaints if it were a Lucene issue at root. That
> leaves the code you wrote as the most likely culprit. So try a very  
> simple
> test case with lots of debugging println's. I'm pretty sure you'll
> find the underlying issue with some of your assumptions pretty  
> quickly.
>
> Sorry I can't be more specific, but we'd have to see all of your code
> and the test cases to do that....
>
> Best
> Erick
>
> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
> >wrote:
>
>> Hi Erick
>>
>> Thanks for your reply.
>>
>> I have used luke to inspect the document and I am some what  
>> confused.  For
>> example when I view the index using the overview tab of Luke I get  
>> the
>> following:
>>
>> 1       body    test
>> 1       id      1234
>> 1       name    rtfDocumentToIndex.rtf
>> 1       path    rtfDocumentToIndex.rtf
>> 1       summary This is a
>> 1       type    RTF_INDEXER
>> 1       body    rtf
>>
>>
>> However when I view the document in the Document tab I get the full  
>> text
>> that was extracted from the rft document (field:body) which is:
>>
>> This is a test rtf document that will be indexed.
>> Amin Mohammed-Coleman
>>
>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>> document, indexed, Amin Mohammed-Coleman to be removed.
>>
>> I have referenced the Lucene In Action book and I can't see what I  
>> may be
>> doing wrong.  I would be happy to provide a testcase should it be  
>> required.
>> When adding the body field to the document I am doing:
>>
>>       Document document = new Document();
>>                       Field field = new
>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>> Field.Store.YES,
>> Field.Index.ANALYZED);
>>                       document.add(field);
>>
>>
>>
>> When I run the search code the string "test" is the only word that  
>> returns
>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>> "document",
>> "indexed").
>>
>> Thanks again for your help and advice.
>>
>>
>> Cheers
>> Amin
>>
>>
>>
>>
>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>
>> Casing is usually handled by the analyzer. Since you construct
>>> the term query programmatically, it doesn't go through
>>> any analyzers, thus is not converted into lower case for
>>> searching as was done automatically for you when you
>>> indexed using StandardAnalyzer.
>>>
>>> As for why you aren't getting hits, it's unclear to me. But
>>> what I'd do is get a copy of Luke and examine your index
>>> to see what's *really* there. This will often give you clues,
>>> usually pointing to some kind of analyzer behavior that you
>>> weren't expecting.
>>>
>>> Best
>>> Erick
>>>
>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>> wrote:
>>>
>>> Hi
>>>>
>>>> I have tried this and it doesn't work.  I don't understand why  
>>>> using
>>>> "amin"
>>>> instead of "Amin" would work, is it not case insensitive?
>>>>
>>>> I tried "test" for field "body" and this works.  Any other terms  
>>>> don't
>>>> work
>>>> for example:
>>>>
>>>> "document"
>>>> "indexed"
>>>>
>>>> these are tokens that were extracted when creating the lucene  
>>>> document.
>>>>
>>>>
>>>> Thanks for your reply.
>>>>
>>>> Cheers
>>>>
>>>> Amin
>>>>
>>>>
>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>
>>>> Basically Lucene stores analyzed tokens, and looks up for the  
>>>> matches
>>>>
>>>>> based
>>>>> on the tokens.
>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>> Term("body",
>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>
>>>>> --
>>>>> Chris Lu
>>>>> -------------------------
>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>> site: http://www.dbsight.net
>>>>> demo: http://search.dbsight.com
>>>>> Lucene Database Search in 3 minutes:
>>>>>
>>>>>
>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>> request)
>>>>> got
>>>>> 2.6 Million Euro funding!
>>>>>
>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>> aminmc@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>> Hi
>>>>>
>>>>>>
>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>
>>>>>> You need to let us know the analyzer you are using.
>>>>>>
>>>>>> -- Chris Lu
>>>>>>> -------------------------
>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>> site: http://www.dbsight.net
>>>>>>> demo: http://search.dbsight.com
>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>>> request)
>>>>>>> got
>>>>>>> 2.6 Million Euro funding!
>>>>>>>
>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>> aminmc@gmail.com
>>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi
>>>>>>>>
>>>>>>>>
>>>>>>>>> I have created a RTFHandler which takes a RTF file and  
>>>>>>>>> creates a
>>>>>>>>> lucene
>>>>>>>>> Document which is indexed.  The RTFHandler looks like  
>>>>>>>>> something like
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> if (bodyText != null) {
>>>>>>>>>                 Document document = new Document();
>>>>>>>>>                 Field field = new
>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>> Field.Store.YES,
>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>                 document.add(field);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I am using Java Built in RTF text extraction.  When I run my  
>>>>>>>>> test to
>>>>>>>>> verify that the document contains text that I expect this  
>>>>>>>>> works
>>>>>>>>> fine.
>>>>>>>>> I
>>>>>>>>> get
>>>>>>>>> the following when I print the document:
>>>>>>>>>
>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is  
>>>>>>>>> a test
>>>>>>>>> rtf
>>>>>>>>> document that will be indexed.
>>>>>>>>>
>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem is when I use the following to search I get no  
>>>>>>>>> result:
>>>>>>>>>
>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>>>>>> Searchable[]
>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>                 Term t = new Term("body", "Amin");
>>>>>>>>>                 TermQuery termQuery = new TermQuery(t);
>>>>>>>>>                 TopDocs topDocs =  
>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>> 1);
>>>>>>>>>                 System.out.println(topDocs.totalHits);
>>>>>>>>>                 multiSearcher.close();
>>>>>>>>>
>>>>>>>>> RftIndexSearcher is configured with the directory that holds  
>>>>>>>>> rtf
>>>>>>>>> documents.  I have used Luke to look at the document and  
>>>>>>>>> what I am
>>>>>>>>> finding
>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>
>>>>>>>>> 1       body    test
>>>>>>>>> 1       id      1234
>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>> 1       summary This is a
>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>> 1       body    rtf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>>>
>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>
>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would expect to get a hit using "Amin" or even  
>>>>>>>>> "document".  I am
>>>>>>>>> not
>>>>>>>>> sure whether the
>>>>>>>>> line:
>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>>
>>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds  
>>>>>>>>> the top n
>>>>>>>>> hits
>>>>>>>>> for query." for search (Query query, int n) according to  
>>>>>>>>> java docs.
>>>>>>>>>
>>>>>>>>> I would be grateful if someone may be able to advise on what  
>>>>>>>>> I may
>>>>>>>>> be
>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Amin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi Erick

I kind of thought it might be something related to the way I am  
creating the Lucene document , however my RtfHandler class test case  
all pass when I assert the following:

@Test
	public void testCanCreateLuceneDocumentForRTFDocument() throws  
Exception {
		JavaBuiltInRTFHandler builtInRTFHandler = new JavaBuiltInRTFHandler();
		Document document = builtInRTFHandler.getDocument(rtfFile);	
		assertNotNull(document);
		String value = document.get(FieldNameEnum.BODY.getDescription());
		assertNotNull(value);
		assertNotSame("", value);
		assertTrue(value.contains("Amin Mohammed-Coleman"));
		assertTrue(value.contains("This is a test rtf document that will be  
indexed."));
		String path = document.get(FieldNameEnum.PATH.getDescription());
		assertNotNull(path);
		assertTrue(path.contains(".rtf"));
		String fileName = document.get(FieldNameEnum.NAME.getDescription());
		assertNotNull(fileName);
		assertEquals(RTF_FILE_NAME, fileName);
		assertEquals(WorkItem.IndexerType.RTF_INDEXER.name(),  
document.get(FieldNameEnum.TYPE.getDescription()));
	
	}


I am printed all the content after adding the body field to the  
document. I will do some further debugging...but I can print the full  
body as I get the following:

"This is a test rtf document that will be indexed.

Amin Mohammed-Coleman"

I have attached the handler class, test class and test document, maybe  
there is something I am doing wrong (not sure what!). This is also  
happening for pdf documents as well!


Thanks again for your help and patience.





On 3 Jan 2009, at 01:23, Erick Erickson wrote:

> Well, your query results are consistent with what Luke is
> reporting. So I'd go back and test your assumptions. I
> suspect that you're not indexing what you think you are.
>
> For your test document, I'd just print out what you're indexing
> and the field it's going into. *for each field*. that is, every time  
> you
> do a document.add(<field of some kind>), print out that data. I'm
> pretty sure you'll find that you're not getting what you expect. For
> instance, the call to:
>
> MetaDataEnum.BODY.getDescription()
>
> may be returning some nonsense. Or
> bodyText.trim()
>
> isn't doing what you expect.
>
> Lucene is used by many folks, and errors of the magnitude you're
> experiencing would be seen by many people and the user list would
> be flooded with complaints if it were a Lucene issue at root. That
> leaves the code you wrote as the most likely culprit. So try a very  
> simple
> test case with lots of debugging println's. I'm pretty sure you'll
> find the underlying issue with some of your assumptions pretty  
> quickly.
>
> Sorry I can't be more specific, but we'd have to see all of your code
> and the test cases to do that....
>
> Best
> Erick
>
> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
> >wrote:
>
>> Hi Erick
>>
>> Thanks for your reply.
>>
>> I have used luke to inspect the document and I am some what  
>> confused.  For
>> example when I view the index using the overview tab of Luke I get  
>> the
>> following:
>>
>> 1       body    test
>> 1       id      1234
>> 1       name    rtfDocumentToIndex.rtf
>> 1       path    rtfDocumentToIndex.rtf
>> 1       summary This is a
>> 1       type    RTF_INDEXER
>> 1       body    rtf
>>
>>
>> However when I view the document in the Document tab I get the full  
>> text
>> that was extracted from the rft document (field:body) which is:
>>
>> This is a test rtf document that will be indexed.
>> Amin Mohammed-Coleman
>>
>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>> document, indexed, Amin Mohammed-Coleman to be removed.
>>
>> I have referenced the Lucene In Action book and I can't see what I  
>> may be
>> doing wrong.  I would be happy to provide a testcase should it be  
>> required.
>> When adding the body field to the document I am doing:
>>
>>       Document document = new Document();
>>                       Field field = new
>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>> Field.Store.YES,
>> Field.Index.ANALYZED);
>>                       document.add(field);
>>
>>
>>
>> When I run the search code the string "test" is the only word that  
>> returns
>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>> "document",
>> "indexed").
>>
>> Thanks again for your help and advice.
>>
>>
>> Cheers
>> Amin
>>
>>
>>
>>
>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>
>> Casing is usually handled by the analyzer. Since you construct
>>> the term query programmatically, it doesn't go through
>>> any analyzers, thus is not converted into lower case for
>>> searching as was done automatically for you when you
>>> indexed using StandardAnalyzer.
>>>
>>> As for why you aren't getting hits, it's unclear to me. But
>>> what I'd do is get a copy of Luke and examine your index
>>> to see what's *really* there. This will often give you clues,
>>> usually pointing to some kind of analyzer behavior that you
>>> weren't expecting.
>>>
>>> Best
>>> Erick
>>>
>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>> wrote:
>>>
>>> Hi
>>>>
>>>> I have tried this and it doesn't work.  I don't understand why  
>>>> using
>>>> "amin"
>>>> instead of "Amin" would work, is it not case insensitive?
>>>>
>>>> I tried "test" for field "body" and this works.  Any other terms  
>>>> don't
>>>> work
>>>> for example:
>>>>
>>>> "document"
>>>> "indexed"
>>>>
>>>> these are tokens that were extracted when creating the lucene  
>>>> document.
>>>>
>>>>
>>>> Thanks for your reply.
>>>>
>>>> Cheers
>>>>
>>>> Amin
>>>>
>>>>
>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>
>>>> Basically Lucene stores analyzed tokens, and looks up for the  
>>>> matches
>>>>
>>>>> based
>>>>> on the tokens.
>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>> Term("body",
>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>
>>>>> --
>>>>> Chris Lu
>>>>> -------------------------
>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>> site: http://www.dbsight.net
>>>>> demo: http://search.dbsight.com
>>>>> Lucene Database Search in 3 minutes:
>>>>>
>>>>>
>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>> request)
>>>>> got
>>>>> 2.6 Million Euro funding!
>>>>>
>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>> aminmc@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>> Hi
>>>>>
>>>>>>
>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>
>>>>>> You need to let us know the analyzer you are using.
>>>>>>
>>>>>> -- Chris Lu
>>>>>>> -------------------------
>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>> site: http://www.dbsight.net
>>>>>>> demo: http://search.dbsight.com
>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>>> request)
>>>>>>> got
>>>>>>> 2.6 Million Euro funding!
>>>>>>>
>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>> aminmc@gmail.com
>>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi
>>>>>>>>
>>>>>>>>
>>>>>>>>> I have created a RTFHandler which takes a RTF file and  
>>>>>>>>> creates a
>>>>>>>>> lucene
>>>>>>>>> Document which is indexed.  The RTFHandler looks like  
>>>>>>>>> something like
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> if (bodyText != null) {
>>>>>>>>>                 Document document = new Document();
>>>>>>>>>                 Field field = new
>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>> Field.Store.YES,
>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>                 document.add(field);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I am using Java Built in RTF text extraction.  When I run my  
>>>>>>>>> test to
>>>>>>>>> verify that the document contains text that I expect this  
>>>>>>>>> works
>>>>>>>>> fine.
>>>>>>>>> I
>>>>>>>>> get
>>>>>>>>> the following when I print the document:
>>>>>>>>>
>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is  
>>>>>>>>> a test
>>>>>>>>> rtf
>>>>>>>>> document that will be indexed.
>>>>>>>>>
>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem is when I use the following to search I get no  
>>>>>>>>> result:
>>>>>>>>>
>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>>>>>> Searchable[]
>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>                 Term t = new Term("body", "Amin");
>>>>>>>>>                 TermQuery termQuery = new TermQuery(t);
>>>>>>>>>                 TopDocs topDocs =  
>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>> 1);
>>>>>>>>>                 System.out.println(topDocs.totalHits);
>>>>>>>>>                 multiSearcher.close();
>>>>>>>>>
>>>>>>>>> RftIndexSearcher is configured with the directory that holds  
>>>>>>>>> rtf
>>>>>>>>> documents.  I have used Luke to look at the document and  
>>>>>>>>> what I am
>>>>>>>>> finding
>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>
>>>>>>>>> 1       body    test
>>>>>>>>> 1       id      1234
>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>> 1       summary This is a
>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>> 1       body    rtf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>>>
>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>
>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would expect to get a hit using "Amin" or even  
>>>>>>>>> "document".  I am
>>>>>>>>> not
>>>>>>>>> sure whether the
>>>>>>>>> line:
>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>>
>>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds  
>>>>>>>>> the top n
>>>>>>>>> hits
>>>>>>>>> for query." for search (Query query, int n) according to  
>>>>>>>>> java docs.
>>>>>>>>>
>>>>>>>>> I would be grateful if someone may be able to advise on what  
>>>>>>>>> I may
>>>>>>>>> be
>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Amin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>

Re: Search Problem

Posted by Erick Erickson <er...@gmail.com>.

Well, your query results are consistent with what Luke is
reporting. So I'd go back and test your assumptions. I
suspect that you're not indexing what you think you are.

For your test document, I'd just print out what you're indexing
and the field it's going into. *for each field*. that is, every time you
do a document.add(<field of some kind>), print out that data. I'm
pretty sure you'll find that you're not getting what you expect. For
instance, the call to:

MetaDataEnum.BODY.getDescription()

may be returning some nonsense. Or
bodyText.trim()

isn't doing what you expect.

Lucene is used by many folks, and errors of the magnitude you're
experiencing would be seen by many people and the user list would
be flooded with complaints if it were a Lucene issue at root. That
leaves the code you wrote as the most likely culprit. So try a very simple
test case with lots of debugging println's. I'm pretty sure you'll
find the underlying issue with some of your assumptions pretty quickly.

Sorry I can't be more specific, but we'd have to see all of your code
and the test cases to do that....

Best
Erick

On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> Hi Erick
>
> Thanks for your reply.
>
> I have used luke to inspect the document and I am some what confused.  For
> example when I view the index using the overview tab of Luke I get the
> following:
>
> 1       body    test
> 1       id      1234
> 1       name    rtfDocumentToIndex.rtf
> 1       path    rtfDocumentToIndex.rtf
> 1       summary This is a
> 1       type    RTF_INDEXER
> 1       body    rtf
>
>
> However when I view the document in the Document tab I get the full text
> that was extracted from the rft document (field:body) which is:
>
> This is a test rtf document that will be indexed.
> Amin Mohammed-Coleman
>
> I am using the StandardAnaylzer therefore I wouldnt expect the words
> document, indexed, Amin Mohammed-Coleman to be removed.
>
> I have referenced the Lucene In Action book and I can't see what I may be
> doing wrong.  I would be happy to provide a testcase should it be required.
>  When adding the body field to the document I am doing:
>
>        Document document = new Document();
>                        Field field = new
> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(), Field.Store.YES,
> Field.Index.ANALYZED);
>                        document.add(field);
>
>
>
> When I run the search code the string "test" is the only word that returns
> a result (TopDocs), whereas the others do not (e.g. "amin", "document",
> "indexed").
>
> Thanks again for your help and advice.
>
>
> Cheers
> Amin
>
>
>
>
> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>
>  Casing is usually handled by the analyzer. Since you construct
>> the term query programmatically, it doesn't go through
>> any analyzers, thus is not converted into lower case for
>> searching as was done automatically for you when you
>> indexed using StandardAnalyzer.
>>
>> As for why you aren't getting hits, it's unclear to me. But
>> what I'd do is get a copy of Luke and examine your index
>> to see what's *really* there. This will often give you clues,
>> usually pointing to some kind of analyzer behavior that you
>> weren't expecting.
>>
>> Best
>> Erick
>>
>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>> >wrote:
>>
>>  Hi
>>>
>>> I have tried this and it doesn't work.  I don't understand why using
>>> "amin"
>>> instead of "Amin" would work, is it not case insensitive?
>>>
>>> I tried "test" for field "body" and this works.  Any other terms don't
>>> work
>>> for example:
>>>
>>> "document"
>>> "indexed"
>>>
>>> these are tokens that were extracted when creating the lucene document.
>>>
>>>
>>> Thanks for your reply.
>>>
>>> Cheers
>>>
>>> Amin
>>>
>>>
>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>
>>> Basically Lucene stores analyzed tokens, and looks up for the matches
>>>
>>>> based
>>>> on the tokens.
>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>> Term("body",
>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>
>>>> --
>>>> Chris Lu
>>>> -------------------------
>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>> site: http://www.dbsight.net
>>>> demo: http://search.dbsight.com
>>>> Lucene Database Search in 3 minutes:
>>>>
>>>>
>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>> got
>>>> 2.6 Million Euro funding!
>>>>
>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>> aminmc@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>> Hi
>>>>
>>>>>
>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>
>>>>> You need to let us know the analyzer you are using.
>>>>>
>>>>>  -- Chris Lu
>>>>>> -------------------------
>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>> site: http://www.dbsight.net
>>>>>> demo: http://search.dbsight.com
>>>>>> Lucene Database Search in 3 minutes:
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>>>> got
>>>>>> 2.6 Million Euro funding!
>>>>>>
>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>> aminmc@gmail.com
>>>>>>
>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>  Hi
>>>>>>>
>>>>>>>
>>>>>>>> I have created a RTFHandler which takes a RTF file and creates a
>>>>>>>> lucene
>>>>>>>> Document which is indexed.  The RTFHandler looks like something like
>>>>>>>> this:
>>>>>>>>
>>>>>>>> if (bodyText != null) {
>>>>>>>>                  Document document = new Document();
>>>>>>>>                  Field field = new
>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>> Field.Store.YES,
>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>                  document.add(field);
>>>>>>>>
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> I am using Java Built in RTF text extraction.  When I run my test to
>>>>>>>> verify that the document contains text that I expect this works
>>>>>>>> fine.
>>>>>>>> I
>>>>>>>> get
>>>>>>>> the following when I print the document:
>>>>>>>>
>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is a test
>>>>>>>> rtf
>>>>>>>> document that will be indexed.
>>>>>>>>
>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>
>>>>>>>>
>>>>>>>> The problem is when I use the following to search I get no result:
>>>>>>>>
>>>>>>>>  MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]
>>>>>>>> {rtfIndexSearcher});
>>>>>>>>                  Term t = new Term("body", "Amin");
>>>>>>>>                  TermQuery termQuery = new TermQuery(t);
>>>>>>>>                  TopDocs topDocs = multiSearcher.search(termQuery,
>>>>>>>> 1);
>>>>>>>>                  System.out.println(topDocs.totalHits);
>>>>>>>>                  multiSearcher.close();
>>>>>>>>
>>>>>>>> RftIndexSearcher is configured with the directory that holds rtf
>>>>>>>> documents.  I have used Luke to look at the document and what I am
>>>>>>>> finding
>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>
>>>>>>>> 1       body    test
>>>>>>>> 1       id      1234
>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>> 1       summary This is a
>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>> 1       body    rtf
>>>>>>>>
>>>>>>>>
>>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>>
>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>
>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>
>>>>>>>>
>>>>>>>> I would expect to get a hit using "Amin" or even "document".  I am
>>>>>>>> not
>>>>>>>> sure whether the
>>>>>>>> line:
>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>
>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds the top n
>>>>>>>> hits
>>>>>>>> for query." for search (Query query, int n) according to java docs.
>>>>>>>>
>>>>>>>> I would be grateful if someone may be able to advise on what I may
>>>>>>>> be
>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Amin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi Erick

Thanks for your reply.

I have used luke to inspect the document and I am some what confused.   
For example when I view the index using the overview tab of Luke I get  
the following:

1	body	test
1	id	1234
1	name	rtfDocumentToIndex.rtf
1	path	rtfDocumentToIndex.rtf
1	summary	This is a
1	type	RTF_INDEXER
1	body	rtf


However when I view the document in the Document tab I get the full  
text that was extracted from the rft document (field:body) which is:

This is a test rtf document that will be indexed.
Amin Mohammed-Coleman

I am using the StandardAnaylzer therefore I wouldnt expect the words  
document, indexed, Amin Mohammed-Coleman to be removed.

I have referenced the Lucene In Action book and I can't see what I may  
be doing wrong.  I would be happy to provide a testcase should it be  
required.  When adding the body field to the document I am doing:

	Document document = new Document();
			Field field = new Field(FieldNameEnum.BODY.getDescription(),  
bodyText.trim(), Field.Store.YES, Field.Index.ANALYZED);
			document.add(field);
		


When I run the search code the string "test" is the only word that  
returns a result (TopDocs), whereas the others do not (e.g. "amin",  
"document", "indexed").

Thanks again for your help and advice.


Cheers
Amin



On 2 Jan 2009, at 21:20, Erick Erickson wrote:

> Casing is usually handled by the analyzer. Since you construct
> the term query programmatically, it doesn't go through
> any analyzers, thus is not converted into lower case for
> searching as was done automatically for you when you
> indexed using StandardAnalyzer.
>
> As for why you aren't getting hits, it's unclear to me. But
> what I'd do is get a copy of Luke and examine your index
> to see what's *really* there. This will often give you clues,
> usually pointing to some kind of analyzer behavior that you
> weren't expecting.
>
> Best
> Erick
>
> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com 
> >wrote:
>
>> Hi
>>
>> I have tried this and it doesn't work.  I don't understand why  
>> using "amin"
>> instead of "Amin" would work, is it not case insensitive?
>>
>> I tried "test" for field "body" and this works.  Any other terms  
>> don't work
>> for example:
>>
>> "document"
>> "indexed"
>>
>> these are tokens that were extracted when creating the lucene  
>> document.
>>
>>
>> Thanks for your reply.
>>
>> Cheers
>>
>> Amin
>>
>>
>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>
>> Basically Lucene stores analyzed tokens, and looks up for the matches
>>> based
>>> on the tokens.
>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>> Term("body",
>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>
>>> --
>>> Chris Lu
>>> -------------------------
>>> Instant Scalable Full-Text Search On Any Database/Application
>>> site: http://www.dbsight.net
>>> demo: http://search.dbsight.com
>>> Lucene Database Search in 3 minutes:
>>>
>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>> DBSight customer, a shopping comparison site, (anonymous per  
>>> request) got
>>> 2.6 Million Euro funding!
>>>
>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>> wrote:
>>>
>>> Hi
>>>>
>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>>
>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>
>>>> You need to let us know the analyzer you are using.
>>>>
>>>>> -- Chris Lu
>>>>> -------------------------
>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>> site: http://www.dbsight.net
>>>>> demo: http://search.dbsight.com
>>>>> Lucene Database Search in 3 minutes:
>>>>>
>>>>>
>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>> request)
>>>>> got
>>>>> 2.6 Million Euro funding!
>>>>>
>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi
>>>>>>
>>>>>>>
>>>>>>> I have created a RTFHandler which takes a RTF file and creates a
>>>>>>> lucene
>>>>>>> Document which is indexed.  The RTFHandler looks like  
>>>>>>> something like
>>>>>>> this:
>>>>>>>
>>>>>>> if (bodyText != null) {
>>>>>>>                   Document document = new Document();
>>>>>>>                   Field field = new
>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>> Field.Store.YES,
>>>>>>> Field.Index.ANALYZED);
>>>>>>>                   document.add(field);
>>>>>>>
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> I am using Java Built in RTF text extraction.  When I run my  
>>>>>>> test to
>>>>>>> verify that the document contains text that I expect this  
>>>>>>> works fine.
>>>>>>> I
>>>>>>> get
>>>>>>> the following when I print the document:
>>>>>>>
>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is a  
>>>>>>> test rtf
>>>>>>> document that will be indexed.
>>>>>>>
>>>>>>> Amin Mohammed-Coleman>
>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>
>>>>>>>
>>>>>>> The problem is when I use the following to search I get no  
>>>>>>> result:
>>>>>>>
>>>>>>>   MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>>>> Searchable[]
>>>>>>> {rtfIndexSearcher});
>>>>>>>                   Term t = new Term("body", "Amin");
>>>>>>>                   TermQuery termQuery = new TermQuery(t);
>>>>>>>                   TopDocs topDocs =  
>>>>>>> multiSearcher.search(termQuery,
>>>>>>> 1);
>>>>>>>                   System.out.println(topDocs.totalHits);
>>>>>>>                   multiSearcher.close();
>>>>>>>
>>>>>>> RftIndexSearcher is configured with the directory that holds rtf
>>>>>>> documents.  I have used Luke to look at the document and what  
>>>>>>> I am
>>>>>>> finding
>>>>>>> in the overview tab is the following for the document:
>>>>>>>
>>>>>>> 1       body    test
>>>>>>> 1       id      1234
>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>> 1       summary This is a
>>>>>>> 1       type    RTF_INDEXER
>>>>>>> 1       body    rtf
>>>>>>>
>>>>>>>
>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>
>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>
>>>>>>> Amin Mohammed-Coleman
>>>>>>>
>>>>>>>
>>>>>>> I would expect to get a hit using "Amin" or even "document".   
>>>>>>> I am not
>>>>>>> sure whether the
>>>>>>> line:
>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>
>>>>>>> is incorrect as I am not too sure of the meaning of "Finds the  
>>>>>>> top n
>>>>>>> hits
>>>>>>> for query." for search (Query query, int n) according to java  
>>>>>>> docs.
>>>>>>>
>>>>>>> I would be grateful if someone may be able to advise on what I  
>>>>>>> may be
>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>> Amin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

Re: Search Problem

Posted by Erick Erickson <er...@gmail.com>.

Casing is usually handled by the analyzer. Since you construct
the term query programmatically, it doesn't go through
any analyzers, thus is not converted into lower case for
searching as was done automatically for you when you
indexed using StandardAnalyzer.

As for why you aren't getting hits, it's unclear to me. But
what I'd do is get a copy of Luke and examine your index
to see what's *really* there. This will often give you clues,
usually pointing to some kind of analyzer behavior that you
weren't expecting.

Best
Erick

On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> Hi
>
> I have tried this and it doesn't work.  I don't understand why using "amin"
> instead of "Amin" would work, is it not case insensitive?
>
> I tried "test" for field "body" and this works.  Any other terms don't work
> for example:
>
> "document"
> "indexed"
>
> these are tokens that were extracted when creating the lucene document.
>
>
> Thanks for your reply.
>
> Cheers
>
> Amin
>
>
> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>
>  Basically Lucene stores analyzed tokens, and looks up for the matches
>> based
>> on the tokens.
>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>> Term("body",
>> "amin"), instead of new Term("body", "Amin"), to search.
>>
>> --
>> Chris Lu
>> -------------------------
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>>
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <aminmc@gmail.com
>> >wrote:
>>
>>  Hi
>>>
>>> Sorry I was using the StandardAnalyzer in this instance.
>>>
>>> Cheers
>>>
>>>
>>>
>>>
>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>
>>> You need to let us know the analyzer you are using.
>>>
>>>> -- Chris Lu
>>>> -------------------------
>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>> site: http://www.dbsight.net
>>>> demo: http://search.dbsight.com
>>>> Lucene Database Search in 3 minutes:
>>>>
>>>>
>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>> got
>>>> 2.6 Million Euro funding!
>>>>
>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>
>>>>
>>>>> Hi
>>>>>
>>>>>>
>>>>>> I have created a RTFHandler which takes a RTF file and creates a
>>>>>> lucene
>>>>>> Document which is indexed.  The RTFHandler looks like something like
>>>>>> this:
>>>>>>
>>>>>> if (bodyText != null) {
>>>>>>                    Document document = new Document();
>>>>>>                    Field field = new
>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>> Field.Store.YES,
>>>>>> Field.Index.ANALYZED);
>>>>>>                    document.add(field);
>>>>>>
>>>>>>
>>>>>> }
>>>>>>
>>>>>> I am using Java Built in RTF text extraction.  When I run my test to
>>>>>> verify that the document contains text that I expect this works fine.
>>>>>>  I
>>>>>> get
>>>>>> the following when I print the document:
>>>>>>
>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is a test rtf
>>>>>> document that will be indexed.
>>>>>>
>>>>>> Amin Mohammed-Coleman>
>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>
>>>>>>
>>>>>> The problem is when I use the following to search I get no result:
>>>>>>
>>>>>>    MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]
>>>>>> {rtfIndexSearcher});
>>>>>>                    Term t = new Term("body", "Amin");
>>>>>>                    TermQuery termQuery = new TermQuery(t);
>>>>>>                    TopDocs topDocs = multiSearcher.search(termQuery,
>>>>>> 1);
>>>>>>                    System.out.println(topDocs.totalHits);
>>>>>>                    multiSearcher.close();
>>>>>>
>>>>>> RftIndexSearcher is configured with the directory that holds rtf
>>>>>> documents.  I have used Luke to look at the document and what I am
>>>>>> finding
>>>>>> in the overview tab is the following for the document:
>>>>>>
>>>>>> 1       body    test
>>>>>> 1       id      1234
>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>> 1       summary This is a
>>>>>> 1       type    RTF_INDEXER
>>>>>> 1       body    rtf
>>>>>>
>>>>>>
>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>
>>>>>> This is a test rtf document that will be indexed.
>>>>>>
>>>>>> Amin Mohammed-Coleman
>>>>>>
>>>>>>
>>>>>> I would expect to get a hit using "Amin" or even "document".  I am not
>>>>>> sure whether the
>>>>>> line:
>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>
>>>>>> is incorrect as I am not too sure of the meaning of "Finds the top n
>>>>>> hits
>>>>>> for query." for search (Query query, int n) according to java docs.
>>>>>>
>>>>>> I would be grateful if someone may be able to advise on what I may be
>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>> Amin
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

I have tried this and it doesn't work.  I don't understand why using  
"amin" instead of "Amin" would work, is it not case insensitive?

I tried "test" for field "body" and this works.  Any other terms don't  
work for example:

"document"
"indexed"

these are tokens that were extracted when creating the lucene document.


Thanks for your reply.

Cheers

Amin

On 2 Jan 2009, at 10:36, Chris Lu wrote:

> Basically Lucene stores analyzed tokens, and looks up for the  
> matches based
> on the tokens.
> "Amin" after StandardAnalyzer is "amin", so you need to use new  
> Term("body",
> "amin"), instead of new Term("body", "Amin"), to search.
>
> -- 
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per  
> request) got
> 2.6 Million Euro funding!
>
> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
> >wrote:
>
>> Hi
>>
>> Sorry I was using the StandardAnalyzer in this instance.
>>
>> Cheers
>>
>>
>>
>>
>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>
>> You need to let us know the analyzer you are using.
>>> -- Chris Lu
>>> -------------------------
>>> Instant Scalable Full-Text Search On Any Database/Application
>>> site: http://www.dbsight.net
>>> demo: http://search.dbsight.com
>>> Lucene Database Search in 3 minutes:
>>>
>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>> DBSight customer, a shopping comparison site, (anonymous per  
>>> request) got
>>> 2.6 Million Euro funding!
>>>
>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>> wrote:
>>>
>>>
>>>>
>>>> Hi
>>>>>
>>>>> I have created a RTFHandler which takes a RTF file and creates a  
>>>>> lucene
>>>>> Document which is indexed.  The RTFHandler looks like something  
>>>>> like
>>>>> this:
>>>>>
>>>>> if (bodyText != null) {
>>>>>                     Document document = new Document();
>>>>>                     Field field = new
>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>> Field.Store.YES,
>>>>> Field.Index.ANALYZED);
>>>>>                     document.add(field);
>>>>>
>>>>>
>>>>> }
>>>>>
>>>>> I am using Java Built in RTF text extraction.  When I run my  
>>>>> test to
>>>>> verify that the document contains text that I expect this works  
>>>>> fine.  I
>>>>> get
>>>>> the following when I print the document:
>>>>>
>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is a  
>>>>> test rtf
>>>>> document that will be indexed.
>>>>>
>>>>> Amin Mohammed-Coleman>
>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>
>>>>>
>>>>> The problem is when I use the following to search I get no result:
>>>>>
>>>>>     MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>> Searchable[]
>>>>> {rtfIndexSearcher});
>>>>>                     Term t = new Term("body", "Amin");
>>>>>                     TermQuery termQuery = new TermQuery(t);
>>>>>                     TopDocs topDocs =  
>>>>> multiSearcher.search(termQuery,
>>>>> 1);
>>>>>                     System.out.println(topDocs.totalHits);
>>>>>                     multiSearcher.close();
>>>>>
>>>>> RftIndexSearcher is configured with the directory that holds rtf
>>>>> documents.  I have used Luke to look at the document and what I am
>>>>> finding
>>>>> in the overview tab is the following for the document:
>>>>>
>>>>> 1       body    test
>>>>> 1       id      1234
>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>> 1       summary This is a
>>>>> 1       type    RTF_INDEXER
>>>>> 1       body    rtf
>>>>>
>>>>>
>>>>> However on the Document tab I am getting (in the body field):
>>>>>
>>>>> This is a test rtf document that will be indexed.
>>>>>
>>>>> Amin Mohammed-Coleman
>>>>>
>>>>>
>>>>> I would expect to get a hit using "Amin" or even "document".  I  
>>>>> am not
>>>>> sure whether the
>>>>> line:
>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>
>>>>> is incorrect as I am not too sure of the meaning of "Finds the  
>>>>> top n
>>>>> hits
>>>>> for query." for search (Query query, int n) according to java  
>>>>> docs.
>>>>>
>>>>> I would be grateful if someone may be able to advise on what I  
>>>>> may be
>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>
>>>>>
>>>>> Cheers
>>>>> Amin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Problem

Posted by Chris Lu <ch...@gmail.com>.

Basically Lucene stores analyzed tokens, and looks up for the matches based
on the tokens.
"Amin" after StandardAnalyzer is "amin", so you need to use new Term("body",
"amin"), instead of new Term("body", "Amin"), to search.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

> Hi
>
> Sorry I was using the StandardAnalyzer in this instance.
>
> Cheers
>
>
>
>
> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>
>  You need to let us know the analyzer you are using.
>> -- Chris Lu
>> -------------------------
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>>
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <aminmc@gmail.com
>> >wrote:
>>
>>
>>>
>>>  Hi
>>>>
>>>> I have created a RTFHandler which takes a RTF file and creates a lucene
>>>> Document which is indexed.  The RTFHandler looks like something like
>>>> this:
>>>>
>>>> if (bodyText != null) {
>>>>                      Document document = new Document();
>>>>                      Field field = new
>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>> Field.Store.YES,
>>>> Field.Index.ANALYZED);
>>>>                      document.add(field);
>>>>
>>>>
>>>> }
>>>>
>>>> I am using Java Built in RTF text extraction.  When I run my test to
>>>> verify that the document contains text that I expect this works fine.  I
>>>> get
>>>> the following when I print the document:
>>>>
>>>> Document<stored/uncompressed,indexed,tokenized<body:This is a test rtf
>>>> document that will be indexed.
>>>>
>>>> Amin Mohammed-Coleman>
>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>
>>>>
>>>> The problem is when I use the following to search I get no result:
>>>>
>>>>      MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]
>>>> {rtfIndexSearcher});
>>>>                      Term t = new Term("body", "Amin");
>>>>                      TermQuery termQuery = new TermQuery(t);
>>>>                      TopDocs topDocs = multiSearcher.search(termQuery,
>>>> 1);
>>>>                      System.out.println(topDocs.totalHits);
>>>>                      multiSearcher.close();
>>>>
>>>> RftIndexSearcher is configured with the directory that holds rtf
>>>> documents.  I have used Luke to look at the document and what I am
>>>> finding
>>>> in the overview tab is the following for the document:
>>>>
>>>> 1       body    test
>>>> 1       id      1234
>>>> 1       name    rtfDocumentToIndex.rtf
>>>> 1       path    rtfDocumentToIndex.rtf
>>>> 1       summary This is a
>>>> 1       type    RTF_INDEXER
>>>> 1       body    rtf
>>>>
>>>>
>>>> However on the Document tab I am getting (in the body field):
>>>>
>>>> This is a test rtf document that will be indexed.
>>>>
>>>> Amin Mohammed-Coleman
>>>>
>>>>
>>>> I would expect to get a hit using "Amin" or even "document".  I am not
>>>> sure whether the
>>>> line:
>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>
>>>> is incorrect as I am not too sure of the meaning of "Finds the top n
>>>> hits
>>>> for query." for search (Query query, int n) according to java docs.
>>>>
>>>> I would be grateful if someone may be able to advise on what I may be
>>>> doing wrong.  I am using Lucene 2.4.0
>>>>
>>>>
>>>> Cheers
>>>> Amin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Search Problem

Posted by Amin Mohammed-Coleman <am...@gmail.com>.

Hi

Sorry I was using the StandardAnalyzer in this instance.

Cheers



On 2 Jan 2009, at 00:55, Chris Lu wrote:

> You need to let us know the analyzer you are using.
> --  
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per  
> request) got
> 2.6 Million Euro funding!
>
> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
> >wrote:
>
>>
>>
>>> Hi
>>>
>>> I have created a RTFHandler which takes a RTF file and creates a  
>>> lucene
>>> Document which is indexed.  The RTFHandler looks like something  
>>> like this:
>>>
>>> if (bodyText != null) {
>>>                       Document document = new Document();
>>>                       Field field = new
>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),  
>>> Field.Store.YES,
>>> Field.Index.ANALYZED);
>>>                       document.add(field);
>>>
>>>
>>> }
>>>
>>> I am using Java Built in RTF text extraction.  When I run my test to
>>> verify that the document contains text that I expect this works  
>>> fine.  I get
>>> the following when I print the document:
>>>
>>> Document<stored/uncompressed,indexed,tokenized<body:This is a test  
>>> rtf
>>> document that will be indexed.
>>>
>>> Amin Mohammed-Coleman>
>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>> stored/uncompressed,indexed<summary:This is a >>
>>>
>>>
>>> The problem is when I use the following to search I get no result:
>>>
>>>       MultiSearcher multiSearcher = new MultiSearcher(new  
>>> Searchable[]
>>> {rtfIndexSearcher});
>>>                       Term t = new Term("body", "Amin");
>>>                       TermQuery termQuery = new TermQuery(t);
>>>                       TopDocs topDocs =  
>>> multiSearcher.search(termQuery,
>>> 1);
>>>                       System.out.println(topDocs.totalHits);
>>>                       multiSearcher.close();
>>>
>>> RftIndexSearcher is configured with the directory that holds rtf
>>> documents.  I have used Luke to look at the document and what I am  
>>> finding
>>> in the overview tab is the following for the document:
>>>
>>> 1       body    test
>>> 1       id      1234
>>> 1       name    rtfDocumentToIndex.rtf
>>> 1       path    rtfDocumentToIndex.rtf
>>> 1       summary This is a
>>> 1       type    RTF_INDEXER
>>> 1       body    rtf
>>>
>>>
>>> However on the Document tab I am getting (in the body field):
>>>
>>> This is a test rtf document that will be indexed.
>>>
>>> Amin Mohammed-Coleman
>>>
>>>
>>> I would expect to get a hit using "Amin" or even "document".  I am  
>>> not
>>> sure whether the
>>> line:
>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>
>>> is incorrect as I am not too sure of the meaning of "Finds the top  
>>> n hits
>>> for query." for search (Query query, int n) according to java docs.
>>>
>>> I would be grateful if someone may be able to advise on what I may  
>>> be
>>> doing wrong.  I am using Lucene 2.4.0
>>>
>>>
>>> Cheers
>>> Amin
>>>
>>>
>>>
>>>
>>>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Problem

Posted by Chris Lu <ch...@gmail.com>.

You need to let us know the analyzer you are using.
-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <am...@gmail.com>wrote:

>
>
>> Hi
>>
>> I have created a RTFHandler which takes a RTF file and creates a lucene
>> Document which is indexed.  The RTFHandler looks like something like this:
>>
>> if (bodyText != null) {
>>                        Document document = new Document();
>>                        Field field = new
>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(), Field.Store.YES,
>> Field.Index.ANALYZED);
>>                        document.add(field);
>>
>>
>> }
>>
>> I am using Java Built in RTF text extraction.  When I run my test to
>> verify that the document contains text that I expect this works fine.  I get
>> the following when I print the document:
>>
>> Document<stored/uncompressed,indexed,tokenized<body:This is a test rtf
>> document that will be indexed.
>>
>> Amin Mohammed-Coleman>
>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>> stored/uncompressed,indexed<type:RTF_INDEXER>
>> stored/uncompressed,indexed<summary:This is a >>
>>
>>
>> The problem is when I use the following to search I get no result:
>>
>>        MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]
>> {rtfIndexSearcher});
>>                        Term t = new Term("body", "Amin");
>>                        TermQuery termQuery = new TermQuery(t);
>>                        TopDocs topDocs = multiSearcher.search(termQuery,
>> 1);
>>                        System.out.println(topDocs.totalHits);
>>                        multiSearcher.close();
>>
>> RftIndexSearcher is configured with the directory that holds rtf
>> documents.  I have used Luke to look at the document and what I am finding
>> in the overview tab is the following for the document:
>>
>> 1       body    test
>> 1       id      1234
>> 1       name    rtfDocumentToIndex.rtf
>> 1       path    rtfDocumentToIndex.rtf
>> 1       summary This is a
>> 1       type    RTF_INDEXER
>> 1       body    rtf
>>
>>
>> However on the Document tab I am getting (in the body field):
>>
>> This is a test rtf document that will be indexed.
>>
>> Amin Mohammed-Coleman
>>
>>
>> I would expect to get a hit using "Amin" or even "document".  I am not
>> sure whether the
>> line:
>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>
>> is incorrect as I am not too sure of the meaning of "Finds the top n hits
>> for query." for search (Query query, int n) according to java docs.
>>
>> I would be grateful if someone may be able to advise on what I may be
>> doing wrong.  I am using Lucene 2.4.0
>>
>>
>> Cheers
>> Amin
>>
>>
>>
>>
>>
>
>