You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Celso Fontes <ce...@gmail.com> on 2010/12/12 02:34:00 UTC

Problems with "tagged" and "non tagged" text

Hi, i have the same text in two files:

****TXT      file: http://pastebin.com/u9Rd9VVA
****(X)HTM file: http://pastebin.com/ydHmTQZ8

And i running this Question:

   APC (adenomatous polyposis coli) actin assembly

with OR operator and SNOWBALL Analyser results in:

    +content:apc +(+content:adenomat +content:polyposi +content:coli)
+content:actin +content:assembl


But... only txt returns ok, why?


ps: if i try without "()" i got the same result....
Thanks,
Celso

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Problems with "tagged" and "non tagged" text

Posted by Celso Fontes <ce...@gmail.com>.
Dear Erick,
Sorry i am using really "AND" operator, i wrote wrong in email (i am
very tired)...
But..Follow the 'main' part of code:

            Document document = new Document();
            String path = file.getCanonicalPath();

            document.add(new Field("title", path,
                    Field.Store.YES,
                    Field.Index.ANALYZED));

            Reader reader = new FileReader(file);
            document.add(new Field("content", reader));

As you can see I do indexing ! and...
with the others questions, i have a good result with htm files...this
htm, for example, is good for this question:
******APC (adenomatous polyposis coli) Colon Cancer

Thanks,
Celso.

2010/12/12 Erick Erickson <er...@gmail.com>:
> Unless you provide details on how you are indexing these documents,
> it's pretty hard to help.
>
> It's also hard to reconcile your statement that OR is the default operator
> with
> the results you posted, the '+' all over the place really points to AND
> as the default.
>
> There's no magic in Lucene that will automatically put the "content" of
> an (X)HTM document in the content field of your document, how are you
> insuring that the doc is indexed as you expect?
>
> Luke is a very valuable tool for inspecting your index to see if it is what
> you think it is...
>
> Best
> Erick
>
> On Sat, Dec 11, 2010 at 8:34 PM, Celso Fontes <ce...@gmail.com> wrote:
>
>> Hi, i have the same text in two files:
>>
>> ****TXT      file: http://pastebin.com/u9Rd9VVA
>> ****(X)HTM file: http://pastebin.com/ydHmTQZ8
>>
>> And i running this Question:
>>
>>   APC (adenomatous polyposis coli) actin assembly
>>
>> with OR operator and SNOWBALL Analyser results in:
>>
>>    +content:apc +(+content:adenomat +content:polyposi +content:coli)
>> +content:actin +content:assembl
>>
>>
>> But... only txt returns ok, why?
>>
>>
>> ps: if i try without "()" i got the same result....
>> Thanks,
>> Celso
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Problems with "tagged" and "non tagged" text

Posted by Erick Erickson <er...@gmail.com>.
Unless you provide details on how you are indexing these documents,
it's pretty hard to help.

It's also hard to reconcile your statement that OR is the default operator
with
the results you posted, the '+' all over the place really points to AND
as the default.

There's no magic in Lucene that will automatically put the "content" of
an (X)HTM document in the content field of your document, how are you
insuring that the doc is indexed as you expect?

Luke is a very valuable tool for inspecting your index to see if it is what
you think it is...

Best
Erick

On Sat, Dec 11, 2010 at 8:34 PM, Celso Fontes <ce...@gmail.com> wrote:

> Hi, i have the same text in two files:
>
> ****TXT      file: http://pastebin.com/u9Rd9VVA
> ****(X)HTM file: http://pastebin.com/ydHmTQZ8
>
> And i running this Question:
>
>   APC (adenomatous polyposis coli) actin assembly
>
> with OR operator and SNOWBALL Analyser results in:
>
>    +content:apc +(+content:adenomat +content:polyposi +content:coli)
> +content:actin +content:assembl
>
>
> But... only txt returns ok, why?
>
>
> ps: if i try without "()" i got the same result....
> Thanks,
> Celso
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>