You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by sp...@gmx.eu on 2008/02/14 22:41:38 UTC

RE: Design questions

> Rather than index one doc per page, you could index a special
> token between pages. Say you index $$$$$$$$$ as the special
> token. 

I have decided to use this version, but...

What token can I use? It must be a token which gets never removed by an
analyzer or altered in a way that it not unique in the resulting
tokenstream.

Is something like $0123456789$ the way to go?

Thank you.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Design questions

Posted by sp...@gmx.eu.

> Why not just use $$$$$$$$? 

Because nearly every analyzer removes it (SimpleAnalyzer, German, Russian,
French...)
Just tested it with luke in the search dialog.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Design questions

Posted by sp...@gmx.eu.

> You need to watch both the positionincrementgap
> (which, as I remember, gets added for each new field of the
> same name you add to the document). Make it 0 rather than
> whatever it is currently. You may have to create a new analyzer
> by subclassing your favorite analyzer and overriding the
> getPositionIncrementGap (?)

Well. I'm using GermanAnalyzer and this does not override
getPositionIncrementGap in Analyzer.
And in Analyzer getPositionIncrementGap returns 0.

> Also, I'm not sure whether the term increment (see 
> get/setPositionIncrement)
> needs to be taken into account. See the SynonymAnalyzer in
> Lucene in Action.

I cannot find the source of SynonymAnalyzer.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Design questions

Posted by Erick Erickson <er...@gmail.com>.

You need to watch both the positionincrementgap
(which, as I remember, gets added for each new field of the
same name you add to the document). Make it 0 rather than
whatever it is currently. You may have to create a new analyzer
by subclassing your favorite analyzer and overriding the
getPositionIncrementGap (?)

Also, I'm not sure whether the term increment (see get/setPositionIncrement)
needs to be taken into account. See the SynonymAnalyzer in
Lucene in Action.

On Fri, Feb 15, 2008 at 8:37 AM, <sp...@gmx.eu> wrote:

> > >   Document doc = new Document()
> > >   for (int i = 0; i < pages.length; i++) {
> > >     doc.add(new Field("text", pages[i], Field.Store.NO,
> > > Field.Index.TOKENIZED));
> > >     doc.add(new Field("text", "$$", Field.Store.NO,
> > > Field.Index.UN_TOKENIZED));
> > >   }
> >
> > UN_TOKENIZED. Nice idea!
> > I will check this out.
>
>
> Hm... when I try this, something strange happens with my offsets.
>
> When I use
> doc.add(new Field("text", pages[i] +
> "012345678901234567890123456789012345678901234567890123456789",
> Field.Store.NO, Field.Index.TOKENIZED))
> everything is fine. Offsets are as I expect.
>
> But when I use
> doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED
> ))
> doc.add(new Field("text",
> "012345678901234567890123456789012345678901234567890123456789",
> Field.Store.NO, Field.Index.UN_TOKENIZED))
>
> the offsets of my terms are to high.
>
> What is the difference?
>
> Thank you.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Design questions

Posted by sp...@gmx.eu.

> >   Document doc = new Document()
> >   for (int i = 0; i < pages.length; i++) {
> >     doc.add(new Field("text", pages[i], Field.Store.NO, 
> > Field.Index.TOKENIZED));
> >     doc.add(new Field("text", "$$", Field.Store.NO, 
> > Field.Index.UN_TOKENIZED));
> >   }
> 
> UN_TOKENIZED. Nice idea!
> I will check this out.


Hm... when I try this, something strange happens with my offsets.

When I use 
doc.add(new Field("text", pages[i] +
"012345678901234567890123456789012345678901234567890123456789",
Field.Store.NO, Field.Index.TOKENIZED)) 
everything is fine. Offsets are as I expect.

But when I use 
doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED))
doc.add(new Field("text",
"012345678901234567890123456789012345678901234567890123456789",
Field.Store.NO, Field.Index.UN_TOKENIZED))

the offsets of my terms are to high.

What is the difference?

Thank you.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Design questions

Posted by sp...@gmx.eu.

>   Document doc = new Document()
>   for (int i = 0; i < pages.length; i++) {
>     doc.add(new Field("text", pages[i], Field.Store.NO, 
> Field.Index.TOKENIZED));
>     doc.add(new Field("text", "$$", Field.Store.NO, 
> Field.Index.UN_TOKENIZED));
>   }

UN_TOKENIZED. Nice idea!
I will check this out.

> 2) if your goal is just to be able to make sure you can query 
> for phrases 
> without crossing page boundaries, it's a lot simpler just to use are 
> really big positionIncimentGap with your analyzer (and add 
> each page as a 
> seperate Field instance).  boundary tokens like these are relaly only 
> neccessary if you want more complex queries (like "find X and Y on 
> the same page but not in the same sentence")

Hm. This is what Erik already recommended.
I had to store the field with TermVector.WITH_POSITIONS, right?

But I do not know the maximum number of terms per page and I do not know the
maximum number of pages.
I already had documents with more than 50.000 pages (A4) and documents with
1 page but 100 MB data.
How many terms can 100 MB have? Hm...
Since positions are stored as int I could have a maximum of 40.000 terms per
page (50.000 pages * 40.000 term -> nearly Integer.MAX_VALUE).



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Design questions

Posted by sp...@gmx.eu.

Well, it seems that this may be a solution for me too.
But I'm afraid that someone one day will change this string. And then my app
will not work anymore...  

> -----Original Message-----
> From: Adrian Smith [mailto:adrian.m.smith@gmail.com] 
> Sent: Freitag, 15. Februar 2008 13:02
> To: java-user@lucene.apache.org
> Subject: Re: Design questions
> 
> Hi,
> 
> I have a similar sitaution. I also considered using $. But 
> for the sake of
> not running into (potential) problems with Tokenisers, I just 
> defined a
> string in a config file which for sure is never going to 
> occur in a document
> and will never be searched for, e.g.
> 
> dfgjkjrkruigduhfkdgjrugr
> 
> Cheers, Adrian
> --
> Java Software Developer
> http://www.databasesandlife.com/
> 
> 
> 
> On 15/02/2008, Chris Hostetter <ho...@fucit.org> wrote:
> >
> >
> > I haven't really been following this thread that closely, but...
> >
> > : Why not just use $$$$$$$$? Check to insure that it makes
> >
> > : it through whatever analyzer you choose though. For instance,
> > : LetterTokenizer will remove it...
> >
> >
> > 1) i'm 99% sure you can do something like this...
> >
> >   Document doc = new Document()
> >   for (int i = 0; i < pages.length; i++) {
> >     doc.add(new Field("text", pages[i], Field.Store.NO,
> > Field.Index.TOKENIZED));
> >     doc.add(new Field("text", "$$", Field.Store.NO,
> > Field.Index.UN_TOKENIZED));
> >   }
> >
> > ...and you'll get your magic token regardless of whether it 
> would normally
> > make it through your analyzer. In fact: you want it to be 
> something your
> > analyzer could never produce, even if it appears in the 
> orriginal text, so
> > you don't get false boundaries (ie: if you use an Analzeer 
> that lowercases
> > everything, then "A" makes a perfectly fine boundary token.
> >
> > 2) if your goal is just to be able to make sure you can 
> query for phrases
> > without crossing page boundaries, it's a lot simpler just to use are
> > really big positionIncimentGap with your analyzer (and add 
> each page as a
> > seperate Field instance).  boundary tokens like these are 
> relaly only
> > neccessary if you want more complex queries (like "find X and Y on
> > the same page but not in the same sentence")
> >
> >
> >
> >
> > -Hoss
> >
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Design questions

Posted by Adrian Smith <ad...@gmail.com>.

Hi,

I have a similar sitaution. I also considered using $. But for the sake of
not running into (potential) problems with Tokenisers, I just defined a
string in a config file which for sure is never going to occur in a document
and will never be searched for, e.g.

dfgjkjrkruigduhfkdgjrugr

Cheers, Adrian
--
Java Software Developer
http://www.databasesandlife.com/



On 15/02/2008, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> I haven't really been following this thread that closely, but...
>
> : Why not just use $$$$$$$$? Check to insure that it makes
>
> : it through whatever analyzer you choose though. For instance,
> : LetterTokenizer will remove it...
>
>
> 1) i'm 99% sure you can do something like this...
>
>   Document doc = new Document()
>   for (int i = 0; i < pages.length; i++) {
>     doc.add(new Field("text", pages[i], Field.Store.NO,
> Field.Index.TOKENIZED));
>     doc.add(new Field("text", "$$", Field.Store.NO,
> Field.Index.UN_TOKENIZED));
>   }
>
> ...and you'll get your magic token regardless of whether it would normally
> make it through your analyzer. In fact: you want it to be something your
> analyzer could never produce, even if it appears in the orriginal text, so
> you don't get false boundaries (ie: if you use an Analzeer that lowercases
> everything, then "A" makes a perfectly fine boundary token.
>
> 2) if your goal is just to be able to make sure you can query for phrases
> without crossing page boundaries, it's a lot simpler just to use are
> really big positionIncimentGap with your analyzer (and add each page as a
> seperate Field instance).  boundary tokens like these are relaly only
> neccessary if you want more complex queries (like "find X and Y on
> the same page but not in the same sentence")
>
>
>
>
> -Hoss
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Design questions

Posted by Chris Hostetter <ho...@fucit.org>.

I haven't really been following this thread that closely, but...

: Why not just use $$$$$$$$? Check to insure that it makes
: it through whatever analyzer you choose though. For instance,
: LetterTokenizer will remove it...

1) i'm 99% sure you can do something like this...

  Document doc = new Document()
  for (int i = 0; i < pages.length; i++) {
    doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED));
    doc.add(new Field("text", "$$", Field.Store.NO, Field.Index.UN_TOKENIZED));
  }

...and you'll get your magic token regardless of whether it would normally 
make it through your analyzer. In fact: you want it to be something your 
analyzer could never produce, even if it appears in the orriginal text, so 
you don't get false boundaries (ie: if you use an Analzeer that lowercases 
everything, then "A" makes a perfectly fine boundary token.

2) if your goal is just to be able to make sure you can query for phrases 
without crossing page boundaries, it's a lot simpler just to use are 
really big positionIncimentGap with your analyzer (and add each page as a 
seperate Field instance).  boundary tokens like these are relaly only 
neccessary if you want more complex queries (like "find X and Y on 
the same page but not in the same sentence")




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Design questions

Posted by Erick Erickson <er...@gmail.com>.

Why not just use $$$$$$$$? Check to insure that it makes
it through whatever analyzer you choose though. For instance,
LetterTokenizer will remove it...

Erick

On Thu, Feb 14, 2008 at 4:41 PM, <sp...@gmx.eu> wrote:

> > Rather than index one doc per page, you could index a special
> > token between pages. Say you index $$$$$$$$$ as the special
> > token.
>
> I have decided to use this version, but...
>
> What token can I use? It must be a token which gets never removed by an
> analyzer or altered in a way that it not unique in the resulting
> tokenstream.
>
> Is something like $0123456789$ the way to go?
>
> Thank you.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>