You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sp...@gmx.eu on 2008/02/14 22:41:38 UTC
RE: Design questions
> Rather than index one doc per page, you could index a special
> token between pages. Say you index $$$$$$$$$ as the special
> token.
I have decided to use this version, but...
What token can I use? It must be a token which gets never removed by an
analyzer or altered in a way that it not unique in the resulting
tokenstream.
Is something like $0123456789$ the way to go?
Thank you.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Design questions
Posted by sp...@gmx.eu.
> Why not just use $$$$$$$$?
Because nearly every analyzer removes it (SimpleAnalyzer, German, Russian,
French...)
Just tested it with luke in the search dialog.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Design questions
Posted by sp...@gmx.eu.
> You need to watch both the positionincrementgap
> (which, as I remember, gets added for each new field of the
> same name you add to the document). Make it 0 rather than
> whatever it is currently. You may have to create a new analyzer
> by subclassing your favorite analyzer and overriding the
> getPositionIncrementGap (?)
Well. I'm using GermanAnalyzer and this does not override
getPositionIncrementGap in Analyzer.
And in Analyzer getPositionIncrementGap returns 0.
> Also, I'm not sure whether the term increment (see
> get/setPositionIncrement)
> needs to be taken into account. See the SynonymAnalyzer in
> Lucene in Action.
I cannot find the source of SynonymAnalyzer.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Design questions
Posted by Erick Erickson <er...@gmail.com>.
You need to watch both the positionincrementgap
(which, as I remember, gets added for each new field of the
same name you add to the document). Make it 0 rather than
whatever it is currently. You may have to create a new analyzer
by subclassing your favorite analyzer and overriding the
getPositionIncrementGap (?)
Also, I'm not sure whether the term increment (see get/setPositionIncrement)
needs to be taken into account. See the SynonymAnalyzer in
Lucene in Action.
On Fri, Feb 15, 2008 at 8:37 AM, <sp...@gmx.eu> wrote:
> > > Document doc = new Document()
> > > for (int i = 0; i < pages.length; i++) {
> > > doc.add(new Field("text", pages[i], Field.Store.NO,
> > > Field.Index.TOKENIZED));
> > > doc.add(new Field("text", "$$", Field.Store.NO,
> > > Field.Index.UN_TOKENIZED));
> > > }
> >
> > UN_TOKENIZED. Nice idea!
> > I will check this out.
>
>
> Hm... when I try this, something strange happens with my offsets.
>
> When I use
> doc.add(new Field("text", pages[i] +
> "012345678901234567890123456789012345678901234567890123456789",
> Field.Store.NO, Field.Index.TOKENIZED))
> everything is fine. Offsets are as I expect.
>
> But when I use
> doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED
> ))
> doc.add(new Field("text",
> "012345678901234567890123456789012345678901234567890123456789",
> Field.Store.NO, Field.Index.UN_TOKENIZED))
>
> the offsets of my terms are to high.
>
> What is the difference?
>
> Thank you.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
RE: Design questions
Posted by sp...@gmx.eu.
> > Document doc = new Document()
> > for (int i = 0; i < pages.length; i++) {
> > doc.add(new Field("text", pages[i], Field.Store.NO,
> > Field.Index.TOKENIZED));
> > doc.add(new Field("text", "$$", Field.Store.NO,
> > Field.Index.UN_TOKENIZED));
> > }
>
> UN_TOKENIZED. Nice idea!
> I will check this out.
Hm... when I try this, something strange happens with my offsets.
When I use
doc.add(new Field("text", pages[i] +
"012345678901234567890123456789012345678901234567890123456789",
Field.Store.NO, Field.Index.TOKENIZED))
everything is fine. Offsets are as I expect.
But when I use
doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED))
doc.add(new Field("text",
"012345678901234567890123456789012345678901234567890123456789",
Field.Store.NO, Field.Index.UN_TOKENIZED))
the offsets of my terms are to high.
What is the difference?
Thank you.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Design questions
Posted by sp...@gmx.eu.
> Document doc = new Document()
> for (int i = 0; i < pages.length; i++) {
> doc.add(new Field("text", pages[i], Field.Store.NO,
> Field.Index.TOKENIZED));
> doc.add(new Field("text", "$$", Field.Store.NO,
> Field.Index.UN_TOKENIZED));
> }
UN_TOKENIZED. Nice idea!
I will check this out.
> 2) if your goal is just to be able to make sure you can query
> for phrases
> without crossing page boundaries, it's a lot simpler just to use are
> really big positionIncimentGap with your analyzer (and add
> each page as a
> seperate Field instance). boundary tokens like these are relaly only
> neccessary if you want more complex queries (like "find X and Y on
> the same page but not in the same sentence")
Hm. This is what Erik already recommended.
I had to store the field with TermVector.WITH_POSITIONS, right?
But I do not know the maximum number of terms per page and I do not know the
maximum number of pages.
I already had documents with more than 50.000 pages (A4) and documents with
1 page but 100 MB data.
How many terms can 100 MB have? Hm...
Since positions are stored as int I could have a maximum of 40.000 terms per
page (50.000 pages * 40.000 term -> nearly Integer.MAX_VALUE).
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Design questions
Posted by sp...@gmx.eu.
Well, it seems that this may be a solution for me too.
But I'm afraid that someone one day will change this string. And then my app
will not work anymore...
> -----Original Message-----
> From: Adrian Smith [mailto:adrian.m.smith@gmail.com]
> Sent: Freitag, 15. Februar 2008 13:02
> To: java-user@lucene.apache.org
> Subject: Re: Design questions
>
> Hi,
>
> I have a similar sitaution. I also considered using $. But
> for the sake of
> not running into (potential) problems with Tokenisers, I just
> defined a
> string in a config file which for sure is never going to
> occur in a document
> and will never be searched for, e.g.
>
> dfgjkjrkruigduhfkdgjrugr
>
> Cheers, Adrian
> --
> Java Software Developer
> http://www.databasesandlife.com/
>
>
>
> On 15/02/2008, Chris Hostetter <ho...@fucit.org> wrote:
> >
> >
> > I haven't really been following this thread that closely, but...
> >
> > : Why not just use $$$$$$$$? Check to insure that it makes
> >
> > : it through whatever analyzer you choose though. For instance,
> > : LetterTokenizer will remove it...
> >
> >
> > 1) i'm 99% sure you can do something like this...
> >
> > Document doc = new Document()
> > for (int i = 0; i < pages.length; i++) {
> > doc.add(new Field("text", pages[i], Field.Store.NO,
> > Field.Index.TOKENIZED));
> > doc.add(new Field("text", "$$", Field.Store.NO,
> > Field.Index.UN_TOKENIZED));
> > }
> >
> > ...and you'll get your magic token regardless of whether it
> would normally
> > make it through your analyzer. In fact: you want it to be
> something your
> > analyzer could never produce, even if it appears in the
> orriginal text, so
> > you don't get false boundaries (ie: if you use an Analzeer
> that lowercases
> > everything, then "A" makes a perfectly fine boundary token.
> >
> > 2) if your goal is just to be able to make sure you can
> query for phrases
> > without crossing page boundaries, it's a lot simpler just to use are
> > really big positionIncimentGap with your analyzer (and add
> each page as a
> > seperate Field instance). boundary tokens like these are
> relaly only
> > neccessary if you want more complex queries (like "find X and Y on
> > the same page but not in the same sentence")
> >
> >
> >
> >
> > -Hoss
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Design questions
Posted by Adrian Smith <ad...@gmail.com>.
Hi,
I have a similar sitaution. I also considered using $. But for the sake of
not running into (potential) problems with Tokenisers, I just defined a
string in a config file which for sure is never going to occur in a document
and will never be searched for, e.g.
dfgjkjrkruigduhfkdgjrugr
Cheers, Adrian
--
Java Software Developer
http://www.databasesandlife.com/
On 15/02/2008, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> I haven't really been following this thread that closely, but...
>
> : Why not just use $$$$$$$$? Check to insure that it makes
>
> : it through whatever analyzer you choose though. For instance,
> : LetterTokenizer will remove it...
>
>
> 1) i'm 99% sure you can do something like this...
>
> Document doc = new Document()
> for (int i = 0; i < pages.length; i++) {
> doc.add(new Field("text", pages[i], Field.Store.NO,
> Field.Index.TOKENIZED));
> doc.add(new Field("text", "$$", Field.Store.NO,
> Field.Index.UN_TOKENIZED));
> }
>
> ...and you'll get your magic token regardless of whether it would normally
> make it through your analyzer. In fact: you want it to be something your
> analyzer could never produce, even if it appears in the orriginal text, so
> you don't get false boundaries (ie: if you use an Analzeer that lowercases
> everything, then "A" makes a perfectly fine boundary token.
>
> 2) if your goal is just to be able to make sure you can query for phrases
> without crossing page boundaries, it's a lot simpler just to use are
> really big positionIncimentGap with your analyzer (and add each page as a
> seperate Field instance). boundary tokens like these are relaly only
> neccessary if you want more complex queries (like "find X and Y on
> the same page but not in the same sentence")
>
>
>
>
> -Hoss
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Design questions
Posted by Chris Hostetter <ho...@fucit.org>.
I haven't really been following this thread that closely, but...
: Why not just use $$$$$$$$? Check to insure that it makes
: it through whatever analyzer you choose though. For instance,
: LetterTokenizer will remove it...
1) i'm 99% sure you can do something like this...
Document doc = new Document()
for (int i = 0; i < pages.length; i++) {
doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("text", "$$", Field.Store.NO, Field.Index.UN_TOKENIZED));
}
...and you'll get your magic token regardless of whether it would normally
make it through your analyzer. In fact: you want it to be something your
analyzer could never produce, even if it appears in the orriginal text, so
you don't get false boundaries (ie: if you use an Analzeer that lowercases
everything, then "A" makes a perfectly fine boundary token.
2) if your goal is just to be able to make sure you can query for phrases
without crossing page boundaries, it's a lot simpler just to use are
really big positionIncimentGap with your analyzer (and add each page as a
seperate Field instance). boundary tokens like these are relaly only
neccessary if you want more complex queries (like "find X and Y on
the same page but not in the same sentence")
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Design questions
Posted by Erick Erickson <er...@gmail.com>.
Why not just use $$$$$$$$? Check to insure that it makes
it through whatever analyzer you choose though. For instance,
LetterTokenizer will remove it...
Erick
On Thu, Feb 14, 2008 at 4:41 PM, <sp...@gmx.eu> wrote:
> > Rather than index one doc per page, you could index a special
> > token between pages. Say you index $$$$$$$$$ as the special
> > token.
>
> I have decided to use this version, but...
>
> What token can I use? It must be a token which gets never removed by an
> analyzer or altered in a way that it not unique in the resulting
> tokenstream.
>
> Is something like $0123456789$ the way to go?
>
> Thank you.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>