You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Anna Hunecke <an...@yahoo.de> on 2010/01/19 13:57:23 UTC

Indexing and Searching linked files

Hi!
I have been working with Lucene for a while now. So far, I found helpful tips on this list, so I hope somebody can help me with my problem:

In our app information is grouped in so-called cards. Now, it should be made possible to also search on files linked to the cards. You can link arbitrarily many files to a card and the size of the files is also not restricted.
So, as far as I can see, there are two ways to do this:

1. Add the content of the files to the search index of the card. First, I thought that I could just have an additional field in the index which contains the content of all the files. But then, if the files are very big, I could hit the field size limit, and would possibly not get the content of all files indexed. So, I would need one field per file. The problem I have then is that I don't know how many files I have and how large the index would get. This is risky, because some customers have a lot of data.

2. Create a separate index for files. The documents in this index would contain one file each, so I would not have the problem that I don't know how many fields I have. But then, the searching is a problem:
I would need to search on both the card and the document index, and somehow merge the results together. I sort by score always, but, as I understand it, the scores of the results of two different indexes are not comparable.

So, which way do you think is better?

Best,
Anna

__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails.
http://mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Unary Operators and Operator Precedence

Posted by Marvin Humphrey <ma...@rectangular.com>.

> 3.) Does grouping or nesting affect results with unary operators? Does 
> using unary operators with binary operators affect results. For example, 
> in the query:
> 
>     (+a +b) OR c
> 
> has the "required" effect of the + (plus) operator been eliminated by 
> the OR operator, so that nevermind whether a record contains a or 
> contains b both of which supposedly are required, so long as it contains 
> c, it's a hit?

IMO, that's the only sensible way to handle unary operators.  If they were
global rather than nested, what would this query produce?

   (a AND -b) OR (b AND c)

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Unary Operators and Operator Precedence

Posted by Ahmet Arslan <io...@yahoo.com>.

> Here are some questions about unary
> operators and operator precedence or default order of
> operation.
> 
> We all know the importance of order of operation of binary
> operators (ones that operate on two operands) such as AND
> and OR. We know how to impose express order of operation by
> grouping and nesting.
> 
> But what about unary operators, like + (plus), the
> "required operator", and - (minus), the "prohibited
> operator"? Unary operators operate on only one operand.

You might find this link interesting.

http://wiki.apache.org/lucene-java/BooleanQuerySyntax


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Unary Operators and Operator Precedence

Posted by "T. R. Halvorson" <tr...@midrivers.com>.

Here are some questions about unary operators and operator precedence or 
default order of operation.

We all know the importance of order of operation of binary operators 
(ones that operate on two operands) such as AND and OR. We know how to 
impose express order of operation by grouping and nesting.

But what about unary operators, like + (plus), the "required operator", 
and - (minus), the "prohibited operator"? Unary operators operate on 
only one operand.

Here are the questions (Leaving out of consideration for the time being 
the NOT operator which also is unary because I think we know what is 
going on with that):

1.) Does order of operation matter with unary operators?

2.) If there is operator precedence or default order of operation for 
unary operators in Lucene, is it documented or published?

3.) Does grouping or nesting affect results with unary operators? Does 
using unary operators with binary operators affect results. For example, 
in the query:

     (+a +b) OR c

has the "required" effect of the + (plus) operator been eliminated by 
the OR operator, so that nevermind whether a record contains a or 
contains b both of which supposedly are required, so long as it contains 
c, it's a hit?

T. R.

trh@midrivers.com
www.linkedin.com/in/trhalvorson
www.ncodian.com
http://twitter.com/trhalvorson 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing and Searching linked files

Posted by Erick Erickson <er...@gmail.com>.

What's a reasonable upper limit on the number of files? Because I think it
would be simpler, at least to start, to allow your field to be larger (say,
1B
tokens, 1,000 files of 1M tokens each), but restrict the input of each file
to 1M tokens per file. The most elegant way would probably be to
subclass the TokenStream of your choice and cause it to return the
end-of-input
(e.g. next() returns null) after every 1M tokens, regardless of whether
there were more. You'd have to add some methods to allow you to signal
starting a new file, but that shouldn't be difficult....

The tricky part here would be if you had to answer the question "which
file associated with this card matched for this search?". It's do-able,
I'd think about storing the begin/end offsets of each file in a
meta-data field for the card (probably stored but not indexed)......

A cruder approach would be to read the input files yourself and approximate
1M tokens and feed *that* to the analyzer, but that's perilously close to
re-writing a tokenizer.....

HTH
Erick



On Tue, Jan 19, 2010 at 9:35 AM, Anna Hunecke <an...@yahoo.de> wrote:

> The field size is restricted to 1 million tokens, because of the very
> reasons you mentioned.
> So, even if I have one separate field for the content of a file, I might
> reach the limit if the file is really big. But I can't help that. What I
> want to avoid is that the whole content of some files can not be found
> because I used one field for the content of all files and they just could
> not be appended anymore.
>
> --- Erick Erickson <er...@gmail.com> schrieb am Di, 19.1.2010:
>
> > Von: Erick Erickson <er...@gmail.com>
> > Betreff: Re: Indexing and Searching linked files
> > An: java-user@lucene.apache.org
> > Datum: Dienstag, 19. Januar 2010, 14:43
> > What field size limit are you talking
> > about here? Because 10,000
> > tokens is the default, but you can increase it to
> > Integer.MAX_VALUE.
> >
> > So are you really talking billions of tokens here? Your
> > index
> > quickly becomes unmanageable if you're allowing it to grow
> > by such increments.
> >
> > One can argue, IMO, that the first N (10M, say) tokens/file
> > is
> > "enough" and there's not much real value in the rest, but
> > that
> > can be a weak argument depending on the problem space....
> >
> > But if you're really committed to indexing an unbounded
> > number
> > of arbitrarily large files...you'll fail. Sometime,
> > somewhere, somebody
> > will want to index enough to violate whatever limits you
> > have (disk,
> > memory, time, whatever). So I think you'd be farther ahead
> > to ask your
> > product manager what limits are reasonable and go from
> > there...
> >
> > HTH
> > Erick
> >
> > On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <an...@yahoo.de>
> > wrote:
> >
> > > Hi!
> > > I have been working with Lucene for a while now. So
> > far, I found helpful
> > > tips on this list, so I hope somebody can help me with
> > my problem:
> > >
> > > In our app information is grouped in so-called cards.
> > Now, it should be
> > > made possible to also search on files linked to the
> > cards. You can link
> > > arbitrarily many files to a card and the size of the
> > files is also not
> > > restricted.
> > > So, as far as I can see, there are two ways to do
> > this:
> > >
> > > 1. Add the content of the files to the search index of
> > the card. First, I
> > > thought that I could just have an additional field in
> > the index which
> > > contains the content of all the files. But then, if
> > the files are very big,
> > > I could hit the field size limit, and would possibly
> > not get the content of
> > > all files indexed. So, I would need one field per
> > file. The problem I have
> > > then is that I don't know how many files I have and
> > how large the index
> > > would get. This is risky, because some customers have
> > a lot of data.
> > >
> > > 2. Create a separate index for files. The documents in
> > this index would
> > > contain one file each, so I would not have the problem
> > that I don't know how
> > > many fields I have. But then, the searching is a
> > problem:
> > > I would need to search on both the card and the
> > document index, and somehow
> > > merge the results together. I sort by score always,
> > but, as I understand it,
> > > the scores of the results of two different indexes are
> > not comparable.
> > >
> > > So, which way do you think is better?
> > >
> > > Best,
> > > Anna
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Sie sind Spam leid? Yahoo! Mail verfügt über einen
> > herausragenden Schutz
> > > gegen Massenmails.
> > > http://mail.yahoo.com
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
> gegen Massenmails.
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Indexing and Searching linked files

Posted by Danil ŢORIN <to...@gmail.com>.

You can simple index both "files" and "cards" into same index (no need
for 2 indexes)

Lucene easily support documents of different structure.

You may add some boosting per field or document, and tune similarity
to get most important stuff in top.


On Tue, Jan 19, 2010 at 16:35, Anna Hunecke <an...@yahoo.de> wrote:
> The field size is restricted to 1 million tokens, because of the very reasons you mentioned.
> So, even if I have one separate field for the content of a file, I might reach the limit if the file is really big. But I can't help that. What I want to avoid is that the whole content of some files can not be found because I used one field for the content of all files and they just could not be appended anymore.
>
> --- Erick Erickson <er...@gmail.com> schrieb am Di, 19.1.2010:
>
>> Von: Erick Erickson <er...@gmail.com>
>> Betreff: Re: Indexing and Searching linked files
>> An: java-user@lucene.apache.org
>> Datum: Dienstag, 19. Januar 2010, 14:43
>> What field size limit are you talking
>> about here? Because 10,000
>> tokens is the default, but you can increase it to
>> Integer.MAX_VALUE.
>>
>> So are you really talking billions of tokens here? Your
>> index
>> quickly becomes unmanageable if you're allowing it to grow
>> by such increments.
>>
>> One can argue, IMO, that the first N (10M, say) tokens/file
>> is
>> "enough" and there's not much real value in the rest, but
>> that
>> can be a weak argument depending on the problem space....
>>
>> But if you're really committed to indexing an unbounded
>> number
>> of arbitrarily large files...you'll fail. Sometime,
>> somewhere, somebody
>> will want to index enough to violate whatever limits you
>> have (disk,
>> memory, time, whatever). So I think you'd be farther ahead
>> to ask your
>> product manager what limits are reasonable and go from
>> there...
>>
>> HTH
>> Erick
>>
>> On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <an...@yahoo.de>
>> wrote:
>>
>> > Hi!
>> > I have been working with Lucene for a while now. So
>> far, I found helpful
>> > tips on this list, so I hope somebody can help me with
>> my problem:
>> >
>> > In our app information is grouped in so-called cards.
>> Now, it should be
>> > made possible to also search on files linked to the
>> cards. You can link
>> > arbitrarily many files to a card and the size of the
>> files is also not
>> > restricted.
>> > So, as far as I can see, there are two ways to do
>> this:
>> >
>> > 1. Add the content of the files to the search index of
>> the card. First, I
>> > thought that I could just have an additional field in
>> the index which
>> > contains the content of all the files. But then, if
>> the files are very big,
>> > I could hit the field size limit, and would possibly
>> not get the content of
>> > all files indexed. So, I would need one field per
>> file. The problem I have
>> > then is that I don't know how many files I have and
>> how large the index
>> > would get. This is risky, because some customers have
>> a lot of data.
>> >
>> > 2. Create a separate index for files. The documents in
>> this index would
>> > contain one file each, so I would not have the problem
>> that I don't know how
>> > many fields I have. But then, the searching is a
>> problem:
>> > I would need to search on both the card and the
>> document index, and somehow
>> > merge the results together. I sort by score always,
>> but, as I understand it,
>> > the scores of the results of two different indexes are
>> not comparable.
>> >
>> > So, which way do you think is better?
>> >
>> > Best,
>> > Anna
>> >
>> > __________________________________________________
>> > Do You Yahoo!?
>> > Sie sind Spam leid? Yahoo! Mail verfügt über einen
>> herausragenden Schutz
>> > gegen Massenmails.
>> > http://mail.yahoo.com
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails.
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Proximity of More than Single Words?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Yes, that's just a phrase slop, allowing for variable gaps between words.
I *believe* the Surround QP that works with Span family of queries does handle what you are looking for.


Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: T. R. Halvorson <tr...@midrivers.com>
> To: java-user@lucene.apache.org
> Sent: Tue, January 19, 2010 9:40:07 AM
> Subject: Proximity of More than Single Words?
> 
> For proximity expressions, the query parser documentation says, "use the tilde, 
> "~", symbol at the end of a Phrase." It gives the example "jakarta apache"~10
> 
> Does this mean that proximity can only be operated on single words enquoted in 
> quotation marks? To clarify the question by comparision, on some systems, the w/ 
> proximity operator lets one search for:
> 
> crude w/4 "west texas"
> 
> or
> 
> "spot prices" w/3 "gulf coast"
> 
> The Lucene documentation seems to imply that such searches cannot be constructed 
> in any straightforward way (although there might be a way to get the effect by 
> going around Cobb's Hill). Or does the Lucene syntax allow the examples to be 
> cast as:
> 
> "crude "west texas""~4
> 
> or
> 
> ""spot prices" "gulf coast""~3
> 
> If not, is it a fair assessment to say that in Lucene, proximity is limited to 
> being a part of phrase searching, and its function is exhausted by allowing a 
> slop factor in matching phrases.
> 
> Thanks in advance for any help with this.
> 
> T. R.
> trh@midrivers.com
> http://www.linkedin.com/in/trhalvorson
> www.ncodian.com
> http://twitter.com/trhalvorson 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Proximity of More than Single Words?

Posted by Ahmet Arslan <io...@yahoo.com>.

> For proximity expressions, the query
> parser documentation says, "use the tilde, "~", symbol at
> the end of a Phrase." It gives the example "jakarta
> apache"~10
> 
> Does this mean that proximity can only be operated on
> single words enquoted in quotation marks?

Yes if you are using QueryParser to generate your queries. It does not support nested proximity. But if you are constructing your queries programmatically it can be done with SpanQuery family.

> To clarify the
> question by comparision, on some systems, the w/ proximity
> operator lets one search for:
> 
> crude w/4 "west texas"
> 
> or
> 
> "spot prices" w/3 "gulf coast"

There is a org.apache.lucene.queryParser.surround.parser.QueryParser that supports nested proximity search. However it's syntax does not use quotes. There are two operators ordered (w) and unordered (n). Your examples can be translated as follows:

crude 4w (west w texax)

(spot w prices) 3w (gulf w coast)

Also surround parser do not analyze query words. There are different query parsers here:
http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Proximity of More than Single Words?

Posted by "T. R. Halvorson" <tr...@midrivers.com>.

For proximity expressions, the query parser documentation says, "use the 
tilde, "~", symbol at the end of a Phrase." It gives the example "jakarta 
apache"~10

Does this mean that proximity can only be operated on single words enquoted 
in quotation marks? To clarify the question by comparision, on some systems, 
the w/ proximity operator lets one search for:

crude w/4 "west texas"

or

"spot prices" w/3 "gulf coast"

The Lucene documentation seems to imply that such searches cannot be 
constructed in any straightforward way (although there might be a way to get 
the effect by going around Cobb's Hill). Or does the Lucene syntax allow the 
examples to be cast as:

"crude "west texas""~4

or

""spot prices" "gulf coast""~3

If not, is it a fair assessment to say that in Lucene, proximity is limited 
to being a part of phrase searching, and its function is exhausted by 
allowing a slop factor in matching phrases.

Thanks in advance for any help with this.

T. R.
trh@midrivers.com
http://www.linkedin.com/in/trhalvorson
www.ncodian.com
http://twitter.com/trhalvorson 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing and Searching linked files

Posted by Anna Hunecke <an...@yahoo.de>.

The field size is restricted to 1 million tokens, because of the very reasons you mentioned.
So, even if I have one separate field for the content of a file, I might reach the limit if the file is really big. But I can't help that. What I want to avoid is that the whole content of some files can not be found because I used one field for the content of all files and they just could not be appended anymore.

--- Erick Erickson <er...@gmail.com> schrieb am Di, 19.1.2010:

> Von: Erick Erickson <er...@gmail.com>
> Betreff: Re: Indexing and Searching linked files
> An: java-user@lucene.apache.org
> Datum: Dienstag, 19. Januar 2010, 14:43
> What field size limit are you talking
> about here? Because 10,000
> tokens is the default, but you can increase it to
> Integer.MAX_VALUE.
> 
> So are you really talking billions of tokens here? Your
> index
> quickly becomes unmanageable if you're allowing it to grow
> by such increments.
> 
> One can argue, IMO, that the first N (10M, say) tokens/file
> is
> "enough" and there's not much real value in the rest, but
> that
> can be a weak argument depending on the problem space....
> 
> But if you're really committed to indexing an unbounded
> number
> of arbitrarily large files...you'll fail. Sometime,
> somewhere, somebody
> will want to index enough to violate whatever limits you
> have (disk,
> memory, time, whatever). So I think you'd be farther ahead
> to ask your
> product manager what limits are reasonable and go from
> there...
> 
> HTH
> Erick
> 
> On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <an...@yahoo.de>
> wrote:
> 
> > Hi!
> > I have been working with Lucene for a while now. So
> far, I found helpful
> > tips on this list, so I hope somebody can help me with
> my problem:
> >
> > In our app information is grouped in so-called cards.
> Now, it should be
> > made possible to also search on files linked to the
> cards. You can link
> > arbitrarily many files to a card and the size of the
> files is also not
> > restricted.
> > So, as far as I can see, there are two ways to do
> this:
> >
> > 1. Add the content of the files to the search index of
> the card. First, I
> > thought that I could just have an additional field in
> the index which
> > contains the content of all the files. But then, if
> the files are very big,
> > I could hit the field size limit, and would possibly
> not get the content of
> > all files indexed. So, I would need one field per
> file. The problem I have
> > then is that I don't know how many files I have and
> how large the index
> > would get. This is risky, because some customers have
> a lot of data.
> >
> > 2. Create a separate index for files. The documents in
> this index would
> > contain one file each, so I would not have the problem
> that I don't know how
> > many fields I have. But then, the searching is a
> problem:
> > I would need to search on both the card and the
> document index, and somehow
> > merge the results together. I sort by score always,
> but, as I understand it,
> > the scores of the results of two different indexes are
> not comparable.
> >
> > So, which way do you think is better?
> >
> > Best,
> > Anna
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Sie sind Spam leid? Yahoo! Mail verfügt über einen
> herausragenden Schutz
> > gegen Massenmails.
> > http://mail.yahoo.com
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 

__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing and Searching linked files

Posted by Erick Erickson <er...@gmail.com>.

What field size limit are you talking about here? Because 10,000
tokens is the default, but you can increase it to Integer.MAX_VALUE.

So are you really talking billions of tokens here? Your index
quickly becomes unmanageable if you're allowing it to grow
by such increments.

One can argue, IMO, that the first N (10M, say) tokens/file is
"enough" and there's not much real value in the rest, but that
can be a weak argument depending on the problem space....

But if you're really committed to indexing an unbounded number
of arbitrarily large files...you'll fail. Sometime, somewhere, somebody
will want to index enough to violate whatever limits you have (disk,
memory, time, whatever). So I think you'd be farther ahead to ask your
product manager what limits are reasonable and go from there...

HTH
Erick

On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <an...@yahoo.de> wrote:

> Hi!
> I have been working with Lucene for a while now. So far, I found helpful
> tips on this list, so I hope somebody can help me with my problem:
>
> In our app information is grouped in so-called cards. Now, it should be
> made possible to also search on files linked to the cards. You can link
> arbitrarily many files to a card and the size of the files is also not
> restricted.
> So, as far as I can see, there are two ways to do this:
>
> 1. Add the content of the files to the search index of the card. First, I
> thought that I could just have an additional field in the index which
> contains the content of all the files. But then, if the files are very big,
> I could hit the field size limit, and would possibly not get the content of
> all files indexed. So, I would need one field per file. The problem I have
> then is that I don't know how many files I have and how large the index
> would get. This is risky, because some customers have a lot of data.
>
> 2. Create a separate index for files. The documents in this index would
> contain one file each, so I would not have the problem that I don't know how
> many fields I have. But then, the searching is a problem:
> I would need to search on both the card and the document index, and somehow
> merge the results together. I sort by score always, but, as I understand it,
> the scores of the results of two different indexes are not comparable.
>
> So, which way do you think is better?
>
> Best,
> Anna
>
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
> gegen Massenmails.
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>