You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by oh...@cox.net on 2009/08/20 23:35:47 UTC

Possible to invoke same Lucene query on a String?

Hi,

This question is going to be a little complicated to explain, but let me try.

I have implemented an indexer app based on the demo IndexFiles app, and a web app based on the luceneweb web app for the searching.

In my case, the "Documents" that I'm indexing are a proprietary file type, and each document has kind of "sub-documents".  So, in my indexer, I parse each of the sub-documents, and, for a given "Document", I build a long string containing terms that I extracted from each of the sub-documents, then I do:

doc.add(new Field("contents", longstring, Field.Store.YES, Field.Index.ANALYZED));

I also add the longstring to another non-indexed field, summary:

doc.add(new Field("summary", longstring, Field.Store.YES, Field.Index.NO));

The modified luceneweb web app that I use is pretty vanilla, and originally, what I was asked to do was to be able to search just for a Document, i.e., given a query like "X and Y" (document containing both term=X and term=Y), return the file path+name for the document.  I also was displaying the terms associated with each sub-document by parsing the 'summary' string.

So, for example, if "Document1" contained 3 sub-documents (which contained (term1, term2), (term1a, term2a), and (term1b, term2b), respectively), and if I queried for "term1a AND term2a", the web app would display something like:

Document1                 subdoc1 term1 term2
                                      subdoc2 term1a term2a
                                      subdoc3 term1b term2b

However, I've now been asked to implement the ability to query the sub-documents. 

In other words, rather than the web app displaying what I showed above, they want it to return something like just:

Document1                 subdoc2 term1a term2a

Right now, the web app gets the 'summary' (again, in a long string), then just breaks it into subdoc1, subdoc2, and subdoc3 lines, just for display purposes, so to do what I've been asked, I need to query the 3 sub-strings from the 'summary', i.e., run the "term1a AND term2a" query against the following strings:

subdoc1 term1 term2
subdoc2 term1a term2a
subdoc3 term1b term2b

I guess that I can write a method to do that, but I want to make sure that the sub-document/string query "duplicates" the behavior of the Lucene query.

It seems like there should be a way to duplicate the Lucene query logic by using something (methods) in Lucene itself??

I've been reviewing the Javadocs, but I'm still fairly new to Lucene, so I was hoping that someone could point me in the right direction?

My apologies for the longish post, but I hope that I've been able to explain clearly :)!!

Thanks,
Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Possible to invoke same Lucene query on a String?

Posted by oh...@cox.net.

---- Paul Cowan <co...@aconex.com> wrote: 
> ohaya@cox.net wrote:
> > - I'd have to create a (very small) index, for each sub-document, where I do the Document.add() with just the (for example) two terms, then
> > - Run a query against the 1-entry index, which
> > - Would either give me a "yes" or "no" (for that sub-document)
> > 
> > As I said, I'm concerned about overhead.  Some of the documents are quite large, containing >20K sub-documents.  That means that, for such a document, I'd have to create >20K indexes.
> 
> No, I'm talking about a separate document in the same index.
> 
> There are a few approaches here:
> 
> 1) Index each sub-document separately. So if you have fields 'doc#', 
> 'docname', 'subdoc#', and 'subdocterms', you might do:
> 
>     for (Doc parent : Docs) {
>       for (SubDoc child : parent.subDocs()) {
>         Document luceneDoc = new Document();
>         doc.add(new Field("doc#", parent.number()));
>         doc.add(new Field("docname", parent.name()));
>         doc.add(new Field("subdoc#", child.number()));
>         doc.add(new Field("subdocterms", child.data()));
>       }
>     }
> 
> This means that in your index after indexing 2 docs with 2 subdocs each, 
> you'll have
>     (Lucene #)   doc#   docname   subdoc#   subdocterms
>     ----------------------------------------------------
>     0            100    Foo       101       subdoc1 terms here
>     1            100    Foo       102       subdoc2 terms
>     2            200    Bar       201       subdoc1 terms from doc2
>     3            200    Bar       202       some more subdoc text
> 
> So the search you're doing is actually on the subdoc level. This can get 
> complicated, especially as subdocs from the same parent doc may come 
> back out of order, etc, depending on scoring/sorting.
> 
> Also, if there is a lot of data at the parent level, you're obviously 
> duplicating it. This can get nasty.
> 
> 2) Maintain a (logically) separate subdoc index. You could have 
> something like:
>     doc#   docname  bigblobofdocdata
>     ---------------------------------
>     100    Foo      lots of data here...
>     200    Bar      and lots more here..
> in one index, and
>     doc#   subdoc#  subdocterms
>     ---------------------------------
>     100    101       subdoc1 terms here
>     100    102       subdoc2 terms
>     200    201       subdoc1 terms from doc2
>     200    202       some more subdoc text
> 
> Then you can FIRST search on the doc index to do any matches on 
> 'docname' etc, then use the IDs you find to filter the subdoc index -- 
> so if the user searches for 'docname=foo' and 'subdocterms=text', you 
> first do the docname search to get the docname-matching doc (100), then 
> do a search on the second index for 'subdocterms', but also filter where 
> doc#=100.
> 
> Note they don't HAVE to be separate indexes -- you can actually keep 
> these in the same physical index, with some sort of discriminator (all 
> docs in an index don't have to have the same fields).
> 
> 3) Do some really hardcore tricks with spanqueries. This is what I'm 
> working on at the moment, so it's near and dear to my heart. It's not 
> for the faint-hearted, though, and if you're new to Lucene may scare you 
> off, sorry! Basically Lucene has the concept of 'positions' for terms -- 
> metadata about where in the document the term can be found. This lets 
> you do 'near' queries, etc.
> 
> We're taking advantage of that to do some many-to-one stuff like your 
> problem. Using the first example, with term positions indicated in [], 
> we position terms from different subdocs with a large gap between them, 
> like so:
> 
>     (Lucene #)   doc#   docname   subdoc#   subdocterms
>     ----------------------------------------------------
>     0            100    Foo       101[0]    subdoc1[0] terms[1] here[2]
>                                   102[100]  subdoc2[100] terms[101]
> 
>     1            200    Bar       201[0]    subdoc1[0] terms[1] from[2]
>                                   202[100]  doc2[3] some[100] more[101]
>                                             subdoc[102] text[103]
> 
> So in each doc, subdoc #1's terms start at 0, #2's at 100, #3s at 200, 
> etc. Then when we search we can say 'the terms you're looking for must 
> be in the same 100-position block' to find only subdocs that match all 
> subdoc-related subqueries. This is pretty hairy but is working well for 
> us -- massively reduces our indexing and search times compared to the 
> duplicated document way I mentioned above.
> 
> Cheers,
> 
> Paul


Paul,

Oh boy, you've given me a LOT to chew on :)!!

At first read, I like your #1 approach, maybe because it's easiest for me to understand.  I have to think about it, but my first thought is that we might not need/want the sub-doc index to persist after they're used (maybe!), so create the sub-doc index "on-the-fly" for each Document, maybe using that example I linked as the template, do the query, then move on to the next Document...

I'll have to think about it.  Like I said, lots of ideas in your message :)...

Having said that, I keep thinking wouldn't it be much easier if, as I originally posted, there was a way to invoke a "Lucene query" on just a String object :(??

Of course, if, after some more thought, it makes more sense to persist the sub-doc index(es), then I guess not...

Again, thanks.  Now, I'll have to re-read what you wrote, a couple of times.  

Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Possible to invoke same Lucene query on a String?

Posted by Paul Cowan <co...@aconex.com>.

ohaya@cox.net wrote:
> - I'd have to create a (very small) index, for each sub-document, where I do the Document.add() with just the (for example) two terms, then
> - Run a query against the 1-entry index, which
> - Would either give me a "yes" or "no" (for that sub-document)
> 
> As I said, I'm concerned about overhead.  Some of the documents are quite large, containing >20K sub-documents.  That means that, for such a document, I'd have to create >20K indexes.

No, I'm talking about a separate document in the same index.

There are a few approaches here:

1) Index each sub-document separately. So if you have fields 'doc#', 
'docname', 'subdoc#', and 'subdocterms', you might do:

    for (Doc parent : Docs) {
      for (SubDoc child : parent.subDocs()) {
        Document luceneDoc = new Document();
        doc.add(new Field("doc#", parent.number()));
        doc.add(new Field("docname", parent.name()));
        doc.add(new Field("subdoc#", child.number()));
        doc.add(new Field("subdocterms", child.data()));
      }
    }

This means that in your index after indexing 2 docs with 2 subdocs each, 
you'll have
    (Lucene #)   doc#   docname   subdoc#   subdocterms
    ----------------------------------------------------
    0            100    Foo       101       subdoc1 terms here
    1            100    Foo       102       subdoc2 terms
    2            200    Bar       201       subdoc1 terms from doc2
    3            200    Bar       202       some more subdoc text

So the search you're doing is actually on the subdoc level. This can get 
complicated, especially as subdocs from the same parent doc may come 
back out of order, etc, depending on scoring/sorting.

Also, if there is a lot of data at the parent level, you're obviously 
duplicating it. This can get nasty.

2) Maintain a (logically) separate subdoc index. You could have 
something like:
    doc#   docname  bigblobofdocdata
    ---------------------------------
    100    Foo      lots of data here...
    200    Bar      and lots more here..
in one index, and
    doc#   subdoc#  subdocterms
    ---------------------------------
    100    101       subdoc1 terms here
    100    102       subdoc2 terms
    200    201       subdoc1 terms from doc2
    200    202       some more subdoc text

Then you can FIRST search on the doc index to do any matches on 
'docname' etc, then use the IDs you find to filter the subdoc index -- 
so if the user searches for 'docname=foo' and 'subdocterms=text', you 
first do the docname search to get the docname-matching doc (100), then 
do a search on the second index for 'subdocterms', but also filter where 
doc#=100.

Note they don't HAVE to be separate indexes -- you can actually keep 
these in the same physical index, with some sort of discriminator (all 
docs in an index don't have to have the same fields).

3) Do some really hardcore tricks with spanqueries. This is what I'm 
working on at the moment, so it's near and dear to my heart. It's not 
for the faint-hearted, though, and if you're new to Lucene may scare you 
off, sorry! Basically Lucene has the concept of 'positions' for terms -- 
metadata about where in the document the term can be found. This lets 
you do 'near' queries, etc.

We're taking advantage of that to do some many-to-one stuff like your 
problem. Using the first example, with term positions indicated in [], 
we position terms from different subdocs with a large gap between them, 
like so:

    (Lucene #)   doc#   docname   subdoc#   subdocterms
    ----------------------------------------------------
    0            100    Foo       101[0]    subdoc1[0] terms[1] here[2]
                                  102[100]  subdoc2[100] terms[101]

    1            200    Bar       201[0]    subdoc1[0] terms[1] from[2]
                                  202[100]  doc2[3] some[100] more[101]
                                            subdoc[102] text[103]

So in each doc, subdoc #1's terms start at 0, #2's at 100, #3s at 200, 
etc. Then when we search we can say 'the terms you're looking for must 
be in the same 100-position block' to find only subdocs that match all 
subdoc-related subqueries. This is pretty hairy but is working well for 
us -- massively reduces our indexing and search times compared to the 
duplicated document way I mentioned above.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Possible to invoke same Lucene query on a String?

Posted by oh...@cox.net.

---- Paul Cowan <co...@aconex.com> wrote: 
> ohaya@cox.net wrote:
> > Document1                 subdoc1 term1 term2
> >                                       subdoc2 term1a term2a
> >                                       subdoc3 term1b term2b
> >
> > However, I've now been asked to implement the ability to query the sub-documents. 
> >
> > In other words, rather than the web app displaying what I showed above, they want it to return something like just:
> >
> > Document1                 subdoc2 term1a term2a
> 
> Just checking here... you only want to match where the terms are in 
> specific sub-documents? That is, if someone searches for 'term1a AND 
> term2b', what do you want to see? Nothing (because no sub-document 
> matches both terms)? Or subdoc2 and subdoc3, because they're both part 
> of the reason that Document1 matched?
> 
> If the former, then just indexing each sub-doc as a separate document 
> (duplicating the document-level information) may be the simplest option.
> 
> Cheers,
> 
> Paul
>

Hi Paul,

Hah!

Yes, it's the former I think...

The "Hah!" was because I was googling, and just ran across this:

http://javatechniques.com/blog/lucene-in-memory-text-search-example/

which, I think, creates an in-memory index, then searches it.

I was reading through that, as I saw your message.

As I was reading though, I am wondering:  This seems like it would create an awful lot of overhead?

In other words:

- I'd have to create a (very small) index, for each sub-document, where I do the Document.add() with just the (for example) two terms, then
- Run a query against the 1-entry index, which
- Would either give me a "yes" or "no" (for that sub-document)

As I said, I'm concerned about overhead.  Some of the documents are quite large, containing >20K sub-documents.  That means that, for such a document, I'd have to create >20K indexes.

Is there really no other way to do this?  I guess that, in my mind, I keep thinking about somehow "redirecting" Lucene to do a search on a single String object (that was just a kind of metaphor)?

Comments?

Thanks for your response!

Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Possible to invoke same Lucene query on a String?

Posted by Paul Cowan <co...@aconex.com>.

ohaya@cox.net wrote:
> Document1                 subdoc1 term1 term2
>                                       subdoc2 term1a term2a
>                                       subdoc3 term1b term2b
>
> However, I've now been asked to implement the ability to query the sub-documents. 
>
> In other words, rather than the web app displaying what I showed above, they want it to return something like just:
>
> Document1                 subdoc2 term1a term2a

Just checking here... you only want to match where the terms are in 
specific sub-documents? That is, if someone searches for 'term1a AND 
term2b', what do you want to see? Nothing (because no sub-document 
matches both terms)? Or subdoc2 and subdoc3, because they're both part 
of the reason that Document1 matched?

If the former, then just indexing each sub-doc as a separate document 
(duplicating the document-level information) may be the simplest option.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Possible to invoke same Lucene query on a String?

Posted by oh...@cox.net.

Hi,

I guess, that, in short, what I'm really trying to find out is:

If I construct a Lucene query, can I (somehow) use that to query a String object that I have, rather than querying against a Lucene index?

Thanks,
Jim


---- ohaya@cox.net wrote: 
> Hi,
> 
> This question is going to be a little complicated to explain, but let me try.
> 
> I have implemented an indexer app based on the demo IndexFiles app, and a web app based on the luceneweb web app for the searching.
> 
> In my case, the "Documents" that I'm indexing are a proprietary file type, and each document has kind of "sub-documents".  So, in my indexer, I parse each of the sub-documents, and, for a given "Document", I build a long string containing terms that I extracted from each of the sub-documents, then I do:
> 
> doc.add(new Field("contents", longstring, Field.Store.YES, Field.Index.ANALYZED));
> 
> I also add the longstring to another non-indexed field, summary:
> 
> doc.add(new Field("summary", longstring, Field.Store.YES, Field.Index.NO));
> 
> The modified luceneweb web app that I use is pretty vanilla, and originally, what I was asked to do was to be able to search just for a Document, i.e., given a query like "X and Y" (document containing both term=X and term=Y), return the file path+name for the document.  I also was displaying the terms associated with each sub-document by parsing the 'summary' string.
> 
> So, for example, if "Document1" contained 3 sub-documents (which contained (term1, term2), (term1a, term2a), and (term1b, term2b), respectively), and if I queried for "term1a AND term2a", the web app would display something like:
> 
> Document1                 subdoc1 term1 term2
>                                       subdoc2 term1a term2a
>                                       subdoc3 term1b term2b
> 
> However, I've now been asked to implement the ability to query the sub-documents. 
> 
> In other words, rather than the web app displaying what I showed above, they want it to return something like just:
> 
> Document1                 subdoc2 term1a term2a
> 
> Right now, the web app gets the 'summary' (again, in a long string), then just breaks it into subdoc1, subdoc2, and subdoc3 lines, just for display purposes, so to do what I've been asked, I need to query the 3 sub-strings from the 'summary', i.e., run the "term1a AND term2a" query against the following strings:
> 
> subdoc1 term1 term2
> subdoc2 term1a term2a
> subdoc3 term1b term2b
> 
> I guess that I can write a method to do that, but I want to make sure that the sub-document/string query "duplicates" the behavior of the Lucene query.
> 
> It seems like there should be a way to duplicate the Lucene query logic by using something (methods) in Lucene itself??
> 
> I've been reviewing the Javadocs, but I'm still fairly new to Lucene, so I was hoping that someone could point me in the right direction?
> 
> My apologies for the longish post, but I hope that I've been able to explain clearly :)!!
> 
> Thanks,
> Jim
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org