You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by 1world1love <jd...@yahoo.com> on 2008/06/11 00:35:33 UTC

retrieve all docs efficiently - just one field

Greetings all. I have read many posts concerning similar use cases, but I am
still a little hazy on the best way to achieve what I need to do. Here is
the background:

2 million documents with multiple sections, some sections contain structured
data, some unstructured.

We parse the docs and place the structured stuff in oracle where each
section is a table and one master table to relate them all.

We index the unstructured sections with lucene where each section is a
document (meaning a total of about ~30 million documents) with extra fields
including one for the primary key of the master table and then some meta
fields to describe the section - type, date, etc.

For a common use case, say we have a table called demographics with a number
field that represents age (overly simplistic but gets the point across).

So say we want all people over the age of 50 who may have visited Panama: 

--
We have our lucene index and we want to search the section text for the word
"panama" 

AND

We want to select from the demographics table where age > 50.
--

Now I need to intersect the master table IDs from my lucene hits and my
table results. 

I have a java stored procedure that runs the lucene query and creates a
temporary table with a single column where I insert the master id from the
hits of my lucene query. I then can do a join with my structured query
results.

The problem here is obviously the speed of iterating through the hits to
extract the single field that I need.

Notes: 
- I must be able to get a full set of results, though I only need the one id
field
- We originally went with Oracle text which was simple, but limited and
quite slow for most queries


I have read a little about the hitcollector class and the fieldselector api,
but I am still not sure how they may help me or even if they can.

I have also tooled around with the idea of using termdocs, but the queries
may get a little complex with various ors/ands/nots, though probably not
spans and so forth.

Any suggestions will be greatly apreciated.

Thanks,

J

-- 
View this message in context: http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17766268.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: retrieve all docs efficiently - just one field

Posted by Johannes Christen <j....@insiders.de>.

There might be quite a lot of users in different groups, but more important the user access rights might change and keeping them up to date in the index would be a real challenge.

But thanks for the LUCENE-879 tip. I will look into that next week.

	Jo


-----Ursprüngliche Nachricht-----
Von: Karl Wettin [mailto:karl.wettin@gmail.com] 
Gesendet: Mittwoch, 11. Juni 2008 15:43
An: java-user@lucene.apache.org
Betreff: Re: retrieve all docs efficiently - just one field


11 jun 2008 kl. 09.38 skrev Johannes Christen:
>
> That might be a solution in this case, but I have the same kind of  
> problem in another case.
> We index documents from an NTFS source. One field is the URI of the  
> document.
> After a query has been processed, we perform an access check on the  
> hits to ensure the user has access rights to open the document. If  
> we have a big result set it takes very long to retrieve the URIs  
> from all the hits, which we need to perform the access check against  
> the file system.

How many users usually have access to any given document? Can't you  
just index them with the document?

> Any good solution for this?
> I think a fix document ID in lucene would help in this cases a lot.  
> The mapping between lucene documents and other systems (e.g. Oracle)  
> would be much faster.

LUCENE-879 is a proof of concept that shows how you can enforce the  
Lucene document numbers if you really want to go that way.


        karl



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: retrieve all docs efficiently - just one field

Posted by Karl Wettin <ka...@gmail.com>.
11 jun 2008 kl. 09.38 skrev Johannes Christen:
>
> That might be a solution in this case, but I have the same kind of  
> problem in another case.
> We index documents from an NTFS source. One field is the URI of the  
> document.
> After a query has been processed, we perform an access check on the  
> hits to ensure the user has access rights to open the document. If  
> we have a big result set it takes very long to retrieve the URIs  
> from all the hits, which we need to perform the access check against  
> the file system.

How many users usually have access to any given document? Can't you  
just index them with the document?

> Any good solution for this?
> I think a fix document ID in lucene would help in this cases a lot.  
> The mapping between lucene documents and other systems (e.g. Oracle)  
> would be much faster.

LUCENE-879 is a proof of concept that shows how you can enforce the  
Lucene document numbers if you really want to go that way.


        karl

>
>
> 	Jo
>
> -----Ursprüngliche Nachricht-----
> Von: Karl Wettin [mailto:karl.wettin@gmail.com]
> Gesendet: Mittwoch, 11. Juni 2008 01:55
> An: java-user@lucene.apache.org
> Betreff: Re: retrieve all docs efficiently - just one field
>
>
> 11 jun 2008 kl. 00.35 skrev 1world1love:
>>
>> We have our lucene index and we want to search the section text for
>> the word
>> "panama"
>>
>> AND
>>
>> We want to select from the demographics table where age > 50.
>> --
>>
>> Now I need to intersect the master table IDs from my lucene hits and
>> my
>> table results.
>
> I might be missing something here -- can't you just add the age field
> to the index and include that in your query?
>
>
>             karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: retrieve all docs efficiently - just one field

Posted by Johannes Christen <j....@insiders.de>.
That might be a solution in this case, but I have the same kind of problem in another case.
We index documents from an NTFS source. One field is the URI of the document.
After a query has been processed, we perform an access check on the hits to ensure the user has access rights to open the document. If we have a big result set it takes very long to retrieve the URIs from all the hits, which we need to perform the access check against the file system.

Any good solution for this?
I think a fix document ID in lucene would help in this cases a lot. The mapping between lucene documents and other systems (e.g. Oracle) would be much faster.

	Jo

-----Ursprüngliche Nachricht-----
Von: Karl Wettin [mailto:karl.wettin@gmail.com] 
Gesendet: Mittwoch, 11. Juni 2008 01:55
An: java-user@lucene.apache.org
Betreff: Re: retrieve all docs efficiently - just one field


11 jun 2008 kl. 00.35 skrev 1world1love:
>
> We have our lucene index and we want to search the section text for  
> the word
> "panama"
>
> AND
>
> We want to select from the demographics table where age > 50.
> --
>
> Now I need to intersect the master table IDs from my lucene hits and  
> my
> table results.

I might be missing something here -- can't you just add the age field  
to the index and include that in your query?


             karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: retrieve all docs efficiently - just one field

Posted by 1world1love <jd...@yahoo.com>.


karl wettin-3 wrote:
> 
> 
> I might be missing something here -- can't you just add the age field  
> to the index and include that in your query?
> 
> 

Thanks for the response Karl:

I just used the age field as an example, but in reality the structured data
is copious and complex relationships exist so there are dozens of such
tables to manage it. The unstructured data is actually the more simplistic
element of the data model.

Also, in presenting the data, we must perform a number of aggregations and
summaries that are fairly straightforward in SQL, but would be quite tedious
and time consuming to do with lucene/programatically.
-- 
View this message in context: http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17777993.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: retrieve all docs efficiently - just one field

Posted by Karl Wettin <ka...@gmail.com>.
11 jun 2008 kl. 00.35 skrev 1world1love:
>
> We have our lucene index and we want to search the section text for  
> the word
> "panama"
>
> AND
>
> We want to select from the demographics table where age > 50.
> --
>
> Now I need to intersect the master table IDs from my lucene hits and  
> my
> table results.

I might be missing something here -- can't you just add the age field  
to the index and include that in your query?


             karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: retrieve all docs efficiently - just one field

Posted by 1world1love <jd...@yahoo.com>.
Thanks Erick. That is what I was assuming but couldn't confirm if it was
worth going down those paths to acheive what I was hoping. Your essay was
very informative about realistic expectations with the fieldselector.

I actually just got through reading the discussion on deprecating hits which
essentially provides great detail about the summary you provided (link for
anyone else who comes upon this thread and is curious -
https://issues.apache.org/jira/browse/LUCENE-1290).

I am still not quite sure how exactly to ustilize the hitcollector api, but
I will make a first pass at refactoring my code to use both.


Erick Erickson wrote:
> 
> It can be a major bottleneck ....
> 

-- 
View this message in context: http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17779004.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: retrieve all docs efficiently - just one field

Posted by Erick Erickson <er...@gmail.com>.
<<<I have read a little about the hitcollector class and the fieldselector
api,
but I am still not sure how they may help me or even if they can.>>>

I infer from this that you're using a Hits object to get your IDs to insert
in
your temporary table. Here's the problem with Hits... It re-executes
the query every 100 (200?) hits. So you can think of it as

while (more hits) {
   if ((count % 100) == 0) execute the search and throw away the first
<count> items
   work with the document
}

It can be a major bottleneck to re-execute the query every 100 hits you look
at. HitCollector avoids this re-execution, and can result in very
significant
speedups when iterating through many documents.

FieldSelector will allow lazy fetching. That is, when you do something
like Reader.document(idx, selector) you'll be able to only load those
fields from the document that you specify with the selector. In your case,
you would only load the ID you care about and insert that in your temporary
table. This can also result in very significant savings, especially if you
only want to load a very small field from a document that has very large
fields. See a writeup I did for one of my projects on the Lucene Wiki

http://wiki.apache.org/lucene-java/FieldSelectorPerformance?highlight=(FieldSelector)


Hope this helps
Erick



On Tue, Jun 10, 2008 at 6:35 PM, 1world1love <jd...@yahoo.com> wrote:

>
> Greetings all. I have read many posts concerning similar use cases, but I
> am
> still a little hazy on the best way to achieve what I need to do. Here is
> the background:
>
> 2 million documents with multiple sections, some sections contain
> structured
> data, some unstructured.
>
> We parse the docs and place the structured stuff in oracle where each
> section is a table and one master table to relate them all.
>
> We index the unstructured sections with lucene where each section is a
> document (meaning a total of about ~30 million documents) with extra fields
> including one for the primary key of the master table and then some meta
> fields to describe the section - type, date, etc.
>
> For a common use case, say we have a table called demographics with a
> number
> field that represents age (overly simplistic but gets the point across).
>
> So say we want all people over the age of 50 who may have visited Panama:
>
> --
> We have our lucene index and we want to search the section text for the
> word
> "panama"
>
> AND
>
> We want to select from the demographics table where age > 50.
> --
>
> Now I need to intersect the master table IDs from my lucene hits and my
> table results.
>
> I have a java stored procedure that runs the lucene query and creates a
> temporary table with a single column where I insert the master id from the
> hits of my lucene query. I then can do a join with my structured query
> results.
>
> The problem here is obviously the speed of iterating through the hits to
> extract the single field that I need.
>
> Notes:
> - I must be able to get a full set of results, though I only need the one
> id
> field
> - We originally went with Oracle text which was simple, but limited and
> quite slow for most queries
>
>
> I have read a little about the hitcollector class and the fieldselector
> api,
> but I am still not sure how they may help me or even if they can.
>
> I have also tooled around with the idea of using termdocs, but the queries
> may get a little complex with various ors/ands/nots, though probably not
> spans and so forth.
>
> Any suggestions will be greatly apreciated.
>
> Thanks,
>
> J
>
> --
> View this message in context:
> http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17766268.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>