You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Eugeny N Dzhurinsky <bo...@redwerk.com> on 2006/08/11 09:35:12 UTC

search document for keywords and keyphrases

Hello!

I have an assigment, which will require to search documents for keywords or
keyphrases.

For instance, I have a database of keywords/keyphrases, which might contain
several millions items. Now I need to find if document contains any of the
keywords/phrases listed in that database.

I was thinking on implementing finite-state machine, and use b-trees, so I
will iterate document char by char and go down the tree unless I find some
word or phrase which matches character sequence.

I think Lucene is doing in the same way when performs searching, so may be I
can use Lucene?

-- 
Eugene N Dzhurinsky

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: search document for keywords and keyphrases

Posted by Eugeny N Dzhurinsky <bo...@redwerk.com>.
On Fri, Aug 11, 2006 at 02:39:19PM +0300, Eugeny N Dzhurinsky wrote:
> On Fri, Aug 11, 2006 at 01:22:26PM +0200, Simon Willnauer wrote:
> > Sure you can do this.
> > You index your document with the keywords assigned to the document and
> > search with and Boolean Query to get all document having the keyword
> > 1,2,...n-1,n. Just be aware that there are limitations to boolean
> > queries in lucene. see setMaxClauseCount(). which can be very memory
> > consuming. 
> 
> Well, I don't understand whet do you mean "index your document with the
> keywords assigned to the document". Thre is no keywords assigned to a
> document. I was thinking in his way - may, is it possible to index the
> database of keywords, and use entire document as search phrase?
> 
> Sould that find single words as well a phrases?
> 
> For example, if document contains a phrase
> 
> some things happens there
> 
> and there are entries in keywords database
> 
> some things
> happens here
> some things happens here
> 
> I will need to get all of these entries.
> 
> > But I guess you will search for a small amounts of
> > keywords, do you?
> 
> This database could be VERY large, several millions of records.

So could somebody please advice - will Lucene suit my requirements or I have
to develop a solution by myself/search for something else?

Thanks in advance.

-- 
Eugene N Dzhurinsky

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: search document for keywords and keyphrases

Posted by Eugeny N Dzhurinsky <bo...@redwerk.com>.
On Fri, Aug 11, 2006 at 01:22:26PM +0200, Simon Willnauer wrote:
> Sure you can do this.
> You index your document with the keywords assigned to the document and
> search with and Boolean Query to get all document having the keyword
> 1,2,...n-1,n. Just be aware that there are limitations to boolean
> queries in lucene. see setMaxClauseCount(). which can be very memory
> consuming. 

Well, I don't understand whet do you mean "index your document with the
keywords assigned to the document". Thre is no keywords assigned to a
document. I was thinking in his way - may, is it possible to index the
database of keywords, and use entire document as search phrase?

Sould that find single words as well a phrases?

For example, if document contains a phrase

some things happens there

and there are entries in keywords database

some things
happens here
some things happens here

I will need to get all of these entries.

> But I guess you will search for a small amounts of
> keywords, do you?

This database could be VERY large, several millions of records.

-- 
Eugene N Dzhurinsky

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: search document for keywords and keyphrases

Posted by Simon Willnauer <si...@googlemail.com>.
Sure you can do this.
You index your document with the keywords assigned to the document and
search with and Boolean Query to get all document having the keyword
1,2,...n-1,n. Just be aware that there are limitations to boolean
queries in lucene. see setMaxClauseCount(). which can be very memory
consuming. But I guess you will search for a small amounts of
keywords, do you?

regards simon

On 8/11/06, Eugeny N Dzhurinsky <bo...@redwerk.com> wrote:
> On Fri, Aug 11, 2006 at 08:11:31PM +1000, Jason Polites wrote:
> > Yes you could use lucene for this, but it may be overkill for your
> > requirement.  If I understand you correctly, all you need to is find
> > documents which match "any" of the words in your list?  Do you need to rank
> > the results?   If not, it's probably easier just to create your own inverted
> > index of the documents you need to search.  If you actually need to rank
> > results, then lucene is probably easier as it does this for you.
>
> No. I do have a single document. I need to know does this document contains
> ANY of keywords listed in that database. Keyword may be single word or several
> words separated by spaces.
>
> As a result I need to get list of keywords from that database, which keywords
> exist in the document.
>
> --
> Eugene N Dzhurinsky
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: search document for keywords and keyphrases

Posted by Eugeny N Dzhurinsky <bo...@redwerk.com>.
On Fri, Aug 11, 2006 at 08:11:31PM +1000, Jason Polites wrote:
> Yes you could use lucene for this, but it may be overkill for your
> requirement.  If I understand you correctly, all you need to is find
> documents which match "any" of the words in your list?  Do you need to rank
> the results?   If not, it's probably easier just to create your own inverted
> index of the documents you need to search.  If you actually need to rank
> results, then lucene is probably easier as it does this for you.

No. I do have a single document. I need to know does this document contains
ANY of keywords listed in that database. Keyword may be single word or several
words separated by spaces.

As a result I need to get list of keywords from that database, which keywords 
exist in the document.

-- 
Eugene N Dzhurinsky

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: search document for keywords and keyphrases

Posted by Jason Polites <ja...@gmail.com>.
Yes you could use lucene for this, but it may be overkill for your
requirement.  If I understand you correctly, all you need to is find
documents which match "any" of the words in your list?  Do you need to rank
the results?   If not, it's probably easier just to create your own inverted
index of the documents you need to search.  If you actually need to rank
results, then lucene is probably easier as it does this for you.



On 8/11/06, Eugeny N Dzhurinsky <bo...@redwerk.com> wrote:
>
> Hello!
>
> I have an assigment, which will require to search documents for keywords
> or
> keyphrases.
>
> For instance, I have a database of keywords/keyphrases, which might
> contain
> several millions items. Now I need to find if document contains any of the
> keywords/phrases listed in that database.
>
> I was thinking on implementing finite-state machine, and use b-trees, so I
> will iterate document char by char and go down the tree unless I find some
> word or phrase which matches character sequence.
>
> I think Lucene is doing in the same way when performs searching, so may be
> I
> can use Lucene?
>
> --
> Eugene N Dzhurinsky
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>