You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Damien Katz <da...@apache.org> on 2008/07/11 23:29:13 UTC

Integrated Full Text Indexing and Reporting Re: CouchDB 0.9 and 1.0

CouchDB needs integrate full-text indexing support. We should be able  
to support multiple full text engines, but our reference  
implementation will be Apache Lucene.

Initially (I'm hoping for 0.9.0)  we should be able to index all  
documents and their attachments (for types that lucene can index  
anyway) and return queries against that index via. Jan has begun this  
work and I think someone has this mostly working now somewhere, but  
its not in trunk?

By 1.0, we should also do a view intersections with full text results.  
At query time, CouchDB gets back a list of matching documents and then  
finds the emited view rows from those documents,  and returns them  
sorted by relevance score. This will require some enhancements to the  
internal view API, but the data and required index (views keys by doc  
id) already exist to make this efficient.

Perhaps not initially, but eventually the integration of the fulltext  
engine will be as proper couchdb HTTP and daemon plug-ins (once those  
apis are established).

On Jul 2, 2008, at 3:08 AM, Jan Lehnardt wrote:

> Hello everybody,
> this thread is meant to collect missing work items (features and
> bugs) for for our 1.0 release and a discussion about how to split
> them up between 0.9 and 1.0.
>
> Take it away: Damien.
>
> Cheers
> Jan
> --

Re: Integrated Full Text Indexing and Reporting Re: CouchDB 0.9 and 1.0

Posted by Paul Davis <pa...@gmail.com>.

The patch for Issue74 only affects the line protocol between the
external processes. I think that the biggest show stopper to getting
full text searching right now is the fluidity of how CouchDB is going
to start interfacing with external software. Whether things move
towards having some sort of plugin interface etc should probably be
settled before doing too much work on this. (Assuming that most of the
FTI work will be involved in the integration step.)

Also the note on intersecting views with FTI search results is
interesting, but I'm not certain how that would work implementation
wise. I could see some pretty harsh run time characteristics come into
play when attempting to merge between indices that are in and out of
couchdb.

Not to say it wouldn't be a kick ass feature, but it almost seems like
something that wouldn't be feasible without an erlang FTI engine. In
other news, implementing intersections for arbitrary views might an
entirely separate feature to implement.

Paul

On Sat, Jul 12, 2008 at 5:24 PM, Jan Lehnardt <ja...@apache.org> wrote:
>
> On Jul 11, 2008, at 22:29 , Damien Katz wrote:
>
>> CouchDB needs integrate full-text indexing support. We should be able to
>> support multiple full text engines, but our reference implementation will be
>> Apache Lucene.
>>
>> Initially (I'm hoping for 0.9.0)  we should be able to index all documents
>> and their attachments (for types that lucene can index anyway) and return
>> queries against that index via. Jan has begun this work and I think someone
>> has this mostly working now somewhere, but its not in trunk?
>
> we have a patch that improves the API here:
> https://issues.apache.org/jira/browse/COUCHDB-74
> and there is the
> http://svn.apache.org/repos/asf/incubator/couchdb/branches/lucene-search/
> branch that this patch should be applied to. Further work should be
> continued there. At this
> point the only difference between trunk and the branch is the addition of
> the /db/_search
> API call. The branch also might need to be brought up to trunk. It has no
> current maintainer,
> although Paul Davis voiced interest in pushing this forward. Also, there
> were attempts at adding
> other search engines but they never surfaced. If I remember correctly, the
> problem that views
> can not be searched without expanding the view server, stopped most work.
>
>
>> By 1.0, we should also do a view intersections with full text results. At
>> query time, CouchDB gets back a list of matching documents and then finds
>> the emited view rows from those documents,  and returns them sorted by
>> relevance score. This will require some enhancements to the internal view
>> API, but the data and required index (views keys by doc id) already exist to
>> make this efficient.
>
> I opened a bug report for this.
>
>
> --
>
> Since I started the work on Lucene I am by open source work definition
> somewhat responsible for the life of this. But I'd rather not, at least for
> the Java side of things. If somebody (heya Paul, still in?) wants to take
> this over, that'd be mighty cool.
>
>
> Cheers
> Jan
> --
>
>> Perhaps not initially, but eventually the integration of the fulltext
>> engine will be as proper couchdb HTTP and daemon plug-ins (once those apis
>> are established).
>>
>> On Jul 2, 2008, at 3:08 AM, Jan Lehnardt wrote:
>>
>>> Hello everybody,
>>> this thread is meant to collect missing work items (features and
>>> bugs) for for our 1.0 release and a discussion about how to split
>>> them up between 0.9 and 1.0.
>>>
>>> Take it away: Damien.
>>>
>>> Cheers
>>> Jan
>>> --
>>
>>
>
>

Re: Integrated Full Text Indexing and Reporting Re: CouchDB 0.9 and 1.0

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 11, 2008, at 22:29 , Damien Katz wrote:

> CouchDB needs integrate full-text indexing support. We should be  
> able to support multiple full text engines, but our reference  
> implementation will be Apache Lucene.
>
> Initially (I'm hoping for 0.9.0)  we should be able to index all  
> documents and their attachments (for types that lucene can index  
> anyway) and return queries against that index via. Jan has begun  
> this work and I think someone has this mostly working now somewhere,  
> but its not in trunk?

we have a patch that improves the API here: https://issues.apache.org/jira/browse/COUCHDB-74
and there is the http://svn.apache.org/repos/asf/incubator/couchdb/branches/lucene-search/
branch that this patch should be applied to. Further work should be  
continued there. At this
point the only difference between trunk and the branch is the addition  
of the /db/_search
API call. The branch also might need to be brought up to trunk. It has  
no current maintainer,
although Paul Davis voiced interest in pushing this forward. Also,  
there were attempts at adding
other search engines but they never surfaced. If I remember correctly,  
the problem that views
can not be searched without expanding the view server, stopped most  
work.

> By 1.0, we should also do a view intersections with full text  
> results. At query time, CouchDB gets back a list of matching  
> documents and then finds the emited view rows from those documents,   
> and returns them sorted by relevance score. This will require some  
> enhancements to the internal view API, but the data and required  
> index (views keys by doc id) already exist to make this efficient.

I opened a bug report for this.

--

Since I started the work on Lucene I am by open source work definition  
somewhat responsible for the life of this. But I'd rather not, at  
least for the Java side of things. If somebody (heya Paul, still in?)  
wants to take this over, that'd be mighty cool.

Cheers
Jan
--

> Perhaps not initially, but eventually the integration of the  
> fulltext engine will be as proper couchdb HTTP and daemon plug-ins  
> (once those apis are established).
>
> On Jul 2, 2008, at 3:08 AM, Jan Lehnardt wrote:
>
>> Hello everybody,
>> this thread is meant to collect missing work items (features and
>> bugs) for for our 1.0 release and a discussion about how to split
>> them up between 0.9 and 1.0.
>>
>> Take it away: Damien.
>>
>> Cheers
>> Jan
>> --
>
>