You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@couchdb.apache.org by Apache Wiki <wi...@apache.org> on 2009/04/01 00:19:56 UTC

[Couchdb Wiki] Update of "Full text search" by RobertNewson

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Couchdb Wiki" for change notification.

The following page has been changed by RobertNewson:
http://wiki.apache.org/couchdb/Full_text_search

The comment on the change is:
rewritten for couchdb-lucene

------------------------------------------------------------------------------
- == Fulltext Indexing and Searching ==
+ == Full-text Indexing and Searching ==
- CouchDB comes with fulltext indexing support using [http://lucene.apache.org Apache Lucene] as a reference implementation. The integration is modular and allows any fulltext search technology to be used in combination with CouchDB.
+ 
+ Lucene integration with CouchDB is available with an external project called couchdb-lucene (http://github.com/rnewson/couchdb-lucene).
+ 
  
  === Index interface ===
  
+ couchdb-lucene's indexing process is configured with update notification as follows;
- CouchDB uses stdio for interfacing to the search engine,whenever a document is changed the name of the database 
- containing the document is sent to stdout.
- 
- CouchDB does not expect to receive anything on stdin (read: it will crash if it does).
- 
- ==== setup ====
- 
- The indexer is started by CouchDB using the command line specified in
- the couch.ini configuration parameter:
  
  {{{
- DbUpdateNotificationProcess
+ [update_notification]
+ indexer=/usr/bin/java -jar /path/to/couchdb-lucene-<version>-jar-with-dependencies.jar -index
  }}}
- 
  
  === Search interface ===
  
+ couchdb-lucene's search process is configured as an external process accessible via an httpd_handler as follows;
- CouchDB again uses stdio to interface to the searcher part.
- 
- Currently this interface is not exposed through Futon, so to try it out you need to
- start CouchDB with the 
- interactive option -i to get an Erlang shell.
- 
- From there you can write search queries like:
  
  {{{
- couch_ft_query:execute("database", "+ query +string").
+ [couchdb]
+ os_process_timeout=60000 ; increase the timeout to 60 seconds.
+ 
+ [external]
+ fti=/usr/bin/java -jar /path/to/couchdb-lucene-<version>-jar-with-dependencies.jar -search
+ 
+ [httpd_db_handlers]
+ _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
  }}}
  
+ You can install the httpd_handler as anything you like, but the name must match between the [external] and [httpd_db_handlers] section. The rest of the document assumes 'fti'.
  
+ q:: the query to run (e.g, subject:hello)
+ sort:: the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).
+ limit:: the maximum number of results to return
+ skip:: the number of results to skip
+ include_docs::  whether to include the source docs
+ stale=ok:: If you set the stale option ok, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.
+ debug:: if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.
- For this example the string "database\n" followed by "+ query +string\n" is
- transmitted to stdout.
- 
- The result must follow this exact scheme to ensure that CouchDB understands
- it. The first line must be "ok\n". The next two lines contain the id of the 
- highest ranking document that matches the query and the score or rank: 
- "docid1\n7\n". These two lines are repeated for all matching documents. 
- The end of the result must be signaled with an empty newline "\n\n". 
- 
- In case of an error, the first line consists of "error\n" and the second
- line of the error message: "Invalid Foo Condition\n".
- 
- ==== setup ====
- The searcher is started by CouchDB using the command line specified in
- the couch.ini configuration parameter:
- 
- {{{
- FullTextSearchQueryServer
- }}}
- 
  
  === Lucene reference implementation ===
  
+ You can customize the indexing process (by default, all attributes of all documents are indexed) using a special document at _design/lucene with a "transform" function;
- ==== RFC: Use of special design document ====
- Please not that this is currently in discussion and not actually set in code.
- A database to index must contain a special design document in this format:
  
  {{{
  {
+   "transform":"function(doc) { return doc; }"
-   "_id":"_design/fulltextsearch",
-   "_rev":"123",
-   "fulltext_options": {
-     "views": {
-       "names" : {"index":"view-value", "return":"document"},
-       "cities": {"index":"view-key", "return":"view"}
-     }
-   }
  }
  }}}
  
+ ==== Dependencies ====
- The Lucene indexer uses the defined views in this document to guide the indexing
- process. 
  
+ couchdb-lucene uses Maven 2 to manage dependencies, so you shouldn't have to deal with them directly.
- In this example the views "names" and "cities" must also be defined in the database. 
- Lucene will index the "view-value" for the "names" view and return documents as
- search results, 
- for the "cities" view it will index the view-key and return the view in search results.
- 
- For info on views in CouchDB see: ["Introduction to CouchDB views"]
- 
- 
- ==== Dependencies ====
- The Lucene indexer depends on these projects .jar files to work
-  * couchdb4j.jar (see below)
-    * commons-beanutils.jar
-    * commons-codec-1.3.jar
-    * commons-collections.jar
-    * commons-httpclient-3.1.jar
-    * commons-lang.jar
-    * commons-logging-1.1.jar
-    * ezmorph-1.0.3.jar
-    * json-lib-2.0-jdk15.jar
-  * lucene-core-2.3.1.jar
- 
- Note: all the couchdb4j dependencies (as you can see some have not
- version info supplied) is probably easily checked out from the
- couchdb4j repository (see below).
- 
- Note: at this time of writing couchdb4j needs to be patched using the patches
- specified in issue 6 and 8 
- on the coucdb4j issue tracking list: 
- 
- http://code.google.com/p/couchdb4j/issues/list
- 
- So checkout trunk patch and build.
  
  At least Java version 5 is needed.
  
  ==== Compiling ====
+ 
  The Lucene search engine is not build as part of the CouchDB. 
  
  You need to:
   * setup a Java developer environment (at least version 5). 
+  * Checkout CouchDB source with git clone git://github.com/rnewson/couchdb-lucene.git
+  * cd couchdb-lucene
+  * type 'mvn'
-  * Checkout CouchDB source.
-  * Change directory to src/fulltext/lucene
-  * Compile using javac with CLASSPATH with the needed dependencies (listed above)
-  * Do: jar cf !CouchLucene.jar *.class 
  
+ As result you should get a file target/couchdb-lucene-<version>-jar-with-dependencies.jar.
- As result you should get a file !CouchLucene.jar to include in your CLASSPATH at
- runtime.
  
- ==== Runtime setup ====
- You need a path to your java runtime (at least version 5).
- You have to setup your java CLASSPATH to contain all the .jar files listed in the
- dependency list,
- alternatively you can specify it on the command line defined for the .ini options like:
- 
- {{{
- FullTextSearchQueryServer=java -cp /path/to/couchdb4j/lib/couchdb4j.jar:...
- LuceneSearcher
- DbUpdateNotificationProcess=java -cp /path/to/couchdb4j/lib/couchdb4j.jar:...
- LuceneIndexer
- }}}
- 
- Note above example works on Unix like OS's
-