You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Rick Hillegas (JIRA)" <ji...@apache.org> on 2014/02/18 01:25:20 UTC
[jira] [Updated] (DERBY-590) How to integrate Derby with Lucene API?

     [ https://issues.apache.org/jira/browse/DERBY-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rick Hillegas updated DERBY-590:
--------------------------------

    Attachment: derby-590-01-ag-publicAccessToLuceneRoutines.diff

Attaching derby-590-01-ag-publicAccessToLuceneRoutines.diff. This patch builds on Andrew's approach, wiring Lucene support into Derby's SQL authorization scheme. If this approach looks promising, I will write additional regression tests to verify the claims I make about this solution.

Before discussing what this patch does, let me recap what it does NOT do:

1) It does NOT encrypt Lucene indexes if the database is encrypted.

2) It does NOT work with in-memory databases.

3) It does NOT work with backup/restore.

In addition, I have changed the meaning of the updateIndex() procedure and lost the very useful meaning which it used to have. There are at least two important application profiles which have different needs from an index-updating procedure:

i) Update-intensive: These applications do not want to incur the performance hit of re-indexing a document every time it changes. These applications may work best with a bulk-reindexing cron job which executes when user activity is low. This is the use-case supported by the revised updateIndex() procedure: the Lucene index is dropped and completely recreated although schema objects connected to it are not bounced.

ii) Read-intensive: These applications can incur the performance hit of re-indexing a document every time it changes. These applications are well served by the trigger-driven usage of updateIndex() provided in Andrew's original patch.

I think that it may be possible to recover Andrew's original functionality. But this requires more familiarity with Lucene than I can claim. Maybe Andrew can propose some ideas and we can converge on a solution.

So much for the caveats. Here are the changes made by the new patch:

A) The layout of the Lucene support directory is changed. There is now a subdirectory for each schema which has indexed tables in it. Under that directory, there is a subdirectory for each table which has indexed columns. Under the table-specific directory, there is a subdirectory for each text column which is indexed. Those are the leaf directories which contain the actual Lucene files. So the directory structure looks like this:

{noformat}
databaseName
    seg0
    log
    lucene
        SCHEMA1
            TABLE1
                TEXT_COLUMN1
                TEXT_COLUMN2
...
{noformat}

B) The luceneQuery() table function has been changed to be context aware. See the work recently done on https://issues.apache.org/jira/browse/DERBY-6117. Users no longer invoke luceneQuery() directly. Instead, for each indexed column, we create a table function specific to that column. The table function has the name $schemaName.$tableName__$columnName. This is the crucial change which allows us to wire Lucene support up to Derby's SQL authorization. The schema owner can grant EXECUTE privilege on the column-specific table function. In addition, the underlying context-aware luceneQuery() machinery starts out by issuing a select against the text column and the key columns of the table. This ensures that users must enjoy SELECT privilege on all of those columns in order to run the column-specific table function.

C) A text column can be indexed only if the table has a primary key. The primary key can have multiple columns in it and the columns can be any indexable datatype. The column-specific table function returns a data set which includes the whole primary key plus the Lucene document id plus the rank number (the relevance of that document according to Lucene's calculations). This makes it easy to join the table function to the original table in order to obtain more extensive results.

D) The arguments to the Lucene support procedures are now case-insensitive SQL identifiers rather than case-sensitive strings. This is a departure from the convention followed by most Derby system procedures, but I think the departure is welcome and long overdue.

E) The Lucene support has been moved out of the tools jar and into the engine jar. This fixes the big surprise I recorded on https://issues.apache.org/jira/browse/DERBY-6470. This change also made it possible to use IdUtil in order to implement the semantics described above in D).

F) It is assumed that the Lucene jars will be checked into the Derby source tree and that the Lucene support classes will always be built. However, there is still no plan to include Lucene jars in Derby binary distributions. Users who want to enable the optional Lucene support will have to install the Lucene jars themselves.

G) In general, the following convention has been followed for each supported operation: Transactional writes are performed before any calls are made to Lucene. This means that if the Lucene calls raise an error, then the transactional writes are rolled back. I think that following a consistent convention like this will make it easier for users to reason about how Lucene support behaves.

H) Miscellaneous improvements have been made in the areas of code factoring and integration with the Java security manager.

Here's how you use the revised tool:

Let's say that you have a table with this shape...

{noformat}
create table lucenetest.titles
(
    ID int generated always as identity primary key,
    ISBN varchar(16),
    PRINTISBN varchar(16),
    title varchar(1024),
    subtitle varchar(1024),
    author varchar(1024),
    series varchar(1024),
    publisher varchar(1024),
    collections varchar(128),
    collections2 varchar(128)
);
{noformat}

The DBO loads the Lucene support tool as follows...

{noformat}
call syscs_util.syscs_register_tool( 'luceneSupport', true );
{noformat}

This creates the LuceneSupport schema, containing the following procedures and functions...

{noformat}
create procedure LuceneSupport.createIndex
(
    schemaname varchar( 128 ),
    tablename varchar( 128 ),
    textcolumn varchar( 128 )
)
parameter style java modifies sql data language java
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.createIndex';

create procedure LuceneSupport.dropIndex
(
    schemaname varchar( 128 ),
	tablename varchar( 128 ),
	textcolumn varchar( 128 )
)
parameter style java modifies sql data language java
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.dropIndex';

create procedure LuceneSupport.updateIndex
(
    schemaname varchar( 128 ),
	tablename varchar( 128 ),
	textcolumn varchar( 128 )
)
parameter style java reads sql data language java
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.updateIndex';

create function LuceneSupport.listIndexes() returns table
(
    id int,
	schemaname char( 128 ),
	tablename char( 128 ),
	columnname char( 128 ),
	lastupdated timestamp
)
language java parameter style DERBY_JDBC_RESULT_SET contains sql
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.listIndexes'
{noformat}

The LUCENETEST user can then index a text column as follows. The DBO can do this too. However, no-one else can index the column. That is because the createIndex() procedure attempts to create a table function in the lucenetest schema. Other users do not enjoy that privilege...

{noformat}
call LuceneSupport.createIndex( 'lucenetest', 'titles', 'title' );
{noformat}

...which creates the following directory structure...

{noformat}
databaseName
    seg0
    log
    lucene
        LUCENETEST
            TITLES
                TITLE
...
{noformat}

...and the following column-specific table function:

{noformat}
create function lucenetest.titles__title( query varchar( 32672 ), rankCutoff double )
returns table
(
    ID int,
    documentID int,
	rank double
)
language java parameter style derby_jdbc_result_set contains sql
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.luceneQuery';
{noformat}

The LUCENETEST user can then query the Lucene index and join it to the original table as follows...

{noformat}
select title, author, publisher, documentID, rank
from lucenetest.titles t, table ( lucenetest.titles__title( 'grapes', 0 ) ) l
where t.id = l.id;
{noformat}

The LUCENETEST user (or the DBO but no-one else) can drop the Lucene index as follows...

{noformat}
call LuceneSupport.dropIndex( 'lucenetest', 'titles','title' );
{noformat}

...which drops the lucene/LUCENETEST/TITLES/TITLE directory. Directory deletion cascades up. That is, if there are no more indexed columns in lucenetest.titles, then we also drop lucene/LUCENETEST/TITLES. And we drop lucene/LUCENETEST if there are no more indexed tables in the lucenetest schema.

The DBO (and only the DBO) can unload Lucene support as follows...

{noformat}
call syscs_util.syscs_register_tool( 'luceneSupport', false );
{noformat}

...which drops all procedures and table functions created by the tool. This command also drops the lucene subdirectory and everything under it.


Touches the following files:

--------------------

M       java/engine/org/apache/derby/impl/sql/compile/StaticMethodCallNode.java

Adds the Lucene support package to the list of Derby packages whose entry points can be bound to user-defined routines.

--------------------

A       java/engine/org/apache/derby/impl/optional
A       java/engine/org/apache/derby/impl/optional/lucene
A       java/engine/org/apache/derby/impl/optional/lucene/LuceneQueryVTI.java
A       java/engine/org/apache/derby/impl/optional/lucene/LuceneSupport.java
A       java/engine/org/apache/derby/impl/optional/lucene/LuceneListIndexesVTI.java
A       java/engine/org/apache/derby/impl/optional/lucene/build.xml
M       java/engine/org/apache/derby/impl/build.xml
M       java/engine/org/apache/derby/catalog/Java5SystemProcedures.java

The Lucene support optional tool.

--------------------

M       java/engine/org/apache/derby/vti/VTITemplate.java

Some additional support for context-aware table functions. This support allows the function to query the JDBC metadata for the shape of the ResultSet it returns.

--------------------

M       java/engine/org/apache/derby/loc/messages.xml
M       java/shared/org/apache/derby/shared/common/reference/SQLState.java

New error messages for Lucene support.


--------------------

M       build.xml
A       tools/java/lucene-analyzers-common-4.5.0.jar
A       tools/java/lucene-queryparser-4.5.0.jar
A       tools/java/lucene-core-4.5.0.jar
M       tools/ant/properties/extrapath.properties
M       tools/jar/extraDBMSclasses.properties

Miscellaneous build machinery.

--------------------

A       tools/release/notices/lucene.txt

The Lucene notice file which must be included in the Derby NOTICE file if we are to ship the lucene jars in Derby source distributions.

--------------------

M       java/testing/org/apache/derbyTesting/functionTests/tests/lang/_Suite.java
A       java/testing/org/apache/derbyTesting/functionTests/tests/lang/LuceneSupportTest.java

Initial tests.


> How to integrate Derby with Lucene API?
> ---------------------------------------
>
>                 Key: DERBY-590
>                 URL: https://issues.apache.org/jira/browse/DERBY-590
>             Project: Derby
>          Issue Type: Improvement
>          Components: Documentation, SQL
>            Reporter: Abhijeet Mahesh
>              Labels: derby_triage10_11
>         Attachments: derby-590-01-ag-publicAccessToLuceneRoutines.diff, lucene_demo.diff, lucene_demo_2.diff
>
>
> In order to use derby with lucene API what should be the steps to be taken? 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)