You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Paolo Castagna (JIRA)" <ji...@apache.org> on 2010/12/21 18:02:03 UTC
[jira] Issue Comment Edited: (JENA-9) LARQ as a separate module from ARQ

    [ https://issues.apache.org/jira/browse/JENA-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973762#action_12973762 ] 

Paolo Castagna edited comment on JENA-9 at 12/21/10 12:01 PM:
--------------------------------------------------------------

> Merge JENA-5 fix 

I have merged the changes done by Andy to fix JEAN-5 into the separate LARQ module.
I have also removed the @author annotations from comments, since I noticed Andy did it in ARQ.

> Upgrade Lucene version to 2.9.3 and fix tests (if there are failures). Remove code using deprecated Lucene APIs and upgrade to Lucene 3.0.x. 

Done, LARQ is now using Lucene 3.0.3. However, it is possible to move back to Lucene 2.9.3 as a drop-in replacement (if someone needs/wants this).

> Decide how many results to return when the user does not specify it, 1000? More? 

It's now a constant in LARQ.java, it's set to 1000.

> Should we use the index to suppress duplicates instead of in-memory data structures? 

IndexBuilderLiteral.java is now using the index rather than an in-memory data structure to avoid adding duplicate documents to the Lucene index:

<pre>
{code}
                if ( ! super.index.getIndex().hasMatch(LARQ.fLex + ":\"" + node.getLiteralLexicalForm() + "\"" ))
                {
                    if ( indexThisLiteral(s.getLiteral()))
                        index.index(node, node.getLiteralLexicalForm()) ;
                }
{code}
</pre>

> We could use the Model to decide when there are no more triples with a specified literal and therefore it's ok to remove it from Lucene. 

Done, for example, look at IndexBuilderLiteral.java:

{code}
    public void unindexStatement(Statement s)
    { 
        if ( ! indexThisStatement(s) )
            return ;

        if ( s.getObject().isLiteral() )
        {
        	// we use the Model as reference counting
        	StmtIterator iter = s.getModel().listStatements((Resource)null, (Property)null, s.getObject());
        	if ( ! iter.hasNext() ) {
                Node node = s.getObject().asNode() ;
                if ( indexThisLiteral(s.getLiteral())) {
                	index.unindex(node, node.getLiteralLexicalForm()) ;
                }
        	}
        }
    }
{code}

> See how the new NRT capabilities of Lucene can be used from LARQ. 

See IndexBuilderBase.java:

{code}
    protected IndexReader getIndexReader()
    {
        try {
            flushWriter() ;
            if ( indexWriter != null ) {
                return indexWriter.getReader() ; // Let's use the Near Real Time (NRT) 
            } else {
            	return IndexReader.open(dir, true) ;
            }
        } catch (Exception e) { throw new ARQLuceneException("getIndexReader", e) ; }
    }
{code}

> Review package names (currently c.h.h.j.sparql.larq and c.h.h.j.query.larq). Should we move to c.h.h.j.larq.*? 

I think we should, but I have not done it yet.

Indeed, we could change to org.apache.jena.larq.*. What do you think?

      was (Author: castagna):
    bq. Merge JENA-5 fix 

I have merged the changes done by Andy to fix JEAN-5 into the separate LARQ module.
I have also removed the @author annotations from comments, since I noticed Andy did it in ARQ.

bq. Upgrade Lucene version to 2.9.3 and fix tests (if there are failures). Remove code using deprecated Lucene APIs and upgrade to Lucene 3.0.x. 

Done, LARQ is now using Lucene 3.0.3. However, it is possible to move back to Lucene 2.9.3 as a drop-in replacement (if someone needs/wants this).

bq. Decide how many results to return when the user does not specify it, 1000? More? 

It's now a constant in LARQ.java, it's set to 1000.

bq. Should we use the index to suppress duplicates instead of in-memory data structures? 

IndexBuilderLiteral.java is now using the index rather than an in-memory data structure to avoid adding duplicate documents to the Lucene index:

{code}
                if ( ! super.index.getIndex().hasMatch(LARQ.fLex + ":\"" + node.getLiteralLexicalForm() + "\"" ))
                {
                    if ( indexThisLiteral(s.getLiteral()))
                        index.index(node, node.getLiteralLexicalForm()) ;
                }
{code}

bq. We could use the Model to decide when there are no more triples with a specified literal and therefore it's ok to remove it from Lucene. 

Done, for example, look at IndexBuilderLiteral.java:

{code}
    public void unindexStatement(Statement s)
    { 
        if ( ! indexThisStatement(s) )
            return ;

        if ( s.getObject().isLiteral() )
        {
        	// we use the Model as reference counting
        	StmtIterator iter = s.getModel().listStatements((Resource)null, (Property)null, s.getObject());
        	if ( ! iter.hasNext() ) {
                Node node = s.getObject().asNode() ;
                if ( indexThisLiteral(s.getLiteral())) {
                	index.unindex(node, node.getLiteralLexicalForm()) ;
                }
        	}
        }
    }
{code}

bq. See how the new NRT capabilities of Lucene can be used from LARQ. 

See IndexBuilderBase.java:

{code}
    protected IndexReader getIndexReader()
    {
        try {
            flushWriter() ;
            if ( indexWriter != null ) {
                return indexWriter.getReader() ; // Let's use the Near Real Time (NRT) 
            } else {
            	return IndexReader.open(dir, true) ;
            }
        } catch (Exception e) { throw new ARQLuceneException("getIndexReader", e) ; }
    }
{code}

bq. Review package names (currently c.h.h.j.sparql.larq and c.h.h.j.query.larq). Should we move to c.h.h.j.larq.*? 

I think we should, but I have not done it yet.

Indeed, we could change to org.apache.jena.larq.*. What do you think?
  
> LARQ as a separate module from ARQ
> ----------------------------------
>
>                 Key: JENA-9
>                 URL: https://issues.apache.org/jira/browse/JENA-9
>             Project: Jena
>          Issue Type: Task
>          Components: LARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>
> LARQ can be extracted from ARQ as a separate module depending on ARQ.
> ARQ should not depend on LARQ (to avoid dependency cycles) and it could check if LARQ is available in the classpath and wire the property function in dynamically.
> LARQ can have a different release cycle from ARQ and people who do not need free text search will not need to include Lucene in their classpath.
> A separate (experimental) module is available here: https://jena.svn.sourceforge.net/svnroot/jena/LARQ/trunk/
> List of things to do/decide includes:
>  - Merge JENA-5 fix 
>  - Upgrade Lucene version to 2.9.3 and fix tests (if there are failures).
>  - Remove code using deprecated Lucene APIs and upgrade to Lucene 3.0.x.
>  - Decide how many results to return when the user does not specify it, 1000? More?
>  - Should we use the index to suppress duplicates instead of in-memory data structures?
>  - How do we implement removals/unindex?
>     - We could use the Model to decide when there are no more triples with a specified literal and therefore it's ok to remove it from Lucene.
>  - See how the new NRT capabilities of Lucene can be used from LARQ.
>  - Review package names (currently c.h.h.j.sparql.larq and c.h.h.j.query.larq). Should we move to c.h.h.j.larq.*?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.