You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by fr...@voila.fr on 2011/04/19 18:01:49 UTC

LARQ: My usage patterns. Then, a request for a new method.  [Re: LARQ: what's the plan to cope with Lucene 3.x?]

Hi Paolo & you all Jena-rous people,
 
This is in answer to Paolo Castagna's question
> Yes, please, share your usage patterns.
Since you ask with such kind interest: OK, here it goes:
 
The original problem:
to offer textual search over all the literals in an RDF repository. 
300.000 to 30.000.000 literals, almost all of them localized; 
in about 30 languages, most of them Western. 
Indexing is run offline just after building/updating the repository, 
single-threaded with exclusive access to the repository; 
querying is online when the repository is r/only for everyone; 
language for a given query is specified with the query. 
 
The LARQ part of the solution, I'm glad to acknowledge, 
is straightforward up to 2 details; more on this below.
The hirsute part is Analyzer management, a Lucene-only thing I won't discuss here.
 
I settled for an extension to only 2 LARQ classes, plus a 3rd class of my own:
1) class MultiLangIdxBuilder extends IndexBuilderNode; 
it has a member 
    private IndexWriter m_writer;
an override 
    IndexWriter getIndexWriter() {
        return m_writer;
    }
and a method 
    void examineStmtObject(Literal l) throws IOException {
        final String code = l.getLanguage();
        m_writer = this.getWriterFromLangCode(code);
    }
    // Hairy plumbing starts here:
    private IndexWriter getWriterFromLangCode(String code) {
        [...]

2) class IndexerRDF extends IndexBuilderSubject, 
so as to have its IndexBuilderModel.index assigned a MultiLangIdxBuilder at instantiation time.
 
The cool news is that this is enough to thoroughly confine LARQ. 
Callers that use only those 2 classes plus the following 3rd:
class MultiLangIndex {
    void setExecIndex(Context ctx, String code) {
        LARQ.setDefaultIndex(ctx, this.fromLangCode(code));
    }
    // More hairy plumbing here:
    private IndexLARQ fromLangCode(String code) {
        [...]
 
won't have to import any LARQ class; 
in their source code, that is. 
That's why it matters little in the end, 
whether LARQ comes from org.apache or from com.hp. 
 
The less-than-utterly-cool news is that my IndexerRDF class had to override 
the whole super.indexStatement() method.
Were it not for the crucial call to 
((MultiLangIdxBuilder) index).examineStmtObject(...),
I'd have been more than happy to just call the superclass. 
Here's the full snippet, an almost exact paraphrase as you can see:
    public void indexStatement(Statement s) {
        if ( ! indexThisStatement(s) ) return ;
        try {
            Node subject = s.getSubject().asNode() ;
            if ( ! s.getObject().isLiteral()) {
                return;
            }
            final Literal l = s.getLiteral();
            if ( ! LARQ.isString(l)) {
                return ;
            }
            getBuilder().examineStmtObject(l);
            getBuilder().index(subject, l.getLexicalForm());
        } catch (Exception e) { 
            throw new ARQLuceneException("indexStatement", e) ; }
    }
    private MultiLangIdxBuilder getBuilder() {
         return (MultiLangIdxBuilder ) index;
    }
The logic in indexStatement() is rather sophisticated, 
& I'd much rather use it than plagiarize it. 
Turn & toss it as much I pleased though, I found no way around. 
 
Hence my 1st petition: would you consider having IndexBuilderSubject.indexStatement() 
systematically show the object of the statement to its (IndexBuilderNode) index, 
just prior to calling index.index(Node, String)? 
 
One way to achieve this is to pull up the above snippet to IndexBuilderSubject, 
& to add an overridable IndexBuilderNode.examineStmtObject() that does nothing by default. 
Regardless of the way, the idea is to give subclasses a chance 
to reconfigure themselves in anticipation of the call to index.index().
Note that it could be useful in other situations as well; 
e. g. when, unbeknown to LARQ, 
the object of the stmt is actually a subclass of Literal, 
chosen in agreement with the (also subclassed) index.
 
My 2nd petition is truly minor: would you consider allowing IndexBuilderBase.indexWriter to remain null, & allowing subclasses to assign it after construction? This would spare me the construction of a dummy IndexWriter complete with dummy Directory, & the overriding of getIndexWriter(). 
 
What do you think of those requests? Thanks in advance for your feedback. 
 
Cheers, 
     François Jurain. 



____________________________________________________

  Découvrez les 10 aliments anti-cancer dans notre dossier santé http://actu.voila.fr/evenementiel/sante-bien-etre-2010-2011/aliments-anti-cancer/