You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Glen Newton <gl...@gmail.com> on 2009/04/29 16:08:39 UTC

Re: Advice on custom DIH or other solutions: LuSql

The next version of LuSql[1] supports solutions for this kind of
issue: reading from JDBC (which may include a long and compex query)
and then writing the results to a single (flattened) JDBC table that
can subsequently be the source table for Solr. This might be helpful
for your particular issue.

As I am talking about the next version (0.93) of LuSql I should
describe it better:
The first version (0.9) used JDBC as a source, and used Lucene as a
sink. The sink portion was plugable, so different destinations besides
Lucene indexes were possible.

The new version of LuSql now also has a pluggable source, and the
sources/sinks implemented (or will be for the release) are:
Sources:
- JDBC
- Lucene
- BDB (from LuSql)
- Serialized documents (from LuSql)
- http (as client)
- http (as restful server: not done yet)
- RMI client
- Minion[2] (not done yet)
- Terrier[3] (not done yet)

Sinks:
- Lucene
- BDB
- JDBC
- RMI server
- SolrJ
- XML
- Serialized documents
- Minion (not done yet)
- Terrier (not done yet)
- Lemur[4] (not done yet)


So LuSql has evolved from a JDBC-to-Lucene tool, to a more general
tool for the the transformation of document-like (in the Lucene sense
of Document) data objects.

For example, the above use case of the user: for whatever reason, the
JDBC connection is too slow or takes a a long time to complete. Use
LuSql to convert the JDBC into BDB; then use the BDB (which is a fast
local file) either directly, or through LuSql to another sink, say,
like SolrJ to Solr.

LuSql will also be useful to information retrieval researchers, who
may want to quickly compare different IR tools, from the same corpus.

I am finishing up implementation this week, then on to testing, and
the hardest part, updating the documentation. I am looking at 3-4
weeks before an RC1 release.

If you have any questions or suggestions for sources/sinks, just
please contact me.

thanks,

Glen

glen.newton@gmail.com

[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
[2]https://minion.dev.java.net/
[3]http://ir.dcs.gla.ac.uk/terrier/
[4]http://www.lemurproject.org/lemur/

2009/4/29 Noble Paul നോബിള്‍  नोब्ळ् <no...@gmail.com>:
> On Wed, Apr 29, 2009 at 3:24 PM, Wouter Samaey <wo...@gmail.com> wrote:
>> Hi there,
>>
>> I'm currently in the process of learning more about Solr, and how I
>> can implement it into my project.
>>
>> Since my database is very large and complex, I'm looking into the way
>> of keeping my documents current in Solr. I have read the pages about
>> DIH, and find it usefull, but I may need more logic to filter out
>> documents or manipulate them. In order to use DIH, I'd need to run
>> huge queries and joins...
>>
>> Now, I see several ways of going forward:
>>
>> - customize DIH with a new classes so I can read directly from my
>> RDBMS (will be slow)
>> - let the webapp build an XML, and simply take that as a datasource
>> instead of the RDBMS (less queries, and can use memcached for the
>> heavy stuff)
>> - let the webapp instruct Solr to add, update or remove a document as
>> changes occur in real time instead of the DIH delta queries. For
>> loading a fresh situation, I'll still need to find a solution like the
>> ones above. (webapp drives solr directly, instead of DIH polling)
>>
>> Is there some general advice you can give? I understand every app is
>> different..but this must be an issue many have considered before.
>>
>> Kind regards
>>
>> Wouter Samaey
>>
> The disadvantage of DIH pulling data out of your db could be that
> complex queries take long. The best strategy as I see it is maintain a
> simple temp db where your app can write rows as you generate data.
> Periodically , ask DIH to read from this temp DB and update the index.
> This approach is good even even you wish to rebuild the index
>
>
> --
> --Noble Paul
>



-- 

-