You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2009/07/09 16:27:16 UTC

DIH: URLDataSource and incremental indexing

I'm exploring other ways of getting data into Solr via  
DataImportHandler than through a relational database, particularly the  
URLDataSource.

I see the special commands for deleting by id and query as well as the  
$hasMore/$nextUrl techniques, but I'm unclear on exactly how one would  
go about designing a data source over HTTP that worked cleanly for  
full importing and also for delta indexing.

For sake of argument, suppose I have /data.xml[?since=<some timestamp>] 
[&start=X&rows=Y] and it could return documents in Solr XML (or really  
any basic format) since the last time it was updated (or all records  
if no since parameter is provided).  And the service could also return  
which records to remove since that timestamp too.  Can I get there  
from here using URLDataSource?

Have folks been doing this?  If so, anyone care to share some basic  
tips/tricks/examples?

Thanks,
	Erik

Re: DIH: URLDataSource and incremental indexing

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

hi Erik,
It is designed to achieve this using a Transformer.

I am assuming that your API gives delta "deleted/modified/added" documents.

Always run a full-import with clean=false. Depending on the values
returned by the API your transformer can use $deleteById for deletes
etc.

$nextUrl and $hasMore can also be used to fetch more and more data .
Again these variables can be generated and put into the row by the
Transformer

we did it for one of our internal API for amessage boards using a
jsvascript transformer. you can do this with a java transformer as
well

On Thu, Jul 9, 2009 at 7:57 PM, Erik Hatcher<er...@ehatchersolutions.com> wrote:
> I'm exploring other ways of getting data into Solr via DataImportHandler
> than through a relational database, particularly the URLDataSource.
>
> I see the special commands for deleting by id and query as well as the
> $hasMore/$nextUrl techniques, but I'm unclear on exactly how one would go
> about designing a data source over HTTP that worked cleanly for full
> importing and also for delta indexing.
>
> For sake of argument, suppose I have /data.xml[?since=<some
> timestamp>][&start=X&rows=Y] and it could return documents in Solr XML (or
> really any basic format) since the last time it was updated (or all records
> if no since parameter is provided).  And the service could also return which
> records to remove since that timestamp too.  Can I get there from here using
> URLDataSource?
>
> Have folks been doing this?  If so, anyone care to share some basic
> tips/tricks/examples?
>
> Thanks,
>        Erik
>
>

-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com