You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2011/08/16 00:25:17 UTC
Indexing from a database via SolrJ
Is there a simple way to get all the fields from a jdbc resultset into a
bunch of SolrJ documents, which I will then send to be indexed in Solr?
I would like to avoid the looping required to copy the data one field at
a time. Copying it one document at a time would be acceptable, but it
would be nice if there was a way to copy them all at once.
Another idea that occurred to me is to add the dataimporter jar to my
project and leverage it to do the heavy lifting, but I will need some
pointers about what objects and methods to research. Is that a
reasonable idea, or is it too integrated into the server code to be used
with SolrJ?
Can anyone point me in the right direction?
Thanks,
Shawn
Re: Indexing from a database via SolrJ
Posted by Shawn Heisey <so...@elyograg.org>.
On 8/16/2011 1:12 PM, Shawn Heisey wrote:
> On 8/16/2011 11:23 AM, Erick Erickson wrote:
>> The problem with anything "automatic" is that I don't see how it
>> could know
>> which fields in the document to map DB columns to. Unless you had
>> fields that exactly matched column names, it would be iffy...
>>
>> I assume DIH actually does something like this, but don't know any way
>> of having SolrJ automagically do this.
>>
>> At root these kinds of things don't generalize well, but that doesn't
>> mean
>> that there's not a good case for doing this.
>
> In my case, the Solr field names are in perfect sync with the database
> field names. My DIH config doesn't mention any fields by name, it
> just passes them as-is and lets the schema handle everything. I'm
> perfectly OK with handling everything myself in my code, but if
> someone had already invented the wheel, no sense in designing a new
> one. :)
>
> Thanks for all your help, Erick.
Here's what I've ended up with in my method that takes a ResultSet and
puts the data into Solr. I have to get a testbed set up before I can
actually test this code, which will take me a while. I'm inviting
comment now, knowing it might have bugs. Eclipse is happy with it, but
that doesn't mean it works. :)
/**
* Takes an SQL ResultSet and adds the documents to solr. Does it
in batches
* of fetchSize.
*
* @param rs
* @throws SQLException
* @throws IOException
* @throws SolrServerException
*/
private long addResultSet(ResultSet rs) throws SQLException,
SolrServerException, IOException
{
long count = 0;
int innerCount = 0;
Collection<SolrInputDocument> docs = new
ArrayList<SolrInputDocument>();
ResultSetMetaData rsm = rs.getMetaData();
int numColumns = rsm.getColumnCount();
String[] colNames = new String[numColumns + 1];
for (int i = 1; i < (numColumns + 1); i++)
{
colNames[i] = rsm.getColumnName(i);
}
while (rs.next())
{
count++;
innerCount++;
SolrInputDocument doc = new SolrInputDocument();
for (int j = 1; j < (numColumns + 1); j++)
{
Object f;
switch (rsm.getColumnType(j))
{
case Types.BIGINT:
{
f = rs.getLong(j);
break;
}
case Types.INTEGER:
{
f = rs.getInt(j);
break;
}
case Types.DATE:
{
f = rs.getDate(j);
break;
}
case Types.FLOAT:
{
f = rs.getFloat(j);
break;
}
case Types.DOUBLE:
{
f = rs.getDouble(j);
break;
}
case Types.TIME:
{
f = rs.getDate(j);
break;
}
case Types.BOOLEAN:
{
f = rs.getBoolean(j);
break;
}
default:
{
f = rs.getString(j);
}
}
doc.addField(colNames[j], f);
}
docs.add(doc);
/**
* When we reach fetchSize, index the documents and reset
the inner
* counter.
*/
if (innerCount == IdxStatic.fetchSize)
{
solrCore.add(docs);
docs.clear();
innerCount = 0;
}
}
/**
* If the outer loop ended before the inner loop reset, index the
* remaining documents.
*/
if (innerCount != 0)
{
solrCore.add(docs);
}
return count;
}
Re: Indexing from a database via SolrJ
Posted by Shawn Heisey <so...@elyograg.org>.
On 8/16/2011 11:23 AM, Erick Erickson wrote:
> The problem with anything "automatic" is that I don't see how it could know
> which fields in the document to map DB columns to. Unless you had
> fields that exactly matched column names, it would be iffy...
>
> I assume DIH actually does something like this, but don't know any way
> of having SolrJ automagically do this.
>
> At root these kinds of things don't generalize well, but that doesn't mean
> that there's not a good case for doing this.
In my case, the Solr field names are in perfect sync with the database
field names. My DIH config doesn't mention any fields by name, it just
passes them as-is and lets the schema handle everything. I'm perfectly
OK with handling everything myself in my code, but if someone had
already invented the wheel, no sense in designing a new one. :)
Thanks for all your help, Erick.
Shawn
Re: Indexing from a database via SolrJ
Posted by Erick Erickson <er...@gmail.com>.
The problem with anything "automatic" is that I don't see how it could know
which fields in the document to map DB columns to. Unless you had
fields that exactly matched column names, it would be iffy...
I assume DIH actually does something like this, but don't know any way
of having SolrJ automagically do this.
At root these kinds of things don't generalize well, but that doesn't mean
that there's not a good case for doing this.
Best
Erick
On Tue, Aug 16, 2011 at 11:26 AM, Shawn Heisey <so...@elyograg.org> wrote:
> On 8/16/2011 7:14 AM, Erick Erickson wrote:
>>
>> What have you tried and what doesn't it do that you want it to do?
>>
>> This works, instantiating the StreamingUpdateSolrServer (server) and
>> the JDBC connection/SQL statement are left as exercises for the
>> reader<G>.:
>>
>> while (rs.next()) {
>> SolrInputDocument doc = new SolrInputDocument();
>>
>> String id = rs.getString("id");
>> String title = rs.getString("title");
>> String text = rs.getString("text");
>>
>> doc.addField("id", id);
>> doc.addField("title", title);
>> doc.addField("text", text);
>>
>> docs.add(doc);
>> ++counter;
>> ++total;
>> if (counter> 100) { // Completely arbitrary, just batch up more
>> than one document for throughput!
>> server.add(docs);
>> docs.clear();
>> counter = 0;
>> }
>> }
>
> I've implemented a basic loop with the structure you've demonstrated, but it
> currently doesn't do anything yet with SolrInputDocument or
> SolrDocumentList. I figured there would be a way to avoid going through the
> field list one by one, but what you've written suggests that the
> field-by-field method is required. I can live with that.
>
> It does look like addField just takes an Object, so hopefully I can create a
> loop that determines the type of each field from the JDBC metadata,
> retrieves the correct Java type from the ResultSet, and inserts it. I
> imagine that everything still works if you happen to insert a field that
> doesn't exist in the index. This must be how the DIH does it, so I was
> hoping that the DIH might expose a method that takes a ResultSet and
> produces a SolrDocumentList. I still have to take a deeper look at the
> source and documentation.
>
> Thanks for the help so far, I can get a little more implemented now.
>
> Shawn
>
>
Re: Indexing from a database via SolrJ
Posted by Shawn Heisey <so...@elyograg.org>.
On 8/16/2011 7:14 AM, Erick Erickson wrote:
> What have you tried and what doesn't it do that you want it to do?
>
> This works, instantiating the StreamingUpdateSolrServer (server) and
> the JDBC connection/SQL statement are left as exercises for the
> reader<G>.:
>
> while (rs.next()) {
> SolrInputDocument doc = new SolrInputDocument();
>
> String id = rs.getString("id");
> String title = rs.getString("title");
> String text = rs.getString("text");
>
> doc.addField("id", id);
> doc.addField("title", title);
> doc.addField("text", text);
>
> docs.add(doc);
> ++counter;
> ++total;
> if (counter> 100) { // Completely arbitrary, just batch up more
> than one document for throughput!
> server.add(docs);
> docs.clear();
> counter = 0;
> }
> }
I've implemented a basic loop with the structure you've demonstrated,
but it currently doesn't do anything yet with SolrInputDocument or
SolrDocumentList. I figured there would be a way to avoid going through
the field list one by one, but what you've written suggests that the
field-by-field method is required. I can live with that.
It does look like addField just takes an Object, so hopefully I can
create a loop that determines the type of each field from the JDBC
metadata, retrieves the correct Java type from the ResultSet, and
inserts it. I imagine that everything still works if you happen to
insert a field that doesn't exist in the index. This must be how the
DIH does it, so I was hoping that the DIH might expose a method that
takes a ResultSet and produces a SolrDocumentList. I still have to take
a deeper look at the source and documentation.
Thanks for the help so far, I can get a little more implemented now.
Shawn
Re: Indexing from a database via SolrJ
Posted by Erick Erickson <er...@gmail.com>.
What have you tried and what doesn't it do that you want it to do?
This works, instantiating the StreamingUpdateSolrServer (server) and
the JDBC connection/SQL statement are left as exercises for the
reader <G>.:
while (rs.next()) {
SolrInputDocument doc = new SolrInputDocument();
String id = rs.getString("id");
String title = rs.getString("title");
String text = rs.getString("text");
doc.addField("id", id);
doc.addField("title", title);
doc.addField("text", text);
docs.add(doc);
++counter;
++total;
if (counter > 100) { // Completely arbitrary, just batch up more
than one document for throughput!
server.add(docs);
docs.clear();
counter = 0;
}
}
Best
Erick
On Mon, Aug 15, 2011 at 6:25 PM, Shawn Heisey <so...@elyograg.org> wrote:
> Is there a simple way to get all the fields from a jdbc resultset into a
> bunch of SolrJ documents, which I will then send to be indexed in Solr? I
> would like to avoid the looping required to copy the data one field at a
> time. Copying it one document at a time would be acceptable, but it would
> be nice if there was a way to copy them all at once.
>
> Another idea that occurred to me is to add the dataimporter jar to my
> project and leverage it to do the heavy lifting, but I will need some
> pointers about what objects and methods to research. Is that a reasonable
> idea, or is it too integrated into the server code to be used with SolrJ?
>
> Can anyone point me in the right direction?
>
> Thanks,
> Shawn
>
>
Re: Indexing from a database via SolrJ
Posted by Shawn Heisey <so...@elyograg.org>.
On 8/15/2011 5:55 PM, Arcadius Ahouansou wrote:
> Hi Shawn.
>
> Unles you are doing complex pre-processing before indexing, you may want to
> have a look at:
> http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS
>
> That should take care of it without any coding.
>
> You may need to periodically do a HTTP GET to trigger the import.
I'm aware of this, and my current build system written in Perl works
this way. When I need to do a full index rebuild, I will still use the
DIH, but it has become too limiting for regular indexing needs. It will
be inadequate for things that we have in development. We need more
flexibility, so I am wanting to handle the interface to the DB myself
and index directly with SolrJ.
Thanks,
Shawn
Re: Indexing from a database via SolrJ
Posted by Arcadius Ahouansou <ar...@menelic.com>.
Hi Shawn.
Unles you are doing complex pre-processing before indexing, you may want to
have a look at:
http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS
That should take care of it without any coding.
You may need to periodically do a HTTP GET to trigger the import.
Arcadius.
On Mon, Aug 15, 2011 at 11:25 PM, Shawn Heisey <so...@elyograg.org> wrote:
> Is there a simple way to get all the fields from a jdbc resultset into a
> bunch of SolrJ documents, which I will then send to be indexed in Solr? I
> would like to avoid the looping required to copy the data one field at a
> time. Copying it one document at a time would be acceptable, but it would
> be nice if there was a way to copy them all at once.
>
> Another idea that occurred to me is to add the dataimporter jar to my
> project and leverage it to do the heavy lifting, but I will need some
> pointers about what objects and methods to research. Is that a reasonable
> idea, or is it too integrated into the server code to be used with SolrJ?
>
> Can anyone point me in the right direction?
>
> Thanks,
> Shawn
>
>
--
W: www.menelic.com
---