You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2011/08/16 00:25:17 UTC

Indexing from a database via SolrJ

Is there a simple way to get all the fields from a jdbc resultset into a 
bunch of SolrJ documents, which I will then send to be indexed in Solr?  
I would like to avoid the looping required to copy the data one field at 
a time.  Copying it one document at a time would be acceptable, but it 
would be nice if there was a way to copy them all at once.

Another idea that occurred to me is to add the dataimporter jar to my 
project and leverage it to do the heavy lifting, but I will need some 
pointers about what objects and methods to research.  Is that a 
reasonable idea, or is it too integrated into the server code to be used 
with SolrJ?

Can anyone point me in the right direction?

Thanks,
Shawn

Re: Indexing from a database via SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/16/2011 1:12 PM, Shawn Heisey wrote:
> On 8/16/2011 11:23 AM, Erick Erickson wrote:
>> The problem with anything "automatic" is that I don't see how it 
>> could know
>> which fields in the document to map DB columns to. Unless you had
>> fields that exactly matched column names, it would be iffy...
>>
>> I assume DIH actually does something like this, but don't know any way
>> of having SolrJ automagically do this.
>>
>> At root these kinds of things don't generalize well, but that doesn't 
>> mean
>> that there's not a good case for doing this.
>
> In my case, the Solr field names are in perfect sync with the database 
> field names.  My DIH config doesn't mention any fields by name, it 
> just passes them as-is and lets the schema handle everything.  I'm 
> perfectly OK with handling everything myself in my code, but if 
> someone had already invented the wheel, no sense in designing a new 
> one. :)
>
> Thanks for all your help, Erick.

Here's what I've ended up with in my method that takes a ResultSet and 
puts the data into Solr.  I have to get a testbed set up before I can 
actually test this code, which will take me a while.  I'm inviting 
comment now, knowing it might have bugs.  Eclipse is happy with it, but 
that doesn't mean it works. :)

     /**
      * Takes an SQL ResultSet and adds the documents to solr. Does it 
in batches
      * of fetchSize.
      *
      * @param rs
      * @throws SQLException
      * @throws IOException
      * @throws SolrServerException
      */
     private long addResultSet(ResultSet rs) throws SQLException,
             SolrServerException, IOException
     {
         long count = 0;
         int innerCount = 0;
         Collection<SolrInputDocument> docs = new 
ArrayList<SolrInputDocument>();
         ResultSetMetaData rsm = rs.getMetaData();
         int numColumns = rsm.getColumnCount();
         String[] colNames = new String[numColumns + 1];

         for (int i = 1; i < (numColumns + 1); i++)
         {
             colNames[i] = rsm.getColumnName(i);
         }

         while (rs.next())
         {
             count++;
             innerCount++;

             SolrInputDocument doc = new SolrInputDocument();
             for (int j = 1; j < (numColumns + 1); j++)
             {
                 Object f;
                 switch (rsm.getColumnType(j))
                 {
                     case Types.BIGINT:
                     {
                         f = rs.getLong(j);
                         break;
                     }
                     case Types.INTEGER:
                     {
                         f = rs.getInt(j);
                         break;
                     }
                     case Types.DATE:
                     {
                         f = rs.getDate(j);
                         break;
                     }
                     case Types.FLOAT:
                     {
                         f = rs.getFloat(j);
                         break;
                     }
                     case Types.DOUBLE:
                     {
                         f = rs.getDouble(j);
                         break;
                     }
                     case Types.TIME:
                     {
                         f = rs.getDate(j);
                         break;
                     }
                     case Types.BOOLEAN:
                     {
                         f = rs.getBoolean(j);
                         break;
                     }
                     default:
                     {
                         f = rs.getString(j);
                     }
                 }
                 doc.addField(colNames[j], f);
             }
             docs.add(doc);

             /**
              * When we reach fetchSize, index the documents and reset 
the inner
              * counter.
              */
             if (innerCount == IdxStatic.fetchSize)
             {
                 solrCore.add(docs);
                 docs.clear();
                 innerCount = 0;
             }
         }

         /**
          * If the outer loop ended before the inner loop reset, index the
          * remaining documents.
          */
         if (innerCount != 0)
         {
             solrCore.add(docs);
         }
         return count;
     }

Re: Indexing from a database via SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/16/2011 11:23 AM, Erick Erickson wrote:
> The problem with anything "automatic" is that I don't see how it could know
> which fields in the document to map DB columns to. Unless you had
> fields that exactly matched column names, it would be iffy...
>
> I assume DIH actually does something like this, but don't know any way
> of having SolrJ automagically do this.
>
> At root these kinds of things don't generalize well, but that doesn't mean
> that there's not a good case for doing this.

In my case, the Solr field names are in perfect sync with the database 
field names.  My DIH config doesn't mention any fields by name, it just 
passes them as-is and lets the schema handle everything.  I'm perfectly 
OK with handling everything myself in my code, but if someone had 
already invented the wheel, no sense in designing a new one. :)

Thanks for all your help, Erick.

Shawn

Re: Indexing from a database via SolrJ

Posted by Erick Erickson <er...@gmail.com>.

The problem with anything "automatic" is that I don't see how it could know
which fields in the document to map DB columns to. Unless you had
fields that exactly matched column names, it would be iffy...

I assume DIH actually does something like this, but don't know any way
of having SolrJ automagically do this.

At root these kinds of things don't generalize well, but that doesn't mean
that there's not a good case for doing this.

Best
Erick

On Tue, Aug 16, 2011 at 11:26 AM, Shawn Heisey <so...@elyograg.org> wrote:
> On 8/16/2011 7:14 AM, Erick Erickson wrote:
>>
>> What have you tried and what doesn't it do that you want it to do?
>>
>> This works, instantiating the StreamingUpdateSolrServer (server) and
>> the JDBC connection/SQL statement are left as exercises for the
>> reader<G>.:
>>
>>     while (rs.next()) {
>>       SolrInputDocument doc = new SolrInputDocument();
>>
>>       String id = rs.getString("id");
>>       String title = rs.getString("title");
>>       String text = rs.getString("text");
>>
>>       doc.addField("id", id);
>>       doc.addField("title", title);
>>       doc.addField("text", text);
>>
>>       docs.add(doc);
>>       ++counter;
>>       ++total;
>>       if (counter>  100) { // Completely arbitrary, just batch up more
>> than one document for throughput!
>>         server.add(docs);
>>         docs.clear();
>>         counter = 0;
>>       }
>>     }
>
> I've implemented a basic loop with the structure you've demonstrated, but it
> currently doesn't do anything yet with SolrInputDocument or
> SolrDocumentList.  I figured there would be a way to avoid going through the
> field list one by one, but what you've written suggests that the
> field-by-field method is required.  I can live with that.
>
> It does look like addField just takes an Object, so hopefully I can create a
> loop that determines the type of each field from the JDBC metadata,
> retrieves the correct Java type from the ResultSet, and inserts it.  I
> imagine that everything still works if you happen to insert a field that
> doesn't exist in the index.  This must be how the DIH does it, so I was
> hoping that the DIH might expose a method that takes a ResultSet and
> produces a SolrDocumentList.  I still have to take a deeper look at the
> source and documentation.
>
> Thanks for the help so far, I can get a little more implemented now.
>
> Shawn
>
>

Re: Indexing from a database via SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/16/2011 7:14 AM, Erick Erickson wrote:
> What have you tried and what doesn't it do that you want it to do?
>
> This works, instantiating the StreamingUpdateSolrServer (server) and
> the JDBC connection/SQL statement are left as exercises for the
> reader<G>.:
>
>      while (rs.next()) {
>        SolrInputDocument doc = new SolrInputDocument();
>
>        String id = rs.getString("id");
>        String title = rs.getString("title");
>        String text = rs.getString("text");
>
>        doc.addField("id", id);
>        doc.addField("title", title);
>        doc.addField("text", text);
>
>        docs.add(doc);
>        ++counter;
>        ++total;
>        if (counter>  100) { // Completely arbitrary, just batch up more
> than one document for throughput!
>          server.add(docs);
>          docs.clear();
>          counter = 0;
>        }
>      }

I've implemented a basic loop with the structure you've demonstrated, 
but it currently doesn't do anything yet with SolrInputDocument or 
SolrDocumentList.  I figured there would be a way to avoid going through 
the field list one by one, but what you've written suggests that the 
field-by-field method is required.  I can live with that.

It does look like addField just takes an Object, so hopefully I can 
create a loop that determines the type of each field from the JDBC 
metadata, retrieves the correct Java type from the ResultSet, and 
inserts it.  I imagine that everything still works if you happen to 
insert a field that doesn't exist in the index.  This must be how the 
DIH does it, so I was hoping that the DIH might expose a method that 
takes a ResultSet and produces a SolrDocumentList.  I still have to take 
a deeper look at the source and documentation.

Thanks for the help so far, I can get a little more implemented now.

Shawn

Re: Indexing from a database via SolrJ

Posted by Erick Erickson <er...@gmail.com>.

What have you tried and what doesn't it do that you want it to do?

This works, instantiating the StreamingUpdateSolrServer (server) and
the JDBC connection/SQL statement are left as exercises for the
reader <G>.:

    while (rs.next()) {
      SolrInputDocument doc = new SolrInputDocument();

      String id = rs.getString("id");
      String title = rs.getString("title");
      String text = rs.getString("text");

      doc.addField("id", id);
      doc.addField("title", title);
      doc.addField("text", text);

      docs.add(doc);
      ++counter;
      ++total;
      if (counter > 100) { // Completely arbitrary, just batch up more
than one document for throughput!
        server.add(docs);
        docs.clear();
        counter = 0;
      }
    }

Best
Erick

On Mon, Aug 15, 2011 at 6:25 PM, Shawn Heisey <so...@elyograg.org> wrote:
> Is there a simple way to get all the fields from a jdbc resultset into a
> bunch of SolrJ documents, which I will then send to be indexed in Solr?  I
> would like to avoid the looping required to copy the data one field at a
> time.  Copying it one document at a time would be acceptable, but it would
> be nice if there was a way to copy them all at once.
>
> Another idea that occurred to me is to add the dataimporter jar to my
> project and leverage it to do the heavy lifting, but I will need some
> pointers about what objects and methods to research.  Is that a reasonable
> idea, or is it too integrated into the server code to be used with SolrJ?
>
> Can anyone point me in the right direction?
>
> Thanks,
> Shawn
>
>

Re: Indexing from a database via SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/15/2011 5:55 PM, Arcadius Ahouansou wrote:
> Hi Shawn.
>
> Unles you are doing complex pre-processing before indexing, you may want to
> have a look at:
> http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS
>
> That should take care of it without any coding.
>
> You may need to periodically do a HTTP GET to trigger the import.

I'm aware of this, and my current build system written in Perl works 
this way.  When I need to do a full index rebuild, I will still use the 
DIH, but it has become too limiting for regular indexing needs.  It will 
be inadequate for things that we have in development.  We need more 
flexibility, so I am wanting to handle the interface to the DB myself 
and index directly with SolrJ.

Thanks,
Shawn

Re: Indexing from a database via SolrJ

Posted by Arcadius Ahouansou <ar...@menelic.com>.

Hi Shawn.

Unles you are doing complex pre-processing before indexing, you may want to
have a look at:
http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS

That should take care of it without any coding.

You may need to periodically do a HTTP GET to trigger the import.


Arcadius.

On Mon, Aug 15, 2011 at 11:25 PM, Shawn Heisey <so...@elyograg.org> wrote:

> Is there a simple way to get all the fields from a jdbc resultset into a
> bunch of SolrJ documents, which I will then send to be indexed in Solr?  I
> would like to avoid the looping required to copy the data one field at a
> time.  Copying it one document at a time would be acceptable, but it would
> be nice if there was a way to copy them all at once.
>
> Another idea that occurred to me is to add the dataimporter jar to my
> project and leverage it to do the heavy lifting, but I will need some
> pointers about what objects and methods to research.  Is that a reasonable
> idea, or is it too integrated into the server code to be used with SolrJ?
>
> Can anyone point me in the right direction?
>
> Thanks,
> Shawn
>
>


-- 
W: www.menelic.com
---