You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brian Lamb <br...@journalexperts.com> on 2011/12/11 21:54:11 UTC

MySQL data import

Hi all,

I have a few questions about how the MySQL data import works. It seems it
creates a separate connection for each entity I create. Is there any way to
avoid this?

By nature of my schema, I have several multivalued fields. Each one I
populate with a separate entity. Is there a better way to do it? For
example, could I pull in all the singular data in one sitting and then come
back in later and populate with the multivalued items.

An alternate approach in some cases would be to do a GROUP_CONCAT and then
populate the multivalued column with some transformation. Is that possible?

Lastly, is it possible to use copyField to copy three regular fields into
one multiValued field and have all the data show up?

Thanks,

Brian Lamb

Re: MySQL data import

Posted by Shawn Heisey <so...@elyograg.org>.

On 12/11/2011 1:54 PM, Brian Lamb wrote:
> By nature of my schema, I have several multivalued fields. Each one I
> populate with a separate entity. Is there a better way to do it? For
> example, could I pull in all the singular data in one sitting and then come
> back in later and populate with the multivalued items.
>
> An alternate approach in some cases would be to do a GROUP_CONCAT and then
> populate the multivalued column with some transformation. Is that possible?
>
> Lastly, is it possible to use copyField to copy three regular fields into
> one multiValued field and have all the data show up?

The best way to proceed may depend on whether you actually need the 
field to be multivalued (returning an array in search results), or if 
you simply need to be able to search on all the values.  For me, it's 
the latter - the field isn't stored.

I use the GROUP_CONCAT method (hidden in a database view, so Solr 
doesn't need to know about it) to put multiple values into a field, 
separated by semicolons.  I then use the following single-valued 
fieldType to split those up and make all the values searchable.  The 
tokenizer splits by semicolons followed by zero or more spaces, the 
pattern filter strips leading and trailing punctuation from each token.  
The ICU filter is basically a better implementation of the ascii folding 
filter and the lowercase filter, in a single pass.  The others are 
fairly self-explanatory:

<!-- lowercases, tokenize by semicolons -->
<fieldType name="lcsemi" class="solr.TextField" sortMissingLast="true" 
positionIncrementGap="0" omitNorms="true">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="; *"/>
<filter class="solr.PatternReplaceFilterFactory"
           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
           replacement="$2"
           allowempty="false"
         />
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>

If you actually do need the field to be multivalued, then you'll need to 
do dataimport transformation as mentioned by Gora, who also replied.

Thanks,
Shawn

Re: MySQL data import

Posted by Brian Lamb <br...@journalexperts.com>.

Hi all,

Any tips on this one?

Thanks,

Brian Lamb

On Sun, Dec 11, 2011 at 3:54 PM, Brian Lamb
<br...@journalexperts.com>wrote:

> Hi all,
>
> I have a few questions about how the MySQL data import works. It seems it
> creates a separate connection for each entity I create. Is there any way to
> avoid this?
>
> By nature of my schema, I have several multivalued fields. Each one I
> populate with a separate entity. Is there a better way to do it? For
> example, could I pull in all the singular data in one sitting and then come
> back in later and populate with the multivalued items.
>
> An alternate approach in some cases would be to do a GROUP_CONCAT and then
> populate the multivalued column with some transformation. Is that possible?
>
> Lastly, is it possible to use copyField to copy three regular fields into
> one multiValued field and have all the data show up?
>
> Thanks,
>
> Brian Lamb
>

Re: MySQL data import

Posted by Erick Erickson <er...@gmail.com>.

Here's a quick demo I wrote at one point. I haven't run it in a while,
but you should be able to get the idea.


package jdbc;


import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.common.SolrInputDocument;

import java.io.IOException;
import java.sql.*;
import java.util.ArrayList;
import java.util.Collection;


public class Indexer {
  public static void main(String[] args) {
    startIndex("http://localhost:8983/solr");
  }

  private static void startIndex(String url) {
    Connection con = DataSource.getConnection();
    try {

      long start = System.currentTimeMillis();
      // Create a multi-threaded communications channel to the Solr
server. Full interface (3.3) at:
      // http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html
      StreamingUpdateSolrServer server = new
StreamingUpdateSolrServer(url, 10, 4);

      // You may want to set these timeouts higer, Solr occasionally
will have long pauses while
      // segments merge.
      server.setSoTimeout(1000);  // socket read timeout
      server.setConnectionTimeout(100);
      //server.setDefaultMaxConnectionsPerHost(100);
      //server.setMaxTotalConnections(100);
      //server.setFollowRedirects(false);  // defaults to false
      // allowCompression defaults to false.
      // Server side must support gzip or deflate for this to have any effect.
      //server.setAllowCompression(true);
      server.setMaxRetries(1); // defaults to 0.  > 1 not recommended.
      server.setParser(new XMLResponseParser()); // binary parser is
used by default

      doDocuments(server, con);
      server.commit(); // Only needs to be done at the end, autocommit
or commitWithin should
      // do the rest.
      long endTime = System.currentTimeMillis();
      System.out.println("Total Time Taken->" + (endTime - start) + " mils");

    } catch (Exception e) {
      e.printStackTrace();
      String msg = e.getMessage();
      System.out.println(msg);
    }
  }

  private static void doDocuments(StreamingUpdateSolrServer server,
Connection con) throws SQLException, IOException, SolrServerException
{

    Statement st = con.createStatement();
    ResultSet rs = st.executeQuery("select id,title,text from test");

    // SolrInputDocument interface (3.3) at
    // http://lucene.apache.org/solr/api/org/apache/solr/common/SolrInputDocument.html
    Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
    int total = 0;
    int counter = 0;

    while (rs.next()) {
      SolrInputDocument doc = new SolrInputDocument(); // DO NOT move
this outside the while loop
      // or be sure to call doc.clear()

      String id = rs.getString("id");
      String title = rs.getString("title");
      String text = rs.getString("text");

      doc.addField("id", id);
      doc.addField("title", title);
      doc.addField("text", text);

      docs.add(doc);
      ++counter;
      ++total;
      if (counter > 1000) { // Completely arbitrary, just batch up
more than one document for throughput!
        server.add(docs);
        docs.clear();
        counter = 0;
      }
    }
    System.out.println("Total " + total + " Docs added succesfully");

  }
}

// Trivial class showing connecting to a MySql database server via jdbc...
class DataSource {
  public static Connection getConnection() {
    Connection conn = null;
    try {

      Class.forName("com.mysql.jdbc.Driver").newInstance();
      System.out.println("Driver Loaded......");
      conn = DriverManager.getConnection("jdbc:mysql://172.16.0.169:3306/test?"
+ "user=testuser&password=test123");
      System.out.println("Connection build......");
    } catch (Exception ex) {
      System.out.println(ex);
    }
    return conn;
  }

  public static void closeConnection(Connection con) {
    try {
      if (con != null)
        con.close();
    } catch (SQLException e) {
      e.printStackTrace();
    }
  }
}

On Mon, Dec 12, 2011 at 2:57 PM, Brian Lamb
<br...@journalexperts.com> wrote:
> Thanks all. Erick, is there documentation on doing things with SolrJ and a
> JDBC connection?
>
> On Mon, Dec 12, 2011 at 1:34 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> You might want to consider just doing the whole
>> thing in SolrJ with a JDBC connection. When things
>> get complex, it's sometimes more straightforward.
>>
>> Best
>> Erick...
>>
>> P.S. Yes, it's pretty standard to have a single
>> field be the destination for several copyField
>> directives.
>>
>> On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty <go...@mimirtech.com> wrote:
>> > On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
>> > <br...@journalexperts.com> wrote:
>> >> Hi all,
>> >>
>> >> I have a few questions about how the MySQL data import works. It seems
>> it
>> >> creates a separate connection for each entity I create. Is there any
>> way to
>> >> avoid this?
>> >
>> > Not sure, but I do not think that it is possible. However, from your
>> description
>> > below, I think that you are unnecessarily multiplying entities.
>> >
>> >> By nature of my schema, I have several multivalued fields. Each one I
>> >> populate with a separate entity. Is there a better way to do it? For
>> >> example, could I pull in all the singular data in one sitting and then
>> come
>> >> back in later and populate with the multivalued items.
>> >
>> > Not quite sure as to what you mean. Would it be possible for you
>> > to post your schema.xml, and the DIH configuration file? Preferably,
>> > put these on pastebin.com, and send us links. Also, you should
>> > obfuscate details like access passwords.
>> >
>> >> An alternate approach in some cases would be to do a GROUP_CONCAT and
>> then
>> >> populate the multivalued column with some transformation. Is that
>> possible?
>> > [...]
>> >
>> > This is how we have been handling it. A complete description would
>> > be long, but here is the gist of it:
>> > * A transformer will be needed. In this case, we found it easiest
>> >  to use a Java-based transformer. Thus, your entity should include
>> >  something like
>> >  <entity name="myname" dataSource="mysource"
>> > transformer="com.mycompany.search.solr.handler.JobsNumericTransformer...>
>> >  ...
>> >  </entity>
>> >  Here, the class name to be used for the transformer attribute follows
>> >  the usual Java rules, and the .jar needs to be made available to Solr.
>> > * The SELECT statement for the entity looks something like
>> >  select group_concat( myfield SEPARATOR '@||@')...
>> >  The separator should be something that does not occur in your
>> >  normal data stream.
>> > * Within the entity, define
>> >   <field column="myfield"/>
>> > * There are complications involved if NULL values are allowed
>> >   for the field, in which case you would need to use COALESCE,
>> >   maybe along with CAST
>> > * The transformer would look up "myfield", split along the separator,
>> >   and populate the multi-valued field.
>> >
>> > This *is* a little complicated, so I would also like to hear about
>> > possible alternatives.
>> >
>> > Regards,
>> > Gora
>>

Re: MySQL data import

Posted by Brian Lamb <br...@journalexperts.com>.

Thanks all. Erick, is there documentation on doing things with SolrJ and a
JDBC connection?

On Mon, Dec 12, 2011 at 1:34 PM, Erick Erickson <er...@gmail.com>wrote:

> You might want to consider just doing the whole
> thing in SolrJ with a JDBC connection. When things
> get complex, it's sometimes more straightforward.
>
> Best
> Erick...
>
> P.S. Yes, it's pretty standard to have a single
> field be the destination for several copyField
> directives.
>
> On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty <go...@mimirtech.com> wrote:
> > On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
> > <br...@journalexperts.com> wrote:
> >> Hi all,
> >>
> >> I have a few questions about how the MySQL data import works. It seems
> it
> >> creates a separate connection for each entity I create. Is there any
> way to
> >> avoid this?
> >
> > Not sure, but I do not think that it is possible. However, from your
> description
> > below, I think that you are unnecessarily multiplying entities.
> >
> >> By nature of my schema, I have several multivalued fields. Each one I
> >> populate with a separate entity. Is there a better way to do it? For
> >> example, could I pull in all the singular data in one sitting and then
> come
> >> back in later and populate with the multivalued items.
> >
> > Not quite sure as to what you mean. Would it be possible for you
> > to post your schema.xml, and the DIH configuration file? Preferably,
> > put these on pastebin.com, and send us links. Also, you should
> > obfuscate details like access passwords.
> >
> >> An alternate approach in some cases would be to do a GROUP_CONCAT and
> then
> >> populate the multivalued column with some transformation. Is that
> possible?
> > [...]
> >
> > This is how we have been handling it. A complete description would
> > be long, but here is the gist of it:
> > * A transformer will be needed. In this case, we found it easiest
> >  to use a Java-based transformer. Thus, your entity should include
> >  something like
> >  <entity name="myname" dataSource="mysource"
> > transformer="com.mycompany.search.solr.handler.JobsNumericTransformer...>
> >  ...
> >  </entity>
> >  Here, the class name to be used for the transformer attribute follows
> >  the usual Java rules, and the .jar needs to be made available to Solr.
> > * The SELECT statement for the entity looks something like
> >  select group_concat( myfield SEPARATOR '@||@')...
> >  The separator should be something that does not occur in your
> >  normal data stream.
> > * Within the entity, define
> >   <field column="myfield"/>
> > * There are complications involved if NULL values are allowed
> >   for the field, in which case you would need to use COALESCE,
> >   maybe along with CAST
> > * The transformer would look up "myfield", split along the separator,
> >   and populate the multi-valued field.
> >
> > This *is* a little complicated, so I would also like to hear about
> > possible alternatives.
> >
> > Regards,
> > Gora
>

Re: MySQL data import

Posted by Erick Erickson <er...@gmail.com>.

You might want to consider just doing the whole
thing in SolrJ with a JDBC connection. When things
get complex, it's sometimes more straightforward.

Best
Erick...

P.S. Yes, it's pretty standard to have a single
field be the destination for several copyField
directives.

On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty <go...@mimirtech.com> wrote:
> On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
> <br...@journalexperts.com> wrote:
>> Hi all,
>>
>> I have a few questions about how the MySQL data import works. It seems it
>> creates a separate connection for each entity I create. Is there any way to
>> avoid this?
>
> Not sure, but I do not think that it is possible. However, from your description
> below, I think that you are unnecessarily multiplying entities.
>
>> By nature of my schema, I have several multivalued fields. Each one I
>> populate with a separate entity. Is there a better way to do it? For
>> example, could I pull in all the singular data in one sitting and then come
>> back in later and populate with the multivalued items.
>
> Not quite sure as to what you mean. Would it be possible for you
> to post your schema.xml, and the DIH configuration file? Preferably,
> put these on pastebin.com, and send us links. Also, you should
> obfuscate details like access passwords.
>
>> An alternate approach in some cases would be to do a GROUP_CONCAT and then
>> populate the multivalued column with some transformation. Is that possible?
> [...]
>
> This is how we have been handling it. A complete description would
> be long, but here is the gist of it:
> * A transformer will be needed. In this case, we found it easiest
>  to use a Java-based transformer. Thus, your entity should include
>  something like
>  <entity name="myname" dataSource="mysource"
> transformer="com.mycompany.search.solr.handler.JobsNumericTransformer...>
>  ...
>  </entity>
>  Here, the class name to be used for the transformer attribute follows
>  the usual Java rules, and the .jar needs to be made available to Solr.
> * The SELECT statement for the entity looks something like
>  select group_concat( myfield SEPARATOR '@||@')...
>  The separator should be something that does not occur in your
>  normal data stream.
> * Within the entity, define
>   <field column="myfield"/>
> * There are complications involved if NULL values are allowed
>   for the field, in which case you would need to use COALESCE,
>   maybe along with CAST
> * The transformer would look up "myfield", split along the separator,
>   and populate the multi-valued field.
>
> This *is* a little complicated, so I would also like to hear about
> possible alternatives.
>
> Regards,
> Gora

Re: MySQL data import

Posted by Gora Mohanty <go...@mimirtech.com>.

On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
<br...@journalexperts.com> wrote:
> Hi all,
>
> I have a few questions about how the MySQL data import works. It seems it
> creates a separate connection for each entity I create. Is there any way to
> avoid this?

Not sure, but I do not think that it is possible. However, from your description
below, I think that you are unnecessarily multiplying entities.

> By nature of my schema, I have several multivalued fields. Each one I
> populate with a separate entity. Is there a better way to do it? For
> example, could I pull in all the singular data in one sitting and then come
> back in later and populate with the multivalued items.

Not quite sure as to what you mean. Would it be possible for you
to post your schema.xml, and the DIH configuration file? Preferably,
put these on pastebin.com, and send us links. Also, you should
obfuscate details like access passwords.

> An alternate approach in some cases would be to do a GROUP_CONCAT and then
> populate the multivalued column with some transformation. Is that possible?
[...]

This is how we have been handling it. A complete description would
be long, but here is the gist of it:
* A transformer will be needed. In this case, we found it easiest
  to use a Java-based transformer. Thus, your entity should include
  something like
  <entity name="myname" dataSource="mysource"
transformer="com.mycompany.search.solr.handler.JobsNumericTransformer...>
  ...
  </entity>
 Here, the class name to be used for the transformer attribute follows
 the usual Java rules, and the .jar needs to be made available to Solr.
* The SELECT statement for the entity looks something like
  select group_concat( myfield SEPARATOR '@||@')...
  The separator should be something that does not occur in your
  normal data stream.
* Within the entity, define
   <field column="myfield"/>
* There are complications involved if NULL values are allowed
   for the field, in which case you would need to use COALESCE,
   maybe along with CAST
* The transformer would look up "myfield", split along the separator,
   and populate the multi-valued field.

This *is* a little complicated, so I would also like to hear about
possible alternatives.

Regards,
Gora