You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rohit Gandhe <ro...@gmail.com> on 2010/02/04 21:03:08 UTC

Indexing CSV without HTTP

Hi Everyone,

We are indexing quite a lot of data using update/csv handler. For
reasons I can't get into right now, I can't implement a DIH since I
can only access the DB using Stored Procs and stored proc support in
DIH is not yet available. Indexing takes about 3 hours and I don't
want to tax the server too much during indexing so I came up with a
two server solution. Indexing server to index the file every night and
subsequently copy the index on the search server. Maintaining a full
fledged Tomcat/Jetty for just indexing is too much of a pain, so I
wrote a small utility Java class which starts an Embedded Server,
indexes the CSV and shuts down the server. I would like the
community's input on this solution.

Is this Okay to do?
Is there a better way to do this without running two separate servers?
Is my class safe enough to run everynight in production environment?

Here's my utility calss. This is just a POC and before I productionize
it, I would like some input from Solr Czars here.

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.CoreContainer;
import org.apache.solr.core.CoreDescriptor;
import org.apache.solr.core.SolrConfig;
import org.apache.solr.core.SolrCore;

import java.io.File;

public class StandaloneSolrIndexer {

    public static void main(String args[]) throws Exception {

        SolrCore core = null;
        CoreContainer container = null;
        try {
            container = new CoreContainer();

            SolrConfig config = new SolrConfig("/tmp/solr",
"solrconfig.xml", null);
            CoreDescriptor descriptor = new CoreDescriptor(container,
"core1", "/tmp/solr");

            core = new SolrCore("core1", "/tmp/solr/data", config,
null, descriptor);
            container.register("core1", core, false);

            SolrServer server = new EmbeddedSolrServer(container, "core1");

            //Start by deleting everything
            server.deleteByQuery("*:*");

            ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/csv");
            req.addFile(new File("/tmp/product-5k.tsv"));

            req.setParam("commit", "true");
            req.setParam("stream.contentType", "text/plain;charset=utf-8");
            req.setParam("escape", "\\");
            req.setParam("separator", "\t");
            req.setParam("fieldnames",
"product_id,account_id,name,category_tags,short_desc,upc,manu_mdl_num,ext_prd_id,brand,long_desc,sku,seller,seller_email,vertical,cat,subcat");
            req.setParam("skipLines", "1");

            NamedList<Object> result = server.request(req);
            System.out.println("Result
====================================================================================:
\n" + result);

        } finally {
            if (core != null) core.close();
            if (container != null) container.shutdown();
        }
    }
}


Thanks,
Rohit

Re: Indexing CSV without HTTP

Posted by Rohit Gandhe <ro...@gmail.com>.

Thanks Yonik! We want to go to Index replication soon (couple of
months), which will also help with incremental updates. But for now we
want a quick and dirty solution without running two servers. Does the
utility look ok to index a CSV file? Is it safe to do in production
environment? I know maintaining custom server code is not a good idea,
but this is just until we can implement index replication.

On Thu, Feb 4, 2010 at 12:28 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Thu, Feb 4, 2010 at 3:03 PM, Rohit Gandhe <ro...@gmail.com> wrote:
>> We are indexing quite a lot of data using update/csv handler. For
>> reasons I can't get into right now, I can't implement a DIH since I
>> can only access the DB using Stored Procs and stored proc support in
>> DIH is not yet available. Indexing takes about 3 hours and I don't
>> want to tax the server too much during indexing so I came up with a
>> two server solution. Indexing server to index the file every night and
>> subsequently copy the index on the search server.
>
> Why not use the built-in index replication?
>
>> Maintaining a full
>> fledged Tomcat/Jetty for just indexing is too much of a pain, so I
>> wrote a small utility Java class which starts an Embedded Server,
>
> Surely maintaining your own custom server code is going to be more
> work than simply running a server provided by the community?
>
> -Yonik
> http://www.lucidimagination.com
>

Re: Indexing CSV without HTTP

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Feb 4, 2010 at 3:03 PM, Rohit Gandhe <ro...@gmail.com> wrote:
> We are indexing quite a lot of data using update/csv handler. For
> reasons I can't get into right now, I can't implement a DIH since I
> can only access the DB using Stored Procs and stored proc support in
> DIH is not yet available. Indexing takes about 3 hours and I don't
> want to tax the server too much during indexing so I came up with a
> two server solution. Indexing server to index the file every night and
> subsequently copy the index on the search server.

Why not use the built-in index replication?

> Maintaining a full
> fledged Tomcat/Jetty for just indexing is too much of a pain, so I
> wrote a small utility Java class which starts an Embedded Server,

Surely maintaining your own custom server code is going to be more
work than simply running a server provided by the community?

-Yonik
http://www.lucidimagination.com