You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rohit Gandhe <ro...@gmail.com> on 2010/02/04 21:03:08 UTC
Indexing CSV without HTTP
Hi Everyone,
We are indexing quite a lot of data using update/csv handler. For
reasons I can't get into right now, I can't implement a DIH since I
can only access the DB using Stored Procs and stored proc support in
DIH is not yet available. Indexing takes about 3 hours and I don't
want to tax the server too much during indexing so I came up with a
two server solution. Indexing server to index the file every night and
subsequently copy the index on the search server. Maintaining a full
fledged Tomcat/Jetty for just indexing is too much of a pain, so I
wrote a small utility Java class which starts an Embedded Server,
indexes the CSV and shuts down the server. I would like the
community's input on this solution.
Is this Okay to do?
Is there a better way to do this without running two separate servers?
Is my class safe enough to run everynight in production environment?
Here's my utility calss. This is just a POC and before I productionize
it, I would like some input from Solr Czars here.
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.CoreContainer;
import org.apache.solr.core.CoreDescriptor;
import org.apache.solr.core.SolrConfig;
import org.apache.solr.core.SolrCore;
import java.io.File;
public class StandaloneSolrIndexer {
public static void main(String args[]) throws Exception {
SolrCore core = null;
CoreContainer container = null;
try {
container = new CoreContainer();
SolrConfig config = new SolrConfig("/tmp/solr",
"solrconfig.xml", null);
CoreDescriptor descriptor = new CoreDescriptor(container,
"core1", "/tmp/solr");
core = new SolrCore("core1", "/tmp/solr/data", config,
null, descriptor);
container.register("core1", core, false);
SolrServer server = new EmbeddedSolrServer(container, "core1");
//Start by deleting everything
server.deleteByQuery("*:*");
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/csv");
req.addFile(new File("/tmp/product-5k.tsv"));
req.setParam("commit", "true");
req.setParam("stream.contentType", "text/plain;charset=utf-8");
req.setParam("escape", "\\");
req.setParam("separator", "\t");
req.setParam("fieldnames",
"product_id,account_id,name,category_tags,short_desc,upc,manu_mdl_num,ext_prd_id,brand,long_desc,sku,seller,seller_email,vertical,cat,subcat");
req.setParam("skipLines", "1");
NamedList<Object> result = server.request(req);
System.out.println("Result
====================================================================================:
\n" + result);
} finally {
if (core != null) core.close();
if (container != null) container.shutdown();
}
}
}
Thanks,
Rohit
Re: Indexing CSV without HTTP
Posted by Rohit Gandhe <ro...@gmail.com>.
Thanks Yonik! We want to go to Index replication soon (couple of
months), which will also help with incremental updates. But for now we
want a quick and dirty solution without running two servers. Does the
utility look ok to index a CSV file? Is it safe to do in production
environment? I know maintaining custom server code is not a good idea,
but this is just until we can implement index replication.
On Thu, Feb 4, 2010 at 12:28 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Thu, Feb 4, 2010 at 3:03 PM, Rohit Gandhe <ro...@gmail.com> wrote:
>> We are indexing quite a lot of data using update/csv handler. For
>> reasons I can't get into right now, I can't implement a DIH since I
>> can only access the DB using Stored Procs and stored proc support in
>> DIH is not yet available. Indexing takes about 3 hours and I don't
>> want to tax the server too much during indexing so I came up with a
>> two server solution. Indexing server to index the file every night and
>> subsequently copy the index on the search server.
>
> Why not use the built-in index replication?
>
>> Maintaining a full
>> fledged Tomcat/Jetty for just indexing is too much of a pain, so I
>> wrote a small utility Java class which starts an Embedded Server,
>
> Surely maintaining your own custom server code is going to be more
> work than simply running a server provided by the community?
>
> -Yonik
> http://www.lucidimagination.com
>
Re: Indexing CSV without HTTP
Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Feb 4, 2010 at 3:03 PM, Rohit Gandhe <ro...@gmail.com> wrote:
> We are indexing quite a lot of data using update/csv handler. For
> reasons I can't get into right now, I can't implement a DIH since I
> can only access the DB using Stored Procs and stored proc support in
> DIH is not yet available. Indexing takes about 3 hours and I don't
> want to tax the server too much during indexing so I came up with a
> two server solution. Indexing server to index the file every night and
> subsequently copy the index on the search server.
Why not use the built-in index replication?
> Maintaining a full
> fledged Tomcat/Jetty for just indexing is too much of a pain, so I
> wrote a small utility Java class which starts an Embedded Server,
Surely maintaining your own custom server code is going to be more
work than simply running a server provided by the community?
-Yonik
http://www.lucidimagination.com