You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Hal Arres <he...@gmail.com> on 2014/05/19 12:31:31 UTC
Slow file-import
Hallo there,
I am working on an import-configuration for my solr-index and I got
some issues with that.
In the first step I configured an import-handler to import data from a
database into the solr-index and it worked just fine, but it is very
slow (7K documents per second). So I wanted to change that towards a
data-import-handler using a FileDataSource. (i am running solr 4.6.1)
I have to import nearly 150_000_000 lines each night and each line has
the following characteristics:
- fields are seperated by tabulator
- 70 fields each line
- one line is nearly 600 characters long
- each line contains multiple data-types (date, int, string...)
In the moment the files are imported into the database, from which
they are imported by solr (database import-handler).
To improve the import performance I wanted to import the files directly.
This is the first approach I tested:
---------------
<entity
name="files"
dataSource="null"
rootEntity="false"
processor="FileListEntityProcessor"
baseDir="/tmp"
fileName=".*\.infile"
onError="abort"
recursive="false">
<entity
name="csv_file"
processor="LineEntityProcessor"
url="${files.fileAbsolutePath}"
dataSource="fds"
transformer="RegexTransformer">
<field column="rawLine"
regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
groupNames="field1,,,field4,field5"/>
</entity>
-----------------
If I import less than 10 fields this works just fine. But as soon as I
extend the import to 30 fields, the time to import one line increases
to more than 10sec!
So I checked another way, in which I moved the transformation to a script:
----------------
<script><![CDATA[
function parse(row) {
var rawLine = row.get("rawLine")
var arr = rawLine.split("\t");
row.put("field1", arr[0]);
row.put("field67", arr[67]);
// row.remove("rawLine");
return row;
}
]]></script>
-----------------
But this was just slightly faster than the database import.
Has someone of you an idea, how I can improve my import performance?
Thank you very, very much,
Sebastian
Re: Slow file-import
Posted by Ahmet Arslan <io...@yahoo.com>.
Hi,
Try http://wiki.apache.org/solr/UpdateCSV , it should be faster.
See 'Tab-delimited importing' at the end of the wiki page.
Cheers,
Ahmet
On Monday, May 19, 2014 1:31 PM, Hal Arres <he...@gmail.com> wrote:
Hallo there,
I am working on an import-configuration for my solr-index and I got
some issues with that.
In the first step I configured an import-handler to import data from a
database into the solr-index and it worked just fine, but it is very
slow (7K documents per second). So I wanted to change that towards a
data-import-handler using a FileDataSource. (i am running solr 4.6.1)
I have to import nearly 150_000_000 lines each night and each line has
the following characteristics:
- fields are seperated by tabulator
- 70 fields each line
- one line is nearly 600 characters long
- each line contains multiple data-types (date, int, string...)
In the moment the files are imported into the database, from which
they are imported by solr (database import-handler).
To improve the import performance I wanted to import the files directly.
This is the first approach I tested:
---------------
<entity
name="files"
dataSource="null"
rootEntity="false"
processor="FileListEntityProcessor"
baseDir="/tmp"
fileName=".*\.infile"
onError="abort"
recursive="false">
<entity
name="csv_file"
processor="LineEntityProcessor"
url="${files.fileAbsolutePath}"
dataSource="fds"
transformer="RegexTransformer">
<field column="rawLine"
regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
groupNames="field1,,,field4,field5"/>
</entity>
-----------------
If I import less than 10 fields this works just fine. But as soon as I
extend the import to 30 fields, the time to import one line increases
to more than 10sec!
So I checked another way, in which I moved the transformation to a script:
----------------
<script><![CDATA[
function parse(row) {
var rawLine = row.get("rawLine")
var arr = rawLine.split("\t");
row.put("field1", arr[0]);
row.put("field67", arr[67]);
// row.remove("rawLine");
return row;
}
]]></script>
-----------------
But this was just slightly faster than the database import.
Has someone of you an idea, how I can improve my import performance?
Thank you very, very much,
Sebastian