You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Antelmo Aguilar <An...@nd.edu> on 2015/12/14 18:04:20 UTC

Help Indexing Large File

Hello,

I am trying to index a very large file in Solr (around 5GB).  However, I
get out of memory errors using Curl.  I tried using the post script and I
had some success with it.  After indexing several hundred thousand records
though, I got the following error message:

*SimplePostTool: FATAL: IOException while posting data:
java.io.IOException: too many bytes written*

Would it be possible to get some help on where I can start looking to solve
this issue?  I tried finding some type of log that would give me more
information.  I have not had any luck.  The only logs I was able to find
related to this error were the logs from Solr, but I assume these are from
the "server" perspective and not "cient's" perspective of the error.  I
would really appreciate the help.

Thanks,
Antelmo

Re: Help Indexing Large File

Posted by Erick Erickson <er...@gmail.com>.
Well, this usually means the maximum packet size has been exceeded,
there are several possibilities here that I'm going to skip over
because I have to ask the purpose of indexing a 5G file.

Indexing such a huge file has several problems from a user's perspective:
1> assuming the bulk of it is text, it'll be hit on many, many searches.
2> because it is so large, it'll probably rank quite a ways down the
list, users may rarely see it.
3> even if it is found and a user clicks on the doc, what then? You
can't reasonably fetch it from a server and display it.

In short, before diving into the mechanics of why you get this error
and correcting that, I'd be sure it made any sense to even try to
index this doc. It may, don't get me wrong. Just askin'.

Also, if this is some kind of binary file (say a movie or something),
and what you're trying to index is actually the metadata, consider
extracting that on the client side with SolrJ and/or Tika and just
sending the data you expect to index to Solr. This scales much better
than sending huge does to Solr and letting that poor little server
extract _and_ index _and_ serve queries ;).

Best,
Erick

On Mon, Dec 14, 2015 at 9:04 AM, Antelmo Aguilar
<An...@nd.edu> wrote:
> Hello,
>
> I am trying to index a very large file in Solr (around 5GB).  However, I
> get out of memory errors using Curl.  I tried using the post script and I
> had some success with it.  After indexing several hundred thousand records
> though, I got the following error message:
>
> *SimplePostTool: FATAL: IOException while posting data:
> java.io.IOException: too many bytes written*
>
> Would it be possible to get some help on where I can start looking to solve
> this issue?  I tried finding some type of log that would give me more
> information.  I have not had any luck.  The only logs I was able to find
> related to this error were the logs from Solr, but I assume these are from
> the "server" perspective and not "cient's" perspective of the error.  I
> would really appreciate the help.
>
> Thanks,
> Antelmo

Re: Help Indexing Large File

Posted by Jack Krupansky <ja...@gmail.com>.
What is the nature of the file? Is it Solr XML, CSV, PDF (via Solr Cell),
or... what? If a PDF, maybe it has lots of hi-resolution images. If so, you
may need to strip out the images and just send the text, which would be a
lot smaller. For example, you could run Tika locally to extract the text
and then index the raw text.

-- Jack Krupansky

On Mon, Dec 14, 2015 at 12:04 PM, Antelmo Aguilar <Antelmo.Aguilar.17@nd.edu
> wrote:

> Hello,
>
> I am trying to index a very large file in Solr (around 5GB).  However, I
> get out of memory errors using Curl.  I tried using the post script and I
> had some success with it.  After indexing several hundred thousand records
> though, I got the following error message:
>
> *SimplePostTool: FATAL: IOException while posting data:
> java.io.IOException: too many bytes written*
>
> Would it be possible to get some help on where I can start looking to solve
> this issue?  I tried finding some type of log that would give me more
> information.  I have not had any luck.  The only logs I was able to find
> related to this error were the logs from Solr, but I assume these are from
> the "server" perspective and not "cient's" perspective of the error.  I
> would really appreciate the help.
>
> Thanks,
> Antelmo
>

Re: Help Indexing Large File

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Antelmo Aguilar <An...@nd.edu> wrote:
> I am trying to index a very large file in Solr (around 5GB).  However, I
>get out of memory errors using Curl.  I tried using the post script and I
> had some success with it.  After indexing several hundred thousand records
> though, I got the following error message:

This indicates that your file contains a lot of documents. The solution is to create smaller files and send more of them. Maybe a few hundred MB, to keep it manageable?

> *SimplePostTool: FATAL: IOException while posting data:
> java.io.IOException: too many bytes written*

A look in the postData method in SimplePostTool (at least for Solr 4.10, which is what my editor had open) reveals that it takes the length of the file as an Integer, which overflows when the file is more than 2GB. This means the HttpUrlComponent, that is used for posting, gets the wrong expected size and throws the exception when that is exceeded.

A real fix (if it is not already in Solr 5) would be to fail fast if the file is larger than Integer.MAX_VALUE.

- Toke Eskildsen