You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "keshari.prerna" <ke...@gmail.com> on 2013/10/22 19:58:11 UTC
Indexing logs files of thousands of GBs
Hello,
I am tried to index log files (all text data) stored in file system. Data
can be as big as 1000 GBs or more. I am working on windows.
A sample file can be found at
https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441
I tried using FileListEntityProcessor with TikaEntityProcessor which ended
up in java heap exception and couldn't get rid of it no matter how much I
increase my ram size.
data-confilg.xml
<dataConfig>
<dataSource name="bin" type="FileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="true"
processor="FileListEntityProcessor"
transformer="TemplateTransformer"
baseDir="//mathworks/devel/bat/A/logs/66048/"
fileName=".*\.*" onError="skip" recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size"/>
<field column="fileLastModified" name="lastmodified" />
<entity name="file" dataSource="bin"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
onError="skip" transformer="TemplateTransformer"
rootEntity="true">
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
Then i used FileListEntityProcessor with LineEntityProcessor which never
stopped indexing even after 40 hours or so.
data-config.xml
<dataConfig>
<dataSource name="bin" type="FileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="true"
processor="FileListEntityProcessor"
transformer="TemplateTransformer"
baseDir="//mathworks/devel/bat/A/logs/"
fileName=".*\.*" onError="skip" recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size"/>
<field column="fileLastModified" name="lastmodified" />
<entity name="file" dataSource="bin"
processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
onError="skip"
rootEntity="true">
<field column="content" name="rawLine"/>
</entity>
</entity>
</document>
</dataConfig>
Is there any way i can use post.jar to index text file recursively. Or any
other way which works without java heap exception and doesn't take days to
index.
I am completely stuck here. Any help would be greatly appreciated.
Thanks,
Prerna
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing logs files of thousands of GBs
Posted by Erick Erickson <er...@gmail.com>.
Throwing a multi-gigabyte file at Solr and expecting it
to index it is asking for a bit too much. You either
have to stream it up and break it apart or something
similar.
And consider what happens if you just index the log as
a single document. How do you search it? Do you return
several G as the result? Most applications break
the log file up into individual documents and index each event
individually to enable searches like
"all OOM errors between 12:00 and 13:00 yesterday" or
similar. How do you expect to do such a thing if it's one
big document?
I may be completely off base here, but I think you need to
define the problem you're solving more clearly. I can flat
guarantee that trying to index a large log file as one document
will be unsatisfactory to search, even if you can get it into
the index.
Best,
Erick
On Wed, Oct 30, 2013 at 12:47 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
> Hi,
>
> Hm, sorry for not helping with this particular issue directly, but it
> looks like you are *uploading* your logs and indexing that way?
> Wouldn't pushing them be a better fit when it comes to log indexing?
> We recently contributed a Logstash output that can index logs to Solr,
> which may be of interest - have a look at
> https://twitter.com/otisg/status/395563043045638144 -- includes a
> little diagram that shows how this fits into the picture.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On Wed, Oct 30, 2013 at 9:55 AM, keshari.prerna
> <ke...@gmail.com> wrote:
> > Hello,
> >
> > As suggested by Chris, now I am accessing the files using java program
> and
> > creating SolrInputDocument, but i ran into this exception while doing
> > server.add(document). When i tried to increase "ramBufferSizeMB", it
> doesn't
> > let me make it more than 2 gig.
> >
> > org.apache.solr.client.solrj.SolrServerException: Server at
> > http://localhost:8983/solr/logsIndexing returned non ok status:500,
> > message:the request was rejected because its size (2097454) exceeds the
> > configured maximum (2097152)
> > org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException:
> the
> > request was rejected because its size (2097454) exceeds the configured
> > maximum (2097152) at
> >
> org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902)
> > at
> >
> org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71)
> > at
> >
> org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128)
> > at
> >
> org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977)
> > at
> >
> org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887)
> > at java.io.InputStream.read(Unknown Source) at
> > org.apache.commons.fileupload.util.Streams.copy(Streams.java:94)
> at
> > org.apache.commons.fileupload.util.Streams.copy(Streams.java:64)
> at
> >
> org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)
> > at
> >
> org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
> > at
> >
> org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
> > at
> >
> org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
> > at
> >
> org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
> > at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> > at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> > at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> > at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> > at org.mortbay.jetty.handler.ContextHand
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
> > at
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
> > at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
> > at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
> > at Filewalker.walk(LogsIndexer.java:48)
> > at Filewalker.main(LogsIndexer.java:69)
> >
> > How do I get rid of this?
> >
> > Thanks,
> > Prerna
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: Indexing logs files of thousands of GBs
Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,
Hm, sorry for not helping with this particular issue directly, but it
looks like you are *uploading* your logs and indexing that way?
Wouldn't pushing them be a better fit when it comes to log indexing?
We recently contributed a Logstash output that can index logs to Solr,
which may be of interest - have a look at
https://twitter.com/otisg/status/395563043045638144 -- includes a
little diagram that shows how this fits into the picture.
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/
On Wed, Oct 30, 2013 at 9:55 AM, keshari.prerna
<ke...@gmail.com> wrote:
> Hello,
>
> As suggested by Chris, now I am accessing the files using java program and
> creating SolrInputDocument, but i ran into this exception while doing
> server.add(document). When i tried to increase "ramBufferSizeMB", it doesn't
> let me make it more than 2 gig.
>
> org.apache.solr.client.solrj.SolrServerException: Server at
> http://localhost:8983/solr/logsIndexing returned non ok status:500,
> message:the request was rejected because its size (2097454) exceeds the
> configured maximum (2097152)
> org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the
> request was rejected because its size (2097454) exceeds the configured
> maximum (2097152) at
> org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902)
> at
> org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71)
> at
> org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128)
> at
> org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977)
> at
> org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887)
> at java.io.InputStream.read(Unknown Source) at
> org.apache.commons.fileupload.util.Streams.copy(Streams.java:94) at
> org.apache.commons.fileupload.util.Streams.copy(Streams.java:64) at
> org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)
> at
> org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
> at
> org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
> at
> org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
> at
> org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
> at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at org.mortbay.jetty.handler.ContextHand
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
> at Filewalker.walk(LogsIndexer.java:48)
> at Filewalker.main(LogsIndexer.java:69)
>
> How do I get rid of this?
>
> Thanks,
> Prerna
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing logs files of thousands of GBs
Posted by "keshari.prerna" <ke...@gmail.com>.
I have set at multipartUploadLimitInKB parameter to 10240 (which was 2048
earlier)
multipartUploadLimitInKB="10240". Now it gives following error for same
files at place.
http://localhost:8983/solr/logsIndexing returned non ok status:500,
message:the request was rejected because its size (10486046) exceeds the
configured maximum (10485760).
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098472.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing logs files of thousands of GBs
Posted by "keshari.prerna" <ke...@gmail.com>.
Hello,
As suggested by Chris, now I am accessing the files using java program and
creating SolrInputDocument, but i ran into this exception while doing
server.add(document). When i tried to increase "ramBufferSizeMB", it doesn't
let me make it more than 2 gig.
org.apache.solr.client.solrj.SolrServerException: Server at
http://localhost:8983/solr/logsIndexing returned non ok status:500,
message:the request was rejected because its size (2097454) exceeds the
configured maximum (2097152)
org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the
request was rejected because its size (2097454) exceeds the configured
maximum (2097152) at
org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902)
at
org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71)
at
org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128)
at
org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977)
at
org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887)
at java.io.InputStream.read(Unknown Source) at
org.apache.commons.fileupload.util.Streams.copy(Streams.java:94) at
org.apache.commons.fileupload.util.Streams.copy(Streams.java:64) at
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)
at
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
at
org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
at
org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
at
org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at org.mortbay.jetty.handler.ContextHand
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
at Filewalker.walk(LogsIndexer.java:48)
at Filewalker.main(LogsIndexer.java:69)
How do I get rid of this?
Thanks,
Prerna
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing logs files of thousands of GBs
Posted by Erick Erickson <er...@gmail.com>.
As a supplement to what Chris said, if you can
partition the walking amongst a number of clients
you can also parallelize the indexing. If you're using
SolrCloud 4.5+, there are also some nice optimizations
in SolrCloud to keep intra-shard routing to a minimum.
FWIW,
Erick
On Wed, Oct 23, 2013 at 12:59 PM, Chris Geeringh <ge...@gmail.com> wrote:
> Prerna,
>
> The FileListEntityProcessor has a terribly inefficient recursive method,
> which will be using up all your heap building a list of files.
>
> I would suggest writing a client application and traverse your filesystem
> with NIO available in Java 7. Files.walkFileTree() and a FileVisitor.
>
> As you "walk" post up to the server with SolrJ.
>
> Cheers,
> Chris
>
>
> On 22 October 2013 18:58, keshari.prerna <ke...@gmail.com> wrote:
>
> > Hello,
> >
> > I am tried to index log files (all text data) stored in file system. Data
> > can be as big as 1000 GBs or more. I am working on windows.
> >
> > A sample file can be found at
> > https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441
> >
> > I tried using FileListEntityProcessor with TikaEntityProcessor which
> ended
> > up in java heap exception and couldn't get rid of it no matter how much I
> > increase my ram size.
> > data-confilg.xml
> >
> > <dataConfig>
> > <dataSource name="bin" type="FileDataSource" />
> > <document>
> > <entity name="f" dataSource="null" rootEntity="true"
> > processor="FileListEntityProcessor"
> > transformer="TemplateTransformer"
> > baseDir="//mathworks/devel/bat/A/logs/66048/"
> > fileName=".*\.*" onError="skip" recursive="true">
> >
> > <field column="fileAbsolutePath" name="path" />
> > <field column="fileSize" name="size"/>
> > <field column="fileLastModified" name="lastmodified" />
> >
> > <entity name="file" dataSource="bin"
> > processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> > onError="skip" transformer="TemplateTransformer"
> > rootEntity="true">
> > <field column="text" name="text"/>
> > </entity>
> > </entity>
> > </document>
> > </dataConfig>
> >
> > Then i used FileListEntityProcessor with LineEntityProcessor which never
> > stopped indexing even after 40 hours or so.
> >
> > data-config.xml
> >
> > <dataConfig>
> > <dataSource name="bin" type="FileDataSource" />
> > <document>
> > <entity name="f" dataSource="null" rootEntity="true"
> > processor="FileListEntityProcessor"
> > transformer="TemplateTransformer"
> > baseDir="//mathworks/devel/bat/A/logs/"
> > fileName=".*\.*" onError="skip" recursive="true">
> >
> > <field column="fileAbsolutePath" name="path" />
> > <field column="fileSize" name="size"/>
> > <field column="fileLastModified" name="lastmodified" />
> >
> > <entity name="file" dataSource="bin"
> > processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> > onError="skip"
> > rootEntity="true">
> > <field column="content" name="rawLine"/>
> > </entity>
> > </entity>
> > </document>
> > </dataConfig>
> >
> > Is there any way i can use post.jar to index text file recursively. Or
> any
> > other way which works without java heap exception and doesn't take days
> to
> > index.
> >
> > I am completely stuck here. Any help would be greatly appreciated.
> >
> > Thanks,
> > Prerna
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
Re: Indexing logs files of thousands of GBs
Posted by Chris Geeringh <ge...@gmail.com>.
Prerna,
The FileListEntityProcessor has a terribly inefficient recursive method,
which will be using up all your heap building a list of files.
I would suggest writing a client application and traverse your filesystem
with NIO available in Java 7. Files.walkFileTree() and a FileVisitor.
As you "walk" post up to the server with SolrJ.
Cheers,
Chris
On 22 October 2013 18:58, keshari.prerna <ke...@gmail.com> wrote:
> Hello,
>
> I am tried to index log files (all text data) stored in file system. Data
> can be as big as 1000 GBs or more. I am working on windows.
>
> A sample file can be found at
> https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441
>
> I tried using FileListEntityProcessor with TikaEntityProcessor which ended
> up in java heap exception and couldn't get rid of it no matter how much I
> increase my ram size.
> data-confilg.xml
>
> <dataConfig>
> <dataSource name="bin" type="FileDataSource" />
> <document>
> <entity name="f" dataSource="null" rootEntity="true"
> processor="FileListEntityProcessor"
> transformer="TemplateTransformer"
> baseDir="//mathworks/devel/bat/A/logs/66048/"
> fileName=".*\.*" onError="skip" recursive="true">
>
> <field column="fileAbsolutePath" name="path" />
> <field column="fileSize" name="size"/>
> <field column="fileLastModified" name="lastmodified" />
>
> <entity name="file" dataSource="bin"
> processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> onError="skip" transformer="TemplateTransformer"
> rootEntity="true">
> <field column="text" name="text"/>
> </entity>
> </entity>
> </document>
> </dataConfig>
>
> Then i used FileListEntityProcessor with LineEntityProcessor which never
> stopped indexing even after 40 hours or so.
>
> data-config.xml
>
> <dataConfig>
> <dataSource name="bin" type="FileDataSource" />
> <document>
> <entity name="f" dataSource="null" rootEntity="true"
> processor="FileListEntityProcessor"
> transformer="TemplateTransformer"
> baseDir="//mathworks/devel/bat/A/logs/"
> fileName=".*\.*" onError="skip" recursive="true">
>
> <field column="fileAbsolutePath" name="path" />
> <field column="fileSize" name="size"/>
> <field column="fileLastModified" name="lastmodified" />
>
> <entity name="file" dataSource="bin"
> processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> onError="skip"
> rootEntity="true">
> <field column="content" name="rawLine"/>
> </entity>
> </entity>
> </document>
> </dataConfig>
>
> Is there any way i can use post.jar to index text file recursively. Or any
> other way which works without java heap exception and doesn't take days to
> index.
>
> I am completely stuck here. Any help would be greatly appreciated.
>
> Thanks,
> Prerna
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>