You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "keshari.prerna" <ke...@gmail.com> on 2013/10/22 19:58:11 UTC

Indexing logs files of thousands of GBs

Hello,

I am tried to index log files (all text data) stored in file system. Data
can be as big as 1000 GBs or more. I am working on windows.

A sample file can be found at
https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441

I tried using FileListEntityProcessor with TikaEntityProcessor which ended
up in java heap exception and couldn't get rid of it no matter how much I
increase my ram size.
data-confilg.xml

<dataConfig>
    <dataSource name="bin" type="FileDataSource" />
    <document>
        <entity name="f" dataSource="null" rootEntity="true"
            processor="FileListEntityProcessor"
transformer="TemplateTransformer"
            baseDir="//mathworks/devel/bat/A/logs/66048/"
            fileName=".*\.*" onError="skip" recursive="true">

            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size"/>
            <field column="fileLastModified" name="lastmodified" />

            <entity name="file" dataSource="bin"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
onError="skip" transformer="TemplateTransformer"
           rootEntity="true">
                <field column="text" name="text"/>   
            </entity>
        </entity>
    </document>
</dataConfig>

Then i used FileListEntityProcessor with LineEntityProcessor which never
stopped indexing even after 40 hours or so.

data-config.xml

<dataConfig>
    <dataSource name="bin" type="FileDataSource" />
    <document>
        <entity name="f" dataSource="null" rootEntity="true"
            processor="FileListEntityProcessor"
transformer="TemplateTransformer"
            baseDir="//mathworks/devel/bat/A/logs/"
            fileName=".*\.*" onError="skip" recursive="true">

            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size"/>
            <field column="fileLastModified" name="lastmodified" />

            <entity name="file" dataSource="bin"
processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
onError="skip"
           rootEntity="true">
                <field column="content" name="rawLine"/>   
            </entity>
        </entity>
    </document>
</dataConfig>

Is there any way i can use post.jar to index text file recursively. Or any
other way which works without java heap exception and doesn't take days to
index.

I am completely stuck here. Any help would be greatly appreciated.

Thanks,
Prerna



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing logs files of thousands of GBs

Posted by Erick Erickson <er...@gmail.com>.
Throwing a multi-gigabyte file at Solr and expecting it
to index it is asking for a bit too much. You either
have to stream it up and break it apart or something
similar.

And consider what happens if you just index the log as
a single document. How do you search it? Do you return
several G as the result? Most applications break
the log file up into individual documents and index each event
individually to enable searches like
"all OOM errors between 12:00 and 13:00 yesterday" or
similar. How do you expect to do such a thing if it's one
big document?

I may be completely off base here, but I think you need to
define the problem you're solving more clearly. I can flat
guarantee that trying to index a large log file as one document
will be unsatisfactory to search, even if you can get it into
the index.

Best,
Erick


On Wed, Oct 30, 2013 at 12:47 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi,
>
> Hm, sorry for not helping with this particular issue directly, but it
> looks like you are *uploading* your logs and indexing that way?
> Wouldn't pushing them be a better fit when it comes to log indexing?
> We recently contributed a Logstash output that can index logs to Solr,
> which may be of interest - have a look at
> https://twitter.com/otisg/status/395563043045638144 -- includes a
> little diagram that shows how this fits into the picture.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On Wed, Oct 30, 2013 at 9:55 AM, keshari.prerna
> <ke...@gmail.com> wrote:
> > Hello,
> >
> > As suggested by Chris, now I am accessing the files using java program
> and
> > creating SolrInputDocument, but i ran into this exception while doing
> > server.add(document). When i tried to increase "ramBufferSizeMB", it
> doesn't
> > let me make it more than 2 gig.
> >
> > org.apache.solr.client.solrj.SolrServerException: Server at
> > http://localhost:8983/solr/logsIndexing returned non ok status:500,
> > message:the request was rejected because its size (2097454) exceeds the
> > configured maximum (2097152)
> > org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException:
> the
> > request was rejected because its size (2097454) exceeds the configured
> > maximum (2097152)       at
> >
> org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902)
> > at
> >
> org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71)
> > at
> >
> org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128)
> > at
> >
> org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977)
> > at
> >
> org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887)
> > at java.io.InputStream.read(Unknown Source)     at
> > org.apache.commons.fileupload.util.Streams.copy(Streams.java:94)
>  at
> > org.apache.commons.fileupload.util.Streams.copy(Streams.java:64)
>  at
> >
> org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)
> > at
> >
> org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
> > at
> >
> org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
> > at
> >
> org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
> > at
> >
> org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
> > at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> > at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> > at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> > at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> > at org.mortbay.jetty.handler.ContextHand
> >         at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
> >         at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
> >         at
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
> >         at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
> >         at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
> >         at Filewalker.walk(LogsIndexer.java:48)
> >         at Filewalker.main(LogsIndexer.java:69)
> >
> > How do I get rid of this?
> >
> > Thanks,
> > Prerna
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Indexing logs files of thousands of GBs

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,

Hm, sorry for not helping with this particular issue directly, but it
looks like you are *uploading* your logs and indexing that way?
Wouldn't pushing them be a better fit when it comes to log indexing?
We recently contributed a Logstash output that can index logs to Solr,
which may be of interest - have a look at
https://twitter.com/otisg/status/395563043045638144 -- includes a
little diagram that shows how this fits into the picture.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/



On Wed, Oct 30, 2013 at 9:55 AM, keshari.prerna
<ke...@gmail.com> wrote:
> Hello,
>
> As suggested by Chris, now I am accessing the files using java program and
> creating SolrInputDocument, but i ran into this exception while doing
> server.add(document). When i tried to increase "ramBufferSizeMB", it doesn't
> let me make it more than 2 gig.
>
> org.apache.solr.client.solrj.SolrServerException: Server at
> http://localhost:8983/solr/logsIndexing returned non ok status:500,
> message:the request was rejected because its size (2097454) exceeds the
> configured maximum (2097152)
> org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the
> request was rejected because its size (2097454) exceeds the configured
> maximum (2097152)       at
> org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902)
> at
> org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71)
> at
> org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128)
> at
> org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977)
> at
> org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887)
> at java.io.InputStream.read(Unknown Source)     at
> org.apache.commons.fileupload.util.Streams.copy(Streams.java:94)        at
> org.apache.commons.fileupload.util.Streams.copy(Streams.java:64)        at
> org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)
> at
> org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
> at
> org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
> at
> org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
> at
> org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
> at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at org.mortbay.jetty.handler.ContextHand
>         at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
>         at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
>         at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>         at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
>         at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
>         at Filewalker.walk(LogsIndexer.java:48)
>         at Filewalker.main(LogsIndexer.java:69)
>
> How do I get rid of this?
>
> Thanks,
> Prerna
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing logs files of thousands of GBs

Posted by "keshari.prerna" <ke...@gmail.com>.
I have set at multipartUploadLimitInKB parameter to 10240 (which was 2048
earlier)

multipartUploadLimitInKB="10240". Now it gives following error for same
files at place.

http://localhost:8983/solr/logsIndexing returned non ok status:500,
message:the request was rejected because its size (10486046) exceeds the
configured maximum (10485760).



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098472.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing logs files of thousands of GBs

Posted by "keshari.prerna" <ke...@gmail.com>.
Hello,

As suggested by Chris, now I am accessing the files using java program and
creating SolrInputDocument, but i ran into this exception while doing
server.add(document). When i tried to increase "ramBufferSizeMB", it doesn't
let me make it more than 2 gig.

org.apache.solr.client.solrj.SolrServerException: Server at
http://localhost:8983/solr/logsIndexing returned non ok status:500,
message:the request was rejected because its size (2097454) exceeds the
configured maximum (2097152) 
org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the
request was rejected because its size (2097454) exceeds the configured
maximum (2097152)  	at
org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902)  
at
org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71)  
at
org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128)  
at
org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977)  
at
org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887)  
at java.io.InputStream.read(Unknown Source)  	at
org.apache.commons.fileupload.util.Streams.copy(Streams.java:94)  	at
org.apache.commons.fileupload.util.Streams.copy(Streams.java:64)  	at
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)  
at
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)  
at
org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)  
at
org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)  
at
org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)  
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)  
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)  
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)  
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)  
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)  
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)  
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)  
at org.mortbay.jetty.handler.ContextHand
	at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
	at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
	at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
	at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121)
	at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106)
	at Filewalker.walk(LogsIndexer.java:48)
	at Filewalker.main(LogsIndexer.java:69)

How do I get rid of this?

Thanks,
Prerna



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing logs files of thousands of GBs

Posted by Erick Erickson <er...@gmail.com>.
As a supplement to what Chris said, if you can
partition the walking amongst a number of clients
you can also parallelize the indexing. If you're using
SolrCloud 4.5+, there are also some nice optimizations
in SolrCloud to keep intra-shard routing to a minimum.

FWIW,
Erick


On Wed, Oct 23, 2013 at 12:59 PM, Chris Geeringh <ge...@gmail.com> wrote:

> Prerna,
>
> The FileListEntityProcessor has a terribly inefficient recursive method,
> which will be using up all your heap building a list of files.
>
> I would suggest writing a client application and traverse your filesystem
> with NIO available in Java 7. Files.walkFileTree() and a FileVisitor.
>
> As you "walk" post up to the server with SolrJ.
>
> Cheers,
> Chris
>
>
> On 22 October 2013 18:58, keshari.prerna <ke...@gmail.com> wrote:
>
> > Hello,
> >
> > I am tried to index log files (all text data) stored in file system. Data
> > can be as big as 1000 GBs or more. I am working on windows.
> >
> > A sample file can be found at
> > https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441
> >
> > I tried using FileListEntityProcessor with TikaEntityProcessor which
> ended
> > up in java heap exception and couldn't get rid of it no matter how much I
> > increase my ram size.
> > data-confilg.xml
> >
> > <dataConfig>
> >     <dataSource name="bin" type="FileDataSource" />
> >     <document>
> >         <entity name="f" dataSource="null" rootEntity="true"
> >             processor="FileListEntityProcessor"
> > transformer="TemplateTransformer"
> >             baseDir="//mathworks/devel/bat/A/logs/66048/"
> >             fileName=".*\.*" onError="skip" recursive="true">
> >
> >             <field column="fileAbsolutePath" name="path" />
> >             <field column="fileSize" name="size"/>
> >             <field column="fileLastModified" name="lastmodified" />
> >
> >             <entity name="file" dataSource="bin"
> > processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> > onError="skip" transformer="TemplateTransformer"
> >            rootEntity="true">
> >                 <field column="text" name="text"/>
> >             </entity>
> >         </entity>
> >     </document>
> > </dataConfig>
> >
> > Then i used FileListEntityProcessor with LineEntityProcessor which never
> > stopped indexing even after 40 hours or so.
> >
> > data-config.xml
> >
> > <dataConfig>
> >     <dataSource name="bin" type="FileDataSource" />
> >     <document>
> >         <entity name="f" dataSource="null" rootEntity="true"
> >             processor="FileListEntityProcessor"
> > transformer="TemplateTransformer"
> >             baseDir="//mathworks/devel/bat/A/logs/"
> >             fileName=".*\.*" onError="skip" recursive="true">
> >
> >             <field column="fileAbsolutePath" name="path" />
> >             <field column="fileSize" name="size"/>
> >             <field column="fileLastModified" name="lastmodified" />
> >
> >             <entity name="file" dataSource="bin"
> > processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> > onError="skip"
> >            rootEntity="true">
> >                 <field column="content" name="rawLine"/>
> >             </entity>
> >         </entity>
> >     </document>
> > </dataConfig>
> >
> > Is there any way i can use post.jar to index text file recursively. Or
> any
> > other way which works without java heap exception and doesn't take days
> to
> > index.
> >
> > I am completely stuck here. Any help would be greatly appreciated.
> >
> > Thanks,
> > Prerna
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Re: Indexing logs files of thousands of GBs

Posted by Chris Geeringh <ge...@gmail.com>.
Prerna,

The FileListEntityProcessor has a terribly inefficient recursive method,
which will be using up all your heap building a list of files.

I would suggest writing a client application and traverse your filesystem
with NIO available in Java 7. Files.walkFileTree() and a FileVisitor.

As you "walk" post up to the server with SolrJ.

Cheers,
Chris


On 22 October 2013 18:58, keshari.prerna <ke...@gmail.com> wrote:

> Hello,
>
> I am tried to index log files (all text data) stored in file system. Data
> can be as big as 1000 GBs or more. I am working on windows.
>
> A sample file can be found at
> https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441
>
> I tried using FileListEntityProcessor with TikaEntityProcessor which ended
> up in java heap exception and couldn't get rid of it no matter how much I
> increase my ram size.
> data-confilg.xml
>
> <dataConfig>
>     <dataSource name="bin" type="FileDataSource" />
>     <document>
>         <entity name="f" dataSource="null" rootEntity="true"
>             processor="FileListEntityProcessor"
> transformer="TemplateTransformer"
>             baseDir="//mathworks/devel/bat/A/logs/66048/"
>             fileName=".*\.*" onError="skip" recursive="true">
>
>             <field column="fileAbsolutePath" name="path" />
>             <field column="fileSize" name="size"/>
>             <field column="fileLastModified" name="lastmodified" />
>
>             <entity name="file" dataSource="bin"
> processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> onError="skip" transformer="TemplateTransformer"
>            rootEntity="true">
>                 <field column="text" name="text"/>
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
>
> Then i used FileListEntityProcessor with LineEntityProcessor which never
> stopped indexing even after 40 hours or so.
>
> data-config.xml
>
> <dataConfig>
>     <dataSource name="bin" type="FileDataSource" />
>     <document>
>         <entity name="f" dataSource="null" rootEntity="true"
>             processor="FileListEntityProcessor"
> transformer="TemplateTransformer"
>             baseDir="//mathworks/devel/bat/A/logs/"
>             fileName=".*\.*" onError="skip" recursive="true">
>
>             <field column="fileAbsolutePath" name="path" />
>             <field column="fileSize" name="size"/>
>             <field column="fileLastModified" name="lastmodified" />
>
>             <entity name="file" dataSource="bin"
> processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> onError="skip"
>            rootEntity="true">
>                 <field column="content" name="rawLine"/>
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
>
> Is there any way i can use post.jar to index text file recursively. Or any
> other way which works without java heap exception and doesn't take days to
> index.
>
> I am completely stuck here. Any help would be greatly appreciated.
>
> Thanks,
> Prerna
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>