You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Peter Spam <ps...@mac.com> on 2011/10/21 02:59:04 UTC

Can Solr handle large text files?

I have about 20k text files, some very small, but some up to 300MB, and would like to do text searching with highlighting.

Imagine the text is the contents of your syslog.

I would like to type in some terms, such as "error" and "mail", and have Solr return the syslog lines with those terms PLUS two lines of context.  Pretty much just like Google's highlighting.

1) Can Solr handle this?  I had extremely long query times when I tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking the files into 1MB pieces, but searching would be wonky => return the wrong number of documents (ie. if one file had a term 5 times, and that was the only file that had the term, I want 1 result, not 5 results).  

2) What sort of tokenizer would be best?  Here's what I'm using:

   <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />

    <fieldType name="text_pl" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
      </analyzer>
    </fieldType>


Thanks!
Pete

Re: Can Solr handle large text files?

Posted by Chris Hostetter <ho...@fucit.org>.

: I have about 20k text files, some very small, but some up to 300MB, and 
: would like to do text searching with highlighting.
: 
: Imagine the text is the contents of your syslog.
: 
: I would like to type in some terms, such as "error" and "mail", and have 
: Solr return the syslog lines with those terms PLUS two lines of context.  
: Pretty much just like Google's highlighting.

The devil is in the details.  

based on the description of your problem, i would not index each TXT file 
as a single document.  instead i would index each *line* of each TXT file 
as a document, and in stored (but not indexed) fields i would be the extra 
lines of context for highlighting.

but that assumes that the results you are interested in is matching 
*lines* and not matching *files* -- based on your syslog example that 
seems like what you want (ie: "find me log entries containing 'error' and 
mail" ... not "find me entire log files that contain at least one error 
and at leas one mention of mail, even if they have nothing to do with one 
another).  if that's not your goal, then please provide a more precise 
example of your use case.


-Hoss

Re: Can Solr handle large text files?

Posted by Peter Spam <ps...@mac.com>.

Thanks for the response, Karsten.

1) What's the recommended maximum chunk size?
2) Does my tokenizer look reasonable?


Thanks!
Pete

On Oct 21, 2011, at 2:28 AM, karsten-solr@gmx.de wrote:

> Hi Peter,
> 
> highlighting in large text files can not be fast without dividing the original text in small piece.
> So take a look in
> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
> and in
> http://www.lucidimagination.com/blog/2010/09/16/2446/
> 
> Which means that you should divide your files and use
> Result Grouping / Field Collapsing
> to list only one hit per original document.
> 
> (xtf also would solve your problem "out of the box" but xtf does not use solr).
> 
> Best regards
>  Karsten
> 
> -------- Original-Nachricht --------
>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>> Von: Peter Spam <ps...@mac.com>
>> An: solr-user@lucene.apache.org
>> Betreff: Can Solr handle large text files?
> 
>> I have about 20k text files, some very small, but some up to 300MB, and
>> would like to do text searching with highlighting.
>> 
>> Imagine the text is the contents of your syslog.
>> 
>> I would like to type in some terms, such as "error" and "mail", and have
>> Solr return the syslog lines with those terms PLUS two lines of context. 
>> Pretty much just like Google's highlighting.
>> 
>> 1) Can Solr handle this?  I had extremely long query times when I tried
>> this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking
>> the files into 1MB pieces, but searching would be wonky => return the wrong
>> number of documents (ie. if one file had a term 5 times, and that was the
>> only file that had the term, I want 1 result, not 5 results).  
>> 
>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>> 
>>   <field name="body" type="text_pl" indexed="true" stored="true"
>> multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>> 
>>    <fieldType name="text_pl" class="solr.TextField">
>>      <analyzer>
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> Thanks!
>> Pete

Re: Can Solr handle large text files?

Posted by Peter Spam <ps...@me.com>.

Thanks for the reminder - I had that set to 214xxx... (the max), but perf was terrible when I injected large files.

So what's the max recommended field size in kb?  I can try chopping up the syslogs into arbitrarily small pieces, but would love to know where to start.

Thanks!

Sent from my iPhone

On Oct 23, 2011, at 2:01 PM, Erick Erickson <er...@gmail.com> wrote:

> Also be aware that by default Solr is configured to only index the
> first 10,000 lines
> of text. See maxFieldLength in solrconfig.xml
> 
> Best
> Erick
> 
> On Fri, Oct 21, 2011 at 7:34 PM, Peter Spam <ps...@mac.com> wrote:
>> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?
>> 
>> 
>> Thanks!
>> Pete
>> 
>> On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:
>> 
>>> Hi,
>>> 
>>> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
>>> 
>>> 
>>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm
>>> 
>>> <div class="results">
>>>  #if($response.response.get('grouped'))
>>>    #foreach($grouping in $response.response.get('grouped'))
>>>      #parse("hitGrouped.vm")
>>>    #end
>>>  #else
>>>    #foreach($doc in $response.results)
>>>      #parse("hit.vm")
>>>    #end
>>>  #end
>>> </div>
>>> 
>>> 
>>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446) at
>>> 
>>> Thanks & Regards,
>>> Anand
>>> Anand Nigam
>>> RBS Global Banking & Markets
>>> Office: +91 124 492 5506
>>> 
>>> 
>>> -----Original Message-----
>>> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de]
>>> Sent: 21 October 2011 14:58
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Can Solr handle large text files?
>>> 
>>> Hi Peter,
>>> 
>>> highlighting in large text files can not be fast without dividing the original text in small piece.
>>> So take a look in
>>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>>> and in
>>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>>> 
>>> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
>>> 
>>> (xtf also would solve your problem "out of the box" but xtf does not use solr).
>>> 
>>> Best regards
>>>  Karsten
>>> 
>>> -------- Original-Nachricht --------
>>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>>> Von: Peter Spam <ps...@mac.com>
>>>> An: solr-user@lucene.apache.org
>>>> Betreff: Can Solr handle large text files?
>>> 
>>>> I have about 20k text files, some very small, but some up to 300MB,
>>>> and would like to do text searching with highlighting.
>>>> 
>>>> Imagine the text is the contents of your syslog.
>>>> 
>>>> I would like to type in some terms, such as "error" and "mail", and
>>>> have Solr return the syslog lines with those terms PLUS two lines of context.
>>>> Pretty much just like Google's highlighting.
>>>> 
>>>> 1) Can Solr handle this?  I had extremely long query times when I
>>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I
>>>> tried breaking the files into 1MB pieces, but searching would be wonky
>>>> => return the wrong number of documents (ie. if one file had a term 5
>>>> times, and that was the only file that had the term, I want 1 result, not 5 results).
>>>> 
>>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>>> 
>>>>   <field name="body" type="text_pl" indexed="true" stored="true"
>>>> multiValued="false" termVectors="true" termPositions="true"
>>>> termOffsets="true" />
>>>> 
>>>>    <fieldType name="text_pl" class="solr.TextField">
>>>>      <analyzer>
>>>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>>        <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>      </analyzer>
>>>>    </fieldType>
>>>> 
>>>> 
>>>> Thanks!
>>>> Pete
>>> 
>>> ***********************************************************************************
>>> The Royal Bank of Scotland plc. Registered in Scotland No 90312.
>>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
>>> Authorised and regulated by the Financial Services Authority. The
>>> Royal Bank of Scotland N.V. is authorised and regulated by the
>>> De Nederlandsche Bank and has its seat at Amsterdam, the
>>> Netherlands, and is registered in the Commercial Register under
>>> number 33002587. Registered Office: Gustav Mahlerlaan 350,
>>> Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and
>>> The Royal Bank of Scotland plc are authorised to act as agent for each
>>> other in certain jurisdictions.
>>> 
>>> This e-mail message is confidential and for use by the addressee only.
>>> If the message is received by anyone other than the addressee, please
>>> return the message to the sender by replying to it and then delete the
>>> message from your computer. Internet e-mails are not necessarily
>>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland
>>> N.V. including its affiliates ("RBS group") does not accept responsibility
>>> for changes made to this message after it was sent. For the protection
>>> of RBS group and its clients and customers, and in compliance with
>>> regulatory requirements, the contents of both incoming and outgoing
>>> e-mail communications, which could include proprietary information and
>>> Non-Public Personal Information, may be read by authorised persons
>>> within RBS group other than the intended recipient(s).
>>> 
>>> Whilst all reasonable care has been taken to avoid the transmission of
>>> viruses, it is the responsibility of the recipient to ensure that the onward
>>> transmission, opening or use of this message and any attachments will
>>> not adversely affect its systems or data. No responsibility is accepted
>>> by the RBS group in this regard and the recipient should carry out such
>>> virus and other checks as it considers appropriate.
>>> 
>>> Visit our website at www.rbs.com
>>> 
>>> ***********************************************************************************
>>> 
>> 
>>

Re: Can Solr handle large text files?

Posted by Erick Erickson <er...@gmail.com>.

Also be aware that by default Solr is configured to only index the
first 10,000 lines
of text. See maxFieldLength in solrconfig.xml

Best
Erick

On Fri, Oct 21, 2011 at 7:34 PM, Peter Spam <ps...@mac.com> wrote:
> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?
>
>
> Thanks!
> Pete
>
> On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:
>
>> Hi,
>>
>> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
>>
>>
>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm
>>
>> <div class="results">
>>  #if($response.response.get('grouped'))
>>    #foreach($grouping in $response.response.get('grouped'))
>>      #parse("hitGrouped.vm")
>>    #end
>>  #else
>>    #foreach($doc in $response.results)
>>      #parse("hit.vm")
>>    #end
>>  #end
>> </div>
>>
>>
>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446) at
>>
>> Thanks & Regards,
>> Anand
>> Anand Nigam
>> RBS Global Banking & Markets
>> Office: +91 124 492 5506
>>
>>
>> -----Original Message-----
>> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de]
>> Sent: 21 October 2011 14:58
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can Solr handle large text files?
>>
>> Hi Peter,
>>
>> highlighting in large text files can not be fast without dividing the original text in small piece.
>> So take a look in
>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>> and in
>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>>
>> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
>>
>> (xtf also would solve your problem "out of the box" but xtf does not use solr).
>>
>> Best regards
>>  Karsten
>>
>> -------- Original-Nachricht --------
>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>> Von: Peter Spam <ps...@mac.com>
>>> An: solr-user@lucene.apache.org
>>> Betreff: Can Solr handle large text files?
>>
>>> I have about 20k text files, some very small, but some up to 300MB,
>>> and would like to do text searching with highlighting.
>>>
>>> Imagine the text is the contents of your syslog.
>>>
>>> I would like to type in some terms, such as "error" and "mail", and
>>> have Solr return the syslog lines with those terms PLUS two lines of context.
>>> Pretty much just like Google's highlighting.
>>>
>>> 1) Can Solr handle this?  I had extremely long query times when I
>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I
>>> tried breaking the files into 1MB pieces, but searching would be wonky
>>> => return the wrong number of documents (ie. if one file had a term 5
>>> times, and that was the only file that had the term, I want 1 result, not 5 results).
>>>
>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>>
>>>   <field name="body" type="text_pl" indexed="true" stored="true"
>>> multiValued="false" termVectors="true" termPositions="true"
>>> termOffsets="true" />
>>>
>>>    <fieldType name="text_pl" class="solr.TextField">
>>>      <analyzer>
>>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>        <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>> catenateAll="0" splitOnCaseChange="0"/>
>>>      </analyzer>
>>>    </fieldType>
>>>
>>>
>>> Thanks!
>>> Pete
>>
>> ***********************************************************************************
>> The Royal Bank of Scotland plc. Registered in Scotland No 90312.
>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
>> Authorised and regulated by the Financial Services Authority. The
>> Royal Bank of Scotland N.V. is authorised and regulated by the
>> De Nederlandsche Bank and has its seat at Amsterdam, the
>> Netherlands, and is registered in the Commercial Register under
>> number 33002587. Registered Office: Gustav Mahlerlaan 350,
>> Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and
>> The Royal Bank of Scotland plc are authorised to act as agent for each
>> other in certain jurisdictions.
>>
>> This e-mail message is confidential and for use by the addressee only.
>> If the message is received by anyone other than the addressee, please
>> return the message to the sender by replying to it and then delete the
>> message from your computer. Internet e-mails are not necessarily
>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland
>> N.V. including its affiliates ("RBS group") does not accept responsibility
>> for changes made to this message after it was sent. For the protection
>> of RBS group and its clients and customers, and in compliance with
>> regulatory requirements, the contents of both incoming and outgoing
>> e-mail communications, which could include proprietary information and
>> Non-Public Personal Information, may be read by authorised persons
>> within RBS group other than the intended recipient(s).
>>
>> Whilst all reasonable care has been taken to avoid the transmission of
>> viruses, it is the responsibility of the recipient to ensure that the onward
>> transmission, opening or use of this message and any attachments will
>> not adversely affect its systems or data. No responsibility is accepted
>> by the RBS group in this regard and the recipient should carry out such
>> virus and other checks as it considers appropriate.
>>
>> Visit our website at www.rbs.com
>>
>> ***********************************************************************************
>>
>
>

Re: Can Solr handle large text files?

Posted by Peter Spam <ps...@mac.com>.

Has the performance of highlighting large text documents been improved in Solr 4?


Thanks!
Pete

On Nov 5, 2011, at 9:03 AM, Erick Erickson <er...@gmail.com> wrote:

> Sure, if you write a custom update handler. But I'm not at all sure
> this is "ideal".
> You're requiring all that data to be transmitted across the wire and processed
> by Solr. Assuming you have more than one input source, the Solr server in
> the background will be handling up to N documents simultaneously. Plus
> the effort to index. I think I'd recommend splitting them up on the client side.
> 
> Best
> Erick
> 
> On Fri, Nov 4, 2011 at 3:23 AM, Peter Spam <ps...@mac.com> wrote:
>> Solr 4.0 (11/1 snapshot)
>> Data: 80k files, average size 2.5MB, largest is 750MB;
>> Solr: Each document is max 256k; total docs = 800k
>> Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; Admin shows 30% mem usage
>> 
>> I originally tried injecting the entire file into a single Solr document, and this had disastrous results when trying to highlight.  I've now tried splitting each file into 256k segments per Solr document, and the results are better, but still not what I was hoping for.  Queries are around 2-8 seconds, with some reaching into 30+ second territory.
>> 
>> Ideally, I'd like to feed Solr the metadata and the entire file at once, and have the back-end split the file into thousands of pieces.  Is this possible?
>> 
>> 
>> Thanks!
>> Pete
>> 
>> On Nov 1, 2011, at 5:15 PM, Peter Spam wrote:
>> 
>>> Wow, 50 lines is tiny!  Is that how small you need to go, to get good highlighting performance?
>>> 
>>> I'm looking at documents that can be up to 800MB in size, so I've decided to split them down into 256k chunks.  I'm still indexing right now - I'm curious to see how performance is when the injection is finished.
>>> 
>>> Has anyone done analysis on where the knee in the curve is, wrt document size vs. # of documents?
>>> 
>>> 
>>> Thanks!
>>> Pete
>>> 
>>> On Oct 31, 2011, at 9:28 PM, Anand.Nigam@rbs.com wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property)  of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file.
>>>> 
>>>> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc.
>>>> 
>>>> Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are:
>>>> 
>>>> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
>>>> 
>>>> Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine.
>>>> 
>>>> Thanks & Regards,
>>>> Anand
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Anand Nigam
>>>> RBS Global Banking & Markets
>>>> Office: +91 124 492 5506
>>>> 
>>>> -----Original Message-----
>>>> From: Peter Spam [mailto:pspam@mac.com]
>>>> Sent: 21 October 2011 23:04
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Can Solr handle large text files?
>>>> 
>>>> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?
>>>> 
>>>> 
>>>> Thanks!
>>>> Pete
>>>> 
>>>> On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
>>>>> 
>>>>> 
>>>>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where
>>>>> can I get this file from. Its reference is present in browse.vm
>>>>> 
>>>>> <div class="results">
>>>>> #if($response.response.get('grouped'))
>>>>>  #foreach($grouping in $response.response.get('grouped'))
>>>>>    #parse("hitGrouped.vm")
>>>>>  #end
>>>>> #else
>>>>>  #foreach($doc in $response.results)
>>>>>    #parse("hit.vm")
>>>>>  #end
>>>>> #end
>>>>> </div>
>>>>> 
>>>>> 
>>>>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or
>>>>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/',
>>>>> cwd=C:\glassfish3\glassfish\domains\domain1\config
>>>>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in
>>>>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/',
>>>>> cwd=C:\glassfish3\glassfish\domains\domain1\config at
>>>>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>>>>> r.java:268) at
>>>>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>>>>> SolrVelocityResourceLoader.java:42) at
>>>>> org.apache.velocity.Template.process(Template.java:98) at
>>>>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>>>>> ResourceManagerImpl.java:446) at
>>>>> 
>>>>> Thanks & Regards,
>>>>> Anand
>>>>> Anand Nigam
>>>>> RBS Global Banking & Markets
>>>>> Office: +91 124 492 5506
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de]
>>>>> Sent: 21 October 2011 14:58
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Can Solr handle large text files?
>>>>> 
>>>>> Hi Peter,
>>>>> 
>>>>> highlighting in large text files can not be fast without dividing the original text in small piece.
>>>>> So take a look in
>>>>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>>>>> and in
>>>>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>>>>> 
>>>>> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
>>>>> 
>>>>> (xtf also would solve your problem "out of the box" but xtf does not use solr).
>>>>> 
>>>>> Best regards
>>>>> Karsten
>>>>> 
>>>>> -------- Original-Nachricht --------
>>>>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>>>>> Von: Peter Spam <ps...@mac.com>
>>>>>> An: solr-user@lucene.apache.org
>>>>>> Betreff: Can Solr handle large text files?
>>>>> 
>>>>>> I have about 20k text files, some very small, but some up to 300MB,
>>>>>> and would like to do text searching with highlighting.
>>>>>> 
>>>>>> Imagine the text is the contents of your syslog.
>>>>>> 
>>>>>> I would like to type in some terms, such as "error" and "mail", and
>>>>>> have Solr return the syslog lines with those terms PLUS two lines of context.
>>>>>> Pretty much just like Google's highlighting.
>>>>>> 
>>>>>> 1) Can Solr handle this?  I had extremely long query times when I
>>>>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I
>>>>>> tried breaking the files into 1MB pieces, but searching would be
>>>>>> wonky => return the wrong number of documents (ie. if one file had a
>>>>>> term 5 times, and that was the only file that had the term, I want 1 result, not 5 results).
>>>>>> 
>>>>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>>>>> 
>>>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>>>> multiValued="false" termVectors="true" termPositions="true"
>>>>>> termOffsets="true" />
>>>>>> 
>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>    <analyzer>
>>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>>>    </analyzer>
>>>>>>  </fieldType>
>>>>>> 
>>>>>> 
>>>>>> Thanks!
>>>>>> Pete
>>>>> 
>>>>> **********************************************************************
>>>>> ************* The Royal Bank of Scotland plc. Registered in Scotland
>>>>> No 90312.
>>>>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
>>>>> Authorised and regulated by the Financial Services Authority. The
>>>>> Royal Bank of Scotland N.V. is authorised and regulated by the De
>>>>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and
>>>>> is registered in the Commercial Register under number 33002587.
>>>>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands.
>>>>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are
>>>>> authorised to act as agent for each other in certain jurisdictions.
>>>>> 
>>>>> This e-mail message is confidential and for use by the addressee only.
>>>>> If the message is received by anyone other than the addressee, please
>>>>> return the message to the sender by replying to it and then delete the
>>>>> message from your computer. Internet e-mails are not necessarily
>>>>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland
>>>>> N.V. including its affiliates ("RBS group") does not accept
>>>>> responsibility for changes made to this message after it was sent. For
>>>>> the protection of RBS group and its clients and customers, and in
>>>>> compliance with regulatory requirements, the contents of both incoming
>>>>> and outgoing e-mail communications, which could include proprietary
>>>>> information and Non-Public Personal Information, may be read by
>>>>> authorised persons within RBS group other than the intended recipient(s).
>>>>> 
>>>>> Whilst all reasonable care has been taken to avoid the transmission of
>>>>> viruses, it is the responsibility of the recipient to ensure that the
>>>>> onward transmission, opening or use of this message and any
>>>>> attachments will not adversely affect its systems or data. No
>>>>> responsibility is accepted by the RBS group in this regard and the
>>>>> recipient should carry out such virus and other checks as it considers appropriate.
>>>>> 
>>>>> Visit our website at www.rbs.com
>>>>> 
>>>>> **********************************************************************
>>>>> *************
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Can Solr handle large text files?

Posted by Erick Erickson <er...@gmail.com>.

Sure, if you write a custom update handler. But I'm not at all sure
this is "ideal".
You're requiring all that data to be transmitted across the wire and processed
by Solr. Assuming you have more than one input source, the Solr server in
the background will be handling up to N documents simultaneously. Plus
the effort to index. I think I'd recommend splitting them up on the client side.

Best
Erick

On Fri, Nov 4, 2011 at 3:23 AM, Peter Spam <ps...@mac.com> wrote:
> Solr 4.0 (11/1 snapshot)
> Data: 80k files, average size 2.5MB, largest is 750MB;
> Solr: Each document is max 256k; total docs = 800k
> Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; Admin shows 30% mem usage
>
> I originally tried injecting the entire file into a single Solr document, and this had disastrous results when trying to highlight.  I've now tried splitting each file into 256k segments per Solr document, and the results are better, but still not what I was hoping for.  Queries are around 2-8 seconds, with some reaching into 30+ second territory.
>
> Ideally, I'd like to feed Solr the metadata and the entire file at once, and have the back-end split the file into thousands of pieces.  Is this possible?
>
>
> Thanks!
> Pete
>
> On Nov 1, 2011, at 5:15 PM, Peter Spam wrote:
>
>> Wow, 50 lines is tiny!  Is that how small you need to go, to get good highlighting performance?
>>
>> I'm looking at documents that can be up to 800MB in size, so I've decided to split them down into 256k chunks.  I'm still indexing right now - I'm curious to see how performance is when the injection is finished.
>>
>> Has anyone done analysis on where the knee in the curve is, wrt document size vs. # of documents?
>>
>>
>> Thanks!
>> Pete
>>
>> On Oct 31, 2011, at 9:28 PM, Anand.Nigam@rbs.com wrote:
>>
>>> Hi,
>>>
>>> Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property)  of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file.
>>>
>>> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc.
>>>
>>> Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are:
>>>
>>> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
>>>
>>> Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine.
>>>
>>> Thanks & Regards,
>>> Anand
>>>
>>>
>>>
>>>
>>>
>>> Anand Nigam
>>> RBS Global Banking & Markets
>>> Office: +91 124 492 5506
>>>
>>> -----Original Message-----
>>> From: Peter Spam [mailto:pspam@mac.com]
>>> Sent: 21 October 2011 23:04
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Can Solr handle large text files?
>>>
>>> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?
>>>
>>>
>>> Thanks!
>>> Pete
>>>
>>> On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:
>>>
>>>> Hi,
>>>>
>>>> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
>>>>
>>>>
>>>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where
>>>> can I get this file from. Its reference is present in browse.vm
>>>>
>>>> <div class="results">
>>>> #if($response.response.get('grouped'))
>>>>  #foreach($grouping in $response.response.get('grouped'))
>>>>    #parse("hitGrouped.vm")
>>>>  #end
>>>> #else
>>>>  #foreach($doc in $response.results)
>>>>    #parse("hit.vm")
>>>>  #end
>>>> #end
>>>> </div>
>>>>
>>>>
>>>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or
>>>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/',
>>>> cwd=C:\glassfish3\glassfish\domains\domain1\config
>>>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in
>>>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/',
>>>> cwd=C:\glassfish3\glassfish\domains\domain1\config at
>>>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>>>> r.java:268) at
>>>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>>>> SolrVelocityResourceLoader.java:42) at
>>>> org.apache.velocity.Template.process(Template.java:98) at
>>>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>>>> ResourceManagerImpl.java:446) at
>>>>
>>>> Thanks & Regards,
>>>> Anand
>>>> Anand Nigam
>>>> RBS Global Banking & Markets
>>>> Office: +91 124 492 5506
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de]
>>>> Sent: 21 October 2011 14:58
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Can Solr handle large text files?
>>>>
>>>> Hi Peter,
>>>>
>>>> highlighting in large text files can not be fast without dividing the original text in small piece.
>>>> So take a look in
>>>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>>>> and in
>>>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>>>>
>>>> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
>>>>
>>>> (xtf also would solve your problem "out of the box" but xtf does not use solr).
>>>>
>>>> Best regards
>>>> Karsten
>>>>
>>>> -------- Original-Nachricht --------
>>>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>>>> Von: Peter Spam <ps...@mac.com>
>>>>> An: solr-user@lucene.apache.org
>>>>> Betreff: Can Solr handle large text files?
>>>>
>>>>> I have about 20k text files, some very small, but some up to 300MB,
>>>>> and would like to do text searching with highlighting.
>>>>>
>>>>> Imagine the text is the contents of your syslog.
>>>>>
>>>>> I would like to type in some terms, such as "error" and "mail", and
>>>>> have Solr return the syslog lines with those terms PLUS two lines of context.
>>>>> Pretty much just like Google's highlighting.
>>>>>
>>>>> 1) Can Solr handle this?  I had extremely long query times when I
>>>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I
>>>>> tried breaking the files into 1MB pieces, but searching would be
>>>>> wonky => return the wrong number of documents (ie. if one file had a
>>>>> term 5 times, and that was the only file that had the term, I want 1 result, not 5 results).
>>>>>
>>>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>>>>
>>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>>> multiValued="false" termVectors="true" termPositions="true"
>>>>> termOffsets="true" />
>>>>>
>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>    <analyzer>
>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>>    </analyzer>
>>>>>  </fieldType>
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Pete
>>>>
>>>> **********************************************************************
>>>> ************* The Royal Bank of Scotland plc. Registered in Scotland
>>>> No 90312.
>>>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
>>>> Authorised and regulated by the Financial Services Authority. The
>>>> Royal Bank of Scotland N.V. is authorised and regulated by the De
>>>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and
>>>> is registered in the Commercial Register under number 33002587.
>>>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands.
>>>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are
>>>> authorised to act as agent for each other in certain jurisdictions.
>>>>
>>>> This e-mail message is confidential and for use by the addressee only.
>>>> If the message is received by anyone other than the addressee, please
>>>> return the message to the sender by replying to it and then delete the
>>>> message from your computer. Internet e-mails are not necessarily
>>>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland
>>>> N.V. including its affiliates ("RBS group") does not accept
>>>> responsibility for changes made to this message after it was sent. For
>>>> the protection of RBS group and its clients and customers, and in
>>>> compliance with regulatory requirements, the contents of both incoming
>>>> and outgoing e-mail communications, which could include proprietary
>>>> information and Non-Public Personal Information, may be read by
>>>> authorised persons within RBS group other than the intended recipient(s).
>>>>
>>>> Whilst all reasonable care has been taken to avoid the transmission of
>>>> viruses, it is the responsibility of the recipient to ensure that the
>>>> onward transmission, opening or use of this message and any
>>>> attachments will not adversely affect its systems or data. No
>>>> responsibility is accepted by the RBS group in this regard and the
>>>> recipient should carry out such virus and other checks as it considers appropriate.
>>>>
>>>> Visit our website at www.rbs.com
>>>>
>>>> **********************************************************************
>>>> *************
>>>>
>>>
>>
>
>

Re: Can Solr handle large text files?

Posted by Peter Spam <ps...@mac.com>.

Solr 4.0 (11/1 snapshot)
Data: 80k files, average size 2.5MB, largest is 750MB; 
Solr: Each document is max 256k; total docs = 800k
Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; Admin shows 30% mem usage

I originally tried injecting the entire file into a single Solr document, and this had disastrous results when trying to highlight.  I've now tried splitting each file into 256k segments per Solr document, and the results are better, but still not what I was hoping for.  Queries are around 2-8 seconds, with some reaching into 30+ second territory.

Ideally, I'd like to feed Solr the metadata and the entire file at once, and have the back-end split the file into thousands of pieces.  Is this possible?


Thanks!
Pete

On Nov 1, 2011, at 5:15 PM, Peter Spam wrote:

> Wow, 50 lines is tiny!  Is that how small you need to go, to get good highlighting performance?
> 
> I'm looking at documents that can be up to 800MB in size, so I've decided to split them down into 256k chunks.  I'm still indexing right now - I'm curious to see how performance is when the injection is finished.
> 
> Has anyone done analysis on where the knee in the curve is, wrt document size vs. # of documents?
> 
> 
> Thanks!
> Pete
> 
> On Oct 31, 2011, at 9:28 PM, Anand.Nigam@rbs.com wrote:
> 
>> Hi,
>> 
>> Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property)  of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file.
>> 
>> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc.
>> 
>> Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are:
>> 
>> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
>> 
>> Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine.
>> 
>> Thanks & Regards,
>> Anand
>> 
>> 
>> 
>> 
>> 
>> Anand Nigam
>> RBS Global Banking & Markets
>> Office: +91 124 492 5506   
>> 
>> -----Original Message-----
>> From: Peter Spam [mailto:pspam@mac.com] 
>> Sent: 21 October 2011 23:04
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can Solr handle large text files?
>> 
>> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?
>> 
>> 
>> Thanks!
>> Pete
>> 
>> On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:
>> 
>>> Hi,
>>> 
>>> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
>>> 
>>> 
>>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
>>> can I get this file from. Its reference is present in browse.vm
>>> 
>>> <div class="results">
>>> #if($response.response.get('grouped'))
>>>  #foreach($grouping in $response.response.get('grouped'))
>>>    #parse("hitGrouped.vm")
>>>  #end
>>> #else
>>>  #foreach($doc in $response.results)
>>>    #parse("hit.vm")
>>>  #end
>>> #end
>>> </div>
>>> 
>>> 
>>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
>>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>>> cwd=C:\glassfish3\glassfish\domains\domain1\config 
>>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
>>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>>> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
>>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>>> r.java:268) at 
>>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>>> SolrVelocityResourceLoader.java:42) at 
>>> org.apache.velocity.Template.process(Template.java:98) at 
>>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>>> ResourceManagerImpl.java:446) at
>>> 
>>> Thanks & Regards,
>>> Anand
>>> Anand Nigam
>>> RBS Global Banking & Markets
>>> Office: +91 124 492 5506   
>>> 
>>> 
>>> -----Original Message-----
>>> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de]
>>> Sent: 21 October 2011 14:58
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Can Solr handle large text files?
>>> 
>>> Hi Peter,
>>> 
>>> highlighting in large text files can not be fast without dividing the original text in small piece.
>>> So take a look in
>>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>>> and in
>>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>>> 
>>> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
>>> 
>>> (xtf also would solve your problem "out of the box" but xtf does not use solr).
>>> 
>>> Best regards
>>> Karsten
>>> 
>>> -------- Original-Nachricht --------
>>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>>> Von: Peter Spam <ps...@mac.com>
>>>> An: solr-user@lucene.apache.org
>>>> Betreff: Can Solr handle large text files?
>>> 
>>>> I have about 20k text files, some very small, but some up to 300MB, 
>>>> and would like to do text searching with highlighting.
>>>> 
>>>> Imagine the text is the contents of your syslog.
>>>> 
>>>> I would like to type in some terms, such as "error" and "mail", and 
>>>> have Solr return the syslog lines with those terms PLUS two lines of context.
>>>> Pretty much just like Google's highlighting.
>>>> 
>>>> 1) Can Solr handle this?  I had extremely long query times when I 
>>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>>>> tried breaking the files into 1MB pieces, but searching would be 
>>>> wonky => return the wrong number of documents (ie. if one file had a 
>>>> term 5 times, and that was the only file that had the term, I want 1 result, not 5 results).
>>>> 
>>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>>> 
>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>> multiValued="false" termVectors="true" termPositions="true" 
>>>> termOffsets="true" />
>>>> 
>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>    <analyzer>
>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>    </analyzer>
>>>>  </fieldType>
>>>> 
>>>> 
>>>> Thanks!
>>>> Pete
>>> 
>>> **********************************************************************
>>> ************* The Royal Bank of Scotland plc. Registered in Scotland 
>>> No 90312.
>>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
>>> Authorised and regulated by the Financial Services Authority. The 
>>> Royal Bank of Scotland N.V. is authorised and regulated by the De 
>>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and 
>>> is registered in the Commercial Register under number 33002587. 
>>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. 
>>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are 
>>> authorised to act as agent for each other in certain jurisdictions.
>>> 
>>> This e-mail message is confidential and for use by the addressee only. 
>>> If the message is received by anyone other than the addressee, please 
>>> return the message to the sender by replying to it and then delete the 
>>> message from your computer. Internet e-mails are not necessarily 
>>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
>>> N.V. including its affiliates ("RBS group") does not accept 
>>> responsibility for changes made to this message after it was sent. For 
>>> the protection of RBS group and its clients and customers, and in 
>>> compliance with regulatory requirements, the contents of both incoming 
>>> and outgoing e-mail communications, which could include proprietary 
>>> information and Non-Public Personal Information, may be read by 
>>> authorised persons within RBS group other than the intended recipient(s).
>>> 
>>> Whilst all reasonable care has been taken to avoid the transmission of 
>>> viruses, it is the responsibility of the recipient to ensure that the 
>>> onward transmission, opening or use of this message and any 
>>> attachments will not adversely affect its systems or data. No 
>>> responsibility is accepted by the RBS group in this regard and the 
>>> recipient should carry out such virus and other checks as it considers appropriate.
>>> 
>>> Visit our website at www.rbs.com
>>> 
>>> **********************************************************************
>>> *************
>>> 
>> 
>

Re: Can Solr handle large text files?

Posted by Peter Spam <ps...@mac.com>.

Wow, 50 lines is tiny!  Is that how small you need to go, to get good highlighting performance?

I'm looking at documents that can be up to 800MB in size, so I've decided to split them down into 256k chunks.  I'm still indexing right now - I'm curious to see how performance is when the injection is finished.

Has anyone done analysis on where the knee in the curve is, wrt document size vs. # of documents?


Thanks!
Pete

On Oct 31, 2011, at 9:28 PM, Anand.Nigam@rbs.com wrote:

> Hi,
> 
> Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property)  of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file.
> 
> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc.
> 
> Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are:
> 
> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
> 
> Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine.
> 
> Thanks & Regards,
> Anand
> 
> 
> 
> 
> 
> Anand Nigam
> RBS Global Banking & Markets
> Office: +91 124 492 5506   
> 
> -----Original Message-----
> From: Peter Spam [mailto:pspam@mac.com] 
> Sent: 21 October 2011 23:04
> To: solr-user@lucene.apache.org
> Subject: Re: Can Solr handle large text files?
> 
> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?
> 
> 
> Thanks!
> Pete
> 
> On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:
> 
>> Hi,
>> 
>> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
>> 
>> 
>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
>> can I get this file from. Its reference is present in browse.vm
>> 
>> <div class="results">
>> #if($response.response.get('grouped'))
>>   #foreach($grouping in $response.response.get('grouped'))
>>     #parse("hitGrouped.vm")
>>   #end
>> #else
>>   #foreach($doc in $response.results)
>>     #parse("hit.vm")
>>   #end
>> #end
>> </div>
>> 
>> 
>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>> cwd=C:\glassfish3\glassfish\domains\domain1\config 
>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>> r.java:268) at 
>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>> SolrVelocityResourceLoader.java:42) at 
>> org.apache.velocity.Template.process(Template.java:98) at 
>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>> ResourceManagerImpl.java:446) at
>> 
>> Thanks & Regards,
>> Anand
>> Anand Nigam
>> RBS Global Banking & Markets
>> Office: +91 124 492 5506   
>> 
>> 
>> -----Original Message-----
>> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de]
>> Sent: 21 October 2011 14:58
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can Solr handle large text files?
>> 
>> Hi Peter,
>> 
>> highlighting in large text files can not be fast without dividing the original text in small piece.
>> So take a look in
>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>> and in
>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>> 
>> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
>> 
>> (xtf also would solve your problem "out of the box" but xtf does not use solr).
>> 
>> Best regards
>> Karsten
>> 
>> -------- Original-Nachricht --------
>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>> Von: Peter Spam <ps...@mac.com>
>>> An: solr-user@lucene.apache.org
>>> Betreff: Can Solr handle large text files?
>> 
>>> I have about 20k text files, some very small, but some up to 300MB, 
>>> and would like to do text searching with highlighting.
>>> 
>>> Imagine the text is the contents of your syslog.
>>> 
>>> I would like to type in some terms, such as "error" and "mail", and 
>>> have Solr return the syslog lines with those terms PLUS two lines of context.
>>> Pretty much just like Google's highlighting.
>>> 
>>> 1) Can Solr handle this?  I had extremely long query times when I 
>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>>> tried breaking the files into 1MB pieces, but searching would be 
>>> wonky => return the wrong number of documents (ie. if one file had a 
>>> term 5 times, and that was the only file that had the term, I want 1 result, not 5 results).
>>> 
>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>> 
>>>  <field name="body" type="text_pl" indexed="true" stored="true"
>>> multiValued="false" termVectors="true" termPositions="true" 
>>> termOffsets="true" />
>>> 
>>>   <fieldType name="text_pl" class="solr.TextField">
>>>     <analyzer>
>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>> catenateAll="0" splitOnCaseChange="0"/>
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> 
>>> Thanks!
>>> Pete
>> 
>> **********************************************************************
>> ************* The Royal Bank of Scotland plc. Registered in Scotland 
>> No 90312.
>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
>> Authorised and regulated by the Financial Services Authority. The 
>> Royal Bank of Scotland N.V. is authorised and regulated by the De 
>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and 
>> is registered in the Commercial Register under number 33002587. 
>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. 
>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are 
>> authorised to act as agent for each other in certain jurisdictions.
>> 
>> This e-mail message is confidential and for use by the addressee only. 
>> If the message is received by anyone other than the addressee, please 
>> return the message to the sender by replying to it and then delete the 
>> message from your computer. Internet e-mails are not necessarily 
>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
>> N.V. including its affiliates ("RBS group") does not accept 
>> responsibility for changes made to this message after it was sent. For 
>> the protection of RBS group and its clients and customers, and in 
>> compliance with regulatory requirements, the contents of both incoming 
>> and outgoing e-mail communications, which could include proprietary 
>> information and Non-Public Personal Information, may be read by 
>> authorised persons within RBS group other than the intended recipient(s).
>> 
>> Whilst all reasonable care has been taken to avoid the transmission of 
>> viruses, it is the responsibility of the recipient to ensure that the 
>> onward transmission, opening or use of this message and any 
>> attachments will not adversely affect its systems or data. No 
>> responsibility is accepted by the RBS group in this regard and the 
>> recipient should carry out such virus and other checks as it considers appropriate.
>> 
>> Visit our website at www.rbs.com
>> 
>> **********************************************************************
>> *************
>> 
>

Re: Can Solr handle large text files?

Posted by Peter Spam <ps...@mac.com>.

Oh by the way - what analyzer are you using for your log files?  Here's what I'm trying:

    <fieldType name="text_pl" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
      </analyzer>
    </fieldType>


Thanks!
Pete

On Oct 31, 2011, at 9:28 PM, Anand.Nigam@rbs.com wrote:

> Hi,
> 
> Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property)  of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file.
> 
> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc.
> 
> Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are:
> 
> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
> 
> Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine.
> 
> Thanks & Regards,
> Anand
> 
> 
> 
> 
> 
> Anand Nigam
> RBS Global Banking & Markets
> Office: +91 124 492 5506   
> 
> -----Original Message-----
> From: Peter Spam [mailto:pspam@mac.com] 
> Sent: 21 October 2011 23:04
> To: solr-user@lucene.apache.org
> Subject: Re: Can Solr handle large text files?
> 
> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?
> 
> 
> Thanks!
> Pete
> 
> On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:
> 
>> Hi,
>> 
>> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
>> 
>> 
>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
>> can I get this file from. Its reference is present in browse.vm
>> 
>> <div class="results">
>> #if($response.response.get('grouped'))
>>   #foreach($grouping in $response.response.get('grouped'))
>>     #parse("hitGrouped.vm")
>>   #end
>> #else
>>   #foreach($doc in $response.results)
>>     #parse("hit.vm")
>>   #end
>> #end
>> </div>
>> 
>> 
>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>> cwd=C:\glassfish3\glassfish\domains\domain1\config 
>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>> r.java:268) at 
>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>> SolrVelocityResourceLoader.java:42) at 
>> org.apache.velocity.Template.process(Template.java:98) at 
>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>> ResourceManagerImpl.java:446) at
>> 
>> Thanks & Regards,
>> Anand
>> Anand Nigam
>> RBS Global Banking & Markets
>> Office: +91 124 492 5506   
>> 
>> 
>> -----Original Message-----
>> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de]
>> Sent: 21 October 2011 14:58
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can Solr handle large text files?
>> 
>> Hi Peter,
>> 
>> highlighting in large text files can not be fast without dividing the original text in small piece.
>> So take a look in
>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>> and in
>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>> 
>> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
>> 
>> (xtf also would solve your problem "out of the box" but xtf does not use solr).
>> 
>> Best regards
>> Karsten
>> 
>> -------- Original-Nachricht --------
>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>> Von: Peter Spam <ps...@mac.com>
>>> An: solr-user@lucene.apache.org
>>> Betreff: Can Solr handle large text files?
>> 
>>> I have about 20k text files, some very small, but some up to 300MB, 
>>> and would like to do text searching with highlighting.
>>> 
>>> Imagine the text is the contents of your syslog.
>>> 
>>> I would like to type in some terms, such as "error" and "mail", and 
>>> have Solr return the syslog lines with those terms PLUS two lines of context.
>>> Pretty much just like Google's highlighting.
>>> 
>>> 1) Can Solr handle this?  I had extremely long query times when I 
>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>>> tried breaking the files into 1MB pieces, but searching would be 
>>> wonky => return the wrong number of documents (ie. if one file had a 
>>> term 5 times, and that was the only file that had the term, I want 1 result, not 5 results).
>>> 
>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>> 
>>>  <field name="body" type="text_pl" indexed="true" stored="true"
>>> multiValued="false" termVectors="true" termPositions="true" 
>>> termOffsets="true" />
>>> 
>>>   <fieldType name="text_pl" class="solr.TextField">
>>>     <analyzer>
>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>> catenateAll="0" splitOnCaseChange="0"/>
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> 
>>> Thanks!
>>> Pete
>> 
>> **********************************************************************
>> ************* The Royal Bank of Scotland plc. Registered in Scotland 
>> No 90312.
>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
>> Authorised and regulated by the Financial Services Authority. The 
>> Royal Bank of Scotland N.V. is authorised and regulated by the De 
>> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and 
>> is registered in the Commercial Register under number 33002587. 
>> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. 
>> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are 
>> authorised to act as agent for each other in certain jurisdictions.
>> 
>> This e-mail message is confidential and for use by the addressee only. 
>> If the message is received by anyone other than the addressee, please 
>> return the message to the sender by replying to it and then delete the 
>> message from your computer. Internet e-mails are not necessarily 
>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
>> N.V. including its affiliates ("RBS group") does not accept 
>> responsibility for changes made to this message after it was sent. For 
>> the protection of RBS group and its clients and customers, and in 
>> compliance with regulatory requirements, the contents of both incoming 
>> and outgoing e-mail communications, which could include proprietary 
>> information and Non-Public Personal Information, may be read by 
>> authorised persons within RBS group other than the intended recipient(s).
>> 
>> Whilst all reasonable care has been taken to avoid the transmission of 
>> viruses, it is the responsibility of the recipient to ensure that the 
>> onward transmission, opening or use of this message and any 
>> attachments will not adversely affect its systems or data. No 
>> responsibility is accepted by the RBS group in this regard and the 
>> recipient should carry out such virus and other checks as it considers appropriate.
>> 
>> Visit our website at www.rbs.com
>> 
>> **********************************************************************
>> *************
>> 
>

RE: Can Solr handle large text files?

Posted by An...@rbs.com.

Hi,

Basically I need to index very large log files. I have modified the ExtractingDocumentLoader to create a new document for every 50 lines (it is made configurable by keeping it as a system property)  of the log file being indexed. 'Filename' field for document created from 1 log file is kept the same and unique id is generated by appending the line no. with the file name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score stored in field called 'custom_score' which is directly proportional to its distance from the beginning of the file.

I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 lines for each document so the default max chunk size works for me but it can be easily adjusted depending upon the no of lines you are reading per doc.

Now I have done the grouping based on the 'filename' field and show the results from docs having highest score as a result I am able to show the last matching results from log file. Query parameters that I am using for search are:

http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName

Results are amazing, I am able to index and search from very larger log files (few 100 MBs) with very low memory requirements. Highlighting is also working fine.

Thanks & Regards,
Anand





Anand Nigam
RBS Global Banking & Markets
Office: +91 124 492 5506   

-----Original Message-----
From: Peter Spam [mailto:pspam@mac.com] 
Sent: 21 October 2011 23:04
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?


Thanks!
Pete

On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:

> Hi,
> 
> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
> 
> 
> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
> can I get this file from. Its reference is present in browse.vm
> 
> <div class="results">
>  #if($response.response.get('grouped'))
>    #foreach($grouping in $response.response.get('grouped'))
>      #parse("hitGrouped.vm")
>    #end
>  #else
>    #foreach($doc in $response.results)
>      #parse("hit.vm")
>    #end
>  #end
> </div>
> 
> 
> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
> cwd=C:\glassfish3\glassfish\domains\domain1\config 
> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
> r.java:268) at 
> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
> SolrVelocityResourceLoader.java:42) at 
> org.apache.velocity.Template.process(Template.java:98) at 
> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
> ResourceManagerImpl.java:446) at
> 
> Thanks & Regards,
> Anand
> Anand Nigam
> RBS Global Banking & Markets
> Office: +91 124 492 5506   
> 
> 
> -----Original Message-----
> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de]
> Sent: 21 October 2011 14:58
> To: solr-user@lucene.apache.org
> Subject: Re: Can Solr handle large text files?
> 
> Hi Peter,
> 
> highlighting in large text files can not be fast without dividing the original text in small piece.
> So take a look in
> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
> and in
> http://www.lucidimagination.com/blog/2010/09/16/2446/
> 
> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
> 
> (xtf also would solve your problem "out of the box" but xtf does not use solr).
> 
> Best regards
>  Karsten
> 
> -------- Original-Nachricht --------
>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>> Von: Peter Spam <ps...@mac.com>
>> An: solr-user@lucene.apache.org
>> Betreff: Can Solr handle large text files?
> 
>> I have about 20k text files, some very small, but some up to 300MB, 
>> and would like to do text searching with highlighting.
>> 
>> Imagine the text is the contents of your syslog.
>> 
>> I would like to type in some terms, such as "error" and "mail", and 
>> have Solr return the syslog lines with those terms PLUS two lines of context.
>> Pretty much just like Google's highlighting.
>> 
>> 1) Can Solr handle this?  I had extremely long query times when I 
>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>> tried breaking the files into 1MB pieces, but searching would be 
>> wonky => return the wrong number of documents (ie. if one file had a 
>> term 5 times, and that was the only file that had the term, I want 1 result, not 5 results).
>> 
>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>> 
>>   <field name="body" type="text_pl" indexed="true" stored="true"
>> multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>> 
>>    <fieldType name="text_pl" class="solr.TextField">
>>      <analyzer>
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> Thanks!
>> Pete
> 
> **********************************************************************
> ************* The Royal Bank of Scotland plc. Registered in Scotland 
> No 90312.
> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
> Authorised and regulated by the Financial Services Authority. The 
> Royal Bank of Scotland N.V. is authorised and regulated by the De 
> Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and 
> is registered in the Commercial Register under number 33002587. 
> Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. 
> The Royal Bank of Scotland N.V. and The Royal Bank of Scotland plc are 
> authorised to act as agent for each other in certain jurisdictions.
> 
> This e-mail message is confidential and for use by the addressee only. 
> If the message is received by anyone other than the addressee, please 
> return the message to the sender by replying to it and then delete the 
> message from your computer. Internet e-mails are not necessarily 
> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
> N.V. including its affiliates ("RBS group") does not accept 
> responsibility for changes made to this message after it was sent. For 
> the protection of RBS group and its clients and customers, and in 
> compliance with regulatory requirements, the contents of both incoming 
> and outgoing e-mail communications, which could include proprietary 
> information and Non-Public Personal Information, may be read by 
> authorised persons within RBS group other than the intended recipient(s).
> 
> Whilst all reasonable care has been taken to avoid the transmission of 
> viruses, it is the responsibility of the recipient to ensure that the 
> onward transmission, opening or use of this message and any 
> attachments will not adversely affect its systems or data. No 
> responsibility is accepted by the RBS group in this regard and the 
> recipient should carry out such virus and other checks as it considers appropriate.
> 
> Visit our website at www.rbs.com
> 
> **********************************************************************
> *************
>

Re: Can Solr handle large text files?

Posted by Peter Spam <ps...@mac.com>.

Thanks for your note, Anand.  What was the maximum chunk size for you?  Could you post the relevant portions of your configuration file?


Thanks!
Pete

On Oct 21, 2011, at 4:20 AM, Anand.Nigam@rbs.com wrote:

> Hi,
> 
> I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :
> 
> 
> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm
> 
> <div class="results">
>  #if($response.response.get('grouped'))
>    #foreach($grouping in $response.response.get('grouped'))
>      #parse("hitGrouped.vm")
>    #end
>  #else
>    #foreach($doc in $response.results)
>      #parse("hit.vm")
>    #end
>  #end
> </div>
> 
> 
> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446) at 
> 
> Thanks & Regards,
> Anand
> Anand Nigam
> RBS Global Banking & Markets
> Office: +91 124 492 5506   
> 
> 
> -----Original Message-----
> From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de] 
> Sent: 21 October 2011 14:58
> To: solr-user@lucene.apache.org
> Subject: Re: Can Solr handle large text files?
> 
> Hi Peter,
> 
> highlighting in large text files can not be fast without dividing the original text in small piece.
> So take a look in
> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
> and in
> http://www.lucidimagination.com/blog/2010/09/16/2446/
> 
> Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.
> 
> (xtf also would solve your problem "out of the box" but xtf does not use solr).
> 
> Best regards
>  Karsten
> 
> -------- Original-Nachricht --------
>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>> Von: Peter Spam <ps...@mac.com>
>> An: solr-user@lucene.apache.org
>> Betreff: Can Solr handle large text files?
> 
>> I have about 20k text files, some very small, but some up to 300MB, 
>> and would like to do text searching with highlighting.
>> 
>> Imagine the text is the contents of your syslog.
>> 
>> I would like to type in some terms, such as "error" and "mail", and 
>> have Solr return the syslog lines with those terms PLUS two lines of context.
>> Pretty much just like Google's highlighting.
>> 
>> 1) Can Solr handle this?  I had extremely long query times when I 
>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>> tried breaking the files into 1MB pieces, but searching would be wonky 
>> => return the wrong number of documents (ie. if one file had a term 5 
>> times, and that was the only file that had the term, I want 1 result, not 5 results).
>> 
>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>> 
>>   <field name="body" type="text_pl" indexed="true" stored="true"
>> multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>> 
>>    <fieldType name="text_pl" class="solr.TextField">
>>      <analyzer>
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> Thanks!
>> Pete
> 
> *********************************************************************************** 
> The Royal Bank of Scotland plc. Registered in Scotland No 90312. 
> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
> Authorised and regulated by the Financial Services Authority. The 
> Royal Bank of Scotland N.V. is authorised and regulated by the 
> De Nederlandsche Bank and has its seat at Amsterdam, the 
> Netherlands, and is registered in the Commercial Register under 
> number 33002587. Registered Office: Gustav Mahlerlaan 350, 
> Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and 
> The Royal Bank of Scotland plc are authorised to act as agent for each 
> other in certain jurisdictions. 
> 
> This e-mail message is confidential and for use by the addressee only. 
> If the message is received by anyone other than the addressee, please 
> return the message to the sender by replying to it and then delete the 
> message from your computer. Internet e-mails are not necessarily 
> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
> N.V. including its affiliates ("RBS group") does not accept responsibility 
> for changes made to this message after it was sent. For the protection
> of RBS group and its clients and customers, and in compliance with
> regulatory requirements, the contents of both incoming and outgoing
> e-mail communications, which could include proprietary information and
> Non-Public Personal Information, may be read by authorised persons
> within RBS group other than the intended recipient(s). 
> 
> Whilst all reasonable care has been taken to avoid the transmission of 
> viruses, it is the responsibility of the recipient to ensure that the onward 
> transmission, opening or use of this message and any attachments will 
> not adversely affect its systems or data. No responsibility is accepted 
> by the RBS group in this regard and the recipient should carry out such 
> virus and other checks as it considers appropriate. 
> 
> Visit our website at www.rbs.com 
> 
> ***********************************************************************************  
>

RE: Can Solr handle large text files?

Posted by An...@rbs.com.

Hi,

I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error :


Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm

<div class="results">
  #if($response.response.get('grouped'))
    #foreach($grouping in $response.response.get('grouped'))
      #parse("hitGrouped.vm")
    #end
  #else
    #foreach($doc in $response.results)
      #parse("hit.vm")
    #end
  #end
</div>


HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446) at 

Thanks & Regards,
Anand
Anand Nigam
RBS Global Banking & Markets
Office: +91 124 492 5506   


-----Original Message-----
From: karsten-solr@gmx.de [mailto:karsten-solr@gmx.de] 
Sent: 21 October 2011 14:58
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Hi Peter,

highlighting in large text files can not be fast without dividing the original text in small piece.
So take a look in
http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
and in
http://www.lucidimagination.com/blog/2010/09/16/2446/

Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document.

(xtf also would solve your problem "out of the box" but xtf does not use solr).

Best regards
  Karsten

-------- Original-Nachricht --------
> Datum: Thu, 20 Oct 2011 17:59:04 -0700
> Von: Peter Spam <ps...@mac.com>
> An: solr-user@lucene.apache.org
> Betreff: Can Solr handle large text files?

> I have about 20k text files, some very small, but some up to 300MB, 
> and would like to do text searching with highlighting.
> 
> Imagine the text is the contents of your syslog.
> 
> I would like to type in some terms, such as "error" and "mail", and 
> have Solr return the syslog lines with those terms PLUS two lines of context.
> Pretty much just like Google's highlighting.
> 
> 1) Can Solr handle this?  I had extremely long query times when I 
> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
> tried breaking the files into 1MB pieces, but searching would be wonky 
> => return the wrong number of documents (ie. if one file had a term 5 
> times, and that was the only file that had the term, I want 1 result, not 5 results).
> 
> 2) What sort of tokenizer would be best?  Here's what I'm using:
> 
>    <field name="body" type="text_pl" indexed="true" stored="true"
> multiValued="false" termVectors="true" termPositions="true" 
> termOffsets="true" />
> 
>     <fieldType name="text_pl" class="solr.TextField">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>       </analyzer>
>     </fieldType>
> 
> 
> Thanks!
> Pete

*********************************************************************************** 
The Royal Bank of Scotland plc. Registered in Scotland No 90312. 
Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
Authorised and regulated by the Financial Services Authority. The 
Royal Bank of Scotland N.V. is authorised and regulated by the 
De Nederlandsche Bank and has its seat at Amsterdam, the 
Netherlands, and is registered in the Commercial Register under 
number 33002587. Registered Office: Gustav Mahlerlaan 350, 
Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and 
The Royal Bank of Scotland plc are authorised to act as agent for each 
other in certain jurisdictions. 
  
This e-mail message is confidential and for use by the addressee only. 
If the message is received by anyone other than the addressee, please 
return the message to the sender by replying to it and then delete the 
message from your computer. Internet e-mails are not necessarily 
secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
N.V. including its affiliates ("RBS group") does not accept responsibility 
for changes made to this message after it was sent. For the protection
of RBS group and its clients and customers, and in compliance with
regulatory requirements, the contents of both incoming and outgoing
e-mail communications, which could include proprietary information and
Non-Public Personal Information, may be read by authorised persons
within RBS group other than the intended recipient(s). 

Whilst all reasonable care has been taken to avoid the transmission of 
viruses, it is the responsibility of the recipient to ensure that the onward 
transmission, opening or use of this message and any attachments will 
not adversely affect its systems or data. No responsibility is accepted 
by the RBS group in this regard and the recipient should carry out such 
virus and other checks as it considers appropriate. 

Visit our website at www.rbs.com 

***********************************************************************************

Re: Can Solr handle large text files?

Posted by ka...@gmx.de.

Hi Peter,

highlighting in large text files can not be fast without dividing the original text in small piece.
So take a look in
http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
and in
http://www.lucidimagination.com/blog/2010/09/16/2446/

Which means that you should divide your files and use
Result Grouping / Field Collapsing
to list only one hit per original document.

(xtf also would solve your problem "out of the box" but xtf does not use solr).

Best regards
  Karsten

-------- Original-Nachricht --------
> Datum: Thu, 20 Oct 2011 17:59:04 -0700
> Von: Peter Spam <ps...@mac.com>
> An: solr-user@lucene.apache.org
> Betreff: Can Solr handle large text files?

> I have about 20k text files, some very small, but some up to 300MB, and
> would like to do text searching with highlighting.
> 
> Imagine the text is the contents of your syslog.
> 
> I would like to type in some terms, such as "error" and "mail", and have
> Solr return the syslog lines with those terms PLUS two lines of context. 
> Pretty much just like Google's highlighting.
> 
> 1) Can Solr handle this?  I had extremely long query times when I tried
> this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking
> the files into 1MB pieces, but searching would be wonky => return the wrong
> number of documents (ie. if one file had a term 5 times, and that was the
> only file that had the term, I want 1 result, not 5 results).  
> 
> 2) What sort of tokenizer would be best?  Here's what I'm using:
> 
>    <field name="body" type="text_pl" indexed="true" stored="true"
> multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
> 
>     <fieldType name="text_pl" class="solr.TextField">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>       </analyzer>
>     </fieldType>
> 
> 
> Thanks!
> Pete