You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Peter Spam <ps...@mac.com> on 2010/07/08 02:33:09 UTC

Using hl.regex.pattern to print complete lines

Hi,

I have a text file broken apart by carriage returns, and I'd like to only return entire lines.  So, I'm trying to use this:

	&hl.fragmenter=regex
	&hl.regex.pattern=^.*$

... but I still get fragments, even if I crank up the hl.regex.slop to 3 or so.  I also tried a pattern of "\n.*\n" which seems to work better, but still isn't right.  Any ideas?


-Pete

Re: Using hl.regex.pattern to print complete lines

Posted by Lance Norskog <go...@gmail.com>.

Java regex might be different from all other regex, so writing a test
program and experimenting is the only way. Once you decide that this
expression really is what you want, and that it does not achieve what
you expect, you might have found a bug in highlighting.

Lucene/Solr highlighting has always been a difficult area, and might
not do everything right.

On Wed, Jul 21, 2010 at 4:20 PM, Peter Spam <ps...@mac.com> wrote:
> Still not working ... any ideas?
>
>
> -Pete
>
> On Jul 14, 2010, at 11:56 AM, Peter Spam wrote:
>
>> Any other thoughts, Chris?  I've been messing with this a bit, and can't seem to get (?m)^.*$ to do what I want.
>>
>> 1) I don't care how many characters it returns, I'd like entire lines all the time
>> 2) I just want it to always return 3 lines: the line before, the actual line, and the line after.
>> 3) This should be like "grep -C1"
>>
>> Thanks for your time!
>>
>>
>> -Pete
>>
>> On Jul 9, 2010, at 12:08 AM, Peter Spam wrote:
>>
>>> Ah, this makes sense.  I've changed my regex to "(?m)^.*$", and it works better, but I still get fragments before and after some returns.
>>> Thanks for the hint!
>>>
>>>
>>> -Pete
>>>
>>> On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote:
>>>
>>>>
>>>> : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
>>>> : is available that is for getting entire field contents with search terms
>>>> : highlighted. To use it, set hl.useFastVectorHighlighter to true.
>>>>
>>>> He doesn't want the entire field -- his stored field values contain
>>>> multi-line strings (using newline characters) and he wants to make
>>>> fragments per "line" (ie: bounded by newline characters, or the start/end
>>>> of the entire field value)
>>>>
>>>> Peter: i haven't looked at the code, but i expect that the problem is that
>>>> the java regex engine isn't being used in a way that makes ^ and $ match
>>>> any line boundary -- they are probably only matching the start/end of the
>>>> field (and . is probably only matching non-newline characters)
>>>>
>>>> java regexes support embedded flags (ie: "(?xyz)your regex") so you might
>>>> try that (i don't remember what the correct modifier flag is for the
>>>> multiline mode off the top of my head)
>>>>
>>>> -Hoss
>>>>
>>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Using hl.regex.pattern to print complete lines

Posted by Peter Spam <ps...@mac.com>.

Still not working ... any ideas?


-Pete

On Jul 14, 2010, at 11:56 AM, Peter Spam wrote:

> Any other thoughts, Chris?  I've been messing with this a bit, and can't seem to get (?m)^.*$ to do what I want.
> 
> 1) I don't care how many characters it returns, I'd like entire lines all the time
> 2) I just want it to always return 3 lines: the line before, the actual line, and the line after.
> 3) This should be like "grep -C1"
> 
> Thanks for your time!
> 
> 
> -Pete
> 
> On Jul 9, 2010, at 12:08 AM, Peter Spam wrote:
> 
>> Ah, this makes sense.  I've changed my regex to "(?m)^.*$", and it works better, but I still get fragments before and after some returns.
>> Thanks for the hint!
>> 
>> 
>> -Pete
>> 
>> On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote:
>> 
>>> 
>>> : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
>>> : is available that is for getting entire field contents with search terms
>>> : highlighted. To use it, set hl.useFastVectorHighlighter to true.
>>> 
>>> He doesn't want the entire field -- his stored field values contain 
>>> multi-line strings (using newline characters) and he wants to make 
>>> fragments per "line" (ie: bounded by newline characters, or the start/end 
>>> of the entire field value)
>>> 
>>> Peter: i haven't looked at the code, but i expect that the problem is that 
>>> the java regex engine isn't being used in a way that makes ^ and $ match 
>>> any line boundary -- they are probably only matching the start/end of the 
>>> field (and . is probably only matching non-newline characters)
>>> 
>>> java regexes support embedded flags (ie: "(?xyz)your regex") so you might 
>>> try that (i don't remember what the correct modifier flag is for the 
>>> multiline mode off the top of my head)
>>> 
>>> -Hoss
>>> 
>> 
>

Re: Using hl.regex.pattern to print complete lines

Posted by Peter Spam <ps...@mac.com>.

Any other thoughts, Chris?  I've been messing with this a bit, and can't seem to get (?m)^.*$ to do what I want.

1) I don't care how many characters it returns, I'd like entire lines all the time
2) I just want it to always return 3 lines: the line before, the actual line, and the line after.
3) This should be like "grep -C1"

Thanks for your time!


-Pete

On Jul 9, 2010, at 12:08 AM, Peter Spam wrote:

> Ah, this makes sense.  I've changed my regex to "(?m)^.*$", and it works better, but I still get fragments before and after some returns.
> Thanks for the hint!
> 
> 
> -Pete
> 
> On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote:
> 
>> 
>> : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
>> : is available that is for getting entire field contents with search terms
>> : highlighted. To use it, set hl.useFastVectorHighlighter to true.
>> 
>> He doesn't want the entire field -- his stored field values contain 
>> multi-line strings (using newline characters) and he wants to make 
>> fragments per "line" (ie: bounded by newline characters, or the start/end 
>> of the entire field value)
>> 
>> Peter: i haven't looked at the code, but i expect that the problem is that 
>> the java regex engine isn't being used in a way that makes ^ and $ match 
>> any line boundary -- they are probably only matching the start/end of the 
>> field (and . is probably only matching non-newline characters)
>> 
>> java regexes support embedded flags (ie: "(?xyz)your regex") so you might 
>> try that (i don't remember what the correct modifier flag is for the 
>> multiline mode off the top of my head)
>> 
>> -Hoss
>> 
>

Re: Using hl.regex.pattern to print complete lines

Posted by Peter Spam <ps...@mac.com>.

Ah, this makes sense.  I've changed my regex to "(?m)^.*$", and it works better, but I still get fragments before and after some returns.
Thanks for the hint!


-Pete

On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote:

> 
> : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
> : is available that is for getting entire field contents with search terms
> : highlighted. To use it, set hl.useFastVectorHighlighter to true.
> 
> He doesn't want the entire field -- his stored field values contain 
> multi-line strings (using newline characters) and he wants to make 
> fragments per "line" (ie: bounded by newline characters, or the start/end 
> of the entire field value)
> 
> Peter: i haven't looked at the code, but i expect that the problem is that 
> the java regex engine isn't being used in a way that makes ^ and $ match 
> any line boundary -- they are probably only matching the start/end of the 
> field (and . is probably only matching non-newline characters)
> 
> java regexes support embedded flags (ie: "(?xyz)your regex") so you might 
> try that (i don't remember what the correct modifier flag is for the 
> multiline mode off the top of my head)
> 
> -Hoss
>

Re: Using hl.regex.pattern to print complete lines

Posted by Chris Hostetter <ho...@fucit.org>.

: If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
: is available that is for getting entire field contents with search terms
: highlighted. To use it, set hl.useFastVectorHighlighter to true.

He doesn't want the entire field -- his stored field values contain 
multi-line strings (using newline characters) and he wants to make 
fragments per "line" (ie: bounded by newline characters, or the start/end 
of the entire field value)

Peter: i haven't looked at the code, but i expect that the problem is that 
the java regex engine isn't being used in a way that makes ^ and $ match 
any line boundary -- they are probably only matching the start/end of the 
field (and . is probably only matching non-newline characters)

java regexes support embedded flags (ie: "(?xyz)your regex") so you might 
try that (i don't remember what the correct modifier flag is for the 
multiline mode off the top of my head)

-Hoss

Re: Using hl.regex.pattern to print complete lines

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

(10/07/09 9:30), Peter Spam wrote:
> Thanks for the note, Koji.  However, hl.fragsize=0 seems to return the entire document, rather than just one single line.
>
> Here's what I tried (what I previously had was commented out):
>
> regexv = "^.*$"
> thequery = '/solr/select?facet=true&facet.limit=10&fl=id,score,filename&tv=true&timeAllowed=3000&facet.field=filename&qt=tvrh&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + '&q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + "&hl=true&hl.snippets=1&hl.fragsize=0" #&hl.regex.slop=.8&hl.fragsize=200&hl.fragmenter=regex&hl.regex.pattern=" + CGI::escape(regexv)
>
> Thanks for your help.
>
>
> -Peter
>
>    
Peter,

Are you sure using GapFragmenter when you set fragsize to 0?

I've never tried regex fragmenter...

If you can use the latest branch_3x or trunk, hl.fragListBuilder=single
is available that is for getting entire field contents with search terms
highlighted. To use it, set hl.useFastVectorHighlighter to true.

Koji

-- 
http://www.rondhuit.com/en/

Re: Using hl.regex.pattern to print complete lines

Posted by Peter Spam <ps...@mac.com>.

Thanks for the note, Koji.  However, hl.fragsize=0 seems to return the entire document, rather than just one single line.

Here's what I tried (what I previously had was commented out):

regexv = "^.*$"
thequery = '/solr/select?facet=true&facet.limit=10&fl=id,score,filename&tv=true&timeAllowed=3000&facet.field=filename&qt=tvrh&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + '&q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s + "&hl=true&hl.snippets=1&hl.fragsize=0" #&hl.regex.slop=.8&hl.fragsize=200&hl.fragmenter=regex&hl.regex.pattern=" + CGI::escape(regexv)

Thanks for your help.


-Peter

On Jul 8, 2010, at 3:47 PM, Koji Sekiguchi wrote:

> (10/07/09 2:44), Peter Spam wrote:
>> To clarify, I never want a snippet, I always want a whole line returned.  Is this possible?  Thanks!
>> 
>> 
>> -Pete
>> 
>>   
> Hello Pete,
> 
> Use NullFragmenter. It can be used via GapFragmenter with
> hl.fragsize=0.
> 
> Koji
> 
> -- 
> http://www.rondhuit.com/en/
>

Re: Using hl.regex.pattern to print complete lines

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

(10/07/09 2:44), Peter Spam wrote:
> To clarify, I never want a snippet, I always want a whole line returned.  Is this possible?  Thanks!
>
>
> -Pete
>
>    
Hello Pete,

Use NullFragmenter. It can be used via GapFragmenter with
hl.fragsize=0.

Koji

-- 
http://www.rondhuit.com/en/

Re: Using hl.regex.pattern to print complete lines

Posted by Peter Spam <ps...@mac.com>.

To clarify, I never want a snippet, I always want a whole line returned.  Is this possible?  Thanks!


-Pete

On Jul 7, 2010, at 5:33 PM, Peter Spam wrote:

> Hi,
> 
> I have a text file broken apart by carriage returns, and I'd like to only return entire lines.  So, I'm trying to use this:
> 
> 	&hl.fragmenter=regex
> 	&hl.regex.pattern=^.*$
> 
> ... but I still get fragments, even if I crank up the hl.regex.slop to 3 or so.  I also tried a pattern of "\n.*\n" which seems to work better, but still isn't right.  Any ideas?
> 
> 
> -Pete