You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jochen Just <jo...@avono.de> on 2013/03/18 14:17:48 UTC

Incorrect snippets using FastVectorHighlighter

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi list,

i have the following field type in my schema.xml defined in order to be able to do in word search.

   <fieldType name="string_parts_back" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="1000"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Searching itself works as expected, though highlighting causes me headaches.
At first I did not use the FastVectorHighlighter, which meant highlighting did
not work at all for fields of this type. Since I'm using the FastVectorHighlighter
most of the time highlighting works, sometimes it doesn't.

Given I have a document containing the word 'Superkalifragilistischexpialligetisch'
and I search for 'uperkalifragilistische', I would expect as result 'S<em>uperkalifragilistische</em>xpiallegetisch'
but it is 'S<em>uperkalifragilist</em>ischexpialligetisch'. So there is 'ische'
missing in the highlighted part.

Sadly, I am not able to create a simple setup to reproduce this, but it only happens in our in-house live system.
Though if I remove some fields from my qf attribute of the edismax parser in solconfig.xml, it stops behaving like that.
Some of those removed fields have the fieldType string_parts_back.

Does any one have a clue, what's going on?

Thanks in advance,
Jochen


- -- 
Jochen Just                   Fon:   (++49) 711/28 07 57-193
avono AG                      Mobil: (++49) 172/73 85 387
Breite Straße 2               Mail:  jochen.just@avono.de
70173 Stuttgart               WWW:   http://www.avono.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRRxP5AAoJEP1xbhgWUHmSRAsP/AlLHWA6Pw6Jk5Pmr0rqiAxE
IsJ6HeL+4e56IHsKsruBY7HOGdEwRvXHSkwlKGLF+dvyzz4/lx7wbGBHJCMJJkDe
Yas9izso5z4KGKzKazMYPPKoXja67zmWmRU5PYG/exT8N1gjnA98KTzXAA47xIxA
rm9zUBImPF1eIZmEBcytI/+EMJI4Cy30OvRyWfc6XoxF7Kq5wJuMXvTWl24gM0tQ
xdPUVZ6ir8IkrGw2P7d3/IgaAtYbT+SEAuFjSE9rtS8KdJfWbXDYYupqNV59Syqh
7F5ywEOgnt/OBTODFp9FR4ElakOlSZrmRk8CgYfUZZu9vNASxyBnCWwhz+CkCbfQ
fYRzy1HyDUGIGFl6FAi+4WE4av5EdWUH6N0UEdUkE6tI5b/IqzGIdocSl36PqeMR
za7jKfU9LWqc+Xoh27wLP8Wi11t/XIRQuRCxKSFpc2Go3iweCTu+cXr1K6XTndj/
uoptQ1nJJcQTRmdvxlxA5jvrVaGvOclEEFsndQWyq6wK7CJ9k+FOHfYwc7p3L1Bp
QoTTErdEKgCZj+w39Ma0ASURBX1+jjLqRnMvleSD4CX2K78z8Z7c5a7m48192D6u
mg6uOIUyTdTPH5SLUOU+rNDjOuLLbJOuVGXdpSqYymkr2WPlwwBj+ZYGx1lap1xE
5ZgU5nHnodtUAC9jjz52
=KsNm
-----END PGP SIGNATURE-----

Re: Incorrect snippets using FastVectorHighlighter

Posted by Jochen Just <jo...@avono.de>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 19.03.2013 02:49, schrieb Koji Sekiguchi:
>> So just to be clear: There is no possibility to highlight
>> results, if I use variable gram size. Neither the original
>> highlighter nor FVH do the job. Or am I missing something?
> 
> I don't know the latest original highlighter has such restriction
> or not today,
Just in case somebody is interested: Solr 4.2 can't highlight results
reliably using variable gram size. Neither using original Highlighter
nor using FastVectorHighlighter.
> but when FVH came in 2.9, at that time, the original highlighter
> couldn't deal with n-gram field if n > 1, because (k)-th term's end
> offset can be larger than (k+1)-th term's start offset.
> 
>> Btw does any documentation exits how the VFH works?
> 
> See package summary:
> 
> http://lucene.apache.org/core/4_2_0/highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html
>
> 
> 
> koji


- -- 
Jochen Just                   Fon:   (++49) 711/28 07 57-193
avono AG                      Mobil: (++49) 172/73 85 387
Breite Straße 2               Mail:  jochen.just@avono.de
70173 Stuttgart               WWW:   http://www.avono.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRSrwLAAoJEP1xbhgWUHmSDwEQALGZ+5boTSllItUCG0qaDMA0
5CXNBWzvf+DVV589rOIhIU8OtaNgIrNqLCBT/3hUZvUIcCQV9Y61oBXWSQ5E4oXZ
qC0p2mflNIAOcPutM21kQT+V+XWu1lG06IEIZzSNJqv75d+j6NZE0HdHMNSfdcTl
5pRPDCyrflAwR8Ryj9HjmOTLzW1lPqqfVGaRbY9WvGZE5G3USn8QRpgbODH4Hnb4
IgIdDa8C+ihW55KbzXB60xGG5F1/CFByXfBdLhsH2FDgE8p08noS5LKUH9czDO4y
PGjz2Nx8qcBq0iyj9yzibR4OOa+vvbgWQup9t8cJIbmjIDNWXXYUQdMk+XyB0cxU
ZdplyZA1KRA5D4hrYyrVYNDBgkAAcDTsdYT1MTIgnR2jVD29BU9RENAeR4DKYvp1
9l93+h0sCsfyudl+GBIr/aWvpW3SfK5DS3hDiXNr+js1V9m2jBDzGzD4AW/chLGU
uaU3UCjRWVhWmYbk/sEyxWYW0SHxDaxDRlvzDRXDvmHXR6nwphCDFPmtmY5lBAHY
e4gI3Iim7nu37Muu2tPqHzd8SuBWpwO1jMGJRk2qtBwVN2su/PmxExpc3mqypQ+q
UzMxu95Me4hzDBXR+Vj+8KF8d99ETFbNXeRECRpiq8pmrAI6XhHuirYwy9w7LThe
YX8c/oyIXGoWGSiCA56z
=CgXo
-----END PGP SIGNATURE-----

Re: Incorrect snippets using FastVectorHighlighter

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
> So just to be clear:
> There is no possibility to highlight results, if I use variable gram size.
> Neither the original highlighter nor FVH do the job.
> Or am I missing something?

I don't know the latest original highlighter has such restriction or not today,
but when FVH came in 2.9, at that time, the original highlighter couldn't
deal with n-gram field if n > 1, because (k)-th term's end offset can be
larger than (k+1)-th term's start offset.

> Btw does any documentation exits how the VFH works?

See package summary:

http://lucene.apache.org/core/4_2_0/highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html

koji
-- 
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html

Re: Incorrect snippets using FastVectorHighlighter

Posted by Jochen Just <jo...@avono.de>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So just to be clear:
There is no possibility to highlight results, if I use variable gram size.
Neither the original highlighter nor FVH do the job.
Or am I missing something?
Btw does any documentation exits how the VFH works?

Jochen
Am 18.03.2013 15:00, schrieb Koji Sekiguchi:
> Hi Jochen,
> 
> There is a restriction in FVH. FVH cannot deal with variable gram size. That is, minGramSize == maxGramSize in your NGramFilterFactory setting.
> 
> koji


- -- 
Jochen Just                   Fon:   (++49) 711/28 07 57-193
avono AG                      Mobil: (++49) 172/73 85 387
Breite Straße 2               Mail:  jochen.just@avono.de
70173 Stuttgart               WWW:   http://www.avono.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRR2HSAAoJEP1xbhgWUHmS8aAP/Ao6OudBZmWt3u0IWduLb5t6
ryjJeb6jCSH5RrgDGjnbxT0wgap9NWQIvH53Bwy2Y+T89ruo27mOSywZB4aYOb+l
XEl6ZfJouBf1rTzcMacIRdz4mIj/YFHq+SS724JDaDiwbiLz7Ku8/66dppFTmmN+
+lj52pFYgzzDpP63JPtGEwCBdvJ8jgjNundY94dXlyW33ZMWvPu7sdtQZP2YVQdB
RjHcoQN0fu38+5l30t1MZrm9OpDlV2GugyEk99JpKfcnEFFmYUgS9BHI9aiPg7K3
hy5lpE1ooub78vlB1jExDSRTTEJn0V/MIEUGRvzQQDS94tdhvOidxA0/zeEiBaou
tYMJKKfw8AJQN0ag16DjbCWte/9bQwgCiTswSfrpDzaIPHnqfXw5E2ABeNM7k8Q1
E9iPwsfDG8yy/MZRR83bXWl6fhtYGcW8W6GlNLB5a1B81qKM6Ld6pu9uaGpr58Aw
JKTVrjXC02i0/kYmRx7C8KFQ14UkqKqPZXtsHOmVNVXQlYX6wh9OvmjWOsWHqgNz
Y0KyXaJfL4DQlUlvuCjj2bAplHPfbwvtbtO6gIFIisKvkbk3RziSQ9W67XLTsBEV
cyxRrNbqcrAP+JLANXdP9JC6K54Ll23dFesl9Q8idaqCXuDubc1w5shMXfAGQazt
Ba5Se0fD5QibHe90SsO/
=4Nxk
-----END PGP SIGNATURE-----

Re: Incorrect snippets using FastVectorHighlighter

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Hi Jochen,

There is a restriction in FVH. FVH cannot deal with variable gram size.
That is, minGramSize == maxGramSize in your NGramFilterFactory setting.

koji
-- 
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


(13/03/18 22:17), Jochen Just wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi list,
>
> i have the following field type in my schema.xml defined in order to be able to do in word search.
>
>     <fieldType name="string_parts_back" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
>        <analyzer type="index">
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="1000"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>        <analyzer type="query">
>           <tokenizer class="solr.KeywordTokenizerFactory"/>
>           <filter class="solr.LowerCaseFilterFactory"/>
>        </analyzer>
>      </fieldType>
>
> Searching itself works as expected, though highlighting causes me headaches.
> At first I did not use the FastVectorHighlighter, which meant highlighting did
> not work at all for fields of this type. Since I'm using the FastVectorHighlighter
> most of the time highlighting works, sometimes it doesn't.
>
> Given I have a document containing the word 'Superkalifragilistischexpialligetisch'
> and I search for 'uperkalifragilistische', I would expect as result 'S<em>uperkalifragilistische</em>xpiallegetisch'
> but it is 'S<em>uperkalifragilist</em>ischexpialligetisch'. So there is 'ische'
> missing in the highlighted part.
>
> Sadly, I am not able to create a simple setup to reproduce this, but it only happens in our in-house live system.
> Though if I remove some fields from my qf attribute of the edismax parser in solconfig.xml, it stops behaving like that.
> Some of those removed fields have the fieldType string_parts_back.
>
> Does any one have a clue, what's going on?
>
> Thanks in advance,
> Jochen
>
>
> - --
> Jochen Just                   Fon:   (++49) 711/28 07 57-193
> avono AG                      Mobil: (++49) 172/73 85 387
> Breite Straße 2               Mail:  jochen.just@avono.de
> 70173 Stuttgart               WWW:   http://www.avono.de
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with undefined - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJRRxP5AAoJEP1xbhgWUHmSRAsP/AlLHWA6Pw6Jk5Pmr0rqiAxE
> IsJ6HeL+4e56IHsKsruBY7HOGdEwRvXHSkwlKGLF+dvyzz4/lx7wbGBHJCMJJkDe
> Yas9izso5z4KGKzKazMYPPKoXja67zmWmRU5PYG/exT8N1gjnA98KTzXAA47xIxA
> rm9zUBImPF1eIZmEBcytI/+EMJI4Cy30OvRyWfc6XoxF7Kq5wJuMXvTWl24gM0tQ
> xdPUVZ6ir8IkrGw2P7d3/IgaAtYbT+SEAuFjSE9rtS8KdJfWbXDYYupqNV59Syqh
> 7F5ywEOgnt/OBTODFp9FR4ElakOlSZrmRk8CgYfUZZu9vNASxyBnCWwhz+CkCbfQ
> fYRzy1HyDUGIGFl6FAi+4WE4av5EdWUH6N0UEdUkE6tI5b/IqzGIdocSl36PqeMR
> za7jKfU9LWqc+Xoh27wLP8Wi11t/XIRQuRCxKSFpc2Go3iweCTu+cXr1K6XTndj/
> uoptQ1nJJcQTRmdvxlxA5jvrVaGvOclEEFsndQWyq6wK7CJ9k+FOHfYwc7p3L1Bp
> QoTTErdEKgCZj+w39Ma0ASURBX1+jjLqRnMvleSD4CX2K78z8Z7c5a7m48192D6u
> mg6uOIUyTdTPH5SLUOU+rNDjOuLLbJOuVGXdpSqYymkr2WPlwwBj+ZYGx1lap1xE
> 5ZgU5nHnodtUAC9jjz52
> =KsNm
> -----END PGP SIGNATURE-----
>