You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fatima Issawi <is...@qu.edu.qa> on 2014/02/06 14:23:34 UTC

Highlight results in Arabic are backword

Hello,

I am getting highlight results in Arabic, but the order of the words are backwards. Querying on that field gives me the correct result, though. Is there are setting I’m missing?

An extract from an example query from my Solr Console is below:

{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "indent": "true",
      "q": "author:\"فيشر\"",
      "_": "1391692704242",
      "hl.simple.pre": "<em>",
      "hl.simple.post": "</em>",
      "hl.fl": "author",
      "wt": "json",
      "hl": "true"
    }
  },
  "response": {
    "numFound": 4,
    "start": 0,
    "docs": [
      {
        "pagenumber": 1,
        "id": "1",
        "author": "د. فيشر السعر",
        "author_s": "د. فيشر السعر",
        "collector": "فاطمة عيساوي",
  },
  "highlighting": {
    "1": {
      "author": [
        "د. <em>فيشر</em> السعر"
      ]

RE: Highlight results in Arabic are backword

Posted by Fatima Issawi <is...@qu.edu.qa>.
Thank you. I will look into that.

> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Sunday, February 09, 2014 9:35 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Highlight results in Arabic are backword
> 
> You will most probably put your English and Arabic content into different
> fields. Mostly because you will want to apply different field type definitions
> to your English and Arabic text (tokenizers, etc).
> 
> Also, I would search around the web for articles on multilingual approach to
> Solr, if you are doing some deliberate design now. There are some deeper
> issues. Some good questions are covered here:
> http://info.basistech.com/blog/bid/171842/Indexing-Strategies-for-
> Multilingual-Search-with-Solr-and-Rosette
> (even if it is talking about the commercial tool). There is also a series of 12
> blog posts on dealing with Solr for CJK in the libraries.
> Your issues will be different, but there will be overlap.
> 
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at once.
> Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
> On Sun, Feb 9, 2014 at 12:56 PM, Fatima Issawi <is...@qu.edu.qa> wrote:
> > Thank you both for responding.
> >
> > Is there a way to specify to Solr  to add those attributes on the field when it
> returns results (e.g. Language is Arabic, English. Or direction is LTR or RTL.)?
> >
> > Right now I only have Arabic content indexed, but we plan to add English in
> the near future. I don't want to have to re-do everything later if there is a
> better way of designing this now.
> >
> > Regards,
> > Fatima
> >
> >> -----Original Message-----
> >> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> >> Sent: Friday, February 07, 2014 3:48 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Highlight results in Arabic are backword
> >>
> >> Arabic if complex. Basically, don't trust anything you see until you
> >> put that content on the screen with the surrounding tag marked with
> >> attribute dir='rtl' (e.g. <p dir='rlt'>arabic test</p>).
> >>
> >> Regards,
> >>    Alex.
> >> Personal website: http://www.outerthoughts.com/
> >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >> - Time is the quality of nature that keeps events from happening all at
> once.
> >> Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> >> book)
> >>
> >>
> >> On Thu, Feb 6, 2014 at 10:12 PM, Steve Rowe <sa...@gmail.com>
> wrote:
> >> > Hi Fatima,
> >> >
> >> > I don’t think there’s an actual problem, it just looks like it
> >> > because the
> >> program you’re using to look at the JSON makes a different choice for
> >> laying out the highlighting results than it does for the field values.
> >> >
> >> > In fact, all the bytes are the same, and in the same order for both
> >> > the
> >> “author” field text and the highlighting text, though some space
> >> characters are ASCII space (U+0020) in one and non-breaking space
> >> (U+00A0) in the other.
> >> >
> >> > By the way, I see the same thing as you in my email client (OS X
> >> > Mail.app).  I
> >> assume there is a rule shared by our programs about complex layout
> >> like this, where right-to-left text is mixed with left-to-right text,
> >> likely based on the proportion of each, that triggers a left-to-right
> >> word sequencing instead of the expected right-to-left word sequencing.
> >> >
> >> > Anyway, I pulled out the author field and highlighting texts into
> >> > an HTML
> >> document and viewed it in my browser (Safari), and both are layed out
> >> the same (with the exception of the emphasis given the highlighted
> word):
> >> >
> >> > ——
> >> > <html>
> >> > <body>
> >> > <p>"author": "د. فيشر السعر",</p>
> >> > <p>"highlighting": { "1": { "author": [ "د. <em>فيشر</em> السعر" ]
> >> > } }</p> </body> </html> ——
> >> >
> >> > Steve
> >> >
> >> > On Feb 6, 2014, at 8:23 AM, Fatima Issawi <is...@qu.edu.qa> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> I am getting highlight results in Arabic, but the order of the
> >> >> words are
> >> backwards. Querying on that field gives me the correct result,
> >> though. Is there are setting I’m missing?
> >> >>
> >> >> An extract from an example query from my Solr Console is below:
> >> >>
> >> >> {
> >> >>  "responseHeader": {
> >> >>    "status": 0,
> >> >>    "QTime": 1,
> >> >>    "params": {
> >> >>      "indent": "true",
> >> >>      "q": "author:\"فيشر\"",
> >> >>      "_": "1391692704242",
> >> >>      "hl.simple.pre": "<em>",
> >> >>      "hl.simple.post": "</em>",
> >> >>      "hl.fl": "author",
> >> >>      "wt": "json",
> >> >>      "hl": "true"
> >> >>    }
> >> >>  },
> >> >>  "response": {
> >> >>    "numFound": 4,
> >> >>    "start": 0,
> >> >>    "docs": [
> >> >>      {
> >> >>        "pagenumber": 1,
> >> >>        "id": "1",
> >> >>        "author": "د. فيشر السعر",
> >> >>        "author_s": "د. فيشر السعر",
> >> >>        "collector": "فاطمة عيساوي",  },
> >> >>  "highlighting": {
> >> >>    "1": {
> >> >>      "author": [
> >> >>        "د. <em>فيشر</em> السعر"
> >> >>      ]
> >> >

Re: Highlight results in Arabic are backword

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You will most probably put your English and Arabic content into
different fields. Mostly because you will want to apply different
field type definitions to your English and Arabic text (tokenizers,
etc).

Also, I would search around the web for articles on multilingual
approach to Solr, if you are doing some deliberate design now. There
are some deeper issues. Some good questions are covered here:
http://info.basistech.com/blog/bid/171842/Indexing-Strategies-for-Multilingual-Search-with-Solr-and-Rosette
(even if it is talking about the commercial tool). There is also a
series of 12 blog posts on dealing with Solr for CJK in the libraries.
Your issues will be different, but there will be overlap.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sun, Feb 9, 2014 at 12:56 PM, Fatima Issawi <is...@qu.edu.qa> wrote:
> Thank you both for responding.
>
> Is there a way to specify to Solr  to add those attributes on the field when it returns results (e.g. Language is Arabic, English. Or direction is LTR or RTL.)?
>
> Right now I only have Arabic content indexed, but we plan to add English in the near future. I don't want to have to re-do everything later if there is a better way of designing this now.
>
> Regards,
> Fatima
>
>> -----Original Message-----
>> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
>> Sent: Friday, February 07, 2014 3:48 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Highlight results in Arabic are backword
>>
>> Arabic if complex. Basically, don't trust anything you see until you put that
>> content on the screen with the surrounding tag marked with attribute
>> dir='rtl' (e.g. <p dir='rlt'>arabic test</p>).
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all at once.
>> Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Thu, Feb 6, 2014 at 10:12 PM, Steve Rowe <sa...@gmail.com> wrote:
>> > Hi Fatima,
>> >
>> > I don’t think there’s an actual problem, it just looks like it because the
>> program you’re using to look at the JSON makes a different choice for laying
>> out the highlighting results than it does for the field values.
>> >
>> > In fact, all the bytes are the same, and in the same order for both the
>> “author” field text and the highlighting text, though some space characters
>> are ASCII space (U+0020) in one and non-breaking space (U+00A0) in the
>> other.
>> >
>> > By the way, I see the same thing as you in my email client (OS X Mail.app).  I
>> assume there is a rule shared by our programs about complex layout like this,
>> where right-to-left text is mixed with left-to-right text, likely based on the
>> proportion of each, that triggers a left-to-right word sequencing instead of
>> the expected right-to-left word sequencing.
>> >
>> > Anyway, I pulled out the author field and highlighting texts into an HTML
>> document and viewed it in my browser (Safari), and both are layed out the
>> same (with the exception of the emphasis given the highlighted word):
>> >
>> > ——
>> > <html>
>> > <body>
>> > <p>"author": "د. فيشر السعر",</p>
>> > <p>"highlighting": { "1": { "author": [ "د. <em>فيشر</em> السعر" ] }
>> > }</p> </body> </html> ——
>> >
>> > Steve
>> >
>> > On Feb 6, 2014, at 8:23 AM, Fatima Issawi <is...@qu.edu.qa> wrote:
>> >
>> >> Hello,
>> >>
>> >> I am getting highlight results in Arabic, but the order of the words are
>> backwards. Querying on that field gives me the correct result, though. Is
>> there are setting I’m missing?
>> >>
>> >> An extract from an example query from my Solr Console is below:
>> >>
>> >> {
>> >>  "responseHeader": {
>> >>    "status": 0,
>> >>    "QTime": 1,
>> >>    "params": {
>> >>      "indent": "true",
>> >>      "q": "author:\"فيشر\"",
>> >>      "_": "1391692704242",
>> >>      "hl.simple.pre": "<em>",
>> >>      "hl.simple.post": "</em>",
>> >>      "hl.fl": "author",
>> >>      "wt": "json",
>> >>      "hl": "true"
>> >>    }
>> >>  },
>> >>  "response": {
>> >>    "numFound": 4,
>> >>    "start": 0,
>> >>    "docs": [
>> >>      {
>> >>        "pagenumber": 1,
>> >>        "id": "1",
>> >>        "author": "د. فيشر السعر",
>> >>        "author_s": "د. فيشر السعر",
>> >>        "collector": "فاطمة عيساوي",
>> >>  },
>> >>  "highlighting": {
>> >>    "1": {
>> >>      "author": [
>> >>        "د. <em>فيشر</em> السعر"
>> >>      ]
>> >

RE: Highlight results in Arabic are backword

Posted by Fatima Issawi <is...@qu.edu.qa>.
Thank you both for responding. 

Is there a way to specify to Solr  to add those attributes on the field when it returns results (e.g. Language is Arabic, English. Or direction is LTR or RTL.)?  

Right now I only have Arabic content indexed, but we plan to add English in the near future. I don't want to have to re-do everything later if there is a better way of designing this now.

Regards,
Fatima

> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Friday, February 07, 2014 3:48 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Highlight results in Arabic are backword
> 
> Arabic if complex. Basically, don't trust anything you see until you put that
> content on the screen with the surrounding tag marked with attribute
> dir='rtl' (e.g. <p dir='rlt'>arabic test</p>).
> 
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at once.
> Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
> On Thu, Feb 6, 2014 at 10:12 PM, Steve Rowe <sa...@gmail.com> wrote:
> > Hi Fatima,
> >
> > I don’t think there’s an actual problem, it just looks like it because the
> program you’re using to look at the JSON makes a different choice for laying
> out the highlighting results than it does for the field values.
> >
> > In fact, all the bytes are the same, and in the same order for both the
> “author” field text and the highlighting text, though some space characters
> are ASCII space (U+0020) in one and non-breaking space (U+00A0) in the
> other.
> >
> > By the way, I see the same thing as you in my email client (OS X Mail.app).  I
> assume there is a rule shared by our programs about complex layout like this,
> where right-to-left text is mixed with left-to-right text, likely based on the
> proportion of each, that triggers a left-to-right word sequencing instead of
> the expected right-to-left word sequencing.
> >
> > Anyway, I pulled out the author field and highlighting texts into an HTML
> document and viewed it in my browser (Safari), and both are layed out the
> same (with the exception of the emphasis given the highlighted word):
> >
> > ——
> > <html>
> > <body>
> > <p>"author": "د. فيشر السعر",</p>
> > <p>"highlighting": { "1": { "author": [ "د. <em>فيشر</em> السعر" ] }
> > }</p> </body> </html> ——
> >
> > Steve
> >
> > On Feb 6, 2014, at 8:23 AM, Fatima Issawi <is...@qu.edu.qa> wrote:
> >
> >> Hello,
> >>
> >> I am getting highlight results in Arabic, but the order of the words are
> backwards. Querying on that field gives me the correct result, though. Is
> there are setting I’m missing?
> >>
> >> An extract from an example query from my Solr Console is below:
> >>
> >> {
> >>  "responseHeader": {
> >>    "status": 0,
> >>    "QTime": 1,
> >>    "params": {
> >>      "indent": "true",
> >>      "q": "author:\"فيشر\"",
> >>      "_": "1391692704242",
> >>      "hl.simple.pre": "<em>",
> >>      "hl.simple.post": "</em>",
> >>      "hl.fl": "author",
> >>      "wt": "json",
> >>      "hl": "true"
> >>    }
> >>  },
> >>  "response": {
> >>    "numFound": 4,
> >>    "start": 0,
> >>    "docs": [
> >>      {
> >>        "pagenumber": 1,
> >>        "id": "1",
> >>        "author": "د. فيشر السعر",
> >>        "author_s": "د. فيشر السعر",
> >>        "collector": "فاطمة عيساوي",
> >>  },
> >>  "highlighting": {
> >>    "1": {
> >>      "author": [
> >>        "د. <em>فيشر</em> السعر"
> >>      ]
> >

Re: Highlight results in Arabic are backword

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Arabic if complex. Basically, don't trust anything you see until you
put that content on the screen with the surrounding tag marked with
attribute dir='rtl' (e.g. <p dir='rlt'>arabic test</p>).

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Feb 6, 2014 at 10:12 PM, Steve Rowe <sa...@gmail.com> wrote:
> Hi Fatima,
>
> I don’t think there’s an actual problem, it just looks like it because the program you’re using to look at the JSON makes a different choice for laying out the highlighting results than it does for the field values.
>
> In fact, all the bytes are the same, and in the same order for both the “author” field text and the highlighting text, though some space characters are ASCII space (U+0020) in one and non-breaking space (U+00A0) in the other.
>
> By the way, I see the same thing as you in my email client (OS X Mail.app).  I assume there is a rule shared by our programs about complex layout like this, where right-to-left text is mixed with left-to-right text, likely based on the proportion of each, that triggers a left-to-right word sequencing instead of the expected right-to-left word sequencing.
>
> Anyway, I pulled out the author field and highlighting texts into an HTML document and viewed it in my browser (Safari), and both are layed out the same (with the exception of the emphasis given the highlighted word):
>
> ——
> <html>
> <body>
> <p>"author": "د. فيشر السعر",</p>
> <p>"highlighting": { "1": { "author": [ "د. <em>فيشر</em> السعر" ] } }</p>
> </body>
> </html>
> ——
>
> Steve
>
> On Feb 6, 2014, at 8:23 AM, Fatima Issawi <is...@qu.edu.qa> wrote:
>
>> Hello,
>>
>> I am getting highlight results in Arabic, but the order of the words are backwards. Querying on that field gives me the correct result, though. Is there are setting I’m missing?
>>
>> An extract from an example query from my Solr Console is below:
>>
>> {
>>  "responseHeader": {
>>    "status": 0,
>>    "QTime": 1,
>>    "params": {
>>      "indent": "true",
>>      "q": "author:\"فيشر\"",
>>      "_": "1391692704242",
>>      "hl.simple.pre": "<em>",
>>      "hl.simple.post": "</em>",
>>      "hl.fl": "author",
>>      "wt": "json",
>>      "hl": "true"
>>    }
>>  },
>>  "response": {
>>    "numFound": 4,
>>    "start": 0,
>>    "docs": [
>>      {
>>        "pagenumber": 1,
>>        "id": "1",
>>        "author": "د. فيشر السعر",
>>        "author_s": "د. فيشر السعر",
>>        "collector": "فاطمة عيساوي",
>>  },
>>  "highlighting": {
>>    "1": {
>>      "author": [
>>        "د. <em>فيشر</em> السعر"
>>      ]
>

Re: Highlight results in Arabic are backword

Posted by Steve Rowe <sa...@gmail.com>.
Hi Fatima,

I don’t think there’s an actual problem, it just looks like it because the program you’re using to look at the JSON makes a different choice for laying out the highlighting results than it does for the field values.  

In fact, all the bytes are the same, and in the same order for both the “author” field text and the highlighting text, though some space characters are ASCII space (U+0020) in one and non-breaking space (U+00A0) in the other.

By the way, I see the same thing as you in my email client (OS X Mail.app).  I assume there is a rule shared by our programs about complex layout like this, where right-to-left text is mixed with left-to-right text, likely based on the proportion of each, that triggers a left-to-right word sequencing instead of the expected right-to-left word sequencing.

Anyway, I pulled out the author field and highlighting texts into an HTML document and viewed it in my browser (Safari), and both are layed out the same (with the exception of the emphasis given the highlighted word):

——
<html>
<body>
<p>"author": "د. فيشر السعر",</p>
<p>"highlighting": { "1": { "author": [ "د. <em>فيشر</em> السعر" ] } }</p>
</body>
</html>
——

Steve

On Feb 6, 2014, at 8:23 AM, Fatima Issawi <is...@qu.edu.qa> wrote:

> Hello,
> 
> I am getting highlight results in Arabic, but the order of the words are backwards. Querying on that field gives me the correct result, though. Is there are setting I’m missing?
> 
> An extract from an example query from my Solr Console is below:
> 
> {
>  "responseHeader": {
>    "status": 0,
>    "QTime": 1,
>    "params": {
>      "indent": "true",
>      "q": "author:\"فيشر\"",
>      "_": "1391692704242",
>      "hl.simple.pre": "<em>",
>      "hl.simple.post": "</em>",
>      "hl.fl": "author",
>      "wt": "json",
>      "hl": "true"
>    }
>  },
>  "response": {
>    "numFound": 4,
>    "start": 0,
>    "docs": [
>      {
>        "pagenumber": 1,
>        "id": "1",
>        "author": "د. فيشر السعر",
>        "author_s": "د. فيشر السعر",
>        "collector": "فاطمة عيساوي",
>  },
>  "highlighting": {
>    "1": {
>      "author": [
>        "د. <em>فيشر</em> السعر"
>      ]