You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by "Julian Reschke (JIRA)" <ji...@apache.org> on 2011/09/16 16:22:09 UTC

[jira] [Created] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

incorrect HTML excerpt generation for queries on japanese text content 
-----------------------------------------------------------------------

                 Key: JCR-3075
                 URL: https://issues.apache.org/jira/browse/JCR-3075
             Project: Jackrabbit Content Repository
          Issue Type: Bug
          Components: jackrabbit-core
            Reporter: Julian Reschke
            Priority: Minor


The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):

     public void testJapaneseAndHighlight() throws RepositoryException {
        // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
        String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
        // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
        String jTest = "\u30c6\u30b9\u30c8";
        
        String content = "some text with japanese: " + jContent
                + " ('content')" + " and " + jTest + " ('test').";

        // expected excerpt; note this may change if excerpt providers change
        String expectedExcerpt = "<div><span>some text with japanese: " + jContent
                + " ('content') and <strong>" + jTest
                + "</strong> ('test').</span></div>";
        
        Node n = testRootNode.addNode("node1");
        n.setProperty("title", content);
        testRootNode.getSession().save();
        
        String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
                + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
        Query q = superuser.getWorkspace().getQueryManager()
                .createQuery(xpath, Query.XPATH);
        
        QueryResult qr = q.execute();
        RowIterator it = qr.getRows();
        int cnt = 0;
        while (it.hasNext()) {
            cnt++;
            Row found = it.nextRow();
            assertEquals(n.getPath(), found.getPath());
            String excerpt = found.getValue("rep:excerpt(.)").getString();
            assertEquals(expectedExcerpt, excerpt);
        }
        
        assertEquals(1, cnt);
    }


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Posted by "Julian Reschke (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109519#comment-13109519 ] 

Julian Reschke commented on JCR-3075:
-------------------------------------

ExcerptTest#testEncodeIllegalCharsNoHighlights fails because it switched from testing excerpt(text) to excerpt(.). 

Other than that the new tests look good. I know next to nothing about Lucene, but if the code fixes the new tests while not breaking anything old, then this should probably go in.

Thanks for taking over!

> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>         Attachments: JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu updated JCR-3075:
---------------------------------

    Attachment: JCR-3075.patch

this problem also affects excerpt generation for any quoted phrase search, not just japanese.
Normally a quoted phrase should be considered as only one item when building the excerpt.

Now, because of LUCENE-2458, a normal search using japanese turns into the equivalent of quoted search in let's say english.
So because the excerpt generator has issues dealing with phrases, then any japanese search would have each character of the search token highlighted, instead of just one highlight containing the whole word.

The patch should fix both the original issue, and highlighting for any quoted search.
The problem is there is one test failing and I'm not sure why :(

The failing test is ExcerptTest#testEncodeIllegalCharsNoHighlights, which apparently fails because there is more info on the node returned from the search than expected.
This should not happen, as I haven't touched that part of the code (node indexing), but sadly it does so I still need to investigate.

I'd also welcome some feedback on the approach.

> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>         Attachments: JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu resolved JCR-3075.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.2.8
         Assignee: Alex Parvulescu

> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Assignee: Alex Parvulescu
>            Priority: Minor
>             Fix For: 2.2.8
>
>         Attachments: JCR-3075-v2.patch, JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-3075:
-------------------------------

    Fix Version/s: 2.1.6
                   2.0.4

Merged to the 2.1 branch in revision 1174630 and to 2.0 in revision 1174642.

> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Assignee: Alex Parvulescu
>            Priority: Minor
>             Fix For: 2.0.4, 2.1.6, 2.2.9
>
>         Attachments: JCR-3075-v2.patch, JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109573#comment-13109573 ] 

Alex Parvulescu commented on JCR-3075:
--------------------------------------

fixed on trunk in revision #1173694 and #1173707.
backported to 2.2 in revision #1173706.

I had to tweak the trunk code (rev #1173707) because there were some changes in the lucene api, and I wanted to have the same code on both trunk and the 2.2 branch.


> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>         Attachments: JCR-3075-v2.patch, JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu updated JCR-3075:
---------------------------------

    Attachment: JCR-3075-v2.patch

thanks Julian for going through the patch and identifying the problem. (as usual, it was me who broke the test ;)

ok, I added some more tests to the suite, tweaked a little bit more.
It turns out the previous patch did not preserve the token order for a phrase query, which is not good.

So now all the tests pass.

I'll probably commit this soon, but I'll add v2 it here too, just to keep track. 


> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>         Attachments: JCR-3075-v2.patch, JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-3075:
-------------------------------

    Fix Version/s:     (was: 2.2.8)
                   2.2.9

> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Assignee: Alex Parvulescu
>            Priority: Minor
>             Fix For: 2.2.9
>
>         Attachments: JCR-3075-v2.patch, JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Posted by "Alex Parvulescu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108009#comment-13108009 ] 

Alex Parvulescu commented on JCR-3075:
--------------------------------------

hmm, it would appear that LUCENE-2458 [0] has something to do with this. 
(it treats the 1 word made out of 3 character sequence as a phrase of 3 words made of 1 char each).

I'm not sure what is the best way forward. Upgrading lucene to 3.1 maybe?
I don't think it's just a drop-in replacement, I tried and there are some errors on the repo startup.

This brings the question about the lucene version that JR is using, which seems to be really behind. We probably should try to update more often, but that is a different topic of discussion.


[0] https://issues.apache.org/jira/browse/LUCENE-2458


> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira