You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Alex Parvulescu (JIRA)" <ji...@apache.org> on 2011/09/19 19:25:09 UTC

[jira] [Commented] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

    [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108009#comment-13108009 ] 

Alex Parvulescu commented on JCR-3075:
--------------------------------------

hmm, it would appear that LUCENE-2458 [0] has something to do with this. 
(it treats the 1 word made out of 3 character sequence as a phrase of 3 words made of 1 char each).

I'm not sure what is the best way forward. Upgrading lucene to 3.1 maybe?
I don't think it's just a drop-in replacement, I tried and there are some errors on the repo startup.

This brings the question about the lucene version that JR is using, which seems to be really behind. We probably should try to update more often, but that is a different topic of discussion.


[0] https://issues.apache.org/jira/browse/LUCENE-2458


> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>
> The generated excerpt highlights single characters instead of full words. Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira