You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/08/09 22:01:16 UTC

[jira] Created: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

javabin should use UTF-8, not modified UTF-8
--------------------------------------------

                 Key: SOLR-2034
                 URL: https://issues.apache.org/jira/browse/SOLR-2034
             Project: Solr
          Issue Type: Bug
            Reporter: Robert Muir
         Attachments: SOLR-2034.patch

for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900423#action_12900423 ] 

Hoss Man commented on SOLR-2034:
--------------------------------

bq. I don't think adding many hoops for back compatibility is worth the trouble. Note that that does not mean people can not use solrj to talk across different versions - they may have to use xml though....

Agreed, my chief concern is what happens when someone tries to use SolrJ 1.4 to talk to Solr 3.1 w/javabin (or vice versa).

A) If they get an error: great, i'm totaly fine with that -- we just document that they should use XML in this case.

B) If the commands succeed, but the string data is _always_ corrupted, that's not ideal -- but not totally horrible since the probably should be immediately obvious and should have read the documentation and known not to do that.

C) if the commands succeed, but the string data is _sometimes_ corrupted (as i recall, not every character is different in UTF8 vs Java's  modified UTF8, correct?) then that seems really bad ... people may start using javabin to update their index and not notice for quite some time that big hard to identify chunks of their data are corrupted.

as long a someone sanity checks that the situation is either #A or #B before committing, i'm totally cool with it ... but #C scares the bejesus out of me.

(i'll try to run some tests myself in the next few days if no one else gets a chance)


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2034:
------------------------------

    Attachment: SOLR-2034.patch

I bumped the BinaryResponseParser version (only version i can find here).

its not really obvious to me if this is actually written over the wire / how to conditionalize modified-UTF-8 based on it, and seems risky.

I think its best to just go to UTF-8 and never look back (but if someone knows how to support modified UTF-8 when version=1, thats great, I just don't have the heart)

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Shawn Heisey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925648#action_12925648 ] 

Shawn Heisey commented on SOLR-2034:
------------------------------------

So it requires modifying the code?  I am not a Java programmer, nor am I familiar with Solr internals and objects.  I do understand programming principles and know a few other languages, so I can sometimes muddle my way through the unfamiliar.

I took a look at ReplicationHandler.java and RequestHandlerBase.java and was not able to find anything having to do with javabin or looking remotely like what you mentioned.  I did find a reference to BinaryQueryResponseWriter, but the section of code that uses it also has this in it, suggesting that replication and javabin might be heavily intertwined:

throw new RuntimeException("This is a binary writer , Cannot write to a characterstream");


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900442#action_12900442 ] 

Robert Muir commented on SOLR-2034:
-----------------------------------

Hoss man, I hear your concerns but i don't understand how we can address any of this.

This is really one of the problems of modified-UTF8, and really my big concern with using it (that clients will be wrongly written, see my example above). Its not really possible or reasonable to address it at the javabin layer: it needs to be done at a higher protocol level.

of course, if we can figure this out, then maybe it would be easy to provide back compat too, but i didnt see any obvious places in the code where any versioning is written over the wire.


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved SOLR-2034.
-------------------------------

      Assignee: Robert Muir
    Resolution: Fixed

Committed revision 990180 (trunk), 990183 (3x), and updated the 'javabin' page on the wiki.

I tried to make the change easy to understand in CHANGES/wiki, if you have improvements to
the wording please do not hesitate.


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902744#action_12902744 ] 

Robert Muir commented on SOLR-2034:
-----------------------------------

if no one objects to the latest patch, i'd like to commit in a day or two.


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900443#action_12900443 ] 

Robert Muir commented on SOLR-2034:
-----------------------------------

{quote}
JavaBinCodec.VERSION should be bumped.
It is serialized and verified when decoding, and currently an exception is thrown if it does not match the current version.
{quote}

Ahhh, I totally missed that version byte. So did I bump the wrong version in the patch (BinaryResponseParser's)? I'll fix


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925642#action_12925642 ] 

Ryan McKinley commented on SOLR-2034:
-------------------------------------

try:
{code:java}
server.setParser(new XMLResponseParser()); 
{code}
You may also have to set:
{code:java} 
server.setRequestWriter(new RequestWriter());
{code}

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Shawn Heisey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925638#action_12925638 ] 

Shawn Heisey commented on SOLR-2034:
------------------------------------

I would be happy to take that route.  You'll have to tell me how to switch to XML.  I have already tried a couple of things that didn't work.

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898391#action_12898391 ] 

Hoss Man commented on SOLR-2034:
--------------------------------

+1 to the patch

My only concern is upgrade compatibility -- it would be preferable if people upgrading either Solr or their SolrJ client (but not both at the exact same moment) would still have a functioning system.

As i recall, the BinaryResponseWriter / Parser use a version param and version metadata in the response (just like the XmlResponseWriter) to indicate the codec version requested and the code version returned -- this seems like the kind of thing that should probably warrant a new coden impl with a new version number.

that said: i didn't follow the details of the binary response writer / parser / codec implementation very closely, so i have no idea how hard it will be to make it all work smoothly for people: if it's a pain in the ass then i'm totally fine with saying that SolrJ 3.x can't talk to Solr 1.x (and vice versa) ... but we should still probably update the binary code version info to make it clear there is a difference

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903940#action_12903940 ] 

Noble Paul commented on SOLR-2034:
----------------------------------

Is interoperability the issue? The question is , is there any other client using javabin format? Or is it just to be standards compliant?

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899730#action_12899730 ] 

Ryan McKinley commented on SOLR-2034:
-------------------------------------

I don't think adding many hoops for back compatibility is worth the trouble.  Note that that does not mean people can not use solrj to talk across different versions -- they may have to use xml though.

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925614#action_12925614 ] 

Ryan McKinley commented on SOLR-2034:
-------------------------------------

Can you switch to xml format?  then when everythign is upgraded, switch back?

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Shawn Heisey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925576#action_12925576 ] 

Shawn Heisey commented on SOLR-2034:
------------------------------------

I started upgrading my slave shards today from 1.4.1 to 3.1 and ran into a bit of a showstopper:

SEVERE: Master at: http://HOST:8983/solr/live/replication is not available. Index fetch failed. Exception: Invalid version or the data in not in 'javabin' format

What advice do you have?  I can't upgrade both master and slave at the same time - we have to be able to fully test against the new version.  I would rather not do the testing with completely static indexes, it would be better to have data being added and deleted normally.


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901225#action_12901225 ] 

Robert Muir commented on SOLR-2034:
-----------------------------------

thanks hoss, ill commit tomorrow if there are no objections.

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2034:
------------------------------

    Attachment: SOLR-2034.patch

well, i object to my own patch, because i think it would suck to have solrj depend on the lucene jar.

here's a modified version with its own utf-8 conversion and no bytesref/unicodeutil


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904002#action_12904002 ] 

Robert Muir commented on SOLR-2034:
-----------------------------------

bq. ok. I was wondering if we are planning to implement javabin in any other languages

I can't speak for this, except to say that if we wanted to, I think implementing the encoding
portion would now be significantly easier, especially for languages that don't use an
internal UTF-16 representation like Java does.

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898405#action_12898405 ] 

Robert Muir commented on SOLR-2034:
-----------------------------------

Hoss, thanks, I agree with regards to backwards compat, unfortunately its not immediately obvious to me how to implement the versioning (seamless like you said, would be preferable).

the only thing i see is the version in the response parser, but i will play some and see if i can do it in a versioned way (any more pointers would be very helpful).

ultimately the goal would be to make it easier for non-java clients to implement this protocol. although the wiki says only the java client implements this, i found an issue for the .NET client here: http://code.google.com/p/solrnet/issues/detail?id=71

I took a look at the github source code (http://github.com/mausch/SolrNet/blob/javabin/SolrNet/Impl/JavaBinCodec.cs) and was a little concerned to see writeChars implemented with Encoding.UTF8.GetBytes... I know its likely a work in progress etc, but I think it illustrates the benefits of standard UTF-8.



> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901204#action_12901204 ] 

Hoss Man commented on SOLR-2034:
--------------------------------

beautiful.

+1 commit

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2034:
------------------------------

    Attachment: SOLR-2034.patch

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903964#action_12903964 ] 

Robert Muir commented on SOLR-2034:
-----------------------------------

Noble, please see my comment: https://issues.apache.org/jira/browse/SOLR-2034?focusedCommentId=12898405&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12898405

thats an example of another client trying to implement javabin, and wrongly implementing the modified-utf8 conversion... but there doesn't need to be any justification to not use modified utf-8 over the wire, its just wrong.

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900076#action_12900076 ] 

Robert Muir commented on SOLR-2034:
-----------------------------------

Thanks Ryan, I'll wait a few days before committing to see if there are any objections.

If there aren't, i'll update wiki / CHANGES loud and clear that the binary format has changed.


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2034:
------------------------------

    Attachment: SOLR-2034.patch

OK, i bumped the byte version as Yonik suggested, and tried to use an old client.

Here's the exception:

{noformat}
java.lang.RuntimeException: Invalid version or the data in not in 'javabin' format
	at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
	at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:477)
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
	at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
	at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
{noformat}


> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900432#action_12900432 ] 

Yonik Seeley commented on SOLR-2034:
------------------------------------

Seems OK.  I think modified UTF-8 was originally used so that the string chars could be directly written to the output stream instead of to a temp buffer.  But copying to a temp buffer first shouldn't have that much overhead.

JavaBinCodec.VERSION should be bumped.
It is serialized and verified when decoding, and currently an exception is thrown if it does not match the current version.

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2034) javabin should use UTF-8, not modified UTF-8

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903974#action_12903974 ] 

Noble Paul commented on SOLR-2034:
----------------------------------

ok. I was wondering if we are planning to implement javabin in any other languages

> javabin should use UTF-8, not modified UTF-8
> --------------------------------------------
>
>                 Key: SOLR-2034
>                 URL: https://issues.apache.org/jira/browse/SOLR-2034
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch, SOLR-2034.patch
>
>
> for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org