You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Abhishek Jain (JIRA)" <ji...@apache.org> on 2011/09/24 09:42:26 UTC

[jira] [Created] (TIKA-729) TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings

TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings
--------------------------------------------------------------

                 Key: TIKA-729
                 URL: https://issues.apache.org/jira/browse/TIKA-729
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
            Reporter: Abhishek Jain


Came across this bug when trying to convert Unicode files to UTF-16. For files written in UTF-16BE or UTF-16LE, CharsetDetector detects it as "ISO-8859-1". 

{code}
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;

import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;
import org.xml.sax.SAXException;

public class TikaTextConverter {
  public static void main(String args[]) throws IOException, SAXException, TikaException {
    String inputPath = "/tmp/input.csv";
      
    Writer writer = new OutputStreamWriter(new FileOutputStream(inputPath), "UTF-16LE");
    writer.write("Line1, Some text, Some more text");
    writer.close();
    
    InputStream inputStream = TikaInputStream.get(new File(inputPath).toURI().toURL(), new Metadata());
    
    CharsetDetector detector = new CharsetDetector();
    detector.setText(inputStream);
    
    CharsetMatch[] matches = detector.detectAll();
    for (CharsetMatch match : matches) {
      System.out.println(match.getName());
    }
  }
}
{code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-729) TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-729.
-----------------------------

    Resolution: Duplicate

> TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings
> --------------------------------------------------------------
>
>                 Key: TIKA-729
>                 URL: https://issues.apache.org/jira/browse/TIKA-729
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Abhishek Jain
>
> Came across this bug when trying to convert Unicode files to UTF-16. For files written in UTF-16BE or UTF-16LE, CharsetDetector detects it as "ISO-8859-1". 
> {code}
> import java.io.File;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.OutputStreamWriter;
> import java.io.Writer;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.txt.CharsetDetector;
> import org.apache.tika.parser.txt.CharsetMatch;
> import org.xml.sax.SAXException;
> public class TikaTextConverter {
>   public static void main(String args[]) throws IOException, SAXException, TikaException {
>     String inputPath = "/tmp/input.csv";
>       
>     Writer writer = new OutputStreamWriter(new FileOutputStream(inputPath), "UTF-16LE");
>     writer.write("Line1, Some text, Some more text");
>     writer.close();
>     
>     InputStream inputStream = TikaInputStream.get(new File(inputPath).toURI().toURL(), new Metadata());
>     
>     CharsetDetector detector = new CharsetDetector();
>     detector.setText(inputStream);
>     
>     CharsetMatch[] matches = detector.detectAll();
>     for (CharsetMatch match : matches) {
>       System.out.println(match.getName());
>     }
>   }
> }
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-729) TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113942#comment-13113942 ] 

Nick Burch commented on TIKA-729:
---------------------------------

Duplicate of TIKA-721. Files with the BOM will be detected, those without currently won't be

> TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings
> --------------------------------------------------------------
>
>                 Key: TIKA-729
>                 URL: https://issues.apache.org/jira/browse/TIKA-729
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Abhishek Jain
>
> Came across this bug when trying to convert Unicode files to UTF-16. For files written in UTF-16BE or UTF-16LE, CharsetDetector detects it as "ISO-8859-1". 
> {code}
> import java.io.File;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.OutputStreamWriter;
> import java.io.Writer;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.txt.CharsetDetector;
> import org.apache.tika.parser.txt.CharsetMatch;
> import org.xml.sax.SAXException;
> public class TikaTextConverter {
>   public static void main(String args[]) throws IOException, SAXException, TikaException {
>     String inputPath = "/tmp/input.csv";
>       
>     Writer writer = new OutputStreamWriter(new FileOutputStream(inputPath), "UTF-16LE");
>     writer.write("Line1, Some text, Some more text");
>     writer.close();
>     
>     InputStream inputStream = TikaInputStream.get(new File(inputPath).toURI().toURL(), new Metadata());
>     
>     CharsetDetector detector = new CharsetDetector();
>     detector.setText(inputStream);
>     
>     CharsetMatch[] matches = detector.detectAll();
>     for (CharsetMatch match : matches) {
>       System.out.println(match.getName());
>     }
>   }
> }
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira