You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hong-Thai Nguyen (JIRA)" <ji...@apache.org> on 2014/07/24 11:52:38 UTC
[jira] [Resolved] (TIKA-1373) AutoDetectParser extracts no text
when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hong-Thai Nguyen resolved TIKA-1373.
------------------------------------
Resolution: Fixed
> AutoDetectParser extracts no text when SourceCodeParser is selected
> -------------------------------------------------------------------
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.5
> Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
> autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
> e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser):
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command:
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
> (no text)
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)