You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrés Aguilar-Umaña (JIRA)" <ji...@apache.org> on 2014/07/22 23:12:40 UTC
[jira] [Updated] (TIKA-1373) AutoDetectParser extracts no text when
SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrés Aguilar-Umaña updated TIKA-1373:
---------------------------------------
Description:
When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text:
I have this test program:
String data = "public class HelloWorld {}";
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("Text extracted: "+bch.toString())
It returns (using the SourceCodeParser):
> Text extracted:
But when I use this code:
String data = "public class HelloWorld {}";
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/plain");
try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("Text extracted: "+bch.toString())
The Text Parser is used and I get:
> Text extracted: public class HelloWorld {}
I have also tested this command:
> java -jar tika-app-1.5.jar -t D:\text.java
(no text)
>
was:
When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text:
I have this test program:
String data = "public class HelloWorld {}";
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("Text extracted: "+bch.toString())
It returns (using the SourceCodeParser):
> Text extracted:
But when I use this code:
String data = "public class HelloWorld {}";
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/plain");
try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("Text extracted: "+bch.toString())
The Text Parser is used and I get:
> Text extracted: public class HelloWorld {}
I have also tested this command:
> java -jar tika-app-1.5.jar -t D:\text.java
(no text)
>
> AutoDetectParser extracts no text when SourceCodeParser is selected
> -------------------------------------------------------------------
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.5
> Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text:
> I have this test program:
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> autoDetectParser = new SourceCodeParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
> autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
> e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> It returns (using the SourceCodeParser):
> > Text extracted:
> But when I use this code:
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> autoDetectParser = new SourceCodeParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {
> autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
> e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> The Text Parser is used and I get:
> > Text extracted: public class HelloWorld {}
> I have also tested this command:
> > java -jar tika-app-1.5.jar -t D:\text.java
> (no text)
> >
--
This message was sent by Atlassian JIRA
(v6.2#6252)