You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Benoit Moreau (JIRA)" <ji...@apache.org> on 2014/04/21 11:47:15 UTC

[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

    [ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975502#comment-13975502 ] 

Benoit Moreau commented on TIKA-1224:
-------------------------------------

I'm disappointed because it does not work !

For examples:

> java -jar tika-app-1.5.jar -t Test.java
Output is empty

> java -jar tika-app-1.5.jar -h Test.java
Output is stange

> java -jar tika-app-1.5.jar -T Test.java
Output is what I expect for -h ?
{code:xml}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="htt
p://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head>     <meta http-equiv=
"content-type" content="text/html; charset=ISO-8859-1" />     <meta name="genera
tor" content="JHighlight v1.0 (http://jhighlight.dev.java.net)" />     <title>Te
st.java</title>     <link rel="Help" href="http://jhighlight.dev.java.net" />
  <style type="text/css"> .java_type { color: rgb(0,44,221); } .java_keyword { c
olor: rgb(0,0,0); font-weight: bold; } .java_javadoc_comment { color: rgb(147,14
7,147); background-color: rgb(247,247,247); font-style: italic; } .java_comment
{ color: rgb(147,147,147); background-color: rgb(247,247,247); } .java_operator
{ color: rgb(0,124,31); } .java_plain { color: rgb(0,0,0); } .java_literal { col
or: rgb(188,0,0); } code { color: rgb(0,0,0); font-family: monospace; font-size:
 12px; white-space: nowrap; } .java_javadoc_tag { color: rgb(147,147,147); backg
round-color: rgb(247,247,247); font-style: italic; font-weight: bold; } .java_se
parator { color: rgb(0,33,255); } h1 { font-family: sans-serif; font-size: 16pt;
 font-weight: bold; color: rgb(0,0,0); background: rgb(210,210,210); border: sol
id 1px black; padding: 5px; text-align: center; }     </style> </head> <body> <h
1>Test.java</h1><code><span class="java_javadoc_comment">/**&nbsp;*&nbsp;Class&n
bsp;Test.&nbsp;*&nbsp;*&nbsp;</span><span class="java_javadoc_tag">@author</span
><span class="java_javadoc_comment">&nbsp;ben.12&nbsp;*/</span><span class="java
_keyword">public</span><span class="java_plain">&nbsp;</span><span class="java_k
eyword">class</span><span class="java_plain">&nbsp;</span><span class="java_type
">Test</span><span class="java_plain">&nbsp;</span><span class="java_separator">
{</span><span class="java_plain">&nbsp;&nbsp;</span><span class="java_comment">/
/&nbsp;Class&nbsp;Test}</span><br /> </code> </body> </html>
{code}
But all is in only one line, indentation is lost and file name appears at beginning.
Author is not in head meta tags.
The last "}" is highlighted as a comment.

\\
My input java file:
{code:title=Test.java}
/**
 * Class Test.
 *
 * @author ben.12
 */
public class Test {
	// Class Test
}
{code}

> Adding Source code (Java, Groovy, C) parser
> -------------------------------------------
>
>                 Key: TIKA-1224
>                 URL: https://issues.apache.org/jira/browse/TIKA-1224
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.5
>            Reporter: Hong-Thai Nguyen
>            Priority: Minor
>
> We can parser some source code file formats:
> text/x-java-source
> text/x-groovy
> text/x-c
> for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.2#6252)