You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "samraj (JIRA)" <ji...@apache.org> on 2011/05/18 18:05:47 UTC

[jira] [Created] (TIKA-663) JSP files data extraction failed

JSP files data extraction failed
--------------------------------

                 Key: TIKA-663
                 URL: https://issues.apache.org/jira/browse/TIKA-663
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
         Environment: Windows, JAva 6
            Reporter: samraj


We have worked with tika extraction. In 0.8 jsp file contents extracted well.. But in 0.9 the same files are not extracted well. Pls give the solution

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-663) JSP files data extraction failed

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035525#comment-13035525 ] 

Ken Krugler commented on TIKA-663:
----------------------------------

Please include an example JSP, and what was extracted in 0.8, and what you now get with 0.9, and a list of the problems.

Thanks!

> JSP files data extraction failed
> --------------------------------
>
>                 Key: TIKA-663
>                 URL: https://issues.apache.org/jira/browse/TIKA-663
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows, JAva 6
>            Reporter: samraj
>
> We have worked with tika extraction. In 0.8 jsp file contents extracted well.. But in 0.9 the same files are not extracted well. Pls give the solution

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-663) JSP files data extraction failed

Posted by "samraj (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

samraj updated TIKA-663:
------------------------

    Attachment: File_3.jsp
                File_2.jsp
                File_1.jsp

Here i attached some of the jsp files.

Only title get extracted and other lines are escaped from the parser.

> JSP files data extraction failed
> --------------------------------
>
>                 Key: TIKA-663
>                 URL: https://issues.apache.org/jira/browse/TIKA-663
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows, JAva 6
>            Reporter: samraj
>         Attachments: File_1.jsp, File_2.jsp, File_3.jsp
>
>
> We have worked with tika extraction. In 0.8 jsp file contents extracted well.. But in 0.9 the same files are not extracted well. Pls give the solution

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-663) JSP files data extraction failed

Posted by "Dave Meikle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149992#comment-13149992 ] 

Dave Meikle commented on TIKA-663:
----------------------------------

Hi,

Not sure how you are using Tika but I would have thought you would have been hit by this from Tika 0.5 (the change to TagSoup for HTML parsing), as the tagsoup parser does not pickup the <% and %> tags within the JSP thus the content is not appearing in the output.

Tika 0.8 used TagSoup so would have though you would see the same behaviour in that version also?

Suspect we will want to add a new mime-type entry for jsp files to pass them to the plain text parser as the existing glob mapping will be getting beat by the magic mapping for HTML in these files.  Something like this should do the trick[1]:

<mime-info>
...
  <mime-type type="application/x-httpd-jsp">
    <sub-class-of type="text/plain"/>
    <magic priority="50">
      <match value="&lt;%@" type="string" offset="0"/>
    </magic>
    <glob pattern="*.jsp"/>
  </mime-type>
  ...
</mime-info>

That will remove some of the other metadata extraction from the {{HtmlParser}} for anyone else who has been using Tika to parse JSP files before 0.5 (title, etc) but will give then context and be correct for the original intent based on the glob in the existing mime-types.xml

Not sure if anyone has any objections to this?  If not, I will make the change - as I would expect Tika to treat a JSP file as text to get the script contents as well.

Cheers,
Dave

[1] You can place the above XML in custom-mimetypes.xml within the package org.apache.tika.mime on your classpath to try this out.


                
> JSP files data extraction failed
> --------------------------------
>
>                 Key: TIKA-663
>                 URL: https://issues.apache.org/jira/browse/TIKA-663
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows, JAva 6
>            Reporter: samraj
>         Attachments: File_1.jsp, File_2.jsp, File_3.jsp
>
>
> We have worked with tika extraction. In 0.8 jsp file contents extracted well.. But in 0.9 the same files are not extracted well. Pls give the solution

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-663) JSP files data extraction failed

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150306#comment-13150306 ] 

Nick Burch commented on TIKA-663:
---------------------------------

The mimetype entry looks good to me, so I've added it (with an additional match for <%-- which many of my files use) in r1202089.

I'll have to defer to someone else on the parsing side though...
                
> JSP files data extraction failed
> --------------------------------
>
>                 Key: TIKA-663
>                 URL: https://issues.apache.org/jira/browse/TIKA-663
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows, JAva 6
>            Reporter: samraj
>         Attachments: File_1.jsp, File_2.jsp, File_3.jsp
>
>
> We have worked with tika extraction. In 0.8 jsp file contents extracted well.. But in 0.9 the same files are not extracted well. Pls give the solution

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-663) JSP files data extraction failed

Posted by "Dave Meikle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150324#comment-13150324 ] 

Dave Meikle commented on TIKA-663:
----------------------------------

Thanks Nick.  Was going to add it last night but forgot my SVN password (reset now).

Yes, would be interested in others views as this begs the same question re other similar file types which could have a mix of HTML and scripting tags that would also be missed if picked up by the HtmlParser.

                
> JSP files data extraction failed
> --------------------------------
>
>                 Key: TIKA-663
>                 URL: https://issues.apache.org/jira/browse/TIKA-663
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows, JAva 6
>            Reporter: samraj
>         Attachments: File_1.jsp, File_2.jsp, File_3.jsp
>
>
> We have worked with tika extraction. In 0.8 jsp file contents extracted well.. But in 0.9 the same files are not extracted well. Pls give the solution

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira