You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/02/12 07:55:27 UTC

[jira] Created: (NUTCH-789) Improvements to Tika parser

Improvements to Tika parser
---------------------------

                 Key: NUTCH-789
                 URL: https://issues.apache.org/jira/browse/NUTCH-789
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
         Environment: reported by Sami, in NUTCH-766
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
            Priority: Minor
             Fix For: 1.1


As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-789) Improvements to Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-789:
--------------------------------

      Component/s:     (was: fetcher)
                   parser
    Fix Version/s:     (was: 1.1)

Have created a separate issue for the upgrade of Tika 0.7 and moved this one out of 1.1

> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-789) Improvements to Tika parser

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833714#action_12833714 ] 

Sami Siren commented on NUTCH-789:
----------------------------------

It would be really useful to include the improvements in the functionality since that way almost all (-flash ?) parsers would be covered.

> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-789) Improvements to Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853212#action_12853212 ] 

Chris A. Mattmann commented on NUTCH-789:
-----------------------------------------

Hey Julien -- okey dok, Tika 0.7 has been released. Feel free to upgrade, and close this one out...after that, I'll cut the Nutch 1.1 RC.

Thanks!

Cheers,
Chris


> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-789) Improvements to Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-789:
------------------------------------

    Attachment: NutchTikaConfig.java
                TikaParser.java

- updates contributed by Sami. I'll generate a diff and then re-attach.

> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-789) Improvements to Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852048#action_12852048 ] 

Chris A. Mattmann commented on NUTCH-789:
-----------------------------------------

Folks, I'm going to put together an RC for Tika 0.7 and take care of JIRA now. Once I do that, we can try and close out this issue for 1.1. I should be able to do this before the 48 hr deadline I threw up for Nutch 1.1...

> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-789) Improvements to Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853285#action_12853285 ] 

Chris A. Mattmann commented on NUTCH-789:
-----------------------------------------

Hey Julien, Tika 0.7 is available from Maven central:

http://repo1.maven.org/maven2/org/apache/tika/tika-parsers/

Cheers,
Chris


> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-789) Improvements to Tika parser

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851331#action_12851331 ] 

Andrzej Bialecki  commented on NUTCH-789:
-----------------------------------------

There are no diffs, so it's difficult to figure out what's changed ... I think that Tika will soon release v. 0.7 which may also impact this patch if we decide to upgrade before our release. I asked the Tika guys about their release, let's wait a couple days more.

> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-789) Improvements to Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851316#action_12851316 ] 

Julien Nioche commented on NUTCH-789:
-------------------------------------

Shall we postpone the work on this issue to after 1.1?

> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-789) Improvements to Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853251#action_12853251 ] 

Julien Nioche commented on NUTCH-789:
-------------------------------------

Will upgrade as soon as 0.7 is available from http://repo1.maven.org/maven2/org/apache/tika/ - which is not the case yet.
I will leave this issue open but unmark it as 1.1

> Improvements to Tika parser
> ---------------------------
>
>                 Key: NUTCH-789
>                 URL: https://issues.apache.org/jira/browse/NUTCH-789
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: reported by Sami, in NUTCH-766
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.