You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Craig Stires (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 16:25:53 UTC

[jira] [Created] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Tika add parsing support for ANPA-1312 news wire feeds
------------------------------------------------------

                 Key: TIKA-858
                 URL: https://issues.apache.org/jira/browse/TIKA-858
             Project: Tika
          Issue Type: New Feature
          Components: mime, parser
    Affects Versions: 0.10
            Reporter: Craig Stires


This submission adds support for ANPA-1312 news wire feeds.

Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.

This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Posted by "Craig Stires (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Craig Stires updated TIKA-858:
------------------------------

    Attachment: tika-mimetypes_ANPA.patch

This is the file recognition for ANPA file types.  This patch goes against apache-tika-0.10/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
                
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
>                 Key: TIKA-858
>                 URL: https://issues.apache.org/jira/browse/TIKA-858
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime, parser
>    Affects Versions: 0.10
>            Reporter: Craig Stires
>         Attachments: tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Posted by "Craig Stires (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198874#comment-13198874 ] 

Craig Stires edited comment on TIKA-858 at 2/2/12 3:31 PM:
-----------------------------------------------------------

This is the change to the parser module, which recognizes the ANPA parser.
This patch goes against apache-tika-0.10/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.Parser
                
      was (Author: craig_s):
    This is the change to the parser module, which recognizes the ANPA parser
                  
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
>                 Key: TIKA-858
>                 URL: https://issues.apache.org/jira/browse/TIKA-858
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime, parser
>    Affects Versions: 0.10
>            Reporter: Craig Stires
>         Attachments: org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Posted by "Craig Stires (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Craig Stires updated TIKA-858:
------------------------------

    Attachment: org.apache.tika.parser.Parser_ANPA.patch

This is the change to the parser module, which recognizes the ANPA parser
                
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
>                 Key: TIKA-858
>                 URL: https://issues.apache.org/jira/browse/TIKA-858
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime, parser
>    Affects Versions: 0.10
>            Reporter: Craig Stires
>         Attachments: org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207342#comment-13207342 ] 

Nick Burch commented on TIKA-858:
---------------------------------

Are you able to supply a sample file, and a unit test that uses it?

(Without a unit test, it'll be hard to verify that it works properly, and doesn't accidentally get broken in the future)
                
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
>                 Key: TIKA-858
>                 URL: https://issues.apache.org/jira/browse/TIKA-858
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime, parser
>    Affects Versions: 0.10
>            Reporter: Craig Stires
>         Attachments: IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207345#comment-13207345 ] 

Nick Burch commented on TIKA-858:
---------------------------------

Additionally, what reference did you find for the chosen mimetype for these files? (I couldn't spot one from a quick check was all)
                
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
>                 Key: TIKA-858
>                 URL: https://issues.apache.org/jira/browse/TIKA-858
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime, parser
>    Affects Versions: 0.10
>            Reporter: Craig Stires
>         Attachments: IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Posted by "Craig Stires (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Craig Stires updated TIKA-858:
------------------------------

    Attachment: 7901V5.pdf

Attaching the specification docs for the ANPA formats. [7901V5.pdf]
This discusses the start of header for mime-type recognition, as well as the spec for how the rest of the document structure.
                
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
>                 Key: TIKA-858
>                 URL: https://issues.apache.org/jira/browse/TIKA-858
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime, parser
>    Affects Versions: 0.10
>            Reporter: Craig Stires
>         Attachments: 7901V5.pdf, IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Posted by "Craig Stires (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Craig Stires updated TIKA-858:
------------------------------

    Attachment: IptcAnpaParser.java

The file which parses and categorizes the ANPA wire feeds.
This gets added to apache-tika-0.10/tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java
                
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
>                 Key: TIKA-858
>                 URL: https://issues.apache.org/jira/browse/TIKA-858
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime, parser
>    Affects Versions: 0.10
>            Reporter: Craig Stires
>         Attachments: IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264357#comment-13264357 ] 

Nick Burch commented on TIKA-858:
---------------------------------

Thanks for the patch, I've applied it in r1331794.

However, we do still need a unit test for this. Are you able to get a small, sample ANPA file for us to use in a unit test?
                
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
>                 Key: TIKA-858
>                 URL: https://issues.apache.org/jira/browse/TIKA-858
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime, parser
>    Affects Versions: 0.10
>            Reporter: Craig Stires
>         Attachments: 7901V5.pdf, IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira