You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Craig Stires (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 16:25:53 UTC
[jira] [Created] (TIKA-858) Tika add parsing support for ANPA-1312
news wire feeds
Tika add parsing support for ANPA-1312 news wire feeds
------------------------------------------------------
Key: TIKA-858
URL: https://issues.apache.org/jira/browse/TIKA-858
Project: Tika
Issue Type: New Feature
Components: mime, parser
Affects Versions: 0.10
Reporter: Craig Stires
This submission adds support for ANPA-1312 news wire feeds.
Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312
news wire feeds
Posted by "Craig Stires (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Craig Stires updated TIKA-858:
------------------------------
Attachment: tika-mimetypes_ANPA.patch
This is the file recognition for ANPA file types. This patch goes against apache-tika-0.10/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
> Key: TIKA-858
> URL: https://issues.apache.org/jira/browse/TIKA-858
> Project: Tika
> Issue Type: New Feature
> Components: mime, parser
> Affects Versions: 0.10
> Reporter: Craig Stires
> Attachments: tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (TIKA-858) Tika add parsing support
for ANPA-1312 news wire feeds
Posted by "Craig Stires (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198874#comment-13198874 ]
Craig Stires edited comment on TIKA-858 at 2/2/12 3:31 PM:
-----------------------------------------------------------
This is the change to the parser module, which recognizes the ANPA parser.
This patch goes against apache-tika-0.10/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.Parser
was (Author: craig_s):
This is the change to the parser module, which recognizes the ANPA parser
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
> Key: TIKA-858
> URL: https://issues.apache.org/jira/browse/TIKA-858
> Project: Tika
> Issue Type: New Feature
> Components: mime, parser
> Affects Versions: 0.10
> Reporter: Craig Stires
> Attachments: org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312
news wire feeds
Posted by "Craig Stires (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Craig Stires updated TIKA-858:
------------------------------
Attachment: org.apache.tika.parser.Parser_ANPA.patch
This is the change to the parser module, which recognizes the ANPA parser
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
> Key: TIKA-858
> URL: https://issues.apache.org/jira/browse/TIKA-858
> Project: Tika
> Issue Type: New Feature
> Components: mime, parser
> Affects Versions: 0.10
> Reporter: Craig Stires
> Attachments: org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-858) Tika add parsing support for
ANPA-1312 news wire feeds
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207342#comment-13207342 ]
Nick Burch commented on TIKA-858:
---------------------------------
Are you able to supply a sample file, and a unit test that uses it?
(Without a unit test, it'll be hard to verify that it works properly, and doesn't accidentally get broken in the future)
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
> Key: TIKA-858
> URL: https://issues.apache.org/jira/browse/TIKA-858
> Project: Tika
> Issue Type: New Feature
> Components: mime, parser
> Affects Versions: 0.10
> Reporter: Craig Stires
> Attachments: IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-858) Tika add parsing support for
ANPA-1312 news wire feeds
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207345#comment-13207345 ]
Nick Burch commented on TIKA-858:
---------------------------------
Additionally, what reference did you find for the chosen mimetype for these files? (I couldn't spot one from a quick check was all)
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
> Key: TIKA-858
> URL: https://issues.apache.org/jira/browse/TIKA-858
> Project: Tika
> Issue Type: New Feature
> Components: mime, parser
> Affects Versions: 0.10
> Reporter: Craig Stires
> Attachments: IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312
news wire feeds
Posted by "Craig Stires (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Craig Stires updated TIKA-858:
------------------------------
Attachment: 7901V5.pdf
Attaching the specification docs for the ANPA formats. [7901V5.pdf]
This discusses the start of header for mime-type recognition, as well as the spec for how the rest of the document structure.
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
> Key: TIKA-858
> URL: https://issues.apache.org/jira/browse/TIKA-858
> Project: Tika
> Issue Type: New Feature
> Components: mime, parser
> Affects Versions: 0.10
> Reporter: Craig Stires
> Attachments: 7901V5.pdf, IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312
news wire feeds
Posted by "Craig Stires (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Craig Stires updated TIKA-858:
------------------------------
Attachment: IptcAnpaParser.java
The file which parses and categorizes the ANPA wire feeds.
This gets added to apache-tika-0.10/tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
> Key: TIKA-858
> URL: https://issues.apache.org/jira/browse/TIKA-858
> Project: Tika
> Issue Type: New Feature
> Components: mime, parser
> Affects Versions: 0.10
> Reporter: Craig Stires
> Attachments: IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-858) Tika add parsing support for
ANPA-1312 news wire feeds
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264357#comment-13264357 ]
Nick Burch commented on TIKA-858:
---------------------------------
Thanks for the patch, I've applied it in r1331794.
However, we do still need a unit test for this. Are you able to get a small, sample ANPA file for us to use in a unit test?
> Tika add parsing support for ANPA-1312 news wire feeds
> ------------------------------------------------------
>
> Key: TIKA-858
> URL: https://issues.apache.org/jira/browse/TIKA-858
> Project: Tika
> Issue Type: New Feature
> Components: mime, parser
> Affects Versions: 0.10
> Reporter: Craig Stires
> Attachments: 7901V5.pdf, IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch
>
>
> This submission adds support for ANPA-1312 news wire feeds.
> Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.
> This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira