You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/06/27 20:45:04 UTC

[jira] [Comment Edited] (TIKA-1601) Integrate Jackcess to handle MSAccess files

    [ https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603946#comment-14603946 ] 

Tim Allison edited comment on TIKA-1601 at 6/27/15 6:44 PM:
------------------------------------------------------------

Not anywhere near committing, but this is a rough start.

Some TODOs:
* -Figure out how to get non-ascii text out correctly-
* Figure out how to grab attachments from the accdb file
* Figure out if there's a flag for html-marked up text cells so that we can strip the markup [0]
* Figure out if there's a way to prevent Jackcess from trying to open linked files [0]
* Add unit tests :)

I used [~centic]'s code [1] to pull ~3k mdb files from CommonCrawl for testing.

[0]: https://sourceforge.net/p/jackcess/discussion/456474/thread/038878e6/
[1]: https://github.com/centic9/CommonCrawlDocumentDownload



was (Author: tallison@mitre.org):
Not anywhere near committing, but this is a rough start.

Some TODOs:
* Figure out how to get non-ascii text out correctly
* Figure out how to grab attachments from the accdb file
* Add unit tests :)


> Integrate Jackcess to handle MSAccess files
> -------------------------------------------
>
>                 Key: TIKA-1601
>                 URL: https://issues.apache.org/jira/browse/TIKA-1601
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>         Attachments: jackcess_nocommit_v1.patch, testAccess2.zip
>
>
> Recently, James Ahlborn, the current maintainer of [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense Jackcess to Apache 2.0.  [~boneill], the CTO at [Health Market Science, a LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with this relicensing and led the charge to obtain all necessary corporate approval to deliver a [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to Apache.  As anyone who has tried to get corporate approval for anything knows, this can sometimes require not a small bit of effort.
> If I may speak on behalf of Tika and the larger Apache community, I offer a sincere thanks to James, Brian and the other developers and contributors to Jackcess!!!
> Once the licensing info has been changed in Jackcess and the new release is available in maven, we can integrate Jackcess into Tika and add a capability to process MSAccess.
> As a side note, I reached out to the developers and contributors to determine if there were any objections.  I couldn't find addresses for everyone, and not everyone replied, but those who did offered their support to this move. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)