You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Sujen Shah (JIRA)" <ji...@apache.org> on 2015/07/29 04:04:04 UTC

[jira] [Updated] (TIKA-1699) Integrate the GROBID PDF extractor in Tika

     [ https://issues.apache.org/jira/browse/TIKA-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sujen Shah updated TIKA-1699:
-----------------------------
    Description: 
GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.
It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. 

It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon.

  was:
GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.
It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. 

I have tried integrating on my local, will issue a pull request soon.


> Integrate the GROBID PDF extractor in Tika
> ------------------------------------------
>
>                 Key: TIKA-1699
>                 URL: https://issues.apache.org/jira/browse/TIKA-1699
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Sujen Shah
>              Labels: memex
>
> GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.
> It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc. 
> It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)