You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "tesmai4@gmail.com" <te...@gmail.com> on 2017/06/08 14:44:23 UTC

Grobid with TXT and HTML files

Dear Thamme,


https://grobid.readthedocs.io/en/latest/grobid-04-2015.pdf

The above presentation says that Grobid supports raw text. My input files
are in TXT and HTML formats. Do you have any idea how can this be supported
as raw text?



Regards,




On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <th...@apache.org> wrote:

> Hello,
>
> There is a nice project called Grobid [1] that does most of what you are
> describing.
> Tika has Grobid parser built in (it calls grobid over REST API) . checkout
> [2] for details
>
> I have a project that makes use of Tika with Grobid and NER support. It
> also builds a search index using solr.
> Check out [3] for setting up and [4] for parsing and indexing to solr if
> you like to try out my python project.
> Here I am able to extract title, author names, affiliations, and the whole
> text of articles.
> I did not extract sections within the main body of research articles.  I
> assume there should be a way to configure it in Grobid.
>
> Alternatively, if Grobid can't detect sections, you can try XHTML content
> handler which preserves the basic structure of PDF file using <p>  <br> and
> heading tags. So technically it should be possible to write a wrapper to
> break XHTML output from tika into sections
>
> To get it:
>
> # In bash do `pip install tika’ if tika isn’t already installed
> import tika
> tika.initVM()
> from tika import parser
>
>
> file_path = "<pdf_dir>/2538.pdf"
> data = parser.from_file(file_path, xmlContent=True)
> print(data['content'])
>
>
>
>
> Best,
> Thamme
>
> [1] http://grobid.readthedocs.io/en/latest/Introduction/
> [2] https://wiki.apache.org/tika/GrobidJournalParser
> [3] https://github.com/USCDataScience/parser-indexer-
> py/tree/master/parser-server
> [4] https://github.com/USCDataScience/parser-indexer-
> py/blob/master/docs/parser-index-journals.md
>
> *--*
> *Thamme Gowda*
> TG | @thammegowda <https://twitter.com/thammegowda>
> ~Sent via somebody's Webmail server!
>
> On Wed, May 3, 2017 at 9:34 AM, tesmai4@gmail.com <te...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am working with published research articles using Apache Tika. These
>> articles have distinct sections like abstract, introduction, literature
>> review, methodology, experimental setup, discussion and conclusions. Is
>> there some way to extract document sections with Apache Tika
>>
>> Regards,
>>
>
>

Re: Limit on input PDF file size in Tika?

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 8 Jun 2017, tesmai4@gmail.com wrote:
> Thanks for your reply. I am calling Apache Tika in Java code like this:
>
> public String extractPDFText(String faInputFileName) throws
> IOException,TikaException {
>
>       //Handler for body text of the PDF article
> BodyContentHandler handler = new BodyContentHandler();

Change this for "new BodyContentHandler(-1)" to remove the write limit. 
More details in the javadocs:
https://tika.apache.org/1.15/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler-int-

Nick

Re: Limit on input PDF file size in Tika?

Posted by "tesmai4@gmail.com" <te...@gmail.com>.

Thanks for your reply. I am calling Apache Tika in Java code like this:

 public String extractPDFText(String faInputFileName) throws
IOException,TikaException {

       //Handler for body text of the PDF article
 BodyContentHandler handler = new BodyContentHandler();

        //Metadata of the article
        Metadata metadata = new Metadata();

        //Input file path
        FileInputStream inputstream = new FileInputStream(new
File(faInputFileName));

        //Parser context. It is used to parse InputStream
        ParseContext pcontext = new ParseContext();

 try
{
        //parsing the document using PDF parser from Tika. Case statement
will be added for handling other file types.
 PDFParser pdfparser = new PDFParser();

 //Do the parsing by calling the parse function of pdfparser
 pdfparser.parse(inputstream, handler, metadata,pcontext);

}catch(Exception e)
{
System.out.println("Exception caught:");
}
      //Convert the body handler to string and return the string to the
calling function
     return handler.toString();
  }

Regards,


On Thu, Jun 8, 2017 at 4:29 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 8 Jun 2017, tesmai4@gmail.com wrote:
>
>> My tika code is not extracting full body text of larger PDF files.
>>
>> Files more than 1 MB  in size and around 20 pages are partially extracted.
>> Is there any limit on input PDF file  size in tika
>>
>
> How are you calling Apache Tika? Direct java calls to TikaConfig +
> AutoDetectParser? Using the Tika facade class? Using the Tika App on the
> command line? Tika Server? Other?
>
> Nick
>

Re: Limit on input PDF file size in Tika?

Posted by Nick Burch <ap...@gagravarr.org>.

On Thu, 8 Jun 2017, tesmai4@gmail.com wrote:
> My tika code is not extracting full body text of larger PDF files.
>
> Files more than 1 MB  in size and around 20 pages are partially extracted.
> Is there any limit on input PDF file  size in tika

How are you calling Apache Tika? Direct java calls to TikaConfig + 
AutoDetectParser? Using the Tika facade class? Using the Tika App on the 
command line? Tika Server? Other?

Nick

Limit on input PDF file size in Tika?

Posted by "tesmai4@gmail.com" <te...@gmail.com>.

Dear all,

My tika code is not extracting full body text of larger PDF files.

Files more than 1 MB  in size and around 20 pages are partially extracted.
Is there any limit on input PDF file  size in tika

Regards

Re: Grobid with TXT and HTML files

Posted by Thamme Gowda <th...@apache.org>.

Hi,

Thanks for the explanation. I do not know if Grobid can extract from text
and HTML (please look at the documentation).

P.S.
You may also explore regex for the plain text and XPath for HTML as
alternatives if GROBID doesnt work.

*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Mon, Jun 12, 2017 at 3:44 AM, tesmai4@gmail.com <te...@gmail.com>
wrote:

> Dear Thamme,
>
> Yes, sometime, we are unable to find a PDF of a published research
> article. The published article or part of the published article is
> sometimes available as HTML. These articles are available to me either in
> TXT or in HTML format. This is the context of my input files being TXT or
> HTML
>
> Regards,
>
>
> On Sat, Jun 10, 2017 at 10:18 AM, Thamme Gowda <th...@apache.org>
> wrote:
>
>> Hi,
>>
>> I have used Grobid parser with PDF files only. I have no idea what you
>> are trying to do extract from raw text or HTML.
>>
>> Since you said:
>> 1. "I am working with published research articles using Apache Tika."
>> 2. "My input files are in TXT and HTML formats",
>>
>> Are you saying your research articles are in .txt and .html files? And
>> you are trying to extract sections such as "sections like abstract,
>> introduction, literature review,... etc"  from these files?
>>
>> *--*
>> *Thamme Gowda*
>> TG | @thammegowda <https://twitter.com/thammegowda>
>> ~Sent via somebody's Webmail server!
>>
>> On Thu, Jun 8, 2017 at 7:44 AM, tesmai4@gmail.com <te...@gmail.com>
>> wrote:
>>
>>> Dear Thamme,
>>>
>>>
>>> https://grobid.readthedocs.io/en/latest/grobid-04-2015.pdf
>>>
>>> The above presentation says that Grobid supports raw text. My input
>>> files are in TXT and HTML formats. Do you have any idea how can this be
>>> supported as raw text?
>>>
>>>
>>>
>>> Regards,
>>>
>>>
>>>
>>>
>>> On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <th...@apache.org>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> There is a nice project called Grobid [1] that does most of what you
>>>> are describing.
>>>> Tika has Grobid parser built in (it calls grobid over REST API) .
>>>> checkout [2] for details
>>>>
>>>> I have a project that makes use of Tika with Grobid and NER support. It
>>>> also builds a search index using solr.
>>>> Check out [3] for setting up and [4] for parsing and indexing to solr
>>>> if you like to try out my python project.
>>>> Here I am able to extract title, author names, affiliations, and the
>>>> whole text of articles.
>>>> I did not extract sections within the main body of research articles.
>>>> I assume there should be a way to configure it in Grobid.
>>>>
>>>> Alternatively, if Grobid can't detect sections, you can try XHTML
>>>> content handler which preserves the basic structure of PDF file using <p>
>>>>  <br> and heading tags. So technically it should be possible to write a
>>>> wrapper to break XHTML output from tika into sections
>>>>
>>>> To get it:
>>>>
>>>> # In bash do `pip install tika’ if tika isn’t already installed
>>>> import tika
>>>> tika.initVM()
>>>> from tika import parser
>>>>
>>>>
>>>> file_path = "<pdf_dir>/2538.pdf"
>>>> data = parser.from_file(file_path, xmlContent=True)
>>>> print(data['content'])
>>>>
>>>>
>>>>
>>>>
>>>> Best,
>>>> Thamme
>>>>
>>>> [1] http://grobid.readthedocs.io/en/latest/Introduction/
>>>> [2] https://wiki.apache.org/tika/GrobidJournalParser
>>>> [3] https://github.com/USCDataScience/parser-indexer-py/tree
>>>> /master/parser-server
>>>> [4] https://github.com/USCDataScience/parser-indexer-py/blob
>>>> /master/docs/parser-index-journals.md
>>>>
>>>> *--*
>>>> *Thamme Gowda*
>>>> TG | @thammegowda <https://twitter.com/thammegowda>
>>>> ~Sent via somebody's Webmail server!
>>>>
>>>> On Wed, May 3, 2017 at 9:34 AM, tesmai4@gmail.com <te...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am working with published research articles using Apache Tika. These
>>>>> articles have distinct sections like abstract, introduction, literature
>>>>> review, methodology, experimental setup, discussion and conclusions. Is
>>>>> there some way to extract document sections with Apache Tika
>>>>>
>>>>> Regards,
>>>>>
>>>>
>>>>
>>>
>>
>