You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Laurence Vanhelsuwe <in...@softwarepearls.com> on 2020/09/26 11:53:54 UTC

Tika lib is huge.. why?

I found Tika during a quest to extract PDF metadata in Java. Did i screw up the JAR download, or is Tika really 70MB ?

Kind regards,

  Laurence


Re: Tika lib is huge.. why?

Posted by Tim Allison <ta...@apache.org>.
Please take a look at the main branch for what we’re planning to do in Tika
2.0.0. The goal is to modularize parsers at a finer granularity.

We need all the testing/feedback we can get for a successful launch.

On Sat, Sep 26, 2020 at 2:19 PM Laurence Vanhelsuwe <in...@softwarepearls.com>
wrote:

> PDFBox did the trick for me. Thanks for the golden tip. :-)
>
> > On 26 Sep 2020, at 19:06, Dave Fisher <wa...@comcast.net> wrote:
> >
> > IIRC - if you know you only want PDF extraction then take a look at
> Apache PDFBox. PDFBox.apache.org
> >
> > Sent from my iPhone
> >
> >> On Sep 26, 2020, at 10:00 AM, Laurence Vanhelsuwe <
> info@softwarepearls.com> wrote:
> >>
> >> Thanks for the explanation.
> >>
> >> I understand the approach.. but in my particular use case, I cannot
> reasonably justify inflating my application size from 7Mb to 77Mb just to
> add functionality amounting to less than 1% of all functionality.
> >>
> >> I guess there’s no way to surgically extract just the PDF metadata
> parsing functionality from Tika ?
> >>
> >> Laurence
> >>
> >>> On 26 Sep 2020, at 18:04, Keith Bennett <ke...@gmail.com>
> wrote:
> >>>
> >>> Tika coordinates the use of many external-to-Tika parser libraries,
> each
> >>> with their own dependencies, for parsing many types of files. These
> parser
> >>> libraries are bundled into the tika-app jar file for your convenience.
> I
> >>> believe it's these libraries that make up the bulk of the download. For
> >>> example, if you unzip the jar file and inspect the contents, you can
> see
> >>> that just one of these parsers, poi, consists of 24 MB:
> >>>
> >>> % cd org/apache/poi
> >>>
> >>> % du -sh .
> >>> 24M .
> >>>
> >>> - Keith
> >>>
> >>>> On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe <
> info@softwarepearls.com>
> >>>> wrote:
> >>>>
> >>>> I found Tika during a quest to extract PDF metadata in Java. Did i
> screw
> >>>> up the JAR download, or is Tika really 70MB ?
> >>>>
> >>>> Kind regards,
> >>>>
> >>>> Laurence
> >>>>
> >>>>
> >>
> >
>
>

Re: Tika lib is huge.. why?

Posted by Laurence Vanhelsuwe <in...@softwarepearls.com>.
PDFBox did the trick for me. Thanks for the golden tip. :-)

> On 26 Sep 2020, at 19:06, Dave Fisher <wa...@comcast.net> wrote:
> 
> IIRC - if you know you only want PDF extraction then take a look at Apache PDFBox. PDFBox.apache.org
> 
> Sent from my iPhone
> 
>> On Sep 26, 2020, at 10:00 AM, Laurence Vanhelsuwe <in...@softwarepearls.com> wrote:
>> 
>> Thanks for the explanation.
>> 
>> I understand the approach.. but in my particular use case, I cannot reasonably justify inflating my application size from 7Mb to 77Mb just to add functionality amounting to less than 1% of all functionality.
>> 
>> I guess there’s no way to surgically extract just the PDF metadata parsing functionality from Tika ?
>> 
>> Laurence
>> 
>>> On 26 Sep 2020, at 18:04, Keith Bennett <ke...@gmail.com> wrote:
>>> 
>>> Tika coordinates the use of many external-to-Tika parser libraries, each
>>> with their own dependencies, for parsing many types of files. These parser
>>> libraries are bundled into the tika-app jar file for your convenience. I
>>> believe it's these libraries that make up the bulk of the download. For
>>> example, if you unzip the jar file and inspect the contents, you can see
>>> that just one of these parsers, poi, consists of 24 MB:
>>> 
>>> % cd org/apache/poi
>>> 
>>> % du -sh .
>>> 24M .
>>> 
>>> - Keith
>>> 
>>>> On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe <in...@softwarepearls.com>
>>>> wrote:
>>>> 
>>>> I found Tika during a quest to extract PDF metadata in Java. Did i screw
>>>> up the JAR download, or is Tika really 70MB ?
>>>> 
>>>> Kind regards,
>>>> 
>>>> Laurence
>>>> 
>>>> 
>> 
> 


Re: Tika lib is huge.. why?

Posted by Dave Fisher <wa...@comcast.net>.
IIRC - if you know you only want PDF extraction then take a look at Apache PDFBox. PDFBox.apache.org

Sent from my iPhone

> On Sep 26, 2020, at 10:00 AM, Laurence Vanhelsuwe <in...@softwarepearls.com> wrote:
> 
> Thanks for the explanation.
> 
> I understand the approach.. but in my particular use case, I cannot reasonably justify inflating my application size from 7Mb to 77Mb just to add functionality amounting to less than 1% of all functionality.
> 
> I guess there’s no way to surgically extract just the PDF metadata parsing functionality from Tika ?
> 
> Laurence
> 
>> On 26 Sep 2020, at 18:04, Keith Bennett <ke...@gmail.com> wrote:
>> 
>> Tika coordinates the use of many external-to-Tika parser libraries, each
>> with their own dependencies, for parsing many types of files. These parser
>> libraries are bundled into the tika-app jar file for your convenience. I
>> believe it's these libraries that make up the bulk of the download. For
>> example, if you unzip the jar file and inspect the contents, you can see
>> that just one of these parsers, poi, consists of 24 MB:
>> 
>> % cd org/apache/poi
>> 
>> % du -sh .
>> 24M .
>> 
>> - Keith
>> 
>>> On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe <in...@softwarepearls.com>
>>> wrote:
>>> 
>>> I found Tika during a quest to extract PDF metadata in Java. Did i screw
>>> up the JAR download, or is Tika really 70MB ?
>>> 
>>> Kind regards,
>>> 
>>> Laurence
>>> 
>>> 
> 


Re: Tika lib is huge.. why?

Posted by Laurence Vanhelsuwe <in...@softwarepearls.com>.
Thanks for the explanation.

I understand the approach.. but in my particular use case, I cannot reasonably justify inflating my application size from 7Mb to 77Mb just to add functionality amounting to less than 1% of all functionality.

I guess there’s no way to surgically extract just the PDF metadata parsing functionality from Tika ?

 Laurence

> On 26 Sep 2020, at 18:04, Keith Bennett <ke...@gmail.com> wrote:
> 
> Tika coordinates the use of many external-to-Tika parser libraries, each
> with their own dependencies, for parsing many types of files. These parser
> libraries are bundled into the tika-app jar file for your convenience. I
> believe it's these libraries that make up the bulk of the download. For
> example, if you unzip the jar file and inspect the contents, you can see
> that just one of these parsers, poi, consists of 24 MB:
> 
> % cd org/apache/poi
> 
> % du -sh .
> 24M .
> 
> - Keith
> 
> On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe <in...@softwarepearls.com>
> wrote:
> 
>> I found Tika during a quest to extract PDF metadata in Java. Did i screw
>> up the JAR download, or is Tika really 70MB ?
>> 
>> Kind regards,
>> 
>>  Laurence
>> 
>> 


Re: Tika lib is huge.. why?

Posted by Keith Bennett <ke...@gmail.com>.
Tika coordinates the use of many external-to-Tika parser libraries, each
with their own dependencies, for parsing many types of files. These parser
libraries are bundled into the tika-app jar file for your convenience. I
believe it's these libraries that make up the bulk of the download. For
example, if you unzip the jar file and inspect the contents, you can see
that just one of these parsers, poi, consists of 24 MB:

% cd org/apache/poi

% du -sh .
24M .

- Keith

On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe <in...@softwarepearls.com>
wrote:

> I found Tika during a quest to extract PDF metadata in Java. Did i screw
> up the JAR download, or is Tika really 70MB ?
>
> Kind regards,
>
>   Laurence
>
>