You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2019/08/15 14:15:15 UTC

Re: Quarkus integration

Hi,
The initial documentation is here:
https://quarkus.io/guides/tika-guide

Lots more to come over time, and we have already had users trying it (not
many but hope to see more feedback from them soon)
Sergey

On Fri, May 10, 2019 at 6:04 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> I've managed to get the PDFParser running in the native mode, but I had to
> delay the initialization of
> org.apache.pdfbox.pdmodel.font.PDType1Font, this class has static
> PDType1Font instances, one of them leading to
> org.apache.fontbox.ttf.RAFDataStream which opens a file handler thus Graal
> can not convert it to the native code during the build time, so one needs
> to delay the initialization of PDType1Font till the run time.
>
> If we start from the PDF parser the the call path to RAFDataStream starts
> from:
>
>
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.<init>(PDAcroForm.java:93)
>      at
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>
> I guess I may need to create a PR for PDFBox where RAFDataStream opens a
> stream lazily, with a check like ensureOpen() being added to its read
> methods...
>
> Sergey
>
> On Fri, May 3, 2019 at 1:22 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
>
>> Yes, please add 'sergeyb', I've just assigned myself a CXF issue as
>> 'sergeyb'. Sorry about these multiple ids, but indeed I'll try to keep
>> using a single one.
>>
>> Thanks, Sergey
>>
>>
>>
>> On Fri, May 3, 2019 at 12:13 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> I can add 'sergeyb' if you'd prefer!
>>>
>>> On Fri, May 3, 2019 at 5:43 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>> >
>>> > Though I might need to settle on the 'sergeyb' eventually since it is
>>> my
>>> > apache committer id.
>>> > Thanks...
>>> >
>>> > On Fri, May 3, 2019 at 10:29 AM Sergey Beryozkin <sberyozkin@gmail.com
>>> >
>>> > wrote:
>>> >
>>> > > Oh, I forgot I had a 'sergey_beryozkin' id as well, this is not good,
>>> > > shows how long ago I did contribute :-) (did try sergey.beryozkin
>>> though).
>>> > >
>>> > > Thanks for checking it, I've just assigned this issue to myself.
>>> > > Cheers, Sergey
>>> > >
>>> > >
>>> > > On Thu, May 2, 2019 at 6:08 PM Sergey Beryozkin <
>>> sberyozkin@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> Hi Tim
>>> > >>
>>> > >> I can't assign
>>> > >> https://issues.apache.org/jira/browse/TIKA-2862
>>> > >>
>>> > >> to myself, I used to be able to assign, I know I had some time away
>>> from
>>> > >> Tika, but I'm keen to return with few contributions :-)
>>> > >> Please update my record for me to be able to assign the issues again
>>> > >>
>>> > >> Cheers, Sergey
>>> > >>
>>> > >> On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin <
>>> sberyozkin@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >>> Hi Tim, All
>>> > >>>
>>> > >>> I've started working on integrating Tika with Quarkus [1]. The
>>> main idea
>>> > >>> is to be able to use Tika in the native image mode.
>>> > >>> It's quite likely I'll start creating the PRs soon, to get the
>>> native
>>> > >>> image related issues resolved, these are related to some libraries
>>> > >>> statically initializing FileDescriptors, etc.
>>> > >>>
>>> > >>> Thanks, Sergey
>>> > >>>
>>> > >>> [1]
>>> > >>>
>>> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
>>> > >>> [2]
>>> > >>>
>>> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
>>> > >>>
>>> > >>>
>>>
>>

Re: Quarkus integration

Posted by Sergey Beryozkin <sb...@gmail.com>.
If someone from the large Tika team can give that extension a try, whenever
time allows, it would be super, it will help me improve that extension. If
you do decide to try, please post the feedback to
https://groups.google.com/forum/#!forum/quarkus-dev
or if it fails miserably for your documents, may be here first :-)
Cheers, Sergey

On Thu, Aug 15, 2019 at 3:15 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi,
> The initial documentation is here:
> https://quarkus.io/guides/tika-guide
>
> Lots more to come over time, and we have already had users trying it (not
> many but hope to see more feedback from them soon)
> Sergey
>
> On Fri, May 10, 2019 at 6:04 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
>
>> I've managed to get the PDFParser running in the native mode, but I had
>> to delay the initialization of
>> org.apache.pdfbox.pdmodel.font.PDType1Font, this class has static
>> PDType1Font instances, one of them leading to
>> org.apache.fontbox.ttf.RAFDataStream which opens a file handler thus Graal
>> can not convert it to the native code during the build time, so one needs
>> to delay the initialization of PDType1Font till the run time.
>>
>> If we start from the PDF parser the the call path to RAFDataStream starts
>> from:
>>
>>
>> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>>      at
>> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.<init>(PDAcroForm.java:93)
>>      at
>> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>>
>> org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>>
>> I guess I may need to create a PR for PDFBox where RAFDataStream opens a
>> stream lazily, with a check like ensureOpen() being added to its read
>> methods...
>>
>> Sergey
>>
>> On Fri, May 3, 2019 at 1:22 PM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Yes, please add 'sergeyb', I've just assigned myself a CXF issue as
>>> 'sergeyb'. Sorry about these multiple ids, but indeed I'll try to keep
>>> using a single one.
>>>
>>> Thanks, Sergey
>>>
>>>
>>>
>>> On Fri, May 3, 2019 at 12:13 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> I can add 'sergeyb' if you'd prefer!
>>>>
>>>> On Fri, May 3, 2019 at 5:43 AM Sergey Beryozkin <sb...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Though I might need to settle on the 'sergeyb' eventually since it is
>>>> my
>>>> > apache committer id.
>>>> > Thanks...
>>>> >
>>>> > On Fri, May 3, 2019 at 10:29 AM Sergey Beryozkin <
>>>> sberyozkin@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > Oh, I forgot I had a 'sergey_beryozkin' id as well, this is not
>>>> good,
>>>> > > shows how long ago I did contribute :-) (did try sergey.beryozkin
>>>> though).
>>>> > >
>>>> > > Thanks for checking it, I've just assigned this issue to myself.
>>>> > > Cheers, Sergey
>>>> > >
>>>> > >
>>>> > > On Thu, May 2, 2019 at 6:08 PM Sergey Beryozkin <
>>>> sberyozkin@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > >> Hi Tim
>>>> > >>
>>>> > >> I can't assign
>>>> > >> https://issues.apache.org/jira/browse/TIKA-2862
>>>> > >>
>>>> > >> to myself, I used to be able to assign, I know I had some time
>>>> away from
>>>> > >> Tika, but I'm keen to return with few contributions :-)
>>>> > >> Please update my record for me to be able to assign the issues
>>>> again
>>>> > >>
>>>> > >> Cheers, Sergey
>>>> > >>
>>>> > >> On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin <
>>>> sberyozkin@gmail.com>
>>>> > >> wrote:
>>>> > >>
>>>> > >>> Hi Tim, All
>>>> > >>>
>>>> > >>> I've started working on integrating Tika with Quarkus [1]. The
>>>> main idea
>>>> > >>> is to be able to use Tika in the native image mode.
>>>> > >>> It's quite likely I'll start creating the PRs soon, to get the
>>>> native
>>>> > >>> image related issues resolved, these are related to some libraries
>>>> > >>> statically initializing FileDescriptors, etc.
>>>> > >>>
>>>> > >>> Thanks, Sergey
>>>> > >>>
>>>> > >>> [1]
>>>> > >>>
>>>> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
>>>> > >>> [2]
>>>> > >>>
>>>> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
>>>> > >>>
>>>> > >>>
>>>>
>>>