You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Avtar Singh Mehra <as...@gmail.com> on 2017/03/08 18:43:40 UTC

Require guidance from where to start contributing in Apache Tika

Hello Everyone,
I am new to Apache Tika but have plenty of experience with other Apache
Softwares like Apache Solr, Apache Lucene, Apache Velocity etc. I would
like to start contributing to Apache Tika community. It would be great help
if someone could guide me regarding from where i should start contributing
to Apache Tika.

Thanks in Advance
Avtar

Re: Require guidance from where to start contributing in Apache Tika

Posted by Thamme Gowda <th...@apache.org>.
Hi Avtar,

Welcome to Tika community.

If you are interested in improving ObjectRecognition parser, I have few
suggestions for you!

@Dev let us know if the below suggestion is a good improvement.
@Avtar wait till we get some feedback/votes/ACK, then create a Jira Issue
for this and assign to yourself if you chose to go forward.

*Background:*
When we built ObjectRecognitionParser to do image recognition, there wasn't
good support for Java frameworks.  All the popular neural networks were in
C++ or python.  Since there was nothing that runs within JVM, we tried
several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
However, this game is changing slowly now. Deeplearning4j, the most famous
neural network library for JVM, now supports importing models that are
pre-trained in python/C++ based kits [5].

*Improvement:*
It will be nice to have an implementation of ObjectRecogniser that
doesn't require any external setup(like installation of native libraries or
starting REST services). Reasons: easy to distribute and also to cut the IO
time.

*Steps:*
1. Refer to wiki [1] to understand how ObjectRecognition is done in Tika.
The goal here is to dig into the codebase to become familiar with it.

2. Then refer to e.g. at [2] to learn how image recognition is done in
Deeplearning4j. Here you will be using a pre trained VGG-16 model [3]

3. Read the guidelines/workflow for contributions in wiki [6]

4. Then create org.apache.tika.parser.recognition.dl4j.DL4JImageRecogniser
similar to org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser .
On a side note, this implementation should work with many other image
recognition models.  Once we get VGG's [3] model working, without much
effort we should be able to switch to Google's InceptionNet model [4] by
changing the model path (that will be later goal, we need to wait for next
release of DL4J to make this model work).



[1] https://wiki.apache.org/tika/TikaAndVision
[2] https://deeplearning4j.org/build_vgg_webapp
[3] http://www.robots.ox.ac.uk/~vgg/research/very_deep/
[4]
https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example

[5] https://deeplearning4j.org/model-import-keras
[6] https://wiki.apache.org/tika/UsingGit


Best,
TG

*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Thu, Mar 9, 2017 at 6:17 AM, Avtar Singh Mehra <as...@gmail.com>
wrote:

> Thank you everyone for the help. I really appreciate it.
> I would like to work on Object Recognition parser, and understand it so as
> to understand the working of the parsers. I am interested in pursuing it as
> my GSoC project for summer.
> I would appreciate it if someone could point me to small improvements i can
> do in it.
>
> Thanks
> Avtar
>
> On 9 March 2017 at 04:47, Nick Burch <ap...@gagravarr.org> wrote:
>
> > On Thu, 9 Mar 2017, Avtar Singh Mehra wrote:
> >
> >> I am new to Apache Tika but have plenty of experience with other Apache
> >> Softwares like Apache Solr, Apache Lucene, Apache Velocity etc. I would
> >> like to start contributing to Apache Tika community. It would be great
> >> help
> >> if someone could guide me regarding from where i should start
> contributing
> >> to Apache Tika.
> >>
> >
> > The first two places I'd suggest looking are
> > http://tika.apache.org/contribute.html and
> http://tika.apache.org/1.14/pa
> > rser_guide.html (Get Tika parsing up and running in 5 minutes). Make sure
> > you're able to add a new dummy mime type and parser, understand how it
> > works etc. See also https://wiki.apache.org/tika/Troubleshooting%20Tika
> > for when you hit issues...
> >
> > Once you've got the hang of that, let us know of any gaps in the
> > documentation!
> >
> > Finally, either pick a JIRA that interests you, or an unsupported format,
> > and have a try. Use the contributing guide to guide you on submitting
> > patches, and don't be scared to ask for help :)
> >
> > Nick
> >
>

Re: Require guidance from where to start contributing in Apache Tika

Posted by Avtar Singh Mehra <as...@gmail.com>.
Thank you everyone for the help. I really appreciate it.
I would like to work on Object Recognition parser, and understand it so as
to understand the working of the parsers. I am interested in pursuing it as
my GSoC project for summer.
I would appreciate it if someone could point me to small improvements i can
do in it.

Thanks
Avtar

On 9 March 2017 at 04:47, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 9 Mar 2017, Avtar Singh Mehra wrote:
>
>> I am new to Apache Tika but have plenty of experience with other Apache
>> Softwares like Apache Solr, Apache Lucene, Apache Velocity etc. I would
>> like to start contributing to Apache Tika community. It would be great
>> help
>> if someone could guide me regarding from where i should start contributing
>> to Apache Tika.
>>
>
> The first two places I'd suggest looking are
> http://tika.apache.org/contribute.html and http://tika.apache.org/1.14/pa
> rser_guide.html (Get Tika parsing up and running in 5 minutes). Make sure
> you're able to add a new dummy mime type and parser, understand how it
> works etc. See also https://wiki.apache.org/tika/Troubleshooting%20Tika
> for when you hit issues...
>
> Once you've got the hang of that, let us know of any gaps in the
> documentation!
>
> Finally, either pick a JIRA that interests you, or an unsupported format,
> and have a try. Use the contributing guide to guide you on submitting
> patches, and don't be scared to ask for help :)
>
> Nick
>

Re: Require guidance from where to start contributing in Apache Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 9 Mar 2017, Avtar Singh Mehra wrote:
> I am new to Apache Tika but have plenty of experience with other Apache
> Softwares like Apache Solr, Apache Lucene, Apache Velocity etc. I would
> like to start contributing to Apache Tika community. It would be great help
> if someone could guide me regarding from where i should start contributing
> to Apache Tika.

The first two places I'd suggest looking are 
http://tika.apache.org/contribute.html and 
http://tika.apache.org/1.14/parser_guide.html (Get Tika parsing up and 
running in 5 minutes). Make sure you're able to add a new dummy mime 
type and parser, understand how it works etc. See also 
https://wiki.apache.org/tika/Troubleshooting%20Tika for when you hit 
issues...

Once you've got the hang of that, let us know of any gaps in the 
documentation!

Finally, either pick a JIRA that interests you, or an unsupported format, 
and have a try. Use the contributing guide to guide you on submitting 
patches, and don't be scared to ask for help :)

Nick

RE: Require guidance from where to start contributing in Apache Tika

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Pick a parser/area of interest and look for any open tickets.  Or scan JIRA and look for issues that look interesting to you.

If you'd like to contribute a new parser, I could probably come up with a list pretty quickly of parsers we'd like to have. :)


-----Original Message-----
From: Avtar Singh Mehra [mailto:asmehra95@gmail.com] 
Sent: Wednesday, March 8, 2017 1:44 PM
To: dev@tika.apache.org; Thamme Gowda <th...@apache.org>
Subject: Require guidance from where to start contributing in Apache Tika

Hello Everyone,
I am new to Apache Tika but have plenty of experience with other Apache Softwares like Apache Solr, Apache Lucene, Apache Velocity etc. I would like to start contributing to Apache Tika community. It would be great help if someone could guide me regarding from where i should start contributing to Apache Tika.

Thanks in Advance
Avtar