You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Yves Zoundi <yv...@gmail.com> on 2008/05/19 14:05:30 UTC

OSGI bundle for Tika

Hi everybody,

It would be nice to create sub-projects from Apache Tika main maven
project. The mime detection part is pretty useful and its code could be
in a separate project. That would allow people to use it without the
rest of the Tika's code.

I was looking at a mime detection solution. I looked at JMimeInfo,
jmimemagic and mime-util. After few tests, I choose to use Apache Tika's
code.

I removed few classes from the source code and created a jar with the
mime detection code. I needed to use Tika in an OSGI environment and it
was a bit painful to use Tika out of the box(without embedding it in an
OSGI bundle which would export Tika packages later).

I had to create a manifest and as Tika's code is not huge, I was able to
export the packages quickly. I need to import javax.xml.parsers, sax and
dom packages as Tika use them to load the mimetypes configuration file.

The thing I didn't see in the mime detection code was a serializer to
save the mimetypes. 

In a typical application, people usually :
- Want a mime type configuration file somewhere that they can load
- Want to be able to add/remove mimetypes
- Add file extensions patterns to existing mime types
- Store back the mime types to its location.

So my questions are : 
- If I load the mimetypes from a file, and add some mimetype entries at
runtime, how can I save back the file without doing it manually with
dom, jdom or dom4j?
- Would it be possible to create an OSGI bundle for the mime detection
library?

Thanks,
Keep up the good work

Yves Zoundi 
Blog : http://yveszoundi.blogspot.com
XPontus XML Editor : http://xpontus.sf.net
VFSJFileChooser : http://vfsjfilechooser.sf.net

-- 
Your attitude, not your aptitude, will determine your altitude
Zig Ziglar

You have to learn the rules of the game. And then you have to play
better than anyone else.
Albert Einstein

Act as if it was impossible to fail.
Dorothea Brande


Re: OSGI bundle for Tika

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Mon, May 19, 2008 at 1:05 PM, Yves Zoundi <yv...@gmail.com> wrote:
> Hi everybody,
>
> It would be nice to create sub-projects from Apache Tika main maven
> project. The mime detection part is pretty useful and its code could be
> in a separate project. That would allow people to use it without the
> rest of the Tika's code.

i'm keen on using the MIME magic in RAT as well

> I was looking at a mime detection solution. I looked at JMimeInfo,
> jmimemagic and mime-util. After few tests, I choose to use Apache Tika's
> code.
>
> I removed few classes from the source code and created a jar with the
> mime detection code. I needed to use Tika in an OSGI environment and it
> was a bit painful to use Tika out of the box(without embedding it in an
> OSGI bundle which would export Tika packages later).
>
> I had to create a manifest and as Tika's code is not huge, I was able to
> export the packages quickly. I need to import javax.xml.parsers, sax and
> dom packages as Tika use them to load the mimetypes configuration file.
>
> The thing I didn't see in the mime detection code was a serializer to
> save the mimetypes.
>
> In a typical application, people usually :
> - Want a mime type configuration file somewhere that they can load
> - Want to be able to add/remove mimetypes
> - Add file extensions patterns to existing mime types
> - Store back the mime types to its location.

not sure whether this is really a core activity. IMHO this would work
better as a bolt-on.

> So my questions are :
> - If I load the mimetypes from a file, and add some mimetype entries at
> runtime, how can I save back the file without doing it manually with
> dom, jdom or dom4j?
> - Would it be possible to create an OSGI bundle for the mime detection
> library?

submit a patch ;-)

- robert

Re: OSGI bundle for Tika

Posted by Yves Zoundi <yv...@gmail.com>.
Hello Jukka,

  Yes, I am afraid of carrying too much tika dependencies. For now, the
project is still in the incubator and I believe the code base will grow
significantly sooner or later. It might be difficult to extract the mime
detection library.  I think that the mime detection code is worth having
it's own maven project. I didn't see any dependency but commons-codec.

I really like the idea of a tika-core containing the main interfaces.
Partitionning is good, but at this point I guess it would add extra
complexity and extra work. When most interfaces, are well defined, I think
it will be easy to know what to partition and how to do it without worrying
about the architecture.

 I will fill a feature request for a mimetypes serializer. If I have time, I
might write the serializer and send it to you guys for evaluation.

Regards,
Yves Zoundi


2008/5/20 Jukka Zitting <ju...@gmail.com>:

> Hi,
>
> On Mon, May 19, 2008 at 3:05 PM, Yves Zoundi <yv...@gmail.com> wrote:
> > It would be nice to create sub-projects from Apache Tika main maven
> > project. The mime detection part is pretty useful and its code could be
> > in a separate project. That would allow people to use it without the
> > rest of the Tika's code.
>
> I think we can do that. Are you more worried about the size of the
> tika jar or all the parser dependencies you don't need?
>
> We might want to split Tika into two parts, say tika-core and
> tika-parsers, where tika-core would contain all the core interfaces
> and classes with no dependencies to external libraries (except of
> course the standard Java 5 class libraries). We could go even further
> by partitioning the core library by function, but I'm not sure if that
> is worth the extra complexity.
>
> > I removed few classes from the source code and created a jar with the
> > mime detection code. I needed to use Tika in an OSGI environment and it
> > was a bit painful to use Tika out of the box(without embedding it in an
> > OSGI bundle which would export Tika packages later).
> >
> > I had to create a manifest and as Tika's code is not huge, I was able to
> > export the packages quickly. I need to import javax.xml.parsers, sax and
> > dom packages as Tika use them to load the mimetypes configuration file.
>
> It should be possible to add the OSGi bundle information automatically
> in the normal Maven build. You might want to file an improvement
> request for this.
>
> > The thing I didn't see in the mime detection code was a serializer to
> > save the mimetypes.
>
> Our use cases so far have had only manual modifications of the
> configuration files, but I don't see why we couldn't make it possible
> to programmatically modify the configuration. In fact I've already
> done some work towards making the media type registry easier to
> manage, and a serializer for the configuration file would be a nice
> addition. Could you file a feature request for that?
>
> > In a typical application, people usually :
> > - Want a mime type configuration file somewhere that they can load
> > - Want to be able to add/remove mimetypes
> > - Add file extensions patterns to existing mime types
> > - Store back the mime types to its location.
> >
> > So my questions are :
> > - If I load the mimetypes from a file, and add some mimetype entries at
> > runtime, how can I save back the file without doing it manually with
> > dom, jdom or dom4j?
>
> Currently the only way is to modify the XML file directly, but as
> mentioned above a higher level serialization feature would be nice.
>
> > - Would it be possible to create an OSGI bundle for the mime detection
> > library?
>
> Certainly.
>
> BR,
>
> Jukka Zitting
>



-- 
------------------------------
Your attitude, not your aptitude, will determine your altitude
Zig Ziglar

You have to learn the rules of the game. And then you have to play better
than anyone else.
Albert Einstein

Act as if it was impossible to fail.
Dorothea Brande

Re: OSGI bundle for Tika

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, May 19, 2008 at 3:05 PM, Yves Zoundi <yv...@gmail.com> wrote:
> It would be nice to create sub-projects from Apache Tika main maven
> project. The mime detection part is pretty useful and its code could be
> in a separate project. That would allow people to use it without the
> rest of the Tika's code.

I think we can do that. Are you more worried about the size of the
tika jar or all the parser dependencies you don't need?

We might want to split Tika into two parts, say tika-core and
tika-parsers, where tika-core would contain all the core interfaces
and classes with no dependencies to external libraries (except of
course the standard Java 5 class libraries). We could go even further
by partitioning the core library by function, but I'm not sure if that
is worth the extra complexity.

> I removed few classes from the source code and created a jar with the
> mime detection code. I needed to use Tika in an OSGI environment and it
> was a bit painful to use Tika out of the box(without embedding it in an
> OSGI bundle which would export Tika packages later).
>
> I had to create a manifest and as Tika's code is not huge, I was able to
> export the packages quickly. I need to import javax.xml.parsers, sax and
> dom packages as Tika use them to load the mimetypes configuration file.

It should be possible to add the OSGi bundle information automatically
in the normal Maven build. You might want to file an improvement
request for this.

> The thing I didn't see in the mime detection code was a serializer to
> save the mimetypes.

Our use cases so far have had only manual modifications of the
configuration files, but I don't see why we couldn't make it possible
to programmatically modify the configuration. In fact I've already
done some work towards making the media type registry easier to
manage, and a serializer for the configuration file would be a nice
addition. Could you file a feature request for that?

> In a typical application, people usually :
> - Want a mime type configuration file somewhere that they can load
> - Want to be able to add/remove mimetypes
> - Add file extensions patterns to existing mime types
> - Store back the mime types to its location.
>
> So my questions are :
> - If I load the mimetypes from a file, and add some mimetype entries at
> runtime, how can I save back the file without doing it manually with
> dom, jdom or dom4j?

Currently the only way is to modify the XML file directly, but as
mentioned above a higher level serialization feature would be nice.

> - Would it be possible to create an OSGI bundle for the mime detection
> library?

Certainly.

BR,

Jukka Zitting