You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jukka Zitting <ju...@gmail.com> on 2006/08/16 13:59:00 UTC

Thoughts on Parser design and dependencies

Hi,

I have some questions about the dependencies of the Parser interface,
especially from the perspective of generalizing it to the potential
Tika project. The current dependencies are:

   * Configurable - depends on the Hadoop configuration system
   * Pluggable - depends on the Nutch plugin system
   * Content - depends on the Nutch protocol model
   * Parse - depends on the Nutch index content model

I notice that Nutch uses a custom plugin and configuration system. Is
there a technical reason for having your own instead of using some of
the existing IoC and other component frameworks? If we are to make the
Parser components easily usable outside Nutch we'd need to either
remove those dependencies or include (either directly or by reference)
the plugin/configuration system in Tika. I'd personally prefer to
remove those dependencies in favor of more IoC-friendly JavaBean
conventions, but I'm not familiar with the background of the Parser
components.

The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to generalize this, as neither of these ideas
apply for example to the needs of the Apache Jackrabbit project. My
TextExtractor proposal avoids these dependencies by using just a
binary stream, a content type and an optional character encoding to
produce a single text stream, but that approach fails to support more
structured index content models. I'm trying to find a solution that
combines the best parts of both approaches.

Ideally I'd like to see a parser implementation in Tika that avoids
the Nutch dependencies but can still be used in Nutch without changing
any of the existing code or configuration files. Something like a
TikaParser adapter class might be needed to achieve that.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting, and Java development

Re: Thoughts on Parser design and dependencies

Posted by Sami Siren <ss...@gmail.com>.
Jukka Zitting wrote:
> I notice that Nutch uses a custom plugin and configuration system. Is
> there a technical reason for having your own instead of using some of
> the existing IoC and other component frameworks? If we are to make the

I have been wondering this also myself. I really don't like the 
dependencies to current plugin system all over the place. An antipattern.

--
  Sami Siren

Re: Thoughts on Parser design and dependencies

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jukka Zitting wrote:
> Hi,
>
> On 8/19/06, Sami Siren <ss...@gmail.com> wrote:
>> So far nutch has been build to deal mainly with text type documents.
>> There's however need also to deal with non textual object eg.  images,
>> movies, sound which will provide content only in form of metadata (ok,
>> perhaps some text also about the context of object if applicable), so
>> the metadata names we have today are only a subset of what might be.
>>
>> I really would not want to restrict the metadata the interface can carry
>> to a fixed set.
>
> But if it's an open Map, how do you index and search using that, i.e.
> what is the mapping between the Map keys used by a parser component
> and the field names in the resulting Lucene index? How do we enforce
> that an MPEG parser uses the same Map keys as a JPEG parser when
> encountering metadata with the same semantics?
>
> I'm not opposed to using a Map for truly variable metadata, like HTML
> <meta/> tags with unknown names, but if we want common handling for
> example for Dublin Core metadata, it would be better to enforce that
> on the interface level.

Well, Nutch already does this in a way, but it's a "soft" endorsement 
rather than a hard enforcement .. ;) We define keys for all common 
metadata sets (DC, Office, HttpHeaders), and plugin writers are supposed 
to use them, unless they can't find any metadata key with matching 
semantics.

Then, other indexing plugins expect certain metadata to be available 
under these keys, and create appropriate Lucene fields, again using 
predefined field names.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Thoughts on Parser design and dependencies

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 8/19/06, Sami Siren <ss...@gmail.com> wrote:
> So far nutch has been build to deal mainly with text type documents.
> There's however need also to deal with non textual object eg.  images,
> movies, sound which will provide content only in form of metadata (ok,
> perhaps some text also about the context of object if applicable), so
> the metadata names we have today are only a subset of what might be.
>
> I really would not want to restrict the metadata the interface can carry
> to a fixed set.

But if it's an open Map, how do you index and search using that, i.e.
what is the mapping between the Map keys used by a parser component
and the field names in the resulting Lucene index? How do we enforce
that an MPEG parser uses the same Map keys as a JPEG parser when
encountering metadata with the same semantics?

I'm not opposed to using a Map for truly variable metadata, like HTML
<meta/> tags with unknown names, but if we want common handling for
example for Dublin Core metadata, it would be better to enforce that
on the interface level.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting, and Java development

Re: Thoughts on Parser design and dependencies

Posted by Sami Siren <ss...@gmail.com>.
Jukka Zitting wrote:
> Hi,
> 
> On 8/18/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>> A very important aspect of the Parser interface (or actually, the Parse
>> and Content classes) is that they each may contain arbitrary metadata.
>> This is required for discovering and passing around both the original
>> metadata (such as protocol headers, document properties, etc), and other
>> secondary content (such as data from external sources, or derived 
>> metadata).
> 
> Is there a list of all the different metadata items that get passed in
> or out of the parser components? My hunch is that the list of items is
> relatively short and that even though different parsers might input or
> output different metadata, it still might make sense to come up with a
> general content model that serves the needs of everyone.
 >
>> Simply returning a String doesn't cut it. Returning a java.util.Map may
>> be an option, if you use standard Metadata constants as keys - still,
>> Nutch would have to repackage this anyway into a Writable. And we would
>> lose a nice property of the current Metadata class, which is the ability
>> to tolerate minor syntax variations and to store multiple values per key.
> 
> The problem I see with a Map or a similar keyed solution is that you
> only get to specify the metadata contract as documentated (if ever)
> keys instead of as a compile-time interface. Using a Map is fine if
> the set of managed information truly varies at runtime, but not when
> the set is fixed or at least well bounded.

So far nutch has been build to deal mainly with text type documents. 
There's however need also to deal with non textual object eg.  images, 
movies, sound which will provide content only in form of metadata (ok, 
perhaps some text also about the context of object if applicable), so 
the metadata names we have today are only a subset of what might be.

I really would not want to restrict the metadata the interface can carry 
to a fixed set.

--
  Sami Siren


Re: Thoughts on Parser design and dependencies

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 8/18/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> A very important aspect of the Parser interface (or actually, the Parse
> and Content classes) is that they each may contain arbitrary metadata.
> This is required for discovering and passing around both the original
> metadata (such as protocol headers, document properties, etc), and other
> secondary content (such as data from external sources, or derived metadata).

Is there a list of all the different metadata items that get passed in
or out of the parser components? My hunch is that the list of items is
relatively short and that even though different parsers might input or
output different metadata, it still might make sense to come up with a
general content model that serves the needs of everyone.

> Simply returning a String doesn't cut it. Returning a java.util.Map may
> be an option, if you use standard Metadata constants as keys - still,
> Nutch would have to repackage this anyway into a Writable. And we would
> lose a nice property of the current Metadata class, which is the ability
> to tolerate minor syntax variations and to store multiple values per key.

The problem I see with a Map or a similar keyed solution is that you
only get to specify the metadata contract as documentated (if ever)
keys instead of as a compile-time interface. Using a Map is fine if
the set of managed information truly varies at runtime, but not when
the set is fixed or at least well bounded.

Another concern with both the Parce class in Nutch and my
TextExtractor interface is that the body content is returned as a
single text stream (a String and a Reader respectively). This doesn't
allow the parser to pass along extra information like the emphasis of
certain parts (think of headings or links in html) or the language of
the text (e.g. xml:lang). I'm not too familiar with Lucene to know if
it could use such information, so this might be a YAGNI, but inversion
of control with a Builder interface would be a pretty powerful
solution for passing such information.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting, and Java development

Re: Thoughts on Parser design and dependencies

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren wrote:
>> Original motivation for this was http headers and meta tags, which 
>> can have multiple values. Another case is the language 
>> identification, where the same key may have multiple values, coming 
>> from different sources. Additionally, MapWritable supports any 
>> Writable, which is quite handy to store non-string data and to avoid 
>> converting to/from strings.
>
> I am not jerking on MapWritable, in fact I think it's quite efficient 
> piece of code :) IMO the support for Writable in values is valuable 
> but for keys hmmm... perhaps TextWritable is enough.

Yes, that's a sensible assumption - at least I've never had any need for 
any other types of keys so far ...

>
> I did play around this thing earlier by implementing something that 
> you could call external meta data. Which in concrete means that I 
> created a separate sequence file of (in this particular case DMOZ 
> data), keyed it by url and used that as sort of metadata during 
> indexing phase (mapped together with rest of nutch data).
>
> The benefit of this kind of approach compared to current one (static 
> metadata inside crawldb) is that I can manage the metadata completely 
> separated from crawldb operations and crawldb operations run faster 
> because of less data to move around.

Yes, that's true - I sometimes use this approach too. However, the 
downside of this method is its relative complexity: instead of just 
adding a key/value pair, and have it automagically appear wherever you 
have a CrawlDatum, you now have to manage a separate data file, modify 
it using custom tools and then make sure that all parts of Nutch can 
optionally include this file in the input (and output if you were to 
modify it) ... It can be done, of course, it's just much more complex.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Thoughts on Parser design and dependencies

Posted by Sami Siren <ss...@gmail.com>.
Andrzej Bialecki wrote:
> Sami Siren wrote:
>> Andrzej Bialecki wrote:
>>> Jukka Zitting wrote:
>>>
>>>> The Parser interface is also bound to the ideas of fetching content
>>>> from the network and indexing it using a standard content model
>>>> through the Content and Parse dependencies. For the Tika project I'd
>>>> like to look for ways to generalize this, as neither of these ideas
>>>> apply for example to the needs of the Apache Jackrabbit project. My
>>>> TextExtractor proposal avoids these dependencies by using just a
>>>> binary stream, a content type and an optional character encoding to
>>>> produce a single text stream, but that approach fails to support more
>>>> structured index content models. I'm trying to find a solution that
>>>> combines the best parts of both approaches.
>>>
>>> A very important aspect of the Parser interface (or actually, the 
>>> Parse and Content classes) is that they each may contain arbitrary 
>>> metadata. This is required for discovering and passing around both 
>>> the original metadata (such as protocol headers, document properties, 
>>> etc), and other secondary content (such as data from external 
>>> sources, or derived metadata).
>>>
>>> Simply returning a String doesn't cut it. Returning a java.util.Map 
>>> may be an option, if you use standard Metadata constants as keys - 
>>> still, Nutch would have to repackage this anyway into a Writable. And 
>>> we would lose a nice property of the current Metadata class, which is 
>>> the ability to tolerate minor syntax variations and to store multiple 
>>> values per key.
>>>
>> The tolerance for syntax variations should instead of written into 
>> meta data object be in a separate class perhaps implemented as a 
>> decorator to actual meta data. In fact places where nutch needs to 
>> take advantage of this functionality (actually in case of http headers 
>> only??) are rarer (in number) than those where we know exactly the 
>> names of meta data keys (because we put them there).
>>
>> I'd +1 if we'd go for a Map as a interface to meta data and in the 
>> same time perhaps change the Crawldb's metadata to the same meta data 
>> implementation or subclass of it.
> 
> Hmm. Please keep in mind that we need to use a Writable, both for the 
> Map itself and also for every value that we put there. I'm worried that 
> this could lead to excessive re-packaging of all objects coming out of 
> Parsers, from their original formats (Map<String, String>) to MapWritable.

Yes that is a potential problem. Especially from the efficiency point of 
view. One should test how much of a (performance) problem that actually is.

> Since the goal here is to get rid of dependencies on Nutch or Hadoop, 
> this means that Nutch will have to do such conversion because Tika would 
> not support Writable.
> 
>>
>> Perhaps we could even go for Map<String,String> or is there actually 
>> some use case for having multiple values for single key?
> 
> Original motivation for this was http headers and meta tags, which can 
> have multiple values. Another case is the language identification, where 
> the same key may have multiple values, coming from different sources. 
> Additionally, MapWritable supports any Writable, which is quite handy to 
> store non-string data and to avoid converting to/from strings.

I am not jerking on MapWritable, in fact I think it's quite efficient 
piece of code :) IMO the support for Writable in values is valuable but 
for keys hmmm... perhaps TextWritable is enough.

I did play around this thing earlier by implementing something that you 
could call external meta data. Which in concrete means that I created a 
separate sequence file of (in this particular case DMOZ data), keyed it 
by url and used that as sort of metadata during indexing phase (mapped 
together with rest of nutch data).

The benefit of this kind of approach compared to current one (static 
metadata inside crawldb) is that I can manage the metadata completely 
separated from crawldb operations and crawldb operations run faster 
because of less data to move around.

--
  Sami Siren



Re: Thoughts on Parser design and dependencies

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren wrote:
> Andrzej Bialecki wrote:
>> Jukka Zitting wrote:
>>
>>> The Parser interface is also bound to the ideas of fetching content
>>> from the network and indexing it using a standard content model
>>> through the Content and Parse dependencies. For the Tika project I'd
>>> like to look for ways to generalize this, as neither of these ideas
>>> apply for example to the needs of the Apache Jackrabbit project. My
>>> TextExtractor proposal avoids these dependencies by using just a
>>> binary stream, a content type and an optional character encoding to
>>> produce a single text stream, but that approach fails to support more
>>> structured index content models. I'm trying to find a solution that
>>> combines the best parts of both approaches.
>>
>> A very important aspect of the Parser interface (or actually, the 
>> Parse and Content classes) is that they each may contain arbitrary 
>> metadata. This is required for discovering and passing around both 
>> the original metadata (such as protocol headers, document properties, 
>> etc), and other secondary content (such as data from external 
>> sources, or derived metadata).
>>
>> Simply returning a String doesn't cut it. Returning a java.util.Map 
>> may be an option, if you use standard Metadata constants as keys - 
>> still, Nutch would have to repackage this anyway into a Writable. And 
>> we would lose a nice property of the current Metadata class, which is 
>> the ability to tolerate minor syntax variations and to store multiple 
>> values per key.
>>
> The tolerance for syntax variations should instead of written into 
> meta data object be in a separate class perhaps implemented as a 
> decorator to actual meta data. In fact places where nutch needs to 
> take advantage of this functionality (actually in case of http headers 
> only??) are rarer (in number) than those where we know exactly the 
> names of meta data keys (because we put them there).
>
> I'd +1 if we'd go for a Map as a interface to meta data and in the 
> same time perhaps change the Crawldb's metadata to the same meta data 
> implementation or subclass of it.

Hmm. Please keep in mind that we need to use a Writable, both for the 
Map itself and also for every value that we put there. I'm worried that 
this could lead to excessive re-packaging of all objects coming out of 
Parsers, from their original formats (Map<String, String>) to MapWritable.

Since the goal here is to get rid of dependencies on Nutch or Hadoop, 
this means that Nutch will have to do such conversion because Tika would 
not support Writable.

>
> Perhaps we could even go for Map<String,String> or is there actually 
> some use case for having multiple values for single key?

Original motivation for this was http headers and meta tags, which can 
have multiple values. Another case is the language identification, where 
the same key may have multiple values, coming from different sources. 
Additionally, MapWritable supports any Writable, which is quite handy to 
store non-string data and to avoid converting to/from strings.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Thoughts on Parser design and dependencies

Posted by Sami Siren <ss...@gmail.com>.
Andrzej Bialecki wrote:
> Jukka Zitting wrote:
> 
>> The Parser interface is also bound to the ideas of fetching content
>> from the network and indexing it using a standard content model
>> through the Content and Parse dependencies. For the Tika project I'd
>> like to look for ways to generalize this, as neither of these ideas
>> apply for example to the needs of the Apache Jackrabbit project. My
>> TextExtractor proposal avoids these dependencies by using just a
>> binary stream, a content type and an optional character encoding to
>> produce a single text stream, but that approach fails to support more
>> structured index content models. I'm trying to find a solution that
>> combines the best parts of both approaches.
> 
> A very important aspect of the Parser interface (or actually, the Parse 
> and Content classes) is that they each may contain arbitrary metadata. 
> This is required for discovering and passing around both the original 
> metadata (such as protocol headers, document properties, etc), and other 
> secondary content (such as data from external sources, or derived 
> metadata).
> 
> Simply returning a String doesn't cut it. Returning a java.util.Map may 
> be an option, if you use standard Metadata constants as keys - still, 
> Nutch would have to repackage this anyway into a Writable. And we would 
> lose a nice property of the current Metadata class, which is the ability 
> to tolerate minor syntax variations and to store multiple values per key.
> 
The tolerance for syntax variations should instead of written into meta 
data object be in a separate class perhaps implemented as a decorator to 
actual meta data. In fact places where nutch needs to take advantage of 
this functionality (actually in case of http headers only??) are rarer 
(in number) than those where we know exactly the names of meta data keys 
(because we put them there).

I'd +1 if we'd go for a Map as a interface to meta data and in the same 
time perhaps change the Crawldb's metadata to the same meta data 
implementation or subclass of it.

Perhaps we could even go for Map<String,String> or is there actually 
some use case for having multiple values for single key?

--
  Sami Siren

Re: Thoughts on Parser design and dependencies

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jukka Zitting wrote:

> The Parser interface is also bound to the ideas of fetching content
> from the network and indexing it using a standard content model
> through the Content and Parse dependencies. For the Tika project I'd
> like to look for ways to generalize this, as neither of these ideas
> apply for example to the needs of the Apache Jackrabbit project. My
> TextExtractor proposal avoids these dependencies by using just a
> binary stream, a content type and an optional character encoding to
> produce a single text stream, but that approach fails to support more
> structured index content models. I'm trying to find a solution that
> combines the best parts of both approaches.

A very important aspect of the Parser interface (or actually, the Parse 
and Content classes) is that they each may contain arbitrary metadata. 
This is required for discovering and passing around both the original 
metadata (such as protocol headers, document properties, etc), and other 
secondary content (such as data from external sources, or derived metadata).

Simply returning a String doesn't cut it. Returning a java.util.Map may 
be an option, if you use standard Metadata constants as keys - still, 
Nutch would have to repackage this anyway into a Writable. And we would 
lose a nice property of the current Metadata class, which is the ability 
to tolerate minor syntax variations and to store multiple values per key.

>
> Ideally I'd like to see a parser implementation in Tika that avoids
> the Nutch dependencies but can still be used in Nutch without changing
> any of the existing code or configuration files. Something like a
> TikaParser adapter class might be needed to achieve that.

It seems to me that such adapter is unavoidable. Most probably similar 
adapters would have to be used for all other dependencies (Configurable 
etc). The big question is how to minimize the intermediate object 
creation, and to come up with interfaces that are robust enough to 
support all current usecases in Nutch, but at the same time don't 
introduce too many layers of delegation...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com