You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/04/17 12:35:07 UTC

NUTCH-1129

Hi Guys,

I could probably be using this within the latter part of my university
work, also we have nearly launched Any23 0.7.0-incubating so I suppose now
is as good a time as ever to formally set about preparing a strategy for
what an Any23 parser (and indexing filter?) implementation would look like.

A couple of questions first:

As parse-tika shadows the parse-html implementation in more or less every
way apart from (i) TikaParser methods are not declared public (I think this
is consistent throughout the plugin) (ii) TikaParser#ParseResult.getParse()
embeds the Tika parser logic. TikaParser doesn't have a main method.

1) Why are Tika methods not declared as public?
2) Why doesn't TikaParser ship with a main method?
3) We previously discussed implementing the Any23 parser plugin as a tika
wrapper, therefore it would look very similar to parse-tika?
4) I like the look of the feed plugin where it also ships with a custom
indexingfilter implementation, my thoughts were also to provide a custom
Any23IndexingFilter implementation?

Any comments would be great before I begin coding this up. I'm keen to get
it going, but not before it has the support from you guys as well.

Thanks in advance for any direction.

Lewis


-- 
*Lewis*

Re: NUTCH-1129

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Julien,

On Tue, Apr 17, 2012 at 12:03 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> why should they? If they are not used outside the class then I think it is
> good practice to keep them private. Since TikaParser implements Parser the
> only method that we can expect to be called is getParse() and it is public
>

Yeah oops ;0) (duh)

>
> it could have one for testing but IMHO using the ParserChecker is a better
> way of testing as it is closer to real use
>

Yeah +1 it really is.

>  I don't remember this but I remember suggesting that the Any23 parser
> should be a tika parser which is not the same as a Tika wrapper. I expect
> other people in Tika-land to have a use for it, and we'd get the benefit of
> it automatically with parse-tika
>

Yeah I think for Any23 this would be a great goal to work towards, however
not within the scope of a parse-any23 plugin for Nutch :0) Time will tell
for this one. I think we are all getting to know the capabilities of Any23
just now so it's still early days.

>
> Depends on what you want to do? What would we get out of Any23? How would
> that be used on the search side?
>

I need to look into this, having spoken with Paolo Castagna about this
before we discussed a TDB implementation enabling us to scrape structured
Any23 stuff and send it directly to TDB, however this is separate from a
neat indeaxing filter(s) which for example embraces the mimeType and
indexes it accordingly. The reason the feed plugin grabbed my attention was
that the FeedIndexingFilter grabs important info from the feed and passing
it in such a way that we can index and search convenienctly and efficiently
through piles of feeds. With the latter part of this I need to do some
investigation RE different formats and how they can be represented within
an index allowing us to conveniently navigate triples etc.

Thanks for now Julien.

Lewis
-- 
*Lewis*

Re: NUTCH-1129

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Sebastian,

I'm taking this over to any23-dev@incubator.apache.org to discuss there.
We know what we want over here @ Nutch, we want to utilise the Any23
parsers to scrape additional structured information from webpages and such
like. However as you mention the subsequent task of presenting them
(HtmlIndexingFilter) is not quite as straightforward, it gets more tricky
when you begin to take into account the growing range of formats that Any23
is able to extract, these not only differ in syntax but also in semantic
representation.

I'll get you over on any23 lists. Thanks

Lewis

On Tue, Apr 17, 2012 at 11:19 PM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:

> >> Well, we could easily use certain microdata key/value pairs in our
> results
> >> to greatly improve search and navigation.
>
> Microdata is a good show-case for the Any23 plugin.
>
> Another example would be semantic markup in shops.
> Any23 already does a good job in extracting the semantic content:
>
>  $ any23tools Rover \
>   'http://www.shopforia.com/cgi-**bin/apf4/apf4.cgi?Operation=**
> ItemLookup&ItemId=B007P4VOWC<http://www.shopforia.com/cgi-bin/apf4/apf4.cgi?Operation=ItemLookup&ItemId=B007P4VOWC>
> '
>
> The question is how to map triples to key-value pairs (NutchFields)
> in a straight-forward but configurable way.
> The triples
>  <#Offering_0635753498301> <#hasPriceSpecification>
> <#UnitPriceSpecification> .
>  <#UnitPriceSpecification> <#hasCurrencyValue> "249.99"^^<#float> ;
>        <#hasCurrency> "USD"^^<#string> ;
>        ... .
> and the pair
>  price = 249.99 USD
> are the same information. Nutch (or Solr etc.) require the latter form
> if you want to set up a shop search. But conversion is not as simple
> (maybe I'm wrong?):
>  - information may be spread over several triples
>  - there may be multiple products per document
>   (same predicate for different subjects) => use sub-documents?
>
> Sebastian
>
>
> On 04/17/2012 08:05 PM, Lewis John Mcgibbney wrote:
>
>> Hi Markus,
>>
>> On Tue, Apr 17, 2012 at 12:21 PM, Markus Jelsma
>> <ma...@openindex.io>**wrote:
>>
>>  You did indeed suggest that. However, if building a wrapper is fairly
>>> straightforward then it may not be a bad idea. I haven't seen any hint of
>>> Tika
>>> having Any23 on-board any time soon so we might have to wait a very long
>>> time
>>> if we want to rely on Tika.
>>>
>>>
>> Yeah +1. As I explained to Julien we are some way from thinking about
>> integration into Tika and subsequently writing the parser
>> implementation(s)
>> for use within Tika.
>>
>>
>>
>>>
>>> Well, we could easily use certain microdata key/value pairs in our
>>> results
>>> to
>>> greatly improve search and navigation.
>>>
>>>
>> Yeah, microdata is just one from a whole bunch of formats Any23 can
>> handle.
>> My reservations were how to represent the many different formats in a way
>> which would be easily navigable (is that a word?) within an index. There
>> is
>> obviously work to be done here from my side.
>>
>> Thanks
>>
>> Lewis
>>
>>
>


-- 
*Lewis*

Fwd: NUTCH-1129

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Guys,

A rather interesting discussion has emerged over on dev@nutch regarding
building the Any23 Nutch plugin[0], please see Sebastian Nagel's comments
below for the most recent contribution... which got me thinking more about
it today. I would advise you to maybe read over the short conversation
before reading on as it's better in context :0)

The overwhelming majority of Nutch users build searchable Solr indexes from
the content they retrieve via Nutch, therefore we're looking to build a
plugin solution which does a double task
1) Tika wrapped Any23 Parser plugin - enabling us to use core Any23 parsers
for extraction.
2) An HtmlIndexingFilter - enabling us to process the triples and to get
them into a Solr index in such a way which is easily searchable via fields.

As we discussed and as Sebastian graphically highlights below, this is not
clear cut, therefore I wanted to hear anyones thoughts/input on building 2)
before I begin.

Thanks in advance

Lewis

[0] http://www.mail-archive.com/dev%40nutch.apache.org/msg07104.html

---------- Forwarded message ----------
From: Sebastian Nagel <wa...@googlemail.com>
Date: Tue, Apr 17, 2012 at 11:19 PM
Subject: Re: NUTCH-1129
To: dev@nutch.apache.org

>> Well, we could easily use certain microdata key/value pairs in our
results
>> to greatly improve search and navigation.

Microdata is a good show-case for the Any23 plugin.

Another example would be semantic markup in shops.
Any23 already does a good job in extracting the semantic content:

 $ any23tools Rover \
  'http://www.shopforia.com/cgi-**bin/apf4/apf4.cgi?Operation=**
ItemLookup&ItemId=B007P4VOWC<http://www.shopforia.com/cgi-bin/apf4/apf4.cgi?Operation=ItemLookup&ItemId=B007P4VOWC>
'

The question is how to map triples to key-value pairs (NutchFields)
in a straight-forward but configurable way.
The triples
 <#Offering_0635753498301> <#hasPriceSpecification>
<#UnitPriceSpecification> .
 <#UnitPriceSpecification> <#hasCurrencyValue> "249.99"^^<#float> ;
       <#hasCurrency> "USD"^^<#string> ;
       ... .
and the pair
 price = 249.99 USD
are the same information. Nutch (or Solr etc.) require the latter form
if you want to set up a shop search. But conversion is not as simple
(maybe I'm wrong?):
 - information may be spread over several triples
 - there may be multiple products per document
  (same predicate for different subjects) => use sub-documents?

Sebastian

On 04/17/2012 08:05 PM, Lewis John Mcgibbney wrote:

> Hi Markus,
>
> On Tue, Apr 17, 2012 at 12:21 PM, Markus Jelsma
> <ma...@openindex.io>**wrote:
>
>  You did indeed suggest that. However, if building a wrapper is fairly
>> straightforward then it may not be a bad idea. I haven't seen any hint of
>> Tika
>> having Any23 on-board any time soon so we might have to wait a very long
>> time
>> if we want to rely on Tika.
>>
>>
> Yeah +1. As I explained to Julien we are some way from thinking about
> integration into Tika and subsequently writing the parser implementation(s)
> for use within Tika.
>
>
>
>>
>> Well, we could easily use certain microdata key/value pairs in our results
>> to
>> greatly improve search and navigation.
>>
>>
> Yeah, microdata is just one from a whole bunch of formats Any23 can handle.
> My reservations were how to represent the many different formats in a way
> which would be easily navigable (is that a word?) within an index. There is
> obviously work to be done here from my side.
>
> Thanks
>
> Lewis
>
>

-- 
*Lewis*

Re: NUTCH-1129

Posted by Sebastian Nagel <wa...@googlemail.com>.

 >> Well, we could easily use certain microdata key/value pairs in our results
 >> to greatly improve search and navigation.

Microdata is a good show-case for the Any23 plugin.

Another example would be semantic markup in shops.
Any23 already does a good job in extracting the semantic content:

  $ any23tools Rover \
    'http://www.shopforia.com/cgi-bin/apf4/apf4.cgi?Operation=ItemLookup&ItemId=B007P4VOWC'

The question is how to map triples to key-value pairs (NutchFields)
in a straight-forward but configurable way.
The triples
  <#Offering_0635753498301> <#hasPriceSpecification> <#UnitPriceSpecification> .
  <#UnitPriceSpecification> <#hasCurrencyValue> "249.99"^^<#float> ;
         <#hasCurrency> "USD"^^<#string> ;
	... .
and the pair
  price = 249.99 USD
are the same information. Nutch (or Solr etc.) require the latter form
if you want to set up a shop search. But conversion is not as simple
(maybe I'm wrong?):
  - information may be spread over several triples
  - there may be multiple products per document
    (same predicate for different subjects) => use sub-documents?

Sebastian

On 04/17/2012 08:05 PM, Lewis John Mcgibbney wrote:
> Hi Markus,
>
> On Tue, Apr 17, 2012 at 12:21 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
>> You did indeed suggest that. However, if building a wrapper is fairly
>> straightforward then it may not be a bad idea. I haven't seen any hint of
>> Tika
>> having Any23 on-board any time soon so we might have to wait a very long
>> time
>> if we want to rely on Tika.
>>
>
> Yeah +1. As I explained to Julien we are some way from thinking about
> integration into Tika and subsequently writing the parser implementation(s)
> for use within Tika.
>
>
>>
>>
>> Well, we could easily use certain microdata key/value pairs in our results
>> to
>> greatly improve search and navigation.
>>
>
> Yeah, microdata is just one from a whole bunch of formats Any23 can handle.
> My reservations were how to represent the many different formats in a way
> which would be easily navigable (is that a word?) within an index. There is
> obviously work to be done here from my side.
>
> Thanks
>
> Lewis
>

Re: NUTCH-1129

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Markus,

On Tue, Apr 17, 2012 at 12:21 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> You did indeed suggest that. However, if building a wrapper is fairly
> straightforward then it may not be a bad idea. I haven't seen any hint of
> Tika
> having Any23 on-board any time soon so we might have to wait a very long
> time
> if we want to rely on Tika.
>

Yeah +1. As I explained to Julien we are some way from thinking about
integration into Tika and subsequently writing the parser implementation(s)
for use within Tika.

>
>
> Well, we could easily use certain microdata key/value pairs in our results
> to
> greatly improve search and navigation.
>

Yeah, microdata is just one from a whole bunch of formats Any23 can handle.
My reservations were how to represent the many different formats in a way
which would be easily navigable (is that a word?) within an index. There is
obviously work to be done here from my side.

Thanks

Lewis

Re: NUTCH-1129

Posted by Markus Jelsma <ma...@openindex.io>.

Hi guys,

On Tuesday 17 April 2012 13:03:16 Julien Nioche wrote:
> Hi Lewis
> 
> 1) Why are Tika methods not declared as public?
> 
> 
> why should they? If they are not used outside the class then I think it is
> good practice to keep them private. Since TikaParser implements Parser the
> only method that we can expect to be called is getParse() and it is public
> 
> > 2) Why doesn't TikaParser ship with a main method?
> 
> it could have one for testing but IMHO using the ParserChecker is a better
> way of testing as it is closer to real use
> 
> > 3) We previously discussed implementing the Any23 parser plugin as a tika
> > wrapper, therefore it would look very similar to parse-tika?
> 
> I don't remember this but I remember suggesting that the Any23 parser
> should be a tika parser which is not the same as a Tika wrapper. I expect
> other people in Tika-land to have a use for it, and we'd get the benefit of
> it automatically with parse-tika

You did indeed suggest that. However, if building a wrapper is fairly 
straightforward then it may not be a bad idea. I haven't seen any hint of Tika 
having Any23 on-board any time soon so we might have to wait a very long time 
if we want to rely on Tika.

> 
> > 4) I like the look of the feed plugin where it also ships with a custom
> > indexingfilter implementation, my thoughts were also to provide a custom
> > Any23IndexingFilter implementation?
> 
> Depends on what you want to do? What would we get out of Any23? How would
> that be used on the search side?

Well, we could easily use certain microdata key/value pairs in our results to 
greatly improve search and navigation.

Thanks
Markus

> 
> 
> Thanks
> 
> Julien

-- 
Markus Jelsma - CTO - Openindex

Re: NUTCH-1129

Posted by Julien Nioche <li...@gmail.com>.

Hi Lewis

1) Why are Tika methods not declared as public?
>

why should they? If they are not used outside the class then I think it is
good practice to keep them private. Since TikaParser implements Parser the
only method that we can expect to be called is getParse() and it is public


> 2) Why doesn't TikaParser ship with a main method?
>

it could have one for testing but IMHO using the ParserChecker is a better
way of testing as it is closer to real use


> 3) We previously discussed implementing the Any23 parser plugin as a tika
> wrapper, therefore it would look very similar to parse-tika?
>

I don't remember this but I remember suggesting that the Any23 parser
should be a tika parser which is not the same as a Tika wrapper. I expect
other people in Tika-land to have a use for it, and we'd get the benefit of
it automatically with parse-tika


> 4) I like the look of the feed plugin where it also ships with a custom
> indexingfilter implementation, my thoughts were also to provide a custom
> Any23IndexingFilter implementation?
>

Depends on what you want to do? What would we get out of Any23? How would
that be used on the search side?


Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: NUTCH-1129

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Chris,

On Tue, Apr 17, 2012 at 3:20 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

>
> I think it would be super awesome to add the Any23 parsing functionality
> as a Tika parser, and potentially
> an extension to the MIME repository to detect microformats, etc. Then in
> Nutch, we could take advantage of
> the any23 parser with the existing tika-parser interface.
>
> Thoughts?
>

Well on top of what I've managed to ramble on elsewhere, I think that this
utopian vision is something I will definitely be pushing for. It makes
perfect sense but I think it's a case of Any23 maturing within the
incubator before we can push it up to the Tika PMC for this stuff. It would
be a win win situation as Nutch would benefit also.

For the time being I think a step back (Tika wrapper) plugin implementing
the Any23 functionality would be a good start. I'll be making headways on
this over the next while so will keep all you guys up to date with it.

Thanks
Lewis

Re: NUTCH-1129

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Lewis,

On Apr 17, 2012, at 3:35 AM, Lewis John Mcgibbney wrote:

> 3) We previously discussed implementing the Any23 parser plugin as a tika wrapper, therefore it would look very similar to parse-tika?

I think it would be super awesome to add the Any23 parsing functionality as a Tika parser, and potentially
an extension to the MIME repository to detect microformats, etc. Then in Nutch, we could take advantage of
the any23 parser with the existing tika-parser interface.

Thoughts?

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++