You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jérôme Charron <je...@gmail.com> on 2005/11/24 00:01:54 UTC

[proposal] Generic Markup Language Parser

Hi,

We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just
add a new proposal on the nutch Wiki:
http://wiki.apache.org/nutch/MarkupLanguageParserProposal

Here is the Summary of Issue:
"Currently, Nutch provides some specific markup language parsing plugins:
one for handling HTML, another one for RSS, but no generic XML parsing
plugin. This is extremely cumbersome as adding support for a new markup
language implies that you have to develop the whole XML parsing code from
scratch. This methodology causes: (1) code duplication, with little or no
reuse of common pieces of XML parsing code, and (2) dependency library
duplication, where many XML parsing plugins may rely on similar xml parsing
libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing plugin
keeps its own local copy of these libraries. It is also very difficult to
identify precisely the type of XML content encountered during a parse. That
difficult issue is outside the scope of this proposal, and will be
identified in a future proposal."

Thanks for your feedback, comments, suggestions (and votes).

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: [proposal] Generic Markup Language Parser

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Gentlemen, please let's keep a civilized tone to this exchange, or take 
> it off the list.

+1

Doug

Re: [proposal] Generic Markup Language Parser

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello,
I do agree with Andrzej. I do not see it as a solution for for 
parse-html. But generic XML plugin maybe will have some use for some 
people (even if not for me).
Regards
Piotr


Andrzej Bialecki wrote:
> Stefan Groschupf wrote:
> 
> [...]
> 
> Gentlemen, please let's keep a civilized tone to this exchange, or take 
> it off the list.
> 
> I applaud this effort, I can certainly sympathize with its goals - just 
> the other day I struggled with parsing an XML feed into Nutch segments. 
> It would be very welcome to have a generic platform to handle all kinds 
> of XML input and a way to express mappings from any XML schema to a 
> standard metadata, as it is used in Nutch.
> 
> You don't have to use XSL to accomplish this - an XPath processor would 
> do fine in many cases. Even if you use XSL, and you avoid certain costly 
> constructs, you can keep a decent performance, with the benefit of 
> flexibility and standards-compliance that comes with XSL (people already 
> know how to use it).
> 
> At the same time I see little benefit of creating an intermediate XML - 
> as soon as the data extraction is completed the same information can be 
> passed perfectly well using the Nutch internal classses (ParseImpl and 
> friends) - unless you want to replace the original Content in segments 
> with this intermediate XML.
> 
> I also don't think this solution would be suitable for parse-html, where 
> the top-notch performance is crucial and where by default we have to 
> deal with non-valid or even non well-formed documents - and fixing, 
> parsing and extracting in one step, as we do it today, seems to be the 
> most efficient way to go. So, I very much doubt you will be able to get 
> the same performance if you use your approach.
> 
> So, if you add this as a generic parse-xml framework, to be used where 
> it makes sense in terms of flexibility and performance - I think this 
> would change very little for those who are not interested in XML 
> content, but it would be a big help for those who have to deal with it.
> 


Re: [proposal] Generic Markup Language Parser

Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:

[...]

Gentlemen, please let's keep a civilized tone to this exchange, or take 
it off the list.

I applaud this effort, I can certainly sympathize with its goals - just 
the other day I struggled with parsing an XML feed into Nutch segments. 
It would be very welcome to have a generic platform to handle all kinds 
of XML input and a way to express mappings from any XML schema to a 
standard metadata, as it is used in Nutch.

You don't have to use XSL to accomplish this - an XPath processor would 
do fine in many cases. Even if you use XSL, and you avoid certain costly 
constructs, you can keep a decent performance, with the benefit of 
flexibility and standards-compliance that comes with XSL (people already 
know how to use it).

At the same time I see little benefit of creating an intermediate XML - 
as soon as the data extraction is completed the same information can be 
passed perfectly well using the Nutch internal classses (ParseImpl and 
friends) - unless you want to replace the original Content in segments 
with this intermediate XML.

I also don't think this solution would be suitable for parse-html, where 
the top-notch performance is crucial and where by default we have to 
deal with non-valid or even non well-formed documents - and fixing, 
parsing and extracting in one step, as we do it today, seems to be the 
most efficient way to go. So, I very much doubt you will be able to get 
the same performance if you use your approach.

So, if you add this as a generic parse-xml framework, to be used where 
it makes sense in terms of flexibility and performance - I think this 
would change very little for those who are not interested in XML 
content, but it would be a big help for those who have to deal with it.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [proposal] Generic Markup Language Parser

Posted by Stefan Groschupf <sg...@media-style.com>.
> speed == scalability ????
> Oh, damned, is it a new theory Stefan?
Not? How you will run a search engine that is able to scale up to  
billion of pages that can only parse 20 pages per second?
Do you have unlimited hardware resources? Let me know I would be  
interested to join your project.

>
> Yes, since I have corrected many bugs in the plugin system (not  
> yours I
> hope), I clearly understand how it works, and what's its goal...
>  ;-)
>
I'm not that much following the contribution mails, but wasn't your  
last contribution:
http://issues.apache.org/jira/browse/NUTCH-10,
Wasn't the language detection performance problem reported by one of  
community
and may I'm wrong but  wasn't there a zip parser plugin in 2003  
contribution  to nutch,
sorry I unable to browse the old source-forge mailing list archive?


> P.S. Do you think it makes sense to run another public nutch mailing
>> list, since 'THE nutch [...]' (mailing list  is nutch-
>> dev@lucene.apache.org), 'Isn't it?'
>> http://www.mail-archive.com/nutch-user@lucene.apache.org/ 
>> msg01513.html
>
> Is there another public nutch mailing list somewhere Stefan?
> Please give me the address...
>
Sure, may it is interesting for you:
http://fr.groups.yahoo.com/group/frutch/
It's a frensch speaking workgroup and I'm pretty much sure they are  
nutch related since
they mentioned that in the description and use the nutch figure in  
the signet as well.

HTH,
Stefan 

unsubscribe me please

Posted by Keith Campbell <ke...@mac.com>.
 
On Thursday, November 24, 2005, at 04:03PM, Jérôme Charron <je...@gmail.com> wrote:

>> Until last years there is one thing I notice that matters in a search
>> engine - minimalism.
>
>If you are honnest Stefan, take a closer look at the end of the proposal
>(here is a copy):
>Issues
>
>Create performance benchmarks and ensure that the new implementation gives
>at least the same performances as the parse-html plugin (the most used parse
>plugin in a whole web crawling)
>
>Minimalism.
>> Minimalism == speed, speed == scalability,
>
>speed == scalability ????
>Oh, damned, is it a new theory Stefan?
>
>
>> scalability == serious
>
>high availability == serious (too)
>monitoring == serious (too)
>there is a lot of serious stuff you know, and I really think that
>features == serious (too)
>
>I don't think it would be a good move to slow down html parsing (most
>> used parser) to make rss parser writing more easier for developers.
>
>One more time: take a closer look at the proposal. The idea is to provide a
>convenient
>way to add some markup language related plugins (you know rss and atom are
>the first steps to a more structured content... more is to come)
>Not replacing the existing html and rss ones if their performance are
>better.
>Adapting the html and rss parsers to the proposal is just for archecture
>"beauty" purposes, but it is not mandatory.
>You know, actually, Nutch is widely used for thematic and  intranet search
>engines. And in such a context this proposal really makes sense (as in such
>a context it makes sense to have a protocol-jdbc plugin for instance).
>
>>From my perspective we have much more general things to solve in
>> nutch (manageability, monitoring, ndfs block based task-routing, more
>> dynamic search servers) than improving thing we already have.
>
>It's your point of view.
>You know, I think there is something magic on nutch. It is that peoples are
>focused on different subjects.
>Some are more focused on infrastructure, some others on parsing, some others
>on language technology...
>That's a big chance for nutch... our complementarity...
>(but it's true the subjects you mentionned are some very intersting
>improvements, especially monitoring. Cannot be a serious product deployed on
>many nodes if there is no way to monitor the whole system).
>
>
>> Anyway as you may know we have a plugin system and one goal of the
>> plugin system is to give developers the freedom to develop custom
>> plugins. :-)
>
>Yes, since I have corrected many bugs in the plugin system (not yours I
>hope), I clearly understand how it works, and what's its goal...
> ;-)
>
>P.S. Do you think it makes sense to run another public nutch mailing
>> list, since 'THE nutch [...]' (mailing list  is nutch-
>> dev@lucene.apache.org), 'Isn't it?'
>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html
>
>Is there another public nutch mailing list somewhere Stefan?
>Please give me the address...
>
>Best Regards
>
>Jérôme
>
>--
>http://motrech.free.fr/
>http://www.frutch.org/
>
>

Re: [proposal] Generic Markup Language Parser

Posted by Jérôme Charron <je...@gmail.com>.
> Until last years there is one thing I notice that matters in a search
> engine - minimalism.

If you are honnest Stefan, take a closer look at the end of the proposal
(here is a copy):
Issues

Create performance benchmarks and ensure that the new implementation gives
at least the same performances as the parse-html plugin (the most used parse
plugin in a whole web crawling)

Minimalism.
> Minimalism == speed, speed == scalability,

speed == scalability ????
Oh, damned, is it a new theory Stefan?


> scalability == serious

high availability == serious (too)
monitoring == serious (too)
there is a lot of serious stuff you know, and I really think that
features == serious (too)

I don't think it would be a good move to slow down html parsing (most
> used parser) to make rss parser writing more easier for developers.

One more time: take a closer look at the proposal. The idea is to provide a
convenient
way to add some markup language related plugins (you know rss and atom are
the first steps to a more structured content... more is to come)
Not replacing the existing html and rss ones if their performance are
better.
Adapting the html and rss parsers to the proposal is just for archecture
"beauty" purposes, but it is not mandatory.
You know, actually, Nutch is widely used for thematic and  intranet search
engines. And in such a context this proposal really makes sense (as in such
a context it makes sense to have a protocol-jdbc plugin for instance).

>From my perspective we have much more general things to solve in
> nutch (manageability, monitoring, ndfs block based task-routing, more
> dynamic search servers) than improving thing we already have.

It's your point of view.
You know, I think there is something magic on nutch. It is that peoples are
focused on different subjects.
Some are more focused on infrastructure, some others on parsing, some others
on language technology...
That's a big chance for nutch... our complementarity...
(but it's true the subjects you mentionned are some very intersting
improvements, especially monitoring. Cannot be a serious product deployed on
many nodes if there is no way to monitor the whole system).


> Anyway as you may know we have a plugin system and one goal of the
> plugin system is to give developers the freedom to develop custom
> plugins. :-)

Yes, since I have corrected many bugs in the plugin system (not yours I
hope), I clearly understand how it works, and what's its goal...
 ;-)

P.S. Do you think it makes sense to run another public nutch mailing
> list, since 'THE nutch [...]' (mailing list  is nutch-
> dev@lucene.apache.org), 'Isn't it?'
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html

Is there another public nutch mailing list somewhere Stefan?
Please give me the address...

Best Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: [proposal] Generic Markup Language Parser

Posted by Stefan Groschupf <sg...@media-style.com>.
> Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-)
No, nutch uses java logging, only some plugins use jar that depends  
on log4j.

Stefan 

RE: [proposal] Generic Markup Language Parser

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Stefan, and Jerome,

> A  mail archive is a amazing source of information, isn't it?! :-)
> To answer your question, just ask your self how many pages per second
> your plan to fetch and parse and how much queries per second a lucene
> index is able to handle - and you can deliver in the ui.
> I have here something like 200++ to a maximal 20 queries per second.
> http://wiki.apache.org/nutch/HardwareRequirements

I'm not sure that our proposal affects the ui, really at all. Parsing occurs
only during a fetch, which creates the index for the ui, no? So, why mention
the amount of queries per second that the ui can handle?

> 
> Speed improvement in ui can be done by caching components you use to
> assemble the ui. "There are some ways to improve speed"
> But seriously I don't think there will be any pages  that contains
> 'cacheable' items until parsing.
> Until last years there is one thing I notice that matters in a search
> engine - minimalism.
> There is no usage in nutch of a  logging library, 

Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-)

> no RMI and no meta
> data in the web db. Why?
> Minimalism.
> Minimalism == speed, speed == scalability, scalability == serious
> enterprise search engine projects.
> 
> I don't think it would be a good move to slow down html parsing (most
> used parser) to make rss parser writing more easier for developers.

This proposal isn't meant for RSS, that's seriously constraining the scope.
The proposal is meant for making writing * XML * parsers easier. Note the
"XML". RSS is a significantly small subset of XML as a whole. And, there
currently exists no default support for generic XML documents in Nutch.


> BTW, we already have a html and feed parser that works, as far I know.
> I guess 90 % of the nutch users use the html parser but only 10 % the
> feed-parser (since blogs are mostly html as well).

This may or may not be true however I wouldn't be surprised if it was
because it is representative of the division of content on the web -- HTML
definitely is orders of magnitude more pervasive than RSS.

> 
>  From my perspective we have much more general things to solve in
> nutch (manageability, monitoring, ndfs block based task-routing, more
> dynamic search servers) than improving thing we already have.

I would tend to agree with Jerome on this one -- these seem to be the items
on your agenda: a representative set indeed, but by no means an exhaustive
set of what's needed to improve, and benefit Nutch. One of the motivations
behind our proposal was several emails posted to the Nutch list by users
interested in crawling blogs and RSS:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69417.html

One of my replies to this thread was a message on October 19th, 2005, which
really identified the main problem:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69576.html

There is a lack of a general XML parser in Nutch that would allow it to deal
with general XML content based on user defined schemas and DTDs. Our
proposal would be the initial step towards a solution to this overall
problem. At least, that's part of its intention.


> Anyway as you may know we have a plugin system and one goal of the
> plugin system is to give developers the freedom to develop custom
> plugins. :-)

Indeed. And our goal is help developers in their endeavors by providing at
starting point and generic solution for XML based parsing plugins :-)

Cheers,
  Chris


> 
> Cheers,
> Stefan
> B-)
> 
> P.S. Do you think it makes sense to run another public nutch mailing
> list, since 'THE nutch [...]' (mailing list  is nutch-
> dev@lucene.apache.org), 'Isn't it?'
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html
> 
> 
> 
> Am 24.11.2005 um 19:28 schrieb Jérôme Charron:
> 
> > Hi Stefan,
> >
> > And thanks for taking time to read the doc and giving us your
> > feedback.
> >
> > -1!
> >> Xsl is terrible slow!
> >> Xml will blow up memory and storage usage.
> >
> > But there still something I don't understand...
> > Regarding a previous discussion we had about the use of OpenSearch
> > API to
> > replace Servlet => HTML by Servlet => XML => HTML (using xsl),
> > here is a copy of one of my comment:
> >
> > In my opinion, it is the front-end "dreamed" architecture. But more
> > pragmatically, I'm not sure it's a good idea. XSL transformation is a
> > rather slow process!! And the Nutch front-end must be very responsive.
> >
> > and then your response and Doug response too:
> > Stefan:
> > We already done experiments using XSLT.
> > There are some ways to improve speed, however it is 20 ++ % slower
> > then jsp.
> > Doug:
> > I don't think this would make a significant impact on overall Nutch
> > search
> > performance.
> > (the complete thread is available at
> > http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/
> > msg03811.html
> > )
> >
> > I'm a little bit confused... why the use of xsl must be considered
> > as too
> > time and memory expansive in the back-end process,
> > but not in the front-end?
> >
> > Dublin core may is good for semantic web, but not for a content
> > storage.
> >
> > It is not used as a content storage, but just as an intermediate
> > step: the
> > output of the xsl transformation, that will be then indexed using
> > standard
> > nutch APIs.
> > (notice that this xml file schema is perfectly mapped to Parse and
> > ParseData
> > objects)
> >
> >
> >> In general the goal must be to minimalize memory usage and improve
> >> performance such a parser would increase memory usage and definitely
> >> slow down parsing.
> >
> > Not improving the flexibility, extensibility and features?
> >
> > Jérôme
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/


Re: [proposal] Generic Markup Language Parser

Posted by Stefan Groschupf <sg...@media-style.com>.
Jérôme,

A  mail archive is a amazing source of information, isn't it?! :-)
To answer your question, just ask your self how many pages per second  
your plan to fetch and parse and how much queries per second a lucene  
index is able to handle - and you can deliver in the ui.
I have here something like 200++ to a maximal 20 queries per second.
http://wiki.apache.org/nutch/HardwareRequirements

Speed improvement in ui can be done by caching components you use to  
assemble the ui. "There are some ways to improve speed"
But seriously I don't think there will be any pages  that contains  
'cacheable' items until parsing.
Until last years there is one thing I notice that matters in a search  
engine - minimalism.
There is no usage in nutch of a  logging library, no RMI and no meta  
data in the web db. Why?
Minimalism.
Minimalism == speed, speed == scalability, scalability == serious  
enterprise search engine projects.

I don't think it would be a good move to slow down html parsing (most  
used parser) to make rss parser writing more easier for developers.
BTW, we already have a html and feed parser that works, as far I know.
I guess 90 % of the nutch users use the html parser but only 10 % the  
feed-parser (since blogs are mostly html as well).

 From my perspective we have much more general things to solve in  
nutch (manageability, monitoring, ndfs block based task-routing, more  
dynamic search servers) than improving thing we already have.
Anyway as you may know we have a plugin system and one goal of the  
plugin system is to give developers the freedom to develop custom  
plugins. :-)

Cheers,
Stefan
B-)

P.S. Do you think it makes sense to run another public nutch mailing  
list, since 'THE nutch [...]' (mailing list  is nutch- 
dev@lucene.apache.org), 'Isn't it?'
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html



Am 24.11.2005 um 19:28 schrieb Jérôme Charron:

> Hi Stefan,
>
> And thanks for taking time to read the doc and giving us your  
> feedback.
>
> -1!
>> Xsl is terrible slow!
>> Xml will blow up memory and storage usage.
>
> But there still something I don't understand...
> Regarding a previous discussion we had about the use of OpenSearch  
> API to
> replace Servlet => HTML by Servlet => XML => HTML (using xsl),
> here is a copy of one of my comment:
>
> In my opinion, it is the front-end "dreamed" architecture. But more
> pragmatically, I'm not sure it's a good idea. XSL transformation is a
> rather slow process!! And the Nutch front-end must be very responsive.
>
> and then your response and Doug response too:
> Stefan:
> We already done experiments using XSLT.
> There are some ways to improve speed, however it is 20 ++ % slower  
> then jsp.
> Doug:
> I don't think this would make a significant impact on overall Nutch  
> search
> performance.
> (the complete thread is available at
> http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/ 
> msg03811.html
> )
>
> I'm a little bit confused... why the use of xsl must be considered  
> as too
> time and memory expansive in the back-end process,
> but not in the front-end?
>
> Dublin core may is good for semantic web, but not for a content  
> storage.
>
> It is not used as a content storage, but just as an intermediate  
> step: the
> output of the xsl transformation, that will be then indexed using  
> standard
> nutch APIs.
> (notice that this xml file schema is perfectly mapped to Parse and  
> ParseData
> objects)
>
>
>> In general the goal must be to minimalize memory usage and improve
>> performance such a parser would increase memory usage and definitely
>> slow down parsing.
>
> Not improving the flexibility, extensibility and features?
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/


Re: [proposal] Generic Markup Language Parser

Posted by Jérôme Charron <je...@gmail.com>.
Hi Stefan,

And thanks for taking time to read the doc and giving us your feedback.

-1!
> Xsl is terrible slow!
> Xml will blow up memory and storage usage.

But there still something I don't understand...
Regarding a previous discussion we had about the use of OpenSearch API to
replace Servlet => HTML by Servlet => XML => HTML (using xsl),
here is a copy of one of my comment:

In my opinion, it is the front-end "dreamed" architecture. But more
pragmatically, I'm not sure it's a good idea. XSL transformation is a
rather slow process!! And the Nutch front-end must be very responsive.

and then your response and Doug response too:
Stefan:
We already done experiments using XSLT.
There are some ways to improve speed, however it is 20 ++ % slower then jsp.
Doug:
I don't think this would make a significant impact on overall Nutch search
performance.
(the complete thread is available at
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg03811.html
)

I'm a little bit confused... why the use of xsl must be considered as too
time and memory expansive in the back-end process,
but not in the front-end?

Dublin core may is good for semantic web, but not for a content storage.

It is not used as a content storage, but just as an intermediate step: the
output of the xsl transformation, that will be then indexed using standard
nutch APIs.
(notice that this xml file schema is perfectly mapped to Parse and ParseData
objects)


> In general the goal must be to minimalize memory usage and improve
> performance such a parser would increase memory usage and definitely
> slow down parsing.

Not improving the flexibility, extensibility and features?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

Posted by Jérôme Charron <je...@gmail.com>.
> Do we talk about parsing rdf or do we discuss to store parsed html
> text in rdf and convert it via xslt to pure text?
> I may misunderstand something. I very like the idea of a general rdf
> parser. Back in the days i played around with jena.sf.net
> Parsing yes, replace nutch sequence file and the concept of Wriatbles
> with xml - is from my point of view a bad idea.

One more time. Please read the proposal one more time and my responses.
The proposal doesn't suggest to replace the way data are stored in Nutch.
It is just a proposal of a generic xml parser (as the title suggest it)


> :-) I'm the last that inhibit innovation, but I would love to see
> nutch able to parse billion of pages.

Today, parsing billion of pages is not the only challenge of search engines
(look at Google that no more displays the number of indexed pages)
The parsing of a lot of content types, the language technologies (language
specific stemmatization, analysis, querying, summarization, ...) are some
other new challenges...
The "low level" challenges are importants, but they must not be a brake for
"high level" processes.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

Posted by Stefan Groschupf <sg...@media-style.com>.
Am 25.11.2005 um 11:30 schrieb Erik Hatcher:

>
> On 24 Nov 2005, at 23:49, Chris Mattmann wrote:
>>> Dublin core may is good for semantic web, but not for a content  
>>> storage.
>>
>> I completely disagree with that.
>
> Me too.
Do we talk about parsing rdf or do we discuss to store parsed html  
text in rdf and convert it via xslt to pure text?
I may misunderstand something. I very like the idea of a general rdf  
parser. Back in the days i played around with jena.sf.net
Parsing yes, replace nutch sequence file and the concept of Wriatbles  
with xml - is from my point of view a bad idea.

>
> Stefan - please don't inhibit innovation.
:-) I'm the last that inhibit innovation, but I would love to see  
nutch able to parse billion of pages.
As you can read in my last posting, to give freedom for all  
developers back in the days I contributed the plugin system.

Stefan


Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On 24 Nov 2005, at 23:49, Chris Mattmann wrote:
>> Dublin core may is good for semantic web, but not for a content  
>> storage.
>
> I completely disagree with that.

Me too.

> In fact, I think many people would disagree
> with that in fact. Dublin core is a "standard" metadata model for  
> electronic
> resources. It is by no means the entire spectrum of metadata that  
> could be
> stored for electronic content. However, rather than creating your own
> "author" field, or "content creator", or "document creator", or  
> whatever you
> want to call it, I think it would be nice to provide the DC  
> metadata because
> at least it is well known and provides interoperability with other  
> content
> storage systems. Check out DSpace from MIT. Check out ISO-11179  
> registry
> systems. Check out the ISO standard OAIS reference model for archiving
> systems. Each of these systems has recognized that standard  
> metadata is an
> important concern in any content management system.

Further along these lines... Nutch's instigation had a bit to do with  
Google's dominance, and look where Google is headed now!  Semantic  
web, oh my!  Google Base currently is just scratching the surface of  
where they'll head.  Nutch could certainly be used in this sort of  
space.  I was, but currently backed off for something much simpler to  
begin with, using Nutch to crawl library archives with RDF data  
backing the web pages, pointed to by <link> tags in the <head>  
section.  That RDF is dumped into a powerful triplestore (Kowari),  
with the goal of blending structured RDF queries with full-text queries.

I strongly suspect that there will be more efforts to tweak Nutch  
into the semantic web space.  I'd be surprised otherwise.

>> The magic world is minimalism.
>> So I vote against this suggestion!
>> Stefan
>
> In general, this proposal represents a step forward in being able  
> to parse
> generic XML content in Nutch, which is a very challenging problem.  
> Thanks
> for your suggestions, however, I think that our proposal would help  
> Nutch to
> move forward in being to handle generic forms of XML markup content.

Stefan - please don't inhibit innovation.  Just because you don't  
agree with the approach, let them have the freedom to prove it out  
with encouragement, not negativity.  Plugins can be turned off, and  
if it isn't acceptable to be in the core then so be it, it doesn't  
even have to be an officially supported plugin.  But I, for one,  
would like to encourage them to continue on with their XML efforts  
and see where it leads.

RDF, microformats, triplestores, structured querying, faceted  
browsing.... these are the things I need, with of course full-text  
search, and this is the direction Google is headed in a major way.   
Full-text is great and all, but it's only part of the story, and a  
crude one in many respects. :)  Scraping HTML for "meaning"... insanity.

	Erik



RE: [proposal] Generic Markup Language Parser

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Stefan,


> -1!
> Xsl is terrible slow!

You have to consider what the XSL will be used for. Our proposal suggests
XSL as a means of intermediate transformation of markup content on the
"backend", as Jerome suggested in his reply. This means that whenever markup
content is encountered, specifically, XML based content, then XSL will be
used to create an intermediary "parse-out" xml file, containing the fields
to index. I don't think, given the percentage of xml-based markup content
out there (of course excluding "html"), compared to regular content, that
this would significantly degrade performance. 

> Xml will blow up memory and storage usage.

Possibly, but I would think that we would do it in a clever fashion. For
instance, the parse-out xml files would most likely be small (~kb) files
that could be deleted if space is a concern. It could be a parameterized
option. 

> Dublin core may is good for semantic web, but not for a content storage.

I completely disagree with that. In fact, I think many people would disagree
with that in fact. Dublin core is a "standard" metadata model for electronic
resources. It is by no means the entire spectrum of metadata that could be
stored for electronic content. However, rather than creating your own
"author" field, or "content creator", or "document creator", or whatever you
want to call it, I think it would be nice to provide the DC metadata because
at least it is well known and provides interoperability with other content
storage systems. Check out DSpace from MIT. Check out ISO-11179 registry
systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standard metadata is an
important concern in any content management system.

> In general the goal must be to minimalize memory usage and improve
> performance such a parser would increase memory usage and definitely
> slow down parsing.

I don’t think it would slow down parsing significantly, as I mentioned above
markup content represents a small portion of the amount of content out
there.

> The magic world is minimalism.
> So I vote against this suggestion!
> Stefan

In general, this proposal represents a step forward in being able to parse
generic XML content in Nutch, which is a very challenging problem. Thanks
for your suggestions, however, I think that our proposal would help Nutch to
move forward in being to handle generic forms of XML markup content.


Cheers,
   Chris Mattmann

> 
> 
> 
> 
> 
> Am 24.11.2005 um 00:01 schrieb Jérôme Charron:
> 
> > Hi,
> >
> > We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and
> > me) just
> > add a new proposal on the nutch Wiki:
> > http://wiki.apache.org/nutch/MarkupLanguageParserProposal
> >
> > Here is the Summary of Issue:
> > "Currently, Nutch provides some specific markup language parsing
> > plugins:
> > one for handling HTML, another one for RSS, but no generic XML parsing
> > plugin. This is extremely cumbersome as adding support for a new
> > markup
> > language implies that you have to develop the whole XML parsing
> > code from
> > scratch. This methodology causes: (1) code duplication, with little
> > or no
> > reuse of common pieces of XML parsing code, and (2) dependency library
> > duplication, where many XML parsing plugins may rely on similar xml
> > parsing
> > libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing
> > plugin
> > keeps its own local copy of these libraries. It is also very
> > difficult to
> > identify precisely the type of XML content encountered during a
> > parse. That
> > difficult issue is outside the scope of this proposal, and will be
> > identified in a future proposal."
> >
> > Thanks for your feedback, comments, suggestions (and votes).
> >
> > Regards
> >
> > Jérôme
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/


Re: [proposal] Generic Markup Language Parser

Posted by Stefan Groschupf <sg...@media-style.com>.
-1!
Xsl is terrible slow!
Xml will blow up memory and storage usage.
Dublin core may is good for semantic web, but not for a content storage.
In general the goal must be to minimalize memory usage and improve  
performance such a parser would increase memory usage and definitely  
slow down parsing.
The magic world is minimalism.
So I vote against this suggestion!
Stefan





Am 24.11.2005 um 00:01 schrieb Jérôme Charron:

> Hi,
>
> We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and  
> me) just
> add a new proposal on the nutch Wiki:
> http://wiki.apache.org/nutch/MarkupLanguageParserProposal
>
> Here is the Summary of Issue:
> "Currently, Nutch provides some specific markup language parsing  
> plugins:
> one for handling HTML, another one for RSS, but no generic XML parsing
> plugin. This is extremely cumbersome as adding support for a new  
> markup
> language implies that you have to develop the whole XML parsing  
> code from
> scratch. This methodology causes: (1) code duplication, with little  
> or no
> reuse of common pieces of XML parsing code, and (2) dependency library
> duplication, where many XML parsing plugins may rely on similar xml  
> parsing
> libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing  
> plugin
> keeps its own local copy of these libraries. It is also very  
> difficult to
> identify precisely the type of XML content encountered during a  
> parse. That
> difficult issue is outside the scope of this proposal, and will be
> identified in a future proposal."
>
> Thanks for your feedback, comments, suggestions (and votes).
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/