You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2005/11/25 11:30:27 UTC

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

On 24 Nov 2005, at 23:49, Chris Mattmann wrote:
>> Dublin core may is good for semantic web, but not for a content  
>> storage.
>
> I completely disagree with that.

Me too.

> In fact, I think many people would disagree
> with that in fact. Dublin core is a "standard" metadata model for  
> electronic
> resources. It is by no means the entire spectrum of metadata that  
> could be
> stored for electronic content. However, rather than creating your own
> "author" field, or "content creator", or "document creator", or  
> whatever you
> want to call it, I think it would be nice to provide the DC  
> metadata because
> at least it is well known and provides interoperability with other  
> content
> storage systems. Check out DSpace from MIT. Check out ISO-11179  
> registry
> systems. Check out the ISO standard OAIS reference model for archiving
> systems. Each of these systems has recognized that standard  
> metadata is an
> important concern in any content management system.

Further along these lines... Nutch's instigation had a bit to do with  
Google's dominance, and look where Google is headed now!  Semantic  
web, oh my!  Google Base currently is just scratching the surface of  
where they'll head.  Nutch could certainly be used in this sort of  
space.  I was, but currently backed off for something much simpler to  
begin with, using Nutch to crawl library archives with RDF data  
backing the web pages, pointed to by <link> tags in the <head>  
section.  That RDF is dumped into a powerful triplestore (Kowari),  
with the goal of blending structured RDF queries with full-text queries.

I strongly suspect that there will be more efforts to tweak Nutch  
into the semantic web space.  I'd be surprised otherwise.

>> The magic world is minimalism.
>> So I vote against this suggestion!
>> Stefan
>
> In general, this proposal represents a step forward in being able  
> to parse
> generic XML content in Nutch, which is a very challenging problem.  
> Thanks
> for your suggestions, however, I think that our proposal would help  
> Nutch to
> move forward in being to handle generic forms of XML markup content.

Stefan - please don't inhibit innovation.  Just because you don't  
agree with the approach, let them have the freedom to prove it out  
with encouragement, not negativity.  Plugins can be turned off, and  
if it isn't acceptable to be in the core then so be it, it doesn't  
even have to be an officially supported plugin.  But I, for one,  
would like to encourage them to continue on with their XML efforts  
and see where it leads.

RDF, microformats, triplestores, structured querying, faceted  
browsing.... these are the things I need, with of course full-text  
search, and this is the direction Google is headed in a major way.   
Full-text is great and all, but it's only part of the story, and a  
crude one in many respects. :)  Scraping HTML for "meaning"... insanity.

	Erik

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

Posted by Jérôme Charron <je...@gmail.com>.

> Do we talk about parsing rdf or do we discuss to store parsed html
> text in rdf and convert it via xslt to pure text?
> I may misunderstand something. I very like the idea of a general rdf
> parser. Back in the days i played around with jena.sf.net
> Parsing yes, replace nutch sequence file and the concept of Wriatbles
> with xml - is from my point of view a bad idea.

One more time. Please read the proposal one more time and my responses.
The proposal doesn't suggest to replace the way data are stored in Nutch.
It is just a proposal of a generic xml parser (as the title suggest it)


> :-) I'm the last that inhibit innovation, but I would love to see
> nutch able to parse billion of pages.

Today, parsing billion of pages is not the only challenge of search engines
(look at Google that no more displays the number of indexed pages)
The parsing of a lot of content types, the language technologies (language
specific stemmatization, analysis, querying, summarization, ...) are some
other new challenges...
The "low level" challenges are importants, but they must not be a brake for
"high level" processes.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

Posted by Stefan Groschupf <sg...@media-style.com>.

Am 25.11.2005 um 11:30 schrieb Erik Hatcher:

>
> On 24 Nov 2005, at 23:49, Chris Mattmann wrote:
>>> Dublin core may is good for semantic web, but not for a content  
>>> storage.
>>
>> I completely disagree with that.
>
> Me too.
Do we talk about parsing rdf or do we discuss to store parsed html  
text in rdf and convert it via xslt to pure text?
I may misunderstand something. I very like the idea of a general rdf  
parser. Back in the days i played around with jena.sf.net
Parsing yes, replace nutch sequence file and the concept of Wriatbles  
with xml - is from my point of view a bad idea.

>
> Stefan - please don't inhibit innovation.
:-) I'm the last that inhibit innovation, but I would love to see  
nutch able to parse billion of pages.
As you can read in my last posting, to give freedom for all  
developers back in the days I contributed the plugin system.

Stefan