You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by de...@butterflycluster.com on 2006/09/19 06:16:02 UTC

crawl/index/search

hi there,

Been playing with Nutch for a few weeks now, so i am starting on coming 
up something usable but i need some suggestions here;

Heres the problem - crawl the web (maybe 50 sites or so) and get 
physical addreses;

i want to index physical addresses found on the crawl, so my search 
results should return "Company Name, State" as the Title, the Summary 
can be what ever is found on that page. [this is just an example to 
simplify what i want to say]

To index, looking at the Nutch code, seems i have to parse the HTML 
content and look for the details I need to be searchable.. at the 
moment only things found in META data is indexed but i want to expand 
this with custom fields, such as company name, state etc..

Whats the best way to go about this? I want to write a plug in for 
this; Which classes do i start with and how do i tackle this?

Thanks

Re: crawl/index/search

Posted by Richard Braman <rb...@taxcodesoftware.org>.

Iain,
Thanks for the pointer to GATE.  I will take a look at it too.
Richard

Fadzi Ushewokunze wrote:
> Iain,
>
> Ah thanks for that. I am actually playing with it right now.
> Are you using it?
>
> ----- Original Message ----- From: "Iain" <ia...@idcl.co.uk>
> To: <nu...@lucene.apache.org>
> Sent: Sunday, September 24, 2006 6:26 PM
> Subject: RE: crawl/index/search
>
>
>> You might want to check out GATE from Sheffield University.  It's
>> like UIMA
>> in concept, but more mature and probably richer.
>>
>> They've got a number of modules which integrate with Lucene, so
>> integration
>> with Nutch should be easier.
>>
>>
>> Iain
>>
>> ---------------
>> Iain Downs (Microsoft MVP)
>> Commercial Software Therapist
>> E:  iain@idcl.co.uk     T:+44 (0) 1423 872988
>> W: www.idcl.co.uk
>> http://mvp.support.microsoft.com
>> -----Original Message-----
>> From: Fadzi Ushewokunze [mailto:dev@butterflycluster.com]
>> Sent: 24 September 2006 04:03
>> To: nutch-user@lucene.apache.org
>> Subject: Re: crawl/index/search
>>
>> Richard,
>>
>> Thanks for the insight.
>>
>> I have spent the past few days looking around lightweight structured
>> text,
>> text mining and eventually Natural Langauge Processing. Through
>> further research I came across UIMA from IBM - i liked
>> the idea behind it. I played around with it but it is a huge monster!
>>
>> Its still new to me so I am still getting my head around it but I think
>> it has the potential to achieve a lot. Have you ever dealt with it?
>>
>> Or for that matter, if anyone in the community has, it would be nice to
>> get some info on this, especially if you have integrated it with
>> nutch/lucene.
>>
>> Seems UIMA will be in the apache incubator - it also has a decent size
>> community behind it already.
>>
>> Anyway this is a whole new world (NLP, structured text etc..) at the
>> moment
>> for me -
>> so I am still evaluating my requirements and what tools are available.
>>
>>
>> ----- Original Message ----- From: "Richard Braman"
>> <rb...@taxcodesoftware.org>
>> To: <nu...@lucene.apache.org>
>> Sent: Wednesday, September 20, 2006 12:55 PM
>> Subject: Re: crawl/index/search
>>
>>
>>> Getting other information out of the page requires parsing. In this
>>> case
>>> you have to come up with some pretty complicated  regular expressions
>>> unless the information that you want like the company name is going to
>>> be in the same place on each site.
>>>
>>> I don't know know how to tackle this problem with anything that comes
>>> stock with nutch but writing a plug in would be the way to go,
>>> especially if it is in the public domain.
>>>
>>> I have thought about developing a similar plug in, but the question
>>> becomes what do you use?  I view regular expressions as having  many
>>> shortcomings.  For instance they usually only apply to as custom
>>> solution to locating a particular piece of information in a particular
>>> structure.  I would like a more robust framework for matching patterns
>>> that is easy to use that can be extended upon and so forth.  Regular
>>> expressions wont cut it in many cases and don't allow normal users to
>>> write their own.  For example, what is a regular expression for a
>>> company name? Email  Address would be an easy one to make a reg exp
>>> for,
>>> which is why some many spammers use web crawlers to harvest email
>>> addresses from the web.
>>>
>>> Turns out, there is a whole field of Information Retrieval developing
>>> technologies dedicated to parsing through text and using
>>> advanced ontologies to determine anything and everything about the text
>>> in a document.  They can determine whether a term is a noun, verb,
>>> adjective, and so forth.
>>> They can also determine whether something matches a pattern such as an
>>> email address, address, or company name.   The problem is most of this
>>> is not in the public domain.
>>>
>>> I think most user use rexep to find what they are looking for but i am
>>> quite that using an advanced parsing library would certainly yield a
>>> more robust plug in.
>>>
>>> In my reseearch, I stumbled on lapis, a lightweight structure for text
>>> processing that uses advanced technology.  LAPIS is open source
>>> developed by MIT and is java based.  I used it an was quite impressed
>>> with its ease of use.  I think this would be a very interesting
>>> framework to adapt to nutch.  If anyone else knows any other open
>>> source
>>> libraries for determining structure please comment. You can read more
>>> about lips here or may google "lightweight structure text"
>>> http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis
>>>
>>> I would be willing to help you if you would be willing to put the
>>> plugin
>>> into the public domain.
>>>
>>> Here is the .7 docs for writing plugins:
>>> http://wiki.apache.org/nutch/WritingPluginExample
>>>
>>> dev@butterflycluster.com wrote:
>>>> hi there,
>>>>
>>>> Been playing with Nutch for a few weeks now, so i am starting on
>>>> coming up something usable but i need some suggestions here;
>>>>
>>>> Heres the problem - crawl the web (maybe 50 sites or so) and get
>>>> physical addreses;
>>>>
>>>> i want to index physical addresses found on the crawl, so my search
>>>> results should return "Company Name, State" as the Title, the Summary
>>>> can be what ever is found on that page. [this is just an example to
>>>> simplify what i want to say]
>>>>
>>>> To index, looking at the Nutch code, seems i have to parse the HTML
>>>> content and look for the details I need to be searchable.. at the
>>>> moment only things found in META data is indexed but i want to expand
>>>> this with custom fields, such as company name, state etc..
>>>>
>>>> Whats the best way to go about this? I want to write a plug in for
>>>> this; Which classes do i start with and how do i tackle this?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
>
>

RE: crawl/index/search

Posted by Iain <ia...@idcl.co.uk>.

Not exactly.

I've been on a training course and talked to people who use it, but yet to
use it in anger.  You might also look at KIM (ontotext) which uses Gate and
does more ontology type things - KIM is commercial, though.


Iain
---------------
Iain Downs (Microsoft MVP)
Commercial Software Therapist
E:  iain@idcl.co.uk     T:+44 (0) 1423 872988
W: www.idcl.co.uk
http://mvp.support.microsoft.com

-----Original Message-----
From: Fadzi Ushewokunze [mailto:dev@butterflycluster.com] 
Sent: 24 September 2006 10:19
To: nutch-user@lucene.apache.org; iain@idcl.co.uk
Subject: Re: crawl/index/search

Iain,

Ah thanks for that. I am actually playing with it right now.
Are you using it?

----- Original Message ----- 
From: "Iain" <ia...@idcl.co.uk>
To: <nu...@lucene.apache.org>
Sent: Sunday, September 24, 2006 6:26 PM
Subject: RE: crawl/index/search


> You might want to check out GATE from Sheffield University.  It's like 
> UIMA
> in concept, but more mature and probably richer.
>
> They've got a number of modules which integrate with Lucene, so 
> integration
> with Nutch should be easier.
>
>
> Iain
>
> ---------------
> Iain Downs (Microsoft MVP)
> Commercial Software Therapist
> E:  iain@idcl.co.uk     T:+44 (0) 1423 872988
> W: www.idcl.co.uk
> http://mvp.support.microsoft.com
> -----Original Message-----
> From: Fadzi Ushewokunze [mailto:dev@butterflycluster.com]
> Sent: 24 September 2006 04:03
> To: nutch-user@lucene.apache.org
> Subject: Re: crawl/index/search
>
> Richard,
>
> Thanks for the insight.
>
> I have spent the past few days looking around lightweight structured text,
> text mining and eventually Natural Langauge Processing. Through
> further research I came across UIMA from IBM - i liked
> the idea behind it. I played around with it but it is a huge monster!
>
> Its still new to me so I am still getting my head around it but I think
> it has the potential to achieve a lot. Have you ever dealt with it?
>
> Or for that matter, if anyone in the community has, it would be nice to
> get some info on this, especially if you have integrated it with
> nutch/lucene.
>
> Seems UIMA will be in the apache incubator - it also has a decent size
> community behind it already.
>
> Anyway this is a whole new world (NLP, structured text etc..) at the 
> moment
> for me -
> so I am still evaluating my requirements and what tools are available.
>
>
> ----- Original Message ----- 
> From: "Richard Braman" <rb...@taxcodesoftware.org>
> To: <nu...@lucene.apache.org>
> Sent: Wednesday, September 20, 2006 12:55 PM
> Subject: Re: crawl/index/search
>
>
>> Getting other information out of the page requires parsing. In this case
>> you have to come up with some pretty complicated  regular expressions
>> unless the information that you want like the company name is going to
>> be in the same place on each site.
>>
>> I don't know know how to tackle this problem with anything that comes
>> stock with nutch but writing a plug in would be the way to go,
>> especially if it is in the public domain.
>>
>> I have thought about developing a similar plug in, but the question
>> becomes what do you use?  I view regular expressions as having  many
>> shortcomings.  For instance they usually only apply to as custom
>> solution to locating a particular piece of information in a particular
>> structure.  I would like a more robust framework for matching patterns
>> that is easy to use that can be extended upon and so forth.  Regular
>> expressions wont cut it in many cases and don't allow normal users to
>> write their own.  For example, what is a regular expression for a
>> company name? Email  Address would be an easy one to make a reg exp for,
>> which is why some many spammers use web crawlers to harvest email
>> addresses from the web.
>>
>> Turns out, there is a whole field of Information Retrieval developing
>> technologies dedicated to parsing through text and using
>> advanced ontologies to determine anything and everything about the text
>> in a document.  They can determine whether a term is a noun, verb,
>> adjective, and so forth.
>> They can also determine whether something matches a pattern such as an
>> email address, address, or company name.   The problem is most of this
>> is not in the public domain.
>>
>> I think most user use rexep to find what they are looking for but i am
>> quite that using an advanced parsing library would certainly yield a
>> more robust plug in.
>>
>> In my reseearch, I stumbled on lapis, a lightweight structure for text
>> processing that uses advanced technology.  LAPIS is open source
>> developed by MIT and is java based.  I used it an was quite impressed
>> with its ease of use.  I think this would be a very interesting
>> framework to adapt to nutch.  If anyone else knows any other open source
>> libraries for determining structure please comment. You can read more
>> about lips here or may google "lightweight structure text"
>> http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis
>>
>> I would be willing to help you if you would be willing to put the plugin
>> into the public domain.
>>
>> Here is the .7 docs for writing plugins:
>> http://wiki.apache.org/nutch/WritingPluginExample
>>
>> dev@butterflycluster.com wrote:
>>> hi there,
>>>
>>> Been playing with Nutch for a few weeks now, so i am starting on
>>> coming up something usable but i need some suggestions here;
>>>
>>> Heres the problem - crawl the web (maybe 50 sites or so) and get
>>> physical addreses;
>>>
>>> i want to index physical addresses found on the crawl, so my search
>>> results should return "Company Name, State" as the Title, the Summary
>>> can be what ever is found on that page. [this is just an example to
>>> simplify what i want to say]
>>>
>>> To index, looking at the Nutch code, seems i have to parse the HTML
>>> content and look for the details I need to be searchable.. at the
>>> moment only things found in META data is indexed but i want to expand
>>> this with custom fields, such as company name, state etc..
>>>
>>> Whats the best way to go about this? I want to write a plug in for
>>> this; Which classes do i start with and how do i tackle this?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>

Re: crawl/index/search

Posted by Fadzi Ushewokunze <de...@butterflycluster.com>.

Iain,

Ah thanks for that. I am actually playing with it right now.
Are you using it?

----- Original Message ----- 
From: "Iain" <ia...@idcl.co.uk>
To: <nu...@lucene.apache.org>
Sent: Sunday, September 24, 2006 6:26 PM
Subject: RE: crawl/index/search


> You might want to check out GATE from Sheffield University.  It's like 
> UIMA
> in concept, but more mature and probably richer.
>
> They've got a number of modules which integrate with Lucene, so 
> integration
> with Nutch should be easier.
>
>
> Iain
>
> ---------------
> Iain Downs (Microsoft MVP)
> Commercial Software Therapist
> E:  iain@idcl.co.uk     T:+44 (0) 1423 872988
> W: www.idcl.co.uk
> http://mvp.support.microsoft.com
> -----Original Message-----
> From: Fadzi Ushewokunze [mailto:dev@butterflycluster.com]
> Sent: 24 September 2006 04:03
> To: nutch-user@lucene.apache.org
> Subject: Re: crawl/index/search
>
> Richard,
>
> Thanks for the insight.
>
> I have spent the past few days looking around lightweight structured text,
> text mining and eventually Natural Langauge Processing. Through
> further research I came across UIMA from IBM - i liked
> the idea behind it. I played around with it but it is a huge monster!
>
> Its still new to me so I am still getting my head around it but I think
> it has the potential to achieve a lot. Have you ever dealt with it?
>
> Or for that matter, if anyone in the community has, it would be nice to
> get some info on this, especially if you have integrated it with
> nutch/lucene.
>
> Seems UIMA will be in the apache incubator - it also has a decent size
> community behind it already.
>
> Anyway this is a whole new world (NLP, structured text etc..) at the 
> moment
> for me -
> so I am still evaluating my requirements and what tools are available.
>
>
> ----- Original Message ----- 
> From: "Richard Braman" <rb...@taxcodesoftware.org>
> To: <nu...@lucene.apache.org>
> Sent: Wednesday, September 20, 2006 12:55 PM
> Subject: Re: crawl/index/search
>
>
>> Getting other information out of the page requires parsing. In this case
>> you have to come up with some pretty complicated  regular expressions
>> unless the information that you want like the company name is going to
>> be in the same place on each site.
>>
>> I don't know know how to tackle this problem with anything that comes
>> stock with nutch but writing a plug in would be the way to go,
>> especially if it is in the public domain.
>>
>> I have thought about developing a similar plug in, but the question
>> becomes what do you use?  I view regular expressions as having  many
>> shortcomings.  For instance they usually only apply to as custom
>> solution to locating a particular piece of information in a particular
>> structure.  I would like a more robust framework for matching patterns
>> that is easy to use that can be extended upon and so forth.  Regular
>> expressions wont cut it in many cases and don't allow normal users to
>> write their own.  For example, what is a regular expression for a
>> company name? Email  Address would be an easy one to make a reg exp for,
>> which is why some many spammers use web crawlers to harvest email
>> addresses from the web.
>>
>> Turns out, there is a whole field of Information Retrieval developing
>> technologies dedicated to parsing through text and using
>> advanced ontologies to determine anything and everything about the text
>> in a document.  They can determine whether a term is a noun, verb,
>> adjective, and so forth.
>> They can also determine whether something matches a pattern such as an
>> email address, address, or company name.   The problem is most of this
>> is not in the public domain.
>>
>> I think most user use rexep to find what they are looking for but i am
>> quite that using an advanced parsing library would certainly yield a
>> more robust plug in.
>>
>> In my reseearch, I stumbled on lapis, a lightweight structure for text
>> processing that uses advanced technology.  LAPIS is open source
>> developed by MIT and is java based.  I used it an was quite impressed
>> with its ease of use.  I think this would be a very interesting
>> framework to adapt to nutch.  If anyone else knows any other open source
>> libraries for determining structure please comment. You can read more
>> about lips here or may google "lightweight structure text"
>> http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis
>>
>> I would be willing to help you if you would be willing to put the plugin
>> into the public domain.
>>
>> Here is the .7 docs for writing plugins:
>> http://wiki.apache.org/nutch/WritingPluginExample
>>
>> dev@butterflycluster.com wrote:
>>> hi there,
>>>
>>> Been playing with Nutch for a few weeks now, so i am starting on
>>> coming up something usable but i need some suggestions here;
>>>
>>> Heres the problem - crawl the web (maybe 50 sites or so) and get
>>> physical addreses;
>>>
>>> i want to index physical addresses found on the crawl, so my search
>>> results should return "Company Name, State" as the Title, the Summary
>>> can be what ever is found on that page. [this is just an example to
>>> simplify what i want to say]
>>>
>>> To index, looking at the Nutch code, seems i have to parse the HTML
>>> content and look for the details I need to be searchable.. at the
>>> moment only things found in META data is indexed but i want to expand
>>> this with custom fields, such as company name, state etc..
>>>
>>> Whats the best way to go about this? I want to write a plug in for
>>> this; Which classes do i start with and how do i tackle this?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>

RE: crawl/index/search

Posted by Iain <ia...@idcl.co.uk>.

You might want to check out GATE from Sheffield University.  It's like UIMA
in concept, but more mature and probably richer.

They've got a number of modules which integrate with Lucene, so integration
with Nutch should be easier.


Iain

---------------
Iain Downs (Microsoft MVP)
Commercial Software Therapist
E:  iain@idcl.co.uk     T:+44 (0) 1423 872988
W: www.idcl.co.uk
http://mvp.support.microsoft.com
-----Original Message-----
From: Fadzi Ushewokunze [mailto:dev@butterflycluster.com] 
Sent: 24 September 2006 04:03
To: nutch-user@lucene.apache.org
Subject: Re: crawl/index/search

Richard,

Thanks for the insight.

I have spent the past few days looking around lightweight structured text,
text mining and eventually Natural Langauge Processing. Through
further research I came across UIMA from IBM - i liked
the idea behind it. I played around with it but it is a huge monster!

Its still new to me so I am still getting my head around it but I think
it has the potential to achieve a lot. Have you ever dealt with it?

Or for that matter, if anyone in the community has, it would be nice to
get some info on this, especially if you have integrated it with 
nutch/lucene.

Seems UIMA will be in the apache incubator - it also has a decent size
community behind it already.

Anyway this is a whole new world (NLP, structured text etc..) at the moment 
for me -
so I am still evaluating my requirements and what tools are available.


----- Original Message ----- 
From: "Richard Braman" <rb...@taxcodesoftware.org>
To: <nu...@lucene.apache.org>
Sent: Wednesday, September 20, 2006 12:55 PM
Subject: Re: crawl/index/search


> Getting other information out of the page requires parsing. In this case
> you have to come up with some pretty complicated  regular expressions
> unless the information that you want like the company name is going to
> be in the same place on each site.
>
> I don't know know how to tackle this problem with anything that comes
> stock with nutch but writing a plug in would be the way to go,
> especially if it is in the public domain.
>
> I have thought about developing a similar plug in, but the question
> becomes what do you use?  I view regular expressions as having  many
> shortcomings.  For instance they usually only apply to as custom
> solution to locating a particular piece of information in a particular
> structure.  I would like a more robust framework for matching patterns
> that is easy to use that can be extended upon and so forth.  Regular
> expressions wont cut it in many cases and don't allow normal users to
> write their own.  For example, what is a regular expression for a
> company name? Email  Address would be an easy one to make a reg exp for,
> which is why some many spammers use web crawlers to harvest email
> addresses from the web.
>
> Turns out, there is a whole field of Information Retrieval developing
> technologies dedicated to parsing through text and using
> advanced ontologies to determine anything and everything about the text
> in a document.  They can determine whether a term is a noun, verb,
> adjective, and so forth.
> They can also determine whether something matches a pattern such as an
> email address, address, or company name.   The problem is most of this
> is not in the public domain.
>
> I think most user use rexep to find what they are looking for but i am
> quite that using an advanced parsing library would certainly yield a
> more robust plug in.
>
> In my reseearch, I stumbled on lapis, a lightweight structure for text
> processing that uses advanced technology.  LAPIS is open source
> developed by MIT and is java based.  I used it an was quite impressed
> with its ease of use.  I think this would be a very interesting
> framework to adapt to nutch.  If anyone else knows any other open source
> libraries for determining structure please comment. You can read more
> about lips here or may google "lightweight structure text"
> http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis
>
> I would be willing to help you if you would be willing to put the plugin
> into the public domain.
>
> Here is the .7 docs for writing plugins:
> http://wiki.apache.org/nutch/WritingPluginExample
>
> dev@butterflycluster.com wrote:
>> hi there,
>>
>> Been playing with Nutch for a few weeks now, so i am starting on
>> coming up something usable but i need some suggestions here;
>>
>> Heres the problem - crawl the web (maybe 50 sites or so) and get
>> physical addreses;
>>
>> i want to index physical addresses found on the crawl, so my search
>> results should return "Company Name, State" as the Title, the Summary
>> can be what ever is found on that page. [this is just an example to
>> simplify what i want to say]
>>
>> To index, looking at the Nutch code, seems i have to parse the HTML
>> content and look for the details I need to be searchable.. at the
>> moment only things found in META data is indexed but i want to expand
>> this with custom fields, such as company name, state etc..
>>
>> Whats the best way to go about this? I want to write a plug in for
>> this; Which classes do i start with and how do i tackle this?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>
>
>

Re: crawl/index/search

Posted by Fadzi Ushewokunze <de...@butterflycluster.com>.

Richard,

Thanks for the insight.

I have spent the past few days looking around lightweight structured text,
text mining and eventually Natural Langauge Processing. Through
further research I came across UIMA from IBM - i liked
the idea behind it. I played around with it but it is a huge monster!

Its still new to me so I am still getting my head around it but I think
it has the potential to achieve a lot. Have you ever dealt with it?

Or for that matter, if anyone in the community has, it would be nice to
get some info on this, especially if you have integrated it with 
nutch/lucene.

Seems UIMA will be in the apache incubator - it also has a decent size
community behind it already.

Anyway this is a whole new world (NLP, structured text etc..) at the moment 
for me -
so I am still evaluating my requirements and what tools are available.


----- Original Message ----- 
From: "Richard Braman" <rb...@taxcodesoftware.org>
To: <nu...@lucene.apache.org>
Sent: Wednesday, September 20, 2006 12:55 PM
Subject: Re: crawl/index/search


> Getting other information out of the page requires parsing. In this case
> you have to come up with some pretty complicated  regular expressions
> unless the information that you want like the company name is going to
> be in the same place on each site.
>
> I don't know know how to tackle this problem with anything that comes
> stock with nutch but writing a plug in would be the way to go,
> especially if it is in the public domain.
>
> I have thought about developing a similar plug in, but the question
> becomes what do you use?  I view regular expressions as having  many
> shortcomings.  For instance they usually only apply to as custom
> solution to locating a particular piece of information in a particular
> structure.  I would like a more robust framework for matching patterns
> that is easy to use that can be extended upon and so forth.  Regular
> expressions wont cut it in many cases and don't allow normal users to
> write their own.  For example, what is a regular expression for a
> company name? Email  Address would be an easy one to make a reg exp for,
> which is why some many spammers use web crawlers to harvest email
> addresses from the web.
>
> Turns out, there is a whole field of Information Retrieval developing
> technologies dedicated to parsing through text and using
> advanced ontologies to determine anything and everything about the text
> in a document.  They can determine whether a term is a noun, verb,
> adjective, and so forth.
> They can also determine whether something matches a pattern such as an
> email address, address, or company name.   The problem is most of this
> is not in the public domain.
>
> I think most user use rexep to find what they are looking for but i am
> quite that using an advanced parsing library would certainly yield a
> more robust plug in.
>
> In my reseearch, I stumbled on lapis, a lightweight structure for text
> processing that uses advanced technology.  LAPIS is open source
> developed by MIT and is java based.  I used it an was quite impressed
> with its ease of use.  I think this would be a very interesting
> framework to adapt to nutch.  If anyone else knows any other open source
> libraries for determining structure please comment. You can read more
> about lips here or may google "lightweight structure text"
> http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis
>
> I would be willing to help you if you would be willing to put the plugin
> into the public domain.
>
> Here is the .7 docs for writing plugins:
> http://wiki.apache.org/nutch/WritingPluginExample
>
> dev@butterflycluster.com wrote:
>> hi there,
>>
>> Been playing with Nutch for a few weeks now, so i am starting on
>> coming up something usable but i need some suggestions here;
>>
>> Heres the problem - crawl the web (maybe 50 sites or so) and get
>> physical addreses;
>>
>> i want to index physical addresses found on the crawl, so my search
>> results should return "Company Name, State" as the Title, the Summary
>> can be what ever is found on that page. [this is just an example to
>> simplify what i want to say]
>>
>> To index, looking at the Nutch code, seems i have to parse the HTML
>> content and look for the details I need to be searchable.. at the
>> moment only things found in META data is indexed but i want to expand
>> this with custom fields, such as company name, state etc..
>>
>> Whats the best way to go about this? I want to write a plug in for
>> this; Which classes do i start with and how do i tackle this?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>
>
>

Re: crawl/index/search

Posted by Richard Braman <rb...@taxcodesoftware.org>.

Getting other information out of the page requires parsing. In this case
you have to come up with some pretty complicated  regular expressions
unless the information that you want like the company name is going to
be in the same place on each site. 

I don't know know how to tackle this problem with anything that comes
stock with nutch but writing a plug in would be the way to go,
especially if it is in the public domain.

I have thought about developing a similar plug in, but the question
becomes what do you use?  I view regular expressions as having  many
shortcomings.  For instance they usually only apply to as custom
solution to locating a particular piece of information in a particular
structure.  I would like a more robust framework for matching patterns
that is easy to use that can be extended upon and so forth.  Regular
expressions wont cut it in many cases and don't allow normal users to
write their own.  For example, what is a regular expression for a
company name? Email  Address would be an easy one to make a reg exp for,
which is why some many spammers use web crawlers to harvest email
addresses from the web.

Turns out, there is a whole field of Information Retrieval developing
technologies dedicated to parsing through text and using
advanced ontologies to determine anything and everything about the text
in a document.  They can determine whether a term is a noun, verb,
adjective, and so forth.
They can also determine whether something matches a pattern such as an
email address, address, or company name.   The problem is most of this
is not in the public domain.

I think most user use rexep to find what they are looking for but i am
quite that using an advanced parsing library would certainly yield a
more robust plug in.

In my reseearch, I stumbled on lapis, a lightweight structure for text
processing that uses advanced technology.  LAPIS is open source
developed by MIT and is java based.  I used it an was quite impressed
with its ease of use.  I think this would be a very interesting
framework to adapt to nutch.  If anyone else knows any other open source
libraries for determining structure please comment. You can read more
about lips here or may google "lightweight structure text"
http://www.softwaresecretweapons.com/jspwiki/Wiki.jsp?page=Lapis

I would be willing to help you if you would be willing to put the plugin
into the public domain.

Here is the .7 docs for writing plugins:
http://wiki.apache.org/nutch/WritingPluginExample

dev@butterflycluster.com wrote:
> hi there,
>
> Been playing with Nutch for a few weeks now, so i am starting on
> coming up something usable but i need some suggestions here;
>
> Heres the problem - crawl the web (maybe 50 sites or so) and get
> physical addreses;
>
> i want to index physical addresses found on the crawl, so my search
> results should return "Company Name, State" as the Title, the Summary
> can be what ever is found on that page. [this is just an example to
> simplify what i want to say]
>
> To index, looking at the Nutch code, seems i have to parse the HTML
> content and look for the details I need to be searchable.. at the
> moment only things found in META data is indexed but i want to expand
> this with custom fields, such as company name, state etc..
>
> Whats the best way to go about this? I want to write a plug in for
> this; Which classes do i start with and how do i tackle this?
>
> Thanks
>
>
>
>
>
>