You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Savannah Beckett <sa...@yahoo.com> on 2010/07/21 17:38:21 UTC

faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in the 
index to make it work with solr faceted search, am I right?
Thanks.


      

Re: faceted search with job title

Posted by Savannah Beckett <sa...@yahoo.com>.
I don't see how it can be done without writing sax or dom code for each job 
board, it is non-maintainable if there are a lot of new job boards being 
crawled.  Maybe I should use regex match?  Then I just need to substitute the 
regex pattern for each job board without writing any new sax or dom code.  But 
is regex pattern flexible enough for all job boards?
Thanks.




________________________________
From: "Nagelberg, Kallin" <KN...@globeandmail.com>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title

Yeah you should definitely just setup a custom parser for each site.. should be 
easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
html. If you can't find the pattern for each site leading to the job title how 
can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.searle@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the title 

tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 




________________________________
From: Dave Searle <da...@magicalia.com>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules for 

each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class or 

id, based on the particular website



-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in the 


index to make it work with solr faceted search, am I right?
Thanks.


      

Re: faceted search with job title

Posted by Dave Searle <da...@magicalia.com>.
You could grab your xpath rules from a db too. This is what I did for a price scrapping app I did a while ago. New sites were added with a set of rules using a web ui  You could certainly use regex of course, but IMO that's more complex than writing a simple xpath. Using JavaScript or some dom traversal code, you could quite easily create a click and point tool to generate rules very simply and quickly. 

On 21 Jul 2010, at 23:10, Savannah Beckett <sa...@yahoo.com> wrote:

> And I will have to recompile the dom or sax code each time I add a job board for 
> crawling.  Regex patten is only a string which can be stored in a text file or 
> db, and retrieved based on the job board.  What do you think?
> 
> 
> 
> 
> ________________________________
> From: "Nagelberg, Kallin" <KN...@globeandmail.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wed, July 21, 2010 10:39:32 AM
> Subject: RE: faceted search with job title
> 
> Yeah you should definitely just setup a custom parser for each site.. should be 
> easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
> html. If you can't find the pattern for each site leading to the job title how 
> can you expect solr to? Humans have the advantage here :P
> 
> -Kallin Nagelberg
> 
> -----Original Message-----
> From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
> Sent: Wednesday, July 21, 2010 12:20 PM
> To: solr-user@lucene.apache.org
> Cc: dave.searle@magicalia.com
> Subject: Re: faceted search with job title
> 
> mmm...there must be better way...each job board has different format.  If there 
> are constantly new job boards being crawled, I don't think I can manually look 
> for specific sequence of tags that leads to job title.  Most of them don't even 
> have class or id.  There is no guarantee that the job title will be in the title 
> 
> tag, or header tag.  Something else can be in the title.  Should I do this in a 
> class that extends IndexFilter in Nutch?
> Thanks. 
> 
> 
> 
> 
> ________________________________
> From: Dave Searle <da...@magicalia.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wed, July 21, 2010 8:42:55 AM
> Subject: RE: faceted search with job title
> 
> You'd probably need to do some post processing on the pages and set up rules for 
> 
> each website to grab that specific bit of data. You could load the html into an 
> xml parser, then use xpath to grab content from a particular tag with a class or 
> 
> id, based on the particular website
> 
> 
> 
> -----Original Message-----
> From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
> Sent: 21 July 2010 16:38
> To: solr-user@lucene.apache.org
> Subject: faceted search with job title
> 
> Hi,
>   I am currently using nutch to crawl some job pages from job boards.  They are 
> in my solr index now.  I want to do faceted search with the job titles.  How?  
> The job titles can be in any locations of the page, e.g. title, header, 
> content...   If I use indexfilter in Nutch to search the content for job title, 
> there are hundred of thousands of job titles, I can't hard code them all.  Do 
> you have a better idea?  I think I need the job title in a separate field in the 
> 
> 
> index to make it work with solr faceted search, am I right?
> Thanks.
> 
> 

Re: faceted search with job title

Posted by Ken Krugler <kk...@transpac.com>.
Hi Savannah,

A few comments below, scattered in-line...

-- Ken

On Jul 21, 2010, at 3:08pm, Savannah Beckett wrote:

> And I will have to recompile the dom or sax code each time I add a  
> job board for
> crawling.  Regex patten is only a string which can be stored in a  
> text file or
> db, and retrieved based on the job board.  What do you think?

You can store the XPath expressions in a text file as strings, and  
load/compile them as needed.

> From: "Nagelberg, Kallin" <KN...@globeandmail.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wed, July 21, 2010 10:39:32 AM
> Subject: RE: faceted search with job title
>
> Yeah you should definitely just setup a custom parser for each  
> site.. should be
> easy to extract title using groovy's xml parsing along with tagsoup  
> for sloppy
> html.

Definitely yes re using TagSoup to clean up bad HTML.

And definitely yes to needing per-site "rules" (typically XPath +  
optional regex as needed) to extract specific details.

For a common class of sites powered by the same back-end, you can  
often re-use the same general rules as the markup that you care about  
is consistent.

> If you can't find the pattern for each site leading to the job title  
> how
> can you expect solr to? Humans have the advantage here :P
>
> -Kallin Nagelberg
>
> -----Original Message-----
> From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com]
> Sent: Wednesday, July 21, 2010 12:20 PM
> To: solr-user@lucene.apache.org
> Cc: dave.searle@magicalia.com
> Subject: Re: faceted search with job title
>
> mmm...there must be better way...each job board has different  
> format.  If there
> are constantly new job boards being crawled, I don't think I can  
> manually look
> for specific sequence of tags that leads to job title.  Most of them  
> don't even
> have class or id.  There is no guarantee that the job title will be  
> in the title
> tag, or header tag.  Something else can be in the title.  Should I  
> do this in a
> class that extends IndexFilter in Nutch?

When I do this kind of thing I use Bixo (http://openbixo.org), but  
that requires knowledge of Cascading (& some Hadoop) in order to  
construct web mining workflows.

> ________________________________
> From: Dave Searle <da...@magicalia.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wed, July 21, 2010 8:42:55 AM
> Subject: RE: faceted search with job title
>
> You'd probably need to do some post processing on the pages and set  
> up rules for
> each website to grab that specific bit of data. You could load the  
> html into an
> xml parser, then use xpath to grab content from a particular tag  
> with a class or
> id, based on the particular website
>
>
>
> -----Original Message-----
> From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com]
> Sent: 21 July 2010 16:38
> To: solr-user@lucene.apache.org
> Subject: faceted search with job title
>
> Hi,
>   I am currently using nutch to crawl some job pages from job  
> boards.  They are
> in my solr index now.  I want to do faceted search with the job  
> titles.  How?
> The job titles can be in any locations of the page, e.g. title,  
> header,
> content...   If I use indexfilter in Nutch to search the content for  
> job title,
> there are hundred of thousands of job titles, I can't hard code them  
> all.  Do
> you have a better idea?  I think I need the job title in a separate  
> field in the
> index to make it work with solr faceted search, am I right?

Yes, you'd want a separate "job title" field in the index. Though  
often the job titles are slight variants on each other, so this would  
probably work much better if you automatically found common phrases  
and used those, otherwise you get "Senior Bottlewasher" and "Sr.  
Bottlewasher" and "Sr Bottlewasher" as separate facet values.

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: faceted search with job title

Posted by Savannah Beckett <sa...@yahoo.com>.
And I will have to recompile the dom or sax code each time I add a job board for 
crawling.  Regex patten is only a string which can be stored in a text file or 
db, and retrieved based on the job board.  What do you think?




________________________________
From: "Nagelberg, Kallin" <KN...@globeandmail.com>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title

Yeah you should definitely just setup a custom parser for each site.. should be 
easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
html. If you can't find the pattern for each site leading to the job title how 
can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.searle@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the title 

tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 




________________________________
From: Dave Searle <da...@magicalia.com>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules for 

each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class or 

id, based on the particular website



-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in the 


index to make it work with solr faceted search, am I right?
Thanks.


      

RE: faceted search with job title

Posted by "Nagelberg, Kallin" <KN...@globeandmail.com>.
Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.searle@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the title 
tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 




________________________________
From: Dave Searle <da...@magicalia.com>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules for 
each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class or 
id, based on the particular website



-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in the 

index to make it work with solr faceted search, am I right?
Thanks.


      

Re: faceted search with job title

Posted by Savannah Beckett <sa...@yahoo.com>.
mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the title 
tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 




________________________________
From: Dave Searle <da...@magicalia.com>
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules for 
each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class or 
id, based on the particular website



-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in the 

index to make it work with solr faceted search, am I right?
Thanks.


      

RE: faceted search with job title

Posted by Dave Searle <da...@magicalia.com>.
You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website



-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in the 
index to make it work with solr faceted search, am I right?
Thanks.