You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lenya.apache.org by Robert Goene <ro...@goene.nl> on 2005/05/12 23:50:14 UTC

HTMLParser

Hi,

I am trying to extend the current HTMLParser of lenya 1.2.1 to support 
keywords.

My approach is to use the <dc:description/> field for the storage of 
these keywords, because it is already part of the metadata editor (not 
the best argument ever...). I would like to access the <dc:description/> 
tag from HTMLDocument.java and add the splitted string as fields to Lucene.

My problem is the HTMLParser. Is there a straight forward way to get a 
specific tag from the parser?

Any help would be appreciated!

Robert Goené

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by Robert Goene <ro...@goene.nl>.

solprovider@gmail.com wrote:
>>Does someone has a configuration file to index an xhtml file? I seem to
>>be able to add fields to the index, but without any content...
>>Regards, Robert
> 
> 
> Check out my Search:
> http://www.solprovider.com/solprovider/lenya.nsf/Home?readform&pg=search
> 
> It uses the ConfigurableIndexer, and has an example of the
> configuration file.  Most of the rest is to change fix the URLs
> displayed in results so they work.  You may want to implement the
> whole thing, then add your keywords field to everything.
> 
Thanx a lot! Your site contains just what i needed!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by so...@gmail.com.

> Does someone has a configuration file to index an xhtml file? I seem to
> be able to add fields to the index, but without any content...
> Regards, Robert

Check out my Search:
http://www.solprovider.com/solprovider/lenya.nsf/Home?readform&pg=search

It uses the ConfigurableIndexer, and has an example of the
configuration file.  Most of the rest is to change fix the URLs
displayed in results so they work.  You may want to implement the
whole thing, then add your keywords field to everything.

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by Robert Goene <ro...@goene.nl>.

solprovider@gmail.com wrote:
>>>>>>I am trying to extend the current HTMLParser of lenya 1.2.1 to support
>>>>>>keywords.
> 
> 
>>>Lucene can index data (removing all tags) into several fields which
>>>can be used by search.  The default is to crawl a website for all HTML
>>>pages, then index the entire page into a "content" field.  My version
>>>of search indexes the XML documents in {pub}/content/live, keeps the
>>>"content" field, and adds fields for "language", "title", and
>>>"description".  Each field is configured using an XPATH expression.
> 
> 
>>>So the easy answer should be:
>>>1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
>>>keywords are displayed in the header so they can be accessed using
>>>XPATH.
>>>2. Configure Lucene to add keywords to a new field.  Create the index.
>>>3. Change the Search page to allow selection by keywords.
> 
> 
>>This only leaves me with the question how i should add the keywords.
>>Right now, it is just one string with a \n seperator for the different
>>keywords. I would also like to add a boost factor to the individual
>>keywords.
> 
> 
>>The alternative would be a nice extension of the Lenya GUI to edit an
>>xml list of keywords and boost factor. This sounds more lenya-like to a
>>lenya newbie as i am. Any suggestions?
> 
> 
> Thanks Michi (see his post): Lucene's default is for HTML, but any
> configuration requires XML, so you'll be working with XML.
> 
> You can create a new "keywords" field for use by the Search front-end.
>  Lucene indexes on words, so separating with a space works well.  It
> does not do well separating using tags, because they are removed
> without adding a whitespace separator.  (I think that is called a
> bug.)
> 
> What business purpose would "boost" help?  Lucene would probably need
> to be completely rewritten to support something like it.  Can you
> design an interface that adds enough value to compensate for the extra
> confusion?

The boost is a very nice Lucene function to finetune the index results. 
I need it, because my indexed documents will have very similar keywords 
and need a more sophisticated mechanism to control the search results. I 
think i'll take a look at the ConfigurableIndexer and maybe add a 
fieldtype to parse the content.

Does someone has a configuration file to index an xhtml file? I seem to 
be able to add fields to the index, but without any content...

Regards, Robert


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by so...@gmail.com.

> >>>>I am trying to extend the current HTMLParser of lenya 1.2.1 to support
> >>>>keywords.

> > Lucene can index data (removing all tags) into several fields which
> > can be used by search.  The default is to crawl a website for all HTML
> > pages, then index the entire page into a "content" field.  My version
> > of search indexes the XML documents in {pub}/content/live, keeps the
> > "content" field, and adds fields for "language", "title", and
> > "description".  Each field is configured using an XPATH expression.

> > So the easy answer should be:
> > 1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
> > keywords are displayed in the header so they can be accessed using
> > XPATH.
> > 2. Configure Lucene to add keywords to a new field.  Create the index.
> > 3. Change the Search page to allow selection by keywords.

> This only leaves me with the question how i should add the keywords.
> Right now, it is just one string with a \n seperator for the different
> keywords. I would also like to add a boost factor to the individual
> keywords.

> The alternative would be a nice extension of the Lenya GUI to edit an
> xml list of keywords and boost factor. This sounds more lenya-like to a
> lenya newbie as i am. Any suggestions?

Thanks Michi (see his post): Lucene's default is for HTML, but any
configuration requires XML, so you'll be working with XML.

You can create a new "keywords" field for use by the Search front-end.
 Lucene indexes on words, so separating with a space works well.  It
does not do well separating using tags, because they are removed
without adding a whitespace separator.  (I think that is called a
bug.)

What business purpose would "boost" help?  Lucene would probably need
to be completely rewritten to support something like it.  Can you
design an interface that adds enough value to compensate for the extra
confusion?

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by Michael Wechner <mi...@wyona.com>.

Robert Goene wrote:

>
> This only leaces me with the question how i should add the keywords. 
> Right now, it is just one string with a \n seperator for the different 
> keywords. I would also like to add a boost factor to the individual 
> keywords.
>
> The alternative would be a nice extension of the Lenya GUI to edit an 
> xml list of keywords and boost factor. This sounds more lenya-like to 
> a lenya newbie as i am. Any suggestions?


you might want to take a look at the OSCOM publication, which can be 
downloaded from

http://www.wyona.org/

HTH

Michi

>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
> For additional commands, e-mail: dev-help@lenya.apache.org
>
>


-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by Robert Goene <ro...@goene.nl>.

solprovider@gmail.com wrote:
> On 5/12/05, Robert Goene <ro...@goene.nl> wrote:
> 
>>>>I am trying to extend the current HTMLParser of lenya 1.2.1 to support
>>>>keywords.
>>
>>Is there an xml parser for lucene somewhere? Should be fairly easy. The
>>documents that i am indexing are xhtml, so there is no need for a parser
>>that can handle those illegal html files.
> 
> 
> I am trying to understand the purpose of this, so let me know if this
> answer if completely off-topic.  I believe your issue can be solved
> without touching Java.

Completely on-topic.
> 
> I do not think Lucene cares whether data is HTML or XML; it treats it
> all as XML.  I have not tried it with poorly written HTML, since Lenya
> always closes tags in the correct order, and I have only used Lucene
> with Lenya.
> 
> Lucene can index data (removing all tags) into several fields which
> can be used by search.  The default is to crawl a website for all HTML
> pages, then index the entire page into a "content" field.  My version
> of search indexes the XML documents in {pub}/content/live, keeps the
> "content" field, and adds fields for "language", "title", and
> "description".  Each field is configured using an XPATH expression.
> 
> So the easy answer should be:
> 1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
> keywords are displayed in the header so they can be accessed using
> XPATH.
> 2. Configure Lucene to add keywords to a new field.  Create the index.
> 3. Change the Search page to allow selection by keywords.
> 

This only leaces me with the question how i should add the keywords. 
Right now, it is just one string with a \n seperator for the different 
keywords. I would also like to add a boost factor to the individual 
keywords.

The alternative would be a nice extension of the Lenya GUI to edit an 
xml list of keywords and boost factor. This sounds more lenya-like to a 
lenya newbie as i am. Any suggestions?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by Michael Wechner <mi...@wyona.com>.

solprovider@gmail.com wrote:

>On 5/12/05, Robert Goene <ro...@goene.nl> wrote:
>  
>
>>>>I am trying to extend the current HTMLParser of lenya 1.2.1 to support
>>>>keywords.
>>>>        
>>>>
>>Is there an xml parser for lucene somewhere? Should be fairly easy. The
>>documents that i am indexing are xhtml, so there is no need for a parser
>>that can handle those illegal html files.
>>    
>>
>
>I am trying to understand the purpose of this, so let me know if this
>answer if completely off-topic.  I believe your issue can be solved
>without touching Java.
>
>I do not think Lucene cares whether data is HTML or XML; it treats it
>all as XML. 
>

sorry, no ;-) Lucene requires a Lucene document object/class, which
defines all the fields, but yes, this Lucene document class has
nothing to do where the data comes from our what format it was.

Lenya provide generally two parsers in order to generate a Lucene document
object:

src/java/org/apache/lenya/lucene/index/ConfigurableDocumentCreator.java
which is assuming that the original data is XML

and

src/java/org/apache/lenya/lucene/index/DefaultDocumentCreator.java
which is assuming the the original data is HTML

this grew historically and is therefore not very nicely termed, but
on the other hand it's working fine.

Which document creator one wants to use can be configured within

MY-PUB/config/search/lucene.xconf

e.g.

 <indexer class="org.apache.lenya.lucene.index.ConfigurableIndexer">
    <configuration src="cmfs-luceneDoc.xconf"/>
    <extensions src="xml"/>
  </indexer>

or


<indexer class="org.apache.lenya.lucene.index.DefaultIndexer"/>


Hope that makes it a bit clearer ;-)

Michi




> I have not tried it with poorly written HTML, since Lenya
>always closes tags in the correct order, and I have only used Lucene
>with Lenya.
>
>Lucene can index data (removing all tags) into several fields which
>can be used by search.  The default is to crawl a website for all HTML
>pages, then index the entire page into a "content" field.  My version
>of search indexes the XML documents in {pub}/content/live, keeps the
>"content" field, and adds fields for "language", "title", and
>"description".  Each field is configured using an XPATH expression.
>
>So the easy answer should be:
>1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
>keywords are displayed in the header so they can be accessed using
>XPATH.
>2. Configure Lucene to add keywords to a new field.  Create the index.
>3. Change the Search page to allow selection by keywords.
>
>solprovider
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
>For additional commands, e-mail: dev-help@lenya.apache.org
>
>
>  
>


-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by so...@gmail.com.

On 5/12/05, Robert Goene <ro...@goene.nl> wrote:
> >> I am trying to extend the current HTMLParser of lenya 1.2.1 to support
> >> keywords.
> Is there an xml parser for lucene somewhere? Should be fairly easy. The
> documents that i am indexing are xhtml, so there is no need for a parser
> that can handle those illegal html files.

I am trying to understand the purpose of this, so let me know if this
answer if completely off-topic.  I believe your issue can be solved
without touching Java.

I do not think Lucene cares whether data is HTML or XML; it treats it
all as XML.  I have not tried it with poorly written HTML, since Lenya
always closes tags in the correct order, and I have only used Lucene
with Lenya.

Lucene can index data (removing all tags) into several fields which
can be used by search.  The default is to crawl a website for all HTML
pages, then index the entire page into a "content" field.  My version
of search indexes the XML documents in {pub}/content/live, keeps the
"content" field, and adds fields for "language", "title", and
"description".  Each field is configured using an XPATH expression.

So the easy answer should be:
1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
keywords are displayed in the header so they can be accessed using
XPATH.
2. Configure Lucene to add keywords to a new field.  Create the index.
3. Change the Search page to allow selection by keywords.

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by Robert Goene <ro...@goene.nl>.



Gregor J. Rothfuss wrote:
> Robert Goene wrote:
> 
>> I am trying to extend the current HTMLParser of lenya 1.2.1 to support 
>> keywords.
> 
> 
> that is some of the nastiest code in lenya as you might have figured out 
> by now. if i recall correctly, that code is auto generated by a parser 
> generator and is almost illegible. i tried to document things a little 
> bit at

I removed the remark from my email that it looked like generated code, 
just in case it would insult someone :)

> 
> http://lenya.apache.org/apidocs/1.4/org/apache/lenya/lucene/html/HTMLParser.html 
> 
> 
> michi is apparently working on replacing that custom crawler with the 
> nutch codebase, which should hopefully be easier to deal with:
> 
> http://incubator.apache.org/nutch/apidocs/index.html
> 
> michi, why not do your experiments in the sandbox.. ?

Is there an xml parser for lucene somewhere? Should be fairly easy. The 
documents that i am indexing are xhtml, so there is no need for a parser 
that can handle those illegal html files.

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
> For additional commands, e-mail: dev-help@lenya.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by Michael Wechner <mi...@wyona.com>.

Gregor J. Rothfuss wrote: 

>
> michi is apparently working on replacing that custom crawler with the 
> nutch codebase, which should hopefully be easier to deal with:
>
> http://incubator.apache.org/nutch/apidocs/index.html
>
> michi, why not do your experiments in the sandbox.. ?


fine with me, will do so

Michi

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
> For additional commands, e-mail: dev-help@lenya.apache.org
>
>


-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

Re: HTMLParser

Posted by "Gregor J. Rothfuss" <gr...@apache.org>.

Robert Goene wrote:

> I am trying to extend the current HTMLParser of lenya 1.2.1 to support 
> keywords.

that is some of the nastiest code in lenya as you might have figured out 
by now. if i recall correctly, that code is auto generated by a parser 
generator and is almost illegible. i tried to document things a little 
bit at

http://lenya.apache.org/apidocs/1.4/org/apache/lenya/lucene/html/HTMLParser.html

michi is apparently working on replacing that custom crawler with the 
nutch codebase, which should hopefully be easier to deal with:

http://incubator.apache.org/nutch/apidocs/index.html

michi, why not do your experiments in the sandbox.. ?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org