You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hetan Shah <He...@Sun.COM> on 2004/12/11 01:38:39 UTC

Indexing HTML files give following message

java org.apache.lucene.demo.IndexHTML -create -index
/source/workarea/hs152827/newIndex ..
adding ../0/10037.html
adding ../0/10050.html
adding ../0/1006132.html
adding ../0/1013223.html
Parse Aborted: Encountered "\"" at line 5, column 1.
Was expecting one of:
    <ArgName> ...
    "=" ...
    <TagEnd> ...

And then the indexing hangs on this line. Earlier it used to go on and
index remaining pages in the directory. Any idea why would the indexer
stop at this error.

Pointers are much needed and appreciated.
-H


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing HTML files give following message

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

This is probably due to some bad HTML.  The application you are using
is just a demo, and uses a JavaCC-based HTML parser, which may not be
resilient to invalid HTML.  For Lucene in Action we developed a little
extensible indexing framework, and for HTML indexing we used 2 tools to
handle HTML parsing: JTidy and NekoHTML.  Since the code for the book
is freely available....... http://www.manning.com.  NekoHTML knows how
to deal with some bad HTML, that's why I'm suggesting this.
The indexing framework could come handy for those working on various
'desktop search' applications (Roosster, LDesktop (if that's really
happening), Lucidity, etc.)

Otis


--- Hetan Shah <He...@Sun.COM> wrote:

> java org.apache.lucene.demo.IndexHTML -create -index
> /source/workarea/hs152827/newIndex ..
> adding ../0/10037.html
> adding ../0/10050.html
> adding ../0/1006132.html
> adding ../0/1013223.html
> Parse Aborted: Encountered "\"" at line 5, column 1.
> Was expecting one of:
>     <ArgName> ...
>     "=" ...
>     <TagEnd> ...
> 
> And then the indexing hangs on this line. Earlier it used to go on
> and
> index remaining pages in the directory. Any idea why would the
> indexer
> stop at this error.
> 
> Pointers are much needed and appreciated.
> -H
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org