You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hetan Shah <He...@Sun.COM> on 2004/08/25 23:30:52 UTC

Time to index documents

Hello all,

Is there a way to reduce the indexing time taken when the indexer is 
indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following 
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered "\"" at line 7, column 1.
Was expecting one of:
     <ArgName> ...
     "=" ...
     <TagEnd> ...

Any suggestions on preventing this from happening?

Thanks in advance.
-H


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Time to index documents

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.
Hetan,

If you are using a corpus with multiple editors, I suggest that you 
use a cleaner like tidy as there might be weird stuff appearing in the 
html.

sv

On Thu, 26 Aug 2004, Karthik N S wrote:

> Hi Hetan
> 
> 
>    Th's the  major Problem of non Standatrdized Tags for HTML Document's
>   u are Indexing ,resulting in lag time taken for Indexing process....
> 
> 
>    If u can Tweak the HTMLParser.jj file within  lucene.zip   '/demo/html'
> file
>    [U have to have some Knowledge of JAVACC for this].
> 
> 
> 
> Karthik
> 
> -----Original Message-----
> From: Hetan Shah [mailto:Hetan.Shah@Sun.COM]
> Sent: Thursday, August 26, 2004 3:01 AM
> To: Lucene Users List
> Subject: Time to index documents
> 
> 
> Hello all,
> 
> Is there a way to reduce the indexing time taken when the indexer is
> indexing about 30,000 + files. It is roughly taking around 6-7 hours to
> do this. I am using IndexHTML class to create the index out of HTML files.
> 
> Another issue that I see is every once in a while I get the following
> output on the screen.
> 
> adding ../31/1104852.html
> Parse Aborted: Encountered "\"" at line 7, column 1.
> Was expecting one of:
>      <ArgName> ...
>      "=" ...
>      <TagEnd> ...
> 
> Any suggestions on preventing this from happening?
> 
> Thanks in advance.
> -H
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Time to index documents

Posted by Karthik N S <ka...@controlnet.co.in>.
Hi Hetan


   Th's the  major Problem of non Standatrdized Tags for HTML Document's
  u are Indexing ,resulting in lag time taken for Indexing process....


   If u can Tweak the HTMLParser.jj file within  lucene.zip   '/demo/html'
file
   [U have to have some Knowledge of JAVACC for this].



Karthik

-----Original Message-----
From: Hetan Shah [mailto:Hetan.Shah@Sun.COM]
Sent: Thursday, August 26, 2004 3:01 AM
To: Lucene Users List
Subject: Time to index documents


Hello all,

Is there a way to reduce the indexing time taken when the indexer is
indexing about 30,000 + files. It is roughly taking around 6-7 hours to
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered "\"" at line 7, column 1.
Was expecting one of:
     <ArgName> ...
     "=" ...
     <TagEnd> ...

Any suggestions on preventing this from happening?

Thanks in advance.
-H


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Time to index documents

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.
JGuru explanation: 
http://www.jguru.com/faq/view.jsp?EID=1074228

I have no sample code for neko, I think nutch uses it though. For tidy, 
you can look at ant in the sandbox:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3&view=markup

HTH,
sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

> Do you have any pointers for sample code for them?
> Would highly appreciate it.
> Thanks.
> -H
> 
> Stephane James Vaucher wrote:
> 
> > I don't think that the demo parser is meant as a production 
> > system component. You can look at Tidy or NekoHtml. They cleanup your html 
> > and are probably optimised.
> > 
> > sv
> > 
> > On Wed, 25 Aug 2004, Hetan Shah wrote:
> > 
> > 
> >>Hello all,
> >>
> >>Is there a way to reduce the indexing time taken when the indexer is 
> >>indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
> >>do this. I am using IndexHTML class to create the index out of HTML files.
> >>
> >>Another issue that I see is every once in a while I get the following 
> >>output on the screen.
> >>
> >>adding ../31/1104852.html
> >>Parse Aborted: Encountered "\"" at line 7, column 1.
> >>Was expecting one of:
> >>     <ArgName> ...
> >>     "=" ...
> >>     <TagEnd> ...
> >>
> >>Any suggestions on preventing this from happening?
> >>
> >>Thanks in advance.
> >>-H
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Time to index documents

Posted by Hetan Shah <He...@Sun.COM>.
Do you have any pointers for sample code for them?
Would highly appreciate it.
Thanks.
-H

Stephane James Vaucher wrote:

> I don't think that the demo parser is meant as a production 
> system component. You can look at Tidy or NekoHtml. They cleanup your html 
> and are probably optimised.
> 
> sv
> 
> On Wed, 25 Aug 2004, Hetan Shah wrote:
> 
> 
>>Hello all,
>>
>>Is there a way to reduce the indexing time taken when the indexer is 
>>indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
>>do this. I am using IndexHTML class to create the index out of HTML files.
>>
>>Another issue that I see is every once in a while I get the following 
>>output on the screen.
>>
>>adding ../31/1104852.html
>>Parse Aborted: Encountered "\"" at line 7, column 1.
>>Was expecting one of:
>>     <ArgName> ...
>>     "=" ...
>>     <TagEnd> ...
>>
>>Any suggestions on preventing this from happening?
>>
>>Thanks in advance.
>>-H
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Time to index documents

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.
I don't think that the demo parser is meant as a production 
system component. You can look at Tidy or NekoHtml. They cleanup your html 
and are probably optimised.

sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

> Hello all,
> 
> Is there a way to reduce the indexing time taken when the indexer is 
> indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
> do this. I am using IndexHTML class to create the index out of HTML files.
> 
> Another issue that I see is every once in a while I get the following 
> output on the screen.
> 
> adding ../31/1104852.html
> Parse Aborted: Encountered "\"" at line 7, column 1.
> Was expecting one of:
>      <ArgName> ...
>      "=" ...
>      <TagEnd> ...
> 
> Any suggestions on preventing this from happening?
> 
> Thanks in advance.
> -H
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org