You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jo...@dom.com on 2002/01/08 23:08:30 UTC

indexing big files

Question from a Lucene newbie... I'm trying to index a file structure which
happens to include a relatively large file (310kb with 55,700 words) and
for some reason it appears to hanging the whole indexing process.  Here's a
quick run-down..

1) Am using a webcrawler to retrieve files and copy to my local disk.
2) For files like .pdf's... I'm copying an .html equivalent of the file to
my disk (but leaving .pdf extension).
3) Then later in a serperate batch process I run pretty much the standard
out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
.pdf as a possible indexing type).

That's about it.  No big deal.  The transformation from pdf to html is not
perfected yet either... so file size will definitely drop in the future...
as nonsense terms are being included in these files.  But for now... what
should I be looking at or altering to find out what is causing the hang?
Thanks!

Jon Wasson


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: indexing big files

Posted by Winton Davies <wd...@overture.com>.
My guess is Garbage Collection -- Try allocating twice as much Heap as before.
or more. Try running with -gc:verbose (or whatever).

  Cheers,
  Winton


>Question from a Lucene newbie... I'm trying to index a file structure which
>happens to include a relatively large file (310kb with 55,700 words) and
>for some reason it appears to hanging the whole indexing process.  Here's a
>quick run-down..
>
>1) Am using a webcrawler to retrieve files and copy to my local disk.
>2) For files like .pdf's... I'm copying an .html equivalent of the file to
>my disk (but leaving .pdf extension).
>3) Then later in a serperate batch process I run pretty much the standard
>out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
>.pdf as a possible indexing type).
>
>That's about it.  No big deal.  The transformation from pdf to html is not
>perfected yet either... so file size will definitely drop in the future...
>as nonsense terms are being included in these files.  But for now... what
>should I be looking at or altering to find out what is causing the hang?
>Thanks!
>
>Jon Wasson
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>