You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Withanage, Dulip" <wi...@asia-europe.uni-heidelberg.de> on 2010/02/03 12:08:03 UTC

PDF Parsing

I parse a pdf collection using the web crawler.
Some PDFs are corrupt and  it makes the whole lucene index unusable.
Does anybody have any idea, how to go around this problem.


Best regards,

Dulip Withanage, M.Sc

[cid:image001.jpg@01CAA4C8.D244F730]
Cluster of Excellence
Karl Jaspers Centre
Heidelberg
e-mail: withanage@asia-europe.uni-heidelberg.de<ma...@asia-europe.uni-heidelberg.de>

Re: PDF Parsing

Posted by Alexander Aristov <al...@gmail.com>.

Your problem has nothing to do with PDFs. Do you have messages/exceptions
where you are merging indexes?

Best Regards
Alexander Aristov


On 4 February 2010 12:58, Withanage, Dulip <
withanage@asia-europe.uni-heidelberg.de> wrote:

> Thanks for the initial ideas.
> >>do they really corrupt or they get corrupted when they are downloaded?
> Sorry for  my false assumption at the beginning. I am absolutely new to
> lucene and nutch both.
> I think the index is not corrupt. It gets corrupted in the mergecrawl
> process.
>
> These are my steps
> 1. I have a pdf index of around 2000 documents in web server.
> 2. I generate one index for each 100 documents.
> 3. Then I use a modified mergecrawl_script to merge the indexes
> http://wiki.apache.org/nutch/MergeCrawl
> 4. I add each directory one after other to make a complete index.
> 5. The merged lucene index is corrupt after I encounter a index directory
> of about 400 mb.
>
>
>
>
> -----Original Message-----
> From: Alexander Aristov [mailto:alexander.aristov@gmail.com]
> Sent: Wednesday, February 03, 2010 9:00 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: PDF Parsing
>
> hi
>
> do they really corrupt or they get corrupted when they are downloaded?
> There
> is a parameter in Nutch which limits downloaded content size. it just cuts
> files and they became corrupted. check this setting
>
> Best Regards
> Alexander Aristov
>
>
> On 3 February 2010 21:52, Ken Krugler <kk...@transpac.com> wrote:
>
> >
> > On Feb 3, 2010, at 3:08am, Withanage, Dulip wrote:
> >
> >  I parse a pdf collection using the web crawler.
> >> Some PDFs are corrupt and  it makes the whole lucene index unusable.
> >> Does anybody have any idea, how to go around this problem.
> >>
> >
> > How does it make the "whole Lucene index unusable"?
> >
> > Normally a corrupt PDF can cause an exception to be thrown during
> parsing,
> > or it can cause the parser to hang.
> >
> > It might output a bunch of garbage, but that shouldn't cause the index to
> > become invalid.
> >
> > -- Ken
> >
> >  Best regards,
> >>
> >> Dulip Withanage, M.Sc
> >>
> >>
> >> Cluster of Excellence
> >> Karl Jaspers Centre
> >> Heidelberg
> >> e-mail: withanage@asia-europe.uni-heidelberg.de
> >>
> >>
> >>
> >>
> > --------------------------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > e l a s t i c   w e b   m i n i n g
> >
> >
> >
> >
> >
>

RE: PDF Parsing

Posted by "Withanage, Dulip" <wi...@asia-europe.uni-heidelberg.de>.

Thanks for the initial ideas.
>>do they really corrupt or they get corrupted when they are downloaded?
Sorry for  my false assumption at the beginning. I am absolutely new to lucene and nutch both. 
I think the index is not corrupt. It gets corrupted in the mergecrawl process.

These are my steps 
1. I have a pdf index of around 2000 documents in web server.
2. I generate one index for each 100 documents.
3. Then I use a modified mergecrawl_script to merge the indexes http://wiki.apache.org/nutch/MergeCrawl
4. I add each directory one after other to make a complete index.
5. The merged lucene index is corrupt after I encounter a index directory of about 400 mb. 




-----Original Message-----
From: Alexander Aristov [mailto:alexander.aristov@gmail.com] 
Sent: Wednesday, February 03, 2010 9:00 PM
To: nutch-user@lucene.apache.org
Subject: Re: PDF Parsing

hi

do they really corrupt or they get corrupted when they are downloaded? There
is a parameter in Nutch which limits downloaded content size. it just cuts
files and they became corrupted. check this setting

Best Regards
Alexander Aristov


On 3 February 2010 21:52, Ken Krugler <kk...@transpac.com> wrote:

>
> On Feb 3, 2010, at 3:08am, Withanage, Dulip wrote:
>
>  I parse a pdf collection using the web crawler.
>> Some PDFs are corrupt and  it makes the whole lucene index unusable.
>> Does anybody have any idea, how to go around this problem.
>>
>
> How does it make the "whole Lucene index unusable"?
>
> Normally a corrupt PDF can cause an exception to be thrown during parsing,
> or it can cause the parser to hang.
>
> It might output a bunch of garbage, but that shouldn't cause the index to
> become invalid.
>
> -- Ken
>
>  Best regards,
>>
>> Dulip Withanage, M.Sc
>>
>>
>> Cluster of Excellence
>> Karl Jaspers Centre
>> Heidelberg
>> e-mail: withanage@asia-europe.uni-heidelberg.de
>>
>>
>>
>>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

Re: PDF Parsing

Posted by Alexander Aristov <al...@gmail.com>.

hi

do they really corrupt or they get corrupted when they are downloaded? There
is a parameter in Nutch which limits downloaded content size. it just cuts
files and they became corrupted. check this setting

Best Regards
Alexander Aristov


On 3 February 2010 21:52, Ken Krugler <kk...@transpac.com> wrote:

>
> On Feb 3, 2010, at 3:08am, Withanage, Dulip wrote:
>
>  I parse a pdf collection using the web crawler.
>> Some PDFs are corrupt and  it makes the whole lucene index unusable.
>> Does anybody have any idea, how to go around this problem.
>>
>
> How does it make the "whole Lucene index unusable"?
>
> Normally a corrupt PDF can cause an exception to be thrown during parsing,
> or it can cause the parser to hang.
>
> It might output a bunch of garbage, but that shouldn't cause the index to
> become invalid.
>
> -- Ken
>
>  Best regards,
>>
>> Dulip Withanage, M.Sc
>>
>>
>> Cluster of Excellence
>> Karl Jaspers Centre
>> Heidelberg
>> e-mail: withanage@asia-europe.uni-heidelberg.de
>>
>>
>>
>>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

Re: PDF Parsing

Posted by Ken Krugler <kk...@transpac.com>.

On Feb 3, 2010, at 3:08am, Withanage, Dulip wrote:

> I parse a pdf collection using the web crawler.
> Some PDFs are corrupt and  it makes the whole lucene index unusable.
> Does anybody have any idea, how to go around this problem.

How does it make the "whole Lucene index unusable"?

Normally a corrupt PDF can cause an exception to be thrown during  
parsing, or it can cause the parser to hang.

It might output a bunch of garbage, but that shouldn't cause the index  
to become invalid.

-- Ken

> Best regards,
>
> Dulip Withanage, M.Sc
>
>
> Cluster of Excellence
> Karl Jaspers Centre
> Heidelberg
> e-mail: withanage@asia-europe.uni-heidelberg.de
>
>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g