You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2013/03/04 22:33:55 UTC

Nutch 1.6 : How to reparse Nutch segments ?

Hi!

I am trying to reparse Nutch segments and it says 'Segment already parsed'
when i try to parse.

Is there any option of attribute as '-reparse' like 2.x series has ?

Should i delete some directories so that i can reparse ?

Please give me suggestions on how to reparse segments that are already
parsed.

Thanks,
-- 
Kiran Chitturi

Re: Nutch 1.6 : How to reparse Nutch segments ?

Posted by kiran chitturi <ch...@gmail.com>.
Thanks Tejas. Deleting the 'crawl_parse' directory worked for me today.




On Mon, Mar 4, 2013 at 11:15 PM, Tejas Patil <te...@gmail.com>wrote:

> Yes. After I deleted that directory, parse operation ran successfully. Even
> if its an empty directory, parse wont proceed normally.
>
>
> On Mon, Mar 4, 2013 at 8:07 PM, kiran chitturi <chitturikiran15@gmail.com
> >wrote:
>
> > Thanks Tejas for the information.
> >
> > Did you try deleting 'crawl_parse' directory ? Since, the code checks for
> > that directory, i will try deleting and reparsing.
> >
> >
> >
> > On Mon, Mar 4, 2013 at 10:49 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > >wrote:
> >
> > > The code [0] checks if there is already a "crawl_parse" directory in
> the
> > > segment [lines 88-89].
> > >
> > >  88 if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME))) 89 throw
> new
> > > IOException("Segment already parsed!");
> > > I am not sure what you guys meant by deleting the subsection of the
> > > directories. Did you mean deletion of the contents inside the old
> > > crawl_parse directory ? I tried that locally and it didn't work.
> > >
> > > [0] :
> > >
> > >
> >
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup
> > >
> > >
> > > On Mon, Mar 4, 2013 at 4:20 PM, kiran chitturi <
> > chitturikiran15@gmail.com
> > > >wrote:
> > >
> > > > It took me close to 2 days to fetch 400k pages on my not so fast
> single
> > > > machine. I do not want to refetch unless it very crucial.
> > > >
> > > > I will check and see if deleting any sub-directories is helpful
> > > >
> > > > Thanks!
> > > >
> > > >
> > > > On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney <
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > >
> > > > > This makes perfect sense Kiran. It is something I've encountered in
> > the
> > > > > past and as my segments were not production critical I was easily
> > able
> > > to
> > > > > delete and re-fetch them then parse out the stuff I wanted to.
> > > > > As I said, I think this is the only way to get I'm afraid.
> > > > >
> > > > > On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi <
> > > > chitturikiran15@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Yeah. I used parse-(tika|metatags) first in the configuration and
> > > now i
> > > > > > want to use parse-(html|tika|metatags). This is due to the
> > > > parse-metatags
> > > > > > new patch upgrade.
> > > > > >
> > > > > > Thanks for the suggestions. It would be very helpful for
> reparsing
> > > > > segments
> > > > > > for 1.x like 2.x has.
> > > > > >
> > > > > > Regards,
> > > > > > Kiran.
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney <
> > > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > > >
> > > > > > > Please don't go ahead and delete the parse directories just yet
> > > > before
> > > > > > you
> > > > > > > hear back from others.
> > > > > > > My suggestion would be to try and delete a subsection of the
> > > > > directories
> > > > > > > and see if this is possible.
> > > > > > > Have you changed some configuration and now want to parse out
> > some
> > > > more
> > > > > > > content/structure?
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <
> > > > > > chitturikiran15@gmail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I am trying to reparse Nutch segments and it says 'Segment
> > > already
> > > > > > > parsed'
> > > > > > > > when i try to parse.
> > > > > > > >
> > > > > > > > Is there any option of attribute as '-reparse' like 2.x
> series
> > > has
> > > > ?
> > > > > > > >
> > > > > > > > Should i delete some directories so that i can reparse ?
> > > > > > > >
> > > > > > > > Please give me suggestions on how to reparse segments that
> are
> > > > > already
> > > > > > > > parsed.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > --
> > > > > > > > Kiran Chitturi
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > *Lewis*
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Kiran Chitturi
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Lewis*
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kiran Chitturi
> > > >
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>



-- 
Kiran Chitturi

Re: Nutch 1.6 : How to reparse Nutch segments ?

Posted by Tejas Patil <te...@gmail.com>.
Yes. After I deleted that directory, parse operation ran successfully. Even
if its an empty directory, parse wont proceed normally.


On Mon, Mar 4, 2013 at 8:07 PM, kiran chitturi <ch...@gmail.com>wrote:

> Thanks Tejas for the information.
>
> Did you try deleting 'crawl_parse' directory ? Since, the code checks for
> that directory, i will try deleting and reparsing.
>
>
>
> On Mon, Mar 4, 2013 at 10:49 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > The code [0] checks if there is already a "crawl_parse" directory in the
> > segment [lines 88-89].
> >
> >  88 if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME))) 89 throw new
> > IOException("Segment already parsed!");
> > I am not sure what you guys meant by deleting the subsection of the
> > directories. Did you mean deletion of the contents inside the old
> > crawl_parse directory ? I tried that locally and it didn't work.
> >
> > [0] :
> >
> >
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup
> >
> >
> > On Mon, Mar 4, 2013 at 4:20 PM, kiran chitturi <
> chitturikiran15@gmail.com
> > >wrote:
> >
> > > It took me close to 2 days to fetch 400k pages on my not so fast single
> > > machine. I do not want to refetch unless it very crucial.
> > >
> > > I will check and see if deleting any sub-directories is helpful
> > >
> > > Thanks!
> > >
> > >
> > > On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > > > This makes perfect sense Kiran. It is something I've encountered in
> the
> > > > past and as my segments were not production critical I was easily
> able
> > to
> > > > delete and re-fetch them then parse out the stuff I wanted to.
> > > > As I said, I think this is the only way to get I'm afraid.
> > > >
> > > > On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi <
> > > chitturikiran15@gmail.com
> > > > >wrote:
> > > >
> > > > > Yeah. I used parse-(tika|metatags) first in the configuration and
> > now i
> > > > > want to use parse-(html|tika|metatags). This is due to the
> > > parse-metatags
> > > > > new patch upgrade.
> > > > >
> > > > > Thanks for the suggestions. It would be very helpful for reparsing
> > > > segments
> > > > > for 1.x like 2.x has.
> > > > >
> > > > > Regards,
> > > > > Kiran.
> > > > >
> > > > >
> > > > > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney <
> > > > > lewis.mcgibbney@gmail.com> wrote:
> > > > >
> > > > > > Please don't go ahead and delete the parse directories just yet
> > > before
> > > > > you
> > > > > > hear back from others.
> > > > > > My suggestion would be to try and delete a subsection of the
> > > > directories
> > > > > > and see if this is possible.
> > > > > > Have you changed some configuration and now want to parse out
> some
> > > more
> > > > > > content/structure?
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <
> > > > > chitturikiran15@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi!
> > > > > > >
> > > > > > > I am trying to reparse Nutch segments and it says 'Segment
> > already
> > > > > > parsed'
> > > > > > > when i try to parse.
> > > > > > >
> > > > > > > Is there any option of attribute as '-reparse' like 2.x series
> > has
> > > ?
> > > > > > >
> > > > > > > Should i delete some directories so that i can reparse ?
> > > > > > >
> > > > > > > Please give me suggestions on how to reparse segments that are
> > > > already
> > > > > > > parsed.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > --
> > > > > > > Kiran Chitturi
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > *Lewis*
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kiran Chitturi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> >
>
>
>
> --
> Kiran Chitturi
>

Re: Nutch 1.6 : How to reparse Nutch segments ?

Posted by kiran chitturi <ch...@gmail.com>.
Thanks Tejas for the information.

Did you try deleting 'crawl_parse' directory ? Since, the code checks for
that directory, i will try deleting and reparsing.



On Mon, Mar 4, 2013 at 10:49 PM, Tejas Patil <te...@gmail.com>wrote:

> The code [0] checks if there is already a "crawl_parse" directory in the
> segment [lines 88-89].
>
>  88 if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME))) 89 throw new
> IOException("Segment already parsed!");
> I am not sure what you guys meant by deleting the subsection of the
> directories. Did you mean deletion of the contents inside the old
> crawl_parse directory ? I tried that locally and it didn't work.
>
> [0] :
>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup
>
>
> On Mon, Mar 4, 2013 at 4:20 PM, kiran chitturi <chitturikiran15@gmail.com
> >wrote:
>
> > It took me close to 2 days to fetch 400k pages on my not so fast single
> > machine. I do not want to refetch unless it very crucial.
> >
> > I will check and see if deleting any sub-directories is helpful
> >
> > Thanks!
> >
> >
> > On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > This makes perfect sense Kiran. It is something I've encountered in the
> > > past and as my segments were not production critical I was easily able
> to
> > > delete and re-fetch them then parse out the stuff I wanted to.
> > > As I said, I think this is the only way to get I'm afraid.
> > >
> > > On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi <
> > chitturikiran15@gmail.com
> > > >wrote:
> > >
> > > > Yeah. I used parse-(tika|metatags) first in the configuration and
> now i
> > > > want to use parse-(html|tika|metatags). This is due to the
> > parse-metatags
> > > > new patch upgrade.
> > > >
> > > > Thanks for the suggestions. It would be very helpful for reparsing
> > > segments
> > > > for 1.x like 2.x has.
> > > >
> > > > Regards,
> > > > Kiran.
> > > >
> > > >
> > > > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney <
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > >
> > > > > Please don't go ahead and delete the parse directories just yet
> > before
> > > > you
> > > > > hear back from others.
> > > > > My suggestion would be to try and delete a subsection of the
> > > directories
> > > > > and see if this is possible.
> > > > > Have you changed some configuration and now want to parse out some
> > more
> > > > > content/structure?
> > > > >
> > > > >
> > > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <
> > > > chitturikiran15@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hi!
> > > > > >
> > > > > > I am trying to reparse Nutch segments and it says 'Segment
> already
> > > > > parsed'
> > > > > > when i try to parse.
> > > > > >
> > > > > > Is there any option of attribute as '-reparse' like 2.x series
> has
> > ?
> > > > > >
> > > > > > Should i delete some directories so that i can reparse ?
> > > > > >
> > > > > > Please give me suggestions on how to reparse segments that are
> > > already
> > > > > > parsed.
> > > > > >
> > > > > > Thanks,
> > > > > > --
> > > > > > Kiran Chitturi
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Lewis*
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kiran Chitturi
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>



-- 
Kiran Chitturi

Re: Nutch 1.6 : How to reparse Nutch segments ?

Posted by Tejas Patil <te...@gmail.com>.
The code [0] checks if there is already a "crawl_parse" directory in the
segment [lines 88-89].

 88 if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME))) 89 throw new
IOException("Segment already parsed!");
I am not sure what you guys meant by deleting the subsection of the
directories. Did you mean deletion of the contents inside the old
crawl_parse directory ? I tried that locally and it didn't work.

[0] :
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup


On Mon, Mar 4, 2013 at 4:20 PM, kiran chitturi <ch...@gmail.com>wrote:

> It took me close to 2 days to fetch 400k pages on my not so fast single
> machine. I do not want to refetch unless it very crucial.
>
> I will check and see if deleting any sub-directories is helpful
>
> Thanks!
>
>
> On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > This makes perfect sense Kiran. It is something I've encountered in the
> > past and as my segments were not production critical I was easily able to
> > delete and re-fetch them then parse out the stuff I wanted to.
> > As I said, I think this is the only way to get I'm afraid.
> >
> > On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi <
> chitturikiran15@gmail.com
> > >wrote:
> >
> > > Yeah. I used parse-(tika|metatags) first in the configuration and now i
> > > want to use parse-(html|tika|metatags). This is due to the
> parse-metatags
> > > new patch upgrade.
> > >
> > > Thanks for the suggestions. It would be very helpful for reparsing
> > segments
> > > for 1.x like 2.x has.
> > >
> > > Regards,
> > > Kiran.
> > >
> > >
> > > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > > > Please don't go ahead and delete the parse directories just yet
> before
> > > you
> > > > hear back from others.
> > > > My suggestion would be to try and delete a subsection of the
> > directories
> > > > and see if this is possible.
> > > > Have you changed some configuration and now want to parse out some
> more
> > > > content/structure?
> > > >
> > > >
> > > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <
> > > chitturikiran15@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > I am trying to reparse Nutch segments and it says 'Segment already
> > > > parsed'
> > > > > when i try to parse.
> > > > >
> > > > > Is there any option of attribute as '-reparse' like 2.x series has
> ?
> > > > >
> > > > > Should i delete some directories so that i can reparse ?
> > > > >
> > > > > Please give me suggestions on how to reparse segments that are
> > already
> > > > > parsed.
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > Kiran Chitturi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> Kiran Chitturi
>

Re: Nutch 1.6 : How to reparse Nutch segments ?

Posted by kiran chitturi <ch...@gmail.com>.
It took me close to 2 days to fetch 400k pages on my not so fast single
machine. I do not want to refetch unless it very crucial.

I will check and see if deleting any sub-directories is helpful

Thanks!


On Mon, Mar 4, 2013 at 5:54 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> This makes perfect sense Kiran. It is something I've encountered in the
> past and as my segments were not production critical I was easily able to
> delete and re-fetch them then parse out the stuff I wanted to.
> As I said, I think this is the only way to get I'm afraid.
>
> On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi <chitturikiran15@gmail.com
> >wrote:
>
> > Yeah. I used parse-(tika|metatags) first in the configuration and now i
> > want to use parse-(html|tika|metatags). This is due to the parse-metatags
> > new patch upgrade.
> >
> > Thanks for the suggestions. It would be very helpful for reparsing
> segments
> > for 1.x like 2.x has.
> >
> > Regards,
> > Kiran.
> >
> >
> > On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Please don't go ahead and delete the parse directories just yet before
> > you
> > > hear back from others.
> > > My suggestion would be to try and delete a subsection of the
> directories
> > > and see if this is possible.
> > > Have you changed some configuration and now want to parse out some more
> > > content/structure?
> > >
> > >
> > > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <
> > chitturikiran15@gmail.com
> > > >wrote:
> > >
> > > > Hi!
> > > >
> > > > I am trying to reparse Nutch segments and it says 'Segment already
> > > parsed'
> > > > when i try to parse.
> > > >
> > > > Is there any option of attribute as '-reparse' like 2.x series has ?
> > > >
> > > > Should i delete some directories so that i can reparse ?
> > > >
> > > > Please give me suggestions on how to reparse segments that are
> already
> > > > parsed.
> > > >
> > > > Thanks,
> > > > --
> > > > Kiran Chitturi
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *Lewis*
>



-- 
Kiran Chitturi

Re: Nutch 1.6 : How to reparse Nutch segments ?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
This makes perfect sense Kiran. It is something I've encountered in the
past and as my segments were not production critical I was easily able to
delete and re-fetch them then parse out the stuff I wanted to.
As I said, I think this is the only way to get I'm afraid.

On Mon, Mar 4, 2013 at 2:25 PM, kiran chitturi <ch...@gmail.com>wrote:

> Yeah. I used parse-(tika|metatags) first in the configuration and now i
> want to use parse-(html|tika|metatags). This is due to the parse-metatags
> new patch upgrade.
>
> Thanks for the suggestions. It would be very helpful for reparsing segments
> for 1.x like 2.x has.
>
> Regards,
> Kiran.
>
>
> On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Please don't go ahead and delete the parse directories just yet before
> you
> > hear back from others.
> > My suggestion would be to try and delete a subsection of the directories
> > and see if this is possible.
> > Have you changed some configuration and now want to parse out some more
> > content/structure?
> >
> >
> > On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <
> chitturikiran15@gmail.com
> > >wrote:
> >
> > > Hi!
> > >
> > > I am trying to reparse Nutch segments and it says 'Segment already
> > parsed'
> > > when i try to parse.
> > >
> > > Is there any option of attribute as '-reparse' like 2.x series has ?
> > >
> > > Should i delete some directories so that i can reparse ?
> > >
> > > Please give me suggestions on how to reparse segments that are already
> > > parsed.
> > >
> > > Thanks,
> > > --
> > > Kiran Chitturi
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> Kiran Chitturi
>



-- 
*Lewis*

Re: Nutch 1.6 : How to reparse Nutch segments ?

Posted by kiran chitturi <ch...@gmail.com>.
Yeah. I used parse-(tika|metatags) first in the configuration and now i
want to use parse-(html|tika|metatags). This is due to the parse-metatags
new patch upgrade.

Thanks for the suggestions. It would be very helpful for reparsing segments
for 1.x like 2.x has.

Regards,
Kiran.


On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Please don't go ahead and delete the parse directories just yet before you
> hear back from others.
> My suggestion would be to try and delete a subsection of the directories
> and see if this is possible.
> Have you changed some configuration and now want to parse out some more
> content/structure?
>
>
> On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <chitturikiran15@gmail.com
> >wrote:
>
> > Hi!
> >
> > I am trying to reparse Nutch segments and it says 'Segment already
> parsed'
> > when i try to parse.
> >
> > Is there any option of attribute as '-reparse' like 2.x series has ?
> >
> > Should i delete some directories so that i can reparse ?
> >
> > Please give me suggestions on how to reparse segments that are already
> > parsed.
> >
> > Thanks,
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *Lewis*
>



-- 
Kiran Chitturi

Re: Nutch 1.6 : How to reparse Nutch segments ?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please don't go ahead and delete the parse directories just yet before you
hear back from others.
My suggestion would be to try and delete a subsection of the directories
and see if this is possible.
Have you changed some configuration and now want to parse out some more
content/structure?


On Mon, Mar 4, 2013 at 1:33 PM, kiran chitturi <ch...@gmail.com>wrote:

> Hi!
>
> I am trying to reparse Nutch segments and it says 'Segment already parsed'
> when i try to parse.
>
> Is there any option of attribute as '-reparse' like 2.x series has ?
>
> Should i delete some directories so that i can reparse ?
>
> Please give me suggestions on how to reparse segments that are already
> parsed.
>
> Thanks,
> --
> Kiran Chitturi
>



-- 
*Lewis*