You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ayyanar Inbamohan <te...@yahoo.com> on 2005/08/31 10:41:50 UTC

parser for xsl, ppt and zip

Hi all,

 Any parser plugins available for parsing xsl,ppt and
zip extension files.

thanks in advance,
Ayyanar...


		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 

Re: parser for xsl, ppt and zip

Posted by Jérôme Charron <je...@gmail.com>.
> Any parser plugins available for parsing xsl,ppt and
> > zip extension files.
> 
> Some patches are available for xsl, ppt and zip plugins:
> (JIRA is actually down, so that I can't give you URLs to the related 
> issues and patches).
> 
Ok, JIRA is up. Here is the link to the related issues:
parser plugin for MS PowerPoint
slides<http://issues.apache.org/jira/browse/NUTCH-21>(
http://issues.apache.org/jira/browse/NUTCH-21)
Parser plugin for MS Excel files<http://issues.apache.org/jira/browse/NUTCH-52>(
http://issues.apache.org/jira/browse/NUTCH-52)
Parser plugin for Zip files <http://issues.apache.org/jira/browse/NUTCH-53>(
http://issues.apache.org/jira/browse/NUTCH-53)

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: parser for xsl, ppt and zip

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Ok, I need these parsers, but I need some other parsers too:
1. pdf and 2. rtf.

1. I have problem with pdf parser, there are only these messages in the 
fetcher log with pdf files:
050831 084230 fetch of http://www.renyi.hu/~szilard/hw2.pdf failed with: 
java.lang.NoClassDefFoundError: org/pdfbox/exceptions
/InvalidPasswordException
Any ideas?

2.
The problem is with RTF parser is that this program isn't under the 
Apache license.
I use RTF with an old nightly build, I would like to upgrade it to 0.7, 
but I think this plugin need to make seperated from nutch sources. Can I 
will put it to the jira?

Re: parser for xsl, ppt and zip

Posted by Michael Nebel <mi...@nebel.de>.
Hi Jérôme,

I think, the ppt-parser is ready to go. Now!

But the xls-plugin won't work without further modifications. It's a 
"pre-Andrzej"-version which still uses the ParseException. I'm not sure, 
if my "hacks" to get them running are ok. I still see many errors 
(null-pointer-exceptions while crawling). Ok - I see... I should update 
  NUTCH-52 :-)

Regards

	Michael



>>The ppt-parser from Stephan Strittmatter (NUTCH-21) seems to work ok - I
>>would suggest to add him to the regular plugins (Important: if you
>>download the plugin from jira - be careful to take the current version
>>(by now the second attachment from 2.Aug.2005!))
>>
>>The xls-plugin from Rohit Kulkarni (NUTCH-52) needs still some work. The
>>latest changes concerning the ParseStatus are not integrated, so it
>>won't run under nutch-0.7. There are also some null-pointer-problems.
>>
>>The zip-Plugin from Rohit Kulkarni (NUTCH-53) seems to work for me, but
>>I gave him only a few tests.
> 
> 
> Thanks Michael for this status about these plugins.
> Since the best way to widely test and improve these plugins is to widely 
> using them,
> I thing it's time to commit them.
> If there is no objections in the next days, I will commit them next week.
> However, my first idea was to commit these patches in the trunk in order to 
> avoid introducing
> some new bugs in the future 0.7.1 release. Committers (especially Piotr, our 
> release expert) and developpers, what do you think about this point? (trunk 
> for 0.8, or 0.7 branch for 0.7.1)
> 
> Regards
> 
> Jérôme
> 


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: parser for xsl, ppt and zip

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Ok, I found the source of my problem with pdf  parser.
I updated the pdfbox to a nightly-build to resolve pausing errors. In 
this case we need update the plugin.xml too.

Re: parser for xsl, ppt and zip

Posted by Jérôme Charron <je...@gmail.com>.
> 
> +1 on committing to the trunk. -0.5 to committing to Release-0.7 :-) See
> below.

Ok for me. Seems we share the same realease process.

> We have to work out a consistent policy on release engineering model,
> and a policy on committing new functionality, with regard to the releases.

That's right, it's one of the first steps to QA.
Perhaps could you create a draft ReleasePolicy document on the Wiki with 
these rules, so that everybody can contribute.

Best Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: parser for xsl, ppt and zip

Posted by Jérôme Charron <je...@gmail.com>.
> 
> I would vote for even stricter rules for
> branch. We used to have a rule that required a bug in bug tracking
> system and no new features(there were always exceptions but they were
> rare ane well documeted) and all commits listed in changelog with bug
> numbers.

Totally agree with Piotr, JIRA is actually underused (I think there's too 
much information on the mailing lists that are not reported in Wiki or 
JIRA).

> I do not think we need all these things but listing all
> changes in changlog - release notes will give user more confidence in
> upgrade.

I better known CVS than SVN (but the most I work on nutch the most it will 
change), so, here's a dummy svn question:
Can svn log command be used to generate a changelog between two revisions?
(it could be an easy and consistent way to deliver changelog with each 
release, and enforce to have some detailed commit messages)

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: parser for xsl, ppt and zip

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Andrzej Bialecki wrote:
> 
> +1 on committing to the trunk. -0.5 to committing to Release-0.7 :-) See 

+1 for trunk, -1 to Release-0.7
Totally agree with Andrzej - I would vote for even stricter rules for 
branch. We used to have a rule that required a bug in bug tracking 
system and no new features(there were always exceptions but they were 
rare ane well documeted) and all commits listed in changelog with bug 
numbers. I do  not think we need all these things but listing all 
changes in changlog - release notes will give user more confidence in 
upgrade.
Regards
Piotr

Re: [Nutch-general] Link Analysis in OC

Posted by Kelvin Tan <ke...@relevanz.com>.
No, and that's something that will be worked into the OC before it gets merged into SVN: support for both host-based and score-based fetchlist prioritization.

On Tue, 6 Sep 2005 19:17:42 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> Does OC support Link Analysis directly?
>
> I guess we have to use updateDB and then use
> DistributeLinkAnalysisTool to generate the pageRank
> score for individual site.
>
> Will there be another scenario that we could get Link
> Analysis Score from OC?
>
> thanks,
>
> Michael Ji
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
>
> ------------------------------------------------------- SF.Net
> email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle
> Practices Agile & Plan-Driven Development * Managing Projects &
> Teams * Testing & QA Security * Process Improvement & Measurement *
> http://www.sqe.com/bsce5sf
> _______________________________________________ Nutch-general
> mailing list Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general



Link Analysis in OC

Posted by Michael Ji <fj...@yahoo.com>.
hi Kelvin:

Does OC support Link Analysis directly? 

I guess we have to use updateDB and then use
DistributeLinkAnalysisTool to generate the pageRank
score for individual site.

Will there be another scenario that we could get Link
Analysis Score from OC?

thanks,

Michael Ji

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: parser for xsl, ppt and zip

Posted by Ayyanar Inbamohan <te...@yahoo.com>.
Hi all,

I am using nutch 6.0, whether the zip,ppt xls plugins
will work in 6.0 or not

thanks,
Ayyanar...

--- Doug Cutting <cu...@nutch.org> wrote:

> Andrzej Bialecki wrote:
> > I usually follow these rules, which I propose to
> discuss/modify/accept:
> > 
> > * New features are first committed to trunk (or
> CVS HEAD). This way we 
> > avoid losing new features somewhere on the
> branches, because as the time 
> > goes it would be more and more difficult to
> forward-port them from past 
> > branches to the trunk.
> > 
> > * If there are important features, which will
> benefit majority of users, 
> > these are back-ported to release branches
> afterwards. I believe this is 
> > the case with the ppt/xls plugins.
> > 
> > * other than that, the code in release branches is
> considered "stable", 
> > i.e. no new features are introduced except for
> fixing some minor issues.
> 
> +1
> 
> Doug
> 



		
__________________________________ 
Yahoo! Mail 
Stay connected, organized, and protected. Take the tour: 
http://tour.mail.yahoo.com/mailtour.html 


Re: parser for xsl, ppt and zip

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> I usually follow these rules, which I propose to discuss/modify/accept:
> 
> * New features are first committed to trunk (or CVS HEAD). This way we 
> avoid losing new features somewhere on the branches, because as the time 
> goes it would be more and more difficult to forward-port them from past 
> branches to the trunk.
> 
> * If there are important features, which will benefit majority of users, 
> these are back-ported to release branches afterwards. I believe this is 
> the case with the ppt/xls plugins.
> 
> * other than that, the code in release branches is considered "stable", 
> i.e. no new features are introduced except for fixing some minor issues.

+1

Doug

Re: parser for xsl, ppt and zip

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:

> Thanks Michael for this status about these plugins.
> Since the best way to widely test and improve these plugins is to widely 
> using them,
> I thing it's time to commit them.
> If there is no objections in the next days, I will commit them next week.
> However, my first idea was to commit these patches in the trunk in order to 
> avoid introducing
> some new bugs in the future 0.7.1 release. Committers (especially Piotr, our 
> release expert) and developpers, what do you think about this point? (trunk 
> for 0.8, or 0.7 branch for 0.7.1)

+1 on committing to the trunk. -0.5 to committing to Release-0.7 :-) See 
below.

We have to work out a consistent policy on release engineering model, 
and a policy on committing new functionality, with regard to the releases.

I usually follow these rules, which I propose to discuss/modify/accept:

* New features are first committed to trunk (or CVS HEAD). This way we 
avoid losing new features somewhere on the branches, because as the time 
goes it would be more and more difficult to forward-port them from past 
branches to the trunk.

* If there are important features, which will benefit majority of users, 
these are back-ported to release branches afterwards. I believe this is 
the case with the ppt/xls plugins.

* other than that, the code in release branches is considered "stable", 
i.e. no new features are introduced except for fixing some minor issues.

This is an important distinction, because users will expect the 
Release-* branches to work properly at all times - i.e. at any given 
moment they should be able to get the code, recompile it and it should 
work properly - of course, within the functional limits of the given 
release. This is not the case with the trunk/ or HEAD, where active 
development occurs, and where occasional breakage may happen and may 
last even for longer time, and this is acceptable there.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: parser for xsl, ppt and zip

Posted by Jérôme Charron <je...@gmail.com>.
> 
> The ppt-parser from Stephan Strittmatter (NUTCH-21) seems to work ok - I
> would suggest to add him to the regular plugins (Important: if you
> download the plugin from jira - be careful to take the current version
> (by now the second attachment from 2.Aug.2005!))
> 
> The xls-plugin from Rohit Kulkarni (NUTCH-52) needs still some work. The
> latest changes concerning the ParseStatus are not integrated, so it
> won't run under nutch-0.7. There are also some null-pointer-problems.
> 
> The zip-Plugin from Rohit Kulkarni (NUTCH-53) seems to work for me, but
> I gave him only a few tests.

Thanks Michael for this status about these plugins.
Since the best way to widely test and improve these plugins is to widely 
using them,
I thing it's time to commit them.
If there is no objections in the next days, I will commit them next week.
However, my first idea was to commit these patches in the trunk in order to 
avoid introducing
some new bugs in the future 0.7.1 release. Committers (especially Piotr, our 
release expert) and developpers, what do you think about this point? (trunk 
for 0.8, or 0.7 branch for 0.7.1)

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: parser for xsl, ppt and zip

Posted by Michael Nebel <mi...@nebel.de>.
Hi,

The ppt-parser from Stephan Strittmatter (NUTCH-21) seems to work ok - I 
would suggest to add him to the regular plugins (Important: if you 
download the plugin from jira - be careful to take the current version 
(by now the second attachment from 2.Aug.2005!))

The xls-plugin from Rohit Kulkarni (NUTCH-52) needs still some work. The 
latest changes concerning the ParseStatus are not integrated, so it 
won't run under nutch-0.7. There are also some null-pointer-problems.

The zip-Plugin from Rohit Kulkarni (NUTCH-53) seems to work for me, but 
I gave him only a few tests.

Regards

	Michael



Jérôme Charron wrote:

>>Any parser plugins available for parsing xsl,ppt and
>>zip extension files.
> 
> 
> Some patches are available for xsl, ppt and zip plugins:
> (JIRA is actually down, so that I can't give you URLs to the related issues 
> and patches).
> 
> If people are intersted in this patches to be commited in trunk, please vote 
> for them.
> http://issues.apache.org/jira/browse/Nutch
> 
> Regards
> 
> Jérôme
> 
> 


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


RE: parser for xsl, ppt and zip

Posted by EM <em...@cpuedge.com>.
+1

-----Original Message-----
From: yoursoft@freemail.hu [mailto:yoursoft@freemail.hu] 
Sent: Wednesday, August 31, 2005 7:58 AM
To: nutch-user@lucene.apache.org
Subject: Re: parser for xsl, ppt and zip

+1 vote xls, ppt, zip.

Jérôme Charron wrotte:

>>Any parser plugins available for parsing xsl,ppt and
>>zip extension files.
>>    
>>
>
>Some patches are available for xsl, ppt and zip plugins:
>(JIRA is actually down, so that I can't give you URLs to the related issues

>and patches).
>
>If people are intersted in this patches to be commited in trunk, please
vote 
>for them.
>http://issues.apache.org/jira/browse/Nutch
>
>Regards
>
>Jérôme
>
>
>  
>




Re: parser for xsl, ppt and zip

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
+1 vote xls, ppt, zip.

Jérôme Charron wrotte:

>>Any parser plugins available for parsing xsl,ppt and
>>zip extension files.
>>    
>>
>
>Some patches are available for xsl, ppt and zip plugins:
>(JIRA is actually down, so that I can't give you URLs to the related issues 
>and patches).
>
>If people are intersted in this patches to be commited in trunk, please vote 
>for them.
>http://issues.apache.org/jira/browse/Nutch
>
>Regards
>
>Jérôme
>
>
>  
>


Re: parser for xsl, ppt and zip

Posted by Jérôme Charron <je...@gmail.com>.
> Any parser plugins available for parsing xsl,ppt and
> zip extension files.

Some patches are available for xsl, ppt and zip plugins:
(JIRA is actually down, so that I can't give you URLs to the related issues 
and patches).

If people are intersted in this patches to be commited in trunk, please vote 
for them.
http://issues.apache.org/jira/browse/Nutch

Regards

Jérôme


-- 
http://motrech.free.fr/ 
http://www.frutch.org/