You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ayyanar Inbamohan <te...@yahoo.com> on 2005/09/06 13:42:37 UTC

nutch 7.0 not fetching powerpoint, plugin is present

Hi all,

I am using the powerpoint plugin from JIRA, and when i
crawl my application having link to the ppt, nutch 7.0
is not at all fetching the powerpoint files.

i am crawling my local appliation 

http://localhost:8080/search_sample/index.html

this url, i have given in the url.intranet, 

i gave some href to powerpoint file in index.html, 

and then started but it is not crawling



Thanks in advance..

thanks,
Ayyanar....


	
		
______________________________________________________
Click here to donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> Funly enough, I was thinking the other way around: could it be a
> requirement for someone that two plugins parse the same content-type?
> One plugin does some parts of the parsing, then hands over the page to
> another one, _à la_ Visitor.

It could be an interesting feature.
But for now, I concentrate on needed features, and as you show it, there's 
many issues to solve: 

- I looked quite briefly at the current code, so I could be wrong, but
> the parsers are put into cache (in ParserFactory) in a Hashtable which
> takes <content-type>+<extension> as a key. One of the plugins will
> overshadow the other, most likely the one which loaded last. That
> being said, we can end up in strange situations were a plugin handles
> pps whilst another one handles ppt.
> - Also, the way it now works, a plugin does the job on its own. But if
> two plugins were to do the parsing, wouldn't be the results deduped at
> some point anyway?
> - Last thing, I couldn't think of any convincing example.


Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> Jérôme Charron wrote:
> > I really don't like this solution to centralize this kind of 
> informations.
> > I think, it's the plugin responsability to claim the
> > content-type/path-suffix it can handle.
> 
> However, what happens if more than one plugin claims that it can handle
> any given content-type? E.g. html parser may claim that it supports
> plaintext. but there is another plugin specifically for plaintext. Which
> of them wins?



That's the drawback of this solution (randomly choose a plugin if many 
claims to handle a content-type)
The drawback of the "centralized" one is to give to the administrator the 
ability to easily associate parse-plugin to a content-type this one doesn't 
handle.

Jérôme


-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> 
> By now, I would have to extend the parse-html-plugin. Having the ability
> to define two plugins which handle one content-type would allow me to
> use the standard plugin and handle my own bugs.

Why not defining an extension-point in the parse-html plugin and providing a 
sub-plugin
of the parse-html plugin for your needs?

Regards

Jérôme


-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Michael Nebel <mi...@nebel.de>.
Hi Sébastien,

 > Funly enough, I was thinking the other way around: could it be a
 > requirement for someone that two plugins parse the same content-type?
 > One plugin does some parts of the parsing, then hands over the page to
 > another one, _à la_ Visitor.  But then there are several issues:
..
> - Last thing, I couldn't think of any convincing example.

Perhaps I can think of an example:

- I want to have a crawler which indexes normal http-pages
- for some sites, I want also index some special meta-tags/extract
   some well known information

By now, I would have to extend the parse-html-plugin. Having the ability 
to define two plugins which handle one content-type would allow me to 
use the standard plugin and handle my own bugs.

Regards

	Michael

-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Funly enough, I was thinking the other way around: could it be a
requirement for someone that two plugins parse the same content-type? 
One plugin does some parts of the parsing, then hands over the page to
another one, _à la_ Visitor.  But then there are several issues: 

- I looked quite briefly at the current code, so I could be wrong, but
the parsers are put into cache (in ParserFactory) in a Hashtable which
takes <content-type>+<extension> as a key.  One of the plugins will
overshadow the other, most likely the one which loaded last.  That
being said, we can end up in strange situations were a plugin handles
pps whilst another one handles ppt.
- Also, the way it now works, a plugin does the job on its own.  But if
two plugins were to do the parsing, wouldn't be the results deduped at
some point anyway?
- Last thing, I couldn't think of any convincing example.


So all in all, I had reached the same conclusion as Doug's, that is,
that the ParserFactory should probably handle all that.


Sébastien.

--- Andrzej Bialecki <ab...@getopt.org> a écrit :

> 
> However, what happens if more than one plugin claims that it can
> handle 
> any given content-type? E.g. html parser may claim that it supports 
> plaintext. but there is another plugin specifically for plaintext.
> Which 
> of them wins?
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 



	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:
> I really don't like this solution to centralize this kind of informations.
> I think, it's the plugin responsability to claim the 
> content-type/path-suffix it can handle.

However, what happens if more than one plugin claims that it can handle 
any given content-type? E.g. html parser may claim that it supports 
plaintext. but there is another plugin specifically for plaintext. Which 
of them wins?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> This is possible now by simply configuring a catch-all plugin to match
> the empty suffix and removing the empty suffix from other plugins. So
> it seems the problem is not that this is currently impossible, but
> rather that it would be better to alter the configuration than the
> plugin definitions.
> 
> So we might have ParserFactory read a config file that maps content
> types and url suffixes to plugins. Folks can edit this file instead of
> modifying the plugin declarations.

I really don't like this solution to centralize this kind of informations.
I think, it's the plugin responsability to claim the 
content-type/path-suffix it can handle.
The problems today are:
1. In some plugins (powerpoint for instance), the plugin.xml contains only 
one content-type and doesn't contains path suffix (the originating problem 
of this thread could be easily solved by adding the ppt pathSuffix in the 
plugin manifest)
2. What to do if no parse-plugin match nor the content-type nor the 
path-suffix? I think Andrzej solution is the good one: Providing a default 
parser (doing its best to extract textual content). If this parser is not 
activated, then simply ignore the content.

> It can also define default handlers
> for unknown content types and unknown suffixes. This could either
> augment or entirely replace the specifications in the plugins
> themselves. Does this make sense?

I really think that parse-plugins must specify the content-type and the 
path-suffix they can handle.
No?
Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> 3. implement a catch-all plugin, which is equivalent to a Unix command 
> strings(1) (I have an implementation of that which I can contribute). 
> And turn it off/on in the config, if it's off, then the unknown content 
> is skipped and logged, if it's on - then make the best effort to extract 
> text.

This is possible now by simply configuring a catch-all plugin to match 
the empty suffix and removing the empty suffix from other plugins.  So 
it seems the problem is not that this is currently impossible, but 
rather that it would be better to alter the configuration than the 
plugin definitions.

So we might have ParserFactory read a config file that maps content 
types and url suffixes to plugins.  Folks can edit this file instead of 
modifying the plugin declarations.  It can also define default handlers 
for unknown content types and unknown suffixes.  This could either 
augment or entirely replace the specifications in the plugins 
themselves.  Does this make sense?

Doug

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Ayyanar Inbamohan <te...@yahoo.com>.
Hi All,


As you all said,
1. i have added the powerpoint to mime type, 
2. in the nutch-default.xml also i have added the
powerpoint plugin in the plugins list
3. in plugin.xml also i have added the content-type as
application/powerpoint

but still i am getting the problem


050908 105407 fetching
http://localhost:8080/search_sample/kmportal3.ppt
050908 105407 fetching
http://localhost:8080/search_sample/testpdf.pdf
050908 105407 fetching
http://localhost:8080/search_sample/kmportal10.ppt
050908 105407 fetching
http://localhost:8080/search_sample/kmportal2.ppt
050908 105407 fetching
http://localhost:8080/search_sample/kmportal4.ppt
050908 105407 fetching
http://localhost:8080/search_sample/kmportal6.ppt
050908 105407 fetching
http://localhost:8080/search_sample/testexcel.xls
050908 105407 fetching
http://localhost:8080/search_sample/javaCertStudyNotes.pdf
050908 105407 fetching
http://localhost:8080/search_sample/kmportal7.ppt
050908 105408 fetching
http://localhost:8080/search_sample/testdoc.doc
050908 105408 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal3.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105408 fetching
http://localhost:8080/search_sample/kmportal8.ppt
050908 105409 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal8.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105409 fetching
http://localhost:8080/search_sample/kmportal9.ppt
050908 105410 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal9.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105410 fetching
http://localhost:8080/search_sample/kmportal11.ppt
050908 105411 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal10.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105411 fetching
http://localhost:8080/search_sample/kmportal5.ppt
050908 105412 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal11.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105413 fetching
http://localhost:8080/search_sample/kmportal1.ppt
050908 105413 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal1.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105415 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal5.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105416 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal2.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105417 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal4.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050908 105418 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal7.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint



thanks,
Ayyanar...

--- Jérôme Charron <je...@gmail.com> wrote:

> > 3. implement a catch-all plugin, which is
> equivalent to a Unix command
> > strings(1) (I have an implementation of that which
> I can contribute).
> > And turn it off/on in the config, if it's off,
> then the unknown content
> > is skipped and logged, if it's on - then make the
> best effort to extract
> > text.
> 
> Andrzej, I really like this solution... +1
> In such a case, other parse-plugin doesn't need
> anymore to check the 
> content-type: if they get some content, they assume
> it is of the good 
> content-type.
> 
> Regards
> 
> Jérôme
> 
> 
> -- 
> http://motrech.free.fr/
> http://www.frutch.org/
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> 3. implement a catch-all plugin, which is equivalent to a Unix command
> strings(1) (I have an implementation of that which I can contribute).
> And turn it off/on in the config, if it's off, then the unknown content
> is skipped and logged, if it's on - then make the best effort to extract
> text.

Andrzej, I really like this solution... +1
In such a case, other parse-plugin doesn't need anymore to check the 
content-type: if they get some content, they assume it is of the good 
content-type.

Regards

Jérôme


-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:
>>I remember having played with that a wee bit, but the problem was that
>>the plugins themselves are riddled with pieces of code like the one
>>below, found in MSWordParser in release 0.7:
> 
> 
> Yes, it's true, each parse plugin checks in its code the content-type of the 
> provided content.
> As you notice it, there's a real synchro problem between the allowed 
> content-type specified in
> the plugin.xml file and the one checked within the code.
> I propose two solutions:
> 
> 1. No default behavior in the ParserFactory: ie if it doesn't found a 
> suitable plugin for a content-type, it must not parse the content (what is 
> exact behavior to have in such a case is to defined: throw an exception, 
> simply ignore the content....???)
> 
> 2. Provides in the plugin repository a way to retrieve the content-types 
> associated to a plugin: somethin like: 
> public static MimeType[] getAllowedMimeTypes(String pluginid);

Yes, that will definitely be needed sooner or later.

> 
> It's open for comments... and contributions too ;-)

3. implement a catch-all plugin, which is equivalent to a Unix command 
strings(1) (I have an implementation of that which I can contribute). 
And turn it off/on in the config, if it's off, then the unknown content 
is skipped and logged, if it's on - then make the best effort to extract 
text.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> 
> I remember having played with that a wee bit, but the problem was that
> the plugins themselves are riddled with pieces of code like the one
> below, found in MSWordParser in release 0.7:

Yes, it's true, each parse plugin checks in its code the content-type of the 
provided content.
As you notice it, there's a real synchro problem between the allowed 
content-type specified in
the plugin.xml file and the one checked within the code.
I propose two solutions:

1. No default behavior in the ParserFactory: ie if it doesn't found a 
suitable plugin for a content-type, it must not parse the content (what is 
exact behavior to have in such a case is to defined: throw an exception, 
simply ignore the content....???)

2. Provides in the plugin repository a way to retrieve the content-types 
associated to a plugin: somethin like: 
public static MimeType[] getAllowedMimeTypes(String pluginid);

It's open for comments... and contributions too ;-)

> 2. Remember that powerpoint plugin is not part of the Nutch-0.7
> > release...
> Now, you'll have to find a better one than that, Jerome! :)

I would have tested
;-)

Regards 

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
--- Jérôme Charron <je...@gmail.com> a écrit :


> Yes, you are rigth, but my response was a short time solution.
> 1. A quick solution could be to checsk that a plugin can be
> associated to 
> many content-types (if so, there's just to add application/powerpoint
> in the 
> mspowerpoint plugin xml).


I remember having played with that a wee bit, but the problem was that
the plugins themselves are riddled with pieces of code like the one
below, found in MSWordParser in release 0.7:

 if (contentType != null &&
!contentType.startsWith("application/msword"))
      return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_INVALID_FORMAT,
        "Content-Type not application/msword: " +
contentType).getEmptyParse();

which means that, whatever you do, you're screwed - Excuse my French (:
- since the MIME type is hard-coded in the plugin.  It also means that
if you want to add a MIME type (say application/vnd.ms-word), you have
to edit the code.


> 2. Remember that powerpoint plugin is not part of the Nutch-0.7
> release... 

Now, you'll have to find a better one than that, Jerome! :)


Slán agat,
Sebastien


	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> Is it not supposed to be the other way around, Nutch needing to be more
> complacent with old servers that return "application/powerpoint"? The
> thing is, there are some servers out there which _do_ return that MIME
> Type, and supposedly, one would want to index them as well... As we
> can't hack all those servers to fix the MIME Types, Nutch should IMHO
> accomodate those sites too.
> Or am I talking crap there?

Yes, you are rigth, but my response was a short time solution.
1. A quick solution could be to checsk that a plugin can be associated to 
many content-types (if so, there's just to add application/powerpoint in the 
mspowerpoint plugin xml).
2. Remember that powerpoint plugin is not part of the Nutch-0.7 release... 

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi there, 

Is it not supposed to be the other way around, Nutch needing to be more
complacent with old servers that return "application/powerpoint"?  The
thing is, there are some servers out there which _do_ return that MIME
Type, and supposedly, one would want to index them as well...  As we
can't hack all those servers to fix the MIME Types, Nutch should IMHO
accomodate those sites too.

Or am I talking crap there?


Regards,
Sebastien

--- Jérôme Charron <je...@gmail.com> a écrit :

> It returns application/powerpoint for powerpoint content type, where
> as the 
> mspowerpoint plugin is called for content type 
> application/vnd.ms-powerpoint.
> Please, change your http server configuration and all should works
> fine.
> 
> Regards
> 
> Jérôme
> 



	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> 
> I have enabled the ppt extension from the
> crawl-urlfilter.txt, Now it is fetching the powerpoint
> files,
> But i am getting the following error, bcos ppt files
> content type is not taken by nutch..

Looking at the code, here is a copy of the comment of the ParserFactory (the 
class that choose which parser to use):
/*****
Content type has priority: the first plugin found whose
"contentType" attribute matches the beginning of the content's type is
used. If none match, then the first whose "pathSuffix" attribute matches
the end of the url's path is used. If neither of these match, then the
first plugin whose "pathSuffix" is the empty string is used.
*******/

In your case, it is the first plugin whose "pathSuffix" is the empty string 
that is used (a kind of random one).
I think, your http server is not well configured.
It returns application/powerpoint for powerpoint content type, where as the 
mspowerpoint plugin is called for content type 
application/vnd.ms-powerpoint.
Please, change your http server configuration and all should works fine.

Regards

Jérôme

Re: Re-indexing segments to add more field information

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello,
I think it is enough to delete index.done file and index folder. I did 
it this way some time ago.
Regrads
Piotr
Mike Berrow wrote:
> I would like to re-build the indexes I have in existing segments using 
> a custom index filter plug-in (adds more field information to assist with
> a custom sort).
> 
> Should that be just a matter of deleting the index.done files to let it
> go through?,  or is thre more to be done than that?
> Any known pitfalls?
> 
> Thanks much,
> 
> -- Mike Berrow
> 
> 
> 


Re-indexing segments to add more field information

Posted by Mike Berrow <mb...@pacbell.net>.
I would like to re-build the indexes I have in existing segments using 
a custom index filter plug-in (adds more field information to assist with
a custom sort).

Should that be just a matter of deleting the index.done files to let it
go through?,  or is thre more to be done than that?
Any known pitfalls?

Thanks much,

-- Mike Berrow



Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> > I think you have activated the parse-mspowerpoint plugin, but not the
> > lib-jakarta-poi plugin.
> > Just activate the lib-jakarta-poi plugin and it must work.
> Thanks, that worked.


For informations: 
If you download the last code available in the trunk, the "manual" 
activation of the lib-jakarta-poi plugin
is no more required (if the plugin.auto-activation property is setted to 
true).
You just have to enable the plugins you want to use, and if they needs some 
other plugins, these plugins
will be automatically activated.

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Renat Lumpau <rl...@gentoo.org>.
On Fri, Sep 09, 2005 at 11:50:37PM +0200, J?r?me Charron wrote:
> I think you have activated the parse-mspowerpoint plugin, but not the 
> lib-jakarta-poi plugin.
> Just activate the lib-jakarta-poi plugin and it must work.

Thanks, that worked.

-- Renat

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Jérôme Charron <je...@gmail.com>.
> So in this case, the MIME type is correct, so the file should be passed
> to the parse-mspowerpoint plugin, but it's not. Now that the plugin has
> been committed, how do we actually make it work (yes, I've read 
> http://issues.apache.org/jira/browse/NUTCH-88 )?

I think you have activated the parse-mspowerpoint plugin, but not the 
lib-jakarta-poi plugin.
Just activate the lib-jakarta-poi plugin and it must work.
In fact, in your case, the parse-mspowerpoint plugin is not loaded because 
one of its dependency is missing.
I'm currently begin working on this issue, in order to auto-activate needed 
plugin.
Since I have a lot of work by this time, it should be available at this end 
of the next week (I hope!!!)

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Renat Lumpau <rl...@gentoo.org>.
Folks,

I'm having a very similar problem with the latest svn version of Nutch
(Revision: 279844).

The crawler returns this message: fetch okay, but can't parse [URL
scrubbed]/pub/Presentation.ppt, reason: failed(2,203): Content-Type not
text/html: application/vnd.ms-powerpoint

So in this case, the MIME type is correct, so the file should be passed
to the parse-mspowerpoint plugin, but it's not. Now that the plugin has
been committed, how do we actually make it work (yes, I've read http://issues.apache.org/jira/browse/NUTCH-88 )?

Thanks,
Renat

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Ayyanar Inbamohan <te...@yahoo.com>.
Hi micheal,

me too, sorry for delay, yesterday i am on leave.

i have added the plugins as follows

<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin
directory names to
  include.  Any plugin not matching this expression is
excluded.  By
  default Nutch includes crawling just HTML and plain
text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>



thanks,
Ayyanar

--- Michael Nebel <mi...@nebel.de> wrote:

> Hi Ayyanar,
> 
> sorry for the delay, but I've been out of office for
> some hours.
> 
> Have you activated the plugins? You need to extend
> the plugin.includes. 
> Mne look for example:
> 
> <property>
>    <name>plugin.includes</name>
>  
>
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexcel|mspowerpoint|zip)|index-(basic|more)|query-(basic|site|url)|language-identifier|clustering-carrot2</value>
>    <description>Regular expression naming plugin
> directory names to
>    include.  Any plugin not matching this expression
> is excluded.
>    In any case you need at least include the
> nutch-extensionpoints
>    plugin. By default Nutch includes crawling just
> HTML and plain text
>    via HTTP,  and basic indexing and search plugins.
> boost-urlpattern|
>    </description>
> </property>
> 
> Regards
> 
> 	Michael
> 
> 
> Ayyanar Inbamohan wrote:
> 
> > Hi Michael,
> > 
> > I have enabled the ppt extension from the
> > crawl-urlfilter.txt, Now it is fetching the
> powerpoint
> > files,
> > 
> > But i am getting the following error, bcos  ppt
> files
> > content type is not taken by nutch..
> > 
> > 
> > 
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal3.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/testpdf.pdf
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal10.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/testdoc.doc
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal2.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal4.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal6.ppt
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/testexcel.xls
> > 050906 175342 fetching
> >
>
http://localhost:8080/search_sample/javaCertStudyNotes.pdf
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal7.ppt
> > 050906 175342 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal3.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175342 fetching
> > http://localhost:8080/search_sample/kmportal8.ppt
> > 050906 175343 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal8.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175343 fetching
> > http://localhost:8080/search_sample/kmportal9.ppt
> > 050906 175344 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal9.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175344 fetching
> > http://localhost:8080/search_sample/kmportal11.ppt
> > 050906 175347 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal4.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175348 fetching
> > http://localhost:8080/search_sample/kmportal5.ppt
> > 050906 175348 fetching
> > http://localhost:8080/search_sample/kmportal1.ppt
> > 050906 175350 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal7.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175351 fetch okay, but can't parse
> >
> http://localhost:8080/search_sample/kmportal10.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175353 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal6.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175354 fetch okay, but can't parse
> > http://localhost:8080/search_sample/testexcel.xls,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/vnd.ms-excel
> > 050906 175355 fetch okay, but can't parse
> >
> http://localhost:8080/search_sample/kmportal11.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175356 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal5.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175358 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal1.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 050906 175359 fetch okay, but can't parse
> > http://localhost:8080/search_sample/kmportal2.ppt,
> > reason: failed(2,203): Content-Type not
> > application/msword: application/powerpoint
> > 
> > 
> > thanks,
> > Ayyanar..
> > 
> > --- Michael Nebel <mi...@nebel.de> wrote:
> > 
> > 
> >>Hi,
> >>
> >>have you checked the filters? (regex-urlfilter or
> >>crawl-urlfilter)? The 
> >>ending ".ppt" ist disabled by default.
> >>
> >>Regards
> >>
> >>	Michael
> >>
> >>Ayyanar Inbamohan wrote:
> >>
> >>
> >>>Hi all,
> >>>
> >>>I am using the powerpoint plugin from JIRA, and
> >>
> >>when i
> >>
> >>>crawl my application having link to the ppt,
> nutch
> >>
> >>7.0
> >>
> >>>is not at all fetching the powerpoint files.
> >>>
> >>>i am crawling my local appliation 
> >>>
> >>>http://localhost:8080/search_sample/index.html
> >>>
> >>>this url, i have given in the url.intranet, 
> >>>
> >>>i gave some href to powerpoint file in
> index.html,
> >>
> >>>and then started but it is not crawling
> >>>
> >>>
> >>>
> >>>Thanks in advance..
> >>>
> >>>thanks,
> >>>Ayyanar....
> >>>
> >>
> -- 
> Michael Nebel
> http://www.nebel.de/
> http://www.netluchs.de/
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Michael Nebel <mi...@nebel.de>.
Hi Ayyanar,

sorry for the delay, but I've been out of office for some hours.

Have you activated the plugins? You need to extend the plugin.includes. 
Mne look for example:

<property>
   <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexcel|mspowerpoint|zip)|index-(basic|more)|query-(basic|site|url)|language-identifier|clustering-carrot2</value>
   <description>Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints
   plugin. By default Nutch includes crawling just HTML and plain text
   via HTTP,  and basic indexing and search plugins. boost-urlpattern|
   </description>
</property>

Regards

	Michael


Ayyanar Inbamohan wrote:

> Hi Michael,
> 
> I have enabled the ppt extension from the
> crawl-urlfilter.txt, Now it is fetching the powerpoint
> files,
> 
> But i am getting the following error, bcos  ppt files
> content type is not taken by nutch..
> 
> 
> 
> 050906 175342 fetching
> http://localhost:8080/search_sample/kmportal3.ppt
> 050906 175342 fetching
> http://localhost:8080/search_sample/testpdf.pdf
> 050906 175342 fetching
> http://localhost:8080/search_sample/kmportal10.ppt
> 050906 175342 fetching
> http://localhost:8080/search_sample/testdoc.doc
> 050906 175342 fetching
> http://localhost:8080/search_sample/kmportal2.ppt
> 050906 175342 fetching
> http://localhost:8080/search_sample/kmportal4.ppt
> 050906 175342 fetching
> http://localhost:8080/search_sample/kmportal6.ppt
> 050906 175342 fetching
> http://localhost:8080/search_sample/testexcel.xls
> 050906 175342 fetching
> http://localhost:8080/search_sample/javaCertStudyNotes.pdf
> 050906 175342 fetching
> http://localhost:8080/search_sample/kmportal7.ppt
> 050906 175342 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal3.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175342 fetching
> http://localhost:8080/search_sample/kmportal8.ppt
> 050906 175343 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal8.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175343 fetching
> http://localhost:8080/search_sample/kmportal9.ppt
> 050906 175344 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal9.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175344 fetching
> http://localhost:8080/search_sample/kmportal11.ppt
> 050906 175347 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal4.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175348 fetching
> http://localhost:8080/search_sample/kmportal5.ppt
> 050906 175348 fetching
> http://localhost:8080/search_sample/kmportal1.ppt
> 050906 175350 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal7.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175351 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal10.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175353 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal6.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175354 fetch okay, but can't parse
> http://localhost:8080/search_sample/testexcel.xls,
> reason: failed(2,203): Content-Type not
> application/msword: application/vnd.ms-excel
> 050906 175355 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal11.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175356 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal5.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175358 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal1.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 050906 175359 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal2.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
> 
> 
> thanks,
> Ayyanar..
> 
> --- Michael Nebel <mi...@nebel.de> wrote:
> 
> 
>>Hi,
>>
>>have you checked the filters? (regex-urlfilter or
>>crawl-urlfilter)? The 
>>ending ".ppt" ist disabled by default.
>>
>>Regards
>>
>>	Michael
>>
>>Ayyanar Inbamohan wrote:
>>
>>
>>>Hi all,
>>>
>>>I am using the powerpoint plugin from JIRA, and
>>
>>when i
>>
>>>crawl my application having link to the ppt, nutch
>>
>>7.0
>>
>>>is not at all fetching the powerpoint files.
>>>
>>>i am crawling my local appliation 
>>>
>>>http://localhost:8080/search_sample/index.html
>>>
>>>this url, i have given in the url.intranet, 
>>>
>>>i gave some href to powerpoint file in index.html,
>>
>>>and then started but it is not crawling
>>>
>>>
>>>
>>>Thanks in advance..
>>>
>>>thanks,
>>>Ayyanar....
>>>
>>
-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Ayyanar Inbamohan <te...@yahoo.com>.
Hi Michael,

I have enabled the ppt extension from the
crawl-urlfilter.txt, Now it is fetching the powerpoint
files,

But i am getting the following error, bcos  ppt files
content type is not taken by nutch..



050906 175342 fetching
http://localhost:8080/search_sample/kmportal3.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testpdf.pdf
050906 175342 fetching
http://localhost:8080/search_sample/kmportal10.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testdoc.doc
050906 175342 fetching
http://localhost:8080/search_sample/kmportal2.ppt
050906 175342 fetching
http://localhost:8080/search_sample/kmportal4.ppt
050906 175342 fetching
http://localhost:8080/search_sample/kmportal6.ppt
050906 175342 fetching
http://localhost:8080/search_sample/testexcel.xls
050906 175342 fetching
http://localhost:8080/search_sample/javaCertStudyNotes.pdf
050906 175342 fetching
http://localhost:8080/search_sample/kmportal7.ppt
050906 175342 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal3.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175342 fetching
http://localhost:8080/search_sample/kmportal8.ppt
050906 175343 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal8.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175343 fetching
http://localhost:8080/search_sample/kmportal9.ppt
050906 175344 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal9.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175344 fetching
http://localhost:8080/search_sample/kmportal11.ppt
050906 175347 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal4.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175348 fetching
http://localhost:8080/search_sample/kmportal5.ppt
050906 175348 fetching
http://localhost:8080/search_sample/kmportal1.ppt
050906 175350 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal7.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175351 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal10.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175353 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal6.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175354 fetch okay, but can't parse
http://localhost:8080/search_sample/testexcel.xls,
reason: failed(2,203): Content-Type not
application/msword: application/vnd.ms-excel
050906 175355 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal11.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175356 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal5.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175358 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal1.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint
050906 175359 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal2.ppt,
reason: failed(2,203): Content-Type not
application/msword: application/powerpoint


thanks,
Ayyanar..

--- Michael Nebel <mi...@nebel.de> wrote:

> Hi,
> 
> have you checked the filters? (regex-urlfilter or
> crawl-urlfilter)? The 
> ending ".ppt" ist disabled by default.
> 
> Regards
> 
> 	Michael
> 
> Ayyanar Inbamohan wrote:
> 
> > Hi all,
> > 
> > I am using the powerpoint plugin from JIRA, and
> when i
> > crawl my application having link to the ppt, nutch
> 7.0
> > is not at all fetching the powerpoint files.
> > 
> > i am crawling my local appliation 
> > 
> > http://localhost:8080/search_sample/index.html
> > 
> > this url, i have given in the url.intranet, 
> > 
> > i gave some href to powerpoint file in index.html,
> 
> > 
> > and then started but it is not crawling
> > 
> > 
> > 
> > Thanks in advance..
> > 
> > thanks,
> > Ayyanar....
> > 
> 
> -- 
> Michael Nebel
> http://www.nebel.de/
> http://www.netluchs.de/
> 
> 



	
		
______________________________________________________
Click here to donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/

Re: nutch 7.0 not fetching powerpoint, plugin is present

Posted by Michael Nebel <mi...@nebel.de>.
Hi,

have you checked the filters? (regex-urlfilter or crawl-urlfilter)? The 
ending ".ppt" ist disabled by default.

Regards

	Michael

Ayyanar Inbamohan wrote:

> Hi all,
> 
> I am using the powerpoint plugin from JIRA, and when i
> crawl my application having link to the ppt, nutch 7.0
> is not at all fetching the powerpoint files.
> 
> i am crawling my local appliation 
> 
> http://localhost:8080/search_sample/index.html
> 
> this url, i have given in the url.intranet, 
> 
> i gave some href to powerpoint file in index.html, 
> 
> and then started but it is not crawling
> 
> 
> 
> Thanks in advance..
> 
> thanks,
> Ayyanar....
> 

-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/