You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Joe Bell <jo...@prodeasystems.com> on 2009/12/09 17:27:32 UTC

Nutch 1.0 and Office 2007 documents

Hi,

 

I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949 

 

Is a separate plugin required to parse these documents (i.e.,
parse-msexcel, parse-mspowerpoint, etc. will *not* work?) 

 

I noticed the comment on the above thread - docx should be parsed,A
plugin can be used to Parsed docx file. you get some 
help info from parse-html plugin and so on. - but didn't find it really
helpful.

 

Regards,

Joe




This message is confidential to Prodea Systems, Inc unless otherwise indicated 
or apparent from its nature. This message is directed to the intended recipient 
only, who may be readily determined by the sender of this message and its 
contents. If the reader of this message is not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient:(a)any dissemination or copying of this message is strictly 
prohibited; and(b)immediately notify the sender by return message and destroy 
any copies of this message in any form(electronic, paper or otherwise) that you 
have.The delivery of this message and its information is neither intended to be 
nor constitutes a disclosure or waiver of any trade secrets, intellectual 
property, attorney work product, or attorney-client communications. The 
authority of the individual sending this message to legally bind Prodea Systems  
is neither apparent nor implied,and must be independently verified.

Re: Nutch 1.0 and Office 2007 documents

Posted by Julien Nioche <li...@gmail.com>.
Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use
it for your how-to

J.

2009/12/14 Julien Nioche <li...@gmail.com>

>  If I manage to put it to work I will write here a mini how-to.
>>
>
> The Nutch Wiki would be the right place for doing that. It would be nice to
> have a page there listing the differences between the capabilities of the
> Tika plugin and the existing Nutch parsing plugins as there might be
> differences between them (support for Office 2007 being potentially one of
> them)
>
> Note that the Tika plugin is VERY beta
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/12/14 Adilson Oliveira Cruz <ad...@gmail.com>
>
>>  Hi,
>>
>>  Thanks for the reply. I will try to use Tika with Nutch to parse the
>> documents. My current Nutch setup is working quite nice and I don't want
>> to
>> configure another Nutch instance.
>>
>>  If I manage to put it to work I will write here a mini how-to.
>>
>>  Best,
>>
>>  Adilson
>>
>> On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > There is a Tika plugin in JIRA (
>> > https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's
>> page
>> > the support for the Office 2007 was imminent in POI (which Tika uses
>> > internally). The plan for Nutch is to progressively delegate the parsing
>> to
>> > Tika; Nutch-766 has been implemented for this. I haven't checked whether
>> > Tika currently supports Office 2007 but I suggest that you try parsing
>> docs
>> > at this format with Tika, if it does work then you'll get that
>> > automatically
>> > via Nutch-766
>> >
>> > Makes sense?
>> >
>> > Julien
>> >
>> > --
>> > DigitalPebble Ltd
>> > http://www.digitalpebble.com
>> >
>> > 2009/12/14 Adilson Oliveira Cruz <ad...@gmail.com>
>> >
>> > >  Hi all,
>> > >
>> > >  Anyone successfully used nutch to index Office 2007 documents? I know
>> > that
>> > > this question has already been asked, but considering the number of
>> > e-mails
>> > > asking the same question, looks like that Nutch does not support
>> Office
>> > > 2007
>> > > documents.
>> > >
>> > >  Best,
>> > >
>> > >  Adilson
>> > >
>> > > On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <jo...@prodeasystems.com>
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > >
>> > > >
>> > > > I'm also curious as to whether anyone has had success with Nutch and
>> > > > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
>> > > > errors as seen here -
>> > > >
>> >
>> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
>> > > > cuments-in-Nutch-1.0-td26640949.html#a26640949<
>> > >
>> >
>> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Is a separate plugin required to parse these documents (i.e.,
>> > > > parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
>> > > >
>> > > >
>> > > >
>> > > > I noticed the comment on the above thread - docx should be parsed,A
>> > > > plugin can be used to Parsed docx file. you get some
>> > > > help info from parse-html plugin and so on. - but didn't find it
>> really
>> > > > helpful.
>> > > >
>> > > >
>> > > >
>> > > > Regards,
>> > > >
>> > > > Joe
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > This message is confidential to Prodea Systems, Inc unless otherwise
>> > > > indicated
>> > > > or apparent from its nature. This message is directed to the
>> intended
>> > > > recipient
>> > > > only, who may be readily determined by the sender of this message
>> and
>> > its
>> > > > contents. If the reader of this message is not the intended
>> recipient,
>> > or
>> > > > an
>> > > > employee or agent responsible for delivering this message to the
>> > intended
>> > > > recipient:(a)any dissemination or copying of this message is
>> strictly
>> > > > prohibited; and(b)immediately notify the sender by return message
>> and
>> > > > destroy
>> > > > any copies of this message in any form(electronic, paper or
>> otherwise)
>> > > that
>> > > > you
>> > > > have.The delivery of this message and its information is neither
>> > intended
>> > > > to be
>> > > > nor constitutes a disclosure or waiver of any trade secrets,
>> > intellectual
>> > > > property, attorney work product, or attorney-client communications.
>> The
>> > > > authority of the individual sending this message to legally bind
>> Prodea
>> > > > Systems
>> > > > is neither apparent nor implied,and must be independently verified.
>> > >
>> >
>>
>
>
>
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Nutch 1.0 and Office 2007 documents

Posted by Julien Nioche <li...@gmail.com>.
>
>  If I manage to put it to work I will write here a mini how-to.
>

The Nutch Wiki would be the right place for doing that. It would be nice to
have a page there listing the differences between the capabilities of the
Tika plugin and the existing Nutch parsing plugins as there might be
differences between them (support for Office 2007 being potentially one of
them)

Note that the Tika plugin is VERY beta

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/12/14 Adilson Oliveira Cruz <ad...@gmail.com>

>  Hi,
>
>  Thanks for the reply. I will try to use Tika with Nutch to parse the
> documents. My current Nutch setup is working quite nice and I don't want to
> configure another Nutch instance.
>
>  If I manage to put it to work I will write here a mini how-to.
>
>  Best,
>
>  Adilson
>
> On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Hi,
> >
> > There is a Tika plugin in JIRA (
> > https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's
> page
> > the support for the Office 2007 was imminent in POI (which Tika uses
> > internally). The plan for Nutch is to progressively delegate the parsing
> to
> > Tika; Nutch-766 has been implemented for this. I haven't checked whether
> > Tika currently supports Office 2007 but I suggest that you try parsing
> docs
> > at this format with Tika, if it does work then you'll get that
> > automatically
> > via Nutch-766
> >
> > Makes sense?
> >
> > Julien
> >
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
> > 2009/12/14 Adilson Oliveira Cruz <ad...@gmail.com>
> >
> > >  Hi all,
> > >
> > >  Anyone successfully used nutch to index Office 2007 documents? I know
> > that
> > > this question has already been asked, but considering the number of
> > e-mails
> > > asking the same question, looks like that Nutch does not support Office
> > > 2007
> > > documents.
> > >
> > >  Best,
> > >
> > >  Adilson
> > >
> > > On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <jo...@prodeasystems.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I'm also curious as to whether anyone has had success with Nutch and
> > > > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
> > > > errors as seen here -
> > > >
> > http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
> > > > cuments-in-Nutch-1.0-td26640949.html#a26640949<
> > >
> >
> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
> > > >
> > > >
> > > >
> > > >
> > > > Is a separate plugin required to parse these documents (i.e.,
> > > > parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
> > > >
> > > >
> > > >
> > > > I noticed the comment on the above thread - docx should be parsed,A
> > > > plugin can be used to Parsed docx file. you get some
> > > > help info from parse-html plugin and so on. - but didn't find it
> really
> > > > helpful.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Joe
> > > >
> > > >
> > > >
> > > >
> > > > This message is confidential to Prodea Systems, Inc unless otherwise
> > > > indicated
> > > > or apparent from its nature. This message is directed to the intended
> > > > recipient
> > > > only, who may be readily determined by the sender of this message and
> > its
> > > > contents. If the reader of this message is not the intended
> recipient,
> > or
> > > > an
> > > > employee or agent responsible for delivering this message to the
> > intended
> > > > recipient:(a)any dissemination or copying of this message is strictly
> > > > prohibited; and(b)immediately notify the sender by return message and
> > > > destroy
> > > > any copies of this message in any form(electronic, paper or
> otherwise)
> > > that
> > > > you
> > > > have.The delivery of this message and its information is neither
> > intended
> > > > to be
> > > > nor constitutes a disclosure or waiver of any trade secrets,
> > intellectual
> > > > property, attorney work product, or attorney-client communications.
> The
> > > > authority of the individual sending this message to legally bind
> Prodea
> > > > Systems
> > > > is neither apparent nor implied,and must be independently verified.
> > >
> >
>

Re: Nutch 1.0 and Office 2007 documents

Posted by Adilson Oliveira Cruz <ad...@gmail.com>.
 Hi,

 Thanks for the reply. I will try to use Tika with Nutch to parse the
documents. My current Nutch setup is working quite nice and I don't want to
configure another Nutch instance.

 If I manage to put it to work I will write here a mini how-to.

 Best,

 Adilson

On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi,
>
> There is a Tika plugin in JIRA (
> https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
> the support for the Office 2007 was imminent in POI (which Tika uses
> internally). The plan for Nutch is to progressively delegate the parsing to
> Tika; Nutch-766 has been implemented for this. I haven't checked whether
> Tika currently supports Office 2007 but I suggest that you try parsing docs
> at this format with Tika, if it does work then you'll get that
> automatically
> via Nutch-766
>
> Makes sense?
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/12/14 Adilson Oliveira Cruz <ad...@gmail.com>
>
> >  Hi all,
> >
> >  Anyone successfully used nutch to index Office 2007 documents? I know
> that
> > this question has already been asked, but considering the number of
> e-mails
> > asking the same question, looks like that Nutch does not support Office
> > 2007
> > documents.
> >
> >  Best,
> >
> >  Adilson
> >
> > On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <jo...@prodeasystems.com>
> > wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I'm also curious as to whether anyone has had success with Nutch and
> > > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
> > > errors as seen here -
> > >
> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
> > > cuments-in-Nutch-1.0-td26640949.html#a26640949<
> >
> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
> > >
> > >
> > >
> > >
> > > Is a separate plugin required to parse these documents (i.e.,
> > > parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
> > >
> > >
> > >
> > > I noticed the comment on the above thread - docx should be parsed,A
> > > plugin can be used to Parsed docx file. you get some
> > > help info from parse-html plugin and so on. - but didn't find it really
> > > helpful.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Joe
> > >
> > >
> > >
> > >
> > > This message is confidential to Prodea Systems, Inc unless otherwise
> > > indicated
> > > or apparent from its nature. This message is directed to the intended
> > > recipient
> > > only, who may be readily determined by the sender of this message and
> its
> > > contents. If the reader of this message is not the intended recipient,
> or
> > > an
> > > employee or agent responsible for delivering this message to the
> intended
> > > recipient:(a)any dissemination or copying of this message is strictly
> > > prohibited; and(b)immediately notify the sender by return message and
> > > destroy
> > > any copies of this message in any form(electronic, paper or otherwise)
> > that
> > > you
> > > have.The delivery of this message and its information is neither
> intended
> > > to be
> > > nor constitutes a disclosure or waiver of any trade secrets,
> intellectual
> > > property, attorney work product, or attorney-client communications. The
> > > authority of the individual sending this message to legally bind Prodea
> > > Systems
> > > is neither apparent nor implied,and must be independently verified.
> >
>

Re: Nutch 1.0 and Office 2007 documents

Posted by Julien Nioche <li...@gmail.com>.
Hi,

There is a Tika plugin in JIRA (
https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
the support for the Office 2007 was imminent in POI (which Tika uses
internally). The plan for Nutch is to progressively delegate the parsing to
Tika; Nutch-766 has been implemented for this. I haven't checked whether
Tika currently supports Office 2007 but I suggest that you try parsing docs
at this format with Tika, if it does work then you'll get that automatically
via Nutch-766

Makes sense?

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/12/14 Adilson Oliveira Cruz <ad...@gmail.com>

>  Hi all,
>
>  Anyone successfully used nutch to index Office 2007 documents? I know that
> this question has already been asked, but considering the number of e-mails
> asking the same question, looks like that Nutch does not support Office
> 2007
> documents.
>
>  Best,
>
>  Adilson
>
> On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <jo...@prodeasystems.com>
> wrote:
>
> > Hi,
> >
> >
> >
> > I'm also curious as to whether anyone has had success with Nutch and
> > parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
> > errors as seen here -
> > http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
> > cuments-in-Nutch-1.0-td26640949.html#a26640949<
> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
> >
> >
> >
> >
> > Is a separate plugin required to parse these documents (i.e.,
> > parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
> >
> >
> >
> > I noticed the comment on the above thread - docx should be parsed,A
> > plugin can be used to Parsed docx file. you get some
> > help info from parse-html plugin and so on. - but didn't find it really
> > helpful.
> >
> >
> >
> > Regards,
> >
> > Joe
> >
> >
> >
> >
> > This message is confidential to Prodea Systems, Inc unless otherwise
> > indicated
> > or apparent from its nature. This message is directed to the intended
> > recipient
> > only, who may be readily determined by the sender of this message and its
> > contents. If the reader of this message is not the intended recipient, or
> > an
> > employee or agent responsible for delivering this message to the intended
> > recipient:(a)any dissemination or copying of this message is strictly
> > prohibited; and(b)immediately notify the sender by return message and
> > destroy
> > any copies of this message in any form(electronic, paper or otherwise)
> that
> > you
> > have.The delivery of this message and its information is neither intended
> > to be
> > nor constitutes a disclosure or waiver of any trade secrets, intellectual
> > property, attorney work product, or attorney-client communications. The
> > authority of the individual sending this message to legally bind Prodea
> > Systems
> > is neither apparent nor implied,and must be independently verified.
>

Re: Nutch 1.0 and Office 2007 documents

Posted by Adilson Oliveira Cruz <ad...@gmail.com>.
 Hi all,

 Anyone successfully used nutch to index Office 2007 documents? I know that
this question has already been asked, but considering the number of e-mails
asking the same question, looks like that Nutch does not support Office 2007
documents.

 Best,

 Adilson

On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell <jo...@prodeasystems.com> wrote:

> Hi,
>
>
>
> I'm also curious as to whether anyone has had success with Nutch and
> parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
> errors as seen here -
> http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
> cuments-in-Nutch-1.0-td26640949.html#a26640949<http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949>
>
>
>
> Is a separate plugin required to parse these documents (i.e.,
> parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
>
>
>
> I noticed the comment on the above thread - docx should be parsed,A
> plugin can be used to Parsed docx file. you get some
> help info from parse-html plugin and so on. - but didn't find it really
> helpful.
>
>
>
> Regards,
>
> Joe
>
>
>
>
> This message is confidential to Prodea Systems, Inc unless otherwise
> indicated
> or apparent from its nature. This message is directed to the intended
> recipient
> only, who may be readily determined by the sender of this message and its
> contents. If the reader of this message is not the intended recipient, or
> an
> employee or agent responsible for delivering this message to the intended
> recipient:(a)any dissemination or copying of this message is strictly
> prohibited; and(b)immediately notify the sender by return message and
> destroy
> any copies of this message in any form(electronic, paper or otherwise) that
> you
> have.The delivery of this message and its information is neither intended
> to be
> nor constitutes a disclosure or waiver of any trade secrets, intellectual
> property, attorney work product, or attorney-client communications. The
> authority of the individual sending this message to legally bind Prodea
> Systems
> is neither apparent nor implied,and must be independently verified.