You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ahammad <ah...@gmail.com> on 2009/01/28 22:44:33 UTC

Indexing msword document properties

I have successfully gotten Nutch to index msword documents. If you go under
File>Properties, and under the "Custom" tab in MS Word, you can add some
properties to the file, sort of like HTML meta tags.

I have the msword parser, index-more and query-more plugins, as well as a
custom meta tag indexer/filter installed. My question is can Nutch read
document properties like the ones I described? Does it have the ability to
go that far in the document to extract the custom user-defined properties?

If so, was there anybody that successfully implemented this? If not, I would
imagine that we need to modify index-more/query-more plugins to do that. Can
someone confirm this?

Anyone know of a good place to start looking? Any help will be appreciated.

Cheers.

-- 
View this message in context: http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21715700.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Restarting Nutch

Posted by Hrishikesh Agashe <hr...@persistent.co.in>.
Hi,

I am planning to do a huge crawl using Nutch (billions of URLs) and so need
to understand whether Nutch can handle restarts after a crash.

For single system, if I do Ctrl+C while Nutch is running and then restart
it, will it be possible for Nutch to detect where it has reached in last run
and start from that point onwards? Or will it be considered as new fresh
crawl?

Also if I have 5 nodes running Nutch and doing the crawling, if one of the
node fails, should it be considered as total failure of Nutch itself? Or
should I allow other nodes to proceed further? Will I loose data gathered by
the failed node?

TIA,
--Hrishi


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Indexing msword document properties

Posted by Antony Bowesman <ad...@teamware.com>.
> Seems like my previous message never went through.
> 
> The Nutch msword parser does index _some_ metadata. If you go into
> File>Properties and under the Summary tab (in Microsoft Word), that metadata
> is indexed (like author, company etc.). However, you can add custom
> properties (File>Properties under the Custom tab) to any Word document. That
> metadata is not indexed.
> 
> As an example, I have a set of files that have some information relating to
> product types. In those files, there is a custom property called
> productType, which can contain values like fax, printer, monitor etc.
> 
> What I want to be able to do is to index those files so I can be able to
> search on the product type. For instance, if I put "canon
> +productType:printer", I'll get only the documents that have to do with
> Canon printers. I already have a query filter in place that can do that,
> it's just a matter of getting the productType custom property in the index.
> 
> The POI parser that you wrote, does it have the ability to parse custom
> properties from Microsoft Word documents?

It didn't, but I just added it - it was trivial.  I'm using POI 3.5 and my 
parser is doing something like

     byte[] raw = content.getContent();
     POITextExtractor extractor = ExtractorFactory.createExtractor(new 
ByteArrayInputStream(raw));
     text = extractor.getText();
     if (POIOLE2TextExtractor.class.isAssignableFrom(extractor.getClass()))
     {
         properties = getOLE2MetaData((POIOLE2TextExtractor)extractor);
     }
     else if (POIXMLTextExtractor.class.isAssignableFrom(extractor.getClass()))
     {
         properties = getXMLMetaData((POIXMLTextExtractor)extractor);
     }

I just tried getting custom properties from the OLE2 text extractor, which is 
based on the MSExtractor implementation

     private Properties getOLE2MetaData(POIOLE2TextExtractor extractor)
     {
         Properties props = new Properties();
         SummaryInformation si = extractor.getSummaryInformation();
...
         DocumentSummaryInformation dsi = extractor.getDocSummaryInformation();
         CustomProperties cp = dsi.getCustomProperties();
         Iterator i = cp.keySet().iterator();
         while (i.hasNext())
         {
             String name = (String)i.next();
             setProperty(props, name, cp.get(name).toString());
         }
         return props;
     }

This works nicely.  I didn't try the XML variant, but I guess that would be 
pretty similar.
Antony





Re: Indexing msword document properties

Posted by ahammad <ah...@gmail.com>.
Seems like my previous message never went through.

The Nutch msword parser does index _some_ metadata. If you go into
File>Properties and under the Summary tab (in Microsoft Word), that metadata
is indexed (like author, company etc.). However, you can add custom
properties (File>Properties under the Custom tab) to any Word document. That
metadata is not indexed.

As an example, I have a set of files that have some information relating to
product types. In those files, there is a custom property called
productType, which can contain values like fax, printer, monitor etc.

What I want to be able to do is to index those files so I can be able to
search on the product type. For instance, if I put "canon
+productType:printer", I'll get only the documents that have to do with
Canon printers. I already have a query filter in place that can do that,
it's just a matter of getting the productType custom property in the index.

The POI parser that you wrote, does it have the ability to parse custom
properties from Microsoft Word documents?

Thank you for your reply.

Cheers



Antony Bowesman wrote:
> 
> Nutch 0.9 already extracts the properties in MSExtractor.java and
> MSBaseParser 
> puts them into the MetaData class.
> 
> I'm not using Nutch in its entirety, only the parsing framework, but I am 
> indexing the document properties quite happily from MS documents.  I also
> wrote 
> a new parser for Office 2007, using POI 3.5 and that is also getting the 
> properties in a similar way.  Is the problem at a higher level in that
> Nutch is 
> not indexing the MetaData?
> 
> Antony
> 
> 
> 
> 
> Doğacan Güney wrote:
>> On Fri, Jan 30, 2009 at 9:15 PM, ahammad <ah...@gmail.com> wrote:
>>> Hello,
>>>
>>> I've been looking further into this and it seems like the only way to do
>>> it
>>> is to modify the msword parser so that it reads in the custom properties
>>> information. I'm attempting this but so far, I wasn't successful.
>>>
>>> The classes that I found that may be useful are
>>> org.apache.poi.hpsf.DocumentSummaryInformation and
>>> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
>>> that I need.
>>>
>>> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in
>>> the
>>> lib-parsems plugin. Am I proceeding correctly with this or am I just
>>> wasting
>>> my time?
>>>
>>> Anybody has any other suggestions? This seems like it'll be a lot of
>>> work
>>> with a very small chance of success. Any alternative methods would be
>>> nice.
>>>
>> 
>> No, you are doing the right thing. Alternatively, if you know of a
>> good java library
>> for extracting the information you are looking for; you can write your
>> own parse-ms
>> plugin as well.
>> 
>> Extract any metadata you want and put them in parse data metadata. You
>> can then
>> read them during indexing and add them to your index.
>> 
>>> Thanks a lot.
>>>
>>> Cheers
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21832075.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Indexing msword document properties

Posted by Antony Bowesman <ad...@teamware.com>.
Nutch 0.9 already extracts the properties in MSExtractor.java and MSBaseParser 
puts them into the MetaData class.

I'm not using Nutch in its entirety, only the parsing framework, but I am 
indexing the document properties quite happily from MS documents.  I also wrote 
a new parser for Office 2007, using POI 3.5 and that is also getting the 
properties in a similar way.  Is the problem at a higher level in that Nutch is 
not indexing the MetaData?

Antony




Doğacan Güney wrote:
> On Fri, Jan 30, 2009 at 9:15 PM, ahammad <ah...@gmail.com> wrote:
>> Hello,
>>
>> I've been looking further into this and it seems like the only way to do it
>> is to modify the msword parser so that it reads in the custom properties
>> information. I'm attempting this but so far, I wasn't successful.
>>
>> The classes that I found that may be useful are
>> org.apache.poi.hpsf.DocumentSummaryInformation and
>> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
>> that I need.
>>
>> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
>> lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
>> my time?
>>
>> Anybody has any other suggestions? This seems like it'll be a lot of work
>> with a very small chance of success. Any alternative methods would be nice.
>>
> 
> No, you are doing the right thing. Alternatively, if you know of a
> good java library
> for extracting the information you are looking for; you can write your
> own parse-ms
> plugin as well.
> 
> Extract any metadata you want and put them in parse data metadata. You can then
> read them during indexing and add them to your index.
> 
>> Thanks a lot.
>>
>> Cheers



Re: Indexing msword document properties

Posted by Doğacan Güney <do...@gmail.com>.
On Fri, Jan 30, 2009 at 9:15 PM, ahammad <ah...@gmail.com> wrote:
>
> Hello,
>
> I've been looking further into this and it seems like the only way to do it
> is to modify the msword parser so that it reads in the custom properties
> information. I'm attempting this but so far, I wasn't successful.
>
> The classes that I found that may be useful are
> org.apache.poi.hpsf.DocumentSummaryInformation and
> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
> that I need.
>
> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
> lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
> my time?
>
> Anybody has any other suggestions? This seems like it'll be a lot of work
> with a very small chance of success. Any alternative methods would be nice.
>

No, you are doing the right thing. Alternatively, if you know of a
good java library
for extracting the information you are looking for; you can write your
own parse-ms
plugin as well.

Extract any metadata you want and put them in parse data metadata. You can then
read them during indexing and add them to your index.

> Thanks a lot.
>
> Cheers
>
>
>
> ahammad wrote:
>>
>> I have successfully gotten Nutch to index msword documents. If you go
>> under File>Properties, and under the "Custom" tab in MS Word, you can add
>> some properties to the file, sort of like HTML meta tags.
>>
>> I have the msword parser, index-more and query-more plugins, as well as a
>> custom meta tag indexer/filter installed. My question is can Nutch read
>> document properties like the ones I described? Does it have the ability to
>> go that far in the document to extract the custom user-defined properties?
>>
>> If so, was there anybody that successfully implemented this? If not, I
>> would imagine that we need to modify index-more/query-more plugins to do
>> that. Can someone confirm this?
>>
>> Anyone know of a good place to start looking? Any help will be
>> appreciated.
>>
>> Cheers.
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21753762.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Re: Indexing msword document properties

Posted by ahammad <ah...@gmail.com>.
Hello,

I've been looking further into this and it seems like the only way to do it
is to modify the msword parser so that it reads in the custom properties
information. I'm attempting this but so far, I wasn't successful.

The classes that I found that may be useful are
org.apache.poi.hpsf.DocumentSummaryInformation and
org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
that I need.

I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
my time?

Anybody has any other suggestions? This seems like it'll be a lot of work
with a very small chance of success. Any alternative methods would be nice.

Thanks a lot.

Cheers



ahammad wrote:
> 
> I have successfully gotten Nutch to index msword documents. If you go
> under File>Properties, and under the "Custom" tab in MS Word, you can add
> some properties to the file, sort of like HTML meta tags.
> 
> I have the msword parser, index-more and query-more plugins, as well as a
> custom meta tag indexer/filter installed. My question is can Nutch read
> document properties like the ones I described? Does it have the ability to
> go that far in the document to extract the custom user-defined properties?
> 
> If so, was there anybody that successfully implemented this? If not, I
> would imagine that we need to modify index-more/query-more plugins to do
> that. Can someone confirm this?
> 
> Anyone know of a good place to start looking? Any help will be
> appreciated.
> 
> Cheers.
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21753762.html
Sent from the Nutch - User mailing list archive at Nabble.com.