You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Antony Bowesman <ad...@teamware.com> on 2009/02/03 23:04:15 UTC

Re: Indexing msword document properties

Nutch 0.9 already extracts the properties in MSExtractor.java and MSBaseParser 
puts them into the MetaData class.

I'm not using Nutch in its entirety, only the parsing framework, but I am 
indexing the document properties quite happily from MS documents.  I also wrote 
a new parser for Office 2007, using POI 3.5 and that is also getting the 
properties in a similar way.  Is the problem at a higher level in that Nutch is 
not indexing the MetaData?

Antony




Doğacan Güney wrote:
> On Fri, Jan 30, 2009 at 9:15 PM, ahammad <ah...@gmail.com> wrote:
>> Hello,
>>
>> I've been looking further into this and it seems like the only way to do it
>> is to modify the msword parser so that it reads in the custom properties
>> information. I'm attempting this but so far, I wasn't successful.
>>
>> The classes that I found that may be useful are
>> org.apache.poi.hpsf.DocumentSummaryInformation and
>> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
>> that I need.
>>
>> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
>> lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
>> my time?
>>
>> Anybody has any other suggestions? This seems like it'll be a lot of work
>> with a very small chance of success. Any alternative methods would be nice.
>>
> 
> No, you are doing the right thing. Alternatively, if you know of a
> good java library
> for extracting the information you are looking for; you can write your
> own parse-ms
> plugin as well.
> 
> Extract any metadata you want and put them in parse data metadata. You can then
> read them during indexing and add them to your index.
> 
>> Thanks a lot.
>>
>> Cheers



Restarting Nutch

Posted by Hrishikesh Agashe <hr...@persistent.co.in>.
Hi,

I am planning to do a huge crawl using Nutch (billions of URLs) and so need
to understand whether Nutch can handle restarts after a crash.

For single system, if I do Ctrl+C while Nutch is running and then restart
it, will it be possible for Nutch to detect where it has reached in last run
and start from that point onwards? Or will it be considered as new fresh
crawl?

Also if I have 5 nodes running Nutch and doing the crawling, if one of the
node fails, should it be considered as total failure of Nutch itself? Or
should I allow other nodes to proceed further? Will I loose data gathered by
the failed node?

TIA,
--Hrishi


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Indexing msword document properties

Posted by Antony Bowesman <ad...@teamware.com>.
> Seems like my previous message never went through.
> 
> The Nutch msword parser does index _some_ metadata. If you go into
> File>Properties and under the Summary tab (in Microsoft Word), that metadata
> is indexed (like author, company etc.). However, you can add custom
> properties (File>Properties under the Custom tab) to any Word document. That
> metadata is not indexed.
> 
> As an example, I have a set of files that have some information relating to
> product types. In those files, there is a custom property called
> productType, which can contain values like fax, printer, monitor etc.
> 
> What I want to be able to do is to index those files so I can be able to
> search on the product type. For instance, if I put "canon
> +productType:printer", I'll get only the documents that have to do with
> Canon printers. I already have a query filter in place that can do that,
> it's just a matter of getting the productType custom property in the index.
> 
> The POI parser that you wrote, does it have the ability to parse custom
> properties from Microsoft Word documents?

It didn't, but I just added it - it was trivial.  I'm using POI 3.5 and my 
parser is doing something like

     byte[] raw = content.getContent();
     POITextExtractor extractor = ExtractorFactory.createExtractor(new 
ByteArrayInputStream(raw));
     text = extractor.getText();
     if (POIOLE2TextExtractor.class.isAssignableFrom(extractor.getClass()))
     {
         properties = getOLE2MetaData((POIOLE2TextExtractor)extractor);
     }
     else if (POIXMLTextExtractor.class.isAssignableFrom(extractor.getClass()))
     {
         properties = getXMLMetaData((POIXMLTextExtractor)extractor);
     }

I just tried getting custom properties from the OLE2 text extractor, which is 
based on the MSExtractor implementation

     private Properties getOLE2MetaData(POIOLE2TextExtractor extractor)
     {
         Properties props = new Properties();
         SummaryInformation si = extractor.getSummaryInformation();
...
         DocumentSummaryInformation dsi = extractor.getDocSummaryInformation();
         CustomProperties cp = dsi.getCustomProperties();
         Iterator i = cp.keySet().iterator();
         while (i.hasNext())
         {
             String name = (String)i.next();
             setProperty(props, name, cp.get(name).toString());
         }
         return props;
     }

This works nicely.  I didn't try the XML variant, but I guess that would be 
pretty similar.
Antony





Re: Indexing msword document properties

Posted by ahammad <ah...@gmail.com>.
Seems like my previous message never went through.

The Nutch msword parser does index _some_ metadata. If you go into
File>Properties and under the Summary tab (in Microsoft Word), that metadata
is indexed (like author, company etc.). However, you can add custom
properties (File>Properties under the Custom tab) to any Word document. That
metadata is not indexed.

As an example, I have a set of files that have some information relating to
product types. In those files, there is a custom property called
productType, which can contain values like fax, printer, monitor etc.

What I want to be able to do is to index those files so I can be able to
search on the product type. For instance, if I put "canon
+productType:printer", I'll get only the documents that have to do with
Canon printers. I already have a query filter in place that can do that,
it's just a matter of getting the productType custom property in the index.

The POI parser that you wrote, does it have the ability to parse custom
properties from Microsoft Word documents?

Thank you for your reply.

Cheers



Antony Bowesman wrote:
> 
> Nutch 0.9 already extracts the properties in MSExtractor.java and
> MSBaseParser 
> puts them into the MetaData class.
> 
> I'm not using Nutch in its entirety, only the parsing framework, but I am 
> indexing the document properties quite happily from MS documents.  I also
> wrote 
> a new parser for Office 2007, using POI 3.5 and that is also getting the 
> properties in a similar way.  Is the problem at a higher level in that
> Nutch is 
> not indexing the MetaData?
> 
> Antony
> 
> 
> 
> 
> Doğacan Güney wrote:
>> On Fri, Jan 30, 2009 at 9:15 PM, ahammad <ah...@gmail.com> wrote:
>>> Hello,
>>>
>>> I've been looking further into this and it seems like the only way to do
>>> it
>>> is to modify the msword parser so that it reads in the custom properties
>>> information. I'm attempting this but so far, I wasn't successful.
>>>
>>> The classes that I found that may be useful are
>>> org.apache.poi.hpsf.DocumentSummaryInformation and
>>> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
>>> that I need.
>>>
>>> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in
>>> the
>>> lib-parsems plugin. Am I proceeding correctly with this or am I just
>>> wasting
>>> my time?
>>>
>>> Anybody has any other suggestions? This seems like it'll be a lot of
>>> work
>>> with a very small chance of success. Any alternative methods would be
>>> nice.
>>>
>> 
>> No, you are doing the right thing. Alternatively, if you know of a
>> good java library
>> for extracting the information you are looking for; you can write your
>> own parse-ms
>> plugin as well.
>> 
>> Extract any metadata you want and put them in parse data metadata. You
>> can then
>> read them during indexing and add them to your index.
>> 
>>> Thanks a lot.
>>>
>>> Cheers
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21832075.html
Sent from the Nutch - User mailing list archive at Nabble.com.