You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Spencer <da...@tropo.com> on 2005/05/06 19:29:57 UTC
Interesting use case for "numeric synonyms"

I just came across an interesting concept, "numeric synonyms"...I'm 
looking at the powerpoint contribution:

http://issues.apache.org/jira/browse/NUTCH-21

However initially I'm using the code in the context of Lucene, not 
Nutch, so I've changed it slightly.

I have 200 or so PPT files to test it on, and on around 20% it says 
there's no body (i.e. no text). A spot check shows this to be wrong, and 
  sure enough the code gets exceptions, squelchs them,
has buffer overruns etc [but I'm not complaining - I know it's hard to 
reverse engineer MSFT formats].

PPTConstants.java has these definitions:
   public static final int PPT_MASTERSLIDE = 1024;
   public static final int PPT_SLIDEPERSISTANT_ATOM = 1011;
   public static final int PPT_DRAWINGGROUP_ATOM = 61448;
   public static final int PPT_TEXTCHAR_ATOM = 4000;
   public static final int PPT_TEXTBYTE_ATOM = 4008;
   public static final int PPT_USEREDIT_ATOM = 4085;

So I decided to look for other implemtations of Powerpoint parsers, even 
in other languages - the obvious Google searches didn't work 
("powerpoint file format"), and msdn.microsoft.com was of no help, so I 
decided to search for just the numbers above w/ Google i.e. "4000 4008 
4085".

Now I've used ppthtml from http://chicago.sourceforge.net/ before, but I 
had an old note that it sometimes goes into an infinite loop, so I try 
not to use it for indexing - but hey, it does the same work as the 
Nutch/PPT parser, but Google didn't return it (or its source code) as a 
match, so how can that be, surely it uses the same constants...

I start reading ppthtml.c and see:

            switch (type) {
		case 0x0FA0:	/* Text String in unicode */
			...
		case 0x0FA8:	/* Text String in ASCII */
			...
		case 0x0FBA:	/* CString - unicode... */
			...

And sure enough, the 1st 2 hex values there match the java, decimal 
values above from PPTConstants.java [the 3rd one is not covered by the 
java code but doesn't seem to matter].

So...the point is....is there any prior art or discussions on covering 
this, so a search for a number can find a match even if the number is 
represented in other bases?

In Lucene-speak, this means that either when indexing, or parsing the 
query, the Analyzer expands a number like, say, 4000 to multiple tokens 
at the same offset:
	4000		- decimal, not changed
	0x0*FA0		- hex, "0*" for optional leading zeros
	00*7640		- leading zero usually means octal

Hope this list is a reasonable place for this.

A related question is, is the powerpoint format documented anywhere? For 
the life of me I couldn't find out where the various constants came from.

thx,
  Dave