You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Spencer <da...@tropo.com> on 2005/05/06 19:29:57 UTC
Interesting use case for "numeric synonyms"
I just came across an interesting concept, "numeric synonyms"...I'm
looking at the powerpoint contribution:
http://issues.apache.org/jira/browse/NUTCH-21
However initially I'm using the code in the context of Lucene, not
Nutch, so I've changed it slightly.
I have 200 or so PPT files to test it on, and on around 20% it says
there's no body (i.e. no text). A spot check shows this to be wrong, and
sure enough the code gets exceptions, squelchs them,
has buffer overruns etc [but I'm not complaining - I know it's hard to
reverse engineer MSFT formats].
PPTConstants.java has these definitions:
public static final int PPT_MASTERSLIDE = 1024;
public static final int PPT_SLIDEPERSISTANT_ATOM = 1011;
public static final int PPT_DRAWINGGROUP_ATOM = 61448;
public static final int PPT_TEXTCHAR_ATOM = 4000;
public static final int PPT_TEXTBYTE_ATOM = 4008;
public static final int PPT_USEREDIT_ATOM = 4085;
So I decided to look for other implemtations of Powerpoint parsers, even
in other languages - the obvious Google searches didn't work
("powerpoint file format"), and msdn.microsoft.com was of no help, so I
decided to search for just the numbers above w/ Google i.e. "4000 4008
4085".
Now I've used ppthtml from http://chicago.sourceforge.net/ before, but I
had an old note that it sometimes goes into an infinite loop, so I try
not to use it for indexing - but hey, it does the same work as the
Nutch/PPT parser, but Google didn't return it (or its source code) as a
match, so how can that be, surely it uses the same constants...
I start reading ppthtml.c and see:
switch (type) {
case 0x0FA0: /* Text String in unicode */
...
case 0x0FA8: /* Text String in ASCII */
...
case 0x0FBA: /* CString - unicode... */
...
And sure enough, the 1st 2 hex values there match the java, decimal
values above from PPTConstants.java [the 3rd one is not covered by the
java code but doesn't seem to matter].
So...the point is....is there any prior art or discussions on covering
this, so a search for a number can find a match even if the number is
represented in other bases?
In Lucene-speak, this means that either when indexing, or parsing the
query, the Analyzer expands a number like, say, 4000 to multiple tokens
at the same offset:
4000 - decimal, not changed
0x0*FA0 - hex, "0*" for optional leading zeros
00*7640 - leading zero usually means octal
Hope this list is a reasonable place for this.
A related question is, is the powerpoint format documented anywhere? For
the life of me I couldn't find out where the various constants came from.
thx,
Dave