You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Antonio Vazquez <av...@cystelcom.com> on 2001/11/29 15:26:08 UTC
Indexing other documents type than html and txt
Hi all,
I have a doubt. I know that lucene can index html and text documents, but
can it index other type of documents like pdf,docs, and xls documents? if it
can, how can I implement it? Perhaps can implement it like html and txt
indexing?
regards
Antonio
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: Indexing other documents type than html and txt
Posted by "Cecil, Paula New" <cn...@fuse.net>.
Here is another version of something I had posted earlier. It attempts to
read the "text" out of binary files. Not perfect and doesn't work at all on
PDF. It permits you use the "reader" form of a Field to index.
import java.util.*;
import java.io.*;
/**
<p>This class is designed to retrieve text from binary files.
The occasion for its development was to find a way generic way to
index typical office documents which are almost always in a
a proprietary and binary form.
<p>This class will <b>not</b> work with PDF files.
<p>You can exercise some control over the result by using the
<code>setCharArray()</code> method and the
<code>setShortestToken()</code> method.
<ul>
<li><code>setCharArray()</code>: allows you to override the default
characters to keep. All others are eliminated. The default "keepers"
are all ASCII character plus whitespace. This means that if a
text file is the input, it will pass thru unchanged (except that
consequtive blanks are squeezed to a single blank).
<li><code>setShortestToken()</code>: allows you only keep strings of
a minimum length. By default the length is zero, meaning that all
tokens are passed.
</ul>
<p>Note lastly that this class is only designed to work with ASCII.
It may not be difficult to change to support Unicode, but I do
not know how to do that.
*/
public class BinaryReader
extends java.io.FilterReader
{
// private vars
// for debugging
private int count=0;
private int rawcnt=0;
private int shortestToken = 0;
// default char set to keep, blank out everything else
private char[][] charArray = {
{'!', '~'},
{'\t', '\t'},
{'\r', '\r'},
{'\n', '\n'},
};
private String leftovers="";
private char charFilter(char c) {
for (int i=0; i < charArray.length; i++) {
if ( c >= charArray[i][0] && c <= charArray[i][1] ) {
return c;
}
}
return ' ';
}
public BinaryReader(Reader in) {
super(in);
}
/**
<p>This method may be used to override the ranges of characters
that are retained. All others are elminiated. The default is:
<code>
private char[][] charArray = {
{'!', '~'},
{'\t', '\t'},
{'\r', '\r'},
{'\n', '\n'},
};
</code>
<p>Note that the ranges are inclusive and that to pick our a
"single" character instead of a range, just make that character
both the min and max (as shown for the whitespace characters above).
@param char[][] - array of ranges to keep
*/
public void setCharArray( char[][] keepers ) {
// in each row, column 1 is min and column 2 is max
// to pick out a single character instead of a range
// just make it both min and max.
charArray = keepers;
}
/**
<p>This method may be used to eliminate "short strings" of text.
By default it takes even single letters, since the value is
initialized to zero. For example, if the
length 3 is used, the single and two letter strings will not
be returned.
<p><b>Warning: the test doesn't always work for strings that
begin a line of text (at least in DOS/Windows).</b>
@param int len - length of shortest strings to pass
*/
public void setShortestToken(int len) {
shortestToken = len;
}
/**
<p>Reads a single character and runs it through the filter. The
(int) character returned will either be -1 for end-of-file,
a blank (indicating it was filtered), or the character unchanged.
*/
public int read() throws IOException
{
int c = in.read();
if ( c != -1 ) return c;
rawcnt++;
count++;
return charFilter((char)c);
}
/**
<p>Reads from stream and populates the supplied char array.
@param char[] cbuf - character buffer to fill
@return int - number of characters actually placed into the buffer
*/
public int read(char[] cbuf) throws IOException
{
return read(cbuf, 0, cbuf.length);
}
/**
<p>Reads from stream and populates the supplied char array
using the offset and length provided.
@param char[] cbuf - character buffer to fill
@param int offset - offset to being filling array
@param int length - maximun characters to place into the array
@return int - number of characters actually placed into the buffer
*/
public int read(char[] cbuf, int off, int len)
throws IOException
{
char[] cb = new char[len];
int cnt = in.read(cb);
if ( cnt == -1 ) {
file://System.out.println("At end, rawcnt is "+rawcnt);
return cnt; // done
}
int cnt2=cnt;
int loc = off;
for ( int i=0; i < cnt; i++ ) {
cbuf[loc++] = charFilter(cb[i]);
}
char[] weeded = filter(new String(cbuf, off, cnt));
if ( weeded.length > -1 ) {
cnt2 = weeded.length;
// redo buffer
for (int i=0; i < cnt2; i++) {
cbuf[off+i] = weeded[i];
}
}
rawcnt += cnt;
count += cnt2;
return cnt2;
}
private char[] filter(String instring)
{
// record the buffer size (ie, size of incoming string)
int max = instring.length();
// combine leftovers into incoming string and reset leftovers
String s = leftovers+instring;
leftovers="";
StringBuffer sb = new StringBuffer(s.length());
StringTokenizer st = new StringTokenizer(s," ");
String tok=null;
while (st.hasMoreTokens()) {
tok = st.nextToken();
int toklength = tok.length();
if ( toklength < shortestToken ) {
// skip it
continue;
}
sb.append(tok);
sb.append(' ');
}
String t = sb.toString();
t = t.substring(0,t.length()); // remove the appended blank
if ( t.length() > max ) {
leftovers = t.substring(max);
t = t.substring(0,max);
}
return t.toCharArray();
}
/**
<p>Returns the number characters read from the Reader stream
@return int number of characters read
*/
public int getInputCount() { return rawcnt; }
/**
<p>Returns the number characters passed by the filter (ie, after the
binary characters are removed.
@return int number of characters not filtered out
*/
public int getOutputCount() { return count; }
/**
<p>A handy main() method to test or perform the filtering using the
defaults. Started from the command line, it takes two required
arguments: the input filename and the output filename.
<p>An optional third argument is an integer to set the
shortest tokens to pass the filter.
*/
public static void main(String[] args)
throws FileNotFoundException, IOException
{
if ( args.length < 2 ) {
System.out.println(
"Usage: java BinaryReader infile outfile [shortest]");
System.out.println("where 'shortest' is the shortest token passed.");
System.exit(0);
}
FileWriter fw = new FileWriter( args[1] );
FileReader fr = new FileReader( args[0] );
BufferedReader br = new BufferedReader(fr);
BinaryReader binr = new BinaryReader(br);
if ( args.length > 2 ) {
binr.setShortestToken(Integer.parseInt(args[2]));
}
char[] cb = new char[1024];
int cnt;
while ( (cnt = binr.read(cb)) != -1 ) {
fw.write(cb,0,cnt);
}
fw.close();
int ocnt = binr.getOutputCount();
int icnt = binr.getInputCount();
System.out.println("Input Character Count ="+icnt);
System.out.println("Output Character Count="+ocnt);
}
}
----- Original Message -----
From: Antonio Vazquez <av...@cystelcom.com>
To: Lucene Users List <lu...@jakarta.apache.org>
Sent: Thursday, November 29, 2001 6:26 AM
Subject: Indexing other documents type than html and txt
> Hi all,
> I have a doubt. I know that lucene can index html and text documents, but
> can it index other type of documents like pdf,docs, and xls documents? if
it
> can, how can I implement it? Perhaps can implement it like html and txt
> indexing?
>
> regards
>
> Antonio
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>