You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by MM...@LEVI.com on 2003/05/01 07:25:39 UTC

RE: Using lucene with HSSF from Apache

Hi, 
I did it, but I use only lucene. You need to create an IndexWriter with
SimpleAnalyzer, an InputStream as new FileInputStream, create Document with
two Fields: one contains the file path and one contains the file's content).
That's all. 
Michel

-----Original Message-----
From: Shoba Ramachandran [mailto:shoba_duruvan@yahoo.com] 
Sent: Wednesday, April 30, 2003 6:10 PM
To: lucene-user@jakarta.apache.org
Subject: Using lucene with HSSF from Apache

Hi,

Has anyone tried to index xls and doc files?
I'm trying to do with HSSF from apache and using
lucene1.2

This code returns me binary and printing it out gives
junk chracters. File indexed like this returns nothing
upon search. 

public static byte[] parse(File file) throws Exception
  {
    POIFSFileSystem fs = new POIFSFileSystem(new
FileInputStream(file));
HSSFWorkbook wb = new HSSFWorkbook(fs);
byte[] xlsInfo = wb.getBytes();
    System.out.println("xls content :  "+
xlsInfo.toString());
return xlsInfo;
  }

Thanks in advance for your help
Shoba


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Using lucene with HSSF from Apache

Posted by Kelvin Tan <ke...@relevanz.com>.
Perhaps others can be more helpful here. I haven't had difficulty using a 
million terms, but perhaps its just my usage.

On Mon, 5 May 2003 07:38:05 -0700 (PDT), Shoba Ramachandran said:
>Thanks very much HTH,
>What would be the good number for max.no. of terms for
>a very huge document say 20MB PDF file.
>
>Thanks
>Shoba
>
>--- kelvin-lists@relevanz.com wrote:
>>It's doable, because if you open any ms office
>>document in your text
>>editor, you'll see that text is all there,
>>surrounded with binary
>>characters and other kinds of mumbo-jumbo. The
>>biggest minus is that
>>you'll exceed the max number of terms in a hurry,
>>even if you set it
>>to like a million, once you hit a reasonably large
>>file (>5MB).
>>
>>What I do is filter out all the unreadable stuff
>>using some regex
>>filters. Only minus is that its somewhat slower coz
>>of the line by
>>line processing, but I'm sure its _much_ faster than
>>attempting to
>>add all those nonsense data.
>>
>>HTH
>>
>>On Fri, 2 May 2003 08:30:15 -0700 (PDT), Shoba
>>Ramachandran wrote:
>>>Hi Michel,
>>>
>>>Are you able to index and search xls and doc files
>>>with just Lucene using SimpleAnalyzer????
>>>There is no need for POI?
>>>With Lucene, you are able to extract the xls
>>content
>>>as text?
>>>
>>>Let me try as you explained.
>>>Thanks very much for your reply.
>>>Shoba
>>>
>>>--- MMachado@LEVI.com wrote:
>>>>Hi,
>>>>I did it, but I use only lucene. You need to
>>create
>>>>an IndexWriter with
>>>>SimpleAnalyzer, an InputStream as new
>>>>FileInputStream, create Document with
>>>>two Fields: one contains the file path and one
>>>>contains the file's content).
>>>>That's all.
>>>>Michel
>>>>
>>>>-----Original Message-----
>>>>From: Shoba Ramachandran
>>>>[mailto:shoba_duruvan@yahoo.com]
>>>>Sent: Wednesday, April 30, 2003 6:10 PM
>>>>To: lucene-user@jakarta.apache.org
>>>>Subject: Using lucene with HSSF from Apache
>>>>
>>>>Hi,
>>>>
>>>>Has anyone tried to index xls and doc files?
>>>>I'm trying to do with HSSF from apache and using
>>>>lucene1.2
>>>>
>>>>This code returns me binary and printing it out
>>>>gives
>>>>junk chracters. File indexed like this returns
>>>>nothing
>>>>upon search.
>>>>
>>>>public static byte[] parse(File file) throws
>>>>Exception
>>>>{
>>>>POIFSFileSystem fs = new POIFSFileSystem(new
>>>>FileInputStream(file));
>>>>HSSFWorkbook wb = new HSSFWorkbook(fs);
>>>>byte[] xlsInfo = wb.getBytes();
>>>>System.out.println("xls content :  "+
>>>>xlsInfo.toString());
>>>>return xlsInfo;
>>>>}
>>>>
>>>>Thanks in advance for your help
>>>>Shoba
>>>>
>>>>
>>>>__________________________________
>>>>Do you Yahoo!?
>>>>The New Yahoo! Search - Faster. Easier. Bingo.
>>>>http://search.yahoo.com
>>>>
>>>>
>>
>>--------------------------------------------------------------------
>>-
>>
>>>>To unsubscribe, e-mail:
>>>>lucene-user-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail:
>>>>lucene-user-help@jakarta.apache.org
>>>>
>>>>
>>
>>--------------------------------------------------------------------
>>-
>>
>>>>To unsubscribe, e-mail:
>>>>lucene-user-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail:
>>>>lucene-user-help@jakarta.apache.org
>>>>
>>>
>>>
>>>__________________________________
>>>Do you Yahoo!?
>>>The New Yahoo! Search - Faster. Easier. Bingo.
>>>http://search.yahoo.com
>>>
>>
>>--------------------------------------------------------------------
>>-
>>
>>>To unsubscribe, e-mail:
>>lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail:
>>lucene-user-help@jakarta.apache.org
>>>
>>
>>
>>
>>
>>
>---------------------------------------------------------------------
>>To unsubscribe, e-mail:
>>lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail:
>>lucene-user-help@jakarta.apache.org
>>
>
>
>__________________________________
>Do you Yahoo!?
>The New Yahoo! Search - Faster. Easier. Bingo.
>http://search.yahoo.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Using lucene with HSSF from Apache

Posted by Shoba Ramachandran <sh...@yahoo.com>.
Thanks very much HTH,
What would be the good number for max.no. of terms for
a very huge document say 20MB PDF file.

Thanks
Shoba

--- kelvin-lists@relevanz.com wrote:
> It's doable, because if you open any ms office
> document in your text
> editor, you'll see that text is all there,
> surrounded with binary
> characters and other kinds of mumbo-jumbo. The
> biggest minus is that
> you'll exceed the max number of terms in a hurry,
> even if you set it
> to like a million, once you hit a reasonably large
> file (>5MB).
> 
> What I do is filter out all the unreadable stuff
> using some regex
> filters. Only minus is that its somewhat slower coz
> of the line by
> line processing, but I'm sure its _much_ faster than
> attempting to
> add all those nonsense data.
> 
> HTH
> 
> On Fri, 2 May 2003 08:30:15 -0700 (PDT), Shoba
> Ramachandran wrote:
> >Hi Michel,
> >
> >Are you able to index and search xls and doc files
> >with just Lucene using SimpleAnalyzer????
> >There is no need for POI?
> >With Lucene, you are able to extract the xls
> content
> >as text?
> >
> >Let me try as you explained.
> >Thanks very much for your reply.
> >Shoba
> >
> >--- MMachado@LEVI.com wrote:
> >>Hi,
> >>I did it, but I use only lucene. You need to
> create
> >>an IndexWriter with
> >>SimpleAnalyzer, an InputStream as new
> >>FileInputStream, create Document with
> >>two Fields: one contains the file path and one
> >>contains the file's content).
> >>That's all.
> >>Michel
> >>
> >>-----Original Message-----
> >>From: Shoba Ramachandran
> >>[mailto:shoba_duruvan@yahoo.com]
> >>Sent: Wednesday, April 30, 2003 6:10 PM
> >>To: lucene-user@jakarta.apache.org
> >>Subject: Using lucene with HSSF from Apache
> >>
> >>Hi,
> >>
> >>Has anyone tried to index xls and doc files?
> >>I'm trying to do with HSSF from apache and using
> >>lucene1.2
> >>
> >>This code returns me binary and printing it out
> >>gives
> >>junk chracters. File indexed like this returns
> >>nothing
> >>upon search.
> >>
> >>public static byte[] parse(File file) throws
> >>Exception
> >>{
> >>POIFSFileSystem fs = new POIFSFileSystem(new
> >>FileInputStream(file));
> >>HSSFWorkbook wb = new HSSFWorkbook(fs);
> >>byte[] xlsInfo = wb.getBytes();
> >>System.out.println("xls content :  "+
> >>xlsInfo.toString());
> >>return xlsInfo;
> >>}
> >>
> >>Thanks in advance for your help
> >>Shoba
> >>
> >>
> >>__________________________________
> >>Do you Yahoo!?
> >>The New Yahoo! Search - Faster. Easier. Bingo.
> >>http://search.yahoo.com
> >>
> >>
>
>---------------------------------------------------------------------
> 
> >>To unsubscribe, e-mail:
> >>lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail:
> >>lucene-user-help@jakarta.apache.org
> >>
> >>
>
>---------------------------------------------------------------------
> 
> >>To unsubscribe, e-mail:
> >>lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail:
> >>lucene-user-help@jakarta.apache.org
> >>
> >
> >
> >__________________________________
> >Do you Yahoo!?
> >The New Yahoo! Search - Faster. Easier. Bingo.
> >http://search.yahoo.com
> >
>
>---------------------------------------------------------------------
> 
> >To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> 
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Using lucene with HSSF from Apache

Posted by ke...@relevanz.com.
It's doable, because if you open any ms office document in your text 
editor, you'll see that text is all there, surrounded with binary 
characters and other kinds of mumbo-jumbo. The biggest minus is that 
you'll exceed the max number of terms in a hurry, even if you set it 
to like a million, once you hit a reasonably large file (>5MB). 

What I do is filter out all the unreadable stuff using some regex 
filters. Only minus is that its somewhat slower coz of the line by 
line processing, but I'm sure its _much_ faster than attempting to 
add all those nonsense data.

HTH

On Fri, 2 May 2003 08:30:15 -0700 (PDT), Shoba Ramachandran wrote:
>Hi Michel,
>
>Are you able to index and search xls and doc files
>with just Lucene using SimpleAnalyzer????
>There is no need for POI?
>With Lucene, you are able to extract the xls content
>as text?
>
>Let me try as you explained.
>Thanks very much for your reply.
>Shoba
>
>--- MMachado@LEVI.com wrote:
>>Hi,
>>I did it, but I use only lucene. You need to create
>>an IndexWriter with
>>SimpleAnalyzer, an InputStream as new
>>FileInputStream, create Document with
>>two Fields: one contains the file path and one
>>contains the file's content).
>>That's all.
>>Michel
>>
>>-----Original Message-----
>>From: Shoba Ramachandran
>>[mailto:shoba_duruvan@yahoo.com]
>>Sent: Wednesday, April 30, 2003 6:10 PM
>>To: lucene-user@jakarta.apache.org
>>Subject: Using lucene with HSSF from Apache
>>
>>Hi,
>>
>>Has anyone tried to index xls and doc files?
>>I'm trying to do with HSSF from apache and using
>>lucene1.2
>>
>>This code returns me binary and printing it out
>>gives
>>junk chracters. File indexed like this returns
>>nothing
>>upon search.
>>
>>public static byte[] parse(File file) throws
>>Exception
>>{
>>POIFSFileSystem fs = new POIFSFileSystem(new
>>FileInputStream(file));
>>HSSFWorkbook wb = new HSSFWorkbook(fs);
>>byte[] xlsInfo = wb.getBytes();
>>System.out.println("xls content :  "+
>>xlsInfo.toString());
>>return xlsInfo;
>>}
>>
>>Thanks in advance for your help
>>Shoba
>>
>>
>>__________________________________
>>Do you Yahoo!?
>>The New Yahoo! Search - Faster. Easier. Bingo.
>>http://search.yahoo.com
>>
>>
>---------------------------------------------------------------------

>>To unsubscribe, e-mail:
>>lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail:
>>lucene-user-help@jakarta.apache.org
>>
>>
>---------------------------------------------------------------------

>>To unsubscribe, e-mail:
>>lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail:
>>lucene-user-help@jakarta.apache.org
>>
>
>
>__________________________________
>Do you Yahoo!?
>The New Yahoo! Search - Faster. Easier. Bingo.
>http://search.yahoo.com
>
>---------------------------------------------------------------------

>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Using lucene with HSSF from Apache

Posted by Shoba Ramachandran <sh...@yahoo.com>.
Hi Michel,

Are you able to index and search xls and doc files
with just Lucene using SimpleAnalyzer????
There is no need for POI?
With Lucene, you are able to extract the xls content
as text?

Let me try as you explained.
Thanks very much for your reply.
Shoba

--- MMachado@LEVI.com wrote:
> Hi, 
> I did it, but I use only lucene. You need to create
> an IndexWriter with
> SimpleAnalyzer, an InputStream as new
> FileInputStream, create Document with
> two Fields: one contains the file path and one
> contains the file's content).
> That's all. 
> Michel
> 
> -----Original Message-----
> From: Shoba Ramachandran
> [mailto:shoba_duruvan@yahoo.com] 
> Sent: Wednesday, April 30, 2003 6:10 PM
> To: lucene-user@jakarta.apache.org
> Subject: Using lucene with HSSF from Apache
> 
> Hi,
> 
> Has anyone tried to index xls and doc files?
> I'm trying to do with HSSF from apache and using
> lucene1.2
> 
> This code returns me binary and printing it out
> gives
> junk chracters. File indexed like this returns
> nothing
> upon search. 
> 
> public static byte[] parse(File file) throws
> Exception
>   {
>     POIFSFileSystem fs = new POIFSFileSystem(new
> FileInputStream(file));
> HSSFWorkbook wb = new HSSFWorkbook(fs);
> byte[] xlsInfo = wb.getBytes();
>     System.out.println("xls content :  "+
> xlsInfo.toString());
> return xlsInfo;
>   }
> 
> Thanks in advance for your help
> Shoba
> 
> 
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo.
> http://search.yahoo.com
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org