You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Feuer <pa...@gmail.com> on 2009/01/30 04:43:36 UTC

indexing binary files?

Hi -

I've looked on the FAQ, the Java Docs, and searched a little in
google, but haven't been able to figure out if Lucene can index binary
files.

Our binary files can get up into the 20-30 gigabyte range.

If it is possible, anyone have any pointers to what interfaces I should look at?

Thanks,

./paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Yonik Seeley <ys...@gmail.com>.
Yes, that should work.  Stream the file, converting each record to a
Lucene Document.  All of the fields should probably be indexed only
(not stored) for size reasons, and then you could have a single stored
but not indexed field that would be the offset into your binary file.

-Yonik

On Fri, Jan 30, 2009 at 6:45 AM, Paul Feuer <pa...@gmail.com> wrote:
>
> The ~25 GB represents about 100 million events an avg of about 250 bytes each. the indexed and searchable values are normal things: small bits of text (8-10 bytes usually); longs; ints; etc...
>
> Also this 25GB is a per-day size, which is why expanding the values in it to ascii is problematic from a storage perspective.
>
> I've never used lucene before, but looking at the javadocs, I was hoping that I'd be able to implement some IndexOutput that would store the relevant offsets and then be able to search custom Fieldables in my document. (I'm writing this on the subway right now, so I don't have the docs in front of me, but I think that's what I was thinking last night)
>
> ./paul
>
> Sent from my Verizon Wireless BlackBerry
>
> -----Original Message-----
> From: Michael McCandless <lu...@mikemccandless.com>
>
> Date: Fri, 30 Jan 2009 05:45:17
> To: <ja...@lucene.apache.org>
> Subject: Re: indexing binary files?
>
>
>
> You can also create a Lucene field using a Reader, if the String is
> really too large to materialize at once.  Such fields cannot be stored
> though.
>
> But, if the String really is so large, I would worry about the end
> user's experience (normally you want a Document to be a rather bite-
> sized piece of content so users browsing through search results won't
> see a single monolithic result covering tons and tons of content).
>
> Mike
>
> Ganesh wrote:
>
>> Use your parser to get the string out of the binary file and index
>> them using Lucene.
>>
>> Store the string as it is, if it is small otherwise store the path
>> and its offset position. The content could be later retrieved.
>>
>> Regards
>> Ganesh
>>
>>
>> ----- Original Message ----- From: "Paul Feuer" <pa...@gmail.com>
>> To: <ja...@lucene.apache.org>
>> Sent: Friday, January 30, 2009 10:00 AM
>> Subject: Re: indexing binary files?
>>
>>
>>> we have parsers for these files.
>>>
>>> to index them, do the string representations need to be stored (aside
>>> from sitting in the index file)? or can the reader simply provide the
>>> string in order to record the location of the record in the binary
>>> file?
>>>
>>> if i need to convert the binary file into text fields, the files will
>>> get VERY large.
>>>
>>> the binary data are well-formed events, so queries would be like
>>> "where ACCOUNT = 'Microsoft'"
>>>
>>> ./paul
>>>
>>>
>>> On Thu, Jan 29, 2009 at 11:00 PM, Anshum <an...@gmail.com> wrote:
>>>> Hi Paul,
>>>> Lucene is a 'text only' saerch lib. i.e. as long as you feed in
>>>> anything as
>>>> a string, you'd be able to use lucene else I don't think there's a
>>>> way.
>>>> How do you even intend to search in those binary files? as in...
>>>> what would
>>>> be the keyword/phrase? asking out of curiosity!
>>>>
>>>> --
>>>> Anshum Gupta
>>>> Naukri Labs!
>>>> http://ai-cafe.blogspot.com
>>>>
>>>> The facts expressed here belong to everybody, the opinions to me.
>>>> The
>>>> distinction is yours to draw............
>>>>
>>>>
>>>> On Fri, Jan 30, 2009 at 9:13 AM, Paul Feuer <pa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi -
>>>>>
>>>>> I've looked on the FAQ, the Java Docs, and searched a little in
>>>>> google, but haven't been able to figure out if Lucene can index
>>>>> binary
>>>>> files.
>>>>>
>>>>> Our binary files can get up into the 20-30 gigabyte range.
>>>>>
>>>>> If it is possible, anyone have any pointers to what interfaces I
>>>>> should
>>>>> look at?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> ./paul
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> Send instant messages to your online friends http://in.messenger.yahoo.com
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Paul Feuer <pa...@gmail.com>.
The ~25 GB represents about 100 million events an avg of about 250 bytes each. the indexed and searchable values are normal things: small bits of text (8-10 bytes usually); longs; ints; etc...

Also this 25GB is a per-day size, which is why expanding the values in it to ascii is problematic from a storage perspective. 

I've never used lucene before, but looking at the javadocs, I was hoping that I'd be able to implement some IndexOutput that would store the relevant offsets and then be able to search custom Fieldables in my document. (I'm writing this on the subway right now, so I don't have the docs in front of me, but I think that's what I was thinking last night)

./paul 

Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Michael McCandless <lu...@mikemccandless.com>

Date: Fri, 30 Jan 2009 05:45:17 
To: <ja...@lucene.apache.org>
Subject: Re: indexing binary files?



You can also create a Lucene field using a Reader, if the String is  
really too large to materialize at once.  Such fields cannot be stored  
though.

But, if the String really is so large, I would worry about the end  
user's experience (normally you want a Document to be a rather bite- 
sized piece of content so users browsing through search results won't  
see a single monolithic result covering tons and tons of content).

Mike

Ganesh wrote:

> Use your parser to get the string out of the binary file and index  
> them using Lucene.
>
> Store the string as it is, if it is small otherwise store the path  
> and its offset position. The content could be later retrieved.
>
> Regards
> Ganesh
>
>
> ----- Original Message ----- From: "Paul Feuer" <pa...@gmail.com>
> To: <ja...@lucene.apache.org>
> Sent: Friday, January 30, 2009 10:00 AM
> Subject: Re: indexing binary files?
>
>
>> we have parsers for these files.
>>
>> to index them, do the string representations need to be stored (aside
>> from sitting in the index file)? or can the reader simply provide the
>> string in order to record the location of the record in the binary
>> file?
>>
>> if i need to convert the binary file into text fields, the files will
>> get VERY large.
>>
>> the binary data are well-formed events, so queries would be like
>> "where ACCOUNT = 'Microsoft'"
>>
>> ./paul
>>
>>
>> On Thu, Jan 29, 2009 at 11:00 PM, Anshum <an...@gmail.com> wrote:
>>> Hi Paul,
>>> Lucene is a 'text only' saerch lib. i.e. as long as you feed in  
>>> anything as
>>> a string, you'd be able to use lucene else I don't think there's a  
>>> way.
>>> How do you even intend to search in those binary files? as in...  
>>> what would
>>> be the keyword/phrase? asking out of curiosity!
>>>
>>> --
>>> Anshum Gupta
>>> Naukri Labs!
>>> http://ai-cafe.blogspot.com
>>>
>>> The facts expressed here belong to everybody, the opinions to me.  
>>> The
>>> distinction is yours to draw............
>>>
>>>
>>> On Fri, Jan 30, 2009 at 9:13 AM, Paul Feuer <pa...@gmail.com>  
>>> wrote:
>>>
>>>> Hi -
>>>>
>>>> I've looked on the FAQ, the Java Docs, and searched a little in
>>>> google, but haven't been able to figure out if Lucene can index  
>>>> binary
>>>> files.
>>>>
>>>> Our binary files can get up into the 20-30 gigabyte range.
>>>>
>>>> If it is possible, anyone have any pointers to what interfaces I  
>>>> should
>>>> look at?
>>>>
>>>> Thanks,
>>>>
>>>> ./paul
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> Send instant messages to your online friends http://in.messenger.yahoo.com
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Michael McCandless <lu...@mikemccandless.com>.
You can also create a Lucene field using a Reader, if the String is  
really too large to materialize at once.  Such fields cannot be stored  
though.

But, if the String really is so large, I would worry about the end  
user's experience (normally you want a Document to be a rather bite- 
sized piece of content so users browsing through search results won't  
see a single monolithic result covering tons and tons of content).

Mike

Ganesh wrote:

> Use your parser to get the string out of the binary file and index  
> them using Lucene.
>
> Store the string as it is, if it is small otherwise store the path  
> and its offset position. The content could be later retrieved.
>
> Regards
> Ganesh
>
>
> ----- Original Message ----- From: "Paul Feuer" <pa...@gmail.com>
> To: <ja...@lucene.apache.org>
> Sent: Friday, January 30, 2009 10:00 AM
> Subject: Re: indexing binary files?
>
>
>> we have parsers for these files.
>>
>> to index them, do the string representations need to be stored (aside
>> from sitting in the index file)? or can the reader simply provide the
>> string in order to record the location of the record in the binary
>> file?
>>
>> if i need to convert the binary file into text fields, the files will
>> get VERY large.
>>
>> the binary data are well-formed events, so queries would be like
>> "where ACCOUNT = 'Microsoft'"
>>
>> ./paul
>>
>>
>> On Thu, Jan 29, 2009 at 11:00 PM, Anshum <an...@gmail.com> wrote:
>>> Hi Paul,
>>> Lucene is a 'text only' saerch lib. i.e. as long as you feed in  
>>> anything as
>>> a string, you'd be able to use lucene else I don't think there's a  
>>> way.
>>> How do you even intend to search in those binary files? as in...  
>>> what would
>>> be the keyword/phrase? asking out of curiosity!
>>>
>>> --
>>> Anshum Gupta
>>> Naukri Labs!
>>> http://ai-cafe.blogspot.com
>>>
>>> The facts expressed here belong to everybody, the opinions to me.  
>>> The
>>> distinction is yours to draw............
>>>
>>>
>>> On Fri, Jan 30, 2009 at 9:13 AM, Paul Feuer <pa...@gmail.com>  
>>> wrote:
>>>
>>>> Hi -
>>>>
>>>> I've looked on the FAQ, the Java Docs, and searched a little in
>>>> google, but haven't been able to figure out if Lucene can index  
>>>> binary
>>>> files.
>>>>
>>>> Our binary files can get up into the 20-30 gigabyte range.
>>>>
>>>> If it is possible, anyone have any pointers to what interfaces I  
>>>> should
>>>> look at?
>>>>
>>>> Thanks,
>>>>
>>>> ./paul
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> Send instant messages to your online friends http://in.messenger.yahoo.com
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Ganesh <em...@yahoo.co.in>.
Use your parser to get the string out of the binary file and index them 
using Lucene.

Store the string as it is, if it is small otherwise store the path and its 
offset position. The content could be later retrieved.

Regards
Ganesh


----- Original Message ----- 
From: "Paul Feuer" <pa...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Friday, January 30, 2009 10:00 AM
Subject: Re: indexing binary files?


> we have parsers for these files.
>
> to index them, do the string representations need to be stored (aside
> from sitting in the index file)? or can the reader simply provide the
> string in order to record the location of the record in the binary
> file?
>
> if i need to convert the binary file into text fields, the files will
> get VERY large.
>
> the binary data are well-formed events, so queries would be like
> "where ACCOUNT = 'Microsoft'"
>
> ./paul
>
>
> On Thu, Jan 29, 2009 at 11:00 PM, Anshum <an...@gmail.com> wrote:
>> Hi Paul,
>> Lucene is a 'text only' saerch lib. i.e. as long as you feed in anything 
>> as
>> a string, you'd be able to use lucene else I don't think there's a way.
>> How do you even intend to search in those binary files? as in... what 
>> would
>> be the keyword/phrase? asking out of curiosity!
>>
>> --
>> Anshum Gupta
>> Naukri Labs!
>> http://ai-cafe.blogspot.com
>>
>> The facts expressed here belong to everybody, the opinions to me. The
>> distinction is yours to draw............
>>
>>
>> On Fri, Jan 30, 2009 at 9:13 AM, Paul Feuer <pa...@gmail.com> wrote:
>>
>>> Hi -
>>>
>>> I've looked on the FAQ, the Java Docs, and searched a little in
>>> google, but haven't been able to figure out if Lucene can index binary
>>> files.
>>>
>>> Our binary files can get up into the 20-30 gigabyte range.
>>>
>>> If it is possible, anyone have any pointers to what interfaces I should
>>> look at?
>>>
>>> Thanks,
>>>
>>> ./paul
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Paul Feuer <pa...@gmail.com>.
we have parsers for these files.

to index them, do the string representations need to be stored (aside
from sitting in the index file)? or can the reader simply provide the
string in order to record the location of the record in the binary
file?

if i need to convert the binary file into text fields, the files will
get VERY large.

the binary data are well-formed events, so queries would be like
"where ACCOUNT = 'Microsoft'"

./paul


On Thu, Jan 29, 2009 at 11:00 PM, Anshum <an...@gmail.com> wrote:
> Hi Paul,
> Lucene is a 'text only' saerch lib. i.e. as long as you feed in anything as
> a string, you'd be able to use lucene else I don't think there's a way.
> How do you even intend to search in those binary files? as in... what would
> be the keyword/phrase? asking out of curiosity!
>
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
>
>
> On Fri, Jan 30, 2009 at 9:13 AM, Paul Feuer <pa...@gmail.com> wrote:
>
>> Hi -
>>
>> I've looked on the FAQ, the Java Docs, and searched a little in
>> google, but haven't been able to figure out if Lucene can index binary
>> files.
>>
>> Our binary files can get up into the 20-30 gigabyte range.
>>
>> If it is possible, anyone have any pointers to what interfaces I should
>> look at?
>>
>> Thanks,
>>
>> ./paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Anshum <an...@gmail.com>.
Hi Paul,
Lucene is a 'text only' saerch lib. i.e. as long as you feed in anything as
a string, you'd be able to use lucene else I don't think there's a way.
How do you even intend to search in those binary files? as in... what would
be the keyword/phrase? asking out of curiosity!

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Fri, Jan 30, 2009 at 9:13 AM, Paul Feuer <pa...@gmail.com> wrote:

> Hi -
>
> I've looked on the FAQ, the Java Docs, and searched a little in
> google, but haven't been able to figure out if Lucene can index binary
> files.
>
> Our binary files can get up into the 20-30 gigabyte range.
>
> If it is possible, anyone have any pointers to what interfaces I should
> look at?
>
> Thanks,
>
> ./paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: indexing binary files?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Are these some type of parsable-into-text binary files that you have a  
parser handy for?

	Erik

On Jan 29, 2009, at 10:43 PM, Paul Feuer wrote:

> Hi -
>
> I've looked on the FAQ, the Java Docs, and searched a little in
> google, but haven't been able to figure out if Lucene can index binary
> files.
>
> Our binary files can get up into the 20-30 gigabyte range.
>
> If it is possible, anyone have any pointers to what interfaces I  
> should look at?
>
> Thanks,
>
> ./paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Shashi Kant <sk...@sloan.mit.edu>.
Hi Uwe, I was suggesting writing a custom tokenizer. In the worst case it
would be a character per token, might not be a very pretty solution, but
should do the job.
What do you think?

Thanks
Shashi


On Fri, Jan 30, 2009 at 12:57 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi Shashi,
>
> What is the sense of this? The base64 encoded documents cannot be tokenized
> and searched. To do this, they must be indexed as plain text. If you want
> to
> store the original binary values as document data in the index, you could
> also store them additionally as byte[] in the raw biary form in the index.
> You must differentiate between *indexed* and *stored* fields.
>
> But as Paul said, just *index* the text parts from the binary file using a
> parser and also *store* the offset value to get a pointer to the original
> data.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Shashi Kant [mailto:shashi_kant@yahoo.com]
> > Sent: Friday, January 30, 2009 3:32 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: indexing binary files?
> >
> > Hi Paul, have you tried persisting the binaries in Base64 format and then
> > indexing them?
> > As you are aware, Base64 is a robust representation used in email
> > attachments for example.
> >
> >
> > Thanks
> > Shashi
> >
> >
> >
> > ----- Original Message ----
> > From: Paul Feuer <pa...@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Thursday, January 29, 2009 10:43:36 PM
> > Subject: indexing binary files?
> >
> > Hi -
> >
> > I've looked on the FAQ, the Java Docs, and searched a little in
> > google, but haven't been able to figure out if Lucene can index binary
> > files.
> >
> > Our binary files can get up into the 20-30 gigabyte range.
> >
> > If it is possible, anyone have any pointers to what interfaces I should
> > look at?
> >
> > Thanks,
> >
> > ./paul
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

RE: indexing binary files?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Shashi,

What is the sense of this? The base64 encoded documents cannot be tokenized
and searched. To do this, they must be indexed as plain text. If you want to
store the original binary values as document data in the index, you could
also store them additionally as byte[] in the raw biary form in the index.
You must differentiate between *indexed* and *stored* fields.

But as Paul said, just *index* the text parts from the binary file using a
parser and also *store* the offset value to get a pointer to the original
data.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Shashi Kant [mailto:shashi_kant@yahoo.com]
> Sent: Friday, January 30, 2009 3:32 PM
> To: java-user@lucene.apache.org
> Subject: Re: indexing binary files?
> 
> Hi Paul, have you tried persisting the binaries in Base64 format and then
> indexing them?
> As you are aware, Base64 is a robust representation used in email
> attachments for example.
> 
> 
> Thanks
> Shashi
> 
> 
> 
> ----- Original Message ----
> From: Paul Feuer <pa...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, January 29, 2009 10:43:36 PM
> Subject: indexing binary files?
> 
> Hi -
> 
> I've looked on the FAQ, the Java Docs, and searched a little in
> google, but haven't been able to figure out if Lucene can index binary
> files.
> 
> Our binary files can get up into the 20-30 gigabyte range.
> 
> If it is possible, anyone have any pointers to what interfaces I should
> look at?
> 
> Thanks,
> 
> ./paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Paul Feuer <pa...@gmail.com>.
Expanding 25+ GB per day is not ideal. If its possible to index the binary directly, as it sounds like it might, we'll just do that.  

I think what I was missing was - I didn't see AbstractField which seems like it has the stuff I need (if indeed Field is used as I assume it is)

./paul 


Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Shashi Kant <sh...@yahoo.com>

Date: Fri, 30 Jan 2009 09:38:16 
To: <ja...@lucene.apache.org>
Subject: Re: indexing binary files?


Unless I am missing something, not sure I see the issue here. You can convert to Base64 purely for indexing purposes and leave the original binary as-is.



----- Original Message ----
From: Paul Feuer <pa...@gmail.com>
To: Lucene User List <ja...@lucene.apache.org>; Shashi Kant <sk...@sloan.mit.edu>
Sent: Friday, January 30, 2009 10:12:33 AM
Subject: Re: indexing binary files?


The binary events in the file are parsable by both our java server-side processes and the clients of these processes, so we need to keep the data in the binary format. 

../paul 


Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Shashi Kant <sh...@yahoo.com>

Date: Fri, 30 Jan 2009 06:32:19 
To: <ja...@lucene.apache.org>
Subject: Re: indexing binary files?


Hi Paul, have you tried persisting the binaries in Base64 format and then indexing them?
As you are aware, Base64 is a robust representation used in email attachments for example.


Thanks
Shashi



----- Original Message ----
From: Paul Feuer <pa...@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, January 29, 2009 10:43:36 PM
Subject: indexing binary files?

Hi -

I've looked on the FAQ, the Java Docs, and searched a little in
google, but haven't been able to figure out if Lucene can index binary
files.

Our binary files can get up into the 20-30 gigabyte range.

If it is possible, anyone have any pointers to what interfaces I should look at?

Thanks,

.../paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Shashi Kant <sh...@yahoo.com>.
Unless I am missing something, not sure I see the issue here. You can convert to Base64 purely for indexing purposes and leave the original binary as-is.



----- Original Message ----
From: Paul Feuer <pa...@gmail.com>
To: Lucene User List <ja...@lucene.apache.org>; Shashi Kant <sk...@sloan.mit.edu>
Sent: Friday, January 30, 2009 10:12:33 AM
Subject: Re: indexing binary files?


The binary events in the file are parsable by both our java server-side processes and the clients of these processes, so we need to keep the data in the binary format. 

./paul 


Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Shashi Kant <sh...@yahoo.com>

Date: Fri, 30 Jan 2009 06:32:19 
To: <ja...@lucene.apache.org>
Subject: Re: indexing binary files?


Hi Paul, have you tried persisting the binaries in Base64 format and then indexing them?
As you are aware, Base64 is a robust representation used in email attachments for example.


Thanks
Shashi



----- Original Message ----
From: Paul Feuer <pa...@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, January 29, 2009 10:43:36 PM
Subject: indexing binary files?

Hi -

I've looked on the FAQ, the Java Docs, and searched a little in
google, but haven't been able to figure out if Lucene can index binary
files.

Our binary files can get up into the 20-30 gigabyte range.

If it is possible, anyone have any pointers to what interfaces I should look at?

Thanks,

../paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Paul Feuer <pa...@gmail.com>.
The binary events in the file are parsable by both our java server-side processes and the clients of these processes, so we need to keep the data in the binary format. 

./paul 


Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Shashi Kant <sh...@yahoo.com>

Date: Fri, 30 Jan 2009 06:32:19 
To: <ja...@lucene.apache.org>
Subject: Re: indexing binary files?


Hi Paul, have you tried persisting the binaries in Base64 format and then indexing them?
As you are aware, Base64 is a robust representation used in email attachments for example.


Thanks
Shashi



----- Original Message ----
From: Paul Feuer <pa...@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, January 29, 2009 10:43:36 PM
Subject: indexing binary files?

Hi -

I've looked on the FAQ, the Java Docs, and searched a little in
google, but haven't been able to figure out if Lucene can index binary
files.

Our binary files can get up into the 20-30 gigabyte range.

If it is possible, anyone have any pointers to what interfaces I should look at?

Thanks,

../paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing binary files?

Posted by Shashi Kant <sh...@yahoo.com>.
Hi Paul, have you tried persisting the binaries in Base64 format and then indexing them?
As you are aware, Base64 is a robust representation used in email attachments for example.


Thanks
Shashi



----- Original Message ----
From: Paul Feuer <pa...@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, January 29, 2009 10:43:36 PM
Subject: indexing binary files?

Hi -

I've looked on the FAQ, the Java Docs, and searched a little in
google, but haven't been able to figure out if Lucene can index binary
files.

Our binary files can get up into the 20-30 gigabyte range.

If it is possible, anyone have any pointers to what interfaces I should look at?

Thanks,

./paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org