You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ulrich Schinz <ul...@schinz.de> on 2005/06/23 12:41:39 UTC
getting text-snippets
hi there!
first of all: im new here in the list, my name is uli. hello to all !
im quite new in using lucene. i created different indices, some with
GermanAnalyzer some with StandardAnalyzer...
i added Fields to my Documents with doc.add(Field.Text("contents",new
FileReader(f)); and doc.add(Field.Keyword
("filename",g.getCanonicalPath());
in the search i getresults with doc.get("filename"), where i get the
right filenames, containing search-query.
if i try to get doc.get("contents"); nothig is returned...
aim is: i wanna get the filename to generate link on top of an
result. after i'd like to have an text-snipped, where the query-term
in this document occured. just like google or some other search-
engines... i have seen this in nutch as well... so it should be
possible, but im not sure, how i can get these text-snippets...
maybe someone can give me some hints, how to manage that.
thx in advance,
regards,
uli
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: getting text-snippets
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 23, 2005, at 10:17 AM, Ulrich Schinz wrote:
>> Field.Text(String, Reader) is not a stored field. This is why
>> doc.get("contents") is empty.
>>
>>
>
> ok, i read that in javadoc of lucene... in dont understand what
> Field.Text(String,Reader,boolean) does... if i set boolean to true,
> what is the stortermvector??
Term vectors is additional index storage allowing you to find out how
many of each term occurred in a field. For your purposes, you don't
need to use that feature.
>> You have some options... change to using a stored field by reading
>> the file contents into a String and using Field.Text(String,
>> String) instead. Or, when rendering the results, go directly to
>> the file pointed to by doc.get("filename") and read its contents
>> then. There are pros/cons to both of these approaches.
>>
>>
>
> ok, i started to try this... but i also try to index pdf-files.. so
> i get an InputStream from pdftotext. if i try to convert that to a
> String it takes really long time,
> and we have a lot of data to index....
> i tried different ways to get that done:
> 1.
> String ret = "";
> InputStream is=null;
> String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"};
> byte[] buffer = new byte[80];
> child = Runtime.getRuntime().exec(cmd);
> is = child.getInputStream();
> BufferedInputStream bis = new BufferedInputStream(is,80);
> while(next != -1){
> ++t;
> next = bis.read(buffer,bis.pos, 80);
> String input = new String(buffer,0,next);
> ret += input;
> }
>
> not really that way, but conceptual (in real it compiles :-) )
> 2.
> String ret = "";
> InputStream is=null;
> String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"};
> byte[] buffer = new byte[80];
> child = Runtime.getRuntime().exec(cmd);
> is = child.getInputStream();
> while(next != -1){
> ++t;
> next = is.read();
> ret += String(next);
> }
>
> but those versions are both really slow... it takes me more than 20
> minutes (minimum) to get a pdf file of size 900 k...
> is there a way to get that faster???
You should consider using PDFBox for reading PDF files.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: getting text-snippets
Posted by Ulrich Schinz <ul...@schinz.de>.
> Field.Text(String, Reader) is not a stored field. This is why
> doc.get("contents") is empty.
>
ok, i read that in javadoc of lucene... in dont understand what
Field.Text(String,Reader,boolean) does... if i set boolean to true,
what is the stortermvector??
>
> You have some options... change to using a stored field by reading
> the file contents into a String and using Field.Text(String,
> String) instead. Or, when rendering the results, go directly to
> the file pointed to by doc.get("filename") and read its contents
> then. There are pros/cons to both of these approaches.
>
ok, i started to try this... but i also try to index pdf-files.. so i
get an InputStream from pdftotext. if i try to convert that to a
String it takes really long time,
and we have a lot of data to index....
i tried different ways to get that done:
1.
String ret = "";
InputStream is=null;
String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"};
byte[] buffer = new byte[80];
child = Runtime.getRuntime().exec(cmd);
is = child.getInputStream();
BufferedInputStream bis = new BufferedInputStream(is,80);
while(next != -1){
++t;
next = bis.read(buffer,bis.pos, 80);
String input = new String(buffer,0,next);
ret += input;
}
not really that way, but conceptual (in real it compiles :-) )
2.
String ret = "";
InputStream is=null;
String[] cmd = {"/usr/bin/pdftotext", "test.pdf", "-"};
byte[] buffer = new byte[80];
child = Runtime.getRuntime().exec(cmd);
is = child.getInputStream();
while(next != -1){
++t;
next = is.read();
ret += String(next);
}
but those versions are both really slow... it takes me more than 20
minutes (minimum) to get a pdf file of size 900 k...
is there a way to get that faster???
regards,
uli
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: getting text-snippets
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 23, 2005, at 6:41 AM, Ulrich Schinz wrote:
> hi there!
>
> first of all: im new here in the list, my name is uli. hello to all !
>
> im quite new in using lucene. i created different indices, some
> with GermanAnalyzer some with StandardAnalyzer...
> i added Fields to my Documents with doc.add(Field.Text
> ("contents",new FileReader(f)); and doc.add(Field.Keyword
> ("filename",g.getCanonicalPath());
>
> in the search i getresults with doc.get("filename"), where i get
> the right filenames, containing search-query.
> if i try to get doc.get("contents"); nothig is returned...
Field.Text(String, Reader) is not a stored field. This is why doc.get
("contents") is empty.
> aim is: i wanna get the filename to generate link on top of an
> result. after i'd like to have an text-snipped, where the query-term
> in this document occured. just like google or some other search-
> engines... i have seen this in nutch as well... so it should be
> possible, but im not sure, how i can get these text-snippets...
>
> maybe someone can give me some hints, how to manage that.
You have some options... change to using a stored field by reading
the file contents into a String and using Field.Text(String, String)
instead. Or, when rendering the results, go directly to the file
pointed to by doc.get("filename") and read its contents then. There
are pros/cons to both of these approaches.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org