You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by blazingwolf7 <bl...@gmail.com> on 2008/07/04 10:19:21 UTC

Untokenized URL

Hi,

I am currently working on retrieving url and contentLength of each document
found during the search. I want to retrieve it during the calculation of
score so that I can influence the score in some other way.

I used the methods from TermDocs and TermEnum to get the information.
However, the url I retrieve as is know by most, is tokenized. It is broken
down into several parts and I will have to rejoin them. Can anyone help me
with this? I am stuck here wondering how to get back the whole url without
using a Reader.

Also, I try to retrieve the contentLength, but the results return are null.
Why is that? I opened the index using Luke and the contentLength is there
but when I try to get it using this way, the results is null. 

Can anyone help me with both of these problems? Any help will be
appreciated. Thanks
-- 
View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Untokenized URL

Posted by blazingwolf7 <bl...@gmail.com>.

Thanks for the help


Uwe Schindler wrote:
> 
> Hi,
> 
> Read here: http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> And I think that this type of questions is more for the Lucene Users
> mailing
> list
> (http://lucene.apache.org/java/docs/mailinglists.html#Java%20User%20List).
> This list is for developers of Lucene itself, not for users asking for
> help
> how to implement something specific with Lucene.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
>> Sent: Monday, July 07, 2008 9:15 AM
>> To: java-dev@lucene.apache.org
>> Subject: RE: Untokenized URL
>> 
>> 
>> Well, I am open to suggestion, except for using reader. The
>> Documnet.get()
>> &
>> CO, how does it works?
>> 
>> 
>> Uwe Schindler wrote:
>> >
>> > As Shai told before, you should store the field twice: As tokenized
>> field
>> > for your search and with a different name (e.g. "field-untokenized").
>> For
>> > your TermEnum Code you may use the untokenized field, for normal search
>> > queries the tokenized.
>> > If you want to retrieve the field contents with Document.get() & Co.
>> > instead
>> > of TermEnum, you may store the field one time with Flags Tokenized &
>> > Stored.
>> > But this does not work with your TermEnum solution.
>> >
>> > -----
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: uwe@thetaphi.de
>> >
>> >> -----Original Message-----
>> >> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
>> >> Sent: Monday, July 07, 2008 7:39 AM
>> >> To: java-dev@lucene.apache.org
>> >> Subject: Re: Untokenized URL
>> >>
>> >>
>> >> I am trying to retrieve the url and use it as filter. The main problem
>> is
>> >> I
>> >> don't want to use a reader to continuously retrieve the url for each
>> >> document located.
>> >>
>> >> TermDocs termDocs = reader.termDocs();
>> >> TermEnum termEnum = reader.terms (new Term (field, ""));
>> >> do{
>> >>    Term term = termEnum.term();
>> >> }while(termEnum.next());
>> >>
>> >> I am using this code to retrieve the field containing the url but it
>> is
>> >> tokenized. Is there anyway to untokenized it or is there a better way
>> to
>> >> do
>> >> this?
>> >>
>> >>
>> >> Shai Erera wrote:
>> >> >
>> >> > I think that the simplest solution will be to index the URL field
>> >> twice,
>> >> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
>> >> > un_tokenized term.
>> >> > If you have a document in hand and only want to fetch its URL, then
>> add
>> >> > the
>> >> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
>> >> > COMPRESS and Index.NO.
>> >> >
>> >> > Perhaps I don't understand the entire scenario. When do you need to
>> >> fetch
>> >> > the contentLength and URL? To what purpose?
>> >> >
>> >> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7
>> <bl...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >>
>> >> >> No, I didn't store the contentLength. Just adding it into the
>> index.
>> >> >> Which
>> >> >> until now I am still scratching my head as I can't think of another
>> >> way
>> >> >> to
>> >> >> retrieve it without continuously using the reader.
>> >> >>
>> >> >> As for the url, I use doc.add(new Field("url",
>> >> Store.NO,Index.TOKENIZED).
>> >> >> I
>> >> >> will like to keep it this way, having the url being tokenized. I am
>> >> >> finding
>> >> >> a way to UNtokenized it, I retrieved it using a method that will
>> >> retrieve
>> >> >> the entire field then extract the information in it. But the
>> problem
>> >> is,
>> >> >> the
>> >> >> url are broken down. I am seeking a way to reconstruct it to its
>> >> >> orgininal
>> >> >> format. Can it be done?
>> >> >>
>> >> >>
>> >> >> Shai Erera wrote:
>> >> >> >
>> >> >> > Hi
>> >> >> >
>> >> >> > Regarding the contentLength, when you add it to the document, do
>> you
>> >> >> use
>> >> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
>> >> >> >
>> >> >> > Regarding the URL, how do you add it to the document? For
>> example,
>> >> if
>> >> >> you
>> >> >> > do
>> >> >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
>> >> >> > Index.UN_TOKENIZED), it would create a token like "url:
>> >> >> http://www.cnn.com"
>> >> >> > without breaking it to its parts. Is that what you're looking
>> for?
>> >> >> >
>> >> >> > Shai
>> >> >> >
>> >> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
>> >> <bl...@gmail.com>
>> >> >> > wrote:
>> >> >> >
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> I am currently working on retrieving url and contentLength of
>> each
>> >> >> >> document
>> >> >> >> found during the search. I want to retrieve it during the
>> >> calculation
>> >> >> of
>> >> >> >> score so that I can influence the score in some other way.
>> >> >> >>
>> >> >> >> I used the methods from TermDocs and TermEnum to get the
>> >> information.
>> >> >> >> However, the url I retrieve as is know by most, is tokenized. It
>> is
>> >> >> >> broken
>> >> >> >> down into several parts and I will have to rejoin them. Can
>> anyone
>> >> >> help
>> >> >> >> me
>> >> >> >> with this? I am stuck here wondering how to get back the whole
>> url
>> >> >> >> without
>> >> >> >> using a Reader.
>> >> >> >>
>> >> >> >> Also, I try to retrieve the contentLength, but the results
>> return
>> >> are
>> >> >> >> null.
>> >> >> >> Why is that? I opened the index using Luke and the contentLength
>> is
>> >> >> there
>> >> >> >> but when I try to get it using this way, the results is null.
>> >> >> >>
>> >> >> >> Can anyone help me with both of these problems? Any help will be
>> >> >> >> appreciated. Thanks
>> >> >> >> --
>> >> >> >> View this message in context:
>> >> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> >> >> >> Sent from the Lucene - Java Developer mailing list archive at
>> >> >> Nabble.com.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> --------------------------------------------------------------------
>> >> -
>> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> >
>> >> >> > Shai Erera
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
>> >> >> Sent from the Lucene - Java Developer mailing list archive at
>> >> Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> --------------------------------------------------------------------
>> -
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> >
>> >> > Shai Erera
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context: http://www.nabble.com/Untokenized-URL-
>> >> tp18275048p18310348.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>> >
>> 
>> --
>> View this message in context: http://www.nabble.com/Untokenized-URL-
>> tp18275048p18311247.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18311983.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Untokenized URL

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

Read here: http://wiki.apache.org/lucene-java/LuceneFAQ

And I think that this type of questions is more for the Lucene Users mailing
list
(http://lucene.apache.org/java/docs/mailinglists.html#Java%20User%20List).
This list is for developers of Lucene itself, not for users asking for help
how to implement something specific with Lucene.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
> Sent: Monday, July 07, 2008 9:15 AM
> To: java-dev@lucene.apache.org
> Subject: RE: Untokenized URL
> 
> 
> Well, I am open to suggestion, except for using reader. The Documnet.get()
> &
> CO, how does it works?
> 
> 
> Uwe Schindler wrote:
> >
> > As Shai told before, you should store the field twice: As tokenized
> field
> > for your search and with a different name (e.g. "field-untokenized").
> For
> > your TermEnum Code you may use the untokenized field, for normal search
> > queries the tokenized.
> > If you want to retrieve the field contents with Document.get() & Co.
> > instead
> > of TermEnum, you may store the field one time with Flags Tokenized &
> > Stored.
> > But this does not work with your TermEnum solution.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >> -----Original Message-----
> >> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
> >> Sent: Monday, July 07, 2008 7:39 AM
> >> To: java-dev@lucene.apache.org
> >> Subject: Re: Untokenized URL
> >>
> >>
> >> I am trying to retrieve the url and use it as filter. The main problem
> is
> >> I
> >> don't want to use a reader to continuously retrieve the url for each
> >> document located.
> >>
> >> TermDocs termDocs = reader.termDocs();
> >> TermEnum termEnum = reader.terms (new Term (field, ""));
> >> do{
> >>    Term term = termEnum.term();
> >> }while(termEnum.next());
> >>
> >> I am using this code to retrieve the field containing the url but it is
> >> tokenized. Is there anyway to untokenized it or is there a better way
> to
> >> do
> >> this?
> >>
> >>
> >> Shai Erera wrote:
> >> >
> >> > I think that the simplest solution will be to index the URL field
> >> twice,
> >> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
> >> > un_tokenized term.
> >> > If you have a document in hand and only want to fetch its URL, then
> add
> >> > the
> >> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
> >> > COMPRESS and Index.NO.
> >> >
> >> > Perhaps I don't understand the entire scenario. When do you need to
> >> fetch
> >> > the contentLength and URL? To what purpose?
> >> >
> >> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <bl...@gmail.com>
> >> > wrote:
> >> >
> >> >>
> >> >> No, I didn't store the contentLength. Just adding it into the index.
> >> >> Which
> >> >> until now I am still scratching my head as I can't think of another
> >> way
> >> >> to
> >> >> retrieve it without continuously using the reader.
> >> >>
> >> >> As for the url, I use doc.add(new Field("url",
> >> Store.NO,Index.TOKENIZED).
> >> >> I
> >> >> will like to keep it this way, having the url being tokenized. I am
> >> >> finding
> >> >> a way to UNtokenized it, I retrieved it using a method that will
> >> retrieve
> >> >> the entire field then extract the information in it. But the problem
> >> is,
> >> >> the
> >> >> url are broken down. I am seeking a way to reconstruct it to its
> >> >> orgininal
> >> >> format. Can it be done?
> >> >>
> >> >>
> >> >> Shai Erera wrote:
> >> >> >
> >> >> > Hi
> >> >> >
> >> >> > Regarding the contentLength, when you add it to the document, do
> you
> >> >> use
> >> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> >> >> >
> >> >> > Regarding the URL, how do you add it to the document? For example,
> >> if
> >> >> you
> >> >> > do
> >> >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
> >> >> > Index.UN_TOKENIZED), it would create a token like "url:
> >> >> http://www.cnn.com"
> >> >> > without breaking it to its parts. Is that what you're looking for?
> >> >> >
> >> >> > Shai
> >> >> >
> >> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
> >> <bl...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> I am currently working on retrieving url and contentLength of
> each
> >> >> >> document
> >> >> >> found during the search. I want to retrieve it during the
> >> calculation
> >> >> of
> >> >> >> score so that I can influence the score in some other way.
> >> >> >>
> >> >> >> I used the methods from TermDocs and TermEnum to get the
> >> information.
> >> >> >> However, the url I retrieve as is know by most, is tokenized. It
> is
> >> >> >> broken
> >> >> >> down into several parts and I will have to rejoin them. Can
> anyone
> >> >> help
> >> >> >> me
> >> >> >> with this? I am stuck here wondering how to get back the whole
> url
> >> >> >> without
> >> >> >> using a Reader.
> >> >> >>
> >> >> >> Also, I try to retrieve the contentLength, but the results return
> >> are
> >> >> >> null.
> >> >> >> Why is that? I opened the index using Luke and the contentLength
> is
> >> >> there
> >> >> >> but when I try to get it using this way, the results is null.
> >> >> >>
> >> >> >> Can anyone help me with both of these problems? Any help will be
> >> >> >> appreciated. Thanks
> >> >> >> --
> >> >> >> View this message in context:
> >> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
> >> >> >> Sent from the Lucene - Java Developer mailing list archive at
> >> >> Nabble.com.
> >> >> >>
> >> >> >>
> >> >> >>
> >> --------------------------------------------------------------------
> >> -
> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Regards,
> >> >> >
> >> >> > Shai Erera
> >> >> >
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
> >> >> Sent from the Lucene - Java Developer mailing list archive at
> >> Nabble.com.
> >> >>
> >> >>
> >> >> --------------------------------------------------------------------
> -
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Shai Erera
> >> >
> >> >
> >>
> >> --
> >> View this message in context: http://www.nabble.com/Untokenized-URL-
> >> tp18275048p18310348.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> >
> 
> --
> View this message in context: http://www.nabble.com/Untokenized-URL-
> tp18275048p18311247.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Untokenized URL

Posted by blazingwolf7 <bl...@gmail.com>.

Well, I am open to suggestion, except for using reader. The Documnet.get() &
CO, how does it works?


Uwe Schindler wrote:
> 
> As Shai told before, you should store the field twice: As tokenized field
> for your search and with a different name (e.g. "field-untokenized"). For
> your TermEnum Code you may use the untokenized field, for normal search
> queries the tokenized.
> If you want to retrieve the field contents with Document.get() & Co.
> instead
> of TermEnum, you may store the field one time with Flags Tokenized &
> Stored.
> But this does not work with your TermEnum solution.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
>> Sent: Monday, July 07, 2008 7:39 AM
>> To: java-dev@lucene.apache.org
>> Subject: Re: Untokenized URL
>> 
>> 
>> I am trying to retrieve the url and use it as filter. The main problem is
>> I
>> don't want to use a reader to continuously retrieve the url for each
>> document located.
>> 
>> TermDocs termDocs = reader.termDocs();
>> TermEnum termEnum = reader.terms (new Term (field, ""));
>> do{
>>    Term term = termEnum.term();
>> }while(termEnum.next());
>> 
>> I am using this code to retrieve the field containing the url but it is
>> tokenized. Is there anyway to untokenized it or is there a better way to
>> do
>> this?
>> 
>> 
>> Shai Erera wrote:
>> >
>> > I think that the simplest solution will be to index the URL field
>> twice,
>> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
>> > un_tokenized term.
>> > If you have a document in hand and only want to fetch its URL, then add
>> > the
>> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
>> > COMPRESS and Index.NO.
>> >
>> > Perhaps I don't understand the entire scenario. When do you need to
>> fetch
>> > the contentLength and URL? To what purpose?
>> >
>> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <bl...@gmail.com>
>> > wrote:
>> >
>> >>
>> >> No, I didn't store the contentLength. Just adding it into the index.
>> >> Which
>> >> until now I am still scratching my head as I can't think of another
>> way
>> >> to
>> >> retrieve it without continuously using the reader.
>> >>
>> >> As for the url, I use doc.add(new Field("url",
>> Store.NO,Index.TOKENIZED).
>> >> I
>> >> will like to keep it this way, having the url being tokenized. I am
>> >> finding
>> >> a way to UNtokenized it, I retrieved it using a method that will
>> retrieve
>> >> the entire field then extract the information in it. But the problem
>> is,
>> >> the
>> >> url are broken down. I am seeking a way to reconstruct it to its
>> >> orgininal
>> >> format. Can it be done?
>> >>
>> >>
>> >> Shai Erera wrote:
>> >> >
>> >> > Hi
>> >> >
>> >> > Regarding the contentLength, when you add it to the document, do you
>> >> use
>> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
>> >> >
>> >> > Regarding the URL, how do you add it to the document? For example,
>> if
>> >> you
>> >> > do
>> >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
>> >> > Index.UN_TOKENIZED), it would create a token like "url:
>> >> http://www.cnn.com"
>> >> > without breaking it to its parts. Is that what you're looking for?
>> >> >
>> >> > Shai
>> >> >
>> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
>> <bl...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I am currently working on retrieving url and contentLength of each
>> >> >> document
>> >> >> found during the search. I want to retrieve it during the
>> calculation
>> >> of
>> >> >> score so that I can influence the score in some other way.
>> >> >>
>> >> >> I used the methods from TermDocs and TermEnum to get the
>> information.
>> >> >> However, the url I retrieve as is know by most, is tokenized. It is
>> >> >> broken
>> >> >> down into several parts and I will have to rejoin them. Can anyone
>> >> help
>> >> >> me
>> >> >> with this? I am stuck here wondering how to get back the whole url
>> >> >> without
>> >> >> using a Reader.
>> >> >>
>> >> >> Also, I try to retrieve the contentLength, but the results return
>> are
>> >> >> null.
>> >> >> Why is that? I opened the index using Luke and the contentLength is
>> >> there
>> >> >> but when I try to get it using this way, the results is null.
>> >> >>
>> >> >> Can anyone help me with both of these problems? Any help will be
>> >> >> appreciated. Thanks
>> >> >> --
>> >> >> View this message in context:
>> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> >> >> Sent from the Lucene - Java Developer mailing list archive at
>> >> Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> --------------------------------------------------------------------
>> -
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> >
>> >> > Shai Erera
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> >
>> > Shai Erera
>> >
>> >
>> 
>> --
>> View this message in context: http://www.nabble.com/Untokenized-URL-
>> tp18275048p18310348.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18311247.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Untokenized URL

Posted by Uwe Schindler <uw...@thetaphi.de>.

As Shai told before, you should store the field twice: As tokenized field
for your search and with a different name (e.g. "field-untokenized"). For
your TermEnum Code you may use the untokenized field, for normal search
queries the tokenized.
If you want to retrieve the field contents with Document.get() & Co. instead
of TermEnum, you may store the field one time with Flags Tokenized & Stored.
But this does not work with your TermEnum solution.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: blazingwolf7 [mailto:blazingwolf7@gmail.com]
> Sent: Monday, July 07, 2008 7:39 AM
> To: java-dev@lucene.apache.org
> Subject: Re: Untokenized URL
> 
> 
> I am trying to retrieve the url and use it as filter. The main problem is
> I
> don't want to use a reader to continuously retrieve the url for each
> document located.
> 
> TermDocs termDocs = reader.termDocs();
> TermEnum termEnum = reader.terms (new Term (field, ""));
> do{
>    Term term = termEnum.term();
> }while(termEnum.next());
> 
> I am using this code to retrieve the field containing the url but it is
> tokenized. Is there anyway to untokenized it or is there a better way to
> do
> this?
> 
> 
> Shai Erera wrote:
> >
> > I think that the simplest solution will be to index the URL field twice,
> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
> > un_tokenized term.
> > If you have a document in hand and only want to fetch its URL, then add
> > the
> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
> > COMPRESS and Index.NO.
> >
> > Perhaps I don't understand the entire scenario. When do you need to
> fetch
> > the contentLength and URL? To what purpose?
> >
> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <bl...@gmail.com>
> > wrote:
> >
> >>
> >> No, I didn't store the contentLength. Just adding it into the index.
> >> Which
> >> until now I am still scratching my head as I can't think of another way
> >> to
> >> retrieve it without continuously using the reader.
> >>
> >> As for the url, I use doc.add(new Field("url",
> Store.NO,Index.TOKENIZED).
> >> I
> >> will like to keep it this way, having the url being tokenized. I am
> >> finding
> >> a way to UNtokenized it, I retrieved it using a method that will
> retrieve
> >> the entire field then extract the information in it. But the problem
> is,
> >> the
> >> url are broken down. I am seeking a way to reconstruct it to its
> >> orgininal
> >> format. Can it be done?
> >>
> >>
> >> Shai Erera wrote:
> >> >
> >> > Hi
> >> >
> >> > Regarding the contentLength, when you add it to the document, do you
> >> use
> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> >> >
> >> > Regarding the URL, how do you add it to the document? For example, if
> >> you
> >> > do
> >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
> >> > Index.UN_TOKENIZED), it would create a token like "url:
> >> http://www.cnn.com"
> >> > without breaking it to its parts. Is that what you're looking for?
> >> >
> >> > Shai
> >> >
> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
> <bl...@gmail.com>
> >> > wrote:
> >> >
> >> >>
> >> >> Hi,
> >> >>
> >> >> I am currently working on retrieving url and contentLength of each
> >> >> document
> >> >> found during the search. I want to retrieve it during the
> calculation
> >> of
> >> >> score so that I can influence the score in some other way.
> >> >>
> >> >> I used the methods from TermDocs and TermEnum to get the
> information.
> >> >> However, the url I retrieve as is know by most, is tokenized. It is
> >> >> broken
> >> >> down into several parts and I will have to rejoin them. Can anyone
> >> help
> >> >> me
> >> >> with this? I am stuck here wondering how to get back the whole url
> >> >> without
> >> >> using a Reader.
> >> >>
> >> >> Also, I try to retrieve the contentLength, but the results return
> are
> >> >> null.
> >> >> Why is that? I opened the index using Luke and the contentLength is
> >> there
> >> >> but when I try to get it using this way, the results is null.
> >> >>
> >> >> Can anyone help me with both of these problems? Any help will be
> >> >> appreciated. Thanks
> >> >> --
> >> >> View this message in context:
> >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
> >> >> Sent from the Lucene - Java Developer mailing list archive at
> >> Nabble.com.
> >> >>
> >> >>
> >> >> --------------------------------------------------------------------
> -
> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Shai Erera
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
> >
> >
> 
> --
> View this message in context: http://www.nabble.com/Untokenized-URL-
> tp18275048p18310348.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Untokenized URL

Posted by blazingwolf7 <bl...@gmail.com>.

I am trying to retrieve the url and use it as filter. The main problem is I
don't want to use a reader to continuously retrieve the url for each
document located. 

TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field, ""));
do{
   Term term = termEnum.term();
}while(termEnum.next());

I am using this code to retrieve the field containing the url but it is
tokenized. Is there anyway to untokenized it or is there a better way to do
this?


Shai Erera wrote:
> 
> I think that the simplest solution will be to index the URL field twice,
> once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
> un_tokenized term.
> If you have a document in hand and only want to fetch its URL, then add
> the
> URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
> COMPRESS and Index.NO.
> 
> Perhaps I don't understand the entire scenario. When do you need to fetch
> the contentLength and URL? To what purpose?
> 
> On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <bl...@gmail.com>
> wrote:
> 
>>
>> No, I didn't store the contentLength. Just adding it into the index.
>> Which
>> until now I am still scratching my head as I can't think of another way
>> to
>> retrieve it without continuously using the reader.
>>
>> As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED).
>> I
>> will like to keep it this way, having the url being tokenized. I am
>> finding
>> a way to UNtokenized it, I retrieved it using a method that will retrieve
>> the entire field then extract the information in it. But the problem is,
>> the
>> url are broken down. I am seeking a way to reconstruct it to its
>> orgininal
>> format. Can it be done?
>>
>>
>> Shai Erera wrote:
>> >
>> > Hi
>> >
>> > Regarding the contentLength, when you add it to the document, do you
>> use
>> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
>> >
>> > Regarding the URL, how do you add it to the document? For example, if
>> you
>> > do
>> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
>> > Index.UN_TOKENIZED), it would create a token like "url:
>> http://www.cnn.com"
>> > without breaking it to its parts. Is that what you're looking for?
>> >
>> > Shai
>> >
>> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <bl...@gmail.com>
>> > wrote:
>> >
>> >>
>> >> Hi,
>> >>
>> >> I am currently working on retrieving url and contentLength of each
>> >> document
>> >> found during the search. I want to retrieve it during the calculation
>> of
>> >> score so that I can influence the score in some other way.
>> >>
>> >> I used the methods from TermDocs and TermEnum to get the information.
>> >> However, the url I retrieve as is know by most, is tokenized. It is
>> >> broken
>> >> down into several parts and I will have to rejoin them. Can anyone
>> help
>> >> me
>> >> with this? I am stuck here wondering how to get back the whole url
>> >> without
>> >> using a Reader.
>> >>
>> >> Also, I try to retrieve the contentLength, but the results return are
>> >> null.
>> >> Why is that? I opened the index using Luke and the contentLength is
>> there
>> >> but when I try to get it using this way, the results is null.
>> >>
>> >> Can anyone help me with both of these problems? Any help will be
>> >> appreciated. Thanks
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> >
>> > Shai Erera
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
> 
> 
> -- 
> Regards,
> 
> Shai Erera
> 
> 

-- 
View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18310348.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Untokenized URL

Posted by Shai Erera <se...@gmail.com>.

I think that the simplest solution will be to index the URL field twice,
once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
un_tokenized term.
If you have a document in hand and only want to fetch its URL, then add the
URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
COMPRESS and Index.NO.

Perhaps I don't understand the entire scenario. When do you need to fetch
the contentLength and URL? To what purpose?

On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <bl...@gmail.com> wrote:

>
> No, I didn't store the contentLength. Just adding it into the index. Which
> until now I am still scratching my head as I can't think of another way to
> retrieve it without continuously using the reader.
>
> As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). I
> will like to keep it this way, having the url being tokenized. I am finding
> a way to UNtokenized it, I retrieved it using a method that will retrieve
> the entire field then extract the information in it. But the problem is,
> the
> url are broken down. I am seeking a way to reconstruct it to its orgininal
> format. Can it be done?
>
>
> Shai Erera wrote:
> >
> > Hi
> >
> > Regarding the contentLength, when you add it to the document, do you use
> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> >
> > Regarding the URL, how do you add it to the document? For example, if you
> > do
> > doc.add(new Field("url", "http://www.cnn.com", Store.NO,
> > Index.UN_TOKENIZED), it would create a token like "url:
> http://www.cnn.com"
> > without breaking it to its parts. Is that what you're looking for?
> >
> > Shai
> >
> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <bl...@gmail.com>
> > wrote:
> >
> >>
> >> Hi,
> >>
> >> I am currently working on retrieving url and contentLength of each
> >> document
> >> found during the search. I want to retrieve it during the calculation of
> >> score so that I can influence the score in some other way.
> >>
> >> I used the methods from TermDocs and TermEnum to get the information.
> >> However, the url I retrieve as is know by most, is tokenized. It is
> >> broken
> >> down into several parts and I will have to rejoin them. Can anyone help
> >> me
> >> with this? I am stuck here wondering how to get back the whole url
> >> without
> >> using a Reader.
> >>
> >> Also, I try to retrieve the contentLength, but the results return are
> >> null.
> >> Why is that? I opened the index using Luke and the contentLength is
> there
> >> but when I try to get it using this way, the results is null.
> >>
> >> Can anyone help me with both of these problems? Any help will be
> >> appreciated. Thanks
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Regards,

Shai Erera

Re: Untokenized URL

Posted by blazingwolf7 <bl...@gmail.com>.

No, I didn't store the contentLength. Just adding it into the index. Which
until now I am still scratching my head as I can't think of another way to
retrieve it without continuously using the reader.

As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). I
will like to keep it this way, having the url being tokenized. I am finding
a way to UNtokenized it, I retrieved it using a method that will retrieve
the entire field then extract the information in it. But the problem is, the
url are broken down. I am seeking a way to reconstruct it to its orgininal
format. Can it be done?


Shai Erera wrote:
> 
> Hi
> 
> Regarding the contentLength, when you add it to the document, do you use
> *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> 
> Regarding the URL, how do you add it to the document? For example, if you
> do
> doc.add(new Field("url", "http://www.cnn.com", Store.NO,
> Index.UN_TOKENIZED), it would create a token like "url:http://www.cnn.com"
> without breaking it to its parts. Is that what you're looking for?
> 
> Shai
> 
> On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <bl...@gmail.com>
> wrote:
> 
>>
>> Hi,
>>
>> I am currently working on retrieving url and contentLength of each
>> document
>> found during the search. I want to retrieve it during the calculation of
>> score so that I can influence the score in some other way.
>>
>> I used the methods from TermDocs and TermEnum to get the information.
>> However, the url I retrieve as is know by most, is tokenized. It is
>> broken
>> down into several parts and I will have to rejoin them. Can anyone help
>> me
>> with this? I am stuck here wondering how to get back the whole url
>> without
>> using a Reader.
>>
>> Also, I try to retrieve the contentLength, but the results return are
>> null.
>> Why is that? I opened the index using Luke and the contentLength is there
>> but when I try to get it using this way, the results is null.
>>
>> Can anyone help me with both of these problems? Any help will be
>> appreciated. Thanks
>> --
>> View this message in context:
>> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
> 
> 
> -- 
> Regards,
> 
> Shai Erera
> 
> 

-- 
View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Untokenized URL

Posted by Shai Erera <se...@gmail.com>.

Hi

Regarding the contentLength, when you add it to the document, do you use
*store* it as well (i.e., passing Store.YES or Store.COMPRESS)?

Regarding the URL, how do you add it to the document? For example, if you do
doc.add(new Field("url", "http://www.cnn.com", Store.NO,
Index.UN_TOKENIZED), it would create a token like "url:http://www.cnn.com"
without breaking it to its parts. Is that what you're looking for?

Shai

On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <bl...@gmail.com>
wrote:

>
> Hi,
>
> I am currently working on retrieving url and contentLength of each document
> found during the search. I want to retrieve it during the calculation of
> score so that I can influence the score in some other way.
>
> I used the methods from TermDocs and TermEnum to get the information.
> However, the url I retrieve as is know by most, is tokenized. It is broken
> down into several parts and I will have to rejoin them. Can anyone help me
> with this? I am stuck here wondering how to get back the whole url without
> using a Reader.
>
> Also, I try to retrieve the contentLength, but the results return are null.
> Why is that? I opened the index using Luke and the contentLength is there
> but when I try to get it using this way, the results is null.
>
> Can anyone help me with both of these problems? Any help will be
> appreciated. Thanks
> --
> View this message in context:
> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Regards,

Shai Erera