You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Fanny Yeung <to...@hotmail.com> on 2002/05/13 14:48:50 UTC

Search on XML files

Hi,

Does anyone know how to make up the query for multiple fields search on XML 
files in the sample provided by isogen? Does it support?

I would like to get all the results which contain the value of 'Australia' 
in tag 'country' AND the date is '20020415' in the tag 'date'. I always get 
0 hit count. Any problem of my query string?

+(Australia AND tagname:country) AND +(20020415 AND tagname:date)

1. What the query string suppose to be if I want to get records which 
contain (Austalia and 20020415) or (HongKong and 20020315)?
2. What the query string suppose to be if I want to get records which 
contain (Australia and 20020415) and (not (HongKong and 20020315))?

Since I am a newbie on Lucene, I am wonder whether I can use filter to 
restricts the search results? In my case, I need to retrieve all the news 
between a date range (for example, 20020102 to 20020330). In addition, the 
result should only contains those news that have been subscribed  . Should I 
use filter to filter out the unsubscribed news? Or I should make up a query 
string to include those subscribed news? Which approach is better in terms 
of performance?

Thanks in advance.


Fanny

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Search on XML files

Posted by Brandon Jockman <br...@isogen.com>.
one minor correction... (!)

> > 2. What the query string suppose to be if I want to get records which
> > contain (Australia and 20020415) and (not (HongKong and 20020315))?
>
> ((Australia +tagname:country) AND (+tagname:date +20020415))  AND
> !(( tagname:country HongKong) AND (tagname:date 20020415))

-B

----- Original Message -----
From: "Brandon Jockman" <br...@isogen.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, May 13, 2002 10:31 AM
Subject: Re: Search on XML files


> Fanny,
>
> The current implementation allows for searching on:
>
> a.. the entire PCDATA content of an XML document.
> b.. the PCDATA content within specific elements.
> c.. processing instructions by name and content.
> d.. attributes of elements by both name and value.
> e.. elements/PIs with specific parent element types.
> f.. elements/PIs at specific child locations within a parent element.
> g.. elements/PIs with specific ancestor element types.
> h.. elements/PIs with specifically ordered ancestor element type.
>
> The original need we had for XML contextual searching was to find a
specific
> document that contained a particular element with particular content, and
in
> relationships to other element types.
>
> Currently, searching for a document based on content of two separate
> elements with a logical AND relationship is not provided. However, the OR
> relationship should work just fine.
>
> There is a field stored that contains all text content for the document,
but
> that probably isn't enough for what you need.
>
> Each lucene document from the same XML document has a 'docid' field.
>
>
> You have two real options:
>
> 1. Write a queryparser that inherits from the Lucene one that detects the
> relationship and performs more than one search, grouping results based on
> document id.
>
> Searching for X and Y would become:
> 1. Search for X -> Hits_X
> 2. Search for Y -> Hits_Y
> 3. Merge Hits_X and Hits_Y based on docid.
>
> -=-
>
> 2. Write a queryparser that inherits from the lucene one, detects that you
> are searching for a document based on several elements, as opposed to a
> single one, and converts the search from:
>
> X AND Y
>
> to:
>
> (X AND docid:docidentifier) OR (Y AND docid:docidentifier)
>
> ..and then merge results based on docid.
>
>
> You may also be able to leverage the search 'Filtering' mechanism, but I'm
> not experienced with that...
>
> <<<From FAQ>>>
> 16. What is filtering and how is it performed ?
> Filtering means imposing additional restriction on the hit list to
eliminate
> hits that otherwise would be included in the search results. There are two
> ways to filter hits:
>
>   a.. Search Query - in this approach, provide your custom filter object
to
> the when you call the search() method. This filter will be called exactly
> once to evaluate every document that resulted in non zero score.
>   b.. Selective Collection - in this approach you perform the regular
search
> and when you get back the hit list, collect only those that matches your
> filtering criteria. In this approach, your filter is called only for hits
> that returned by the search method which may be only a subset of the non
> zero matches (useful when evaluating your search filter is expensive).
> <<< ... >>>
>
> > 1. What the query string suppose to be if I want to get records which
> > contain (Austalia and 20020415) or (HongKong and 20020315)?
>
> ((Australia +tagname:country) AND (+tagname:date +20020415)) OR ((HongKong
> +tagname:country) AND (tagname:date +20020415))
>
> > 2. What the query string suppose to be if I want to get records which
> > contain (Australia and 20020415) and (not (HongKong and 20020315))?
>
> ((Australia +tagname:country) AND (+tagname:date +20020415))  AND
> (( tagname:country HongKong) AND (tagname:date 20020415))
>
> Either of these queries will require the additional functionality outlined
> in options 1 or 2 above.
>
>
> Regards,
>
> -Brandon
>
> Brandon Jockman
> ISOGEN International, LLC.
> brandonj@isogen.com
>
>
>
> ----- Original Message -----
> From: "Fanny Yeung" <to...@hotmail.com>
> To: <lu...@jakarta.apache.org>
> Sent: Monday, May 13, 2002 7:48 AM
> Subject: Search on XML files
>
>
> > Hi,
> >
> > Does anyone know how to make up the query for multiple fields search on
> XML
> > files in the sample provided by isogen? Does it support?
> >
> > I would like to get all the results which contain the value of
'Australia'
> > in tag 'country' AND the date is '20020415' in the tag 'date'. I always
> get
> > 0 hit count. Any problem of my query string?
> >
> > +(Australia AND tagname:country) AND +(20020415 AND tagname:date)
> >
> > 1. What the query string suppose to be if I want to get records which
> > contain (Austalia and 20020415) or (HongKong and 20020315)?
> > 2. What the query string suppose to be if I want to get records which
> > contain (Australia and 20020415) and (not (HongKong and 20020315))?
> >
> > Since I am a newbie on Lucene, I am wonder whether I can use filter to
> > restricts the search results? In my case, I need to retrieve all the
news
> > between a date range (for example, 20020102 to 20020330). In addition,
the
> > result should only contains those news that have been subscribed  .
Should
> I
> > use filter to filter out the unsubscribed news? Or I should make up a
> query
> > string to include those subscribed news? Which approach is better in
terms
> > of performance?
> >
> > Thanks in advance.
> >
> >
> > Fanny
> >
> > _________________________________________________________________
> > MSN Photos is the easiest way to share and print your photos:
> > http://photos.msn.com/support/worldwide.aspx
> >
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Search on XML files

Posted by Brandon Jockman <br...@isogen.com>.
Fanny,

The current implementation allows for searching on:

a.. the entire PCDATA content of an XML document.
b.. the PCDATA content within specific elements.
c.. processing instructions by name and content.
d.. attributes of elements by both name and value.
e.. elements/PIs with specific parent element types.
f.. elements/PIs at specific child locations within a parent element.
g.. elements/PIs with specific ancestor element types.
h.. elements/PIs with specifically ordered ancestor element type.

The original need we had for XML contextual searching was to find a specific
document that contained a particular element with particular content, and in
relationships to other element types.

Currently, searching for a document based on content of two separate
elements with a logical AND relationship is not provided. However, the OR
relationship should work just fine.

There is a field stored that contains all text content for the document, but
that probably isn't enough for what you need.

Each lucene document from the same XML document has a 'docid' field.


You have two real options:

1. Write a queryparser that inherits from the Lucene one that detects the
relationship and performs more than one search, grouping results based on
document id.

Searching for X and Y would become:
1. Search for X -> Hits_X
2. Search for Y -> Hits_Y
3. Merge Hits_X and Hits_Y based on docid.

-=-

2. Write a queryparser that inherits from the lucene one, detects that you
are searching for a document based on several elements, as opposed to a
single one, and converts the search from:

X AND Y

to:

(X AND docid:docidentifier) OR (Y AND docid:docidentifier)

..and then merge results based on docid.


You may also be able to leverage the search 'Filtering' mechanism, but I'm
not experienced with that...

<<<From FAQ>>>
16. What is filtering and how is it performed ?
Filtering means imposing additional restriction on the hit list to eliminate
hits that otherwise would be included in the search results. There are two
ways to filter hits:

  a.. Search Query - in this approach, provide your custom filter object to
the when you call the search() method. This filter will be called exactly
once to evaluate every document that resulted in non zero score.
  b.. Selective Collection - in this approach you perform the regular search
and when you get back the hit list, collect only those that matches your
filtering criteria. In this approach, your filter is called only for hits
that returned by the search method which may be only a subset of the non
zero matches (useful when evaluating your search filter is expensive).
<<< ... >>>

> 1. What the query string suppose to be if I want to get records which
> contain (Austalia and 20020415) or (HongKong and 20020315)?

((Australia +tagname:country) AND (+tagname:date +20020415)) OR ((HongKong
+tagname:country) AND (tagname:date +20020415))

> 2. What the query string suppose to be if I want to get records which
> contain (Australia and 20020415) and (not (HongKong and 20020315))?

((Australia +tagname:country) AND (+tagname:date +20020415))  AND
(( tagname:country HongKong) AND (tagname:date 20020415))

Either of these queries will require the additional functionality outlined
in options 1 or 2 above.


Regards,

-Brandon

Brandon Jockman
ISOGEN International, LLC.
brandonj@isogen.com



----- Original Message -----
From: "Fanny Yeung" <to...@hotmail.com>
To: <lu...@jakarta.apache.org>
Sent: Monday, May 13, 2002 7:48 AM
Subject: Search on XML files


> Hi,
>
> Does anyone know how to make up the query for multiple fields search on
XML
> files in the sample provided by isogen? Does it support?
>
> I would like to get all the results which contain the value of 'Australia'
> in tag 'country' AND the date is '20020415' in the tag 'date'. I always
get
> 0 hit count. Any problem of my query string?
>
> +(Australia AND tagname:country) AND +(20020415 AND tagname:date)
>
> 1. What the query string suppose to be if I want to get records which
> contain (Austalia and 20020415) or (HongKong and 20020315)?
> 2. What the query string suppose to be if I want to get records which
> contain (Australia and 20020415) and (not (HongKong and 20020315))?
>
> Since I am a newbie on Lucene, I am wonder whether I can use filter to
> restricts the search results? In my case, I need to retrieve all the news
> between a date range (for example, 20020102 to 20020330). In addition, the
> result should only contains those news that have been subscribed  . Should
I
> use filter to filter out the unsubscribed news? Or I should make up a
query
> string to include those subscribed news? Which approach is better in terms
> of performance?
>
> Thanks in advance.
>
>
> Fanny
>
> _________________________________________________________________
> MSN Photos is the easiest way to share and print your photos:
> http://photos.msn.com/support/worldwide.aspx
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Search on XML files

Posted by Karl Øie <ka...@gan.no>.
> +(Australia AND tagname:country) AND +(20020415 AND tagname:date)

AND and + does the same, so:

	+(Australia AND tagname:country) +(20020415 AND tagname:date)

should return a set with documents everything are mached, basicly the same as:

	Australia AND tagname:country AND 20020415 AND tagname:date


> 1. What the query string suppose to be if I want to get records which
> contain (Austalia and 20020415) or (HongKong and 20020315)?

	(Austalia AND 20020415) (HongKong AND 20020315)


> 2. What the query string suppose to be if I want to get records which
> contain (Australia and 20020415) and (not (HongKong and 20020315))?

	+(Austalia AND 20020415) -(HongKong AND 20020315)


> Since I am a newbie on Lucene, I am wonder whether I can use filter to
> restricts the search results? In my case, I need to retrieve all the news
> between a date range (for example, 20020102 to 20020330). In addition, the
> result should only contains those news that have been subscribed  . Should
> I use filter to filter out the unsubscribed news? Or I should make up a
> query string to include those subscribed news? Which approach is better in
> terms of performance?

for the date you should use the Field.Date field type, but i don't think here 
is a shortcut to also filter with subscribers without doing something weird 
like adding say 3000 years to all dates that has been subscribed and then 
search for dates between 30020102 -> 30020330....

if you don't think this is a good idea (chances are that your system is not 
going to be used for more than a millenia) you can just use a SUBSCRIBED 
field to the index and seach for things that maches date AND subscribe=???.
lucene is quite fast even for large indexes so it should performe good 
anyhow....


happy hacking!


mvh karl øie


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>