You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Christian Bongiorno <ch...@bongiorno.org> on 2009/05/04 19:16:10 UTC

multi-field index and search (Not MultiFieldQuery). Help setting up index and search

I am trying to build a search (have been experimenting with using Lucene)
and someone suggested contacting your team

Background:
Currently the service I am working on applies taxing/duties to products for
international shipping by looking up something called an HTS code (a
universally recognized taxation code for duty/tariff). We already have
almost a million items classified by HTS code. As many as 50k items fall
into the same HTS code.

For purposes of HTS classification
Description is only important if no other field exists. But taxation is
based on things like material (leather, cloth, etc) and product
(shoes/bags/toys). Color is of fair relevancy as well (to a customs official
black boots or brown make no difference; it wasn’t made here so it must be
taxed)

The idea is to turn our entire existing knowledge base into an index, then
when we get a new item that needs classification, we “search” for the
“Document(hts)” that best matches by using the new item attributes for the
item to be classified as the search query.

The document structure, as I see it, should be:

Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
{Key,value},{Key,value},…} …}

There are 1788 documents. Up to 50k ASINs and their attributes may fall into
a single document.

On some fields, they are straightforward and very good indicators of match.
Such as

Material -> “leather”
Gender -> “women”

Others are fuzzier

Description -> “Stylish full calf leather boots. Sleek Italian leather,
designer”

So for a query of:
“Material” -> ”Leather”
“Gender” -> ”womAn”
“Description” -> ”Short leather shoes, Made in Denmark”

I would expect a very high match here since the first 2 fields, which don’t
vary much, are good indicators for HTS.

I have searched through the archives and I don't see anything like what I am
looking for.

Basically, every item will have attributes which I am treating as
"Field(item.key, item.value)". I think that's the right approach but
multi-field query queries your terms across all fields in the search. That
isn't what I need. I very clearly know my fields and values and that should
give me enormous leverage when querying if I could build a query to do that


Christian

-- 
Christian Bongiorno

Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search

Posted by Paul Elschot <pa...@xs4all.nl>.
Christian,

I suppose each ASIN represents a product by key,value pairs and
an HTS code?

In that case you may want to denormalize to index each ASIN as
a lucene document. Then search for the most similar products in your queries
by key/value pairs, using the your key as a lucene field.
Such keys would likely not need a stored norm in the lucene index.
The result of the query would be a series of HTS codes
(non unique), weighted by the score value. To get a score
for each HTS code, you might need your own HitCollector
and a field cache for the HTS codes.

You'll probably need to use a custom (who are the users again?)
lucene similarity function to lower the weight for the description,
and to increase the influence of the coordination factor so more
matches in different keys have a bigger influence on the result.

And have a look at Solr before starting to code this. The facets
there might be of help during interactive retrieval. Your application
is not really a web shop, but there are (at least) some overlaps.

Regards,
Paul Elschot


On Monday 04 May 2009 19:16:10 Christian Bongiorno wrote:
> I am trying to build a search (have been experimenting with using Lucene)
> and someone suggested contacting your team
> 
> Background:
> Currently the service I am working on applies taxing/duties to products for
> international shipping by looking up something called an HTS code (a
> universally recognized taxation code for duty/tariff). We already have
> almost a million items classified by HTS code. As many as 50k items fall
> into the same HTS code.
> 
> For purposes of HTS classification
> Description is only important if no other field exists. But taxation is
> based on things like material (leather, cloth, etc) and product
> (shoes/bags/toys). Color is of fair relevancy as well (to a customs official
> black boots or brown make no difference; it wasn’t made here so it must be
> taxed)
> 
> The idea is to turn our entire existing knowledge base into an index, then
> when we get a new item that needs classification, we “search” for the
> “Document(hts)” that best matches by using the new item attributes for the
> item to be classified as the search query.
> 
> The document structure, as I see it, should be:
> 
> Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
> {Key,value},{Key,value},…} …}
> 
> There are 1788 documents. Up to 50k ASINs and their attributes may fall into
> a single document.
> 
> On some fields, they are straightforward and very good indicators of match.
> Such as
> 
> Material -> “leather”
> Gender -> “women”
> 
> Others are fuzzier
> 
> Description -> “Stylish full calf leather boots. Sleek Italian leather,
> designer”
> 
> So for a query of:
> “Material” -> ”Leather”
> “Gender” -> ”womAn”
> “Description” -> ”Short leather shoes, Made in Denmark”
> 
> I would expect a very high match here since the first 2 fields, which don’t
> vary much, are good indicators for HTS.
> 
> I have searched through the archives and I don't see anything like what I am
> looking for.
> 
> Basically, every item will have attributes which I am treating as
> "Field(item.key, item.value)". I think that's the right approach but
> multi-field query queries your terms across all fields in the search. That
> isn't what I need. I very clearly know my fields and values and that should
> give me enormous leverage when querying if I could build a query to do that
> 
> 
> Christian
> 
> -- 
> Christian Bongiorno
> 


Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search

Posted by Erick Erickson <er...@gmail.com>.
Yes, SHOULD is what you want I think here.

Best
Erick


On Mon, May 4, 2009 at 6:41 PM, Christian Bongiorno <christian@bongiorno.org
> wrote:

> You mean to use
> BooleanQuery bq = new BooleanQuery();
> bq.add(new TermQuery(new
> Term("key","value")),BooleanClause.Occur.MUST_OCCUR));
> // above is eric's suggestion.
>
> If so, doesn't that mean if they don't all match I won't get a result?
> Wouldn't it be better to use SHOULD_OCCUR? The documentation doesn't give
> extra insight on that
>
> As for fields where I expect looser matches, such as description, I should
> boost the other fields.
>
> Thanks again
>
> On Mon, May 4, 2009 at 1:32 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
> > In the case of such queries with keywords (not analyzed tokens), I would
> > create directly the appropinquate TermQuerys and combine with
> BooleanQuery.
> > QueryParser is normally not for program-internal queries, more for
> queries
> > the user has entered. For your use-case, it seems better to just create
> the
> > correct Query Objects using standard instantiation from the Java code.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Monday, May 04, 2009 9:51 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: multi-field index and search (Not MultiFieldQuery). Help
> > > setting up index and search
> > >
> > > MultiFieldQuery essentially (if I have this right) forms a "cross
> > > product".
> > > I.e.
> > > it is NOT required to specify specific values for discrete fields. MFQ
> > > helps
> > > form queries expressing something like "does any term appear in any
> field
> > > in a hit" or "Does every term appear in some field of a hit, regardless
> > of
> > > which
> > > field and not necessarily the same field" (depending upon whether the
> > > default operator is OR or AND). You can get something of the same
> effect
> > > by creating a special field that is the concatenation of all the other
> > > fields
> > > and searching that concatenated field with and/or (except that MFQ does
> > > interesting things with boosting).
> > >
> > > But if you know exactly what terms you require in which field, the
> > > standard query parser is fine. i.e. +material:leather +gender:female
> > > will look for "leather" ONLY in material and "female" ONLY in gender.
> > >
> > > HTH
> > > Erick.
> > >
> > > P.S. A tip for you: If anything I say contradicts something Paul says,
> > > listen to Paul <G>...
> > >
> > >
> > > On Mon, May 4, 2009 at 3:33 PM, Christian Bongiorno
> > > <christian@bongiorno.org
> > > > wrote:
> > >
> > > > Yeah, you definitely got the idea. You're the second person to
> > recommend
> > > > putting each item in it's own document and just store the HTS code
> > > (which
> > > > is
> > > > easy for me). The HTS code actually comes with no extra info. I mean,
> > > there
> > > > is info, but we don't store any of it.
> > > >
> > > > I will try as you and Paul have recommended. Once done, then I would
> > > need a
> > > > MultiFieldQuery? Forgive me but the queries confuse me.
> > > >
> > > > Rebuilding my index will take some time, but I appreciate everyone's
> > > help
> > > >
> > > > Christian
> > > >
> > > > On Mon, May 4, 2009 at 11:40 AM, Erick Erickson <
> > erickerickson@gmail.com
> > > > >wrote:
> > > >
> > > > > Hmmmm, tricky. Let's see if I understand your problem.
> > > > >
> > > > > Basically, you have a bunch of HSTs that have had
> > > > > some number of items arbitrarily assigned to them, and
> > > > > you want to see if you can make Lucene behave as a kind
> > > > > of expert system to help you classify the next item.
> > > > >
> > > > > I *think* you'd get better results by indexing each item
> > > > > along with its HST code as a separate document. Because
> > > > > what you really want to ask is "given the attributes of my
> > > > > new item, what other item is "most similar" to it and then
> > > > > present the HSTs from these items to the classifier
> > > > > (perhaps a person?).
> > > > >
> > > > > I'm going to assume further that the HST code has
> > > > > some data associated with it that describes the
> > > > > class, and that these need to be available to
> > > > > the user to see if your suggestions are appropriate.
> > > > > You could either index the HSTs in another index
> > > > > OR index them in the same index but simply store
> > > > > the data (don't index it) and the HST documents won't
> > > > > interfere with your searches on "similar items".
> > > > >
> > > > > Mostly, this is just trying to see if I understand what
> > > > > you're trying to accomplish. This may be gibberish, but
> > > > > it's a start <G>.
> > > > >
> > > > > Best
> > > > > Erick
> > > > >
> > > > >
> > > > > On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <
> > > > > christian@bongiorno.org
> > > > > > wrote:
> > > > >
> > > > > > I am trying to build a search (have been experimenting with using
> > > > Lucene)
> > > > > > and someone suggested contacting your team
> > > > > >
> > > > > > Background:
> > > > > > Currently the service I am working on applies taxing/duties to
> > > products
> > > > > for
> > > > > > international shipping by looking up something called an HTS code
> > (a
> > > > > > universally recognized taxation code for duty/tariff). We already
> > > have
> > > > > > almost a million items classified by HTS code. As many as 50k
> items
> > > > fall
> > > > > > into the same HTS code.
> > > > > >
> > > > > > For purposes of HTS classification
> > > > > > Description is only important if no other field exists. But
> > taxation
> > > is
> > > > > > based on things like material (leather, cloth, etc) and product
> > > > > > (shoes/bags/toys). Color is of fair relevancy as well (to a
> customs
> > > > > > official
> > > > > > black boots or brown make no difference; it wasn't made here so
> it
> > > must
> > > > > be
> > > > > > taxed)
> > > > > >
> > > > > > The idea is to turn our entire existing knowledge base into an
> > > index,
> > > > > then
> > > > > > when we get a new item that needs classification, we "search" for
> > > the
> > > > > > "Document(hts)" that best matches by using the new item
> attributes
> > > for
> > > > > the
> > > > > > item to be classified as the search query.
> > > > > >
> > > > > > The document structure, as I see it, should be:
> > > > > >
> > > > > > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},.}, {ASIN2:
> > > > > > {Key,value},{Key,value},.} .}
> > > > > >
> > > > > > There are 1788 documents. Up to 50k ASINs and their attributes
> may
> > > fall
> > > > > > into
> > > > > > a single document.
> > > > > >
> > > > > > On some fields, they are straightforward and very good indicators
> > of
> > > > > match.
> > > > > > Such as
> > > > > >
> > > > > > Material -> "leather"
> > > > > > Gender -> "women"
> > > > > >
> > > > > > Others are fuzzier
> > > > > >
> > > > > > Description -> "Stylish full calf leather boots. Sleek Italian
> > > leather,
> > > > > > designer"
> > > > > >
> > > > > > So for a query of:
> > > > > > "Material" -> "Leather"
> > > > > > "Gender" -> "womAn"
> > > > > > "Description" -> "Short leather shoes, Made in Denmark"
> > > > > >
> > > > > > I would expect a very high match here since the first 2 fields,
> > > which
> > > > > don't
> > > > > > vary much, are good indicators for HTS.
> > > > > >
> > > > > > I have searched through the archives and I don't see anything
> like
> > > what
> > > > I
> > > > > > am
> > > > > > looking for.
> > > > > >
> > > > > > Basically, every item will have attributes which I am treating as
> > > > > > "Field(item.key, item.value)". I think that's the right approach
> > but
> > > > > > multi-field query queries your terms across all fields in the
> > > search.
> > > > > That
> > > > > > isn't what I need. I very clearly know my fields and values and
> > that
> > > > > should
> > > > > > give me enormous leverage when querying if I could build a query
> to
> > > do
> > > > > that
> > > > > >
> > > > > >
> > > > > > Christian
> > > > > >
> > > > > > --
> > > > > > Christian Bongiorno
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Christian Bongiorno
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Christian Bongiorno
>

Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search

Posted by Christian Bongiorno <ch...@bongiorno.org>.
You mean to use
BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(new
Term("key","value")),BooleanClause.Occur.MUST_OCCUR));
// above is eric's suggestion.

If so, doesn't that mean if they don't all match I won't get a result?
Wouldn't it be better to use SHOULD_OCCUR? The documentation doesn't give
extra insight on that

As for fields where I expect looser matches, such as description, I should
boost the other fields.

Thanks again

On Mon, May 4, 2009 at 1:32 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> In the case of such queries with keywords (not analyzed tokens), I would
> create directly the appropinquate TermQuerys and combine with BooleanQuery.
> QueryParser is normally not for program-internal queries, more for queries
> the user has entered. For your use-case, it seems better to just create the
> correct Query Objects using standard instantiation from the Java code.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Monday, May 04, 2009 9:51 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: multi-field index and search (Not MultiFieldQuery). Help
> > setting up index and search
> >
> > MultiFieldQuery essentially (if I have this right) forms a "cross
> > product".
> > I.e.
> > it is NOT required to specify specific values for discrete fields. MFQ
> > helps
> > form queries expressing something like "does any term appear in any field
> > in a hit" or "Does every term appear in some field of a hit, regardless
> of
> > which
> > field and not necessarily the same field" (depending upon whether the
> > default operator is OR or AND). You can get something of the same effect
> > by creating a special field that is the concatenation of all the other
> > fields
> > and searching that concatenated field with and/or (except that MFQ does
> > interesting things with boosting).
> >
> > But if you know exactly what terms you require in which field, the
> > standard query parser is fine. i.e. +material:leather +gender:female
> > will look for "leather" ONLY in material and "female" ONLY in gender.
> >
> > HTH
> > Erick.
> >
> > P.S. A tip for you: If anything I say contradicts something Paul says,
> > listen to Paul <G>...
> >
> >
> > On Mon, May 4, 2009 at 3:33 PM, Christian Bongiorno
> > <christian@bongiorno.org
> > > wrote:
> >
> > > Yeah, you definitely got the idea. You're the second person to
> recommend
> > > putting each item in it's own document and just store the HTS code
> > (which
> > > is
> > > easy for me). The HTS code actually comes with no extra info. I mean,
> > there
> > > is info, but we don't store any of it.
> > >
> > > I will try as you and Paul have recommended. Once done, then I would
> > need a
> > > MultiFieldQuery? Forgive me but the queries confuse me.
> > >
> > > Rebuilding my index will take some time, but I appreciate everyone's
> > help
> > >
> > > Christian
> > >
> > > On Mon, May 4, 2009 at 11:40 AM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > Hmmmm, tricky. Let's see if I understand your problem.
> > > >
> > > > Basically, you have a bunch of HSTs that have had
> > > > some number of items arbitrarily assigned to them, and
> > > > you want to see if you can make Lucene behave as a kind
> > > > of expert system to help you classify the next item.
> > > >
> > > > I *think* you'd get better results by indexing each item
> > > > along with its HST code as a separate document. Because
> > > > what you really want to ask is "given the attributes of my
> > > > new item, what other item is "most similar" to it and then
> > > > present the HSTs from these items to the classifier
> > > > (perhaps a person?).
> > > >
> > > > I'm going to assume further that the HST code has
> > > > some data associated with it that describes the
> > > > class, and that these need to be available to
> > > > the user to see if your suggestions are appropriate.
> > > > You could either index the HSTs in another index
> > > > OR index them in the same index but simply store
> > > > the data (don't index it) and the HST documents won't
> > > > interfere with your searches on "similar items".
> > > >
> > > > Mostly, this is just trying to see if I understand what
> > > > you're trying to accomplish. This may be gibberish, but
> > > > it's a start <G>.
> > > >
> > > > Best
> > > > Erick
> > > >
> > > >
> > > > On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <
> > > > christian@bongiorno.org
> > > > > wrote:
> > > >
> > > > > I am trying to build a search (have been experimenting with using
> > > Lucene)
> > > > > and someone suggested contacting your team
> > > > >
> > > > > Background:
> > > > > Currently the service I am working on applies taxing/duties to
> > products
> > > > for
> > > > > international shipping by looking up something called an HTS code
> (a
> > > > > universally recognized taxation code for duty/tariff). We already
> > have
> > > > > almost a million items classified by HTS code. As many as 50k items
> > > fall
> > > > > into the same HTS code.
> > > > >
> > > > > For purposes of HTS classification
> > > > > Description is only important if no other field exists. But
> taxation
> > is
> > > > > based on things like material (leather, cloth, etc) and product
> > > > > (shoes/bags/toys). Color is of fair relevancy as well (to a customs
> > > > > official
> > > > > black boots or brown make no difference; it wasn't made here so it
> > must
> > > > be
> > > > > taxed)
> > > > >
> > > > > The idea is to turn our entire existing knowledge base into an
> > index,
> > > > then
> > > > > when we get a new item that needs classification, we "search" for
> > the
> > > > > "Document(hts)" that best matches by using the new item attributes
> > for
> > > > the
> > > > > item to be classified as the search query.
> > > > >
> > > > > The document structure, as I see it, should be:
> > > > >
> > > > > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},.}, {ASIN2:
> > > > > {Key,value},{Key,value},.} .}
> > > > >
> > > > > There are 1788 documents. Up to 50k ASINs and their attributes may
> > fall
> > > > > into
> > > > > a single document.
> > > > >
> > > > > On some fields, they are straightforward and very good indicators
> of
> > > > match.
> > > > > Such as
> > > > >
> > > > > Material -> "leather"
> > > > > Gender -> "women"
> > > > >
> > > > > Others are fuzzier
> > > > >
> > > > > Description -> "Stylish full calf leather boots. Sleek Italian
> > leather,
> > > > > designer"
> > > > >
> > > > > So for a query of:
> > > > > "Material" -> "Leather"
> > > > > "Gender" -> "womAn"
> > > > > "Description" -> "Short leather shoes, Made in Denmark"
> > > > >
> > > > > I would expect a very high match here since the first 2 fields,
> > which
> > > > don't
> > > > > vary much, are good indicators for HTS.
> > > > >
> > > > > I have searched through the archives and I don't see anything like
> > what
> > > I
> > > > > am
> > > > > looking for.
> > > > >
> > > > > Basically, every item will have attributes which I am treating as
> > > > > "Field(item.key, item.value)". I think that's the right approach
> but
> > > > > multi-field query queries your terms across all fields in the
> > search.
> > > > That
> > > > > isn't what I need. I very clearly know my fields and values and
> that
> > > > should
> > > > > give me enormous leverage when querying if I could build a query to
> > do
> > > > that
> > > > >
> > > > >
> > > > > Christian
> > > > >
> > > > > --
> > > > > Christian Bongiorno
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Christian Bongiorno
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Christian Bongiorno

RE: multi-field index and search (Not MultiFieldQuery). Help setting up index and search

Posted by Uwe Schindler <uw...@thetaphi.de>.
In the case of such queries with keywords (not analyzed tokens), I would
create directly the appropinquate TermQuerys and combine with BooleanQuery.
QueryParser is normally not for program-internal queries, more for queries
the user has entered. For your use-case, it seems better to just create the
correct Query Objects using standard instantiation from the Java code.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Monday, May 04, 2009 9:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: multi-field index and search (Not MultiFieldQuery). Help
> setting up index and search
> 
> MultiFieldQuery essentially (if I have this right) forms a "cross
> product".
> I.e.
> it is NOT required to specify specific values for discrete fields. MFQ
> helps
> form queries expressing something like "does any term appear in any field
> in a hit" or "Does every term appear in some field of a hit, regardless of
> which
> field and not necessarily the same field" (depending upon whether the
> default operator is OR or AND). You can get something of the same effect
> by creating a special field that is the concatenation of all the other
> fields
> and searching that concatenated field with and/or (except that MFQ does
> interesting things with boosting).
> 
> But if you know exactly what terms you require in which field, the
> standard query parser is fine. i.e. +material:leather +gender:female
> will look for "leather" ONLY in material and "female" ONLY in gender.
> 
> HTH
> Erick.
> 
> P.S. A tip for you: If anything I say contradicts something Paul says,
> listen to Paul <G>...
> 
> 
> On Mon, May 4, 2009 at 3:33 PM, Christian Bongiorno
> <christian@bongiorno.org
> > wrote:
> 
> > Yeah, you definitely got the idea. You're the second person to recommend
> > putting each item in it's own document and just store the HTS code
> (which
> > is
> > easy for me). The HTS code actually comes with no extra info. I mean,
> there
> > is info, but we don't store any of it.
> >
> > I will try as you and Paul have recommended. Once done, then I would
> need a
> > MultiFieldQuery? Forgive me but the queries confuse me.
> >
> > Rebuilding my index will take some time, but I appreciate everyone's
> help
> >
> > Christian
> >
> > On Mon, May 4, 2009 at 11:40 AM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Hmmmm, tricky. Let's see if I understand your problem.
> > >
> > > Basically, you have a bunch of HSTs that have had
> > > some number of items arbitrarily assigned to them, and
> > > you want to see if you can make Lucene behave as a kind
> > > of expert system to help you classify the next item.
> > >
> > > I *think* you'd get better results by indexing each item
> > > along with its HST code as a separate document. Because
> > > what you really want to ask is "given the attributes of my
> > > new item, what other item is "most similar" to it and then
> > > present the HSTs from these items to the classifier
> > > (perhaps a person?).
> > >
> > > I'm going to assume further that the HST code has
> > > some data associated with it that describes the
> > > class, and that these need to be available to
> > > the user to see if your suggestions are appropriate.
> > > You could either index the HSTs in another index
> > > OR index them in the same index but simply store
> > > the data (don't index it) and the HST documents won't
> > > interfere with your searches on "similar items".
> > >
> > > Mostly, this is just trying to see if I understand what
> > > you're trying to accomplish. This may be gibberish, but
> > > it's a start <G>.
> > >
> > > Best
> > > Erick
> > >
> > >
> > > On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <
> > > christian@bongiorno.org
> > > > wrote:
> > >
> > > > I am trying to build a search (have been experimenting with using
> > Lucene)
> > > > and someone suggested contacting your team
> > > >
> > > > Background:
> > > > Currently the service I am working on applies taxing/duties to
> products
> > > for
> > > > international shipping by looking up something called an HTS code (a
> > > > universally recognized taxation code for duty/tariff). We already
> have
> > > > almost a million items classified by HTS code. As many as 50k items
> > fall
> > > > into the same HTS code.
> > > >
> > > > For purposes of HTS classification
> > > > Description is only important if no other field exists. But taxation
> is
> > > > based on things like material (leather, cloth, etc) and product
> > > > (shoes/bags/toys). Color is of fair relevancy as well (to a customs
> > > > official
> > > > black boots or brown make no difference; it wasn't made here so it
> must
> > > be
> > > > taxed)
> > > >
> > > > The idea is to turn our entire existing knowledge base into an
> index,
> > > then
> > > > when we get a new item that needs classification, we "search" for
> the
> > > > "Document(hts)" that best matches by using the new item attributes
> for
> > > the
> > > > item to be classified as the search query.
> > > >
> > > > The document structure, as I see it, should be:
> > > >
> > > > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},.}, {ASIN2:
> > > > {Key,value},{Key,value},.} .}
> > > >
> > > > There are 1788 documents. Up to 50k ASINs and their attributes may
> fall
> > > > into
> > > > a single document.
> > > >
> > > > On some fields, they are straightforward and very good indicators of
> > > match.
> > > > Such as
> > > >
> > > > Material -> "leather"
> > > > Gender -> "women"
> > > >
> > > > Others are fuzzier
> > > >
> > > > Description -> "Stylish full calf leather boots. Sleek Italian
> leather,
> > > > designer"
> > > >
> > > > So for a query of:
> > > > "Material" -> "Leather"
> > > > "Gender" -> "womAn"
> > > > "Description" -> "Short leather shoes, Made in Denmark"
> > > >
> > > > I would expect a very high match here since the first 2 fields,
> which
> > > don't
> > > > vary much, are good indicators for HTS.
> > > >
> > > > I have searched through the archives and I don't see anything like
> what
> > I
> > > > am
> > > > looking for.
> > > >
> > > > Basically, every item will have attributes which I am treating as
> > > > "Field(item.key, item.value)". I think that's the right approach but
> > > > multi-field query queries your terms across all fields in the
> search.
> > > That
> > > > isn't what I need. I very clearly know my fields and values and that
> > > should
> > > > give me enormous leverage when querying if I could build a query to
> do
> > > that
> > > >
> > > >
> > > > Christian
> > > >
> > > > --
> > > > Christian Bongiorno
> > > >
> > >
> >
> >
> >
> > --
> > Christian Bongiorno
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search

Posted by Erick Erickson <er...@gmail.com>.
MultiFieldQuery essentially (if I have this right) forms a "cross product".
I.e.
it is NOT required to specify specific values for discrete fields. MFQ helps
form queries expressing something like "does any term appear in any field
in a hit" or "Does every term appear in some field of a hit, regardless of
which
field and not necessarily the same field" (depending upon whether the
default operator is OR or AND). You can get something of the same effect
by creating a special field that is the concatenation of all the other
fields
and searching that concatenated field with and/or (except that MFQ does
interesting things with boosting).

But if you know exactly what terms you require in which field, the
standard query parser is fine. i.e. +material:leather +gender:female
will look for "leather" ONLY in material and "female" ONLY in gender.

HTH
Erick.

P.S. A tip for you: If anything I say contradicts something Paul says,
listen to Paul <G>...


On Mon, May 4, 2009 at 3:33 PM, Christian Bongiorno <christian@bongiorno.org
> wrote:

> Yeah, you definitely got the idea. You're the second person to recommend
> putting each item in it's own document and just store the HTS code (which
> is
> easy for me). The HTS code actually comes with no extra info. I mean, there
> is info, but we don't store any of it.
>
> I will try as you and Paul have recommended. Once done, then I would need a
> MultiFieldQuery? Forgive me but the queries confuse me.
>
> Rebuilding my index will take some time, but I appreciate everyone's help
>
> Christian
>
> On Mon, May 4, 2009 at 11:40 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Hmmmm, tricky. Let's see if I understand your problem.
> >
> > Basically, you have a bunch of HSTs that have had
> > some number of items arbitrarily assigned to them, and
> > you want to see if you can make Lucene behave as a kind
> > of expert system to help you classify the next item.
> >
> > I *think* you'd get better results by indexing each item
> > along with its HST code as a separate document. Because
> > what you really want to ask is "given the attributes of my
> > new item, what other item is "most similar" to it and then
> > present the HSTs from these items to the classifier
> > (perhaps a person?).
> >
> > I'm going to assume further that the HST code has
> > some data associated with it that describes the
> > class, and that these need to be available to
> > the user to see if your suggestions are appropriate.
> > You could either index the HSTs in another index
> > OR index them in the same index but simply store
> > the data (don't index it) and the HST documents won't
> > interfere with your searches on "similar items".
> >
> > Mostly, this is just trying to see if I understand what
> > you're trying to accomplish. This may be gibberish, but
> > it's a start <G>.
> >
> > Best
> > Erick
> >
> >
> > On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <
> > christian@bongiorno.org
> > > wrote:
> >
> > > I am trying to build a search (have been experimenting with using
> Lucene)
> > > and someone suggested contacting your team
> > >
> > > Background:
> > > Currently the service I am working on applies taxing/duties to products
> > for
> > > international shipping by looking up something called an HTS code (a
> > > universally recognized taxation code for duty/tariff). We already have
> > > almost a million items classified by HTS code. As many as 50k items
> fall
> > > into the same HTS code.
> > >
> > > For purposes of HTS classification
> > > Description is only important if no other field exists. But taxation is
> > > based on things like material (leather, cloth, etc) and product
> > > (shoes/bags/toys). Color is of fair relevancy as well (to a customs
> > > official
> > > black boots or brown make no difference; it wasn’t made here so it must
> > be
> > > taxed)
> > >
> > > The idea is to turn our entire existing knowledge base into an index,
> > then
> > > when we get a new item that needs classification, we “search” for the
> > > “Document(hts)” that best matches by using the new item attributes for
> > the
> > > item to be classified as the search query.
> > >
> > > The document structure, as I see it, should be:
> > >
> > > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
> > > {Key,value},{Key,value},…} …}
> > >
> > > There are 1788 documents. Up to 50k ASINs and their attributes may fall
> > > into
> > > a single document.
> > >
> > > On some fields, they are straightforward and very good indicators of
> > match.
> > > Such as
> > >
> > > Material -> “leather”
> > > Gender -> “women”
> > >
> > > Others are fuzzier
> > >
> > > Description -> “Stylish full calf leather boots. Sleek Italian leather,
> > > designer”
> > >
> > > So for a query of:
> > > “Material” -> ”Leather”
> > > “Gender” -> ”womAn”
> > > “Description” -> ”Short leather shoes, Made in Denmark”
> > >
> > > I would expect a very high match here since the first 2 fields, which
> > don’t
> > > vary much, are good indicators for HTS.
> > >
> > > I have searched through the archives and I don't see anything like what
> I
> > > am
> > > looking for.
> > >
> > > Basically, every item will have attributes which I am treating as
> > > "Field(item.key, item.value)". I think that's the right approach but
> > > multi-field query queries your terms across all fields in the search.
> > That
> > > isn't what I need. I very clearly know my fields and values and that
> > should
> > > give me enormous leverage when querying if I could build a query to do
> > that
> > >
> > >
> > > Christian
> > >
> > > --
> > > Christian Bongiorno
> > >
> >
>
>
>
> --
> Christian Bongiorno
>

Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search

Posted by Christian Bongiorno <ch...@bongiorno.org>.
Yeah, you definitely got the idea. You're the second person to recommend
putting each item in it's own document and just store the HTS code (which is
easy for me). The HTS code actually comes with no extra info. I mean, there
is info, but we don't store any of it.

I will try as you and Paul have recommended. Once done, then I would need a
MultiFieldQuery? Forgive me but the queries confuse me.

Rebuilding my index will take some time, but I appreciate everyone's help

Christian

On Mon, May 4, 2009 at 11:40 AM, Erick Erickson <er...@gmail.com>wrote:

> Hmmmm, tricky. Let's see if I understand your problem.
>
> Basically, you have a bunch of HSTs that have had
> some number of items arbitrarily assigned to them, and
> you want to see if you can make Lucene behave as a kind
> of expert system to help you classify the next item.
>
> I *think* you'd get better results by indexing each item
> along with its HST code as a separate document. Because
> what you really want to ask is "given the attributes of my
> new item, what other item is "most similar" to it and then
> present the HSTs from these items to the classifier
> (perhaps a person?).
>
> I'm going to assume further that the HST code has
> some data associated with it that describes the
> class, and that these need to be available to
> the user to see if your suggestions are appropriate.
> You could either index the HSTs in another index
> OR index them in the same index but simply store
> the data (don't index it) and the HST documents won't
> interfere with your searches on "similar items".
>
> Mostly, this is just trying to see if I understand what
> you're trying to accomplish. This may be gibberish, but
> it's a start <G>.
>
> Best
> Erick
>
>
> On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <
> christian@bongiorno.org
> > wrote:
>
> > I am trying to build a search (have been experimenting with using Lucene)
> > and someone suggested contacting your team
> >
> > Background:
> > Currently the service I am working on applies taxing/duties to products
> for
> > international shipping by looking up something called an HTS code (a
> > universally recognized taxation code for duty/tariff). We already have
> > almost a million items classified by HTS code. As many as 50k items fall
> > into the same HTS code.
> >
> > For purposes of HTS classification
> > Description is only important if no other field exists. But taxation is
> > based on things like material (leather, cloth, etc) and product
> > (shoes/bags/toys). Color is of fair relevancy as well (to a customs
> > official
> > black boots or brown make no difference; it wasn’t made here so it must
> be
> > taxed)
> >
> > The idea is to turn our entire existing knowledge base into an index,
> then
> > when we get a new item that needs classification, we “search” for the
> > “Document(hts)” that best matches by using the new item attributes for
> the
> > item to be classified as the search query.
> >
> > The document structure, as I see it, should be:
> >
> > Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
> > {Key,value},{Key,value},…} …}
> >
> > There are 1788 documents. Up to 50k ASINs and their attributes may fall
> > into
> > a single document.
> >
> > On some fields, they are straightforward and very good indicators of
> match.
> > Such as
> >
> > Material -> “leather”
> > Gender -> “women”
> >
> > Others are fuzzier
> >
> > Description -> “Stylish full calf leather boots. Sleek Italian leather,
> > designer”
> >
> > So for a query of:
> > “Material” -> ”Leather”
> > “Gender” -> ”womAn”
> > “Description” -> ”Short leather shoes, Made in Denmark”
> >
> > I would expect a very high match here since the first 2 fields, which
> don’t
> > vary much, are good indicators for HTS.
> >
> > I have searched through the archives and I don't see anything like what I
> > am
> > looking for.
> >
> > Basically, every item will have attributes which I am treating as
> > "Field(item.key, item.value)". I think that's the right approach but
> > multi-field query queries your terms across all fields in the search.
> That
> > isn't what I need. I very clearly know my fields and values and that
> should
> > give me enormous leverage when querying if I could build a query to do
> that
> >
> >
> > Christian
> >
> > --
> > Christian Bongiorno
> >
>



-- 
Christian Bongiorno

Re: multi-field index and search (Not MultiFieldQuery). Help setting up index and search

Posted by Erick Erickson <er...@gmail.com>.
Hmmmm, tricky. Let's see if I understand your problem.

Basically, you have a bunch of HSTs that have had
some number of items arbitrarily assigned to them, and
you want to see if you can make Lucene behave as a kind
of expert system to help you classify the next item.

I *think* you'd get better results by indexing each item
along with its HST code as a separate document. Because
what you really want to ask is "given the attributes of my
new item, what other item is "most similar" to it and then
present the HSTs from these items to the classifier
(perhaps a person?).

I'm going to assume further that the HST code has
some data associated with it that describes the
class, and that these need to be available to
the user to see if your suggestions are appropriate.
You could either index the HSTs in another index
OR index them in the same index but simply store
the data (don't index it) and the HST documents won't
interfere with your searches on "similar items".

Mostly, this is just trying to see if I understand what
you're trying to accomplish. This may be gibberish, but
it's a start <G>.

Best
Erick


On Mon, May 4, 2009 at 1:16 PM, Christian Bongiorno <christian@bongiorno.org
> wrote:

> I am trying to build a search (have been experimenting with using Lucene)
> and someone suggested contacting your team
>
> Background:
> Currently the service I am working on applies taxing/duties to products for
> international shipping by looking up something called an HTS code (a
> universally recognized taxation code for duty/tariff). We already have
> almost a million items classified by HTS code. As many as 50k items fall
> into the same HTS code.
>
> For purposes of HTS classification
> Description is only important if no other field exists. But taxation is
> based on things like material (leather, cloth, etc) and product
> (shoes/bags/toys). Color is of fair relevancy as well (to a customs
> official
> black boots or brown make no difference; it wasn’t made here so it must be
> taxed)
>
> The idea is to turn our entire existing knowledge base into an index, then
> when we get a new item that needs classification, we “search” for the
> “Document(hts)” that best matches by using the new item attributes for the
> item to be classified as the search query.
>
> The document structure, as I see it, should be:
>
> Document(HTS) -> {{ASIN1: {Key,value},{Key,value},…}, {ASIN2:
> {Key,value},{Key,value},…} …}
>
> There are 1788 documents. Up to 50k ASINs and their attributes may fall
> into
> a single document.
>
> On some fields, they are straightforward and very good indicators of match.
> Such as
>
> Material -> “leather”
> Gender -> “women”
>
> Others are fuzzier
>
> Description -> “Stylish full calf leather boots. Sleek Italian leather,
> designer”
>
> So for a query of:
> “Material” -> ”Leather”
> “Gender” -> ”womAn”
> “Description” -> ”Short leather shoes, Made in Denmark”
>
> I would expect a very high match here since the first 2 fields, which don’t
> vary much, are good indicators for HTS.
>
> I have searched through the archives and I don't see anything like what I
> am
> looking for.
>
> Basically, every item will have attributes which I am treating as
> "Field(item.key, item.value)". I think that's the right approach but
> multi-field query queries your terms across all fields in the search. That
> isn't what I need. I very clearly know my fields and values and that should
> give me enormous leverage when querying if I could build a query to do that
>
>
> Christian
>
> --
> Christian Bongiorno
>