You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steve Reichgut <sr...@axtaweb.com> on 2010/03/02 02:07:54 UTC

Search Result differences Standard vs DisMax

***Sorry if this was sent twice. I had connection problems here and it 
didn't look like the first time it went out****

I have been testing out results for some basic queries using both the 
Standard and DisMax query parsers. The results though aren't what I 
expected and am wondering if I am misundertanding how the DisMax query 
parser works.

For example, let's say I am doing a basic search for "Apache Solr" 
across a single field = Field 1 using the Standard parser. My results 
are exactly what I expected. Any document that includes either "Apache" 
or "Solr" or "Apache Solr" in Field 1 is listed with priority given to 
those that include both words.

Now, if I do the same search for "Apache Solr" across multiple fields - 
Field 1, Field 2 - using DisMax, I would expect basically the same 
results. The results should include any document that has one or both 
words in Field 1 or Field 2.

When I run that query in DisMax though, it only returns the documents 
that have BOTH words included which in my sample set only includes 1 or 
2 documents. I thought that, by default, DisMax should make both words 
optional so I am confused as to why I am only getting such a small subset.

Can anyone shed some light on what I am doing wrong or if I am 
misunderstanding how DisMax works.

Thanks,
Steve

Re: Implementing hierarchical facet

Posted by Andy <an...@yahoo.com>.
This dynamicfield feature is great. Didn't know about it.

Thanks!

--- On Wed, 3/3/10, Geert-Jan Brits <gb...@gmail.com> wrote:

From: Geert-Jan Brits <gb...@gmail.com>
Subject: Re: Implementing hierarchical facet
To: solr-user@lucene.apache.org
Date: Wednesday, March 3, 2010, 5:04 AM

you could always define 1 dynamicfield and encode the hierarchy level in the
fieldname:

<dynamicField name="_loc_hier_*" type="string" stored="false" indexed="true"
omitNorms="true"/>
using:
&facet=on&facet.field={!key=Location}_loc_hier_city&fq=_loc_hier_country:<somecountryid>
...
adding cityarea later for instance would be as simple as:
&facet=on&facet.field={!key=Location}_loc_hier_cityarea&fq=_loc_hier_city:<somecityid>

Cheers,
Geert-Jan


2010/3/3 Andy <an...@yahoo.com>

> Thanks. I didn't know about the {!key=Location} trick.
>
> Thanks everyone for your help. From what I could gather, there're 3
> approaches:
>
> 1) SOLR-64
> Pros:
> - can have arbitrary levels of hierarchy without modifying schema
> Cons:
> - each combination of all the levels in the hierarchy will result in a
> separate filter cache. This number could be huge, which would lead to poor
> performance
>
> 2) SOLR-792
> Pros:
> - each level of the hierarchy separately results in filter cache. Much
> smaller number of filter cache. Better performance.
> Cons:
> - Only 2 levels are supported
>
> 3) Separate fields for each hierarchy levels
> Pros:
> - same as SOLR-792. Good performance
> Cons:
> - can only handle a fixed number of levels in the hierarchy. Adding any
> levels beyond that requires schema modification
>
> Does that sound right?
>
> Option 3 is probably the best match for my use case. Is there any trick to
> make it able to deal with arbitrary number of levels?
>
> Thanks.
>
> --- On Tue, 3/2/10, Geert-Jan Brits <gb...@gmail.com> wrote:
>
> From: Geert-Jan Brits <gb...@gmail.com>
> Subject: Re: Implementing hierarchical facet
> To: solr-user@lucene.apache.org
> Date: Tuesday, March 2, 2010, 8:02 PM
>
> Using Solr 1.4: even less changes to the frontend:
>
> &facet=on&facet.field={!key=Location}countryid
> ...
> &facet=on&facet.field={!key=Location}cityid&fq=countryid:<somecountryid>
> etc.
>
> will consistently render the resulting facet under the name "Location" .
>
>
> 2010/3/3 Geert-Jan Brits <gb...@gmail.com>
>
> > If it's a requirement to let Solr handle the facet-hierarchy please
> > disregard this post, but
> > an alternative would be to have your App control when to ask for which
> > 'facet-level' (e.g: country, state, city) in the hierarchy.
> >
> > as follows,
> >
> > each doc has 3 seperate fields (indexed=true, stored=false):
> > - countryid
> > - stateid
> > - cityid
> >
> > facet on country:
> > &facet=on&facet.field=countryid
> >
> > facet on state ( country selected. functionally you probably don't want
> to
> > show states without the user having selected a country anyway)
> > &facet=on&facet.field=countryid&fq=countryid:<somecountryid>
> >
> > facet on city (state selected, same functional analogy as above)
> > &facet=on&facet.field=cityid&fq=stateid:<somestateid>
> >
> > or
> >
> > facet on city (countryselected, same functional analogy as above)
> > &facet=on&facet.field=cityid&fq=countryid:<somecountryid>
> >
> > grab the resulting facat and drop it under "Location"
> >
> > pros:
> > - reusing fq's (good performance, I've never used hierarchical facets,
> but
> > would be surprised if it has a (major) speed increase to this method)
> > - flexible (you get multiple hierarchies: country --> state --> city and
> > country --> city)
> >
> > cons:
> > - a little more application logic
> >
> > Hope that helps,
> > Geert-Jan
> >
> >
> >
> >
> >
> > 2010/3/2 Andy <an...@yahoo.com>
> >
> > I read that a simple way to implement hierarchical facet is to
> concatenate
> >> strings with a separator. Something like "level1>level2>level3" with ">"
> as
> >> the separator.
> >>
> >> A problem with this approach is that the number of facet values will
> >> greatly increase.
> >>
> >> For example I have a facet "Location" with the hierarchy
> >> country>state>city. Using the above approach every single city will lead
> to
> >> a separate facet value. With tens of thousands of cities in the world
> the
> >> response from Solr will be huge. And then on the client side I'd have to
> >> loop through all the facet values and combine those with the same
> country
> >> into a single value.
> >>
> >> Ideally Solr would be "aware" of the hierarchy structure and send back
> >> responses accordingly. So at level 1 Solr will send back facet values
> based
> >> on country (100 or so values). Level 2 the facet values will be based on
> the
> >> states within the selected country (a few dozen values). Next level will
> be
> >> cities within that state. and so on.
> >>
> >> Is it possible to implement hierarchical facet this way using Solr?
> >>
> >>
> >>
> >>
> >
> >
> >
>
>
>
>
>



      

Re: Implementing hierarchical facet

Posted by Geert-Jan Brits <gb...@gmail.com>.
you could always define 1 dynamicfield and encode the hierarchy level in the
fieldname:

<dynamicField name="_loc_hier_*" type="string" stored="false" indexed="true"
omitNorms="true"/>
using:
&facet=on&facet.field={!key=Location}_loc_hier_city&fq=_loc_hier_country:<somecountryid>
...
adding cityarea later for instance would be as simple as:
&facet=on&facet.field={!key=Location}_loc_hier_cityarea&fq=_loc_hier_city:<somecityid>

Cheers,
Geert-Jan


2010/3/3 Andy <an...@yahoo.com>

> Thanks. I didn't know about the {!key=Location} trick.
>
> Thanks everyone for your help. From what I could gather, there're 3
> approaches:
>
> 1) SOLR-64
> Pros:
> - can have arbitrary levels of hierarchy without modifying schema
> Cons:
> - each combination of all the levels in the hierarchy will result in a
> separate filter cache. This number could be huge, which would lead to poor
> performance
>
> 2) SOLR-792
> Pros:
> - each level of the hierarchy separately results in filter cache. Much
> smaller number of filter cache. Better performance.
> Cons:
> - Only 2 levels are supported
>
> 3) Separate fields for each hierarchy levels
> Pros:
> - same as SOLR-792. Good performance
> Cons:
> - can only handle a fixed number of levels in the hierarchy. Adding any
> levels beyond that requires schema modification
>
> Does that sound right?
>
> Option 3 is probably the best match for my use case. Is there any trick to
> make it able to deal with arbitrary number of levels?
>
> Thanks.
>
> --- On Tue, 3/2/10, Geert-Jan Brits <gb...@gmail.com> wrote:
>
> From: Geert-Jan Brits <gb...@gmail.com>
> Subject: Re: Implementing hierarchical facet
> To: solr-user@lucene.apache.org
> Date: Tuesday, March 2, 2010, 8:02 PM
>
> Using Solr 1.4: even less changes to the frontend:
>
> &facet=on&facet.field={!key=Location}countryid
> ...
> &facet=on&facet.field={!key=Location}cityid&fq=countryid:<somecountryid>
> etc.
>
> will consistently render the resulting facet under the name "Location" .
>
>
> 2010/3/3 Geert-Jan Brits <gb...@gmail.com>
>
> > If it's a requirement to let Solr handle the facet-hierarchy please
> > disregard this post, but
> > an alternative would be to have your App control when to ask for which
> > 'facet-level' (e.g: country, state, city) in the hierarchy.
> >
> > as follows,
> >
> > each doc has 3 seperate fields (indexed=true, stored=false):
> > - countryid
> > - stateid
> > - cityid
> >
> > facet on country:
> > &facet=on&facet.field=countryid
> >
> > facet on state ( country selected. functionally you probably don't want
> to
> > show states without the user having selected a country anyway)
> > &facet=on&facet.field=countryid&fq=countryid:<somecountryid>
> >
> > facet on city (state selected, same functional analogy as above)
> > &facet=on&facet.field=cityid&fq=stateid:<somestateid>
> >
> > or
> >
> > facet on city (countryselected, same functional analogy as above)
> > &facet=on&facet.field=cityid&fq=countryid:<somecountryid>
> >
> > grab the resulting facat and drop it under "Location"
> >
> > pros:
> > - reusing fq's (good performance, I've never used hierarchical facets,
> but
> > would be surprised if it has a (major) speed increase to this method)
> > - flexible (you get multiple hierarchies: country --> state --> city and
> > country --> city)
> >
> > cons:
> > - a little more application logic
> >
> > Hope that helps,
> > Geert-Jan
> >
> >
> >
> >
> >
> > 2010/3/2 Andy <an...@yahoo.com>
> >
> > I read that a simple way to implement hierarchical facet is to
> concatenate
> >> strings with a separator. Something like "level1>level2>level3" with ">"
> as
> >> the separator.
> >>
> >> A problem with this approach is that the number of facet values will
> >> greatly increase.
> >>
> >> For example I have a facet "Location" with the hierarchy
> >> country>state>city. Using the above approach every single city will lead
> to
> >> a separate facet value. With tens of thousands of cities in the world
> the
> >> response from Solr will be huge. And then on the client side I'd have to
> >> loop through all the facet values and combine those with the same
> country
> >> into a single value.
> >>
> >> Ideally Solr would be "aware" of the hierarchy structure and send back
> >> responses accordingly. So at level 1 Solr will send back facet values
> based
> >> on country (100 or so values). Level 2 the facet values will be based on
> the
> >> states within the selected country (a few dozen values). Next level will
> be
> >> cities within that state. and so on.
> >>
> >> Is it possible to implement hierarchical facet this way using Solr?
> >>
> >>
> >>
> >>
> >
> >
> >
>
>
>
>
>

RE: DIH onError question

Posted by "Shah, Nirmal" <ns...@columnit.com>.
Thanks for your prompt reply.  I resolved the ERROR, and used "continue" to bypass any EXCEPTIONS.

Nirmal Shah
Remedy Consultant|Column Technologies|Cell: (630) 244-1648


-----Original Message-----
From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:noble.paul@gmail.com] 
Sent: Tuesday, March 02, 2010 11:13 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH onError question

onError only handles Exception (not Error or Throwable). I your case
it is a NoClassDefFoundError . If it is an Error or Throwable it is a
symptom of a larger problem. If you fix the NoClassDefFoundError it
should be ok

On Wed, Mar 3, 2010 at 10:06 AM, Shah, Nirmal <ns...@columnit.com> wrote:
> Hi all,
>
> I am using Solr 1.5 from trunk.  I am getting the below error on a full
> load, and it is causing the import to fail and rollback.  I am not
> concerned about the error but rather that I cannot seem to tell the
> indexing to continue.  I have two entities, and I have tried all (4)
> combinations of "skip" and "continue" for their onError attributes.
>
> SEVERE: Exception while processing: f document : null
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:652)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:606)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
> :261)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
> 5)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
> r.java:333)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
> :391)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
> 372)
> Caused by: java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>        at
> org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108
> )
>        at
> org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
>        at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:23
> 5)
>        at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
>        at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
> yProcessor.java:124)
>        at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
> ProcessorWrapper.java:233)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:580)
>        ... 6 more
> Mar 2, 2010 10:21:05 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> SEVERE: Full Import failed
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:652)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:606)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
> :261)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
> 5)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
> r.java:333)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
> :391)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
> 372)
> Caused by: java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>        at
> org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108
> )
>        at
> org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
>        at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:23
> 5)
>        at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
>        at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
> yProcessor.java:124)
>        at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
> ProcessorWrapper.java:233)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:580)
>        ... 6 more
> Mar 2, 2010 10:21:05 PM org.apache.solr.update.DirectUpdateHandler2
> rollback
> INFO: start rollback
>
>
> My data-config file:
> <dataConfig>
>  <dataSource name="binaryFile" type="BinFileDataSource" />
>  <document>
>    <entity name="f" processor="FileListEntityProcessor"
> transformer="RegexTransformer,TemplateTransformer" baseDir="C:\Docs"
> fileName=".*pdf" recursive="true"       rootEntity="false" pk="id"
> dataSource="binaryFile" onError="skip">
>        <field column="id" sourceColName="fileAbsolutePath" regex="\\"
> replaceWith="/" />
>      <entity dataSource="binaryFile" name="x"
> processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
> onError="continue" >
>        <field column="text" name="text" />
>      </entity>
>    </entity>
>  </document>
> </dataConfig>
>
>
> Thanks,
> Nirmal
>



-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

Re: DIH onError question

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
onError only handles Exception (not Error or Throwable). I your case
it is a NoClassDefFoundError . If it is an Error or Throwable it is a
symptom of a larger problem. If you fix the NoClassDefFoundError it
should be ok

On Wed, Mar 3, 2010 at 10:06 AM, Shah, Nirmal <ns...@columnit.com> wrote:
> Hi all,
>
> I am using Solr 1.5 from trunk.  I am getting the below error on a full
> load, and it is causing the import to fail and rollback.  I am not
> concerned about the error but rather that I cannot seem to tell the
> indexing to continue.  I have two entities, and I have tried all (4)
> combinations of "skip" and "continue" for their onError attributes.
>
> SEVERE: Exception while processing: f document : null
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:652)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:606)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
> :261)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
> 5)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
> r.java:333)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
> :391)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
> 372)
> Caused by: java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>        at
> org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108
> )
>        at
> org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
>        at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:23
> 5)
>        at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
>        at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
> yProcessor.java:124)
>        at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
> ProcessorWrapper.java:233)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:580)
>        ... 6 more
> Mar 2, 2010 10:21:05 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> SEVERE: Full Import failed
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:652)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:606)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
> :261)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
> 5)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
> r.java:333)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
> :391)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
> 372)
> Caused by: java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>        at
> org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108
> )
>        at
> org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
>        at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:23
> 5)
>        at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
>        at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
> yProcessor.java:124)
>        at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
> ProcessorWrapper.java:233)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
> ava:580)
>        ... 6 more
> Mar 2, 2010 10:21:05 PM org.apache.solr.update.DirectUpdateHandler2
> rollback
> INFO: start rollback
>
>
> My data-config file:
> <dataConfig>
>  <dataSource name="binaryFile" type="BinFileDataSource" />
>  <document>
>    <entity name="f" processor="FileListEntityProcessor"
> transformer="RegexTransformer,TemplateTransformer" baseDir="C:\Docs"
> fileName=".*pdf" recursive="true"       rootEntity="false" pk="id"
> dataSource="binaryFile" onError="skip">
>        <field column="id" sourceColName="fileAbsolutePath" regex="\\"
> replaceWith="/" />
>      <entity dataSource="binaryFile" name="x"
> processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
> onError="continue" >
>        <field column="text" name="text" />
>      </entity>
>    </entity>
>  </document>
> </dataConfig>
>
>
> Thanks,
> Nirmal
>



-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

DIH onError question

Posted by "Shah, Nirmal" <ns...@columnit.com>.
Hi all,

I am using Solr 1.5 from trunk.  I am getting the below error on a full
load, and it is causing the import to fail and rollback.  I am not
concerned about the error but rather that I cannot seem to tell the
indexing to continue.  I have two entities, and I have tried all (4)
combinations of "skip" and "continue" for their onError attributes.

SEVERE: Exception while processing: f document : null
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoClassDefFoundError:
org/bouncycastle/jce/provider/BouncyCastleProvider
	at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:652)
	at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:606)
	at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
:261)
	at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
5)
	at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
r.java:333)
	at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
:391)
	at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
372)
Caused by: java.lang.NoClassDefFoundError:
org/bouncycastle/jce/provider/BouncyCastleProvider
	at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108
)
	at
org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
	at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:23
5)
	at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
	at
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
	at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
	at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
	at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
yProcessor.java:124)
	at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
ProcessorWrapper.java:233)
	at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:580)
	... 6 more
Mar 2, 2010 10:21:05 PM org.apache.solr.handler.dataimport.DataImporter
doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoClassDefFoundError:
org/bouncycastle/jce/provider/BouncyCastleProvider
	at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:652)
	at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:606)
	at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java
:261)
	at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:18
5)
	at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporte
r.java:333)
	at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java
:391)
	at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:
372)
Caused by: java.lang.NoClassDefFoundError:
org/bouncycastle/jce/provider/BouncyCastleProvider
	at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108
)
	at
org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
	at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:23
5)
	at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
	at
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
	at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
	at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
	at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntit
yProcessor.java:124)
	at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Entity
ProcessorWrapper.java:233)
	at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
ava:580)
	... 6 more
Mar 2, 2010 10:21:05 PM org.apache.solr.update.DirectUpdateHandler2
rollback
INFO: start rollback


My data-config file:
<dataConfig>
  <dataSource name="binaryFile" type="BinFileDataSource" />
  <document>
    <entity name="f" processor="FileListEntityProcessor"
transformer="RegexTransformer,TemplateTransformer" baseDir="C:\Docs"
fileName=".*pdf" recursive="true"     	rootEntity="false" pk="id"
dataSource="binaryFile" onError="skip">
	<field column="id" sourceColName="fileAbsolutePath" regex="\\"
replaceWith="/" />
      <entity dataSource="binaryFile" name="x"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
onError="continue" >
        <field column="text" name="text" />
      </entity>
    </entity>
  </document>
</dataConfig>


Thanks,
Nirmal

Re: Implementing hierarchical facet

Posted by Andy <an...@yahoo.com>.
Thanks. I didn't know about the {!key=Location} trick.

Thanks everyone for your help. From what I could gather, there're 3 approaches:

1) SOLR-64
Pros:
- can have arbitrary levels of hierarchy without modifying schema
Cons:
- each combination of all the levels in the hierarchy will result in a separate filter cache. This number could be huge, which would lead to poor performance

2) SOLR-792
Pros:
- each level of the hierarchy separately results in filter cache. Much smaller number of filter cache. Better performance.
Cons:
- Only 2 levels are supported

3) Separate fields for each hierarchy levels
Pros:
- same as SOLR-792. Good performance
Cons:
- can only handle a fixed number of levels in the hierarchy. Adding any levels beyond that requires schema modification

Does that sound right?

Option 3 is probably the best match for my use case. Is there any trick to make it able to deal with arbitrary number of levels?

Thanks.

--- On Tue, 3/2/10, Geert-Jan Brits <gb...@gmail.com> wrote:

From: Geert-Jan Brits <gb...@gmail.com>
Subject: Re: Implementing hierarchical facet
To: solr-user@lucene.apache.org
Date: Tuesday, March 2, 2010, 8:02 PM

Using Solr 1.4: even less changes to the frontend:

&facet=on&facet.field={!key=Location}countryid
...
&facet=on&facet.field={!key=Location}cityid&fq=countryid:<somecountryid>
etc.

will consistently render the resulting facet under the name "Location" .


2010/3/3 Geert-Jan Brits <gb...@gmail.com>

> If it's a requirement to let Solr handle the facet-hierarchy please
> disregard this post, but
> an alternative would be to have your App control when to ask for which
> 'facet-level' (e.g: country, state, city) in the hierarchy.
>
> as follows,
>
> each doc has 3 seperate fields (indexed=true, stored=false):
> - countryid
> - stateid
> - cityid
>
> facet on country:
> &facet=on&facet.field=countryid
>
> facet on state ( country selected. functionally you probably don't want to
> show states without the user having selected a country anyway)
> &facet=on&facet.field=countryid&fq=countryid:<somecountryid>
>
> facet on city (state selected, same functional analogy as above)
> &facet=on&facet.field=cityid&fq=stateid:<somestateid>
>
> or
>
> facet on city (countryselected, same functional analogy as above)
> &facet=on&facet.field=cityid&fq=countryid:<somecountryid>
>
> grab the resulting facat and drop it under "Location"
>
> pros:
> - reusing fq's (good performance, I've never used hierarchical facets, but
> would be surprised if it has a (major) speed increase to this method)
> - flexible (you get multiple hierarchies: country --> state --> city and
> country --> city)
>
> cons:
> - a little more application logic
>
> Hope that helps,
> Geert-Jan
>
>
>
>
>
> 2010/3/2 Andy <an...@yahoo.com>
>
> I read that a simple way to implement hierarchical facet is to concatenate
>> strings with a separator. Something like "level1>level2>level3" with ">" as
>> the separator.
>>
>> A problem with this approach is that the number of facet values will
>> greatly increase.
>>
>> For example I have a facet "Location" with the hierarchy
>> country>state>city. Using the above approach every single city will lead to
>> a separate facet value. With tens of thousands of cities in the world the
>> response from Solr will be huge. And then on the client side I'd have to
>> loop through all the facet values and combine those with the same country
>> into a single value.
>>
>> Ideally Solr would be "aware" of the hierarchy structure and send back
>> responses accordingly. So at level 1 Solr will send back facet values based
>> on country (100 or so values). Level 2 the facet values will be based on the
>> states within the selected country (a few dozen values). Next level will be
>> cities within that state. and so on.
>>
>> Is it possible to implement hierarchical facet this way using Solr?
>>
>>
>>
>>
>
>
>



      

Re: Implementing hierarchical facet

Posted by Geert-Jan Brits <gb...@gmail.com>.
Using Solr 1.4: even less changes to the frontend:

&facet=on&facet.field={!key=Location}countryid
...
&facet=on&facet.field={!key=Location}cityid&fq=countryid:<somecountryid>
etc.

will consistently render the resulting facet under the name "Location" .


2010/3/3 Geert-Jan Brits <gb...@gmail.com>

> If it's a requirement to let Solr handle the facet-hierarchy please
> disregard this post, but
> an alternative would be to have your App control when to ask for which
> 'facet-level' (e.g: country, state, city) in the hierarchy.
>
> as follows,
>
> each doc has 3 seperate fields (indexed=true, stored=false):
> - countryid
> - stateid
> - cityid
>
> facet on country:
> &facet=on&facet.field=countryid
>
> facet on state ( country selected. functionally you probably don't want to
> show states without the user having selected a country anyway)
> &facet=on&facet.field=countryid&fq=countryid:<somecountryid>
>
> facet on city (state selected, same functional analogy as above)
> &facet=on&facet.field=cityid&fq=stateid:<somestateid>
>
> or
>
> facet on city (countryselected, same functional analogy as above)
> &facet=on&facet.field=cityid&fq=countryid:<somecountryid>
>
> grab the resulting facat and drop it under "Location"
>
> pros:
> - reusing fq's (good performance, I've never used hierarchical facets, but
> would be surprised if it has a (major) speed increase to this method)
> - flexible (you get multiple hierarchies: country --> state --> city and
> country --> city)
>
> cons:
> - a little more application logic
>
> Hope that helps,
> Geert-Jan
>
>
>
>
>
> 2010/3/2 Andy <an...@yahoo.com>
>
> I read that a simple way to implement hierarchical facet is to concatenate
>> strings with a separator. Something like "level1>level2>level3" with ">" as
>> the separator.
>>
>> A problem with this approach is that the number of facet values will
>> greatly increase.
>>
>> For example I have a facet "Location" with the hierarchy
>> country>state>city. Using the above approach every single city will lead to
>> a separate facet value. With tens of thousands of cities in the world the
>> response from Solr will be huge. And then on the client side I'd have to
>> loop through all the facet values and combine those with the same country
>> into a single value.
>>
>> Ideally Solr would be "aware" of the hierarchy structure and send back
>> responses accordingly. So at level 1 Solr will send back facet values based
>> on country (100 or so values). Level 2 the facet values will be based on the
>> states within the selected country (a few dozen values). Next level will be
>> cities within that state. and so on.
>>
>> Is it possible to implement hierarchical facet this way using Solr?
>>
>>
>>
>>
>
>
>

Re: Implementing hierarchical facet

Posted by Geert-Jan Brits <gb...@gmail.com>.
If it's a requirement to let Solr handle the facet-hierarchy please
disregard this post, but
an alternative would be to have your App control when to ask for which
'facet-level' (e.g: country, state, city) in the hierarchy.

as follows,

each doc has 3 seperate fields (indexed=true, stored=false):
- countryid
- stateid
- cityid

facet on country:
&facet=on&facet.field=countryid

facet on state ( country selected. functionally you probably don't want to
show states without the user having selected a country anyway)
&facet=on&facet.field=countryid&fq=countryid:<somecountryid>

facet on city (state selected, same functional analogy as above)
&facet=on&facet.field=cityid&fq=stateid:<somestateid>

or

facet on city (countryselected, same functional analogy as above)
&facet=on&facet.field=cityid&fq=countryid:<somecountryid>

grab the resulting facat and drop it under "Location"

pros:
- reusing fq's (good performance, I've never used hierarchical facets, but
would be surprised if it has a (major) speed increase to this method)
- flexible (you get multiple hierarchies: country --> state --> city and
country --> city)

cons:
- a little more application logic

Hope that helps,
Geert-Jan





2010/3/2 Andy <an...@yahoo.com>

> I read that a simple way to implement hierarchical facet is to concatenate
> strings with a separator. Something like "level1>level2>level3" with ">" as
> the separator.
>
> A problem with this approach is that the number of facet values will
> greatly increase.
>
> For example I have a facet "Location" with the hierarchy
> country>state>city. Using the above approach every single city will lead to
> a separate facet value. With tens of thousands of cities in the world the
> response from Solr will be huge. And then on the client side I'd have to
> loop through all the facet values and combine those with the same country
> into a single value.
>
> Ideally Solr would be "aware" of the hierarchy structure and send back
> responses accordingly. So at level 1 Solr will send back facet values based
> on country (100 or so values). Level 2 the facet values will be based on the
> states within the selected country (a few dozen values). Next level will be
> cities within that state. and so on.
>
> Is it possible to implement hierarchical facet this way using Solr?
>
>
>
>

RE: Implementing hierarchical facet

Posted by Peter S <pe...@hotmail.com>.
Hi Andy,

 

It sounds like you may want to have a look at tree faceting:

  https://issues.apache.org/jira/browse/SOLR-792

 


 
> Date: Mon, 1 Mar 2010 18:23:51 -0800
> From: angelflow@yahoo.com
> Subject: Implementing hierarchical facet
> To: solr-user@lucene.apache.org
> 
> I read that a simple way to implement hierarchical facet is to concatenate strings with a separator. Something like "level1>level2>level3" with ">" as the separator.
> 
> A problem with this approach is that the number of facet values will greatly increase.
> 
> For example I have a facet "Location" with the hierarchy country>state>city. Using the above approach every single city will lead to a separate facet value. With tens of thousands of cities in the world the response from Solr will be huge. And then on the client side I'd have to loop through all the facet values and combine those with the same country into a single value.
> 
> Ideally Solr would be "aware" of the hierarchy structure and send back responses accordingly. So at level 1 Solr will send back facet values based on country (100 or so values). Level 2 the facet values will be based on the states within the selected country (a few dozen values). Next level will be cities within that state. and so on.
> 
> Is it possible to implement hierarchical facet this way using Solr?
> 
> 
> 
> 
 		 	   		  
_________________________________________________________________
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: Search Result differences Standard vs DisMax

Posted by Erick Erickson <er...@gmail.com>.
Look at your dismax definition, in solrconfig.xml. You should be
able to add something like:
<str name="mm">3</str>

Or, if you want to bend your mind, this is also possible (from the example
file):
<str name="mm">3&lt;-1 5&lt;-2 6&lt;90%</str>


See:
http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

<G>...

Erick

On Mon, Mar 1, 2010 at 9:04 PM, Steve Reichgut <sr...@axtaweb.com>wrote:

> Thanks Joe. That was exactly the issue. When I added 'mm=1', I got exactly
> the results I was looking for. Where would I change the default value for
> the 'mm' parameter? Is it in solrconfig.xml?
>
> Steve
>
>
> On 3/1/2010 5:30 PM, Joe Calderon wrote:
>
>> what are you using for the mm parameter? if you set it to 1 only one word
>> has to match,
>> On 03/01/2010 05:07 PM, Steve Reichgut wrote:
>>
>>> ***Sorry if this was sent twice. I had connection problems here and it
>>> didn't look like the first time it went out****
>>>
>>> I have been testing out results for some basic queries using both the
>>> Standard and DisMax query parsers. The results though aren't what I expected
>>> and am wondering if I am misundertanding how the DisMax query parser works.
>>>
>>> For example, let's say I am doing a basic search for "Apache Solr" across
>>> a single field = Field 1 using the Standard parser. My results are exactly
>>> what I expected. Any document that includes either "Apache" or "Solr" or
>>> "Apache Solr" in Field 1 is listed with priority given to those that include
>>> both words.
>>>
>>> Now, if I do the same search for "Apache Solr" across multiple fields -
>>> Field 1, Field 2 - using DisMax, I would expect basically the same results.
>>> The results should include any document that has one or both words in Field
>>> 1 or Field 2.
>>>
>>> When I run that query in DisMax though, it only returns the documents
>>> that have BOTH words included which in my sample set only includes 1 or 2
>>> documents. I thought that, by default, DisMax should make both words
>>> optional so I am confused as to why I am only getting such a small subset.
>>>
>>> Can anyone shed some light on what I am doing wrong or if I am
>>> misunderstanding how DisMax works.
>>>
>>> Thanks,
>>> Steve
>>>
>>
>>
>>
>

Re: Implementing hierarchical facet

Posted by Andy <an...@yahoo.com>.
Oops. Sorry about that.

I'll start a fresh one.

--- On Mon, 3/1/10, Chris Hostetter <ho...@fucit.org> wrote:

From: Chris Hostetter <ho...@fucit.org>
Subject: Re: Implementing hierarchical facet
To: solr-user@lucene.apache.org
Date: Monday, March 1, 2010, 11:36 PM


: Subject: Implementing hierarchical facet
: In-Reply-To: <4B...@axtaweb.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss




      

Re: Implementing hierarchical facet

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Implementing hierarchical facet
: In-Reply-To: <4B...@axtaweb.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss


Implementing hierarchical facet

Posted by Andy <an...@yahoo.com>.
I read that a simple way to implement hierarchical facet is to concatenate strings with a separator. Something like "level1>level2>level3" with ">" as the separator.

A problem with this approach is that the number of facet values will greatly increase.

For example I have a facet "Location" with the hierarchy country>state>city. Using the above approach every single city will lead to a separate facet value. With tens of thousands of cities in the world the response from Solr will be huge. And then on the client side I'd have to loop through all the facet values and combine those with the same country into a single value.

Ideally Solr would be "aware" of the hierarchy structure and send back responses accordingly. So at level 1 Solr will send back facet values based on country (100 or so values). Level 2 the facet values will be based on the states within the selected country (a few dozen values). Next level will be cities within that state. and so on.

Is it possible to implement hierarchical facet this way using Solr?



      

Re: Search Result differences Standard vs DisMax

Posted by Steve Reichgut <sr...@axtaweb.com>.
Thanks Joe. That was exactly the issue. When I added 'mm=1', I got 
exactly the results I was looking for. Where would I change the default 
value for the 'mm' parameter? Is it in solrconfig.xml?

Steve

On 3/1/2010 5:30 PM, Joe Calderon wrote:
> what are you using for the mm parameter? if you set it to 1 only one 
> word has to match,
> On 03/01/2010 05:07 PM, Steve Reichgut wrote:
>> ***Sorry if this was sent twice. I had connection problems here and 
>> it didn't look like the first time it went out****
>>
>> I have been testing out results for some basic queries using both the 
>> Standard and DisMax query parsers. The results though aren't what I 
>> expected and am wondering if I am misundertanding how the DisMax 
>> query parser works.
>>
>> For example, let's say I am doing a basic search for "Apache Solr" 
>> across a single field = Field 1 using the Standard parser. My results 
>> are exactly what I expected. Any document that includes either 
>> "Apache" or "Solr" or "Apache Solr" in Field 1 is listed with 
>> priority given to those that include both words.
>>
>> Now, if I do the same search for "Apache Solr" across multiple fields 
>> - Field 1, Field 2 - using DisMax, I would expect basically the same 
>> results. The results should include any document that has one or both 
>> words in Field 1 or Field 2.
>>
>> When I run that query in DisMax though, it only returns the documents 
>> that have BOTH words included which in my sample set only includes 1 
>> or 2 documents. I thought that, by default, DisMax should make both 
>> words optional so I am confused as to why I am only getting such a 
>> small subset.
>>
>> Can anyone shed some light on what I am doing wrong or if I am 
>> misunderstanding how DisMax works.
>>
>> Thanks,
>> Steve
>
>


Re: Search Result differences Standard vs DisMax

Posted by Joe Calderon <ca...@gmail.com>.
what are you using for the mm parameter? if you set it to 1 only one 
word has to match,
On 03/01/2010 05:07 PM, Steve Reichgut wrote:
> ***Sorry if this was sent twice. I had connection problems here and it 
> didn't look like the first time it went out****
>
> I have been testing out results for some basic queries using both the 
> Standard and DisMax query parsers. The results though aren't what I 
> expected and am wondering if I am misundertanding how the DisMax query 
> parser works.
>
> For example, let's say I am doing a basic search for "Apache Solr" 
> across a single field = Field 1 using the Standard parser. My results 
> are exactly what I expected. Any document that includes either 
> "Apache" or "Solr" or "Apache Solr" in Field 1 is listed with priority 
> given to those that include both words.
>
> Now, if I do the same search for "Apache Solr" across multiple fields 
> - Field 1, Field 2 - using DisMax, I would expect basically the same 
> results. The results should include any document that has one or both 
> words in Field 1 or Field 2.
>
> When I run that query in DisMax though, it only returns the documents 
> that have BOTH words included which in my sample set only includes 1 
> or 2 documents. I thought that, by default, DisMax should make both 
> words optional so I am confused as to why I am only getting such a 
> small subset.
>
> Can anyone shed some light on what I am doing wrong or if I am 
> misunderstanding how DisMax works.
>
> Thanks,
> Steve