You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by gi...@gmx.de on 2008/10/24 13:44:12 UTC

delta-import for XML files, Solr statistics

Hello,

I have some questions about DataImportHandler and Solr statistics...


1.)
I'm using the DataImportHandler for creating my Lucene index from XML files:

###
$ cat data-config.xml 
<dataConfig>
 <dataSource type="FileDataSource" />
  <document>
   <entity name="xmlFile"
        processor="FileListEntityProcessor"
        baseDir="/tmp/files"
        fileName="myDoc_.*\.xml"
        newerThan="'NOW-30DAYS'"
        recursive="false"
        rootEntity="false"
        dataSource="null">
    <entity name="myDoc"
          url="${xmlFile.fileAbsolutePath}"
          processor="XPathEntityProcessor"
          forEach="/myDoc">
          ...
</dataConfig>
###

No problems with this configuration - All works fine for full-imports, but...

===> What means 'rootEntity="false"' and 'dataSource="null"'?



2.)
The documentation from DataImportHandler describes the index update process for SQL databases only...

My scenario:
- My application creates, deletes and modifies files from /tmp/files every night.
- delta-import / DataImportHandler should "mirror" _all_ this changes to my lucene index (=> create, delete, update documents).

===> Is this possible with delta-import / DataImportHandler?
===> If not: Do you have any suggestions on how to do this?



3.)
My scenario:
- /tmp/files contains 682 'myDoc_.*\.xml' XML files. 
- Each XML file contains 12 XML elements (e.g. <title>foo</title>).
- DataImportHandler transfer only 5 from this 12 elements to the lucene index. 


I don't understand the output from 'solr/dataimport' (=> status):

###
<response>
 ...
 <lst name="statusMessages">
  <str name="Total Requests made to DataSource">0</str>
  <str name="Total Rows Fetched">1363</str>
  <str name="Total Documents Skipped">0</str>
  <str name="Full Dump Started">2008-10-24 13:19:03</str>
  <str name="">
    Indexing completed. Added/Updated: 681 documents. Deleted 0 documents.
  </str>
  <str name="Committed">2008-10-24 13:19:05</str>
  <str name="Optimized">2008-10-24 13:19:05</str>
  <str name="Time taken ">0:0:2.648</str>
  </lst>
...
</response>

===> What is "Total Rows Fetched" rsp. what is a "row" in a XML file? An element? Why 1363?
===> Why shows the "Added/Updated" counter 681 and not 682?



4.)
And my last questions about Solr statistics/informations...

===> Is it possible to get informations (number of indexed documents, stored values from documents etc.) from the current lucene index?
===> The admin webinterface shows 'numDocs' and 'maxDoc' in 'statistics/core'. Is 'numDocs' = number of indexed documents? What means 'maxDocs'?


Thanks a lot!
gisto
-- 
GMX Kostenlose Spiele: Einfach online spielen und Spaß haben mit Pastry Passion!
http://games.entertainment.gmx.net/de/entertainment/games/free/puzzle/6169196

Re: delta-import for XML files, Solr statistics

Posted by Akshay <ak...@gmail.com>.
On Fri, Oct 24, 2008 at 6:07 PM, <gi...@gmx.de> wrote:

> Thanks for your very fast response :-)
>
>
> > > 2.)
> > > The documentation from DataImportHandler describes the index update
> > process for SQL databases only...
> > >
> > > My scenario:
> > > - My application creates, deletes and modifies files from /tmp/files
> > every night.
> > > - delta-import / DataImportHandler should "mirror" _all_ this changes
> to
> > my lucene index (=> create, delete, update documents).
> > The only Entityprocessor which supports delta is SqlEntityProcessor.
> > The XPathEntityProcessor has not implemented it , because we do not
> > know of a consistent way of finding deltas for XML. So ,
> > unfortunately,no delta support for XML. But that said you can
> > implement those methods in XPathEntityProcessor . The methods are
> > explained in EntityProcessor.java. if you have questions specific to
> > this I can help.Probably we can contribute it back
> > >
> > > ===> Is this possible with delta-import / DataImportHandler?
> > > ===> If not: Do you have any suggestions on how to do this?
>
> Ok so, at the moment I have to do a full-import to update my index. What
> happens with (user) queries while full-import is running? Does Solr block
> this queries the import is finished? Which configuration options control
> this behavior?


No queries to SOLR  are not blocked during full import.


>
>
>
> > > My scenario:
> > > - /tmp/files contains 682 'myDoc_.*\.xml' XML files.
> > > - Each XML file contains 12 XML elements (e.g. <title>foo</title>).
> > > - DataImportHandler transfer only 5 from this 12 elements to the lucene
> > index.
> > >
> > >
> > > I don't understand the output from 'solr/dataimport' (=> status):
> > >
> > > ###
> > > <response>
> > >  ...
> > >  <lst name="statusMessages">
> > >  <str name="Total Requests made to DataSource">0</str>
> > >  <str name="Total Rows Fetched">1363</str>
> > >  <str name="Total Documents Skipped">0</str>
> > >  <str name="Full Dump Started">2008-10-24 13:19:03</str>
> > >  <str name="">
> > >    Indexing completed. Added/Updated: 681 documents. Deleted 0
> > documents.
> > >  </str>
> > >  <str name="Committed">2008-10-24 13:19:05</str>
> > >  <str name="Optimized">2008-10-24 13:19:05</str>
> > >  <str name="Time taken ">0:0:2.648</str>
> > >  </lst>
> > > ...
> > > </response>
> > >
> > > ===> Why shows the "Added/Updated" counter 681 and not 682?
> >
> > Added updated is the no:of docs . How do you know the number is not
> > accurate?
>
>
> /tmp/files$ ls myDoc_*.xml | wc -l
> 682
>
> But "Added/Updated" shows 681. Does this mean that one file has an XML
> error? But the statistic says "Total Documents Skipped" = 0?!


It might be the case that somewhere there is a extra line in one of the XML
files, a line like <?xml version="1.0" encoding="utf-8"?> or something.


>
>
>
>
> > > 4.)
> > > And my last questions about Solr statistics/informations...
> > >
> > > ===> Is it possible to get informations (number of indexed documents,
> > stored values from documents etc.) from the current lucene index?
> > > ===> The admin webinterface shows 'numDocs' and 'maxDoc' in
> > 'statistics/core'. Is 'numDocs' = number of indexed documents? What means
> 'maxDocs'?
>
> Do you have answers for this questions too?
>
> Bye,
> Simon
> --
> Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
> Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
>



-- 
Regards,
Akshay Ukey.

Re: delta-import for XML files, Solr statistics

Posted by gi...@gmx.de.
Thanks for your very fast response :-)


> > 2.)
> > The documentation from DataImportHandler describes the index update
> process for SQL databases only...
> >
> > My scenario:
> > - My application creates, deletes and modifies files from /tmp/files
> every night.
> > - delta-import / DataImportHandler should "mirror" _all_ this changes to
> my lucene index (=> create, delete, update documents).
> The only Entityprocessor which supports delta is SqlEntityProcessor.
> The XPathEntityProcessor has not implemented it , because we do not
> know of a consistent way of finding deltas for XML. So ,
> unfortunately,no delta support for XML. But that said you can
> implement those methods in XPathEntityProcessor . The methods are
> explained in EntityProcessor.java. if you have questions specific to
> this I can help.Probably we can contribute it back
> >
> > ===> Is this possible with delta-import / DataImportHandler?
> > ===> If not: Do you have any suggestions on how to do this?

Ok so, at the moment I have to do a full-import to update my index. What happens with (user) queries while full-import is running? Does Solr block this queries the import is finished? Which configuration options control this behavior? 



> > My scenario:
> > - /tmp/files contains 682 'myDoc_.*\.xml' XML files.
> > - Each XML file contains 12 XML elements (e.g. <title>foo</title>).
> > - DataImportHandler transfer only 5 from this 12 elements to the lucene
> index.
> >
> >
> > I don't understand the output from 'solr/dataimport' (=> status):
> >
> > ###
> > <response>
> >  ...
> >  <lst name="statusMessages">
> >  <str name="Total Requests made to DataSource">0</str>
> >  <str name="Total Rows Fetched">1363</str>
> >  <str name="Total Documents Skipped">0</str>
> >  <str name="Full Dump Started">2008-10-24 13:19:03</str>
> >  <str name="">
> >    Indexing completed. Added/Updated: 681 documents. Deleted 0
> documents.
> >  </str>
> >  <str name="Committed">2008-10-24 13:19:05</str>
> >  <str name="Optimized">2008-10-24 13:19:05</str>
> >  <str name="Time taken ">0:0:2.648</str>
> >  </lst>
> > ...
> > </response>
> >
> > ===> Why shows the "Added/Updated" counter 681 and not 682?
> 
> Added updated is the no:of docs . How do you know the number is not
> accurate?


/tmp/files$ ls myDoc_*.xml | wc -l
682

But "Added/Updated" shows 681. Does this mean that one file has an XML error? But the statistic says "Total Documents Skipped" = 0?!

 

> > 4.)
> > And my last questions about Solr statistics/informations...
> >
> > ===> Is it possible to get informations (number of indexed documents,
> stored values from documents etc.) from the current lucene index?
> > ===> The admin webinterface shows 'numDocs' and 'maxDoc' in
> 'statistics/core'. Is 'numDocs' = number of indexed documents? What means 'maxDocs'?

Do you have answers for this questions too?

Bye,
Simon
-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

Re: delta-import for XML files, Solr statistics

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
On Fri, Oct 24, 2008 at 5:14 PM,  <gi...@gmx.de> wrote:
> Hello,
>
> I have some questions about DataImportHandler and Solr statistics...
>
>
> 1.)
> I'm using the DataImportHandler for creating my Lucene index from XML files:
>
> ###
> $ cat data-config.xml
> <dataConfig>
>  <dataSource type="FileDataSource" />
>  <document>
>   <entity name="xmlFile"
>        processor="FileListEntityProcessor"
>        baseDir="/tmp/files"
>        fileName="myDoc_.*\.xml"
>        newerThan="'NOW-30DAYS'"
>        recursive="false"
>        rootEntity="false"
>        dataSource="null">
>    <entity name="myDoc"
>          url="${xmlFile.fileAbsolutePath}"
>          processor="XPathEntityProcessor"
>          forEach="/myDoc">
>          ...
> </dataConfig>
> ###
>
> No problems with this configuration - All works fine for full-imports, but...
>
> ===> What means 'rootEntity="false"' and 'dataSource="null"'?

It is a menace caused by 'sensible defaults'

An entity directly under the <document> is a root entity. That means
that for each row emitted by the root entity one document is created
in Solr/Lucene . but as in this case we do not wish to make one
document per file. we wish to make one document per row emitted by the
entity 'myDoc' .Because the entity 'xmlFile' is not has
rootEntity=false the entity directly under it becomes a root entity
automatically and each row emitted by that becomes a document.

In most of the cases there is only one datasource (A JdbcDataSource)
and all entities just use them . So it is an overkill to ask them to
write the datSource. So we have chosen to implicitly assign the
datasource with no name to that entity. But in case of
FileListEntityProcessor a datasource is not necessary . But it won't
hurt even if you do not put dataSource=null  . It just means that we
won't create a DataSource instance for that.


>
>
>
> 2.)
> The documentation from DataImportHandler describes the index update process for SQL databases only...
>
> My scenario:
> - My application creates, deletes and modifies files from /tmp/files every night.
> - delta-import / DataImportHandler should "mirror" _all_ this changes to my lucene index (=> create, delete, update documents).
The only Entityprocessor which supports delta is SqlEntityProcessor.
The XPathEntityProcessor has not implemented it , because we do not
know of a consistent way of finding deltas for XML. So ,
unfortunately,no delta support for XML. But that said you can
implement those methods in XPathEntityProcessor . The methods are
explained in EntityProcessor.java. if you have questions specific to
this I can help.Probably we can contribute it back
>
> ===> Is this possible with delta-import / DataImportHandler?
> ===> If not: Do you have any suggestions on how to do this?
>
>
>
> 3.)
> My scenario:
> - /tmp/files contains 682 'myDoc_.*\.xml' XML files.
> - Each XML file contains 12 XML elements (e.g. <title>foo</title>).
> - DataImportHandler transfer only 5 from this 12 elements to the lucene index.
>
>
> I don't understand the output from 'solr/dataimport' (=> status):
>
> ###
> <response>
>  ...
>  <lst name="statusMessages">
>  <str name="Total Requests made to DataSource">0</str>
>  <str name="Total Rows Fetched">1363</str>
>  <str name="Total Documents Skipped">0</str>
>  <str name="Full Dump Started">2008-10-24 13:19:03</str>
>  <str name="">
>    Indexing completed. Added/Updated: 681 documents. Deleted 0 documents.
>  </str>
>  <str name="Committed">2008-10-24 13:19:05</str>
>  <str name="Optimized">2008-10-24 13:19:05</str>
>  <str name="Time taken ">0:0:2.648</str>
>  </lst>
> ...
> </response>
>
> ===> What is "Total Rows Fetched" rsp. what is a "row" in a XML file? An element? Why 1363?
> ===> Why shows the "Added/Updated" counter 681 and not 682?

rows fethed makes a lot of sense with SqlEntityProcessor. It is the
no:of rows fetched from DB . It is the cumulative no:of rows given out
by all entitiies put together. in your case it will be the total files
+ total rows emitted from the xml
Added updated is the no:of docs . How do you know the number is not accurate?
>
>
>
> 4.)
> And my last questions about Solr statistics/informations...
>
> ===> Is it possible to get informations (number of indexed documents, stored values from documents etc.) from the current lucene index?
> ===> The admin webinterface shows 'numDocs' and 'maxDoc' in 'statistics/core'. Is 'numDocs' = number of indexed documents? What means 'maxDocs'?
>
>
> Thanks a lot!
> gisto
> --
> GMX Kostenlose Spiele: Einfach online spielen und Spaß haben mit Pastry Passion!
> http://games.entertainment.gmx.net/de/entertainment/games/free/puzzle/6169196
>



-- 
--Noble Paul