You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Pulkit Singhal <pu...@gmail.com> on 2011/10/02 00:17:00 UTC

Enabling the right logs for DIH

====
The Problem:
====
When using DIH with trunk 4.x, I am seeing some very funny numbers
with a particularly large XML file that I'm trying to import. Usually
there are bound to be more rows than documents indexed in DIH because
of the foreach property but my other xm lfiles have maybe 1.5 times
the rows compared to the # of docs indexed.

This particular funky file ends up with something like:
<str name="Total Rows Fetched">25614008</str>
<str name="Total Documents Processed">1048</str>
That's 25 million rows fetched before even a measly 1000 docs are indexed!
Something has to be wrong here.
I checked the xml for well-formed-ness in vim by running "!:xmllint
--noout %" so I think there are no issues there.

====
The Question:
====
For those intimately familiar with DIH code/behaviour: What is the
appropriate log-level that will let me see the rows & docs printed out
to log as each one is fetched/created? I don't want to make the logs
explode because then I won't be able to read through them. Is there
some gentle balance here that I can leverage?

Thanks!
- Pulkit

Re: Enabling the right logs for DIH

Posted by Erick Erickson <er...@gmail.com>.

Hmm, you know, I don't even know what
a "row" means when importing XML. But
let's talk about importing XML. As far as
I know, unless you use XSLT to perform
a transformation, Solr doesn't import XML
except as well-formed Solr documents,
some form like:
<add>
<doc>
   <field name="blah">value</field>
</doc>
</add>

If you're importing anything else, I don't think
Solr understands it at all... So what does
your "funky XML document" look like?
What, if any, errors are reported in your Solr
logs?

Also, it's surprisingly easy to debug Solr
when it runs. In IntelliJ, all it involves
is creating an application and you tell
it to add a "remote" application and
it'll give you the parameters you need to
specify when you start your Solr. From there
you just invoke your Solr instance with those
parameters and connect remotely. I took the
entire source tree for the Solr I was using
and compiled it (ant example) and it was
easy. So you might get more mileage
out of debugging in Solr rather than logging, but
that's a guess.

Best
Erick

On Sat, Oct 1, 2011 at 6:17 PM, Pulkit Singhal <pu...@gmail.com> wrote:
> ====
> The Problem:
> ====
> When using DIH with trunk 4.x, I am seeing some very funny numbers
> with a particularly large XML file that I'm trying to import. Usually
> there are bound to be more rows than documents indexed in DIH because
> of the foreach property but my other xm lfiles have maybe 1.5 times
> the rows compared to the # of docs indexed.
>
> This particular funky file ends up with something like:
> <str name="Total Rows Fetched">25614008</str>
> <str name="Total Documents Processed">1048</str>
> That's 25 million rows fetched before even a measly 1000 docs are indexed!
> Something has to be wrong here.
> I checked the xml for well-formed-ness in vim by running "!:xmllint
> --noout %" so I think there are no issues there.
>
> ====
> The Question:
> ====
> For those intimately familiar with DIH code/behaviour: What is the
> appropriate log-level that will let me see the rows & docs printed out
> to log as each one is fetched/created? I don't want to make the logs
> explode because then I won't be able to read through them. Is there
> some gentle balance here that I can leverage?
>
> Thanks!
> - Pulkit
>