You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by harpax <a....@pan-sonic.de> on 2013/03/04 15:49:09 UTC

solr-dih does multiple queries for sub-entities

Hi,

I am trying to use the DIH for crawling over some xml-files and xpathing
them and then access a db with the filename as a key. That works, but
reading ~30.000 docs would take almost 3h. When I looked at the
DIH-Debug-console it showed me, that way to many db-calls were made: 1 for
the 1st doc, then 2, 3, 4, ..

I tried different attributes combinations (eg stripped it to the minimum),
but still the same. 

This problem was asked before:
http://lucene.472066.n3.nabble.com/DIH-multiple-queries-per-sub-entity-tt701038.html

thanks a lot!

regards
Arne

--
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
    <dataSource 
        name="cr-db"
        jndiName="xyz"
        type="JdbcDataSource" />
    <dataSource 
        name="cr-xml" 
        type="FileDataSource" 
        encoding="utf-8" />


    <document name="doc">
        <entity 
            dataSource="cr-xml" 
            name="f" 
            processor="FileListEntityProcessor" 
            baseDir="/path/to/xml" 
            filename="*.xml" 
            recursive="true" 
            rootEntity="true" 
            onError="skip">
            <entity
                name="xml-data" 
                dataSource="cr-xml" 
                processor="XPathEntityProcessor" 
                forEach="/root" 
                url="${f.fileAbsolutePath}" 
                transformer="DateFormatTransformer" 
                onError="skip">
                <field column="id" xpath="/root/id" /> 

                <field column="A" xpath="/root/a" />
            </entity>

            <entity 
                name="db-data" 
                dataSource="cr-db"
                query="
                    SELECT  
                        id, b
                    FROM 
                        a_table
                    WHERE 
                        id = '${f.file}'">
                <field column="B" name="b" />
            </entity>
        </entity>
    </document>
</dataConfig>
--





--
View this message in context: http://lucene.472066.n3.nabble.com/solr-dih-does-multiple-queries-for-sub-entities-tp4044522.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr-dih does multiple queries for sub-entities

Posted by harpax <a....@pan-sonic.de>.
Thanks for the quick answer!

I am not quite sure whether or not that will help me though. The relation
between the files and the db entries is 1:1, so I am only expecting one
result set for each call, that cannot be cached as the key (the filename)
differs.. I will try to implement it anyway

What I don't quite understand is, why the count of db-calls increases with
the number of docs Solr indexed. I added the verbose output from the 4th doc
below. Not only the db-calls sum to 4, but also the XPathEntity-calls..


thanks again!
arne

--
"document#4",
[
  null,
  "----------- row #1-------------",
  "fileSize",
  3341,
  "fileLastModified",
  "2013-02-13T10:29:04Z",
  "fileAbsolutePath",
  "c:\\Reports\\report-1358168401817.xml",
  "fileDir",
  "c:\\Reports",
  "file",
  "report-1358168401817.xml",
  null,
  "---------------------------------------------",
  "entity:xml-data",
  [
    "query",
    "c:\\Reports\\report-1358168401817.xml",
    "query",
    "c:\\Reports\\report-1358168401817.xml",
    "query",
    "c:\\Reports\\report-1358168401817.xml",
    "query",
    "c:\\Reports\\report-1358168401817.xml",
    "time-taken",
    "0:0:0.0",
    "time-taken",
    "0:0:0.0",
    "time-taken",
    "0:0:0.0",
    "time-taken",
    "0:0:0.0",
    null,
    "----------- row #1-------------",
    "paniclog",
    "",
    "$forEach",
    "/Feedback",
    "crashlog",
    "\ntext\n",
    null,
    "---------------------------------------------"
  ],
  "entity:db-data",
  [
    "query",
    "      SELECT id, b FROM a_table WHERE id = 'report-1358168401817.xml'",
    "query",
    "      SELECT id, b FROM a_table WHERE id = 'report-1358168401817.xml'",
    "query",
    "      SELECT id, b FROM a_table WHERE id = 'report-1358168401817.xml'",
    "query",
    "      SELECT id, b FROM a_table WHERE id = 'report-1358168401817.xml'",
    "time-taken",
    "0:0:0.297",
    "time-taken",
    "0:0:0.297",
    "time-taken",
    "0:0:0.297",
    "time-taken",
    "0:0:0.297",
    null,
    "----------- row #1-------------",
    "ID",
    "report-1358168401817.xml",
    "B",
    "some data",
    "---------------------------------------------",
    null,
  ]
],





--
View this message in context: http://lucene.472066.n3.nabble.com/solr-dih-does-multiple-queries-for-sub-entities-tp4044522p4044552.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr-dih does multiple queries for sub-entities

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
You can cache the subentity, then it will retrieve all the data for that entity in 1 query.  

See http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor for more information.  This section focuses on caching data from SQLEntityProcessor.  However, it is now possible to cache data from other entity types also.  Also, it is possible to plug in cache implementations if the default in-memory cache does not scale for you.  See https://issues.apache.org/jira/browse/SOLR-2382 .

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: harpax [mailto:a.psczolla@pan-sonic.de] 
Sent: Monday, March 04, 2013 8:49 AM
To: solr-user@lucene.apache.org
Subject: solr-dih does multiple queries for sub-entities

Hi,

I am trying to use the DIH for crawling over some xml-files and xpathing
them and then access a db with the filename as a key. That works, but
reading ~30.000 docs would take almost 3h. When I looked at the
DIH-Debug-console it showed me, that way to many db-calls were made: 1 for
the 1st doc, then 2, 3, 4, ..

I tried different attributes combinations (eg stripped it to the minimum),
but still the same. 

This problem was asked before:
http://lucene.472066.n3.nabble.com/DIH-multiple-queries-per-sub-entity-tt701038.html

thanks a lot!

regards
Arne

--
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
    <dataSource 
        name="cr-db"
        jndiName="xyz"
        type="JdbcDataSource" />
    <dataSource 
        name="cr-xml" 
        type="FileDataSource" 
        encoding="utf-8" />


    <document name="doc">
        <entity 
            dataSource="cr-xml" 
            name="f" 
            processor="FileListEntityProcessor" 
            baseDir="/path/to/xml" 
            filename="*.xml" 
            recursive="true" 
            rootEntity="true" 
            onError="skip">
            <entity
                name="xml-data" 
                dataSource="cr-xml" 
                processor="XPathEntityProcessor" 
                forEach="/root" 
                url="${f.fileAbsolutePath}" 
                transformer="DateFormatTransformer" 
                onError="skip">
                <field column="id" xpath="/root/id" /> 

                <field column="A" xpath="/root/a" />
            </entity>

            <entity 
                name="db-data" 
                dataSource="cr-db"
                query="
                    SELECT  
                        id, b
                    FROM 
                        a_table
                    WHERE 
                        id = '${f.file}'">
                <field column="B" name="b" />
            </entity>
        </entity>
    </document>
</dataConfig>
--





--
View this message in context: http://lucene.472066.n3.nabble.com/solr-dih-does-multiple-queries-for-sub-entities-tp4044522.html
Sent from the Solr - User mailing list archive at Nabble.com.