You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brahim Abdesslam <br...@maecia.com> on 2009/09/25 15:18:23 UTC

DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

Hello everybody,

we are using Solr to index some RSS feeds for a news agregator application.

We've got some difficulties with the publication date of each item 
because each site use an homemade date format.
The fact is that we want to have the exact amount of time between the 
date of publication and the time it is now.

So we decided to uses a timestamp that stores the index time for each item.

The problem is :

    * when i do a full-import&clean=false the index is always cleaned.
    * when i do a simple import, nothing seems to be done.

Here is the configuration :

    * Apache Solr 1.4 Nightly 2009-09-25
    * java version : build 1.6.0_15-b03
    * Java HotSpot Client VM : build 14.1-b02, mixed mode, sharing

=> data-config.xml

<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
    <dataSource type="HttpDataSource" />
    <document>
        <entity name="flux_367"
                pk="link"
                url="http://www.capital.fr/rss2/feed/fil-bourse.xml"
                processor="XPathEntityProcessor"
                forEach="/rss/channel | /rss/channel/item"
                transformer="DateFormatTransformer, TemplateTransformer"
                onError="continue">
            <field column="source" template="368" commonField="true" />
            <field column="type" template="0" commonField="true" />
           
            <field column="title" xpath="/rss/channel/item/title" />
            <field column="link" xpath="/rss/channel/item/link" />
            <field column="description" 
xpath="/rss/channel/item/description" />
            <field column="date" xpath="/rss/channel/item/pubDate" 
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss z" />
        </entity>
    </document>
</dataConfig>

=> schema.xml

[...]
<fields>
   <field name="source" type="text" indexed="true" stored="true" />
   <field name="title" type="text" indexed="true" stored="true" />
   <field name="link" type="string" indexed="true" stored="true" />
   <field name="description" type="html" indexed="true" stored="true" />
   <field name="date" type="date" indexed="true" stored="true" 
default="NOW" />
   <field name="type" type="sint" indexed="true" stored="true" />
   <field name="all_text" type="text" indexed="true" stored="false" 
multiValued="true" />
   <copyField source="source" dest="all_text" />
   <copyField source="title" dest="all_text" />
   <copyField source="description" dest="all_text" />
   <copyField source="date" dest="all_text" />
   <copyField source="type" dest="all_text" />
  
   <!-- Here, default is used to create a "timestamp" field indicating
        When each document was indexed.
   -->
   <field name="timestamp" type="date" indexed="true" stored="true" 
default="NOW" multiValued="false"/>
 
 </fields>

 <uniqueKey>link</uniqueKey>
 
 <defaultSearchField>all_text</defaultSearchField>

 <solrQueryParser defaultOperator="OR"/>
[...]

- Tests :

=> command=full-import&clean=false

25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} 
status=0 QTime=6
25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
25-Sep-2009 14:58:21 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
        
commit{dir=D:\srv\solr\index,segFN=segments_2s,version=1251453476028,generation=100,filenames=[segments_2s, 
_3u.
cfs, _3u.cfx]
25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1251453476028
25-Sep-2009 14:58:22 org.apache.solr.handler.dataimport.DocBuilder finish
INFO: Import completed successfully

=> command=import

25-Sep-2009 14:59:20 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=import} status=0 
QTime=0
25-Sep-2009 14:59:20 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties

Any idea or suggestion ?
Thank you in advance!
-- 

Brahim Abdesslam
Directeur des opérations

* Maecia - /Développement web/ *
Mob : +33 (0)6 82 87 31 27
Tel  : +33 (0)9 54 99 29 59
Fax : +33 (0)9 59 99 29 59

http://www.maecia.com <http://www.maecia.com>


Re: DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Sat, Sep 26, 2009 at 9:41 PM, Brahim Abdesslam <
brahim.abdesslam@maecia.com> wrote:

>
> on a Linux system the command :
> curl
> http://192.168.0.14:8983/solr/dataimport?command=full-import&clean=false
> just don't work like this command :
> curl "
> http://192.168.0.14:8983/solr/dataimport?command=full-import&clean=false"
>
>
Ah, thanks for clearing that up.


> But we still have a problem with... the famous timestamp, it is always
> updated for each item!
>
> To get the date and time where the item is indexed we have this field in
> the file schema.xml :
>
> <field name="timestamp" type="date" indexed="true" stored="true"
> default="NOW" />
>
> Do you think the items are still all always updated ?
>

Well, your full-import with clean=false may still be replacing all existing
documents with new ones. If so, the timestamp would always be updated. So
unless you can index only the new feeds (and not re-index the existing
documents), you will need to use the publication date.

-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

Posted by Brahim Abdesslam <br...@maecia.com>.
Shalin Shekhar Mangar a écrit :
> On Fri, Sep 25, 2009 at 6:48 PM, Brahim Abdesslam <
> brahim.abdesslam@maecia.com> wrote:
>
> we are using Solr to index some RSS feeds for a news agregator application.
>
> We've got some difficulties with the publication date of each item because
> each site use an homemade date format.
> The fact is that we want to have the exact amount of time between the date
> of publication and the time it is now.
>
>   
> The fact is that the RSS example is just that, an example. It was never
> meant for production use and it does not handle the variety of date formats
> found in the wild. If you want to index RSS feeds, it is best to use an RSS
> parser to extract out the values. You can use the PlainTextEntityProcessor
> to get the raw RSS feed and write a custom transformer which uses a rss
> parsing library like rome to extract the various fields.
>
>   
>> So we decided to uses a timestamp that stores the index time for each item.
>>
>> The problem is :
>>
>>   * when i do a full-import&clean=false the index is always cleaned.
>>     
Thanks, we will have a look at this if we can't get the timestamp method 
working...
>   * when i do a simple import, nothing seems to be done.
>   
>
> == snip ==
>
>   
>> - Tests :
>>
>> => command=full-import&clean=false
>>
>> 25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> INFO: Read dataimport.properties
>> 25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/dataimport params={command=full-import}
>> status=0 QTime=6
>>
>>     
>
> See the above parameters. It has only one param: command=full-import. There
> is no clean=false in there so I'm guessing the clean parameter never made it
> to Solr. Can you check again?
>   
You rock! I was working without double quotes..

on a Linux system the command :
curl 
http://192.168.0.14:8983/solr/dataimport?command=full-import&clean=false
just don't work like this command :
curl 
"http://192.168.0.14:8983/solr/dataimport?command=full-import&clean=false"

But we still have a problem with... the famous timestamp, it is always 
updated for each item!

To get the date and time where the item is indexed we have this field in 
the file schema.xml :

<field name="timestamp" type="date" indexed="true" stored="true" 
default="NOW" />

Do you think the items are still all always updated ?

Thank you very mutch Shalin !


Re: DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Sep 25, 2009 at 6:48 PM, Brahim Abdesslam <
brahim.abdesslam@maecia.com> wrote:

> Hello everybody,
>
> we are using Solr to index some RSS feeds for a news agregator application.
>
> We've got some difficulties with the publication date of each item because
> each site use an homemade date format.
> The fact is that we want to have the exact amount of time between the date
> of publication and the time it is now.
>
>
The fact is that the RSS example is just that, an example. It was never
meant for production use and it does not handle the variety of date formats
found in the wild. If you want to index RSS feeds, it is best to use an RSS
parser to extract out the values. You can use the PlainTextEntityProcessor
to get the raw RSS feed and write a custom transformer which uses a rss
parsing library like rome to extract the various fields.


> So we decided to uses a timestamp that stores the index time for each item.
>
> The problem is :
>
>   * when i do a full-import&clean=false the index is always cleaned.

  * when i do a simple import, nothing seems to be done.
>

== snip ==


>
> - Tests :
>
> => command=full-import&clean=false
>
> 25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> 25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/dataimport params={command=full-import}
> status=0 QTime=6
>

See the above parameters. It has only one param: command=full-import. There
is no clean=false in there so I'm guessing the clean parameter never made it
to Solr. Can you check again?

-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Sep 25, 2009 at 6:48 PM, Brahim Abdesslam <
brahim.abdesslam@maecia.com> wrote:

>   * when i do a simple import, nothing seems to be done.
>

That was a bug. It is fixed in trunk now. Thanks!

-- 
Regards,
Shalin Shekhar Mangar.