You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shalin Shekhar Mangar <sh...@gmail.com> on 2009/09/26 17:21:31 UTC

Re: DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

On Fri, Sep 25, 2009 at 6:48 PM, Brahim Abdesslam <
brahim.abdesslam@maecia.com> wrote:

> Hello everybody,
>
> we are using Solr to index some RSS feeds for a news agregator application.
>
> We've got some difficulties with the publication date of each item because
> each site use an homemade date format.
> The fact is that we want to have the exact amount of time between the date
> of publication and the time it is now.
>
>
The fact is that the RSS example is just that, an example. It was never
meant for production use and it does not handle the variety of date formats
found in the wild. If you want to index RSS feeds, it is best to use an RSS
parser to extract out the values. You can use the PlainTextEntityProcessor
to get the raw RSS feed and write a custom transformer which uses a rss
parsing library like rome to extract the various fields.


> So we decided to uses a timestamp that stores the index time for each item.
>
> The problem is :
>
>   * when i do a full-import&clean=false the index is always cleaned.

  * when i do a simple import, nothing seems to be done.
>

== snip ==


>
> - Tests :
>
> => command=full-import&clean=false
>
> 25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> 25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/dataimport params={command=full-import}
> status=0 QTime=6
>

See the above parameters. It has only one param: command=full-import. There
is no clean=false in there so I'm guessing the clean parameter never made it
to Solr. Can you check again?

-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Sat, Sep 26, 2009 at 9:41 PM, Brahim Abdesslam <
brahim.abdesslam@maecia.com> wrote:

>
> on a Linux system the command :
> curl
> http://192.168.0.14:8983/solr/dataimport?command=full-import&clean=false
> just don't work like this command :
> curl "
> http://192.168.0.14:8983/solr/dataimport?command=full-import&clean=false"
>
>
Ah, thanks for clearing that up.


> But we still have a problem with... the famous timestamp, it is always
> updated for each item!
>
> To get the date and time where the item is indexed we have this field in
> the file schema.xml :
>
> <field name="timestamp" type="date" indexed="true" stored="true"
> default="NOW" />
>
> Do you think the items are still all always updated ?
>

Well, your full-import with clean=false may still be replacing all existing
documents with new ones. If so, the timestamp would always be updated. So
unless you can index only the new feeds (and not re-index the existing
documents), you will need to use the publication date.

-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

Posted by Brahim Abdesslam <br...@maecia.com>.

Shalin Shekhar Mangar a écrit :
> On Fri, Sep 25, 2009 at 6:48 PM, Brahim Abdesslam <
> brahim.abdesslam@maecia.com> wrote:
>
> we are using Solr to index some RSS feeds for a news agregator application.
>
> We've got some difficulties with the publication date of each item because
> each site use an homemade date format.
> The fact is that we want to have the exact amount of time between the date
> of publication and the time it is now.
>
>   
> The fact is that the RSS example is just that, an example. It was never
> meant for production use and it does not handle the variety of date formats
> found in the wild. If you want to index RSS feeds, it is best to use an RSS
> parser to extract out the values. You can use the PlainTextEntityProcessor
> to get the raw RSS feed and write a custom transformer which uses a rss
> parsing library like rome to extract the various fields.
>
>   
>> So we decided to uses a timestamp that stores the index time for each item.
>>
>> The problem is :
>>
>>   * when i do a full-import&clean=false the index is always cleaned.
>>     
Thanks, we will have a look at this if we can't get the timestamp method 
working...
>   * when i do a simple import, nothing seems to be done.
>   
>
> == snip ==
>
>   
>> - Tests :
>>
>> => command=full-import&clean=false
>>
>> 25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> INFO: Read dataimport.properties
>> 25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/dataimport params={command=full-import}
>> status=0 QTime=6
>>
>>     
>
> See the above parameters. It has only one param: command=full-import. There
> is no clean=false in there so I'm guessing the clean parameter never made it
> to Solr. Can you check again?
>   
You rock! I was working without double quotes..

on a Linux system the command :
curl 
http://192.168.0.14:8983/solr/dataimport?command=full-import&clean=false
just don't work like this command :
curl 
"http://192.168.0.14:8983/solr/dataimport?command=full-import&clean=false"

But we still have a problem with... the famous timestamp, it is always 
updated for each item!

To get the date and time where the item is indexed we have this field in 
the file schema.xml :

<field name="timestamp" type="date" indexed="true" stored="true" 
default="NOW" />

Do you think the items are still all always updated ?

Thank you very mutch Shalin !