You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Pulkit Singhal <pu...@gmail.com> on 2011/09/18 01:20:03 UTC

Miscellaneous DIH related questions

My DIH's full-import logs end with a tailing output saying that 1500
documents were added, which is correct because I have 16 sources and
one of them was down and each source is supposed to give me 100
results:
(1500 adds)],optimize=} 0 0

But When I check my document count I get only 1384 results:
INFO: [rss] webapp=/solr path=/select params={start=0&q=*:*&rows=0}
hits=1384 status=0 QTime=0

1) I think I may have duplicates based on the primary key for the data
coming in. Is there any other explnation than that?
2) Is there some way to get a log of how many documents were deleted?
Because an update does a delete then add, this would allow me to make
sure of what is going on.

The sources I have are URL based, soemtimes they appear to be down
because the request gets denied I suppose:
SEVERE: Exception thrown while getting data
java.io.FileNotFoundException:
http://www.amazon.com/rss/tag/anime/popular/ref=tag_tdp_rss_pop_man?length=100
Caused by: java.io.FileNotFoundException:
http://www.amazon.com/rss/tag/anime/popular/ref=tag_tdp_rss_pop_man?length=100
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1434)

3) Is there some way to configure the datasource to retry 3 time or
something like that? I have increased the values for connectionTimeout
and readTimeout but it doesn't help when sometimes the server simply
denies the request due to heavy load. I need to be able to retry at
those times. The onError has only the abort,skip,continue options, non
of which really let me retry anything.

Thank You.
- Pulkit

Re: Miscellaneous DIH related questions

Posted by pu...@gmail.com.

No cron job, I'm just clicking on the full import button in dataimport.jsp page.

1) Can you point me to the code in Solr where such a retry functionality should be added? I might be able to contribute.
2) What is a good place to add the java based scheduling? Again I'll test and share if I succeed.

- Pulkit

Sent from my iPhone

On Sep 18, 2011, at 12:37 AM, Gora Mohanty <go...@mimirtech.com> wrote:

> On Sun, Sep 18, 2011 at 4:50 AM, Pulkit Singhal <pu...@gmail.com> wrote:
> [...]
>> 3) Is there some way to configure the datasource to retry 3 time or
>> something like that? I have increased the values for connectionTimeout
>> and readTimeout but it doesn't help when sometimes the server simply
>> denies the request due to heavy load. I need to be able to retry at
>> those times. The onError has only the abort,skip,continue options, non
>> of which really let me retry anything.
> [...]
> 
> Don't think that there is a built-in feature for this, though it sounds like
> it would be useful.
> 
> I presume that you are scheduling your imports through cron, or
> something like that. One possibility then would be to have the script
> check the status of the import, and retry if needed.
> 
> Regards,
> Gora

Re: Miscellaneous DIH related questions

Posted by Gora Mohanty <go...@mimirtech.com>.

On Sun, Sep 18, 2011 at 4:50 AM, Pulkit Singhal <pu...@gmail.com> wrote:
[...]
> 3) Is there some way to configure the datasource to retry 3 time or
> something like that? I have increased the values for connectionTimeout
> and readTimeout but it doesn't help when sometimes the server simply
> denies the request due to heavy load. I need to be able to retry at
> those times. The onError has only the abort,skip,continue options, non
> of which really let me retry anything.
[...]

Don't think that there is a built-in feature for this, though it sounds like
it would be useful.

I presume that you are scheduling your imports through cron, or
something like that. One possibility then would be to have the script
check the status of the import, and retry if needed.

Regards,
Gora

Re: Miscellaneous DIH related questions

Posted by Erick Erickson <er...@gmail.com>.

For (2), look at your admin/stats page. The difference between numDocs and
maxDocs is the number of documents that have been deleted from your
index...

For (3) I don't have a clue about.

Best
Erick

On Sat, Sep 17, 2011 at 7:20 PM, Pulkit Singhal <pu...@gmail.com> wrote:
> My DIH's full-import logs end with a tailing output saying that 1500
> documents were added, which is correct because I have 16 sources and
> one of them was down and each source is supposed to give me 100
> results:
> (1500 adds)],optimize=} 0 0
>
> But When I check my document count I get only 1384 results:
> INFO: [rss] webapp=/solr path=/select params={start=0&q=*:*&rows=0}
> hits=1384 status=0 QTime=0
>
> 1) I think I may have duplicates based on the primary key for the data
> coming in. Is there any other explnation than that?
> 2) Is there some way to get a log of how many documents were deleted?
> Because an update does a delete then add, this would allow me to make
> sure of what is going on.
>
> The sources I have are URL based, soemtimes they appear to be down
> because the request gets denied I suppose:
> SEVERE: Exception thrown while getting data
> java.io.FileNotFoundException:
> http://www.amazon.com/rss/tag/anime/popular/ref=tag_tdp_rss_pop_man?length=100
> Caused by: java.io.FileNotFoundException:
> http://www.amazon.com/rss/tag/anime/popular/ref=tag_tdp_rss_pop_man?length=100
>        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1434)
>
> 3) Is there some way to configure the datasource to retry 3 time or
> something like that? I have increased the values for connectionTimeout
> and readTimeout but it doesn't help when sometimes the server simply
> denies the request due to heavy load. I need to be able to retry at
> those times. The onError has only the abort,skip,continue options, non
> of which really let me retry anything.
>
> Thank You.
> - Pulkit
>