You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Joachim Van den Bogaert <jo...@inqa.be> on 2013/04/14 11:11:14 UTC

Test methods IR task on real-time content

Hi all,

I was wondering whether anyone has ever used information retrieval metrics on real-time big data
with variable amounts of data.

The main idea would be to test whether you can find relevant information for a given time frame for two data repositories:
one baseline repository and one with extra content. The question here would be how to do this in a fair way:
chances are that the extra content will contain more relevant documents than the baseline. So how can you be sure that
finding more relevant documents is really related to the quality of your search system and not to the size of your data repository?

Regards,
Joachim