You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by "jian wang (JIRA)" <ji...@apache.org> on 2014/08/23 17:10:10 UTC
[jira] [Comment Edited] (DATAFU-21) Probability weighted sampling without reservoir

    [ https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075634#comment-14075634 ] 

jian wang edited comment on DATAFU-21 at 8/23/14 3:08 PM:
----------------------------------------------------------

Further investigation updates:

Still consider the possibility of rejecting items applying Maurer's lemma and accepting items applying Bernstein's lemma.
But switch to a different key to associate with each item to simplify the computation and avoid discretization.

The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a random variable between (0,1), f(w(j)) is a function of w(j) whose value  lies in (0, 1), so X(j) is a variable that also lies in (0,1).  f(w(j)) I choose is :  f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of w(j), which is approximated as max{w(j), j = 1 to n},  so 1/f(w(j))  = 1.0 + w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2],  X(j) = U * f(w(j)) is in [0,1).

The reason to use upperBound is to normalize the weights to the (0,1] range so that extremely large absolute weights will not make f(w(j)) , if without upperBound, extremely small and thus 1/f(w(j)) extremely large. 

Apply Maurer's lemma:

find 0<q1<1, so that we reject items whose key is greater than q1.

let Y(j) = 1 if (X(j) < q1),  Y(j) = 0 otherwise {j = 1 to n}

Y(j) are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 
           = Pr (U * f(w(j)) < q1) = Pr(U < q1 / f(w(j)))
           = q1 / f(w(j))

E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j))

set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated random key is less than q1,

we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Y <= p * n} <= err which could be translated to Pr{E(Y) - Y >= E(Y) - p * n} <= err,

apply Maurer's lemma, we get
     
      log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p * n
                                              
Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }.

Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 to n) } =  sum((1.0 + w(j) / upperBound), j = 1 to n)  = n + sum(w(j), j = 1 to n) / upperBound

q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) }
       = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) })
          +  { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) }

The final q1 we use is: p +  ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) })

Apply Berstein's lemma:

find 0<q2<1, so that we accept items whose key is less than q2.

let Z(j) = 1 if (X(j) < q2),  Z(j) = 0 otherwise {j = 1 to n}

Z(j) are independent random variables. 

E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr{U < q2 / f(w(j)) } = q2 / f(w(j))

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1

theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j))

set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated random key is less than q2,

we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{Z - E(Z)  >= p * n - E(Z)} <= err,

let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0.

apply Berstein's lemma, we get 

       logPr{Z - E(Z)  >= p * n - E(Z)}  <= - t * t / (2 * sum(theta(j) ^ 2, j = 1 to n) + 2 * M * t / 3) <= log(err)

Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) - 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n).

Observation:
                     (1) q1 is decreasing  as more items are processed;
                     (2) q2 is increasing as more items are processed;
                     (3) Compared with a chosen baseline weighted sampling algorithm, the distribution of  sampling probability relative to item weight is similar but items with higher weights have relative higher sampled frequencies in the baseline weighted sampling algorithm.

Here is a report of simulation experiment: https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#, please help provide feedback on the experiment result.

Conclusion: 
                   (1) We could reject items whose keys are greater than q1 and accept items whose keys are smaller than q2  given the global upper bound of item weights;

                   (2) At present we enforce an input from customer the global upper bound of items' weights. We may come back to see if this is required or could be handled in another way;
 
                   (3)  The sampling probability is relatively higher for items with higher weights compared with lower weight items, but the probability difference between the higher weight items and the lower weight items is less significant compared with our chosen baseline weighted sampling algorithm.



was (Author: king821221):
Further investigation updates:

Still consider the possibility of rejecting items applying Maurer's lemma and accepting items applying Bernstein's lemma.
But switch to a different key to associate with each item to simplify the computation and avoid discretization.

The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a random variable between (0,1), f(w(j)) is a function of w(j) whose value  lies in (0, 1), so X(j) is a variable that also lies in (0,1).  f(w(j)) I choose is :  f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of w(j), which is approximated as max{w(j), j = 1 to n},  so 1/f(w(j))  = 1.0 + w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2],  X(j) = U * f(w(j)) is in [0,1).

The reason to use upperBound is to normalize the weights to the (0,1] range so that extremely large absolute weights will not make f(w(j)) , if without upperBound, extremely small and thus 1/f(w(j)) extremely large. 

Apply Maurer's lemma:

find 0<q1<1, so that we reject items whose key is greater than q1.

let Y(j) = 1 if (X(j) < q1),  Y(j) = 0 otherwise {j = 1 to n}

Y(j) are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 
           = Pr (U * f(w(j)) < q1) = Pr(U < q1 / f(w(j)))
           = q1 / f(w(j))

E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j))

set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated random key is less than q1,

we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Y <= p * n} <= err which could be translated to Pr{E(Y) - Y >= E(Y) - p * n} <= err,

apply Maurer's lemma, we get
     
      log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p * n
                                              
Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }.

Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 to n) } =  sum((1.0 + w(j) / upperBound), j = 1 to n)  = n + sum(w(j), j = 1 to n) / upperBound

q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) }
       = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) })
          +  { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) }

The final q1 we use is: p +  ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) })

Apply Berstein's lemma:

find 0<q2<1, so that we accept items whose key is less than q2.

let Z(j) = 1 if (X(j) < q2),  Z(j) = 0 otherwise {j = 1 to n}

Z(j) are independent random variables. 

E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr{U < q2 / f(w(j)) } = q2 / f(w(j))

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1

theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j))

set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated random key is less than q2,

we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{Z - E(Z)  >= p * n - E(Z)} <= err,

let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0.

apply Berstein's lemma, we get 

       logPr{Z - E(Z)  >= p * n - E(Z)}  <= - t * t / (2 * sum(theta(j) ^ 2, j = 1 to n) + 2 * M * t / 3) <= log(err)

Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) - 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n).

Observation:
                     (1) q1 is decreasing  as more items are processed;
                     (2) q2 is increasing as more items are processed
                     (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use a global static upperBound, f(w(j)) keeps static. However, this requires us to pre-calculate the upperBound or require customer to input an empirical number. We could use upperBound local to mapper, combiner and reducer. As more items are processed, since upperBound is non-decreasing, f(w(j)) will be non-decreasing when more items are processed from mapper to reducer;
                     (4) Compared with a chosen baseline weighted sampling algorithm, the distribution of  sampling probability relative to item weight is similar but items with higher weights have relative higher sampled frequencies in the baseline weighted sampling algorithm.

Here is a report of simulation experiment: https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#, please help provide feedback on the experiment result.

Conclusion: 
                   (1) We could reject items whose keys are greater than q1 and accept items whose keys are smaller than q2 if the global upper bound is given;

                   (2) We could reject items using q1 when global upperBound is not given;

                   (3) The item sampling probability of using both q1 and q2 compared with using only q1 is close;
 
                   (4)  The sampling probability is relatively higher for items with higher weights compared with lower weight items, but the probability difference between the higher weight items and the lower weight items is less significant compared with our chosen baseline weighted sampling algorithm;

                   (5) Suggest customer to provide a global upper bound of item weights.


> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good acceptance-rejection sampling algo on probability random sampling. The weighted sampler could utilize the simple random sample with slight modification.
> One slight modification is:  the present simple random sample generates a uniform random number lies between (0, 1) as the random variable to accept or reject an item. The weighted sample may generate this random variable based on the item's weight and this random number still lies between (0, 1) and each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)