카테고리 없음

AB TEST conversion test

최영민85 2018. 10. 19. 12:12

http://math.tut.fi/~ruohonen/S_1.pdf

Test: the Difference between Two Means

• https://abtestguide.com/calc/

Hypothesis testing steps:

1) Define null hypothesis, H0: e.g., two samples belong to the same population, or there is no trend. Usually we would like to reject it.

2) Choose the test statistics for a given data, e.g., mean, trend, and a test level α , e.g., 5%.

3) Consider or create the null distribution: assume H0 is true, and obtain statistics for H0.

5) Compare the test statistics to the null distribution. Obtain the probability p of the test statistic to be observed in the null distribution. If the p-value (probability of finding this sample mean or trend within the null distribution) is less than the test level, p < α , then the null hypothesis is rejected.

Test ststics(검정 통계량) 분포

case1: Difference between Two Means (z-test)

standard normal (Gauss) distribution, used if you know the variance of the population

case2: Difference between Two Means( sigma unknown-ttest)

student t distribution with n-1 d.o.f., used if you estimate the standard deviation from the sample

case3: Proportion

Power(검정력)

- probability that a statistical test will reject a false null hypothesis (H0 ) when the alternative hypothesis (H1 ) is true.

Critical Value

In hypothesis testing, a critical value is a point on the test distribution that is compared to the test statistic to determine whether to reject the null hypothesis

• Example of test statistic: t-value

If the absolute value of your test statistic is greater than the critical value, you can declare statistical significance and reject the null hypothesis

• Example: t-value > critical t-value

α : the threshold value that we measure p-values against. •

For results with 95% level of confidence: α = 0.05

• = probability of type I error

• p-value: probability that the observed statistic occurred by chance alone

• Statistical significance: comparison between α and the p-value • p-value < 0.05: reject H0 and p-value > 0.05: fail to reject H0

Type II error (β) is the failure to reject a false H0 • Direct relationship between Power and type II error: • β = 0.2 and Power = 1 – β = 0.8 (80%)

The effect size

It depends on the type of difference and the data •

Easy example: comparison between 2 means

• The bigger the effect (the absolute difference), the bigger the power

• = the bigger the probability of picking up the difference

Multitest 참고

https://web.williams.edu/Mathematics/sjmiller/public_html/BrownClasses/162/Handouts/StatsTests04.pdf 4.2절

Determine Sample size

lecture11.ppt

Hupothesis가 아래와 같을 때, (1),(2)번 식을 연립하여 샘플 수를 추출함

the difference of two means

Proportions of two groups

유도...

Winner's Curse : Bias Estimation for Total Effects of Features in Online Controlled Experiments 리뷰

Winner's Curse.pdf

Blog : https://medium.com/airbnb-engineering/selection-bias-in-online-experimentation-c3d67795cceb

What Is The Winner’s Curse?

- Airbnb는 수개월간 6개의 실험을 Sequentially 진행함

- Bottom-up 결과는 각 A/B Test에서 lift 차이를 더함

- Split-holdou은 6개 실험 전후의 최종 lift(aggregated total effect)이여 아래의 그래프와 같은 차이가 있음

Are You Suffering From Winner’s Curse?

Winner curse: a phenomenon in common value auctions where the winner tends to overpay for the value of the item

Winner curse를 설명하기 위한 예를 들어 보자, 10개의 실험을 진행했고 각 실험 결과의 standard deviation는 1%로로 고정 한다.

충분한 수의 sample을 통해 각 실험은 독립적으로 진행 되었다. 검정 결과는 observed effect(Test statics)이고 아래 줄은 실제의 true effect 이다.

신뢰 수준 0.05인 T-test를 진행 할경 경우 observed effect가 임계치(critical value)인 1.96보다 큰 경우과 효과 있음을 나타낸다(Red).

3가지 실험의 total observed effec(Bottom-up)은 2.7%+2.6%+3.3%=8/6%이다. 하지만 True effect는 1%+1%+4%=6% 이다. 이 경우 upward bias 는 2.6% 이다.

True Effect가 있는 경우 즉 0보다 큰 경우는 True effect가 T-test 임계치인 1.96보다 작은 경우와 큰 경우 2가지로 나뉠 수 있다.

Obaserved effect는 true effect를 평균으로 하는 정규 분포에서 어느 값이든지 나올 수 있기 때문에 true effect가 있어도 1.96보다 클수도 작으수도 있다.

Case A인 경우는 항상 1.96보다 큰 Observed effect만 택하기에 항상 upward bias가 발생 한다. 반대로 CaseB는 Observed effect가 true effect보다 작은 경우도 그 값이 1.96이기 때문에 upward/downward bias가 발생 하다. 실제로, 위에서 10번째 Test에서 Ture Effect가 4%이지만 3.3% Observed effect가 있는 Test를 선택 했다. 10번째 Test만 있는 경우는 total effect도 downward bias 가 발생 할 수 있다.

따라서 저자는 Winner curse는 평균 적으로 발생 한다고 논문을 이어나가고, 그 증명은 다음과 같다. 하지만, 두 경우 모두 위 그래프에서 a영역만 전체 나올 수 있는 부분에서 상위부분만을 띄어다가 평균을 냈기 때문에 당연한 결과 이다.

Selection bias는 true effect값에 따라 다르게 나타아면 그 결과는 아래와 같다.

1.96보다 작은경우 1.96에 가까워 질수록 selection bias가 늘어나고, 1.96보다 큰 경우는 점점 그 값이 감소 한다.

유도를 해보면..? 유도한 식 그려보면 똑같이 나옴.

Selection bias는 p-value가 작을수록 그 값이 작아진다.

위에 유도한 식에서 1.96대신에 각 p-value에 대응 하는 값을 넣어서 그리면 유도 됨

Okay, What Shall We Do Then?

- Observed effect의 분산을 감소 시킴 - Sample 수를 증가

- 각 실험마다 실험을 반복하여 true effect의 평균을 구하여 이 값을 실제의 효과라고 가정

A Comparison of Approaches to Advertising Measurement: 리뷰

용어

- Control Group : Test 기간 동안 광고에 노출 되지 않음

- Test group : Test 기간 동안 광고를 신청

Ad effectiveness metrics : how we report the effectiveness of ad campaigns

suppose that

- 0.8% of users in the control group during a hypothetical study period

- 1.2% of users in the test group purchased during a hypothetical study period

=> “exposure to ads increased the share of consumers buying by 0.4 percentage points, or an increase in purchase likelihood of 50%.” is right?

=> No, not all consumers who were assigned to the test group were exposed to ads during the study

The incremental conversion rate (ICR) is the actual conversion rate minus the counterfactual conversion rate, 1.8%-1.0% in our example

A Comparison of Approaches to Advertising Measurement: 리뷰

Introduction

Facebook lift study : A/B test with two important differences

-the control group : scaled so that the size of the test and control groups are the same

-Reached audience : Members of the test group who are shown the advert at least once during the test period

- Unreached audience : have not seen the advert during the test period

-> The activity of the unreached audience introduces variance that is not present in a standard A/B test

- Multi-Cell: the target population is split into multiple cells each with a control and test group of their own, as illustrated in Figure 1

-> used to compare two marketing strategies where the target audience exhibits a selection bias

Summary

•We derive the statistical power and required sample size for Facebook lift studies, bridging the gap between the online controlled experimental literature and the reality on measuring incrementality on Facebook.

•We generalise the results to multi-cell lift studies, where incrementalities under different strategies are compared against each other.

How des Facebook calculate incrementality and lift?

Control 그룹과 Test 그룹의 conversion에는 reached R와 unreached U audiences를 모두 포함 하고 있음

reached R와 unreached U audiences의 conversion은 같다고 가정

Test group중에 광고를 본 사람의 비율 Reach r은 Control Goup과 같다고 가정

In the control group the conversion rates are the same in the unreached and reached audiences

The incrementality is the difference in conversions between the test and scaled control groups and originates solely from the reached audiences

The test statistic(검정 통계량) is lift (L) defined as incrementality divided by the number of reached conversions in the scaled control

Facebook’s Null Hypothesis Significance Test determines if there is a non-zero lift at 90% confidence level (two-tailed)

Derivation of the lift distributions

Power and Minimum Sample size

While we have derived the necessary CMF to calculate power and sample size, we also explore the possibility to proceed by simulating the distribution for L using a large number of samples

Step for numerical CMF of L(under H0)

1. Estimates for E(CC ) and r can be taken from previous Facebook advertising results.