多重检验、Post-Hoc分析与重复测量ANOVA
多重检验、Post-Hoc分析与重复测量ANOVA
⾸先参考这篇简⽂,了解多重检验的⽬的、原理和⽅法:多重检验中FDR的计算。
多重检验之所以重要,是因为在实际科学实验中,对单⼀现象的研究过程往往会重复成百上千次,即便p值的阈值设定为0.01,也不免随着测量次数的增加⽽导致Type I 错误的累积,因此需要对检验⽅法与p值进⾏校正。
comparisons⽅差分析post hoc借鉴⼀个例⼦来说明:
lendingclub数据集。LendingClub是⼀家美国点对点贷款公司,总部位于加利福尼亚州旧⾦⼭。它是第⼀家向美国证券交易委    需要使⽤到lendingclub数据集
员会(SEC)注册其证券产品并在⼆级市场上提供贷款交易的点对点贷⽅。
Lending Club允许借款⼈创建1,000⾄40,000美元的⽆抵押个⼈贷款。标准贷款期限为三年。投资者可以在Lending Club⽹站上搜索和浏览贷款清单,并根据提供的有关借款⼈、贷款⾦额、贷款等级和贷款⽬的信息选择他们想要投资的贷款。投资者从利息中赚钱。Lending Club通过向借款⼈收取原始费⽤和向投资者收取服务费来赚钱。
官⽅⽹站包含2007年⾄今的所有数据,有兴趣的同学可⾃⾏下载研究:
LendingClub Statistics
LendingClub官⽹
由于数据量⾮常⼤,在此我们选⽤部分数据进⾏分析。
从Datacamp中下载Lendingclub数据⼦集
lendingclub <-
read.csv(url("assets.datacamp/production/repositories/1793/datasets/e14dbe91a0840393e 86e4fb9a7ec1b958842ae39/lendclub.csv"))
⽤dplyr包中的glimpse()查看数据基本情况
glimpse是“⼀瞥”的意思,意为“扫⼀眼数据、⼤致查看数据”。
library(dplyr)
glimpse(lendingclub)
Observations: 1,500
Variables: 12
$ member_id          <int> 55096114, 1555332, 1009151, 69524202, 72128084, 53906707, 610285, 48822267, 62159245, 1688398, 37887752,
$ loan_amnt          <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 5000, 7500, 6900, 12000, 8000, 18250, 9325, 28000, 16000,
$ funded_amnt        <int> 11000, 10000, 13000, 5000, 18000, 14000, 8000, 5000, 7500, 6900, 12000, 8000, 18250, 9325, 28000, 16000,
$ term                <fct> 36 months, 36 months, 60 months, 36 months, 36 months, 60 months, 36 months,
36 months, 36 months, 36 months, ...
$ int_rate            <dbl> 12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13.11, 7.89, 16.55, 10.16, 12.39, 10.16,
19.89, 15.61, 6.24, 14.99, 16...
$ emp_length          <fct> 10+ years, 10+ years, 3 years, 10+ years, 10+ years, 3 years, 10+ years, 10+ years, < 1 year, n/a, 10+ years, ...
$ home_ownership      <fct> RENT, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, MORTGAGE, RENT, MORTGAGE, MORTGAGE, RENT, OWN, OWN, MO...
$ annual_inc          <dbl> 51000, 40000, 78204, 51000, 96000, 47000, 40000, 33000, 50000, 70000, 65000, 50000, 36500, 70000, 176000,
$ verification_status <fct> Not Verified, Verified, Not Verified, Not Verified, Not Verified, Not Verified, Not Verified, Source Verified,...
$ loan_status        <fct> Current, Fully Paid, Fully Paid, Current, Current, Current, Fully Paid, Current, Current, Fully Paid, Current,...
$ purpose            <fct> debt_consolidation, debt_consolidation, home_improvement, home_improvement, credit_card, home_improvement,
$ grade              <fct> C, A, B, C, A, D, C, A, D, B, C, B, E, C, A, C, D, B, C, D, D, A, B, D, B, B, C, A, B, B, B, A, C, A, B, C, B,...
可以看到数据集包含12个变量,共1500⾏贷款⼈的数据
summarise()计算贷款数额(loan_amnt)的中位数、利率(int_rate)与年收⼊(annual_inc)的均值。
使⽤summarise()
对数据集中变量的含义有任何不清楚的地⽅,都可以在官⽹下⾯的DATA DICTIONARY下载数据变量介绍的excel⽂档查看。
dplyr包,上⾯已载⼊。
此处需要使⽤到管道函数%>%
管道函数%>%,其作⽤是将前⼀步的结果直接传递给下⼀步的函数,该函数属于dplyr lendingclub %>% summarise(median(loan_amnt), mean(int_rate), mean(annual_inc))
median(loan_amnt) mean(int_rate) mean(annual_inc)
1            13000      13.3147
2        75736.03
贷款数额的中位数为13000美元,利率均值为13.31%,贷款⼈平均年收⼊为75736.03美元。
使⽤ggplot绘制贷款⽬的(purpose)的柱形图
coord_flip()翻转坐标轴
由于x轴⽂字较长会挤在⼀起,⽤coord_flip()
library(ggplot2)
ggplot(data = lendingclub, aes(x = purpose)) + geom_bar() + coord_flip()
图中可以看出以债务合并(debt consolidation)为⽬的借款的⼈最多,其次是信⽤卡还款(credit card)。
注:债务合并指⽤⼀个贷款来偿还其他的贷款和⼀系列的信⽤卡。
贷款⽬的的种类:
婚姻 wedding
度假 vacation
⼩⽣意 small business
可再⽣能源 renewable energy
其它 other
搬家 moving
医疗 medical
⼤宗购买 major purchase
购房 house
装修 home improvement
债务合并 debt consolidation
信⽤卡 credit card
购车 car
由于种类较为杂乱,我们可以对其进⾏合并。⽐如信⽤卡、债务合并、医疗都跟债务相关,⽽购车、⼤宗购买、度假都跟⼤宗消费有关。使⽤recode()重新整理贷款⽬的
lendingclub$purpose_recode <- lendingclub$purpose %>% recode(
"credit_card" = "debt_related", "debt_consolidation" = "debt_related", "medical" = "debt_related",        "car" = "big_purchase", "major_purchase" = "big_purchase", "vacation" = "big_purchase",
"moving" = "life_change", "small_business" = "life_change", "wedding" = "life_change",
"house" = "home_related", "home_improvement" = "home_related")
再次查看柱形图
ggplot(data=lendingclub, aes(x = purpose_recode)) + geom_bar()
以贷款⽬的(purpose_recode)为⾃变量,贷款⾦额(funded_amnt)为因变量,建⽴回归⽅程。以贷款⽬的(purpose_recode)为⾃变量,贷款⾦额(funded_amnt)为因变量,建⽴回归⽅程purpose_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub)
summary(purpose_model)
Call:
lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
Residuals:
Min    1Q Median    3Q    Max
-14472  -6251  -1322  4678  25761
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)                      9888.1    1248.9  7.917 4.69e-15 ***
purpose_recodedebt_related      5433.5    1270.5  4.277 2.02e-05 ***
purpose_recodehome_related      4845.0    1501.0  3.228  0.00127 **
purpose_recodelife_change        4095.3    2197.2  1.864  0.06254 .
purpose_recodeother              -649.3    1598.3  -0.406  0.68461
purpose_recoderenewable_energy  -1796.4    4943.3  -0.363  0.71636
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1