Gene Expression Count Model Continuous Covariates
Gene expression associated with continuous (quantitative) traits (Limma/edgeR/correlation) 1 @annacotannacot-20795 Last seen 22 months ago Hi, I'm trying to figure out which is the best model to go with in an experiment, so I'd appreciate any advice people can give, as I'm essentionally a wet lab person. I'm aiming to find genes that associate with changes in triglyceride levels (TG) in the liver samples. 1) edgeR 2) Limma -Voom 3) Regular Pearson correlation analysis From all the approaches tested the edgeR gives noticeably longer list of significant results. And finally time for questions: Thank you very much for all the tips! @alun Last seen 16 hours ago The city by the bay My first question is which tool would be the best choice to answer my question (Limma, edgeR or Pearson correlation analysis) and why? Both edgeR and You're getting a lot more DE genes from edgeR, probably because you're using the older GLM approach. Try using the quasi-likelihood methods, i.e., While both of these work fine, if you want a recommendation for a single method, I would probably go with Are my assumptions correct? I'm not sure what assumptions you're referring to. The only obvious one is that you're assuming that log-gene expression is linear with respect to increasing triglyceride level. Whether or not that is reasonable is debatable. If you want to avoid that assumption, and if you have enough samples, you can use a spline basis matrix instead. However, this comes at the cost of reducing the interpretability of your results. Is it correct to use the untransformed variable (TG- in this case not normally distributed) as an independent variable, or should I rather used log2 transformed value? See my previous point. Is log-gene expression linear with respect to the triglyceride levels, or linear with respect to the log-levels? Ideally, you'd have some biological knowledge about the system that you could use to make this decision. For example, I would say that it would make sense to log-transform the covariate, because I could easily imagine that many expression responses will be linear with respect to concentration (and thus log-expression will be linear with log-concentration). It's a bit harder - though not inconceivable - to imagine genes that increase exponentially with concentration, in which case linearity of log-expression values with the untransformed levels would be more appropriate. If you can't make a reasonable decision a priori... well, just try them both. The better-fitting model should have lower average dispersion estimates and (in this simple case) probably more DE genes. Note that more DE genes doesn't always mean a better-fitting model; poor fits for models involving blocking factors can result in more false positives, for example. And, it must be said, choosing a model that gives you the lowest variance doesn't always return the "correct" model, due to natural stochasticity in the quality of the fit. Our hope is that a consistent effect over thousands of genes is a reliable indicator of the model's suitability. An additional complication is that you can only use one or the other in any single limma/edgeR run. Some genes might respond linearly while other genes might respond exponentially, but you have to use the same design matrix for both. If this is a concern, you should use splines; then none of this would be an issue, because given enough knots you can approximate any curve for any gene. The lack of normality in the covariate is completely irrelevant to edgeR and co. As far as I understand logFC in case of continuous trait the beta coefficient, right? In your case, the log-fold change in the log-change in expression per unit of increase in the triglyceride concentration. Or increase in the log-unit, if you decide to log-transform your covariates.
I used featureCounts to summarize the gene level.
> dge <- DGEList(counts=expr) > keep <- filterByExpr(dge) > dge <- dge[keep, , keep.lib.sizes=FALSE] > dge <- calcNormFactors(dge) > TG <- pheno_untranformed$triglycerides
> design <- model.matrix( ~TG) > dge <- estimateDisp(dge, design,robust=TRUE) > fit <- glmFit(dge, design) lrt <- glmLRT(fit) > diffTable <- topTags(lrt, n=15, adjust.method="BH", sort.by="PValue", p.value=1)
> design <- model.matrix( ~TG) > y <- voom(dge, design) > fit <- lmFit(y, design) > fit <- eBayes(fit) > diffTable2 <- topTable(fit, coef="TG", number=15)
voom-limma work fine here. I don't know what you mean by "regular pearson correlation analysis" - I assume you're testing for a non-zero correlation. edgeR and limma have a few advantages over such a simple approach:
design.glmQLFit and glmQLFTest, that are closer to the behaviour of voom and limma in theory and in practice.voom-limma because it is faster and plugs more easily into downstream methods, e.g., gene set testing.
Source: https://support.bioconductor.org/p/121169/
0 Response to "Gene Expression Count Model Continuous Covariates"
Post a Comment