Quiz sulle statistiche per i dati genomici & Risposte

Intraprendere il viaggio alla comprensione delle statistiche scienza dei dati genomici può essere sia esaltante che stimolante. Mentre approfondiamo le complessità dell'analisi statistica in questo campo, è fondamentale avere una solida base su cui costruire. Questo post del blog mira a fornire agli studenti una guida completa al Coursera corso ‘Statistiche per Genomico Motori di ricerca'.

Qui, troverai quiz curati e le relative risposte che non solo mettono alla prova le tue conoscenze ma migliorano anche la tua esperienza di apprendimento. Che tu sia uno studente, professionale, o appassionato nel campo della ricerca genomica, questa risorsa è progettata per ottimizzare il tuo studio e assicurarti di essere ben attrezzato per affrontare le sfide statistiche genomica.

Modulo 1 Quiz

Q1. Susan chiede a Joe di condividere i suoi dati secondo il piano di condivisione dei dati discusso nelle lezioni. Quali dei seguenti sono motivi per cui lo studio potrebbe essere riproducibile, ma non replicabile?

Risposta: L’effetto identificato può essere riprodotto dal codice e dai dati di Joe, ma potrebbe essere dovuto solo a variazioni casuali e non apparire in studi futuri.

Q2. Inserisci il seguente pezzo di codice all'inizio di un documento di markdown R chiamato test.Rmd ma imposta eval=TRUE

Risposta: La trama è casuale la prima volta che lavori a maglia il documento. È identico alla prima volta la seconda volta che lavori a maglia il documento. Dopo aver rimosso le cartelle test_cache e test_files generano nuove versioni casuali.

Q3. Crea un oggetto summarydExperiment con il codice seguente

biblioteca(Biobase)

biblioteca(Intervalli genomici)

dati(campione.ExpressionSet, pacchetto = “Biobase”)

se = makeSummarizedExperimentFromExpressionSet(campione.ExpressionSet)

Risposta: Ottieni la tabella genomica con il test(con) saggio(con), ottieni la tabella dei fenotipi con colData(con) colData(con), ottieni i dati delle funzionalità con rowData(con) rigaData(con). rowRanges(con) rowRanges(con) fornisce informazioni sulla posizione genomica e sulla struttura delle caratteristiche misurate.

Q4. Supponiamo di aver misurato i dati ChIP-Seq da 10 individui sani e 10 pazienti affetti da cancro metastatico. Per ogni individuo dividi il campione in due sottocampioni identici ed esegui l'esperimento ChIP-Seq su ciascun sottocampione. Come puoi misurare (un') variabilità biologica, (B) variabilità tecnica e (c) variabilità fenotipica.

Risposta: UN) Osservando la variazione tra i campioni da 10 diversi individui affetti da cancro B) Osservando la variabilità tra le misurazioni sui due sottocampioni dello stesso campione e C) confrontando le misurazioni medie sugli individui sani con le misurazioni sugli individui affetti da cancro.

Q5. Just considering the phenotype data what are some reasons that the Bottomly data set is likely a better experimental design than the Bodymap data? Imagine the question of interest in the Bottomly data is to compare strains and in the Bodymap data it is to compare tissues.

Risposta: The covariates in the Bottomly data set (experiment number, lane number) are balanced with respect to strain. The covariates in the Bodymap data set (Genere, età, number of technical replicates) are not balanced with respect to tissue.

Q6. What are some reasons why this plot is not useful for comparing the number of technical replicates by tissue (you may need to install the plotrix package).

Risposta: The “mixture” category is split across multiple wedges.

Q7. Which of the following code chunks will make a heatmap of the 500 most highly expressed genes (as defined by total count), without re-ordering due to clustering? Are the highly expressed samples next to each other in sample order?

Risposta: row_sums = rowSums(edata)

edata = edata[ordine(-row_sums),]

index = 1:500

heatmap(edata[indice,],Rowv=NA,Colv=NA)

Q8. Make an MA-plot of the first sample versus the second sample using the log2 transform (hint: you may have to add 1 primo) and the rlog transform from the DESeq2 package. How are the two MA-plots different? Which kind of genes appear most different in each plot?

Risposta: The plots look pretty similar, but there are two strong diagonal stripes (corresponding to the zero count genes) in the log2 log2 plot. In entrambi i casi, the genes in the middle of the expression distribution show the biggest differences, but the low abundance genes seem to show smaller differences with the rlog rlog transform.

D9. Raggruppare i dati in tre modi: Senza modifiche ai dati Dopo aver filtrato tutti i geni con rowMeans less than 100 Dopo aver effettuato la trasformazione log2 dei dati senza filtrarli, colora i campioni da cui provengono lo studio (Questo corso ti consentirà di ottenere un first mover: considerare l'utilizzo della funzione myplclust.R nel pacchetto rafalib disponibile da CRAN e l'esame dell'argomento lab.col.) Come si confrontano i metodi in termini di quanto bene raggruppano i dati per studio? Perché pensi che sia così??

Risposta: Il clustering con o senza filtro è più o meno la stessa cosa. Il clustering dopo la trasformazione log2 mostra un clustering migliore rispetto alla variabile di studio. La ragione probabile è che la distribuzione altamente distorta non corrisponde alla metrica della distanza euclidea utilizzata nell’esempio di clustering.

Q10. Raggruppare i campioni utilizzando il clustering k-means dopo aver applicato la trasformazione log2 (assicurati di aggiungere 1). Imposta un seme per risultati riproducibili (utilizzare set.seed(1235)). Se scegli due cluster, ottieni gli stessi due cluster che ottieni se usi la funzione cutee per raggruppare i campioni in due gruppi? Quale cluster corrisponde più da vicino alle etichette dello studio?

Risposta: Producono risposte diverse. Il k-significa che il clustering corrisponde meglio allo studio. Il clustering gerarchico avrebbe un aspetto migliore se scendessimo più in basso nell'albero, ma la suddivisione superiore non descrive perfettamente la variabile di studio.

Modulo 2 Quiz

Q1. Quale percentuale di variazione è spiegata dal primo componente principale nel set di dati se tu: 1) Non effettuare trasformazioni? 2) log2(dati + 1) trasformare? 2) log2(dati + 1) trasformare e sottrarre le medie di riga?

Risposta: un'. 0.89 B. 0.97 c. 0.35

Q2. Eseguire il log2(dati + 1) trasformare e sottrarre le medie di riga dai campioni. Imposta il seme su 333 e utilizzare k-means per raggruppare i campioni in due cluster. Utilizzare svd per calcolare i vettori singolari. Qual è la correlazione tra il primo vettore singolare e l'indicatore di clustering del campione?

Risposta: 0.87

Q3. Adattare un modello lineare che mette in relazione i conteggi del primo gene con il numero di repliche tecniche, considerando il numero di repliche come un fattore. Traccia i dati per questo gene rispetto alla covariata. Riesci a pensare al motivo per cui questo modello potrebbe non adattarsi bene?

Risposta: Ci sono pochissimi campioni con più di 2 replica quindi le stime per tali valori non saranno molto buone.

Q4. Adattare un modello lineare che metta in relazione il conteggio del primo gene con l'età della persona e il sesso dei campioni. Qual è il valore e l'interpretazione del coefficiente per l'età?

Risposta: -23.91. Questo coefficiente significa che per ogni ulteriore anno di età, the count goes down by an average of 23.91 for a fixed sex.

Q5. Eseguire il log2(dati + 1) trasformare. Then fit a regression model to each sample using population as the outcome. Do this using the lm.fit function (hint: don’t forget the intercept). What is the dimension of the residual matrix, the effects matrix and the coefficients matrix?

Risposta: Residual matrix: 129 X 52580

Effects matrix: 129 X 52580

Coefficients matrix: 2 X 52580

Q6. Eseguire il log2(dati + 1) trasformare. Then fit a regression model to each sample using population as the outcome. Do this using the lm.fit function (hint: don’t forget the intercept). What is the effects matrix?

Risposta:

Q7. Fit many regression models to the expression data where age is the outcome variable using the lmFit function from the limma package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the coefficient for age for the 1,000th gene? Make a plot of the data and fitted values for this gene. Does the model fit well?

Risposta: -27.61. The model doesn’t fit well since there are two large outlying values and the rest of the values are near zero.

Q8. Fit many regression models to the expression data where age is the outcome variable and tissue.type is an adjustment variable using the lmFit function from the limma package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is wrong with this model?

Risposta: Since tissue.type tissue.type is a factor variable with many levels, this model has more coefficients to estimate per gene (18) than data points per gene (16).

D9. Why is it difficult to distinguish the study effect from the population effect in the Montgomery Pickrell dataset from ReCount?

Risposta: The effects are difficult to distinguish because each study only measured one population.

Q10. Set the seed using the command set.seed(33353) then estimate a single surrogate variable using the sva function after log2(dati + 1) transforming the expression data, removing rows with rowMeans less than 1, and treating age as the outcome (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the correlation between the estimated surrogate for batch and age? Is the surrogate more highly correlated with race or gender?

Risposta: Correlation with age: 0.20

More highly correlated with race.

Modulo 3 Quiz

Q1. Load the example SNP data with the following code: Fit a linear model and a logistic regression model to the data for the 3rd SNP. What are the coefficients for the SNP variable? How are they interpreted? (Questo corso ti consentirà di ottenere un first mover: Don’t forget to recode the 0 values to NA for the SNP data)

Risposta: Linear Model = -0.04

Logistic Model = -0.16

Both models are fit on the additive scale. So in the linear model case, the coefficient is the decrease in probability associated with each additional copy of the minor allele. In the logistic regression case, it is the decrease in the log odds ratio associated with each additional copy of the minor allele.

Q2. In the previous question why might the choice of logistic regression be better than the choice of linear regression?

Risposta: If you included more variables it would be possible to get negative estimates for the probability of being a case from the linear model, but this would be prevented with the logistic regression model.

Q3. Load the example SNP data with the following code: Fit a logistic regression model on a recessive (e sì 2 copies of minor allele to confer risk) and additive scale for the 10th SNP. Make a table of the fitted values versus the case/control status. Does one model fit better than the other?

Risposta: No, in tutti i casi, i valori adattati sono vicini 0.5 e ci sono circa un uguale numero di casi e controlli in ciascun gruppo. Ciò è vero indipendentemente dal fatto che si adatti a un modello recessivo o additivo.

Q4. Load the example SNP data with the following code: Qual è la dimensione media dell'effetto? Qual è il massimo?? Qual è il minimo??

Risposta: Dimensione media dell'effetto = 0.007, minimo = -4.25, massimo = 3.90

Q5. Load the example SNP data with the following code: Qual è la correlazione con i risultati derivanti dall'utilizzo di snp.rhs.tests e chi.squared? Perché questo ha senso??

Risposta: > 0.99. Entrambi stanno testando la stessa associazione utilizzando lo stesso modello di regressione additiva su scala logistica ma utilizzando test leggermente diversi.

Q6. Carica il Montgomery e il Pickrell eSet: Fai il log2(dati + 1) trasformare e adattare calcolare le statistiche F per la differenza tra studi/popolazioni utilizzando genefilter:rowFtests e utilizzando genefilter:rowttest. Ottieni la stessa statistica? Ottieni lo stesso valore p?

Risposta: You get the same p-value but different statistics. This is because the F-statistic and t-statistic test the same thing when doing a two group test and one is a transform of the other.

Q7. Carica il Montgomery e il Pickrell eSet: First test for differences between the studies using the DESeq2 package using the DESeq function. Then do the log2(dati + 1) transform and do the test for differences between studies using the limma package and the lmFit, ebayes and topTable functions. What is the correlation in the statistics between the two analyses? Are there more differences for the large statistics or the small statistics (hint: Make an MA-plot).

Risposta: 0.93. There are more differences for the small statistics.

Q8. Apply the Benjamni-Hochberg correction to the P-values from the two previous analyses. How many results are statistically significant at an FDR of 0.05 in each analysis?

Risposta: DESeq = 1995 significant; limma = 2807 significant

D9. Is the number of significant differences surprising for the analysis comparing studies from Question 8? Why or why not?

Risposta: Yes and no. It is surprising because there is a large fraction of the genes that are significantly different, but it isn’t that surprising because we would expect that when comparing measurements from very different batches.

Q10. Suppose you observed the following P-values from the comparison of differences between studies. Why might you be suspicious of the analysis?

Risposta: The p-values should have a spike near zero (the significant results) and be flat to the right hand side (the null results) so the distribution pushed toward one suggests something went wrong.

Modulo 4 Quiz

Q1. When performing gene set analysis it is critical to use the same annotation as was used in pre-processing steps. Read the paper behind the Bottomly data set on the ReCount database: http://www.ncbi.nlm.nih.gov/pubmed?term=21455293 Using the paper and the function: supportedGenomes() in the goseq package can you figure out which of the Mouse genome builds they aligned the reads to.

Risposta: UCSC mm9

Q2. Load the Bottomly data with the following code and perform a differential expression analysis using limma with only the strain variable as an outcome. How many genes are differentially expressed at the 5% FDR level using Benjamini-Hochberg correction? What is the gene identifier of the first gene differentially expressed at this level (just in order, not the smallest FDR) ? (hint: the featureNames function may be useful)

Risposta: 223 at FDR 5%; ENSMUSG00000000402 first DE gene

Q3. Use the nullp and goseq functions in the goseq package to perform a gene ontology analysis. What is the top category that comes up as over represented? (hint: you will need to use the genome information on the genome from question 1 and the differential expression analysis from question 2.

Risposta: GO:0004888

Q4. Look up the GO category that was the top category from the previous question. What is the name of the category?

Risposta: transmembrane signaling receptor activity

Q5. Load the Bottomly data with the following code and perform a differential expression analysis using limma and treating strain as the outcome but adjusting for lane as a factor. Then find genes significant at the 5% FDR rate using the Benjamini Hochberg correction and perform the gene set analysis with goseq following the protocol from the first 4 domande. How many of the top 10 overrepresented categories are the same for the adjusted and unadjusted analysis?

Risposta: 3