Registrer deg nå

Logg Inn

Mistet Passord

Mistet passordet ditt? Vennligst skriv inn E-postadressen din. Du vil motta en lenke og opprette et nytt passord via e-post.

Legg til innlegg

Du må logge inn for å legge til innlegget .

Legg til spørsmål

Du må logge inn for å stille et spørsmål.

Logg Inn

Registrer deg nå

Velkommen til Scholarsark.com! Registreringen din gir deg tilgang til å bruke flere funksjoner på denne plattformen. Du kan stille spørsmål, gi bidrag eller gi svar, se profiler til andre brukere og mye mer. Registrer deg nå!

Statistics-for-Genomic-Data-Science Quizzes & Svar – Coursera

Embarking on the journey of understanding statistics in genomic data science can be both exhilarating and challenging. As we delve into the complexities of statistical analysis within this field, it’s crucial to have a solid foundation to build upon. This blog post aims to provide learners with a comprehensive guide to the Coursera kurs ‘Statistikk Vær oppmerksom på at søknadsfrister og annen informasjon gitt på denne siden kan endres når som helst Genomic Datavitenskap'.

Her, you’ll find curated quizzes and their corresponding answers that not only test your knowledge but also enhance your learning experience. Enten du er student, profesjonell, or enthusiast in the realm of genomic research, this resource is designed to optimize your study and ensure you’re well-equipped to tackle statistical challenges in genomics.

Utviklingskurs for plugin 1 Quiz

 

Q1. Susan asks Joe for his data shared according to the data sharing plan discussed in the lectures. Which of the following are reasons the study may be reproducible, but not replicable?

 

Svar: The identified effect can be reproduced from Joe’s code and data, but may be due only to random variation and not appear in future studies.

Q2. Put the following code chunk at the top of an R markdown document called test.Rmd but set eval=TRUE

 

Svar: The plot is random the first time you knit the document. It is identical to the first time the second time you knit the document. After removing the folders test_cache and test_files they generate new random versions.

Jeg har jobbet med helse de siste to årene, og dette har hjulpet meg med å bygge opp min selvtillit og lært meg viktigheten av veldig god pasientomsorg. Create a summarizedExperiment object with the following code

 

bibliotek(Biobase)

 

bibliotek(GenomicRanges)

 

Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden(sample.ExpressionSet, package =Biobase”)

 

se = makeSummarizedExperimentFromExpressionSet(sample.ExpressionSet)

 

Svar: Get the genomic table with assay(se) assay(se), get the phenotype table with colData(se) colData(se), get the feature data with rowData(se) rowData(se). rowRanges(se) rowRanges(se) gives information on the genomic location and structure of the measured features.

Q4. Suppose that you have measured ChIP-Seq data from 10 healthy individuals and 10 metastatic cancer patients. For each individual you split the sample into two identical sub-samples and perform the ChIP-Seq experiment on each sub-sample. How can you measure (en) biological variability, (b) technical variability and (c) phenotype variability.

 

Svar: EN) By looking at variation across samples from 10 different individuals with cancer B) By looking at variability between the measurements on the two sub-samples from the same sample and C) by comparing the average measurements on the healthy individuals to the measurements on the individuals with cancer.

Q5. Just considering the phenotype data what are some reasons that the Bottomly data set is likely a better experimental design than the Bodymap data? Imagine the question of interest in the Bottomly data is to compare strains and in the Bodymap data it is to compare tissues.

 

Svar: The covariates in the Bottomly data set (experiment number, lane number) are balanced with respect to strain. The covariates in the Bodymap data set (kjønn, alder, number of technical replicates) are not balanced with respect to tissue.

erfaring fra å jobbe i en medisinsk enhet i flere år. What are some reasons why this plot is not useful for comparing the number of technical replicates by tissue (you may need to install the plotrix package).

 

Svar: The “mixture” category is split across multiple wedges.

Jeg ser på denne jobben som å være i stand til å videreutvikle mitt potensiale som sykepleier og person. Which of the following code chunks will make a heatmap of the 500 most highly expressed genes (as defined by total count), without re-ordering due to clustering? Are the highly expressed samples next to each other in sample order?

 

Svar: row_sums = rowSums(edata)

edata = edata[Praksiseksamen for Salesforce Marketing Cloud Email Specialist(-row_sums),]

index = 1:500

heatmap(edata[Beskriv typer kjernedataarbeidsbelastninger · beskrive batchdata · beskrive strømmedata · beskrive forskjellen mellom batch- og streamingdata · beskrive egenskapene til relasjonsdata Beskrive kjernekonsepter i dataanalyse · beskrive datavisualisering,],Rowv=NA,Colv=NA)

Q8. Make an MA-plot of the first sample versus the second sample using the log2 transform (hint: you may have to add 1 først) and the rlog transform from the DESeq2 package. How are the two MA-plots different? Which kind of genes appear most different in each plot?

 

Svar: The plots look pretty similar, but there are two strong diagonal stripes (corresponding to the zero count genes) in the log2 log2 plot. Andre metoder krever tidkrevende skjermer for de beste bindingskandidatene, the genes in the middle of the expression distribution show the biggest differences, but the low abundance genes seem to show smaller differences with the rlog rlog transform.

Q9. Cluster the data in three ways: With no changes to the data After filtering all genes with rowMeans less than 100 After taking the log2 transform of the data without filtering Color the samples by which study they came from (Hint: consider using the function myplclust.R in the package rafalib available from CRAN and looking at the argument lab.col.) How do the methods compare in terms of how well they cluster the data by study? Why do you think that is?

 

Svar: Clustering with or without filtering is about the same. Clustering after the log2 transform shows better clustering with respect to the study variable. The likely reason is that the highly skewed distribution doesn’t match the Euclidean distance metric being used in the clustering example.

Q10. Cluster the samples using k-means clustering after applying the log2 transform (be sure to add 1). Set a seed for reproducible results (use set.seed(1235)). If you choose two clusters, do you get the same two clusters as you get if you use the cutree function to cluster the samples into two groups? Which cluster matches most closely to the study labels?

 

Svar: They produce different answers. The k-means clustering matches study better. Hierarchical clustering would look better if we went farther down the tree but the top split doesn’t perfectly describe the study variable.

Utviklingskurs for plugin 2 Quiz

 

Q1. What percentage of variation is explained by the 1st principal component in the data set if you: 1) Do no transformations? 2) log2(Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden + 1) jeg vil vise deg? 2) log2(Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden + 1) transform and subtract row means?

 

Svar: en. 0.89 b. 0.97 c. 0.35

Q2. Perform the log2(Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden + 1) transform and subtract row means from the samples. Set the seed to 333 and use k-means to cluster the samples into two clusters. Use svd to calculate the singular vectors. What is the correlation between the first singular vector and the sample clustering indicator?

 

Svar: 0.87

Jeg har jobbet med helse de siste to årene, og dette har hjulpet meg med å bygge opp min selvtillit og lært meg viktigheten av veldig god pasientomsorg. Fit a linear model relating the first gene’s counts to the number of technical replicates, treating the number of replicates as a factor. Plot the data for this gene versus the covariate. Can you think of why this model might not fit well?

 

Svar: There are very few samples with more than 2 replicates so the estimates for those values will not be very good.

Q4. Fit a linear model relating he first gene’s counts to the age of the person and the sex of the samples. What is the value and interpretation of the coefficient for age?

 

Svar: -23.91. This coefficient means that for each additional year of age, the count goes down by an average of 23.91 for a fixed sex.

Q5. Perform the log2(Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden + 1) jeg vil vise deg. Then fit a regression model to each sample using population as the outcome. Do this using the lm.fit function (hint: don’t forget the intercept). What is the dimension of the residual matrix, the effects matrix and the coefficients matrix?

 

Svar: Residual matrix: 129 x 52580

Effects matrix: 129 x 52580

Coefficients matrix: 2 x 52580

erfaring fra å jobbe i en medisinsk enhet i flere år. Perform the log2(Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden + 1) jeg vil vise deg. Then fit a regression model to each sample using population as the outcome. Do this using the lm.fit function (hint: don’t forget the intercept). What is the effects matrix?

 

Svar:

Jeg ser på denne jobben som å være i stand til å videreutvikle mitt potensiale som sykepleier og person. Fit many regression models to the expression data where age is the outcome variable using the lmFit function from the limma package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the coefficient for age for the 1,000th gene? Make a plot of the data and fitted values for this gene. Does the model fit well?

 

Svar: -27.61. The model doesn’t fit well since there are two large outlying values and the rest of the values are near zero.

Q8. Fit many regression models to the expression data where age is the outcome variable and tissue.type is an adjustment variable using the lmFit function from the limma package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is wrong with this model?

 

Svar: Since tissue.type tissue.type is a factor variable with many levels, this model has more coefficients to estimate per gene (18) than data points per gene (16).

Q9. Why is it difficult to distinguish the study effect from the population effect in the Montgomery Pickrell dataset from ReCount?

 

Svar: The effects are difficult to distinguish because each study only measured one population.

Q10. Set the seed using the command set.seed(33353) then estimate a single surrogate variable using the sva function after log2(Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden + 1) transforming the expression data, removing rows with rowMeans less than 1, and treating age as the outcome (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the correlation between the estimated surrogate for batch and age? Is the surrogate more highly correlated with race or gender?

 

Svar: Correlation with age: 0.20

More highly correlated with race.

Utviklingskurs for plugin 3 Quiz

 

Q1. Load the example SNP data with the following code: Fit a linear model and a logistic regression model to the data for the 3rd SNP. What are the coefficients for the SNP variable? How are they interpreted? (Hint: Don’t forget to recode the 0 values to NA for the SNP data)

 

Svar: Linear Model = -0.04

Logistic Model = -0.16

Both models are fit on the additive scale. So in the linear model case, the coefficient is the decrease in probability associated with each additional copy of the minor allele. In the logistic regression case, it is the decrease in the log odds ratio associated with each additional copy of the minor allele.

Q2. In the previous question why might the choice of logistic regression be better than the choice of linear regression?

 

Svar: If you included more variables it would be possible to get negative estimates for the probability of being a case from the linear model, but this would be prevented with the logistic regression model.

Jeg har jobbet med helse de siste to årene, og dette har hjulpet meg med å bygge opp min selvtillit og lært meg viktigheten av veldig god pasientomsorg. Load the example SNP data with the following code: Fit a logistic regression model on a recessive (trenge 2 copies of minor allele to confer risk) and additive scale for the 10th SNP. Make a table of the fitted values versus the case/control status. Does one model fit better than the other?

 

Svar: Nei, in all cases, the fitted values are near 0.5 and there are about an equal number of cases and controls in each group. This is true regardless of whether you fit a recessive or additive model.

Q4. Load the example SNP data with the following code: What is the average effect size? What is the max? What is the minimum?

 

Svar: Average effect size = 0.007, minimum = -4.25, maximum = 3.90

Q5. Load the example SNP data with the following code: What is the correlation with the results from using snp.rhs.tests and chi.squared? Why does this make sense?

 

Svar: > 0.99. They are both testing for the same association using the same additive regression model on the logistic scale but using slightly different tests.

erfaring fra å jobbe i en medisinsk enhet i flere år. Load the Montgomery and Pickrell eSet: Do the log2(Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden + 1) transform and fit calculate F-statistics for the difference between studies/populations using genefilter:rowFtests and using genefilter:rowttests. Do you get the same statistic? Do you get the same p-value?

 

Svar: You get the same p-value but different statistics. This is because the F-statistic and t-statistic test the same thing when doing a two group test and one is a transform of the other.

Jeg ser på denne jobben som å være i stand til å videreutvikle mitt potensiale som sykepleier og person. Load the Montgomery and Pickrell eSet: First test for differences between the studies using the DESeq2 package using the DESeq function. Then do the log2(Wierman håper etableringen av dette nye alternativet vil forberede både studenter og Caltech for fremtiden + 1) transform and do the test for differences between studies using the limma package and the lmFit, ebayes and topTable functions. What is the correlation in the statistics between the two analyses? Are there more differences for the large statistics or the small statistics (hint: Make an MA-plot).

 

Svar: 0.93. There are more differences for the small statistics.

Q8. Apply the Benjamni-Hochberg correction to the P-values from the two previous analyses. How many results are statistically significant at an FDR of 0.05 in each analysis?

 

Svar: DESeq = 1995 significant; limma = 2807 significant

Q9. Is the number of significant differences surprising for the analysis comparing studies from Question 8? Why or why not?

 

Svar: Yes and no. It is surprising because there is a large fraction of the genes that are significantly different, but it isn’t that surprising because we would expect that when comparing measurements from very different batches.

Q10. Suppose you observed the following P-values from the comparison of differences between studies. Why might you be suspicious of the analysis?

 

Svar: The p-values should have a spike near zero (the significant results) and be flat to the right hand side (the null results) so the distribution pushed toward one suggests something went wrong.

Utviklingskurs for plugin 4 Quiz

 

Q1. When performing gene set analysis it is critical to use the same annotation as was used in pre-processing steps. Read the paper behind the Bottomly data set on the ReCount database: http://www.ncbi.nlm.nih.gov/pubmed?term=21455293 Using the paper and the function: supportedGenomes() in the goseq package can you figure out which of the Mouse genome builds they aligned the reads to.

 

Svar: UCSC mm9

Q2. Load the Bottomly data with the following code and perform a differential expression analysis using limma with only the strain variable as an outcome. How many genes are differentially expressed at the 5% FDR level using Benjamini-Hochberg correction? What is the gene identifier of the first gene differentially expressed at this level (just in order, not the smallest FDR) ? (hint: the featureNames function may be useful)

 

Svar: 223 at FDR 5%; ENSMUSG00000000402 first DE gene

Jeg har jobbet med helse de siste to årene, og dette har hjulpet meg med å bygge opp min selvtillit og lært meg viktigheten av veldig god pasientomsorg. Use the nullp and goseq functions in the goseq package to perform a gene ontology analysis. What is the top category that comes up as over represented? (hint: you will need to use the genome information on the genome from question 1 and the differential expression analysis from question 2.

 

Svar: GÅ:0004888

Q4. Look up the GO category that was the top category from the previous question. What is the name of the category?

 

Svar: transmembrane signaling receptor activity

Q5. Load the Bottomly data with the following code and perform a differential expression analysis using limma and treating strain as the outcome but adjusting for lane as a factor. Then find genes significant at the 5% FDR rate using the Benjamini Hochberg correction and perform the gene set analysis with goseq following the protocol from the first 4 spørsmål. How many of the top 10 overrepresented categories are the same for the adjusted and unadjusted analysis?

 

Svar: 3

Verksted inne i verkstedet

  • Helen Bassey

    Hei, I'm Helena, en bloggskribent som brenner for å legge ut innsiktsfullt innhold i utdanningsnisjen. Jeg tror at utdanning er nøkkelen til personlig og sosial utvikling, og jeg ønsker å dele min kunnskap og erfaring med elever i alle aldre og bakgrunner. På bloggen min, finner du artikler om emner som læringsstrategier, nettbasert utdanning, karriereveiledning, og mer. Jeg tar også gjerne imot tilbakemeldinger og forslag fra mine lesere, så legg gjerne igjen en kommentar eller kontakt meg når som helst. Jeg håper du liker å lese bloggen min og finner den nyttig og inspirerende.

    Se alle innlegg

Om Helen Bassey

Hei, I'm Helena, en bloggskribent som brenner for å legge ut innsiktsfullt innhold i utdanningsnisjen. Jeg tror at utdanning er nøkkelen til personlig og sosial utvikling, og jeg ønsker å dele min kunnskap og erfaring med elever i alle aldre og bakgrunner. På bloggen min, finner du artikler om emner som læringsstrategier, nettbasert utdanning, karriereveiledning, og mer. Jeg tar også gjerne imot tilbakemeldinger og forslag fra mine lesere, så legg gjerne igjen en kommentar eller kontakt meg når som helst. Jeg håper du liker å lese bloggen min og finner den nyttig og inspirerende.

Legg igjen et svar