Paired Test Two Sample Test & Appropriate Hypothesis Test Worksheet

ST 352 R Tutorial Assignment 2: Two-Sample and Paired t-methods 40 points Instructions• Type all answers to questions in a separate document. Once completed, submit your document in Canvas (.doc, .docx, or .txt files only). ONCE AGAIN, TYPE ANSWERS ONLY TO THE QUESTIONS IN A SEPARATE DOCUMENT!! Your assignment will not be graded if answer are typed in this document and submitted. The following topics are covered in R Tutorial Assignment 2: • • Two-sample t-methods in R, including how data need to be entered into an Excel spreadsheet to use the two-sample t-methods in R Paired t-methods, including how data need to be entered into an Excel spreadsheet to use the paired t-methods in R Notes:

• Part I is a reminder of the activity you need to perform well ahead of when the assignment is due. • Parts II through IV are the “tutorial” part of this assignment. There are no questions associated with Parts II through IV, but going through these parts on your own will be helpful in answering the questions in part V. There will not be any code provided in Part V. • Part V is the “assignment” part. Answer the questions in Part V on a separate document and submit that document. Part I: Activity for Problem 1 in Part V below For this activity, you will need to collect data that will be used in Problem 1 in Part V below. Make sure you allot time to complete this activity!! Objective of activity: In Problem 1 below, you will be asked to determine if the average price of items at two grocery stores is different. To answer this question, you will take a sample of the same 15 items at both stores and use this information to perform the appropriate hypothesis test.

Note: You will collect the data online for this activity. 1 Steps for activity: 1. Decide which two grocery stores you want to compare. (Note: you will be collecting data by searching the Internet. You may want to first check to see if the stores you have chosen have prices of items online.) 2. Decide 15 products to be part of your representative sample from each store. You will sample the exact same 15 products at both stores! For example, if you select a 12-ounce box of Kellogg’s Frosted Flakes at Store A, you will also select a 12-ounce box of Kellogg’s Frosted Flakes at Store B. • While you are asked to decide on these 15 items before visiting each store’s website, you may have to peruse each store’s website to see what items they have online as not all items that are in the physical store may be online. • As best you can, try to obtain a representative sample from the stores. 3. Go to each store’s website and record the prices of each item at each store.

• Before analyzing your data, you will have to enter the data in a correct format in an Excel spreadsheet to import into R. The correct format is discussed in Part II below. For now, you may just want to write down the name of the product and price in a notebook. • Note: If a price is on sale or reduced (i.e. a “club member” price), use the original price (non-reduced price) if it is given. If only the sale prices is given, then you will have to use that price. 4. Decide which method (the two-sample t-methods or paired t-methods) is appropriate for this problem based on how the data were collected. It is important to think about this now before proceeding to the next step. 5. You are now ready to complete Problem 1 in the assignment below. Part II: Entering data into an Excel spreadsheet How the data are entered into an Excel spreadsheet depends on whether it is more appropriate to use the two-sample t-methods or the paired t-methods. Two-sample t-methods

1. Decide which variable is the response variable and which is the explanatory variable 2. Place the values of the response variable in the first column in the spreadsheet. Give the column a one-word name at the top of the column. 3. Place the categories of the explanatory variable in the second column so that the value of the response variable and the category from which the response came from are in the same row. Give the column a one-word name at the top of the column. Example: A random sample of male college students and a random sample of female college students was selected. For each person, their sex and amount of time spent exercising (minutes per day) was recorded. 2 • • • Because the two random samples were independently taken, a two-sample t-test is appropriate to determine if there is a difference in the average time spent exercising per day is different between male and female college students. Identify the variables. o The response variable is time spent exercising (minutes per day).

o The explanatory variable is sex. Therefore, the time spent exercising each day is in the first column and sex is in the second column: sex exercise Male 30 Male 30 Female 45 Male 60 Male 0 Female 30 Female 30 Male 30 Male 40 Note the first row are the recorded data for individual 1: a male student who exercised 30 minutes per day. Row 2 is for individual 2 (another male student who exercised 30 minutes per day), and so on. Paired t-methods 1. The values of the response variable for category 1 are recorded in the first column. (“Category 1” could be individual #1 for each paired set of observations, or it could be the “before” measurement for each individual.) 2. The values of the response variable for category 2 are recorded in the second column so that its matched value is in the same row. (“Category 2” could be individual #2 for each paired set of observations, or it could be the “after” measurement for each individual.) Example: Weights of 9 students (in pounds) before and after a month-long training schedule are recorded.

Is there evidence to indicate the mean weight after training is less than the mean weight before training? • Because a before and after measurement was taken on each of the 9 students, this is a matchedpairs design. o Keep in mind that a matched-pairs design can also be constructed by matching two “individuals” on similar characteristics. • The before measurements for each of the 9 students will go in one column in the spreadsheet. Give the column a one-word name (such as “before”) • The after measurements for each of the 9 students will go in the second column. Give the column a one-word name (such as “after”) • Although not necessary, the first column could be an identifier for each individual: 3 person 1 2 3 4 5 6 7 8 9 before 124 152 129 144 150 152 138 161 140 after 133 145 131 150 141 137 130 159 141 Part III: R code for two-sample t-methods If the two-sample t-methods are more appropriate to use, follow the code below to explore the sample data, perform the t-test, and obtain the confidence interval for the difference in population means.

Here is an example: Example: Is there a difference in the average amount of time spent exercising each day between male and female OSU students? Use the appropriate variables in the studentsurvey data set to answer this question. The two variables to use are called sex and exercise Let’s assume that the daily amount of exercise for male and female students in the samples is representative of daily amount of exercise for all OSU male and female students (although that could be debated).

• Exploring the data Recall, when comparing a quantitative variable between two (or more) groups, a side-by-side boxand-whisker plot work best to get an idea of the shape of the data in each sample and also get an idea of whether the claim in the null hypothesis will be rejected for the claim in the alternative hypothesis. R code for side-by-side box-and-whisker plots: General code: boxplot(yyy ~ xxx, data = zzz, horizontal = TRUE, main = ” “, xlab = ” “) Replace zzz with the name of the data set Replace yyy with the name of the response variable Replace xxx with the name of the explanatory variable Horizontal = TRUE is an optional argument that displays the box-and-whisker plots horizontally instead of vertically (which is the default) o Give the graph a title by typing in the title between the quotes in the main = argument. o Properly label the x-axis by typing in the label (with units) between the quotes in the xlab = argument o o o o

4 Specific code for this example boxplot(exercise ~ sex, data = studentsurvey, horizontal = TRUE, main = “Comparison of Time Spent Exercising between Males and Females”, xlab = “time spent exercising per day (minutes)”) R code for summary statistics: If you have the mosaic package installed on your computer: require(mosaic) favstats(yyy ~ xxx, data = zzz) If you do not have the mosaic package installed on your computer: The tapply command will give you summary statistics for each group: tapply(zzz$yyy, zzz$xxx, summary) o Replace zzz with the name of the data set o Replace yyy with the name of the response variable o Replace xxx with the name of the explanatory variable Specific code for this example favstats(exercise ~ sex, data = studentsurvey) OR tapply(studentsurvey$exercise, studentsurvey$sex, summary) Note: if using tapply and you want the standard deviation for each group, replace “summary” with “sd”) tapply(studentsurvey$exercise, studentsurvey$sex, sd)

• Two-sample t-test and confidence interval for difference in population means The t.test() command is used. general R code t.test(yyy ~ xxx, data=zzz, mu = mmm, alternative = “aaa”, conf.level = ccc) Replace zzz with the name of the data set Replace yyy with the name of the response variable Replace xxx with the name of the explanatory variable Replace mmm with the hypothesized value for the difference in population means (which will always be 0 for our purposes). The default is 0. o Replace aaa with either “less”, “greater”, or”two.sided”, for alternatives of “less than”, “greater than”, or “not equal to”, respectively. The default is “two.sided”.  The text used must be typed between double quotes  If you are performing a one-sided test, it is important to be aware of how R is subtracting.

• If the categories of the explanatory variable are coded with text, R subtracts alphabetically (μfemale – μmale, for example) • If the categories are coded numerically, R subtracts in numerical order (μ1 – μ2, for example) o Replace ccc with the desired level of confidence as a proportion. The default is 0.95. o o o o 5 specific code for this example t.test(exercise ~ sex, data=studentsurvey, mu = 0, alternative = “two.sided”, conf.level = 0.95) Note: R calculates the degrees of freedom is by using a rather lengthy formula (called a Welch’s two-sample t-test). This is the correct degrees of freedom and should be used, but you will not need to know how to get the degrees of freedom by hand using the formula method. Instead, if doing such a problem by hand on an exam, use the conservative estimate: smaller n – 1. Important note about confidence intervals: If your alternative is “less” or “greater”, one of the bounds of the confidence interval in the output will be “Inf”, meaning there is no value for that bound. We always want values for both bounds! Here’s what you need to do in this situation:

• Record the t-statistic, degrees of freedom, and p-value for the hypothesis test. • Re-run the code with alternative = “two.sided” • Record both bounds of the confidence interval (and ignore the p-value) Note that the above hold for both the two-sample t-test problem above AND the paired ttest problem below!! Part IV: R code for paired t-methods If the data are paired (either “before/after” measurements taken on each individual or individuals are matched together in pairs based on similar characteristics), the paired t-methods should be used. Example Weights of 9 students (in pounds) before and after a month-long training schedule are given below. Is there evidence to indicate the mean weight after training is less than the mean weight before training? Person: 1 Before weight: 124 After weight: 133 2 152 145 3 129 131 4 144 150 5 150 141 The data are stored in the training data set on Canvas. 6 152 137 7 138 130 8 161 159 9 140 141 It is important to realize that the analysis of paired data is on the differences in measurements for each individual.

For example, below is the Excel spreadsheet for the above data with a column called “difference”. These differences for each individual were calculated as follows: before weight – after weight. We could have just as easily (and just as legitimately) subtracted after weight – before weight. • If you are subtracting manually, it is important to remember how you subtracted to 1) state the hypotheses correctly (especially the alternative hypothesis if a one-sided test is performed), and 2) interpret the confidence interval for the difference in population means correctly. 6 person 1 2 3 4 5 6 7 8 9 before 124 152 129 144 150 152 138 161 140 after 133 145 131 150 141 137 130 159 141 dif -9 7 -2 -6 9 15 8 2 -1 Exploring the data Recall, when data are paired, we analyze the differences in the paired values.

Therefore, we have one column (or “vector”) of data that we analyze. Obtain a histogram, dotplot, or box-and-whisker plot of the differences. The training data set does not have a column of differences. Therefore, you need to manually create a vector of differences in R after importing the data set: > dif 0, where diff = before weight – after weight Since the alternative hypothesis contains a greater than symbol, we use “greater” in the code. Part V: The Assignment Answer the questions for the following two problems on a separate document and submit that document. Part of what you have to do for each of these two problems is to decide which inference method to use – the two-sample t-methods or the paired t-methods. Once you have decided, refer to Parts II through IV above for help with 1) how to enter your data into an Excel spreadsheet correctly to import into R, and 2) how to use R to explore and analyze the data. 9 Problem 1: 20 points total In this problem, you will analyze the data you collected in Part I above to answer the following question: Is there a difference in the average prices between the two grocery stores you decided to compare?

Use your data to answer the following questions. 1. 2. 3. (2 points) Which of the following is the most appropriate hypothesis test to use in this problem: paired test or two-sample test? Why? (3 points) State the null and alternative hypotheses in statistical notation. Define any parameters used. Provide a properly labeled appropriate graphical display. a. (1 point) Include the graphical display here. b. (2 points) Do you feel it is appropriate to use the t-methods for this problem? Why or why not? c. (2 points) Do you feel the null hypothesis will be rejected based on your graphical display? Why or why not? 4. (2 points) Regardless to your answer to #3b above, run the code in R to obtain the t-statistic and pvalue. Report the t-statistic with degrees of freedom and the p-value here. 6. (2 points) Do you feel that your sample is representative of the all products at these two stores to make a conclusion about all products? Explain. 5. 7. 8. (3 points) Based on your p-value, state a conclusion in the context of the problem. (2 points) Which of the t…

Do you have a similar assignment and would want someone to complete it for you? Click on the ORDER NOW option to get instant services at EssayBell.com

Do you have a similar assignment and would want someone to complete it for you? Click on the ORDER NOW option to get instant services at EssayBell.com. We assure you of a well written and plagiarism free papers delivered within your specified deadline.