|
Chapter 1: The What and Why of Statistics
The following excerpt is taken from the U.S. Census Bureau:
"The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. It will replace the decennial long form in future censuses and is a critical element in the Census Bureau's reengineered 2010 census. The decennial census has two parts: 1) the short form, which counts the population; and 2) the long form, which obtains demographic, housing, social, and economic information from a 1-in-6 sample of households. Information from the long form is used for the administration of federal programs and the distribution of billions of federal dollars. Since this is done only once every 10 years, long-form information becomes out of date. Planners and other data users are reluctant to rely on it for decisions that are expensive and affect the quality of life of thousands of people. The American Community Survey is a way to provide the data communities need every year instead of once in ten years" (http://www.census.gov/acs/www/SBasics/What/What1.htm).
Using the lessons from Chapter 1, let's take a closer look at some of the ACS variables. Begin by visiting the following U.S. Census Bureau's website: http://www.census.gov/. On the left side of your screen, click on link that says, "American FactFinder." In the middle of your screen, locate the link, "get data," under the American Community Survey heading. You will see that the most recent ACS data file is automatically selected for you. Leave this selection as is. On the right of your screen, click on the link, "Detailed Tables."
At this point, examine the page in front of you. Begin by selecting your unit of analysis by clicking the down arrow in the box under the "Select a Geographic Type" heading. For starters, select, "State." When you do so, you will see that a list of states populates the box below. Select 10 states of interest to you. Do so by selecting one state and then, while holding the Ctrl key down on your keyboard, selecting the remaining 9 states. Once all 10 states are selected, click on the word "Add" below the box containing the state names. Then click "Next" at the bottom of your screen.
You should now have a list of variables in front of you. Scroll through these variables using the up and down arrows on the right side of the box containing the variables. Find several variables of interest to you. Select them, again, by selecting the first variable of interest and then, while holding the Ctrl key down on your keyboard, selecting the remaining variables of interest. Click "Add." Once the variables of interest have been added to the Current Table Selections box, click "Show Result."
Examine you results. What do you notice about the variables that you selected? What is the level measurement - nominal, ordinal, interval/ratio - for each and how can you tell? Can you think of any hypotheses that you might be interested in testing using these data?
Chapter 2: Organization of Information: Frequency Distributions
The following web exercise utilizes data from the World Values Survey (WVS). Begin by visiting the following URL: http://www.worldvaluessurvey.org/. Familiarize yourself with the WVS by reading through some of the major findings that have resulted from this effort in recent years. (Hint: see the "Findings" link in the mid to upper left part of your screen).
For this exercise, we will use the WVS' online data analysis tool to construct frequency distributions. Begin by clicking on the link for "Online Data Analysis" on the left part of your screen. In what follows, you will see a list of countries on the screen in front of you. As you can see, the various samples (or waves) of the survey are listed at the top of the page. Thus, for example, you can see that respondents from Albania were surveyed in 1998 during the 1994-1999 wave. Scroll through the list of countries using the scroll bar on the right of your screen. Select five countries of interest, preferably within the same wave to promote cross comparability. Once you have selected your countries of interest, click on the red "Continue" tab located at the top right of your screen. Next, we need to select the variable(s) of interest for our analysis. On the screen in front of you, there are several classes of variables (e.g., Perceptions of Life, National Identity, etc.). Click once on "Perceptions of Life." A list of variables should populate the screen below the box containing the various classes of variables. Locate the set of variables associated with "Qualities that children can be encouraged to learn at home." Select the variable, AO30 ("Important child qualities: hard work (A030)").
Having made it this far, you should now see a new window in front of you with the "Question text" tab already selected. Take a look at the question. This was the question as it was asked on the survey. Next, select the "Marginals" tab. Here is where you should first begin to recognize the general form a frequency distribution as it was introduced in Chapter 2. Regarding our variable of interest, AO30, there were two response categories: "Not mentioned" and "Important." Within each of these rows, the columns for both the frequencies and the percentages have been generated for each of the five countries you selected, as well as across all five countries.
Taking this example a bit further, select the "Cross-tabs" tab. Within the dropdown menu for "Operations," select the option for "Absolute values." Within the dropdown menu for "Cross by," select the option for "Total." The frequency distribution below should now contain only the raw frequencies. Take five minutes and calculate by hand the percentage distribution column for each of the five countries. Set these calculations aside for the moment. Navigate up to the dropdown menu for "Operations" and select the option for "Show %/(Column)." The table below should now show the relevant percentages, as opposed to the raw frequencies. Confirm that these percentages are the same as what you calculated by hand.
If you feel relatively confident about the various steps involved in this exercise, repeat this exercise, varying the set of countries selected, the particular survey wave selected, and/or the variables selected.
Chapter 3: Graphic Presentation
The following web exercise utilizes data from the World Values Survey (WVS). Begin by visiting the following URL: http://www.worldvaluessurvey.org/. Familiarize yourself with the WVS by reading through some of the major findings that have resulted from this effort in recent years. (Hint: see the "Findings" link in the mid to upper left part of your screen).
For this exercise, we will use the WVS' online data analysis tool to construct a series of bar graphs. Begin by clicking on the link for "Online Data Analysis" on the left part of your screen. In what follows, you will see a list of countries on the screen in front of you. As you can see, the various samples (or waves) of the survey are listed at the top of the page. Thus, for example, you can see that respondents from Albania were surveyed in 1998 during the 1994-1999 wave. Scroll through the list of countries using the scroll bar on the right of your screen. Select three countries of interest, preferably within the same wave to promote cross comparability. Once you have selected your countries of interest, click on the red "Continue" tab located at the top right of your screen. Next, we need to select the variable(s) of interest for our analysis. On the screen in front of you, there are several classes of variables (e.g., Perceptions of Life, National Identity, etc.). Click once on "Religion and Morale." A list of variables should population the screen below the box containing the various classes of variables. Locate the set of variables associated with "Justification of social behaviors." Select the variable, F123 ("Justifiable: Suicide (F123)").
Having made it this far, you should now see a new window in front of you with the "Question text" tab already selected. Take a look at the question. This was the question as it was asked on the survey. Next, select the "Graphics" tab. You should now see a bar chart in the screen in front of you. Within the dropdown menu for "Cross by," select the option for "Total." Once you have done so, go ahead and examine the bar graph in front of you. How do the three countries you select differ with respect to attitudes on suicide. Consider going back and selecting another variable of interest. What might you expect to find?
To take this exercise a bit further, notice the dropdown menu on the left of your screen for "Chart type." This dropdown menu contains seven possible chart types. Select one of these options and note how the results are displayed differently than in an ordinary bar graph. See if you can try each one of these options. Which do you like best? What are the advantages and disadvantages associated with each?
Chapter 4: Measures of Central Tendency
Perhaps the best way to get comfortable with measures of central tendency is simply to work through as many examples as possible. Related to this, one very helpful set of tools is that provided by Statistics Canada, Canada's central statistical agency, available at the following URL: http://www.statcan.ca/english/edu/power/ch11/first11.htm.
On the right side of your screen, you will see a menu that contains, among other things, links to modules for the mean, the median, and the mode. Begin by selecting the link for the mean. As your read through the materials provided, you will have the opportunity to view a number of examples as you go along, as well as to select hyperlinked terms so as to learn more about them. After you complete this module, return to the original URL as provided above and make your way through the modules for the median and the mode.
After you have completed the modules above, click the "Exercises" link from the table on the right of your screen. Work through each of the exercises that relate to the modules that you covered. As you go along, you have the opportunity to view the answers to the questions as you complete them by clicking on the various links associated with each question. This will allow you to gauge your level of comprehension as you work through the problems.
Once you have completed these exercises, examine those questions that were the most difficult for you and develop a strategy for addressing these difficulties.
Chapter 5: Measures of Variability
As we did with the web exercise for Chapter 4, we return to the webpage of Statistics Canada, Canada's central statistical agency, available at the following URL: http://www.statcan.ca/english/edu/power/ch12/first12.htm.
On the right side of your screen, you will see a menu that contains, among other things, links to modules for the range and quartiles, the variance and standard deviation, and many others. Begin by selecting the link for the range and quartiles. As your read through the materials provided, you will have the opportunity to view a number of examples as you go along, as well as to select hyperlinked terms so as to learn more about them. After you complete this module, return to the original URL as provided above and make your way through the module for the variance and standard deviation.
After you have completed the modules above, click the "Exercises" link from the table on the right of your screen. Work through each of the exercises that relate to the modules that you covered. As you go along, you have the opportunity to view the answers to the questions as you complete them by clicking on the various links associated with each question. This will allow you to gauge your level of comprehension as you work through the problems.
Once you have completed these exercises, examine those questions that were the most difficult for you and develop a strategy for addressing these difficulties.
Chapter 6: The Normal Distribution
Begin by watching the following very helpful illustration of the normal distribution: http://www.ms.uky.edu/~mai/java/stat/GaltonMachine.html
Next, navigate to the following URL: http://psych-www.colorado.edu/~mcclella/java/normal/normz.html
Using the tools provided on this webpage, we can enter an observed value (see "Y"), the sample mean (see "Mean"), and the sample standard deviation (see "Std Dev") and automatically have the z-score and the area under the normal curve calculated for us. Let's try an example.
When the Scholastic Aptitude Test, later renamed the Scholastic Assessment Test (SAT), was first developed, scoring was originally designed to result in a mean of 500 per section, with a standard deviation of 100. Suppose that a student obtains a 540 on the math section of the test. What is the associated z-score? What is the associated area under the normal curve?
Enter the quantities above into their respective cells on the page in front of you. Once you have done so, hit the return key on your keyboard. The associated z-score is 0.4. At the bottom of the box you are working in, use the dropdown menu to select "Cumulative." Since the area under the curve is 0.655, we can easily see from the graph that that this score serves as the 65.5th percentile, meaning that about 35% of scores fall above this particular score and about 65% of scores fall below it.
Repeat the above example on your own. Be sure to consider raw scores that are both above and below the mean score. In comparing the results for scores above and below the mean, how do the z-scores differ? How do the areas under the normal curve differ?
Chapter 7: Sampling and Sampling Distributions
In Chapter 7, the notion of a probability sample was introduced. Begin by reading more about probability sampling and types of probability samples at the Statistics Canada website: http://www.statcan.ca/english/edu/power/ch13/probability/probability.htm.
Next, let's try a few sampling simulations. Navigate to the following URL: http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html.
Click the "Begin" button on the upper left of your screen. At this point, a new window will open for you. Maximize this window. The distribution at the top of your screen is a population distribution. The idea here is that we are going to draw several samples from this population. In doing so, we will see not only what the distribution of our sample looks like, but also how the sampling distribution takes shape.
Under the "Sample:" label, click the "Animated" button. You have just drawn one sample of size five from the population. This sample is plotted in the second graph from the top of your screen. What this simulation also did was to calculate the mean of this sample and plot this value in the third graph from the top of your screen. Thus far, we have only drawn one sample. Therefore, repeat the process of drawing a sampling by clicking the "Animated" button until you have drawn 50 or more samples of size 5 from the population.
Once you have completed the task above, take a moment to examine the shape of the sampling distribution relative to the shape of the population distribution. On the left side of your screen, compare the mean values for the sampling distribution and the population distribution. Are they similar? Are they identical? Why or why not?
Having made it this far, notice that there are three additional buttons under the "Animated" button. These allow you to drawn 5, 1000, or 10000 samples at a time. But, first, you need to clear the results of the previous simulation. Do this by clicking the "Clear lower 3" button at the top of your screen. You are now ready to drawn multiple samples at once. Click the "1,000" button. In doing so, you have drawn 1,000 samples of size 5. Again, clear your work, and then try clicking the "10,000." You should see a marked difference in the shape of the sampling distribution. Take a moment to summarize these differences.
Chapter 8: Estimation
In Chapter 8, we introduced the notion of an inverse relationship between precision and confidence. That is, as a confidence interval gets narrower and consequently more precise, we can be less confident that the interval contains the true population parameter. Additionally, the material provided in Chapter 8 also highlighted the very important role of sample size in increasing the level of precision in our estimates.
The following exercise seeks to illustrate this relationship. Begin by navigating to the following URL: http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html. Click the "Begin" button at the upper left of your screen. A new window will have opened for you. Maximize this window. Read through the instructions provided to ensure you understand what this simulation does. With a sample size of 10 selected from the dropdown menu at the top of your screen, click the "Sample" button. The graph to your left uses red and blue bars to show you which of the 100 confidence intervals generated do not contain the population parameter. Likewise, the cumulative results record this information.
Having run only one simulation thus far, which has generated 100 confidence intervals, click the "Sample" button 9 more times. Of the 1,000 confidence intervals you have now generated for a sample size of, note how many "Did Not Contain 50," the population mean. Record this information and set it aside.
Next, click the "Clear" button at the top right of your screen. Now, use the dropdown menu and select 20 as your sample size. Click the "Sample" button 10 times. Once you have done so, note how many "Did Not Contain 50," the population mean. Compare this result to the result obtained for a sample size of 10. What do you notice? Do these results confirm the importance of sample size? If so, what is the evidence for this from the two simulations that you have run? If not, why not?
Chapter 9: Testing Hypotheses
In testing hypotheses involving two samples, it is often expedient to avoid the use of hand calculations. However, that said, statistical packages such as SPSS often mask a lot of the important features of t-test calculations that are instructive for students. Toward reconciling these differences, navigate to the following webpage, which provides an interface for conducting a two sample t-test: http://faculty.vassar.edu/lowry/tu.html.
Scroll down the page until you see a box labeled "Setup." Click the "Independent Samples" button. Next, we need to enter some data. In the student study questions for Chapter 4, recall that we used data from the Latin American Migration Project (http://lamp.opr.princeton.edu) on the duration of stay (in months) in the United States during respondents' first U.S. migration. The data are presented below.
Nicaragua: 4, 6, 6, 6, 12, 36, 36, 36, 36, 60, 72, 78, 96, 120, 126, 156, 162, 162, 186, 540
Guatemala: 1, 1, 12, 24, 24, 24, 36, 36, 42, 60, 78, 84, 102, 102, 102, 102,132, 144
Locate the box labeled "Sample Entry." Let's consider Nicaragua as Sample A and Guatemala as Sample B. For each sample, enter the first value - for example, 4, if we start with Nicaragua - and then hit the return button on your keyboard. Enter the next value and then hit the return button. Do this until you have entered all values for both groups. Then click the "Calculate" button.
In the "Data Summary" box below, these cells should now be filled in for you. Take a moment to identify each of these quantities. (Note: some of the notation used on this website may differ from that used in Chapter 9). For instance, the first quantity is simply the sum, for each group, of each of the values you previously entered. Once you have identified each of the quantities in this box, move down to the box with "Meana-Meanb" in the upper left hand cell. What is the value of the calculated t-statistic? And how many degrees of freedom do you have?
To determine whether the t-statistic is statistically significant, examine the cell to the right of the "two-tailed" label. We can reject the null hypothesis if the value in this cell is less than or equal to .05 (which assumes that we set our alpha level to .05 at the outset of this test).
Finally, note the last box on the page. Here, confidence intervals are calculated around the mean for the first group, the mean for the second group, and the difference between the means for the two groups. If you feel you need a brief refresher on confidence intervals, we suggest that you revisit the material in Chapter 8 and then come back the results here to see if they make more sense to you.
Chapter 10: Relationships Between Two Variables: Cross Tabulation
The following web exercise extends the web exercise in Chapter 2. We present the full set of instructions from Chapter 2, as well as additional instructions at the end of this document.
The following web exercise utilizes data from the World Values Survey (WVS). Begin by visiting the following URL: http://www.worldvaluessurvey.org/. Familiarize yourself with the WVS by reading through some of the major findings that have resulted from this effort in recent years. (Hint: see the "Findings" link in the mid to upper left part of your screen).
For this exercise, we will use the WVS' online data analysis tool to construct frequency distributions. Begin by clicking on the link for "Online Data Analysis" on the left part of your screen. In what follows, you will see a list of countries on the screen in front of you. As you can see, the various samples (or waves) of the survey are listed at the top of the page. Thus, for example, you can see that respondents from Albania were surveyed in 1998 during the 1994-1999 wave. Scroll through the list of countries using the scroll bar on the right of your screen. Select three countries of interest, preferably within the same wave to promote cross comparability. Once you have selected your countries of interest, click on the red "Continue" tab located at the top right of your screen. Next, we need to select the variable(s) of interest for our analysis. On the screen in front of you, there are several classes of variables (e.g., Perceptions of Life, National Identity, etc.). Click once on "Perceptions of Life." A list of variables should populate the screen below the box containing the various classes of variables. Locate the set of variables associated with "Qualities that children can be encouraged to learn at home." Select the variable, AO30 ("Important child qualities: hard work (A030)"). After having done so, you should now see a new window in front of you with the "Question text" tab already selected. Take a look at the question. This was the question as it was asked on the survey.
To examine bivariate relationships, select the "Cross tabs" tab. Let's begin by simply examining frequencies (as opposed to percentages). In the dropdown menu to the right of the "Operations" label, select "Absolute Values." Next, we need to tell the interface which other variable we wish to use for our cross tabulation. In the dropdown menu to the right of the "Cross by" label, select "Age - respondent." You should now see a series of bivariate tables on the screen in front of you, one for each of the three countries you specified and one for the total sample.
Next, we need to percentage the table using the convention of column percentages as introduced in Chapter 10. To do this, in the dropdown menu to the right of the "Operations" label, select "Show %/(Column)." Take a moment to reexamine the bivariate tables below to ensure that each table has in fact been percentaged correctly.
Chapter 11: The Chi Square Test
The chi-square test is used to determine whether the relationship between two variables in a bivariate table is statistically significant. Two major components to the calculation of the chi-square test statistic are the observed and expected frequencies. Using the interface from the URL below, let's work through an example interactively.
http://people.ku.edu/~preacher/chisq/chisq.htm
Read through the set of descriptions and instructions on the page until you come to table near the end of the webpage. Notice that the columns are labeled "Gp1," "Gp2," etc. And the rows are labeled "Cond. 1," "Cond. 2," etc. For our purposes, consider each of these to be a category of a variable." In other words, each of these labels could effectively be relabeled as "Category 1," "Category 2," etc.
Employing data from the World Values Survey (http://www.worldvaluessurvey.org/), suppose that we are interested in the relationship between age and views about marriage as an outdated institution. Among U.S. respondents in 1999, 258 persons between the ages of 15 and 29 disagreed that marriage is an out-dated institutions, while 51 persons agreed. Of those between the ages of 30 and 49, 463 disagreed and 47 agreed. And, among those 50 and older, 341 disagreed and 22 agreed.
Taking the data provided in the above paragraph, let's now plug these values into the table on webpage that we started at. Starting with the upper left cell, and moving to right within that row, enter the three values associated with those who disagreed. Be sure to start with the youngest age group and move your way to the oldest. Thus, you should have entered: 258, 463, and 341. Drop down to the second row and enter the values associated with those who agreed. Once you have entered all 6 values, click the "Calculate" button at the bottom of the box.
As is evident, the marginals have been calculated for you. You can use these values and the formula provided to Chapter 11 to calculate the expected cell frequencies. In the bottom right corner of the box, note the values for the chi-square statistic and the degrees of freedom. You could use these two values and the table in the back of your text book to determine if the relationship between these two variables is statistically significant. However, an easier way to assess statistical significance is to examine the "p-value." If this value is less than or equal to .05, then the calculated chi-square statistic is statistically significant.
Chapter 12: Measures of Association for Nominal and Ordinal Variables
By now, you have probably had at least some exposure to one or more statistical software packages (e.g., SPSS). Despite its conveniences, the use of statistical software often comes with its own set of problems. One such problem that is particularly potent for students new to statistics is how to interpret the various quantities that are produced by SPSS when calculating the values of lambda and gamma for nominal and ordinal variables. To illustrate this, navigate to the following URL: Class11_LambdaNGamma.ppt
This presentation will take you through two SPSS examples, one using lambda and one using gamma, from start to finish. Many of the nuances of interpreting an SPSS output are covered. At the end of the presentation, make a list of three things that you did not know about calculating or interpreting lambda or gamma before you viewed these slides. Can you think of other information that was not included in the presentation that would also be useful to have included in this presentation?
Chapter 13: Regression and Correlation
Much scholarly work has gone into understanding the HIV/AIDS epidemic, particularly in Sub-Saharan Africa. Researchers have suggested that there may be a relationship between the prevalence of HIV/AIDS in a given country and education and literacy levels, particularly among women. The idea here is that education and literacy contribute to a greater understanding of prevention and, if infected, the care a maintenance of the infection. For this exercise we will explore this relationship using data from the Population Reference Bureau's Data Finder tool available on its website: http://www.prb.org/Datafinder.aspx.
Begin by following these steps. First, once you have navigated to the above URL, click on the link that says, "Let me choose multiple regions..." Second, select the following world regions: America - North, America - South, Asia, Europe, Africa - Sub-Saharan, and Oceania. Once you have checked each of these regions, click "Next". Third, locate and select the following variables. Click on the HIV/AIDS link and check "HIV Infected Adults Who are Women Population." Also, click on the Education link and check "Literacy Rate, Ages 15-24, 2000-04, Female." At the bottom of your screen, click "Get Report." You should now have the data of interest on the screen in front of you.
Begin by constructing a scatter diagram using these data with the literacy rate as your independent variable and the percentage women infected with HIV as your dependent variable. Does there appear to be a relationship between the two variables? If so, what type of relationship? What direction does the relationship appear to have? Next, follow the example from Table 13.4 in your text book to calculate the various quantities necessary to construct a linear regression equation, Pearson's correlation coefficient (r), and the coefficient of determination (r2). Based on your analysis, write up a summary of your results, characterizing the relationship between these two variables.
Chapter 14: Analysis of Variance
A defining feature of ANOVA procedures as they were introduced in Chapter 13 was the attention given to two quantities: the sum of squares between and the sum of squares within. Often times, it is difficult to understand these quantities without some sort of visual aid. This web exercise attempts to provide such an aid.
Navigate to the following URL: http://www.ruf.rice.edu/~lane/stat_sim/one_way/index.html.
Once there, click on the "Begin" button in the upper left of your screen. A new window will have opened for you. Maximize this window. You should see three bars on the screen in front of you, one for each of three groups. For the sake of simplicity, let's say that our sample consists of all students who took a statistics class in the fall of 2008 at the University of Wisconsin Milwaukee. Group 1 consists of sophomores. Group 2 consists of juniors. And Group 3 consists of seniors.
Suppose that all students were given a statistics readiness assessment at the start of the semester worth 10 points. Underneath the "Choose dataset" label, use the dropdown menu and select "Sample set 1." Upon doing so, you will see that a small red cross appears on each of the group bars. Take this cross to be the mean score for each group on the statistics readiness assessment. As such, we can see that sophomores scored the lowest and seniors scored the highest.
Where ANOVA procedures are concerned, the pie chart on the right of your screen shows the relative contribution of the sum squares between and the sum squares within to the value of the F statistic. As you can see from the three group mean scores, there is much variation between groups. Suppose, however, that each group scored a 7.5 of the statistics readiness assessment. With you cursor, navigate to each of the small red crosses. Click and hold your mouse button, and then drag the cross up to a score of 7.5. (Note, you may have to click and drag multiple times to achieve this). Do this for all three groups. Notice what happens in the pie chart on the right of your screen to the relative share of the sum squares between.
Once you have completed the above warm up, click the "Clear All" button in the top right of your screen. Then, use the dropdown menu to select another sample set, e.g., "Sample set 4." How do these results compare to what you tried earlier. Continue to repeat the steps above, trying different sample sets, as well as adjusting the location of the small red crosses with your cursor. Try as many different examples as possible to give you a better grasp of the relationship between the sum of squares between and the sum of squares within.
|