AP CSP Day 41: Data Collection & Cleaning | Cycle 2
Share
Survivorship bias is a data collection error where only successful or surviving cases are included, making outcomes appear better than they actually are. Self-selection bias occurs when participants choose whether to be included, skewing the sample toward those with strong opinions. AP CSP Cycle 2 data collection questions present research scenarios and ask students to identify which type of bias affects the data and how it would distort conclusions. Recognizing that the absence of data about failures or non-participants is itself meaningful information is a key analytical skill.
📚 Study the Concept First (Optional) Click to expand ▼
Bias in Data Collection: Harder Cases
Survivorship Bias
Survivorship bias occurs when data is collected only from cases that survived a process, making the process look more effective than it is. Studying only businesses that are still operating tells you nothing about the many that failed under the same conditions.
Voluntary Response Bias
Surveys where participation is voluntary over-represent people with strong opinions. Customers who had terrible or exceptional experiences are far more likely to complete a feedback form than customers who had average experiences.
Practice Question
A school surveys students about study habits. The "hours studied per week" column contains: 5, 8, 3, [blank], 6, "a lot", 7, 4, 2, [blank]. A researcher wants to calculate the average hours studied. Which approach is most appropriate?
I. Delete all rows with missing or invalid data, then calculate the average of remaining values.
II. Replace blanks with 0 and "a lot" with the column median, then calculate the average.
III. Calculate the average using only the valid numeric entries, ignoring invalid ones.
Both approaches I and III are valid data cleaning strategies. Approach I (deletion) works when invalid entries are few and removing them does not introduce significant bias. Approach III (selective inclusion) preserves valid data while excluding problematic entries. The best choice depends on how much data is affected and the research goals.
A) Approach III is also valid, not just approach I. B) Approach I is also valid, not just approach III. D) Both approaches are established, acceptable data cleaning methods used by researchers.
Students think there is exactly one correct way to handle missing data. In practice, multiple strategies exist and the appropriate choice depends on context, sample size, and the nature of the missing data.
Data cleaning questions on the AP exam often have answers about context-dependent decisions. Look for answers that acknowledge multiple valid approaches rather than one absolute rule.
Keep Practicing!
Consistent daily practice is the key to AP CSP success.
AP CSP Resources Get 1-on-1 Help