AP CSP Day 41: Data Collection & Cleaning | Cycle 2

Key Concepts

Survivorship bias is a data collection error where only successful or surviving cases are included, making outcomes appear better than they actually are. Self-selection bias occurs when participants choose whether to be included, skewing the sample toward those with strong opinions. AP CSP Cycle 2 data collection questions present research scenarios and ask students to identify which type of bias affects the data and how it would distort conclusions. Recognizing that the absence of data about failures or non-participants is itself meaningful information is a key analytical skill.

📚 Study the Concept First (Optional) Click to expand ▼

Bias in Data Collection: Harder Cases

Survivorship Bias

Survivorship bias occurs when data is collected only from cases that survived a process, making the process look more effective than it is. Studying only businesses that are still operating tells you nothing about the many that failed under the same conditions.

Voluntary Response Bias

Surveys where participation is voluntary over-represent people with strong opinions. Customers who had terrible or exceptional experiences are far more likely to complete a feedback form than customers who had average experiences.

Common Trap: Assuming that a large sample size eliminates bias. A large biased sample produces large biased results with high confidence. Sample size affects precision; sample selection method determines validity.
Exam Tip: For any described study, ask: who chose to participate? Who was excluded by the data collection method? Who survived the process being studied? Each question can reveal a different type of bias.
Big Idea 2: Data
Cycle 2 • Day 41 Practice • Hard Difficulty
Focus: Data Collection & Cleaning

Practice Question

A school surveys students about study habits. The "hours studied per week" column contains: 5, 8, 3, [blank], 6, "a lot", 7, 4, 2, [blank]. A researcher wants to calculate the average hours studied. Which approach is most appropriate?

I. Delete all rows with missing or invalid data, then calculate the average of remaining values.
II. Replace blanks with 0 and "a lot" with the column median, then calculate the average.
III. Calculate the average using only the valid numeric entries, ignoring invalid ones.

Why This Answer?

Both approaches I and III are valid data cleaning strategies. Approach I (deletion) works when invalid entries are few and removing them does not introduce significant bias. Approach III (selective inclusion) preserves valid data while excluding problematic entries. The best choice depends on how much data is affected and the research goals.

Why Not the Others?

A) Approach III is also valid, not just approach I. B) Approach I is also valid, not just approach III. D) Both approaches are established, acceptable data cleaning methods used by researchers.

Common Mistake
Watch Out!

Students think there is exactly one correct way to handle missing data. In practice, multiple strategies exist and the appropriate choice depends on context, sample size, and the nature of the missing data.

AP Exam Tip

Data cleaning questions on the AP exam often have answers about context-dependent decisions. Look for answers that acknowledge multiple valid approaches rather than one absolute rule.

Keep Practicing!

Consistent daily practice is the key to AP CSP success.

AP CSP Resources Get 1-on-1 Help
Back to blog

Leave a comment

Please note, comments need to be approved before they are published.