AP CSP Day 11: Data Collection And Cleaning

Key Concepts

Data collection methods include surveys, sensors, manual entry, and automated logging, each introducing different potential sources of error. Data cleaning involves identifying and correcting issues like missing values, duplicates, and outliers before analysis. The AP CSP exam tests whether students can recognize how collection method and data quality affect the validity of conclusions drawn from a dataset. Understanding that biased or incomplete data leads to unreliable insights is a core Big Idea 2 concept.

📚 Study the Concept First (Optional) Click to expand ▼

Data Collection and Cleaning

Why Data Quality Matters

An algorithm is only as trustworthy as the data it processes. Flawed data collection produces flawed insights regardless of how sophisticated the analysis is. Garbage in, garbage out.

Common Data Problems

Missing values occur when responses are skipped. Duplicates inflate counts. Outliers (extreme values) can skew averages dramatically. Inconsistent formatting (like '01/05/2024' vs 'January 5, 2024') prevents accurate comparison.

Common Trap: Assuming that more data always means better analysis. A large dataset with systematic collection bias produces confidently wrong conclusions.
Exam Tip: When evaluating a described dataset, always ask: who was included, who was excluded, and how was data entered? Each answer can reveal a quality problem.
Big Idea 2: Data
Cycle 1 • Day 11 Practice • Medium Difficulty
Focus: Data Collection & Cleaning

Practice Question

A researcher collects survey data about exercise habits. Some respondents left the "hours per week" field blank, and one entry reads "lots" instead of a number. Which of the following best describes the issue with this data?

Why This Answer?

Blank entries represent missing data and "lots" is an invalid non-numeric entry in a field that expects a number. These are data quality issues that require cleaning — such as removing incomplete rows or converting entries to a consistent format — before meaningful analysis can occur.

Why Not the Others?

A) The problem is data quality, not data set size. Nothing in the question suggests the set is too small. C) The question design may have contributed, but the immediate issue is that existing data needs cleaning. D) Data with quality issues can still be useful after proper cleaning and handling.

Common Mistake
Watch Out!

Students often think imperfect data is completely unusable rather than recognizing that data cleaning is a standard and expected step in the analysis process.

AP Exam Tip

Data cleaning involves handling missing values, correcting invalid entries, and standardizing formats. Imperfect data is not worthless — it just needs preparation.

Keep Practicing!

Consistent daily practice is the key to AP CSP success.

AP CSP Resources Get 1-on-1 Help
Back to blog

Leave a comment

Please note, comments need to be approved before they are published.