AP CSP Big Idea 2 Data
AP CSP Big Idea 2: Data — Complete 2025 Study Guide
Exam Weight: 17–22% • Approximately 12–15 Questions • 2025–2026 AP Exam
Table of Contents
Binary Representation
All data stored in a computer is represented as bits — binary digits that are either 0 or 1. Different types of data use different encoding schemes to convert real-world information into sequences of bits.
Numbers
Numbers are stored in binary (base 2). Each bit position represents a power of 2, read from right to left: 1, 2, 4, 8, 16, 32, 64, 128.
| Binary | Calculation | Decimal |
|---|---|---|
0101 |
4 + 0 + 1 + 0 → wait, right-to-left: 1+0+4+0 | 5 |
1010 |
0+2+0+8 | 10 |
1111 |
1+2+4+8 | 15 |
10000000 |
128 | 128 |
1011: 1+2+0+8 = 11.Text: ASCII and Unicode
Each character is assigned a number, which is then stored in binary. ASCII covers basic Latin characters using 7 or 8 bits per character. Unicode uses more bits to represent characters from virtually every written language. The letter 'A' = 65 in ASCII = 01000001 in binary.
Images
Images are made of pixels. Each pixel stores a color value. In color images, each pixel stores red, green, and blue values. More bits per pixel means more colors and a larger file. More pixels (higher resolution) means more detail and a larger file.
Sound
Sound is recorded by sampling — measuring the amplitude of a sound wave at regular intervals. Higher sampling rate (more samples per second) and greater bit depth (more bits per sample) produce better quality audio but larger files.
Overflow and Roundoff Errors
Digital systems store numbers using a fixed number of bits, which creates two important limitations.
Overflow Error
An overflow error occurs when a calculation produces a number too large to be stored in the available bits. The result wraps around or becomes incorrect.
Roundoff Error
A roundoff error occurs when a number cannot be represented exactly in binary, so it is stored as the closest possible approximation. This happens frequently with decimal fractions.
Data Compression: Lossless vs. Lossy
Compression reduces file size by encoding data more efficiently. There are two fundamentally different types, and the exam frequently tests which is appropriate in a given scenario.
| Type | What Happens | Original Recoverable? | Common Formats |
|---|---|---|---|
| Lossless | Removes redundancy without discarding any data; original can be perfectly reconstructed | Yes — perfectly | ZIP, PNG, FLAC |
| Lossy | Permanently removes data the human senses struggle to perceive; smaller file but original is gone | No — data is permanently lost | JPEG, MP3, MP4 |
When Lossless is Required
Use lossless compression whenever losing any data is unacceptable:
- Medical images (an altered pixel could hide a tumor)
- Legal or financial documents (any change invalidates the record)
- Executable programs (a single bit error corrupts the software)
- Scientific data (exact values are required for accurate analysis)
Run-Length Encoding (Lossless Example)
Instead of storing AAAAAABBB, store 6A3B. This works well when large regions of data repeat — such as solid-color areas in a simple graphic.
Metadata
Metadata is data about data. It describes properties of a file without being the file's actual content.
| File Type | Metadata Examples |
|---|---|
| Photo | Date taken, GPS coordinates, camera model, image dimensions |
| Sender, recipient, timestamp, subject line, server routing path | |
| Document | Author, date created, last modified, word count, software version |
| Music | Artist, album name, track number, duration, genre |
Processing Data to Extract Knowledge
Raw data is not the same as knowledge. Programs process large datasets to find patterns and insights that would be impossible to see by examining individual records.
The Data Pipeline
- Collect — gather from sensors, surveys, transactions, web activity
- Store — save in structured formats like databases or spreadsheets
- Clean — remove duplicates, fix errors, handle missing values
- Analyze — apply programs to find patterns and build models
- Visualize — display results in charts, maps, and graphs to communicate findings
The amount of data generated today is too large for manual analysis. Computational tools process millions of records in seconds, identifying correlations across many variables simultaneously.
Limitations and Bias in Data
Data must be interpreted, and that interpretation can be wrong. The AP exam tests your ability to identify when data is being misused or when conclusions go beyond what the data actually supports.
Correlation vs. Causation
Two variables can be correlated — they change together — without one causing the other. This is the single most tested concept in Big Idea 2.
Sources of Bias
| Bias Type | What It Is | Example |
|---|---|---|
| Collection bias | Data collection method excludes certain groups | An online survey only reaches people with internet access |
| Incomplete data | Important variables are missing from the dataset | A loan model does not include income as a factor |
| Algorithmic bias | A model trained on biased historical data reproduces and amplifies those biases | A hiring algorithm discriminates against women because women were historically underrepresented in the training data |
| Confirmation bias | Analysts search only for evidence that confirms existing beliefs | Only reporting the trials where a drug appeared to work |
Key Vocabulary
| Term | Definition |
|---|---|
| Bit | The smallest unit of data; a single binary digit, either 0 or 1 |
| Byte | 8 bits; the standard unit for measuring file and memory sizes |
| Overflow error | Occurs when a calculation result exceeds the maximum value representable by the allocated bits |
| Roundoff error | Occurs when a number cannot be represented exactly in binary and is stored as an approximation |
| Lossless compression | Reduces file size while allowing perfect reconstruction of the original data |
| Lossy compression | Reduces file size by permanently discarding data; original cannot be fully recovered |
| Metadata | Data that describes other data; properties of a file rather than the file's content itself |
| Sampling | Measuring the amplitude of a sound wave at regular intervals to convert it to digital data |
| Resolution | The number of pixels in an image; higher resolution means more detail and larger file size |
| Correlation | A statistical relationship where two variables tend to change together |
| Causation | One variable directly causes changes in another; a stronger claim than correlation |
| Bias | A systematic error in data collection or analysis that skews results toward a particular outcome |
| Run-length encoding | A lossless compression technique that replaces repeated values with a count and a single value |
AP Exam-Style Practice Questions
Predict your answer before revealing it.
Question 1
A hospital stores MRI scans and wants to reduce storage costs by compressing the image files. Which compression method is required, and why?
- (A) Lossy compression, because the file sizes will be significantly smaller and storage costs will be lower
- (B) Lossy compression, because the human eye cannot detect the missing pixels in a medical image
- (C) Lossless compression, because any loss of image data could result in missed diagnoses or altered findings
- (D) Lossless compression, because it always produces smaller files than lossy compression
Show Answer & Explanation
Answer: C
Medical images require exact, unaltered data. A missing or altered pixel in an MRI scan could cause a physician to miss a tumor or misread a finding. This is a scenario where data accuracy is critical, so lossless compression is required. Eliminate (A) and (B): the fact that humans might not notice the difference is irrelevant when medical accuracy is the priority. Eliminate (D): this is factually false — lossless compression typically produces larger files than lossy.
Question 2
A researcher studying exercise habits surveys 500 adults by posting a link to the survey on a fitness-focused social media platform. The survey finds that 80% of respondents exercise at least 5 days per week. Which of the following best describes a limitation of this data?
- (A) The sample size of 500 is too small to draw any conclusions about exercise habits
- (B) The sample is biased because people who follow fitness social media are more likely to exercise, making the results unrepresentative of the general population
- (C) The data is affected by roundoff errors because percentages cannot be stored exactly in binary
- (D) The survey cannot be trusted because it was posted online rather than administered in person
Show Answer & Explanation
Answer: B
This is collection bias. The survey was posted on a fitness platform, which means it primarily reached people who are already interested in fitness. The sample is not representative of the general population. The 80% statistic may be accurate for that specific community but cannot be generalized to all adults. Eliminate (A): 500 is a reasonable sample size; the problem is who was sampled, not how many. Eliminate (C): roundoff errors are a computing concept, not relevant to survey methodology. Eliminate (D): online surveys are a valid data collection method.
Question 3 — Spot the Error
A student compresses a text file using a lossy compression algorithm to save storage space. Later, the student decompresses the file and notices some words are missing or garbled. Which of the following best explains what happened?
- (A) The student experienced a roundoff error because the text characters could not be stored exactly in binary
- (B) The student chose lossless compression, which sometimes discards data it identifies as redundant
- (C) The student used lossy compression, which permanently removed data from the file; the original text cannot be recovered
- (D) The file was too large for the compression algorithm, causing an overflow error during the compression process
Show Answer & Explanation
Answer: C
Lossy compression permanently discards data. For media like photos and music, the discarded data is usually imperceptible. For text, even small amounts of lost data cause obvious and unacceptable corruption. Text files must always use lossless compression. Eliminate (A): roundoff errors affect numerical calculations, not text storage. Eliminate (B): this describes lossless compression incorrectly — lossless compression by definition does NOT discard data. Eliminate (D): overflow errors relate to numerical computation exceeding bit limits, not file compression.
Question 4 — I, II, and III format
A social media app automatically attaches metadata to every photo a user uploads. Which of the following are potential privacy concerns related to this metadata?
I. The GPS coordinates embedded in the photo could reveal the user's home address or daily location patterns.
II. The timestamp metadata could allow someone to determine when the user is typically away from home.
III. The image resolution metadata makes it easier to compress the photo and reduce its quality.
- (A) I only
- (B) I and II only
- (C) II and III only
- (D) I, II, and III
Show Answer & Explanation
Answer: B — I and II only
Statement I is true: GPS metadata embedded in photos is a well-documented privacy risk. Posting a photo taken at home reveals your home's exact location. Statement II is true: timestamps across many photos create a pattern of when and where someone is at predictable times, enabling location prediction. Statement III is false: image resolution is a technical characteristic of the image, not a privacy concern. It does not facilitate surveillance or reveal personal information. Since III is false, eliminate (C) and (D).
Question 5
A study finds that cities with more coffee shops have higher rates of college graduation. A journalist writes a headline stating: "Coffee Shops Drive Educational Achievement." Which of the following best explains why this conclusion is not supported by the data?
- (A) The data contains bias because coffee shop owners were not included in the study
- (B) The sample size was too small to establish a meaningful relationship between the two variables
- (C) The data shows correlation between coffee shops and graduation rates, but does not establish that one causes the other; both may be driven by a third factor such as urbanization or income level
- (D) The journalist correctly interpreted the data; high correlation is sufficient to conclude causation
Show Answer & Explanation
Answer: C
Correlation does not imply causation. Both coffee shops and college graduates tend to cluster in wealthier, more urban areas — a third variable (affluence, urban density) likely drives both trends independently. Eliminate (A): the study's coverage of coffee shop owners is not the issue. Eliminate (B): sample size is not mentioned as a problem. Eliminate (D): this is the exact misconception the question is testing — high correlation never proves causation on its own.
Get in Touch
Whether you're a student, parent, or teacher — I'd love to hear from you.
Just want free AP CS resources?
Enter your email below and check the subscribe box — no message needed. Students get daily practice questions and study tips. Teachers get curriculum resources and teaching strategies.
Message Sent!
Thanks for reaching out. I'll get back to you within 24 hours.
tanner@apcsexamprep.com
Courses
AP CSA, CSP, & Cybersecurity
Response Time
Within 24 hours
Prefer email? Reach me directly at tanner@apcsexamprep.com