AP CSP Big Idea 2 Data

AP CSP Big Idea 2: Data — Complete 2025 Study Guide

Exam Weight: 17–22% • Approximately 12–15 Questions • 2025–2026 AP Exam

What You Will Learn: Big Idea 2 covers how all data — numbers, text, images, and sound — is stored as binary, how data is compressed to reduce file sizes, and how large datasets are processed to extract useful knowledge. You will also evaluate limitations and biases in data-driven conclusions. These concepts appear in both conceptual questions and scenario-based analysis on the AP exam.

Binary Representation
Overflow and Roundoff Errors
Data Compression: Lossless vs. Lossy
Metadata
Processing Data to Extract Knowledge
Limitations and Bias in Data
Key Vocabulary
AP Exam-Style Practice Questions

Binary Representation

All data stored in a computer is represented as bits — binary digits that are either 0 or 1. Different types of data use different encoding schemes to convert real-world information into sequences of bits.

Numbers

Numbers are stored in binary (base 2). Each bit position represents a power of 2, read from right to left: 1, 2, 4, 8, 16, 32, 64, 128.

Binary	Calculation	Decimal
`0101`	4 + 0 + 1 + 0 → wait, right-to-left: 1+0+4+0	5
`1010`	0+2+0+8	10
`1111`	1+2+4+8	15
`10000000`	128	128

Binary Conversion Shortcut: Write place values right to left: 1, 2, 4, 8, 16, 32, 64, 128. For each bit position that is 1, add its place value. For 1011: 1+2+0+8 = 11.

Text: ASCII and Unicode

Each character is assigned a number, which is then stored in binary. ASCII covers basic Latin characters using 7 or 8 bits per character. Unicode uses more bits to represent characters from virtually every written language. The letter 'A' = 65 in ASCII = 01000001 in binary.

Images

Images are made of pixels. Each pixel stores a color value. In color images, each pixel stores red, green, and blue values. More bits per pixel means more colors and a larger file. More pixels (higher resolution) means more detail and a larger file.

Sound

Sound is recorded by sampling — measuring the amplitude of a sound wave at regular intervals. Higher sampling rate (more samples per second) and greater bit depth (more bits per sample) produce better quality audio but larger files.

AP Tip: The exam does not require binary arithmetic beyond simple conversions. Know that increasing bits per pixel, increasing image resolution, or increasing audio sampling rate all increase file size.

Overflow and Roundoff Errors

Digital systems store numbers using a fixed number of bits, which creates two important limitations.

Overflow Error

An overflow error occurs when a calculation produces a number too large to be stored in the available bits. The result wraps around or becomes incorrect.

Example: A system uses 8 bits to store integers, so the maximum is 255 (2⁸ − 1). Adding 1 to 255 causes overflow — the stored result wraps back to 0 instead of giving 256.

Roundoff Error

A roundoff error occurs when a number cannot be represented exactly in binary, so it is stored as the closest possible approximation. This happens frequently with decimal fractions.

Example: The decimal value 0.1 cannot be expressed exactly in binary. Computing 0.1 + 0.2 in a program may produce 0.30000000000000004 rather than exactly 0.3.

Common Mistake: Overflow = number too large for the allocated bits. Roundoff = number cannot be expressed precisely (like a fraction). These are distinct. A number like 0.1 causes roundoff, not overflow.

Data Compression: Lossless vs. Lossy

Compression reduces file size by encoding data more efficiently. There are two fundamentally different types, and the exam frequently tests which is appropriate in a given scenario.

Type	What Happens	Original Recoverable?	Common Formats
Lossless	Removes redundancy without discarding any data; original can be perfectly reconstructed	Yes — perfectly	ZIP, PNG, FLAC
Lossy	Permanently removes data the human senses struggle to perceive; smaller file but original is gone	No — data is permanently lost	JPEG, MP3, MP4

Key Idea: After lossy compression, the discarded information is gone forever. Decompressing a lossy file gives an approximation of the original, not the original itself. Lossless decompression always reproduces the exact original data.

When Lossless is Required

Use lossless compression whenever losing any data is unacceptable:

Medical images (an altered pixel could hide a tumor)
Legal or financial documents (any change invalidates the record)
Executable programs (a single bit error corrupts the software)
Scientific data (exact values are required for accurate analysis)

Run-Length Encoding (Lossless Example)

Instead of storing AAAAAABBB, store 6A3B. This works well when large regions of data repeat — such as solid-color areas in a simple graphic.

AP Tip: When a question asks which compression type to use, ask: "Does losing any data matter here?" Medical, legal, executable = lossless. Photos for casual sharing, streaming music = lossy is acceptable.

Metadata

Metadata is data about data. It describes properties of a file without being the file's actual content.

File Type	Metadata Examples
Photo	Date taken, GPS coordinates, camera model, image dimensions
Email	Sender, recipient, timestamp, subject line, server routing path
Document	Author, date created, last modified, word count, software version
Music	Artist, album name, track number, duration, genre

Privacy Concern: Metadata can reveal sensitive information even when the file content seems harmless. A photo shared online may include GPS coordinates exposing your home address. The AP exam frequently tests the privacy implications of metadata collection.

Key Idea: Metadata is automatically attached to files by devices and software. Users often do not realize how much personal information is embedded in ordinary files they create and share.

Processing Data to Extract Knowledge

Raw data is not the same as knowledge. Programs process large datasets to find patterns and insights that would be impossible to see by examining individual records.

The Data Pipeline

Collect — gather from sensors, surveys, transactions, web activity
Store — save in structured formats like databases or spreadsheets
Clean — remove duplicates, fix errors, handle missing values
Analyze — apply programs to find patterns and build models
Visualize — display results in charts, maps, and graphs to communicate findings

The amount of data generated today is too large for manual analysis. Computational tools process millions of records in seconds, identifying correlations across many variables simultaneously.

Example: A streaming service records which songs each user skips, pauses, replays, and saves. By processing millions of users' listening behaviors, the service discovers that fans of artist A also tend to like artist B — even without any programmer explicitly encoding that relationship.

Limitations and Bias in Data

Data must be interpreted, and that interpretation can be wrong. The AP exam tests your ability to identify when data is being misused or when conclusions go beyond what the data actually supports.

Correlation vs. Causation

Two variables can be correlated — they change together — without one causing the other. This is the single most tested concept in Big Idea 2.

Classic Example: Ice cream sales and drowning rates both increase in summer. They are correlated, but eating ice cream does not cause drowning. A third factor — hot weather — drives both independently.

Common Mistake: When a question shows two things that increase together, do not assume one causes the other. The correct answer almost always says "the data suggests a relationship" rather than "the data proves that X causes Y."

Sources of Bias

Bias Type	What It Is	Example
Collection bias	Data collection method excludes certain groups	An online survey only reaches people with internet access
Incomplete data	Important variables are missing from the dataset	A loan model does not include income as a factor
Algorithmic bias	A model trained on biased historical data reproduces and amplifies those biases	A hiring algorithm discriminates against women because women were historically underrepresented in the training data
Confirmation bias	Analysts search only for evidence that confirms existing beliefs	Only reporting the trials where a drug appeared to work

Key Vocabulary

Term	Definition
Bit	The smallest unit of data; a single binary digit, either 0 or 1
Byte	8 bits; the standard unit for measuring file and memory sizes
Overflow error	Occurs when a calculation result exceeds the maximum value representable by the allocated bits
Roundoff error	Occurs when a number cannot be represented exactly in binary and is stored as an approximation
Lossless compression	Reduces file size while allowing perfect reconstruction of the original data
Lossy compression	Reduces file size by permanently discarding data; original cannot be fully recovered
Metadata	Data that describes other data; properties of a file rather than the file's content itself
Sampling	Measuring the amplitude of a sound wave at regular intervals to convert it to digital data
Resolution	The number of pixels in an image; higher resolution means more detail and larger file size
Correlation	A statistical relationship where two variables tend to change together
Causation	One variable directly causes changes in another; a stronger claim than correlation
Bias	A systematic error in data collection or analysis that skews results toward a particular outcome
Run-length encoding	A lossless compression technique that replaces repeated values with a count and a single value

AP Exam-Style Practice Questions

Predict your answer before revealing it.

Question 1

A hospital stores MRI scans and wants to reduce storage costs by compressing the image files. Which compression method is required, and why?

(A) Lossy compression, because the file sizes will be significantly smaller and storage costs will be lower
(B) Lossy compression, because the human eye cannot detect the missing pixels in a medical image
(C) Lossless compression, because any loss of image data could result in missed diagnoses or altered findings
(D) Lossless compression, because it always produces smaller files than lossy compression

Show Answer & Explanation

Answer: C

Medical images require exact, unaltered data. A missing or altered pixel in an MRI scan could cause a physician to miss a tumor or misread a finding. This is a scenario where data accuracy is critical, so lossless compression is required. Eliminate (A) and (B): the fact that humans might not notice the difference is irrelevant when medical accuracy is the priority. Eliminate (D): this is factually false — lossless compression typically produces larger files than lossy.

Question 2

A researcher studying exercise habits surveys 500 adults by posting a link to the survey on a fitness-focused social media platform. The survey finds that 80% of respondents exercise at least 5 days per week. Which of the following best describes a limitation of this data?

(A) The sample size of 500 is too small to draw any conclusions about exercise habits
(B) The sample is biased because people who follow fitness social media are more likely to exercise, making the results unrepresentative of the general population
(C) The data is affected by roundoff errors because percentages cannot be stored exactly in binary
(D) The survey cannot be trusted because it was posted online rather than administered in person

Show Answer & Explanation

Answer: B

This is collection bias. The survey was posted on a fitness platform, which means it primarily reached people who are already interested in fitness. The sample is not representative of the general population. The 80% statistic may be accurate for that specific community but cannot be generalized to all adults. Eliminate (A): 500 is a reasonable sample size; the problem is who was sampled, not how many. Eliminate (C): roundoff errors are a computing concept, not relevant to survey methodology. Eliminate (D): online surveys are a valid data collection method.

Question 3 — Spot the Error

A student compresses a text file using a lossy compression algorithm to save storage space. Later, the student decompresses the file and notices some words are missing or garbled. Which of the following best explains what happened?

(A) The student experienced a roundoff error because the text characters could not be stored exactly in binary
(B) The student chose lossless compression, which sometimes discards data it identifies as redundant
(C) The student used lossy compression, which permanently removed data from the file; the original text cannot be recovered
(D) The file was too large for the compression algorithm, causing an overflow error during the compression process

Show Answer & Explanation

Answer: C

Lossy compression permanently discards data. For media like photos and music, the discarded data is usually imperceptible. For text, even small amounts of lost data cause obvious and unacceptable corruption. Text files must always use lossless compression. Eliminate (A): roundoff errors affect numerical calculations, not text storage. Eliminate (B): this describes lossless compression incorrectly — lossless compression by definition does NOT discard data. Eliminate (D): overflow errors relate to numerical computation exceeding bit limits, not file compression.

Question 4 — I, II, and III format

A social media app automatically attaches metadata to every photo a user uploads. Which of the following are potential privacy concerns related to this metadata?

I. The GPS coordinates embedded in the photo could reveal the user's home address or daily location patterns.
II. The timestamp metadata could allow someone to determine when the user is typically away from home.
III. The image resolution metadata makes it easier to compress the photo and reduce its quality.

(A) I only
(B) I and II only
(C) II and III only
(D) I, II, and III

Show Answer & Explanation

Answer: B — I and II only

Statement I is true: GPS metadata embedded in photos is a well-documented privacy risk. Posting a photo taken at home reveals your home's exact location. Statement II is true: timestamps across many photos create a pattern of when and where someone is at predictable times, enabling location prediction. Statement III is false: image resolution is a technical characteristic of the image, not a privacy concern. It does not facilitate surveillance or reveal personal information. Since III is false, eliminate (C) and (D).

Question 5

A study finds that cities with more coffee shops have higher rates of college graduation. A journalist writes a headline stating: "Coffee Shops Drive Educational Achievement." Which of the following best explains why this conclusion is not supported by the data?

(A) The data contains bias because coffee shop owners were not included in the study
(B) The sample size was too small to establish a meaningful relationship between the two variables
(C) The data shows correlation between coffee shops and graduation rates, but does not establish that one causes the other; both may be driven by a third factor such as urbanization or income level
(D) The journalist correctly interpreted the data; high correlation is sufficient to conclude causation

Show Answer & Explanation

Answer: C

Correlation does not imply causation. Both coffee shops and college graduates tend to cluster in wealthier, more urban areas — a third variable (affluence, urban density) likely drives both trends independently. Eliminate (A): the study's coverage of coffee shop owners is not the issue. Eliminate (B): sample size is not mentioned as a problem. Eliminate (D): this is the exact misconception the question is testing — high correlation never proves causation on its own.

Continue Studying: Big Idea 2 accounts for 17–22% of the exam. Master binary conversion, know lossless vs. lossy cold, and practice identifying correlation vs. causation. These three topics cover the majority of BI2 questions.

← Big Idea 1 Big Idea 3: Algorithms →

Calculate your AP CSP score →

1-on-1 AP CSP Tutoring

Still unsure? Work directly with Tanner before the CPT deadline or exam.

11+ years teaching AP CSP — 34.8% of students score 5s vs. 9.6% nationally.

See Tutoring Options →

Get in Touch

Whether you're a student, parent, or teacher — I'd love to hear from you.

Typically responds within 24 hours

✓

Message Sent!

Thanks for reaching out. I'll get back to you within 24 hours.

Name *

Email *

I am a... (optional)

Which course? (optional)

Phone (optional)

How did you find us? (optional)

🏫 Welcome, fellow educator!

I offer curriculum resources, practice materials, and study guides designed for AP CS teachers. Let me know what you're looking for — whether it's classroom materials, a guest speaker, or Teachers Pay Teachers resources.

Message (optional — leave blank if just subscribing)

✉

tanner@apcsexamprep.com

📚

Courses

AP CSA, CSP, & Cybersecurity

⏱

Response Time

Within 24 hours

Prefer email? Reach me directly at tanner@apcsexamprep.com

AP CSP Big Idea 2 Data

AP CSP Big Idea 2: Data — Complete 2025 Study Guide

Table of Contents

Binary Representation

Numbers

Text: ASCII and Unicode

Images

Sound

Overflow and Roundoff Errors

Overflow Error

Roundoff Error

Data Compression: Lossless vs. Lossy

When Lossless is Required

Run-Length Encoding (Lossless Example)

Metadata

Processing Data to Extract Knowledge

The Data Pipeline

Limitations and Bias in Data

Correlation vs. Causation

Sources of Bias

Key Vocabulary

AP Exam-Style Practice Questions

Get in Touch

Message Sent!