AP CSP Big Idea 2 Data

AP CSP Big Idea 2: Data — Complete 2025 Study Guide

Exam Weight: 17–22% • Approximately 12–15 Questions • 2025–2026 AP Exam

What You Will Learn: Big Idea 2 covers how all data — numbers, text, images, and sound — is stored as binary, how data is compressed to reduce file sizes, and how large datasets are processed to extract useful knowledge. You will also evaluate limitations and biases in data-driven conclusions. These concepts appear in both conceptual questions and scenario-based analysis on the AP exam.

Binary Representation

All data stored in a computer is represented as bits — binary digits that are either 0 or 1. Different types of data use different encoding schemes to convert real-world information into sequences of bits.

Numbers

Numbers are stored in binary (base 2). Each bit position represents a power of 2, read from right to left: 1, 2, 4, 8, 16, 32, 64, 128.

Binary Calculation Decimal
0101 4 + 0 + 1 + 0 → wait, right-to-left: 1+0+4+0 5
1010 0+2+0+8 10
1111 1+2+4+8 15
10000000 128 128
Binary Conversion Shortcut: Write place values right to left: 1, 2, 4, 8, 16, 32, 64, 128. For each bit position that is 1, add its place value. For 1011: 1+2+0+8 = 11.

Text: ASCII and Unicode

Each character is assigned a number, which is then stored in binary. ASCII covers basic Latin characters using 7 or 8 bits per character. Unicode uses more bits to represent characters from virtually every written language. The letter 'A' = 65 in ASCII = 01000001 in binary.

Images

Images are made of pixels. Each pixel stores a color value. In color images, each pixel stores red, green, and blue values. More bits per pixel means more colors and a larger file. More pixels (higher resolution) means more detail and a larger file.

Sound

Sound is recorded by sampling — measuring the amplitude of a sound wave at regular intervals. Higher sampling rate (more samples per second) and greater bit depth (more bits per sample) produce better quality audio but larger files.

AP Tip: The exam does not require binary arithmetic beyond simple conversions. Know that increasing bits per pixel, increasing image resolution, or increasing audio sampling rate all increase file size.

Overflow and Roundoff Errors

Digital systems store numbers using a fixed number of bits, which creates two important limitations.

Overflow Error

An overflow error occurs when a calculation produces a number too large to be stored in the available bits. The result wraps around or becomes incorrect.

Example: A system uses 8 bits to store integers, so the maximum is 255 (28 − 1). Adding 1 to 255 causes overflow — the stored result wraps back to 0 instead of giving 256.

Roundoff Error

A roundoff error occurs when a number cannot be represented exactly in binary, so it is stored as the closest possible approximation. This happens frequently with decimal fractions.

Example: The decimal value 0.1 cannot be expressed exactly in binary. Computing 0.1 + 0.2 in a program may produce 0.30000000000000004 rather than exactly 0.3.
Common Mistake: Overflow = number too large for the allocated bits. Roundoff = number cannot be expressed precisely (like a fraction). These are distinct. A number like 0.1 causes roundoff, not overflow.

Data Compression: Lossless vs. Lossy

Compression reduces file size by encoding data more efficiently. There are two fundamentally different types, and the exam frequently tests which is appropriate in a given scenario.

Type What Happens Original Recoverable? Common Formats
Lossless Removes redundancy without discarding any data; original can be perfectly reconstructed Yes — perfectly ZIP, PNG, FLAC
Lossy Permanently removes data the human senses struggle to perceive; smaller file but original is gone No — data is permanently lost JPEG, MP3, MP4
Key Idea: After lossy compression, the discarded information is gone forever. Decompressing a lossy file gives an approximation of the original, not the original itself. Lossless decompression always reproduces the exact original data.

When Lossless is Required

Use lossless compression whenever losing any data is unacceptable:

  • Medical images (an altered pixel could hide a tumor)
  • Legal or financial documents (any change invalidates the record)
  • Executable programs (a single bit error corrupts the software)
  • Scientific data (exact values are required for accurate analysis)

Run-Length Encoding (Lossless Example)

Instead of storing AAAAAABBB, store 6A3B. This works well when large regions of data repeat — such as solid-color areas in a simple graphic.

AP Tip: When a question asks which compression type to use, ask: "Does losing any data matter here?" Medical, legal, executable = lossless. Photos for casual sharing, streaming music = lossy is acceptable.

Metadata

Metadata is data about data. It describes properties of a file without being the file's actual content.

File Type Metadata Examples
Photo Date taken, GPS coordinates, camera model, image dimensions
Email Sender, recipient, timestamp, subject line, server routing path
Document Author, date created, last modified, word count, software version
Music Artist, album name, track number, duration, genre
Privacy Concern: Metadata can reveal sensitive information even when the file content seems harmless. A photo shared online may include GPS coordinates exposing your home address. The AP exam frequently tests the privacy implications of metadata collection.
Key Idea: Metadata is automatically attached to files by devices and software. Users often do not realize how much personal information is embedded in ordinary files they create and share.

Processing Data to Extract Knowledge

Raw data is not the same as knowledge. Programs process large datasets to find patterns and insights that would be impossible to see by examining individual records.

The Data Pipeline

  1. Collect — gather from sensors, surveys, transactions, web activity
  2. Store — save in structured formats like databases or spreadsheets
  3. Clean — remove duplicates, fix errors, handle missing values
  4. Analyze — apply programs to find patterns and build models
  5. Visualize — display results in charts, maps, and graphs to communicate findings

The amount of data generated today is too large for manual analysis. Computational tools process millions of records in seconds, identifying correlations across many variables simultaneously.

Example: A streaming service records which songs each user skips, pauses, replays, and saves. By processing millions of users' listening behaviors, the service discovers that fans of artist A also tend to like artist B — even without any programmer explicitly encoding that relationship.

Limitations and Bias in Data

Data must be interpreted, and that interpretation can be wrong. The AP exam tests your ability to identify when data is being misused or when conclusions go beyond what the data actually supports.

Correlation vs. Causation

Two variables can be correlated — they change together — without one causing the other. This is the single most tested concept in Big Idea 2.

Classic Example: Ice cream sales and drowning rates both increase in summer. They are correlated, but eating ice cream does not cause drowning. A third factor — hot weather — drives both independently.
Common Mistake: When a question shows two things that increase together, do not assume one causes the other. The correct answer almost always says "the data suggests a relationship" rather than "the data proves that X causes Y."

Sources of Bias

Bias Type What It Is Example
Collection bias Data collection method excludes certain groups An online survey only reaches people with internet access
Incomplete data Important variables are missing from the dataset A loan model does not include income as a factor
Algorithmic bias A model trained on biased historical data reproduces and amplifies those biases A hiring algorithm discriminates against women because women were historically underrepresented in the training data
Confirmation bias Analysts search only for evidence that confirms existing beliefs Only reporting the trials where a drug appeared to work

Key Vocabulary

Term Definition
Bit The smallest unit of data; a single binary digit, either 0 or 1
Byte 8 bits; the standard unit for measuring file and memory sizes
Overflow error Occurs when a calculation result exceeds the maximum value representable by the allocated bits
Roundoff error Occurs when a number cannot be represented exactly in binary and is stored as an approximation
Lossless compression Reduces file size while allowing perfect reconstruction of the original data
Lossy compression Reduces file size by permanently discarding data; original cannot be fully recovered
Metadata Data that describes other data; properties of a file rather than the file's content itself
Sampling Measuring the amplitude of a sound wave at regular intervals to convert it to digital data
Resolution The number of pixels in an image; higher resolution means more detail and larger file size
Correlation A statistical relationship where two variables tend to change together
Causation One variable directly causes changes in another; a stronger claim than correlation
Bias A systematic error in data collection or analysis that skews results toward a particular outcome
Run-length encoding A lossless compression technique that replaces repeated values with a count and a single value

AP Exam-Style Practice Questions

Predict your answer before revealing it.

Question 1

A hospital stores MRI scans and wants to reduce storage costs by compressing the image files. Which compression method is required, and why?

  • (A) Lossy compression, because the file sizes will be significantly smaller and storage costs will be lower
  • (B) Lossy compression, because the human eye cannot detect the missing pixels in a medical image
  • (C) Lossless compression, because any loss of image data could result in missed diagnoses or altered findings
  • (D) Lossless compression, because it always produces smaller files than lossy compression
Show Answer & Explanation

Answer: C

Medical images require exact, unaltered data. A missing or altered pixel in an MRI scan could cause a physician to miss a tumor or misread a finding. This is a scenario where data accuracy is critical, so lossless compression is required. Eliminate (A) and (B): the fact that humans might not notice the difference is irrelevant when medical accuracy is the priority. Eliminate (D): this is factually false — lossless compression typically produces larger files than lossy.

Question 2

A researcher studying exercise habits surveys 500 adults by posting a link to the survey on a fitness-focused social media platform. The survey finds that 80% of respondents exercise at least 5 days per week. Which of the following best describes a limitation of this data?

  • (A) The sample size of 500 is too small to draw any conclusions about exercise habits
  • (B) The sample is biased because people who follow fitness social media are more likely to exercise, making the results unrepresentative of the general population
  • (C) The data is affected by roundoff errors because percentages cannot be stored exactly in binary
  • (D) The survey cannot be trusted because it was posted online rather than administered in person
Show Answer & Explanation

Answer: B

This is collection bias. The survey was posted on a fitness platform, which means it primarily reached people who are already interested in fitness. The sample is not representative of the general population. The 80% statistic may be accurate for that specific community but cannot be generalized to all adults. Eliminate (A): 500 is a reasonable sample size; the problem is who was sampled, not how many. Eliminate (C): roundoff errors are a computing concept, not relevant to survey methodology. Eliminate (D): online surveys are a valid data collection method.

Question 3 — Spot the Error

A student compresses a text file using a lossy compression algorithm to save storage space. Later, the student decompresses the file and notices some words are missing or garbled. Which of the following best explains what happened?

  • (A) The student experienced a roundoff error because the text characters could not be stored exactly in binary
  • (B) The student chose lossless compression, which sometimes discards data it identifies as redundant
  • (C) The student used lossy compression, which permanently removed data from the file; the original text cannot be recovered
  • (D) The file was too large for the compression algorithm, causing an overflow error during the compression process
Show Answer & Explanation

Answer: C

Lossy compression permanently discards data. For media like photos and music, the discarded data is usually imperceptible. For text, even small amounts of lost data cause obvious and unacceptable corruption. Text files must always use lossless compression. Eliminate (A): roundoff errors affect numerical calculations, not text storage. Eliminate (B): this describes lossless compression incorrectly — lossless compression by definition does NOT discard data. Eliminate (D): overflow errors relate to numerical computation exceeding bit limits, not file compression.

Question 4 — I, II, and III format

A social media app automatically attaches metadata to every photo a user uploads. Which of the following are potential privacy concerns related to this metadata?

I. The GPS coordinates embedded in the photo could reveal the user's home address or daily location patterns.
II. The timestamp metadata could allow someone to determine when the user is typically away from home.
III. The image resolution metadata makes it easier to compress the photo and reduce its quality.

  • (A) I only
  • (B) I and II only
  • (C) II and III only
  • (D) I, II, and III
Show Answer & Explanation

Answer: B — I and II only

Statement I is true: GPS metadata embedded in photos is a well-documented privacy risk. Posting a photo taken at home reveals your home's exact location. Statement II is true: timestamps across many photos create a pattern of when and where someone is at predictable times, enabling location prediction. Statement III is false: image resolution is a technical characteristic of the image, not a privacy concern. It does not facilitate surveillance or reveal personal information. Since III is false, eliminate (C) and (D).

Question 5

A study finds that cities with more coffee shops have higher rates of college graduation. A journalist writes a headline stating: "Coffee Shops Drive Educational Achievement." Which of the following best explains why this conclusion is not supported by the data?

  • (A) The data contains bias because coffee shop owners were not included in the study
  • (B) The sample size was too small to establish a meaningful relationship between the two variables
  • (C) The data shows correlation between coffee shops and graduation rates, but does not establish that one causes the other; both may be driven by a third factor such as urbanization or income level
  • (D) The journalist correctly interpreted the data; high correlation is sufficient to conclude causation
Show Answer & Explanation

Answer: C

Correlation does not imply causation. Both coffee shops and college graduates tend to cluster in wealthier, more urban areas — a third variable (affluence, urban density) likely drives both trends independently. Eliminate (A): the study's coverage of coffee shop owners is not the issue. Eliminate (B): sample size is not mentioned as a problem. Eliminate (D): this is the exact misconception the question is testing — high correlation never proves causation on its own.

Continue Studying: Big Idea 2 accounts for 17–22% of the exam. Master binary conversion, know lossless vs. lossy cold, and practice identifying correlation vs. causation. These three topics cover the majority of BI2 questions.
1-on-1 AP CSP Tutoring
Still unsure? Work directly with Tanner before the CPT deadline or exam.
11+ years teaching AP CSP — 34.8% of students score 5s vs. 9.6% nationally.
See Tutoring Options →

Get in Touch

Whether you're a student, parent, or teacher — I'd love to hear from you.

Just want free AP CS resources?

Enter your email below and check the subscribe box — no message needed. Students get daily practice questions and study tips. Teachers get curriculum resources and teaching strategies.

Typically responds within 24 hours

Message Sent!

Thanks for reaching out. I'll get back to you within 24 hours.

🏫 Welcome, fellow educator!

I offer curriculum resources, practice materials, and study guides designed for AP CS teachers. Let me know what you're looking for — whether it's classroom materials, a guest speaker, or Teachers Pay Teachers resources.

Email

tanner@apcsexamprep.com

📚

Courses

AP CSA, CSP, & Cybersecurity

Response Time

Within 24 hours

Prefer email? Reach me directly at tanner@apcsexamprep.com