Skip to content

Data and Databases

Data and Databases

This topic covers data representation, data types, database concepts, SQL, normalisation, and data Integrity.

Data Representation

Text Encoding (OL/HL)

ASCII: 7-bit code, 128 characters (uppercase, lowercase, digits, symbols).

Unicode: supports all world languages. UTF-8 is variable-length (1—4 bytes), backward Compatible with ASCII.

Example (OL): The letter ‘A’ in ASCII is 6510=01000001265_{10} = 01000001_2.

Worked Example (OL). What is stored in binary for the word “Cat”?

C = 67 = 01000011, a = 97 = 01100001, t = 116 = 01110100.

“Cat” = 01000011 01100001 01110100 (3 bytes).

Example (HL). The difference between uppercase and lowercase ASCII codes is 32. To convert ‘A’ To ‘a’, add 32 to the ASCII code: 65 + 32 = 97.

Images (OL/HL)

Bitmap images: composed of pixels. Each pixel has a colour value.

  • Colour depth: number of bits per pixel.
  • 1 bit: black and white (2 colours).
  • 8 bits: 256 colours.
  • 24 bits: true colour (16,777,21616,777,216 colours).

Image file size (bits): width ×\times height ×\times colour depth.

Example (OL): A 800×600800 \times 600 image with 24-bit colour depth.

\mathrm{Size = 800 \times 600 \times 24 = 11520000\mathrm{ bits = 1440000\mathrm{ bytes \approx 1.37\mathrm{ MB

Worked Example (OL). An image is 1024 ×\times 768 with 16-bit colour depth.

Size = 1024×768×16=125829121024 \times 768 \times 16 = 12582912 bits = 15728641572864 bytes 1.5\approx 1.5 MB.

Vector images: composed of mathematical descriptions of shapes. Scale without losing quality. Smaller file sizes for simple images.

Bitmap vs Vector comparison:

FeatureBitmapVector
RepresentationGrid of pixelsMathematical descriptions
ScalingLoses quality when enlargedNo quality loss at any size
File sizeDepends on resolutionDepends on complexity
Best forPhotographsLogos, icons, diagrams

Sound (HL)

Sound is sampled (analog to digital conversion).

  • Sample rate: number of samples per second (e.g., 44,100 Hz for CD quality).
  • Sample resolution (bit depth): number of bits per sample (e.g., 16-bit).
  • Channels: mono (1) or stereo (2).

File size (bits): sample rate ×\times sample resolution ×\times duration ×\times channels.

Example (HL): 3 minutes of stereo audio at 44,100 Hz, 16-bit.

\mathrm{Size = 44100 \times 16 \times 180 \times 2 = 254016000\mathrm{ bits \approx 30.3\mathrm{ MB

Worked Example (HL). A 5-minute mono recording at 22050 Hz with 8-bit depth.

Size = 22050×8×300=5292000022050 \times 8 \times 300 = 52920000 bits = 66150006615000 bytes 6.31\approx 6.31 MB.

Nyquist theorem (HL). The sample rate must be at least twice the highest frequency to avoid Aliasing. Human hearing goes up to about 20,000 Hz, so 44,100 Hz is sufficient for CD quality.

Compression (OL/HL)

Lossless compression: no data is lost; original can be reconstructed exactly (e.g., PNG, FLAC, ZIP).

Lossy compression: some data is discarded; original cannot be reconstructed (e.g., JPEG, MP3, MPEG).

Run-Length Encoding (RLE) is a simple lossless method: replace repeated values with the value and Count.

Example (OL): AAABBCCCC becomes 3A2B4C.

Worked Example (OL). Compress WWWWWWBBBBWWWWW using RLE.

W6 B4 W5 = 6 runs. Original: 15 bytes. Compressed: 12 bytes (assuming 1 byte per count, 1 byte per Value).

When RLE is ineffective. For data with no repeats, e.g., ABCDEFGH, RLE produces A1B1C1D1E1F1G1H1 = 16 bytes, which is larger than the original 8 bytes.

Data Types (OL/HL)

TypeDescriptionExample
IntegerWhole numbers42, -7
Real/FloatDecimal numbers3.14, -0.001
CharacterSingle character’A’, ‘z’
StringSequence of characters”Hello”
BooleanTrue or falsetrue, false
DateCalendar date2026-04-14

Worked Example (OL). Which data type would you use for each of the following?

  • A person’s age: Integer
  • A product price: Real/Float
  • A phone number: String (contains leading zeros)
  • Whether a student has passed: Boolean
  • A student’s date of birth: Date

Database Concepts (OL/HL)

Flat File vs Relational Database

Flat file: data stored in a single table. Problems: data redundancy, inconsistency, difficulty Of updating.

Relational database: data stored in multiple related tables. Advantages: reduced redundancy, Data integrity, easier maintenance.

FeatureFlat FileRelational Database
TablesSingleMultiple, related
RedundancyHighMinimised by normalisation
Data integrityNot enforcedEnforced by constraints
Concurrent accessNot supportedSupported with locking
QueryingManual (sequential)SQL (declarative, powerful)

Key Terms

TermDefinition
EntityA thing/object about which data is stored (e.g., Student)
AttributeA property of an entity (e.g., Name, Date of Birth)
Primary keyUniquely identifies each record in a table
Foreign keyA primary key from another table, establishing a relationship
RecordA single row in a table
FieldA single column in a table

Worked Example (OL). In a Student table, which field would be the best primary key?

StudentID is the best choice because it is unique and never changes. Name is not unique (two Students can share a name). Date of Birth is not unique.

Entity-Relationship Diagrams (HL)

An ER diagram shows entities, attributes, and relationships.

  • One-to-one (1:1): each record in one table relates to exactly one record in another.
  • One-to-many (1:M): one record in a table relates to many in another (e.g., one teacher has many students).
  • Many-to-many (M:N): many records in one table relate to many in another (e.g., students and courses). Resolved using a junction table.

Worked Example (HL). A school has Teachers and Subjects. Each teacher teaches one subject, but a Subject can be taught by many teachers. What is the relationship?

Teacher (1) to (M) Subject. The foreign key is SubjectCode in the Teacher table.

SQL (Structured Query Language) (HL)

Data Definition Language (DDL)

Create a table:

CREATE TABLE Students (
StudentID INT PRIMARY KEY,
FirstName VARCHAR(50) NOT NULL,
LastName VARCHAR(50) NOT NULL,
DateOfBirth DATE,
Grade CHAR(2)
);

Add a column:

ALTER TABLE Students ADD COLUMN Email VARCHAR(100);

Delete a table:

DROP TABLE Students;

Data Manipulation Language (DML)

Insert data:

INSERT INTO Students (StudentID, FirstName, LastName, DateOfBirth, Grade)
VALUES (1, 'John', 'Smith', '2008-03-15', 'HL');

Update data:

UPDATE Students
SET Grade = 'OL'
WHERE StudentID = 3;

Delete data:

DELETE FROM Students
WHERE StudentID = 5;

Worked Example. Insert three students into the Students table.

INSERT INTO Students (StudentID, FirstName, LastName, DateOfBirth, Grade)
VALUES
(1, 'Alice', 'Murphy', '2007-06-20', 'HL'),
(2, 'Bob', 'Kelly', '2008-01-15', 'OL'),
(3, 'Clara', 'O\'Brien', '2007-09-03', 'HL');

Queries (DQL)

Select all columns:

SELECT * FROM Students;

Select specific columns:

SELECT FirstName, LastName, Grade FROM Students;

Where clause:

SELECT * FROM Students
WHERE Grade = 'HL';

Comparison operators: =``<>``<``>``<=``>=.

Logical operators: AND``OR``NOT.

Pattern matching:

SELECT * FROM Students
WHERE FirstName LIKE 'J%';

Ordering results:

SELECT * FROM Students
ORDER BY LastName ASC;

Aggregate functions:

SELECT COUNT(*) FROM Students;
SELECT AVG(Mark) FROM Results;
SELECT MAX(Mark), MIN(Mark) FROM Results;

Group by:

SELECT Grade, COUNT(*)
FROM Students
GROUP BY Grade;

Having clause (filters groups):

SELECT Grade, AVG(Mark)
FROM Results
GROUP BY Grade
HAVING AVG(Mark) > 70;

Joins:

SELECT Students.FirstName, Students.LastName, Results.Mark
FROM Students
INNER JOIN Results ON Students.StudentID = Results.StudentID;

Worked Example. Find all HL students whose average mark is above 75.

SELECT Students.FirstName, Students.LastName, AVG(Results.Mark) AS AverageMark
FROM Students
INNER JOIN Results ON Students.StudentID = Results.StudentID
WHERE Students.Grade = 'HL'
GROUP BY Students.StudentID, Students.FirstName, Students.LastName
HAVING AVG(Results.Mark) > 75;

Views (HL)

A view is a virtual table based on a query:

CREATE VIEW HighAchievers AS
SELECT FirstName, LastName, Mark
FROM Students
INNER JOIN Results ON Students.StudentID = Results.StudentID
WHERE Mark >= 80;

Normalisation (HL)

Normalisation reduces data redundancy and dependency.

First Normal Form (1NF)

  • No repeating groups or arrays.
  • Each cell contains a single value.
  • Each record is unique (has a primary key).

Second Normal Form (2NF)

  • Must be in 1NF.
  • No partial dependencies (all non-key attributes depend on the entire primary key).

Third Normal Form (3NF)

  • Must be in 2NF.
  • No transitive dependencies (non-key attributes do not depend on other non-key attributes).

Example (HL):

Unnormalised table:

StudentIDNameSubject1Teacher1Subject2Teacher2
1JohnMathsMs LeePhysicsMr Kelly

1NF: Remove repeating groups.

StudentIDNameSubjectTeacher
1JohnMathsMs Lee
1JohnPhysicsMr Kelly

2NF: Separate into Students and Enrolments tables.

Students: StudentID, Name.

Enrolments: StudentID, Subject, Teacher.

3NF: Remove transitive dependency (Teacher depends on Subject, not StudentID).

Students: StudentID, Name.

Subjects: Subject, Teacher.

Enrolments: StudentID, Subject.

Why normalisation matters:

  • Update anomaly: Without normalisation, if Ms Lee changes her name, every row must be updated.
  • Insertion anomaly: A new subject with no students cannot be added.
  • Deletion anomaly: If John leaves, the information about Maths and its teacher is lost.

Data Integrity (HL)

Types of Integrity

  • Entity integrity: no null values in primary key.
  • Referential integrity: every foreign key value must match a primary key value in the related table (or be null).
  • Domain integrity: values in a column must be of the correct data type and within the valid range.

Worked Example (HL). Explain referential integrity with an example.

If a Student record with StudentID = 5 exists in the Enrolments table, you cannot delete StudentID = 5 from the Students table without first deleting the enrolments. Otherwise, the Enrolments table Would reference a non-existent student.

Advanced SQL Techniques (HL)

Window Functions

SELECT FirstName, LastName, Subject, Mark,
RANK() OVER (PARTITION BY Subject ORDER BY Mark DESC) AS SubjectRank
FROM Results;

Common Table Expressions

WITH SubjectAverages AS (
SELECT Subject, AVG(Mark) AS AvgMark
FROM Results
GROUP BY Subject
)
SELECT Subject, AvgMark
FROM SubjectAverages
ORDER BY AvgMark DESC;

Stored Procedures

CREATE PROCEDURE AddStudent(
IN pName VARCHAR(50),
IN pGrade CHAR(2)
)
BEGIN
INSERT INTO Students (FirstName, LastName, Grade)
VALUES (pName, '', pGrade);
END;

SQL Injection Prevention

Vulnerable:

query = f"SELECT * FROM Users WHERE name = '{user_input}'"

Secure (parameterised):

cursor.execute("SELECT * FROM Users WHERE name = ?", (user_input,))

Database Design Patterns

Many-to-Many Resolution:

A many-to-many relationship always requires a junction (link) table.

Example: Students and Courses (M:N) requires an Enrolments table.

Referential Integrity Constraints

-- ON DELETE CASCADE: delete related records
CREATE TABLE Enrolment (
StudentID INT,
CourseCode VARCHAR(10),
PRIMARY KEY (StudentID, CourseCode),
FOREIGN KEY (StudentID) REFERENCES Students(StudentID) ON DELETE CASCADE,
FOREIGN KEY (CourseCode) REFERENCES Courses(CourseCode) ON DELETE CASCADE
);

Data Representation Additional Topics

Representing Text in Detail

Unicode and UTF-8 encoding:

CharacterUnicodeUTF-8 Bytes
AU+004101000001 (1 byte)
EuroU+20AC11100010 10000010 10101100 (3 bytes)
EmojiU+1F6004 bytes

Why UTF-8 is preferred: Backward-compatible with ASCII, variable-length (efficient for English Text), and supports all Unicode characters.

Image Compression Techniques

JPEG compression: Uses discrete cosine transform (DCT) to discard high-frequency information That the human eye is less sensitive to. The quality can be adjusted (higher quality = larger file).

PNG compression: Uses lossless deflate compression. Best for images with sharp edges and text (like screenshots).

Audio Compression

MP3 compression: Uses psychoacoustic modelling to discard sounds that the human ear cannot Perceive (e.g., quiet sounds masked by louder ones at similar frequencies). Typical bitrate: 128 Kbps (acceptable quality) to 320 kbps (high quality).

FLAC compression: Lossless compression of audio data. File sizes are larger than MP3 but the Original can be perfectly reconstructed.

Additional Practice Questions

  1. A bitmap image has resolution 1600 x 900 and file size 3 MB. Calculate the colour depth.

  2. Write SQL that uses a subquery to find students who are NOT enrolled in any course.

  3. Write SQL using a CTE to find the subject with the highest average mark.

  4. Explain the difference between DELETE, DROP, and TRUNCATE in SQL.

  5. A school database needs tables for Teacher, Subject, and Classroom. Design the schema with appropriate primary and foreign keys.

  6. Write SQL to find all students whose marks are above the class average in every subject.

  7. Explain why UTF-8 is the most widely used character encoding on the web.

  8. Write SQL to create a view that shows each teacher and the number of subjects they teach.

  9. Explain three types of SQL injection attacks and how to prevent each.

  10. A music streaming service stores songs as MP3 files at 128 kbps. A 4-minute song at CD quality (44,100 Hz, 16-bit, stereo) would be approximately 30 MB. Estimate the MP3 file size and explain the compression ratio.

Additional SQL Practice

Complex Queries

Self-join: A table joined with itself to find relationships within the same data.

SELECT s1.FirstName, s1.LastName
FROM Students s1
JOIN Students s2 ON s1.DateOfBirth = s2.DateOfBirth
WHERE s1.StudentID < s2.StudentID;

This finds pairs of students who share the same birthday.

EXISTS and NOT EXISTS:

SELECT FirstName, LastName
FROM Students s
WHERE EXISTS (
SELECT 1 FROM Results r
WHERE r.StudentID = s.StudentID AND r.Mark >= 90
);

CASE expressions:

SELECT FirstName, Grade,
CASE
WHEN Grade = 'HL' THEN 'Higher Level'
WHEN Grade = 'OL' THEN 'Ordinary Level'
ELSE 'Unknown'
END AS LevelDescription
FROM Students;

Aggregate Functions in Detail

FunctionDescriptionIgnores NULL?
COUNT(*)Count all rowsNo
COUNT(col)Count non-NULL values in columnYes
SUMSum of valuesYes
AVGAverage of valuesYes
MINSmallest valueYes
MAXLargest valueYes

Worked Example. Find the student with the highest mark in each subject.

SELECT r.Subject, s.FirstName, s.LastName, r.Mark
FROM Results r
JOIN Students s ON r.StudentID = s.StudentID
WHERE r.Mark = (
SELECT MAX(r2.Mark)
FROM Results r2
WHERE r2.Subject = r.Subject
);

Data Representation Additional Topics

Run-Length Encoding Practice

Worked Example. Compress: AAAAABBBCCCCCCDD

A5 B3 C6 D2 = 8 runs. Original: 16 bytes. Compressed: 16 bytes. No improvement (break-even).

Worked Example. Compress: XXXXXXXXXXYYYYYZZZZZ

X10 Y5 Z5 = 3 runs. Original: 20 bytes. Compressed: 6 bytes. Compression ratio: 20/6 = 3.33:1.

Huffman Coding (HL)

Huffman coding is a lossless compression technique that assigns shorter codes to more frequent Characters and longer codes to less frequent characters.

Example: For the string “BCCABBDDAECCBBAEDDCC”

CharacterFrequencyHuffman Code
C600
B401
D410
A3110
E2111

Original: 19 characters x 8 bits = 152 bits. Compressed: (6x2) + (4x2) + (4x2) + (3x3) + (2x3) = 12 + 8 + 8 + 9 + 6 = 43 bits.

Additional Practice Questions

  1. Write SQL to find all students who scored above 80 in at least 3 different subjects.

  2. Write SQL using a CASE expression to convert numerical marks to letter grades: A (90-100), B (80-89), C (70-79), D (60-69), F (below 60).

  3. Explain how Huffman coding achieves compression. Why is it lossless?

  4. Write SQL to create a view that shows each student’s name, their average mark, and a count of how many subjects they are enrolled in.

  5. Compress the string “MMMMMMNNNOOOOOPP” using RLE. Calculate the compression ratio.

  6. Write SQL using NOT EXISTS to find products that have never been ordered.

  7. Explain the difference between COUNT(*) and COUNT(column). Give an example where they produce different results.

  8. Write SQL to find the top 3 students by total marks across all subjects, using a CTE.

Ethical Considerations in Database Design (HL)

Data Protection Principles

  • Lawfulness: Data must be collected and processed legally.
  • Purpose limitation: Data collected for one purpose should not be used for another.
  • Data minimisation: Only collect data that is necessary.
  • Accuracy: Data must be kept accurate and up to date.
  • Storage limitation: Data should not be kept longer than necessary.
  • Security: Appropriate measures must protect data.

Right to Be Forgotten

Under GDPR, individuals have the right to request deletion of their personal data.

DELETE FROM Students WHERE StudentID = 42;
DELETE FROM Results WHERE StudentID = 42;

Anonymisation vs Pseudonymisation

  • Anonymisation: Permanently removes identifying information. Data cannot be linked back to individuals.
  • Pseudonymisation: Replaces identifying information with artificial identifiers. Data can be linked back using a key.

Worked Example. A hospital wants to share patient data for research. Explain why Pseudonymisation might be preferred over anonymisation.

Pseudonymisation allows the hospital to re-link data to patients if needed (e.g., for follow-up Studies). Anonymisation would make this impossible. However, pseudonymised data still carries a Small re-identification risk.

Additional Practice Questions

  1. Explain the difference between anonymisation and pseudonymisation. When might each be used?

  2. Write SQL to implement the “right to be forgotten” for a specific student, ensuring all related records are deleted across all tables.

  3. Explain why data minimisation is important in database design. Give an example of collecting unnecessary data.

  4. A company stores customer data including names, addresses, purchase history, and payment details. Explain which data protection principles apply and how they should be implemented.

Worked Examples

See the examples integrated throughout the sections above.

Common Pitfalls

  1. Image file size — remember to convert from bits to bytes to MB.
  2. SQL syntax — keywords are case-insensitive but conventionally uppercase; strings in single quotes.
  3. JOIN vs WHERE — use explicit JOIN syntax rather than implicit joins.
  4. Normalisation — identify functional dependencies carefully; 3NF requires no transitive dependencies.
  5. Lossy vs lossless — know which formats use which type of compression.
  6. Forgetting WHERE in UPDATE/DELETE — without WHERE, all rows are affected.
  7. NULL in aggregates — COUNT(*) counts all rows; COUNT(column) counts non-NULL values.
  8. Confusing HAVING and WHERE — WHERE filters rows before grouping; HAVING filters groups.

Practice Questions

Ordinary Level

  1. Calculate the file size of a 1024×7681024 \times 768 image with 16-bit colour depth.
  2. Explain the difference between lossy and lossless compression with examples.
  3. Define entity, attribute, primary key, and foreign key.
  4. Give an advantage of a relational database over a flat file.

Higher Level

  1. Calculate the file size of 5 minutes of stereo audio at 48,000 Hz, 24-bit resolution.

  2. Normalise the following table to 3NF: Order(OrderNo, CustomerName, CustomerAddress, ProductName, Quantity, Price).

  3. Write an SQL query to find all students whose average mark is above 75, grouped by grade level.

  4. Explain the difference between entity integrity and referential integrity with examples.

  5. Write SQL to create tables for a library database with Books, Members, and Loans. Include primary keys and foreign keys.

  6. Write SQL to find the most popular book (most loans) in the library.

  7. Explain the three types of anomalies that can occur in an unnormalised database.

  8. Write SQL to create a view showing all members who have overdue loans (loan date more than 14 days ago).

  9. A bitmap image has resolution 1600 ×\times 900 and file size 3 MB. Calculate the colour depth.

  10. Write SQL that uses a subquery to find students who are NOT enrolled in any course.

Summary

This topic covers the essential chemistry of data and databases, including key reactions, underlying theories, and practical applications.

Key concepts include:

  • key chemical principles and theories
  • mathematical relationships in chemistry
  • practical techniques and apparatus
  • applications of chemistry in industry
  • environmental and ethical considerations

Mastery of these concepts requires both theoretical understanding and the ability to apply knowledge to unfamiliar contexts, particularly in calculation and practical questions.