What is surrogate key?
In database design, a surrogate key is a unique identifier for a row in a database table. Unlike natural keys, which are derived from business data, surrogate keys are artificial, system-generated values with no inherent business meaning.
What is a Surrogate Key?
A surrogate key is a primary key that has no real-world meaning or intrinsic value. It is typically a simple integer, often auto-incrementing (like IDENTITY in SQL Server or SERIAL in PostgreSQL), that the database system generates for each new record. Its sole purpose is to uniquely identify a row within a table, serving as a stable and unchanging reference point.
Characteristics of Surrogate Keys
- Artificial/Meaningless: They do not convey any information about the entity they identify.
- System-Generated: Values are typically assigned automatically by the database management system.
- Unique: Each value is guaranteed to be unique within its table.
- Immutable: Once assigned, the value never changes.
- Non-Volatile: They are not subject to business rule changes or data modifications.
- Simple: Often a single, fixed-size data type (e.g.,
INT,BIGINT).
Why Use Surrogate Keys?
- Stability: Natural keys (e.g., email, social security number) can change, leading to complex updates across related tables. Surrogate keys never change.
- Simplicity of Joins: Joining tables using single, fixed-size integer keys is often more efficient and simpler than using multi-column or variable-length natural keys.
- Decoupling: They decouple the primary key from business logic, making the database design more robust to changes in business rules.
- Guaranteed Uniqueness: Ensures every row has a unique identifier, even if natural key candidates are not always unique or might change.
- Avoids Composite Keys: Reduces the need for multi-column primary keys, which can be cumbersome.
- Data Warehousing: Facilitates data warehousing by providing stable keys for historical tracking and dimension tables.
Example of a Surrogate Key
Consider a Customers table. While a customer might have a unique email address or national ID, these could potentially change. Using a surrogate key ensures a stable identifier for each customer record.
CREATE TABLE Customers (
customer_id INT PRIMARY KEY IDENTITY(1,1), -- Surrogate Key (SQL Server syntax)
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100) UNIQUE,
registration_date DATE
);
-- Or in PostgreSQL:
-- CREATE TABLE Customers (
-- customer_id SERIAL PRIMARY KEY,
-- first_name VARCHAR(50),
-- last_name VARCHAR(50),
-- email VARCHAR(100) UNIQUE,
-- registration_date DATE
-- );
Surrogate Keys vs. Natural Keys
| Feature | Surrogate Key | Natural Key |
|---|---|---|
| Meaning | No inherent business meaning; artificial | Derived from business data; has real-world meaning |
| Generation | System-generated (e.g., auto-increment) | Exists in the real world; user-provided or derived |
| Stability | Highly stable; never changes | Can change over time (e.g., email address, product code) |
| Uniqueness | Guaranteed unique within the table by the system | May require multiple columns for uniqueness; can conflict |
| Use Case | Preferred primary key for most transactional tables | Often used as unique constraints or as primary keys when highly stable |
Potential Drawbacks
- Extra Column: Adds an additional column to every table that uses one.
- Obscures Business Meaning: Queries often require joining with another table to see the 'meaningful' identifier.
- Data Migration: Can complicate data migration between systems if surrogate key values need to be preserved or remapped.
- No Global Uniqueness: A surrogate key for a customer in one database might not be the same for the 'same' customer in another database, requiring a 'natural' key for global identification.
Conclusion
Despite minor drawbacks, surrogate keys are widely adopted in modern database design due to the significant advantages they offer in terms of data integrity, performance, and flexibility. They provide a robust and stable foundation for relational database systems, simplifying management and protecting against the volatility of real-world data.