The whole truth behind candidate keys

A reader of my tutorial on the 3 Normal Forms recently asked:

The question is: why was customer_id not used as part of the composite primary key (order_id, customer_id , item_id) in 1NF in the first place?

It’s a good question. The questioner identified one of the normalization issues that I simplified in order to make (I hope) my tutorial more readable.

Going back to the tutorial, we said that every table row has to have a column or group of columns that uniquely identifies it, right? We called this the Primary Key. Formal database jargon refers to this as a “candidate key”. There could be more than one candidate key in a table; the one that we actually select becomes the Primary Key.

In truth, that’s only part of the story. The candidate key has to have two properties: First is Uniqueness (as we just stated); the second property is called Irreducibility (or Minimality).

Let’s consider some examples of the three columns the questioner mentioned: order_id, customer_id, item_id:

 order_id  customer_id  item_id
125 56 563
125 56 851
125 56 652
126 2 563

Does this set of columns satisfy the Uniqueness property? Yes, it does. Every combination of {order_id, customer_id, item_id} is unique in our table, and this makes sense according to the invoicing rules of our imaginary business: Every item ordered by a customer appears once and only once on an invoice, and invoices are never duplicated (if a customer makes a duplicate order, we issue a new invoice).

The trouble is, it is redundant to say this. We don’t call {order_id, customer_id, item_id} the candidate key because customer_id isn’t required to satisfy the uniqueness requirement; each row is already uniquely identified by {order_id, item_id}. This is what is meant by the Irreducibility property: If a set of columns can be reduced to a smaller set of columns that still satisfies the Uniqueness property, then it does not qualify as a candidate key.

Here’s a somewhat more formal definition of these two properties (quoted from C.J. Date’s An Introduction to Database Systems, 7th Ed., p. 258):

Let K be a set of attributes of relvar R. Then K is a candidate key for R if and only if it possesses both of the following properties:

  1. Uniqueness: No legal value of R ever contains two distinct tuples with the same value for K.
  2. Irreducibility: No proper subset of K has the uniqueness property.

If you’re unfamiliar with database set theory terminology, here relvar (“relation variable” in case you’re curious) means “table” and tuple means “row”. A proper subset is a subset that does not include the whole set itself (in other words if you have a set of things {a,b,c} then {a} is a subset, {a,b} is a subset, {a,b,c} is a subset, and so forth; a proper subset includes {a}, {a,b}, etc., but does not include {a,b,c}).

I don’t think it’s necessary to get into this kind of detail in the tutorial itself, but it’s worthwhile mentioning and I’m glad someone asked.

This entry was posted in Databases. Bookmark the permalink.

2 Responses to The whole truth behind candidate keys

  1. Sana says:

    can we use customer-id and order-id as composite primary key?

  2. EagerMind says:

    Sana, in short, no you may not. Think about if the same customer (e.g. ’56’) ordered the same item (e.g. ‘563’) on a later invoice (i.e. a subsequent unique order). A simple sfw (Select-From-Where) query where ‘customer-id’ and ‘order-id’ were being used as the primary key would yield both invoices where customer ’56’ had ordered item ‘563’. Unfortunately, this primary key would not uniquely identify each tuple (row).

Leave a Reply

Your email address will not be published. Required fields are marked *