Abstracting databases


I read one of Jeff Atwood’s latest posts called Maybe Normalization Isn’t Normal. It takes a bit to get his point, as it rambles, but I am in agreement with his post, the gist of which is "normalization does not scale". I say this with a grain of salt, however, as it takes a lot of data to really worry about this. So, today, I discuss when and why of normalization. But more importantly, why we abstract.

Normalization

First, I believe you should design everything with a bit of sensibility. Normalize to the nth? Yes, it is a bit of a waste some times, but it helps you identify your objects. Note that you do not  have to actually write out the tables as you identify objects. Also note, this is not going to be your production database. Let’s start with a small example. We want to make a person table, as a person is an object we will deal with.
 
Person
—–
PersonId
FirstName
MiddleName    NULL
LastName
 
Now, we need the user to log in. The question to ask here is whether or not you will ever have people in your system that do not log in. If yes, then the user table is separate from the person table.
 
User
—–
UserId
PersonId
LoginName
Password
 
In the case where everyone either logs in or has the potential to log in, you might wish to stick all of this into a single table. NOTE: User is a keyword in SQL Server, and some other databases, so WebsiteUser, or similar, may be better. What about phone numbers or addresses?
  1. Can they be shared by multiple people?
  2. Can a user have more than one phone number of a particular type?

If you answer either yes, you will end up duping data iff you stick it in the person table. That is generally not good, as data is easy to get out of sync.

The idea here is identify the object and then figure out the context.

Abstraction

Now suppose we opted for radical normalization (phone numbers, addresses, etc. in their own database tables), ala Jeff’s example.

social network database example, normalized 

(c) Jeff Atwood, CodingHorror.com

As we grow, we find this does not scale. Per Jeff, here is what happens:

Normalization certainly delivers in terms of limiting duplication. Every entity is represented once, and only once — so there’s almost no risk of inconsistencies in the data. But this design also requires a whopping six joins to retrieve a single user’s information.

select * from Users u
inner join UserPhoneNumbers upn
on u.user_id = upn.user_id
inner join UserScreenNames usn
on u.user_id = usn.user_id
inner join UserAffiliations ua
on u.user_id = ua.user_id
inner join Affiliations a
on a.affiliation_id = ua.affiliation_id
inner join UserWorkHistory uwh
on u.user_id = uwh.user_id
inner join Affiliations wa
on uwh.affiliation_id = wa.affiliation_id

Those six joins aren’t doing anything to help your system’s performance, either. Full-blown normalization isn’t merely difficult to understand and hard to work with — it can also be quite slow.

So, the solution is to not normalize? Perhaps, but that is not really the point. If you placed the query in a stored procedure, or at least separated out the query object, you can recode a small bit and refactor the database to your hearts content.

This gets to my point. It really should not matter what your schema is. I mean this on a high level, of course, as changing schemas is a royal pain in the rear end. You should look for the schema that gets you to scale, but separation of data means a lot.

In addition, you do not have to grab all of the data with six joins, as the end user is not editing all the user and all of his history at the same time. At least not in a SANE application. I put SANE in caps as far TOO MANY applications are INSANE.

How do you abstract. Well, if you go with code generation, you are going to have to regen some objects. If you are changing schemas, I hope this means you have set up your ORM so it maps to business objects rather than set it on autogen.  But, you can also abstract via stored procedures.

Now, let me take a moment to explain the stored procedure. It has gotten a lot of bad press lately, largely by ORM tool manufacturers, who often embed dynamic code in their objects. While it is true, as they state, that sprocs can perform much worse than generated code, this is not a normal case in most well designed stored procedures. And, you get two benefits: a declarative security model (you can easily state who uses what without coding) and abstraction of the database schema.

I am not suggesting that you have to go with sprocs to abstract, but it is an easy way to get a newbie crew into abstraction. I am suggesting, however, that you require some form of abstraction to separate the schema from the business objects. If editing the DAL is your method, then go for it. If you autogen, and have a means of mapping (NHibernate, for example), then go for it. In fact, there is nothing to say you can’t use multiple solutions to further abstract.

Back to Jeff’s article. He then states that a single table is more efficient:
denomalized schema
(c) Jeff Atwood

This works — queries are now blindingly simple (select * from users), and probably blindingly fast, as well. But you’ll have a bunch of gaping blank holes in your data, along with a slew of awkwardly named field arrays. And all those pesky data integrity problems the database used to enforce for you? Those are all your job now. Congratulations on your demotion!

Now here is the big question? Should this really matter? i am not asking whether performance matters? it does. I am not asking whether schema choices should be made sensibly? they should. What I am asking is why this is such a big deal.

If you properly separate concerns and abstract your database schema (by whatever means you are comfortable with), you can change out the bits rather quickly. With a stored procedure, I start with this:

CREATE PROCEDURE Proc1
AS
select * from Users u
inner join UserPhoneNumbers upn
on u.user_id = upn.user_id
inner join UserScreenNames usn
on u.user_id = usn.user_id
inner join UserAffiliations ua
on u.user_id = ua.user_id
inner join Affiliations a
on a.affiliation_id = ua.affiliation_id
inner join UserWorkHistory uwh
on u.user_id = uwh.user_id
inner join Affiliations wa
on uwh.affiliation_id = wa.affiliation_id

And when perf starts to suck, I simply run the following:

ALTER PROCEDURE Proc1
AS
select * from users

This at least works in theory, since the first query grabs tons of rows, but you get the idea.

Personal Feelings

I feel it is "sh*t all stupid" (from Vacation) to denormalize with multiple rows of history in a user table. I realize this was not Jeff’s point. There are some concepts you simply DO NOT blend. I also believe that moving addresses into the user table makes sense, even if you end up with some dupes in information. I say this with a caveat, of course. If the address change requires the same address change in other places, you should have it in a separate table. Trying to be supercoder to catch all dupes is also stupid.

By the way, if you like to play supercoder, and have not adopted TDD, try it. Proper user tests will convince you that you are not coder god, no matter how good you are. I don’t know how many times I thought I had quickly coded something to find I missed one line or fudged another. Tests illuminate that skill level may reduce mistakes, but it never eliminates them. Stepping off the TDD box now.

My point here is that normalization or denormalization need not ruin your day if you follow the following practices:

  1. Determine business objects first
  2. Use the objects to map tables, but allow the tables to flow to have them make sense
  3. Create contracts between layers. This can simply be creating interfaces on your objects, so new object implementations do not change calls
    As an unwritten rule, this means DO NOT BREAK CONTRACTS
  4. Abstract your database schema (your choice here cowboy)

Now you find that you are the next google and perf is suffering. Cool. denormalize for speed and rewrite your abstraction code only. Much quicker.

AH, but some of you probably realize that abstraction slows things down a bit too? Great. If you are really the next google, you have a success problem on your hands, which means you have beaucoup d’argent (money) to work with to solve the problem. Hope you have lots of stock options.

Peace and Grace,
Greg

Advertisements

3 Responses to Abstracting databases

  1. Kelly says:

    Very good reading…..thank you.

  2. Eric says:

    Conventional online transaction processing applications, where data entry originates and normalization is essential, do not typically accumulate large amounts of historical data or join all related tables into one query. The scenario where normalization ceases to scale would describe a datawarehousing application.

  3. Gregory says:

    I agree with you no name. I also realize we can over normalize, at times. If you find that you are shoving things into multiple tables, simply because in the real world they are different objects, you may be taking things a bit too far.
     
    I will disagree with the oversimplification of data warehouses, at least in part. There are transactional applications that do end up with large amounts of data that will not be easily "fixed" through data warehousing scenarios.
     
    Peace and Grace,Greg

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: