June 28th, 2002, 08:16 PM
Normalization: Always Better?
I think my title already explain what I want to ask.
All I want to know is: Is there any possibility that an unnormalized database is faster, more efficient than a normalized one?
Or: Is it a certain that a normalized database always better than an unnormalized one?
June 28th, 2002, 08:22 PM
Normalized databases can be slower do to possible need to do extensive joins.
June 28th, 2002, 09:16 PM
That topic has generated lots of discussion at comp.databases.theory, and www.dbdebunk.com, as well as many other locations for serious database talk.
Those who advise "denormalization for performance" generally approach it two ways: 1) denormalize your tables that cause the most performance hit, and just make sure (or hope) that your application code maintains data integrity, or 2) use denormalization in databases that are mainly for "data warehousing" or in other situations where the databases are read-only, in which case you don't need to worry about integrity.
Those who argue against the "denormalization for performance" concept point out that if you are using approach #1 above, then your database might perform better, but you will lose all that advantage by handling constraints in your application code. (And, you stand more chance of missing certain key constraints and corrupting your database). As far as situation #2, for data warehousing, or "OLAP", the question always arises "what happens when you want to write new data to the system?". If you aren't careful, you could be sorry. Also, your denormalized database causes many more design problems when you want to restructure it, or add new queries.
At www.dbdebunk.com, Fabian Pascal and others argue that the only reason normalized, relational databases have a performance hit is because of the vendors' bad/incomplete implementation of the relational data theory (meaning SQL). Theoretically, the relating of tables in joins is only a logical operation, and the database engine itself should be able to do whatever it needs in physical storage to optimize those queries. (for a very rudimentary example of this concept, Oracle has "materialized views", in which joins are propagated to a separate table, which can be queried without joins. Oracle's back end maintains the integrity of this data) The basic idea is that normalization is a logical concept, which the database presents to the user. The user should be completely shielded from physical storage considerations, and the database engine itself can essentially operate in any way needed to optimize the back end.
Unfortunately, we are left for the moment with DB products that don't really provide much in the way of this sort of optimization. So, the question we have to ask is "what is the trade-off". I personally agree that denormalization for performance brings many hidden costs, such as extra application code, bugs, etc..., so they should be avoided whenever possible. Also, normalization has many faces. Often you can get better performance simply by rethinking your design as a whole, without denormalizing. In situation #2, I suppose I can see a little more justification for denormalization. But for a live system, with constant updates, I would never trust a denormalized system. (been there, done that)
The only time I've ever noticed a performance increase in de-normalizing was for aggregate-type queries. It is hard to, on the fly, take a 4+ million row and SUM( * ) or COUNT or AVG etc. WHERE some_composite_key = XYZ AND other_things
So we store the de-normalized version in a table with the counts and update them when a new row is inserted. The UPDATE cost is far less than the SELECT since we're updating a single row in a small table (< 40,000 rows).
The other problem situation I've noticed is on JOIN type queries
For example, this forum software stores forums and moderators in separate tables. If you were to JOIN the moderators and forum to each other in order to grab the mods of all the forums and the forum info in one query you will have a LOT of I/O. Why? Because the join will pull back lots of duplicate forum information, e.g.:
So if you have 10 moderators you're going to have all the forum info duplicated, like this:
FROM forum f
INNER JOIN forum_moderator fm ON fm.forumid = f.forumid
INNER JOIN user u ON u.userid = fm.userid
Which leads to a LOT of I/O back to your client application.
forumid, forum_title, etc. ... username, userid
1, 'Forum title', ... 'MattR', 1
1, 'Forum title', ... 'Joebob', 2
1, 'Forum title', ... 'Hippy', 23
1, 'Forum title', ... 'Admin', 234
1, 'Forum title', ... 'UserA', 3461
1, 'Forum title', ... 'UserB', 331
1, 'Forum title', ... 'Happy Gilmore', 97121
It's not necessarily a good candidate for de-normilization but if you say stored the forum moderators in a single row (horribly denormalized by forum_moderator( username1, userid1, username2, userid2, ... )) you could bring back the mods in a single result set.
If you are not registering transactions but analyzing a huge amount data, at various aggregation levels, then denormalization is better. Take a look at http://publib-b.boulder.ibm.com/Redb...4?OpenDocument (Data Modeling Techniques for Data Warehousing) it is very interesting
Actually, I have developed a web translator which uses a database for its dictionary and other assisting data. I didn't normalized that dictionary, because I thought that would be faster. Also, I won't do many changes on the data on that dictionary.
Lately, I began to think, "Is there any faster or more efficient way to develop my dictionary?"
That's the background of my question ...
Well, one of the real benefits of normalization is that you have much more complete ways to manipulate your data, while producing less anomalies. Even when if normalization slows your server down, it can speed up development of your application, because you can push more logic onto the database. I almost always consider that a fair trade-off. If you really take advantage of referential integrity, views, stored procedures, etc... you can do some incredible manipulation of your data.
May I ask what database management system you are using for this project, and what programming language?
July 11th, 2002, 07:46 AM
DBMS: MySQL, programming language: PHP
July 11th, 2002, 08:14 AM
I want to add some details, but I feel difficult to explain this, so if there's any confusing part, just ask me, and I'll try to explain better.
My application is accessing the dictionary twice, first, when it needs to get the word type (noun,verb,etc) for each word, second, when it needs to get the meaning of each word. Each access has different query. So, I don't need to have any type of connection (between word_type_table and word_meaning_table) to access the data. (That's because I access them seperately.)
Pfew ... I have done my best to explain ...
August 11th, 2002, 07:00 PM
I noticed that there have been no other posts here, but I was wondering if you found any of the answers you were looking for.
If you are creating a dictionary, I would think that normalization, especially the M:M relationship (many-to-many) would be very useful, not just for data integrity but for functionality of your system. (for example, the M:M relationship would allow you to associate any word with any other word, as many times as you want, to provide a nice "related words" list for each definition)
But it's really impossible for me to provide any more help without seeing at least some representation of your database structure. Also, I must mention that there are a lot of considerations to good database design which are not available in MySQL, and normalization is only part of the question: MySQL still does not have triggers, views, stored procedures, and subqueries, among many others. All of these allow you to have a lot more control over your data. Views are especially useful because they allow you to access data as if it is not normalized, even though it is normalized in the base tables.
Recommendation: a DBMS like PostgreSQL offers arbitrary data constraints, which allow you to specify exactly what is allowed to be entered in any one field, and you can have updates to one table affect other areas with triggers, query rewrite RULE's, etc... A couple of other good things to note about PostgreSQL:
1. there is a full-text indexing module that is available as an add-on, which could greatly speed up queries for a dictionary-type system
2. PostgreSQL allows queries to use Regular Expressions, which is very helpful in searching through text information.
3. There is a great library of extra applications, routines, tutorials etc... available at http://techdocs.postgresql.org/
August 12th, 2002, 04:20 AM
We have had one case where we had to de-normalize a normalized database. The information stored within the database was too 'generic' and neede too much 'inner joins' on the same table. We built triggers on the tables to fill other tables which we could index properly.
Due to the overnormalized database, there wasn't an index anyone could come up with to improve performance.
I always propose to build tables for speed and understanding. If you 'could' store two types of data in 1 table for normalisation's sake, I always advise against it. It's the race between hands-on developpers and theoretical developpers.
August 13th, 2002, 07:21 PM
From all postings and resources I've got from this forum, I finally build my database without being normalized before. The main reason for this is because the program has to make decision between the two queries I've explained in my last post. So, in ways I could think of, I didn't find one to make the two queries to only one query (because of the existence of decision making between two queries).
But, I still think that my program can be improved, only I still don't know how.
Thanks for all helps here ...
August 13th, 2002, 10:46 PM
(sigh...) I am so tired of hearing this kind of thing. . I am a hands-on developer. I don't have a masters degree or PH.D. My applications are not research projects. But I have seen first-hand the results of developing with a non-normalized database, and I have almost always been sorry later. From your quote above, I actually doubt you understand what normalization really means. And the description of your problem with "overnormalization" sounds more like bad database design in general. Normalization is not about storing more types of data in one table. That is a design decision. It's about eliminating the potential for your system to contain ambiguous data, and storing your data in a logical, intelligently accessible method.
Ambiguous data may be easy to handle when you first write an application, but what about the fourth or fifth rewrite, by different developers? What happens when you want another application to access the same data? How do you resolve the inconsistencies? I'm not saying normalization is the best answer; I am saying normalization is the only thing that has provided an answer at all. Everything else ends up being a hack, sooner or later. You may not see this when designing smaller software, but when you want to extend this software, and tie in with larger systems, then you start seeing the problems. (again, I speak from experience)
Yes, there are times when I do not normalize everything, because of some specific performance issue, but I only do that when necessary, and only because of deficiencies in SQL DBMS's. Other than that, views provide all the "denormalized-looking" interfaces I need for other developers. (In postgreSQL, it is even possible to make view updateable)
Is proven theory something to be ignored, just because it is too hard to understand, or doesn't make you feel good? Shouldn't software should be approached like any other type of engineering? Would you like it if the engineer designing a bridge decided that the theory of physics wasn't "hands-on" enough? Would you like it if the pilot of your airplane decided that the theory of aerodynamics didn't matter for this particular landing, because he wanted to optimize for performance?
jagerkiss -- I am not attacking you here. You do what you have to do to complete your project. But, I do feel that you don't completely understand normalization or relational data in general. I just hope I have created a bit of curiosity in you, because the reward of understanding it fully is that you will think more clearly about every application, and possibly save yourself a lot of time and difficulty in the future.
August 14th, 2002, 04:36 AM
I am very thankful for every info I've got here. Frankly, I haven't found an example of really complicated case of normalization which I could learn from. All I have found before only simple case, such as address book, or a simple warehouse data.
I don't feel being attacked, so it's all right. It's maybe true that I do not fully understand about normalization. If I have time in future, I really like to learn more about this.