July 21st, 2003, 05:46 PM
optimizing db queries...again
after learning the method that queries are executed, i was thinking of a common situation and a possible faster solution.
consider teh following query
imagine both tables have 10,000 tuples. also imagine that ther are only 3 tuples where t1.id and t2.id are equal and t2.whatever = 5
select t1.id, t2.id
from table1 t1, table2 t2
where t1.id = t2.id and t2.whatever = 5
i learned that the program will execute the query in this manner.
1. cross multiply tables t1 and t2, resulting in a table with 10,000 x 10,000 tuples
2. select only the tuples where t1.id = t2.id
3. select only columns t1.id and t2.id
i imagine that the biggest operation would be cross multiplying the two tables
but whati f we rewrote the query as such
here the program would execute the subquery first. if t2.whatever was an index this would be fast. next the outer query would be executed. every tuple in t1 would be compared with the result of the subquery. if the subquery had 10 tuples, this would only result in 10 x 10,000 operations--a significant improvement
from table1 t1
where id in (select id from table 2 t2 where t2.whatever = 5)
i am just making assumptions off the way i learned queries are executed.
pklease let me know if i am wrong; i just want to learn the best way to write queries
July 22nd, 2003, 12:06 PM
You might want to try taking a look at the EXPLAIN and EXPLAIN ANALYZE commands to be used with your SQL statement. These will show you how the query planner will execute your query with the cost of each step and (in the case of EXPLAIN ANALYZE) it should give you the time it will take to run the query.
BTW, the performance of IN is notoriously slow in postgres versions prior to 7.4 (which is due to be released shortly).
PostgreSQL, it's what's for dinner...
July 22nd, 2003, 12:20 PM
i tried searching for explain anaylize--couldnt find a coherent example.
could you give me a quick example?
July 22nd, 2003, 12:33 PM
At the psql prompt just add EXPLAIN or EXPLAIN ANALYZE before your query and that should output the query plan and estimates.
So for example:
EXPLAIN SELECT * FROM table
PostgreSQL, it's what's for dinner...
July 22nd, 2003, 12:35 PM
Bcyde's comments are on track, although allow me to expand on this a little:
This is a very good question, and brings up a clear case of logical/physical separation. You should be careful not to assume that any one method is exactly what the DBMS does "under the covers". With relational-type systems, there is a big difference between your logical perception of what the system does (querying, returning sets) and how the system implements those queries and returned sets internally.
So, when making assumptions about SQL queries for performance, your logical perceptions of the SQL/relational queries don't necessarily apply to the physical performance of the task.
Yes, logically speaking, in query #1 the DBMS will cross-multiply both tables before winnowing out the result set. But, (and here's the key) physically speaking, the DBMS is free to optimize that query in any way the DBMS programmers decide. This is the great strength of the relational data model (although most SQL systems don't fully take advantage of this).
So, my point is, you cannot make the assumption that one query is faster than the other without knowing the following:
a) Which DBMS is being used (they can all have different methods of physical optimization)
b) What the relevant indexes are (queries taking advantage of indexes are almost always faster)
c) How the DBMS handles its physical storage. (generally, this isn't something you should be concerned with when using a DBMS, but physical storage can also be done in any way desired by the DBMS developers to optimize for querying)
d) what the query optimizer is in fact doing. Most SQL systems allow you to do something like "EXPLAIN [my_query_definition]", which provides some fairly clear, concise information about how the DBMS is performing that query. Some systems such as PostgreSQL even have a query optimizer that "learns" from repeated queries and continually optimizes storage and index use. (called the Genetic Query Optimizer in PostgreSQL)
e) How you have optimized and tweaked the DBMS on your system. There are many things you can do to affect performance in most DBMSs, including memory usage, caching, file placement, partitioning, etc...
So, really, until you do this sort of homework, related to the DBMS you are using, it is best not to make the sort of assumption you are making about the advisability of one query over another. Personally, I think it's always best to start with queries that make the best logical use of the relational model. I tend to avoid subqueries unless they are necessary.
So, the best answer I can give you is: learn more about the particular DBMS you are using, and experiment with your queries until you find the ones that perform the best.
July 22nd, 2003, 12:51 PM
"called the Genetic Query Optimizer in PostgreSQL"
ahh, good 'ol genetic algorithms and AI =)
i didnt know that postgresql was so money.
July 22nd, 2003, 01:17 PM
By the way, sad.machine, you might be interested in a few websites which go into deeper theoretical explanation of the relational model of data management:
- http://www.thethirdmanifesto.com/ (C.J Date and High Darwen are two of the best minds in the database theory community
- http://www.dbdebunk.com/ (this site contains some excellent articles by C.J. Date, as well as critiques and articles by Fabian Pascal)
- More articles by Date at IntelligentEnterprise.com
- comp.databases.theory -- some fascinating discussions go on in this newsgroup
-- And finally, E.F. Codd, In his own words. (You need a free registration at BRCommunity.com to see this page). A very nice collection of quotes from interviews, some nice insight into his thinking. Notice, many of these quotes were all back in the early 80s, and predicted very well the possibilities of the future.
July 22nd, 2003, 01:27 PM
o nooo!!!!...its bringing back memories of my rdb theory class =)
thanks for taking the time to help me out, rycamor.
ill take a gander.
July 22nd, 2003, 01:45 PM
another related question. in terms of postgresql, does the # of columns in a row affect the query?
for example in my user table, i kept the user's picture as a blob. i felt this was a better way to do it thant o have a seperate table for pics with id's and to have to join the two tablse for each query.
July 22nd, 2003, 01:59 PM
While practical consideration for performance is an important issue I think it's important to take the approach of keeping everything normalized as possible, especially for projects you will end up maintaining and using as a basis to build off of.
There are many advocates of denormalizing the db structure for performance (these are the ones who talk about practice vs theory, and also sometimes confuse SQL databases as being full relational databases), however this usually comes at a price in the form of redundant data, and difficulties with updating data. Sometimes the performance gains as seen by denormalizing are only perceived as well, (I'll try and find some PG mailing list links where this is the case).
Not sure if you're denormalizing your db or not, but if so I'd recommend trying to take the more theoretical/relationally correct approach unless it seems to be dog slow. I'm pretty sure PG can handle joins pretty well . Also, over time hardware becomes cheaper, newer PG versions will get more optimized and you can speed up apps in this way, however throwing more resources at a database that is poorly designed won't necessarily make it easier to fix its inherent problems.
PostgreSQL, it's what's for dinner...
July 22nd, 2003, 02:23 PM
i forgot all the normal form definitions....so i forgot what form mine is in..but i know what you are talkign about.
i don't think im denomralizing...because this data isn't redundant.
say my user table had a category column with a catID that was a key to a category table. lets say there were 4 categories with other attributes and every user was one of these 4 categories. then i would see the benefits of relations --no redundancy, scalability, ease of updates etc.
but in the case of a pic, i am thinking there is no point since it is unique for each user...unless some weirdo steals someone else' pic and uses it as his/her own =P
July 22nd, 2003, 02:39 PM
There are quite a few choices in database layout that don't directly have nothing to do with normalization. The question sad.machine posted about pic placement is one of those.
This is probably more a question of preference, and of how you plan to use the data. For example, if you plan to be accessing the user data without the picture for a significant amount of the queries, then it might be better for performance to keep the pic info in another table. Thus at the price of a (very small) performance hit for the joined query, you get a performance advantage when you don't include it.
But anyway, internally PostgreSQL saves BLOBs as files, and only uses pointers in the actual table data, so you don't have much of a performance hit either way.
Another good rationale for having pics in a separate table is if a significant number of your user accounts will NOT have pics stored until a later date. Keeping them in a separate table allows you to avoid setting that column NULL in the record while waiting for a pic.
By the way, BLOBs in PostgreSQL don't play by all the rules of standard data. (they involve additional operations for full deletion, security, etc...) So, if you really want your image data to operate under all the constraints of your database, you might consider using type BYTEA, which allows you to store binary data directly in the column. (if you are doing this, you definitely should consider keeping it in a separate table, for performance).
July 22nd, 2003, 02:53 PM
wow. thanks guys. i didnt realize db'ing was so complicated. i see it takes a lot of experience to use a db to its fullest.
what do you mean by "operate uner all the constraints of your database"?
July 22nd, 2003, 03:42 PM
i thank you very much! it looks like i have a lot to learn. ill try this at home using strict mode, but im sure it will work.
much thanks to both of you. i learned a lot