The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Databases
> PostgreSQL Help
|
PG query runs 15 times longer than MS SQL Server
Discuss PG query runs 15 times longer than MS SQL Server in the PostgreSQL Help forum on Dev Shed. PG query runs 15 times longer than MS SQL Server PostgreSQL Help forum discussing administration, SQL syntax, or other PostgreSQL-related topics. PostgreSQL provides enterprise level database functionality at open source prices.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

December 17th, 2012, 11:44 AM
|
|
Registered User
|
|
Join Date: Dec 2012
Posts: 11
Time spent in forums: 4 h 48 m 30 sec
Reputation Power: 0
|
|
|
PG query runs 15 times longer than MS SQL Server
Hi. We are evaluating PG and were very impressed by its performance when inserting 30 million rows as you can see in my previous thread. It required slightly more work on our part but was well worth it.
When we went to see how PG performs in a pretty simple query, we were disappointed. Our target was 1.5 minutes but it ran for 23 minutes. We compared PG to SQL Server.
Both SQL Server and PG were subjected first to their ways of "clearing cache". MS via 4 special cache commands. PG via start and stop to the service.
We'd like to give PG a fair shot. Here is a little background.
Our fact table has 30 million rows in it dating back to Jan 2005. Of its many foreign keys (there are no constraints) , one particular one can be used to join to a product dimension in a star like fashion.
The fact table has 3 indexes. One is a serial PK. Two are important dates to our business activity. You can see one of these dates in the query (modified slightly for privacy reasons) below. The purpose of the query is to compare 2012 months to 2011 months across 3 amounts. My apologies for not knowing which flavor of code tags to wrap around my query. I chose Language Syntax Highlighter.
Please let me know what particular needs PG has for a query like this perform competitively. Intermediate tables perhaps?
Code:
select coalesce(x."Product",y."Product") AS "Product",
2012 AS "YR",
coalesce(x."MO",y."MO") AS "MO",
x."Amt1",
y."Amt1" AS "PriorAmt1",
case when x."Amt1" is null
then null
else (x."Amt1" - case when y."Amt1" is null
then '0.0000'::money
else y."Amt1" end
) / x."Amt1" end AS "Amt1Growth",
x."Amt2" AS "Amt2",
y."Amt2" AS "PriorAmt2",
case when x."Amt2" is null
then null
else (x."Amt2" - case when y."Amt2" is null
then '0.0000'::money
else y."Amt2" end
) / x."Amt2" end AS "Amt2Growth",
x."Amt3" AS "Amt3",
y."Amt3" AS "PriorAmt3",
case when x."Amt3" is null
then null
else (x."Amt3" - case when y."Amt3" is null
then '0.0000'::money
else y."Amt3" end
) / x."Amt3" end AS "Amt3Growth"
from
(
SELECT
"Product"
,sum("Amt3") AS "Amt3"
,Sum("Amt1") AS "Amt1"
,Sum("Amt2") AS "Amt2"
,date_part('month',"FK_DateCalendarYear") AS "MO"
,date_part('year',"FK_DateCalendarYear") AS "YR"
FROM "dbo"."Fact" a
left join "dbo"."DimProduct" b
on (a."FK_DimProduct" = b."PK_DimProduct")
where "Product_Company" = 'XX' and
date_part('year',"FK_DateCalendarYear") = 2012
group by "Product",
date_part('month',"FK_DateCalendarYear"),
date_part('year',"FK_DateCalendarYear")
) AS x
FULL OUTER JOIN
(
SELECT
"Product"
,sum("Amt3") AS "Amt3"
,Sum("Amt1") AS "Amt1"
,Sum("Amt2") AS "Amt2"
,date_part('month',"FK_DateCalendarYear") AS "MO"
,date_part('year',"FK_DateCalendarYear") AS "YR"
FROM "dbo"."Fact" a
left join "dbo"."DimProduct" b
on (a."FK_DimProduct" = b."PK_DimProduct")
where "Product_Company" = 'XX' and
date_part('year',"FK_DateCalendarYear") = 2011
group by "Product",
date_part('month',"FK_DateCalendarYear"),
date_part('year',"FK_DateCalendarYear")
) AS y
on (x."Product" = y."Product" and
x."MO" = y."MO")
order by 1,2,3
|

December 17th, 2012, 12:07 PM
|
|
Contributing User
|
|
Join Date: Oct 2003
Location: Germany
|
|
|
Please format your SQL statement properly and read this: https://wiki.postgresql.org/wiki/Slow_Query_Questions
__________________
I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
http://forums.devshed.com/misc.php?do=bbcode#code
Tips on how to ask better questions:
http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
http://wiki.postgresql.org/wiki/SlowQueryQuestions
http://catb.org/esr/faqs/smart-questions.html
|

December 17th, 2012, 12:20 PM
|
 |
Contributing User
|
|
Join Date: May 2004
Location: Superior, CO, USA
|
|
|
It's a little difficult to determine from this query what the issue is. But there are some things I would do.
The first is, when is the last time you ran a reindex? I run a "reindexdb -a" (this is a command line command) once a day to make sure the indexes are up to date.
Next, how much memory are you giving the database server? I have found that increasing shared_buffers and work_mem have improved the performance of my database server though, of course, your mileage may vary.
Lastly, I use the combination of log_min_duration_statement (in postgresql.conf) and "Explain analyze" through pgAdminIII to help me understand what the optimizer is doing. You can turn on explain on the command line but I like the graphical output of pgAdmin. Obviously if you have any table scans (which, given the amount of time that this is taking it sounds like you might) you'd want to examine the query to determine why they are occurring.
I've used DB2, Oracle, Sybase, Informix (though I hate admitting that one out loud), and MS/SQL in various jobs over the years before being introduced to PostgreSQL. What I've seen is that, with the proper tuning, PostgreSQL can easily keep up with a commercial DB. I will say that the default configuration as shipped with PostgreSQL assumes a rather low performance machine so changing of the default memory parameters will be needed.
__________________
Need Java help? Want to help people who do? Sit down with a cup of Java at the hotjoe forums.
|

December 17th, 2012, 01:19 PM
|
|
Registered User
|
|
Join Date: Dec 2012
Posts: 11
Time spent in forums: 4 h 48 m 30 sec
Reputation Power: 0
|
|
Thx folks. We're still trying to figure out how to format the query so you can more easily read it. That particular thing isnt clearly explained. We'll get there. I'll change the tags there to code tags and add a little indentation.
But basically this is two subselects participating in a full outer join followed by a sort. I suspect PG isnt using the date index like we'd hoped but we'll see.
In the mean time, some of the items we can answer immediately per the guidelines are:
-there are no large objects, just integers, dates, dollar amts and booleans
-there can be nulls in dimproduct_scd_original_id, logid, adjustedPKinteger, Amt7, Amt8, Amt9 , Amt10
-there has never been an update or delete run thus far, just initial insert, so it only grew once (from nothing)
-there are 3 indexes as mentioned before but one is the pk, 2 are dates
-there are no triggers or called functions except for what you see in the query
-we run 9.2
-we are on an Intel(R) Xeon(R) CPU E5620 @2.4 GHz 2.4 Ghz (two processors), running WS2008R2 64 bit
-this is the first time the query was ever run
-we have 36 gig of ram, hundreds of gigs of san
-the table is unlogged
-here is what looks like the code version of the explain graphic...there are slight mods to protect our privacy
Code:
"Sort (cost=2969616.15..2969616.29 rows=54 width=72)"
" Sort Key: (COALESCE(b."Product", y."Product")), (COALESCE((date_part('month'::text, (a."FK_DateCalendarYear")::timestamp without time zone)), y."MO"))"
" -> Hash Full Join (cost=2969611.76..2969614.60 rows=54 width=72)"
" Hash Cond: ((b."Product" = y."Product") AND ((date_part('month'::text, (a."FK_DateCalendarYear")::timestamp without time zone)) = y."MO"))"
" -> HashAggregate (cost=1484804.66..1484805.74 rows=54 width=32)"
" -> Hash Join (cost=31.50..1484674.49 rows=8678 width=32)"
" Hash Cond: (a."FK_DimProduct" = b."PK_DimProduct")"
" -> Seq Scan on "Fact" a (cost=0.00..1483910.26 rows=149112 width=32)"
" Filter: (date_part('year'::text, ("FK_DateCalendarYear")::timestamp without time zone) = 2012::double precision)"
" -> Hash (cost=30.81..30.81 rows=55 width=8)"
" -> Seq Scan on "DimProduct" b (cost=0.00..30.81 rows=55 width=8)"
" Filter: ("Product_Company" = 'XX'::bpchar)"
" -> Hash (cost=1484806.28..1484806.28 rows=54 width=36)"
" -> Subquery Scan on y (cost=1484804.66..1484806.28 rows=54 width=36)"
" -> HashAggregate (cost=1484804.66..1484805.74 rows=54 width=32)"
" -> Hash Join (cost=31.50..1484674.49 rows=8678 width=32)"
" Hash Cond: (a."FK_DimProduct" = b."PK_DimProduct")"
" -> Seq Scan on "Fact" a (cost=0.00..1483910.26 rows=149112 width=32)"
" Filter: (date_part('year'::text, ("FK_DateCalendarYear")::timestamp without time zone) = 2011::double precision)"
" -> Hash (cost=30.81..30.81 rows=55 width=8)"
" -> Seq Scan on "DimProduct" b (cost=0.00..30.81 rows=55 width=8)"
" Filter: ("Product_Company" = 'XX'::bpchar)"
-here are the creates with slight mods to protect our privacy
Code:
CREATE TABLE dbo."Fact"
(
"PK_Fact" integer NOT NULL DEFAULT nextval('dbo."Fact_PK_Fact_seq"'::regclass),
"FK_Dim01" integer,
"FK_DimProduct" integer,
"FK_Dim01" integer,
"FK_EffDate" date,
"FK_ExpiryDate" date,
"FK_TransactionDate" date,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_EffDate1" date,
"FK_ExpiryDate1" date,
"Amt3" money,
"Amt1" money,
"Amt2" money,
"LogId" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"FK_Dim01" integer,
"Flag1" character(1),
"Days" integer,
"Amt4" money,
"Amt5" money,
"Adj1" boolean,
"adjustedPK" integer,
"fk_DateAnalysis" date,
"FK_Date1" date,
"FK_DateCalendarYear" date,
"FK_Date2" date,
"Amt6" money,
"Amt7" money,
"Amt8" money,
"Amt9" money,
"Amt10" money,
CONSTRAINT "PK_Unit" PRIMARY KEY ("PK_Unit")
)
WITH (
OIDS=FALSE
);
ALTER TABLE dbo."Fact"
OWNER TO postgres;
CREATE INDEX "IX_Fact_FK_DateCalendarYear"
ON dbo."Fact"
USING btree
("FK_DateCalendarYear");
CREATE INDEX "IX_Fact_FK_TransactionDate"
ON dbo."Fact"
USING btree
("FK_TransactionDate");
CREATE TABLE dbo."DimProduct"
(
"PK_DimProduct" integer NOT NULL DEFAULT nextval('dbo."DimProduct_PK_DimProduct_seq"'::regclass),
"BizKey01" character varying(50),
"Product_Company" character(2) NOT NULL,
"BizKey01" character(3) NOT NULL,
"BizKey01" character(3),
"BizKey01" character(2) NOT NULL,
"BizKey01" character varying(50) NOT NULL,
"BizKey01" character varying(50) NOT NULL,
"BizKey01" character varying(50) NOT NULL,
"BizKey01" character varying(50) NOT NULL,
"BizKey01" character varying(50),
"DimProduct_SCD_Original_ID" integer,
"DimProduct_SCD_Status" character varying(25),
"DimProduct_SCD_Start_Date" date,
"DimProduct_SCD_End_Date" date,
CONSTRAINT "PK_DimProduct" PRIMARY KEY ("PK_DimProduct")
)
WITH (
OIDS=FALSE
);
ALTER TABLE dbo."DimProduct"
OWNER TO postgres;
-some of the mem settings in our config:
shared_buffers 8GB
work_mem 64MG
maintenance_workmem 1GB
effective_cache_size 18GB
|

December 17th, 2012, 01:28 PM
|
|
Contributing User
|
|
Join Date: Oct 2003
Location: Germany
|
|
|
Thanks for the update.
23 minutes for the (small) number of rows shown in the plan seems way too much.
An explain (analyze on, buffers on, verbose on) would have been interesting to show the difference between the estimates and the actuals and to also show the steps that took the longest.
You can upload the execution plan to http://explain.depesz.com which makes things a little easier to read. That site also has an option to anonymize the execution plan.
Last edited by shammat : December 17th, 2012 at 01:33 PM.
|

December 17th, 2012, 01:47 PM
|
|
Registered User
|
|
Join Date: Dec 2012
Posts: 11
Time spent in forums: 4 h 48 m 30 sec
Reputation Power: 0
|
|
|
thx. We'll do that but I should mention that 1.4 million records are involved in the 1st subquery for 2012 and company XX and 1.8 million records in the 2nd for 2011 and company XX. That's before they are summarized by product, month and year.
Without the company qualifier, its 2.8 million and 3.4 million.
|

December 17th, 2012, 01:53 PM
|
|
Contributing User
|
|
Join Date: Oct 2003
Location: Germany
|
|
Quote: | Originally Posted by db042188 thx. We'll do that but I should mention that 1.4 million records are involved in the 1st subquery for 2012 and company XX and 1.8 million records in the 2nd for 2011 and company XX. That's before they are summarized by product, month and year. | Well, the execution plan does not reflect that.
This line:
Seq Scan on "Fact" a (cost=0.00..1483910.26 rows=149112 width=32)"
means that PostgreSQL assumes that there are 149112 rows in the "Fact" table.
And this line:
Seq Scan on "DimProduct" b (cost=0.00..30.81 rows=55 width=8)"
means that PostgreSQL assumes there are 55 rows in "DimProduct".
|

December 17th, 2012, 01:59 PM
|
|
Registered User
|
|
Join Date: Dec 2012
Posts: 11
Time spent in forums: 4 h 48 m 30 sec
Reputation Power: 0
|
|
|
terrible assumption on the fact table.
Good assumption on dimProduct for company XX.
|

December 17th, 2012, 02:12 PM
|
|
Contributing User
|
|
Join Date: Oct 2003
Location: Germany
|
|
Quote: | Originally Posted by db042188 terrible assumption on the fact table. | That indicates that the statistics are out of date and that you should run a analyze on that table.
This is something that would be visible when running explain (analyze on, buffers on, verbose on) instead of just explain
|

December 17th, 2012, 03:27 PM
|
|
Registered User
|
|
Join Date: Dec 2012
Posts: 11
Time spent in forums: 4 h 48 m 30 sec
Reputation Power: 0
|
|
I may have done some unnecessary steps but here is what I did:
1) with my db hilighted in pgAdminIII, I issued a command as follows:
Code:
SET track_io_timing=On;
The feedback I got was "Query returned successfully with no results in 10ms"
2) Ran my query with this in front:
Code:
EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS on, TIMING on )
3) on explain.depesz.com, posted the results under item "Yrm" anonymized. I see a lot of red and some very large numbers associated with one of my tables but will wait to hear from the community on what that means. In the mean time I'll try to figure out how the ANALYSE command gets issued from pgAdminIII so that my stats can get updated.
|

December 17th, 2012, 04:31 PM
|
|
Contributing User
|
|
Join Date: Oct 2003
Location: Germany
|
|
Quote: | Originally Posted by db042188 3) on explain.depesz.com, posted the results under item "Yrm" anonymized. | Step #5 seems to be your problem:
Code:
Seq Scan on charlie_tango quebec_three (cost=0.000..1483910.260 rows=149112 width=32) (actual time=587883.023..821874.081 rows=3371596 loops=1)
...
Filter: (echo_tango('quebec_four'::text, (quebec_three."six")::timestamp without time zone) = 2011::double precision)
Rows Removed by Filter: 26450956
Buffers: shared hit=68 read=961950
PostgreSQL assumes that the step will return 149112 rows but in reality it's returning 3371596 rows. That could either indicate really, really bad statistics or due to the lack of an usable index, PostgreSQL cannot estimate the result for that expression properly.
I guess that the filter condition from execution plan (echo_tango('quebec_four'::text, (quebec_three."six")::timestamp without time zone) = 2011::double precision)) "maps" to this condition from your initial post: date_part('year',"FK_DateCalendarYear") = 2011.
You do have an index on the column FK_DateCalendarYear, but that cannot be used due to the function applied on it. You need to create an index on that expression:
Code:
create index idx_year_value on
"Fact"(date_part('year',"FK_DateCalendarYear"));
then the WHERE condition should be able to use the index.
Quote: | In the mean time I'll try to figure out how the ANALYSE command gets issued from pgAdminIII so that my stats can get updated. | I don't use pgAdmin, but it is a regular SQL statement, just run it wherever you can execute SQL queries.
|

December 18th, 2012, 01:24 PM
|
|
Registered User
|
|
Join Date: Dec 2012
Posts: 11
Time spent in forums: 4 h 48 m 30 sec
Reputation Power: 0
|
|
Thx. We are quite impressed.
We changed two things prior to our last test.
Rather than create a new index, we changed the query slightly to use a
Code:
between 'yyyy-01-01' and 'yyyy-12-31'
clause on both date criteria.
We ran the command
Code:
vacuum analyse "dbo"."Fact"
from pgAdminIII. We were disappointed that the vacuum analyse ran for a half hour but no big deal.
We restarted the PG service and the query ran for 10 seconds. MS SQL came in consistently around a minute and a half on what is essentially the same query.
We saw evidence in a subsequent EXPLAIN that the original INDEX was finally enlisted in the plan. And we dont even know anything about EXPLAIN plans.
If I was smarter I would have taken more care in determining whether that INDEX use was due to the query change alone or the ANALYSE as well.
We will recommend to mgt that PostgreSQL is worth investigating from a relational perspective.
We dont know yet what its cube add in does but we will be researching that next.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|