#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0

    PG query runs 15 times longer than MS SQL Server


    Hi. We are evaluating PG and were very impressed by its performance when inserting 30 million rows as you can see in my previous thread. It required slightly more work on our part but was well worth it.

    When we went to see how PG performs in a pretty simple query, we were disappointed. Our target was 1.5 minutes but it ran for 23 minutes. We compared PG to SQL Server.

    Both SQL Server and PG were subjected first to their ways of "clearing cache". MS via 4 special cache commands. PG via start and stop to the service.

    We'd like to give PG a fair shot. Here is a little background.

    Our fact table has 30 million rows in it dating back to Jan 2005. Of its many foreign keys (there are no constraints) , one particular one can be used to join to a product dimension in a star like fashion.

    The fact table has 3 indexes. One is a serial PK. Two are important dates to our business activity. You can see one of these dates in the query (modified slightly for privacy reasons) below. The purpose of the query is to compare 2012 months to 2011 months across 3 amounts. My apologies for not knowing which flavor of code tags to wrap around my query. I chose Language Syntax Highlighter.

    Please let me know what particular needs PG has for a query like this perform competitively. Intermediate tables perhaps?

    Code:
    select coalesce(x."Product",y."Product") AS "Product",
             2012 AS "YR",
             coalesce(x."MO",y."MO") AS "MO",
             x."Amt1",
             y."Amt1" AS "PriorAmt1",
    
             case when x."Amt1" is null 
                    then null 
                    else (x."Amt1" - case when y."Amt1" is null
                                                 then '0.0000'::money 
                                                 else y."Amt1" end
                           ) / x."Amt1" end AS "Amt1Growth", 
    
             x."Amt2" AS "Amt2",
             y."Amt2" AS "PriorAmt2",
    
            case when x."Amt2" is null 
                   then null 
                   else (x."Amt2" - case when y."Amt2" is null
                                                then '0.0000'::money 
                                                else y."Amt2" end
                         ) / x."Amt2" end AS "Amt2Growth",
    
            x."Amt3" AS "Amt3",
            y."Amt3" AS "PriorAmt3",
    
            case when x."Amt3" is null 
                   then null 
                   else (x."Amt3" - case when y."Amt3" is null  
                                                then '0.0000'::money 
                                                else y."Amt3" end
                         ) / x."Amt3" end AS "Amt3Growth" 
    from
    (
    SELECT 
           "Product"
          ,sum("Amt3") AS "Amt3"
          ,Sum("Amt1") AS "Amt1"
          ,Sum("Amt2") AS "Amt2"
          ,date_part('month',"FK_DateCalendarYear") AS "MO"
          ,date_part('year',"FK_DateCalendarYear") AS "YR"
          
      FROM "dbo"."Fact" a
      left join "dbo"."DimProduct" b
      on (a."FK_DimProduct" = b."PK_DimProduct")
      
      where "Product_Company" = 'XX' and
            date_part('year',"FK_DateCalendarYear") = 2012
      group by "Product",
               date_part('month',"FK_DateCalendarYear"),
               date_part('year',"FK_DateCalendarYear")
        
      ) AS x
      
     FULL OUTER JOIN
      
      (
      SELECT 
           "Product"
          ,sum("Amt3") AS "Amt3"
          ,Sum("Amt1") AS "Amt1"
          ,Sum("Amt2") AS "Amt2"
          ,date_part('month',"FK_DateCalendarYear") AS "MO"
          ,date_part('year',"FK_DateCalendarYear") AS "YR"
          
      FROM "dbo"."Fact" a
      left join "dbo"."DimProduct" b
      on (a."FK_DimProduct" = b."PK_DimProduct")
      
      where "Product_Company" = 'XX' and
            date_part('year',"FK_DateCalendarYear") = 2011
      group by "Product",
               date_part('month',"FK_DateCalendarYear"),
               date_part('year',"FK_DateCalendarYear")
      ) AS y
      
      on (x."Product" = y."Product" and
         x."MO" = y."MO")
         
         
      order by 1,2,3
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Please format your SQL statement properly and read this: https://wiki.postgresql.org/wiki/Slow_Query_Questions
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  4. #3
  5. Contributing User
    Devshed Regular (2000 - 2499 posts)

    Join Date
    May 2004
    Location
    Superior, CO, USA
    Posts
    2,418
    Rep Power
    1713
    It's a little difficult to determine from this query what the issue is. But there are some things I would do.

    The first is, when is the last time you ran a reindex? I run a "reindexdb -a" (this is a command line command) once a day to make sure the indexes are up to date.

    Next, how much memory are you giving the database server? I have found that increasing shared_buffers and work_mem have improved the performance of my database server though, of course, your mileage may vary.

    Lastly, I use the combination of log_min_duration_statement (in postgresql.conf) and "Explain analyze" through pgAdminIII to help me understand what the optimizer is doing. You can turn on explain on the command line but I like the graphical output of pgAdmin. Obviously if you have any table scans (which, given the amount of time that this is taking it sounds like you might) you'd want to examine the query to determine why they are occurring.

    I've used DB2, Oracle, Sybase, Informix (though I hate admitting that one out loud), and MS/SQL in various jobs over the years before being introduced to PostgreSQL. What I've seen is that, with the proper tuning, PostgreSQL can easily keep up with a commercial DB. I will say that the default configuration as shipped with PostgreSQL assumes a rather low performance machine so changing of the default memory parameters will be needed.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0
    Thx folks. We're still trying to figure out how to format the query so you can more easily read it. That particular thing isnt clearly explained. We'll get there. I'll change the tags there to code tags and add a little indentation.

    But basically this is two subselects participating in a full outer join followed by a sort. I suspect PG isnt using the date index like we'd hoped but we'll see.

    In the mean time, some of the items we can answer immediately per the guidelines are:

    -there are no large objects, just integers, dates, dollar amts and booleans
    -there can be nulls in dimproduct_scd_original_id, logid, adjustedPKinteger, Amt7, Amt8, Amt9 , Amt10
    -there has never been an update or delete run thus far, just initial insert, so it only grew once (from nothing)
    -there are 3 indexes as mentioned before but one is the pk, 2 are dates
    -there are no triggers or called functions except for what you see in the query
    -we run 9.2
    -we are on an Intel(R) Xeon(R) CPU E5620 @2.4 GHz 2.4 Ghz (two processors), running WS2008R2 64 bit
    -this is the first time the query was ever run
    -we have 36 gig of ram, hundreds of gigs of san
    -the table is unlogged
    -here is what looks like the code version of the explain graphic...there are slight mods to protect our privacy
    Code:
    "Sort  (cost=2969616.15..2969616.29 rows=54 width=72)"
    "  Sort Key: (COALESCE(b."Product", y."Product")), (COALESCE((date_part('month'::text, (a."FK_DateCalendarYear")::timestamp without time zone)), y."MO"))"
    "  ->  Hash Full Join  (cost=2969611.76..2969614.60 rows=54 width=72)"
    "        Hash Cond: ((b."Product" = y."Product") AND ((date_part('month'::text, (a."FK_DateCalendarYear")::timestamp without time zone)) = y."MO"))"
    "        ->  HashAggregate  (cost=1484804.66..1484805.74 rows=54 width=32)"
    "              ->  Hash Join  (cost=31.50..1484674.49 rows=8678 width=32)"
    "                    Hash Cond: (a."FK_DimProduct" = b."PK_DimProduct")"
    "                    ->  Seq Scan on "Fact" a  (cost=0.00..1483910.26 rows=149112 width=32)"
    "                          Filter: (date_part('year'::text, ("FK_DateCalendarYear")::timestamp without time zone) = 2012::double precision)"
    "                    ->  Hash  (cost=30.81..30.81 rows=55 width=8)"
    "                          ->  Seq Scan on "DimProduct" b  (cost=0.00..30.81 rows=55 width=8)"
    "                                Filter: ("Product_Company" = 'XX'::bpchar)"
    "        ->  Hash  (cost=1484806.28..1484806.28 rows=54 width=36)"
    "              ->  Subquery Scan on y  (cost=1484804.66..1484806.28 rows=54 width=36)"
    "                    ->  HashAggregate  (cost=1484804.66..1484805.74 rows=54 width=32)"
    "                          ->  Hash Join  (cost=31.50..1484674.49 rows=8678 width=32)"
    "                                Hash Cond: (a."FK_DimProduct" = b."PK_DimProduct")"
    "                                ->  Seq Scan on "Fact" a  (cost=0.00..1483910.26 rows=149112 width=32)"
    "                                      Filter: (date_part('year'::text, ("FK_DateCalendarYear")::timestamp without time zone) = 2011::double precision)"
    "                                ->  Hash  (cost=30.81..30.81 rows=55 width=8)"
    "                                      ->  Seq Scan on "DimProduct" b  (cost=0.00..30.81 rows=55 width=8)"
    "                                            Filter: ("Product_Company" = 'XX'::bpchar)"
    -here are the creates with slight mods to protect our privacy

    Code:
    CREATE TABLE dbo."Fact"
    (
      "PK_Fact" integer NOT NULL DEFAULT nextval('dbo."Fact_PK_Fact_seq"'::regclass),
      "FK_Dim01" integer,
      "FK_DimProduct" integer,
      "FK_Dim01" integer,
      "FK_EffDate" date,
      "FK_ExpiryDate" date,
      "FK_TransactionDate" date,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_EffDate1" date,
      "FK_ExpiryDate1" date,
      "Amt3" money,
      "Amt1" money,
      "Amt2" money,
      "LogId" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "FK_Dim01" integer,
      "Flag1" character(1),
      "Days" integer,
      "Amt4" money,
      "Amt5" money,
      "Adj1" boolean,
      "adjustedPK" integer,
      "fk_DateAnalysis" date,
      "FK_Date1" date,
      "FK_DateCalendarYear" date,
      "FK_Date2" date,
      "Amt6" money,
      "Amt7" money,
      "Amt8" money,
      "Amt9" money,
      "Amt10" money,
      CONSTRAINT "PK_Unit" PRIMARY KEY ("PK_Unit")
    )
    WITH (
      OIDS=FALSE
    );
    ALTER TABLE dbo."Fact"
      OWNER TO postgres;
    
    
    CREATE INDEX "IX_Fact_FK_DateCalendarYear"
      ON dbo."Fact"
      USING btree
      ("FK_DateCalendarYear");
    
    
    CREATE INDEX "IX_Fact_FK_TransactionDate"
      ON dbo."Fact"
      USING btree
      ("FK_TransactionDate");
    
    
    
    CREATE TABLE dbo."DimProduct"
    (
      "PK_DimProduct" integer NOT NULL DEFAULT nextval('dbo."DimProduct_PK_DimProduct_seq"'::regclass),
      "BizKey01" character varying(50),
      "Product_Company" character(2) NOT NULL,
      "BizKey01" character(3) NOT NULL,
      "BizKey01" character(3),
      "BizKey01" character(2) NOT NULL,
      "BizKey01" character varying(50) NOT NULL,
      "BizKey01" character varying(50) NOT NULL,
      "BizKey01" character varying(50) NOT NULL,
      "BizKey01" character varying(50) NOT NULL,
      "BizKey01" character varying(50),
      "DimProduct_SCD_Original_ID" integer,
      "DimProduct_SCD_Status" character varying(25),
      "DimProduct_SCD_Start_Date" date,
      "DimProduct_SCD_End_Date" date,
      CONSTRAINT "PK_DimProduct" PRIMARY KEY ("PK_DimProduct")
    )
    WITH (
      OIDS=FALSE
    );
    ALTER TABLE dbo."DimProduct"
      OWNER TO postgres;
    -some of the mem settings in our config:

    shared_buffers 8GB
    work_mem 64MG
    maintenance_workmem 1GB
    effective_cache_size 18GB
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Thanks for the update.

    23 minutes for the (small) number of rows shown in the plan seems way too much.

    An explain (analyze on, buffers on, verbose on) would have been interesting to show the difference between the estimates and the actuals and to also show the steps that took the longest.

    You can upload the execution plan to http://explain.depesz.com which makes things a little easier to read. That site also has an option to anonymize the execution plan.
    Last edited by shammat; December 17th, 2012 at 01:33 PM.
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0
    thx. We'll do that but I should mention that 1.4 million records are involved in the 1st subquery for 2012 and company XX and 1.8 million records in the 2nd for 2011 and company XX. That's before they are summarized by product, month and year.

    Without the company qualifier, its 2.8 million and 3.4 million.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Originally Posted by db042188
    thx. We'll do that but I should mention that 1.4 million records are involved in the 1st subquery for 2012 and company XX and 1.8 million records in the 2nd for 2011 and company XX. That's before they are summarized by product, month and year.
    Well, the execution plan does not reflect that.

    This line:
    Seq Scan on "Fact" a (cost=0.00..1483910.26 rows=149112 width=32)"
    means that PostgreSQL assumes that there are 149112 rows in the "Fact" table.

    And this line:
    Seq Scan on "DimProduct" b (cost=0.00..30.81 rows=55 width=8)"
    means that PostgreSQL assumes there are 55 rows in "DimProduct".
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0
    terrible assumption on the fact table.

    Good assumption on dimProduct for company XX.
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Originally Posted by db042188
    terrible assumption on the fact table.
    That indicates that the statistics are out of date and that you should run a analyze on that table.
    This is something that would be visible when running explain (analyze on, buffers on, verbose on) instead of just explain
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0
    I may have done some unnecessary steps but here is what I did:

    1) with my db hilighted in pgAdminIII, I issued a command as follows:

    Code:
    SET track_io_timing=On;
    The feedback I got was "Query returned successfully with no results in 10ms"

    2) Ran my query with this in front:

    Code:
    EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS on, TIMING on )
    3) on explain.depesz.com, posted the results under item "Yrm" anonymized. I see a lot of red and some very large numbers associated with one of my tables but will wait to hear from the community on what that means. In the mean time I'll try to figure out how the ANALYSE command gets issued from pgAdminIII so that my stats can get updated.
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,785
    Rep Power
    348
    Originally Posted by db042188
    3) on explain.depesz.com, posted the results under item "Yrm" anonymized.
    Step #5 seems to be your problem:
    Code:
    Seq Scan on charlie_tango quebec_three  (cost=0.000..1483910.260 rows=149112 width=32) (actual time=587883.023..821874.081 rows=3371596 loops=1)
    ...
    Filter: (echo_tango('quebec_four'::text, (quebec_three."six")::timestamp without time zone) = 2011::double precision)
    Rows Removed by Filter: 26450956
    Buffers: shared hit=68 read=961950
    PostgreSQL assumes that the step will return 149112 rows but in reality it's returning 3371596 rows. That could either indicate really, really bad statistics or due to the lack of an usable index, PostgreSQL cannot estimate the result for that expression properly.

    I guess that the filter condition from execution plan (echo_tango('quebec_four'::text, (quebec_three."six")::timestamp without time zone) = 2011::double precision)) "maps" to this condition from your initial post: date_part('year',"FK_DateCalendarYear") = 2011.

    You do have an index on the column FK_DateCalendarYear, but that cannot be used due to the function applied on it. You need to create an index on that expression:
    Code:
    create index idx_year_value on 
        "Fact"(date_part('year',"FK_DateCalendarYear"));
    then the WHERE condition should be able to use the index.

    In the mean time I'll try to figure out how the ANALYSE command gets issued from pgAdminIII so that my stats can get updated.
    I don't use pgAdmin, but it is a regular SQL statement, just run it wherever you can execute SQL queries.
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  22. #12
  23. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    11
    Rep Power
    0
    Thx. We are quite impressed.

    We changed two things prior to our last test.

    Rather than create a new index, we changed the query slightly to use a
    Code:
    between 'yyyy-01-01' and 'yyyy-12-31'
    clause on both date criteria.

    We ran the command
    Code:
    vacuum analyse "dbo"."Fact"
    from pgAdminIII. We were disappointed that the vacuum analyse ran for a half hour but no big deal.

    We restarted the PG service and the query ran for 10 seconds. MS SQL came in consistently around a minute and a half on what is essentially the same query.

    We saw evidence in a subsequent EXPLAIN that the original INDEX was finally enlisted in the plan. And we dont even know anything about EXPLAIN plans.

    If I was smarter I would have taken more care in determining whether that INDEX use was due to the query change alone or the ANALYSE as well.

    We will recommend to mgt that PostgreSQL is worth investigating from a relational perspective.

    We dont know yet what its cube add in does but we will be researching that next.

IMN logo majestic logo threadwatch logo seochat tools logo