#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    6
    Rep Power
    0

    Question Performance declining... temp table alternative? or other suggestions...


    I have a pretty large dataset in my pgdb (largest table = 39 million rows) that I am working with. The db is installed on my Windows 7 desktop, because that's the only real option for me currently. When working with the data I thought that some of the operations were exceptionally slow, so I have tried my best to read through performance boosting articles etc. After a bit of tweaking, the queries are executed faster, but the one I'm doing now is expected to finish after more than 5 days, and that is a long time to wait for a result. So I was hoping to get some tips about how I can improve performance.

    My main table is raw data from fish telemetry loggers (I work with fish research at a university). Every time a fish is detected at a logger, I get a row with date/time, fish-ID, site-name and signal strength. In total 39 million rows.

    The function I am currently running is working through all those rows, one fish at a time and one site at a time. So for fish 1 and site 1, I scan rows until there is a time-gap between detections of at least 5 minutes. When such a gap is found, I store the time-span (first and last detection) and min/max/median signal strength as well as total number of detections and detections per minute.

    For the first few fishes, this is how my timing looks like:
    Code:
    Tag-ID  Totalrows ProcessTime RowsPerSec
    1           5520       0.722   7645
    2           2636       0.473   5573
    3        1283671   44682.820     29
    4           1362     121.327     11
    5         577684   51730.610     11
    6            736      73.698     10
    So, as you can see, the number of rows from rawdata processed per second decreases radically when the fish with over a million rows is processed, and then the performance seems to decline with each fish being processed. Since I have 100 fish to look at, I hope this doesn't continue to decline.

    My function involves a temporary table to figure out the max/min/median of each 5-minute group. Is that what is causing the performance issues? Any tips of how to get the same performance on the really large queries as I have on the small ones would be greatly appreciated.

    Pasting my function below. I have tried to find things on the net to do things faster, but this is the best I can do so far.

    I'm calling the function with:
    SELECT Telemetry2009.fDoGordyFilteringDetails(tagid, 5, 'A1') FROM tFishTags GROUP BY tagid;

    Code:
    CREATE OR REPLACE FUNCTION Telemetry2009.fDoGordyFilteringDetails(thistag INTEGER, gaptime INTEGER, thisantenna VARCHAR(10))
    	RETURNS INTEGER AS
    $BODY$
    DECLARE
    	tm_row		RECORD;
    	tm_cursor	CURSOR FOR SELECT * FROM Telemetry2009.rawdata
    					WHERE tagid = thistag
    					  AND antenna = thisantenna
    					  AND date_time > TIMESTAMP '2009-07-06 12:00:00'
    					ORDER BY date_time ASC;
    	maxtime		TIMESTAMP WITH TIME ZONE;
    	thisgap		DOUBLE PRECISION;
    	save		RECORD;
    	medians		INTEGER;
    	i		INTEGER;
    BEGIN
    	-- Create a temporary table for data in the function.
    	--
    	-- Exception handling is the fastest way to check if the temporary table already exists.
    	--
    	-- Temporary tables are dropped at the end of each transaction;
    	-- i.e. SELECT fDoGordyFiltering(tagid) FROM rawdata GROUP BY tagid; keeps the temporary 
    	-- table alive until the query is completed and returns the tagid's, then the temporary 
    	-- table is dropped. So we need to create the table for the first tagid, and the following
    	-- tagid's will just delete previous rows from the table.
    	BEGIN
    		DELETE FROM tempdata;
    	EXCEPTION 
    		WHEN OTHERS THEN
    			-- This is a copy of the Telemetry2009.rawdata table structure, but will 
    			-- only hold the rows we encounter until 'gaptime' is exceeded in a gap.
    			CREATE TEMP TABLE tempdata
    			(
    				date_time TIMESTAMP WITH TIME ZONE,
    				antenna VARCHAR(10),
    				tagid INTEGER,
    				signal INTEGER
    			);
    			-- Another, more dynamic, way would be 
    			-- CREATE TEMP TABLE tempdata AS SELECT * FROM Telemetry2009.rawdata LIMIT 1;
    			-- DELETE FROM tempdata;
    			-- But I think it would be a lot slower.
    	END;
    
    	i := 0; -- just to keep track of progress
    
    	-- Loop through all the rows in rawdata for this tag and antenna, ordered by date_time.
    	FOR tm_row IN tm_cursor LOOP
    		i := i + 1; -- progress tracker
    		IF i > 100000 THEN
    			RAISE NOTICE '\n\n100,000 rows processed at %\n', CLOCK_TIMESTAMP();
    			i := 0;
    		END IF;
    		SELECT MAX(date_time) INTO maxtime FROM tempdata;
    		-- If "time of latest row + gaptime" > "current row time" then we should start a
    		-- new group. Otherwise, we'll just save current row in the temporary table.
    		--
    		-- Remember that if this is the first time through the loop, maxtime will be NULL.
    		IF maxtime IS NOT NULL THEN
    			thisgap := EXTRACT(MONTH	FROM (tm_row.date_time - maxtime)) * 60 * 24 * 30 + 
    					EXTRACT(DAY 	FROM (tm_row.date_time - maxtime)) * 60 * 24 + 
    					EXTRACT(HOUR 	FROM (tm_row.date_time - maxtime)) * 60 + 
    					EXTRACT(MINUTE	FROM (tm_row.date_time - maxtime)) * 1 + 
    					EXTRACT(SECOND	FROM (tm_row.date_time - maxtime)) / 60; 
    			IF thisgap > gaptime THEN
    				SELECT MIN(date_time) AS fd, MAX(date_time) AS ld, COUNT(*) AS c, 
    					MAX(signal) AS maxs, MIN(signal) AS mins INTO save FROM tempdata;
    				-- We need to calculate a new timegap that is within the gaptime.
    				-- (previous gaptime was including the detection that lies outside
    				--  of gaptime, which could be months away).
    				IF (save.ld - save.fd) < INTERVAL '1 minute' THEN 
    					-- DetPerMin >= 1
    					thisgap := 1; 
    				ELSE 
    					thisgap := ROUND((EXTRACT(MONTH	FROM (save.ld - save.fd)) * 60 * 24 * 30) + 
    							(EXTRACT(DAY 	FROM (save.ld - save.fd)) * 60 * 24) + 
    							(EXTRACT(HOUR 	FROM (save.ld - save.fd)) * 60) + 
    							(EXTRACT(MINUTE	FROM (save.ld - save.fd)) * 1) + 
    							(EXTRACT(SECOND	FROM (save.ld - save.fd)) / 60)); 
    				END IF;
    				-- Store median signal.
    				-- Nifty little trick using window functions. Totally stolen code.
    				SELECT ROUND(AVG(signal)) INTO medians 
    					FROM (SELECT signal, ROW_NUMBER() OVER
    							(ORDER BY SIGNAL), COUNT(*) OVER ()
    						FROM tempdata) t
    						WHERE ROW_NUMBER BETWEEN FLOOR((COUNT::FLOAT8 - 1) / 2 + 1)
    								 AND	 CEIL((COUNT::FLOAT8 - 1) / 2 + 1);
    				/* My first attempt at doing median signal:
    				--
    				-- Loop through tempdata to find the middle, then store that signal
    				-- as median signal. It may be a less than perfect way to calculate
    				-- median signal, but it is good enough I think.
    				DECLARE
    					med_row		RECORD;
    					med_cursor	CURSOR FOR SELECT signal FROM tempdata ORDER BY signal;
    					i		INTEGER := 0;
    				BEGIN
    					FOR med_row IN med_cursor LOOP
    						i := i + 1;
    						medians := med_row.signal;
    						-- count(tempdata) / 2 = middle so exit 
    						EXIT WHEN i = FLOOR(save.c / 2);
    					END LOOP;
    				END;
    				*/
    				-- Save this group into a row in the GordyRAW table.
    				INSERT INTO Telemetry2009.tGordyRAW VALUES
    				(
    					thisantenna,			-- Antenna
    					thistag,			-- TagID
    					save.fd,			-- FirstDetection
    					save.ld,			-- LastDetection
    					save.c,				-- NumDetections
    					ROUND(save.c / thisgap),	-- DetPerMin
    					save.maxs,			-- MaxSignal
    					save.mins,			-- MinSignal
    					medians				-- MedianSignal
    				);
    				-- Clear the temporary table since a new time period is started when
    				-- gaptime is exceeded.
    				DELETE FROM tempdata;
    			END IF;
    		END IF;
    		-- Save data for filtering.
    		INSERT INTO tempdata VALUES (tm_row.date_time, tm_row.antenna, tm_row.tagid, tm_row.signal);
    	END LOOP;
    	RETURN thistag;
    END;
    $BODY$
    LANGUAGE 'plpgsql' VOLATILE
    COST 100;
    Some of the contents (just removed what I didn't think was important for my question) of my conf file:

    Code:
    # -----------------------------
    # PostgreSQL configuration file
    # -----------------------------
    #
    #------------------------------------------------------------------------------
    # RESOURCE USAGE (except WAL)
    #------------------------------------------------------------------------------
    
    # - Memory -
    
    shared_buffers = 1500MB			# min 128kB
    					# (change requires restart)
    #temp_buffers = 8MB			# min 800kB
    #max_prepared_transactions = 0		# zero disables the feature
    					# (change requires restart)
    # Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
    # per transaction slot, plus lock space (see max_locks_per_transaction).
    # It is not advisable to set max_prepared_transactions nonzero unless you
    # actively intend to use prepared transactions.
    work_mem = 2MB				# min 64kB
    maintenance_work_mem = 256MB		# min 1MB
    #max_stack_depth = 2MB			# min 100kB
    
    # - Kernel Resource Usage -
    
    #max_files_per_process = 1000		# min 25
    					# (change requires restart)
    #shared_preload_libraries = ''		# (change requires restart)
    
    # - Cost-Based Vacuum Delay -
    
    #vacuum_cost_delay = 0ms		# 0-100 milliseconds
    #vacuum_cost_page_hit = 1		# 0-10000 credits
    #vacuum_cost_page_miss = 10		# 0-10000 credits
    #vacuum_cost_page_dirty = 20		# 0-10000 credits
    #vacuum_cost_limit = 200		# 1-10000 credits
    
    # - Background Writer -
    
    #bgwriter_delay = 200ms			# 10-10000ms between rounds
    #bgwriter_lru_maxpages = 100		# 0-1000 max buffers written/round
    #bgwriter_lru_multiplier = 2.0		# 0-10.0 multipler on buffers scanned/round
    
    # - Asynchronous Behavior -
    
    #effective_io_concurrency = 1		# 1-1000. 0 disables prefetching
    
    
    #------------------------------------------------------------------------------
    # QUERY TUNING
    #------------------------------------------------------------------------------
    
    # - Planner Method Configuration -
    
    #enable_bitmapscan = on
    #enable_hashagg = on
    #enable_hashjoin = on
    #enable_indexscan = on
    #enable_mergejoin = on
    #enable_nestloop = on
    #enable_seqscan = on
    #enable_sort = on
    #enable_tidscan = on
    
    # - Planner Cost Constants -
    
    #seq_page_cost = 1.0			# measured on an arbitrary scale
    #random_page_cost = 4.0			# same scale as above
    #cpu_tuple_cost = 0.01			# same scale as above
    #cpu_index_tuple_cost = 0.005		# same scale as above
    #cpu_operator_cost = 0.0025		# same scale as above
    effective_cache_size = 3000MB
    
    # - Genetic Query Optimizer -
    
    #geqo = on
    #geqo_threshold = 12
    #geqo_effort = 5			# range 1-10
    #geqo_pool_size = 0			# selects default based on effort
    #geqo_generations = 0			# selects default based on effort
    #geqo_selection_bias = 2.0		# range 1.5-2.0
    
    # - Other Planner Options -
    
    #default_statistics_target = 100	# range 1-10000
    #constraint_exclusion = partition	# on, off, or partition
    #cursor_tuple_fraction = 0.1		# range 0.0-1.0
    #from_collapse_limit = 8
    #join_collapse_limit = 8		# 1 disables collapsing of explicit 
    					# JOIN clauses
    
    
    
    # These settings are initialized by initdb, but they can be changed.
    lc_messages = 'Swedish_Sweden.1252'			# locale for system error message
    					# strings
    lc_monetary = 'Swedish_Sweden.1252'			# locale for monetary formatting
    lc_numeric = 'Swedish_Sweden.1252'			# locale for number formatting
    lc_time = 'Swedish_Sweden.1252'				# locale for time formatting
    
    # default configuration for text search
    default_text_search_config = 'pg_catalog.swedish'
    
    # - Other Defaults -
    
    #dynamic_library_path = '$libdir'
    #local_preload_libraries = ''
    
    
    #------------------------------------------------------------------------------
    # LOCK MANAGEMENT
    #------------------------------------------------------------------------------
    
    deadlock_timeout = 10s
    #max_locks_per_transaction = 64		# min 10
    					# (change requires restart)
    # Note:  Each lock table slot uses ~270 bytes of shared memory, and there are
    # max_locks_per_transaction * (max_connections + max_prepared_transactions)
    # lock table slots.
    
    
    #------------------------------------------------------------------------------
    # VERSION/PLATFORM COMPATIBILITY
    #------------------------------------------------------------------------------
    
    # - Previous PostgreSQL Versions -
    
    #add_missing_from = off
    #array_nulls = on
    #backslash_quote = safe_encoding	# on, off, or safe_encoding
    #default_with_oids = off
    #escape_string_warning = on
    #regex_flavor = advanced		# advanced, extended, or basic
    #sql_inheritance = on
    #standard_conforming_strings = off
    #synchronize_seqscans = on
    
    # - Other Platforms and Clients -
    
    #transform_null_equals = off
    
    
    #------------------------------------------------------------------------------
    # CUSTOMIZED OPTIONS
    #------------------------------------------------------------------------------
    
    #custom_variable_classes = ''		# list of custom variable class names
    My machine is a HP z800 with 6GB RAM and a quad-core processor (not that it matters, the postgresql processes only use one core).

    Typically, when running this query about 2GB RAM is utilized and 25% CPU resources (100% of one core) according to the task manager performance monitor.

    I have also found the median calculation to be rather slow, but I don't know how I could possibly make that one faster. Both iterating through a cursor (that I tried first) and using window functions (that I tried later) seems to take up a lot of resources.
  2. #2
  3. So help me Codd
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2008
    Location
    Brazil
    Posts
    362
    Rep Power
    30
    Why is work_mem so low? Just 2MB and working with a large dataset, that will be a problem. Did you try to EXPLAIN all queries and optimize the queries?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    6
    Rep Power
    0
    Originally Posted by pgFrank
    Why is work_mem so low? Just 2MB and working with a large dataset, that will be a problem. Did you try to EXPLAIN all queries and optimize the queries?
    Work_mem was set to 1MB as default. I was going to set it higher, but then I read all these warnings about how that could flood memory if you make a lot of queries and I wasn't sure if my function making 39 million SELECT would also result in 39 million * 2MB... So I figured a doubling from default would still be a good improvement.

    I tried running EXPLAIN ANALYZE in front of my query that uses the function, but then I realized it would take 5 days to complete, so I just gave upon that idea. Especially since I don't even know what the output of that command is (or how to interpret it).

    I have a background in Oracle programming, where I was only writing code and never had to worry about database performance or administration, so I'm kind of fumbling in the dark here. I do have enough sense to index my tables though, and I've restarted the run now with a different index combination on the main table to see if that makes a difference.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,803
    Rep Power
    353
    Originally Posted by dan-erik
    Work_mem was set to 1MB as default. I was going to set it higher, but then I read all these warnings about how that could flood memory if you make a lot of queries and I wasn't sure if my function making 39 million SELECT would also result in 39 million * 2MB... So I figured a doubling from default would still be a good improvement.
    work_mem is per connection not per query or even per row.
    So the worst case is max_connections * work_mem that can be used.
    If you are the only user on the database, you don't have to worry about it.

    I tried running EXPLAIN ANALYZE in front of my query that uses the function, but then I realized it would take 5 days to complete,
    Just use EXPLAIN (without ANALYZE) and it will not actually run the query, but simply produce the estimated execution plan.

    Especially since I don't even know what the output of that command is (or how to interpret it).
    It's the same concept as Oracle's EXPLAIN PLAN.
    It's one of the most important tools when tuning queries. You should get used to it (in PostgreSQL just as well as in Oracle)

    Do read the manual:
    http://www.postgresql.org/docs/current/static/sql-explain.html
    http://www.postgresql.org/docs/current/static/using-explain.html
    http://wiki.postgresql.org/wiki/Using_EXPLAIN

    I do have enough sense to index my tables though, and I've restarted the run now with a different index combination on the main table to see if that makes a difference.
    That's precisely what the execution plan will tell you: whether the index will actually be used, and if so how much it will improve the query.

    Did you see:
    http://wiki.postgresql.org/wiki/Performance_Optimization
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    6
    Rep Power
    0
    Originally Posted by shammat
    work_mem is per connection not per query or even per row.
    If you are the only user on the database, you don't have to worry about it.
    Ah, that makes more sense. I will feel a bit more free to play around with it then.

    Originally Posted by shammat
    Just use EXPLAIN (without ANALYZE) and it will not actually run the query, but simply produce the estimated execution plan.
    My cursor SELECT produces this:
    "Sort (cost=10413.71..10420.63 rows=2770 width=19)"
    " Sort Key: date_time"
    " -> Bitmap Heap Scan on rawdata (cost=69.16..10255.32 rows=2770 width=19)"
    " Recheck Cond: (((antenna)::text = 'EA'::text) AND (tagid = 1100))"
    " Filter: (date_time > '2009-07-06 12:00:00'::timestamp without time zone)"
    " -> Bitmap Index Scan on irawdata_antenna_tagid (cost=0.00..68.47 rows=2770 width=0)"
    " Index Cond: (((antenna)::text = 'EA'::text) AND (tagid = 1100))"

    It only mentions index once, which is a bit odd because I also have an index on date_time, which should be used in the filter condition, shouldn't it?

    This still doesn't make much sense to me. It reports 2770 rows, but there are (like I said) 39 million in the table. Is "cost" measured in seconds? So it takes a maximum of 10240 seconds to run the query? Or how should I interpret this?

    Originally Posted by shammat
    It's the same concept as Oracle's EXPLAIN PLAN.
    It's one of the most important tools when tuning queries. You should get used to it (in PostgreSQL just as well as in Oracle)
    Well, like I said, I never had to worry about performance before. I was working with a huge government database and no matter how complicated my queries were, they bounced back within a few seconds at the most. Performance tuning wasn't part of the development culture there, they just had really powerful servers. But I don't have that luxury now, so I really appreciate your tips here.

    Originally Posted by shammat
    Did you see:
    (address blocked: See forum rules)
    I did look through the wiki page you mention and the links, but it isn't very easy to make sense of everything since it is my first look at these type of matters. Those guides give hints like "one third of available memory" etc, but all the different settings I tried from there still ended up in me waiting for hours to see the first output of "100'000 rows inserted". Which is why I'm here asking.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,803
    Rep Power
    353
    Originally Posted by dan-erik
    It only mentions index once, which is a bit odd because I also have an index on date_time, which should be used in the filter condition, shouldn't it?
    That's the filter condition you are seeing.

    Postgres chooses the index based on statistics it collects from the data and the decides which access path is the most effective (i.e. with the lowest cost)

    This still doesn't make much sense to me. It reports 2770 rows, but there are (like I said) 39 million in the table.
    The 2770 is the result after applying the criteria. This means that the statistics that Postgres collected indicated that your where condition will yield 2770 rows.

    If that is way off what you expect you need to update your statistics using "ANALYZE" (check out the manual). Given the size of your database, this might take some time...

    Is "cost" measured in seconds? So it takes a maximum of 10240 seconds to run the query?
    I think the formula how the cost is evaluated is in one of the links I posted. One of the most important factors that go into that number is disk I/O (because that is very "expensive")

    Well, like I said, I never had to worry about performance before. I was working with a huge government database and no matter how complicated my queries were, they bounced back within a few seconds at the most.
    What kind of computer are you using with Postgres?
    The most important part when tuning a database is disk performance. If you are using a desktop computer with just a single disk, the disk will become the bottleneck very quickly, especially with larger tables.

    The disk limits the speed, Postgres will not be able to return the data quicker than the disk is able to read it. A large memory can help, but only after the second access to the data (once it has been read into cache)

    but all the different settings I tried from there still ended up in me waiting for hours to see the first output of "100'000 rows inserted". Which is why I'm here asking.
    Inserting 100000 rows should not take "hours". On my laptop this can be done in a couple of seconds. But as you are doing a lot of processing for each row that you insert this might as well the best you can get.

    I just had very quick look at your procedure and I see a lot of looping and manual aggregation in there.

    Maybe if you stepped back and re-think the process of inserting. In most of the cases optimizing the algorithm will yield a much better result than trying to optimize the configuration
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    6
    Rep Power
    0
    Originally Posted by shammat
    What kind of computer are you using with Postgres?
    The most important part when tuning a database is disk performance. If you are using a desktop computer with just a single disk, the disk will become the bottleneck very quickly, especially with larger tables.
    I have a HP z800 workstation set up with 6 GB RAM and two disks. I use a SSD (OCZ something, the fastest one I could find) for Windows and program binaries (including the postgres binaries). Then I have a standard 7200 rpm hard drive for data storage, which is also were the actual database files are stored for postgres. Both of them are on a standard SATA2 interface (there is a RAID hardware thing in there, but with the SSD I didn't expect to gain much from using RAID for the data other than maybe security from data loss, and I run backups regularly so that's not a problem for me).

    Originally Posted by shammat
    Inserting 100000 rows should not take "hours".

    Maybe if you stepped back and re-think the process of inserting. In most of the cases optimizing the algorithm will yield a much better result than trying to optimize the configuration
    I did two things after your comments:
    1) Changed work_mem from 2MB to 1500MB
    2) Replaced one of the SELECT min(), max(), count(); for three separate queries using ORDER BY and LIMIT 1, since I understood that is faster when indexing. Then I made tempdata into a regular table (still called tempdata) and indexed it (I assumed it wouldn't be possible to index a temporary table).

    What took hours yesterday (first 100'000 rows from one antenna for one of the fish) now takes roughly 10 seconds. So great improvement! However, after 600'000 rows it takes a lot longer time for the next 100'000 (still waiting and it's been 17 minutes so far). I assume this is because it ran out of work_mem...

    Yet... Utilized memory is only 1.9 GB according to windows system monitor. So it doesn't look like work_mem did that much of a difference on actual memory usage? Makes no sense to me at all.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,803
    Rep Power
    353
    Originally Posted by dan-erik
    Yet... Utilized memory is only 1.9 GB according to windows system monitor. So it doesn't look like work_mem did that much of a difference on actual memory usage? Makes no sense to me at all.
    As far as I know the memory will only be allocated as long as the query is running. Once it is finished (or the connection is closed) the memory is released.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    6
    Rep Power
    0
    Originally Posted by shammat
    As far as I know the memory will only be allocated as long as the query is running. Once it is finished (or the connection is closed) the memory is released.
    I changed workmem to 2GB (maximum, server won't start with a higher value).

    Physical memory usage is at 2.06 GB (that is, including Windows, Firefox, pgAdmin III, etc) when running the query. So it seems to me that work_mem isn't allocated at all, especially since memory usage only goes up by about 0.1 GB when I execute the query.

    Same result as before, things run pretty fast up until 600'000 rows, then it grinds to a halt. Currently I've been waiting for 25 minutes after the message of 600'000 rows, and still no result (I print a NOTICE every 100'000 rows processed).

    EDIT: I just noticed that the first antenna is actually completed at 600'000 rows, which took 3 minutes. So what I'm waiting for is... DELETE FROM tempdata? maybe...
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,803
    Rep Power
    353
    I changed workmem to 2GB
    Don't.
    Leave the memory to the OS file cache or the shared_mem.
    I don't think a work_mem larger than a couple of MB will actually help.

    It seems that under Windows, Postgres doesn't profit as much from a large shared_mem as on other platforms. So with 3GB total memory, I'd recommend 1GB of shared_mem and maybe 50MB of work_mem (maybe even 100MB, but I don't think you'll see any improvement after that)

    Make sure you adjust the "effective_cache_size" as well. That will also heavily influence the execution plan calculated by Postgres.
    I guess 1GB is a reasonable assumption for your system.
    Check out the taskmanager, it reports the amount of memory used for the file system cache.

    Originally Posted by dan-erik
    So what I'm waiting for is... DELETE FROM tempdata? maybe...
    As you want to delete all rows from the table, use TRUNCATE TABLE instead.

    That is a lot faster than DELETE ...

    Comments on this post

    • pabloj agrees
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2010
    Posts
    6
    Rep Power
    0
    Originally Posted by shammat
    As you want to delete all rows from the table, use TRUNCATE TABLE instead.

    That is a lot faster than DELETE ...
    After a long while of experimenting with different settings (the original query would have been done by now I think, if I just would have let it run through without tweaking anything), I have gotten some results.

    Using TRUNCATE TABLE instead of DELETE FROM makes a big difference. Some of the other changes I made with removing max() and min() in exchange for ORDER BY with LIMIT 1 also contributed to speed.

    Changing the work_mem and shared_buffers and other variables doesn't make as much of a difference as one would think. No matter how hard I try, I can't force postgres to use all the RAM that is available. Currently, my query is making the hard drive burn red hot from all the swap-file action - and yet I have 3 GB free RAM!

    One interesting thing is that shared_buffers does make a difference on speed, but it must not be set higher than "Cached" under performance tab and physical memory in the windows task manager. So when the computer starts, "Cached" may be as low as 600 so if I set it any higher than that then postgres won't start when the computer starts because it cannot allocate memory for shared_buffers. However, when windows has been up for a while, "Cached" will be a larger number (currently it's around 2800) which means I can set shared_buffers to 2800 and get better performance in my queries.

    I'm starting to think I should do this in linux as well and see what the difference is. I have a feeling this is all Microsofts fault that things are acting strange and that it takes forever to do things.
  22. #12
  23. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,803
    Rep Power
    353
    A large value for shared_buffers does not help as much as would be expected. As I said in your case I would not go beyond 1GB.

    I think there were even some performance tests that showed, that beyond 512MB there is not much to gain. Some flavors of Linux seem to profit from larger values, but my understanding is that the Postgres developers regard the filesystem cache to be more effective than the shared buffers.

    In order to get good execution plans, it is absolutely necessary to adjust the effective_cache_size setting. My laptop has 3GB and things improved a lot when I raised that figure to 768MB

    You might want to take this figures to the performance mailing list. Maybe the developers can give you better recommendations.
  24. #13
  25. Modding: Oracle MsSQL Firebird
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jun 2001
    Location
    Outside US
    Posts
    8,517
    Rep Power
    543
    Originally Posted by dan-erik
    ...
    Using TRUNCATE TABLE instead of DELETE FROM makes a big difference. ...
    But do you know why? Because it can't be rolled back.
    And why it can't be rolled back? Because nothing gets written to the transaction log, that's also why it can't be rolled back, the db doesn't keep track of what the values were.
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,803
    Rep Power
    353
    Originally Posted by pabloj
    Because it can't be rolled back.
    Not true.

    In Postgres a TRUNCATE can be rolled back. I think you are confusing this with Oracle.
    Code:
    c:\>psql
    psql (8.4.2)
    Type "help" for help.
    
    postgres=> select * from t;
     id
    ----
     42
    (1 row)
    
    postgres=> truncate table t;
    TRUNCATE TABLE
    postgres=> select * from t;
     id
    ----
    (0 rows)
    
    postgres=> rollback;
    ROLLBACK
    postgres=> select * from t;
     id
    ----
     42
    (1 row)
    
    postgres=>

    Comments on this post

    • pgFrank agrees : Exactly! TRUNCATE is transactionsafe in PostgreSQL

IMN logo majestic logo threadwatch logo seochat tools logo