July 17th, 2013, 04:20 AM
Need help in db design
We have analytics database where each month a table will be created by cron like web_analytics_mm_yy (ex:- web_analytics_05_13, web_analytics_06_13, web_analytics_07_13 and so on) respective months data will be inserted to respective tables. but the problem is how to fetch the data from multiple tables in current db setup.
Alternatively If we have "month" and "year" columns in same table, Don't we have a problem in fetching the records back? cos each month we will have 1 million records inserted. after couple of years data records will be huge and performance will be affected. So we have taut of splitting the table month wise and fetching the records.
July 17th, 2013, 11:32 AM
you would either issue one query for each table, or a single UNION query which has subselects for each table
Originally Posted by deepak.fugo
no, not if the table is indexed properly
Originally Posted by deepak.fugo
July 17th, 2013, 07:29 PM
The UNION suggestion by r937 is one way, but requires that you write a query that includes a phrase for every table that might need to be included in the final result. So if you have three tables you'd need three phrases. A procedural language can help with this, but only a bit.
Better solutions exist but they depend on what database backend you are using. Postgres, Oracle and DB2 have partitioning features which make this an easy thing to deal with and prevent performance problems from creeping up on you -- but I don't know which DB you are using.
With Postgres this is easy to deal with. Here is an example:
Create a master table, and then each child (monthly) table inherits the master table's layout. Each child table/partition would also need exclusion constraint(s) based on whatever the criteria for partition is (in your case, the date range). With exclusions in place Postgres can make a query that ignores any tables that couldn't possibly meet your query criteria.
CREATE TABLE web_analytics
(id uuid DEFAULT uuid_generate_v1() PRIMARY KEY,
ip inet NOT NULL,
mac macaddr NOT NULL,
url VARCHAR(2048) NOT NULL,
dtg timestamptz NOT NULL);
CREATE TABLE web_analytics_07_2013
(PRIMARY KEY (id),
CHECK (dtg >= DATE '2013-07-01' AND dtg < DATE '2013-08-01' ))
CREATE INDEX web_analytics_07_2013_dtg ON web_analytics_07_2013 (dtg);
CREATE OR REPLACE FUNCTION partition_web_analytics()
RETURNS TRIGGER AS
t_string := 'web_analytics_' || date_part('month', NEW.dtg) || '_' || date_part('year', NEW.dtg);
'INSERT INTO ' || t_string || ' VALUES ($1, $2, $3, $4, $5)'
USING NEW.id, NEW.ip, NEW.mac, NEW.url, NEW.dtg;
$$ LANGUAGE plpgsql;
CREATE TRIGGER insert_web_analytics
BEFORE INSERT ON web_analytics
FOR EACH ROW EXECUTE PROCEDURE partition_web_analytics();
The child table creation with CHECK constraints that guarantee only the correct month's data will be inserted is enough for the DB to ignore any tables that couldn't possibly hold the data you are querying, assuming that the query filter over date in some way. The trigger let's you do "INSERT INTO web_analytics..." and only hit the correct table.
Of course, your analytics model probably looks a lot more complicated than this (and I didn't proof any of the above, so there's almost definitely a typo or "duh!" bug lurking), and you'd probably want to write a function that creates the table for each month as you go -- but this the basics of how to do what you're asking if you happen to be using Postgres. There are external modules which make this less verbose, and Oracle and DB2 have different utilities for partitioning, some of them offering more convenience in exchange for less control. If you're doing all this in MySQL/Maria I'd recommend leaving MySQL on the web end and use something with heavier data-shoveling tools like Postgres for analytics work.