#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    1
    Rep Power
    0

    Schema for analytics table


    Hi

    We use Postgres for analytics (star schema).
    Every few seconds we get reports on ~500 metrics types.
    The simplest schema would be:

    (timestamp, metric_type value)
    78930890, FOO, 80.9
    78930890, ZOO, 20

    Our DBA has came up with a suggestion to flatten all reports of the same 5 seconds to:

    (timestamp, metric1, metric2, ..., metric500)
    78930890, 90.9, 20, ..., ...

    Some developers push back on this saying this adds a huge complexity on development (batching data so it is written in one shot) and to maintainability (just looking at the table or adding fields is more complex).

    Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    158
    Rep Power
    11
    Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough
    What's not scalable about the orignal model? And for that matter: what does "scalable" even mean? What's scalable about a table with 500 columns?

    I would definately use the original model for capturing the data, that is simply the fastest way of making sure you have safely stored the data.
    That table wil have lots of records, which might frighten your DBA a bit, but he should know about things like partitioning and inheritance...

    It is certainly true that reporting from an EAV table is not particularly fast, and you will probably have to pre-process the data into some sort of intermediary table before you can really work with it, but that's a whole different matter.

    Batching the data and inserting it all once every fice seconds sounds like a recipy for disaster. Not just from a complexity point of view, but just the fact that the timestamp is rounded off by up tp five seconds sounds unacceptable; if one metric comes in 0.1 seconds early, and the next 0.1 late, then your flattening will say that there was no data for ten seconds, which is not true. (and what's are you going to do when you get two datapoints for the same metric in within five seconds?)

    Keep your storage simple and use it to generate sourcedata for your reporting tools, don't try to store your data in reporting tables.

IMN logo majestic logo threadwatch logo seochat tools logo