
February 15th, 2006, 01:42 AM
|
|
Contributing User
|
|
Join Date: Jul 2005
Location: Bay Area, California
|
|
|
You probably want a mixture of both log files (at each server) and a central repository for all important exceptions from your (distributed) services. Over the last 3 months I have been enhancing our home-grown system at work to better support the rapid growth of the company.
On the local side, we wrapped Log4J and each service has its own set of log files. This has a lot of information - whatever developers thought could be useful when debugging problems. Its verbose, but of course very useful. Whenever a more serious problem occurs, we also log to a server for our Operations group. We maintain an table of the different error types defined (name, error level, notification group, description). The developer passes the alert ID, a message, and the runtime exception. The sending portion is handled in a seperate thread, knows the server to connect to, and uses sockets to ensure we don't lose messages because another portion (e.g. JMS) went down. All information is stored in a DB and we can then monitor all our servers in a nifty web console. This design is very easy to scale horizontally and we've begun testing it with SLAMD.
This design allows production to know what types of problems our users are encountering, dig down and determine trends, and easily filter out junk alerts. We also have a dashboard that monitors internal/external service providers, either through test queries or recent activity, so that if a provider goes down we can attempt to resolve the issue quickly.
We haven't had any performance problems, it uses relatively little memory, and offers a good balance. Since it is a simple non-critical internal service, its been a great test bed to try out new technologies. If we like it and get a feel for it, we then push it out to user-level services. Its a bit amusing that our console was AJAXized before the user applications...
|