In a one point we asked ourselves, is it a good idea to give Oracle Grid Control for monitoring DB-s to our control room workers (non dba’s, non administrators), who are basically incident reporters and incident managers. We decided to try Nagios (with graphing).
And as I said, purpose of this job was creating a simple tool for control room workers to get really simplified red/green colored overview.
OEM Grid Control is a super tool, but not so super for simple monitoring, it’s more like a powerful administration tool. In my point of view the other minus of OEM is dependency on agents. My practice is shown that grid agents are fragile and from time to time they just won’t respond. Probably every Grid administrator has seen a message “Agent is not responding” and after a minute or two it’s up again. And what’s more frustrating – you can’t ignore them and you can’t turn them off, because if server dies last thing it sends is “Agent is not responding”. No DB errors because there is no agent anymore. Once I was responsible for EM environment where we had 600 – 700 databases, it was impossible to rely only on Grid monitoring.
And other bad thing, Grid is not for nondba’s and it’s too sophisticated for simple monitoring.
Things should be easy.
I struggled about a week with Nagios to get this thing normally working. Probably because I’m just not too unix/linux administrator. The idea was to combine our shell scripts with graphic interface (Nagios, of course it could be whatever monitoring tool, but Nagios has a biggest community and it’s free, you can combine basically everything into Nagios, it means that nagios is as powerful as good are you on the shell) and draw some graph for illustrating situation and trends. Our scripts output was email. But I’m not big fan of reading tons of emails, especially if I want to see a quick overview, email notifications are maybe good for storing historical data.
Shell scripts are running on nagios server and checking oracle with simple SQL queries – no any agent dependency. (I’m not saying that Grid Control is bad, but Oracle’s contcept is just a bit different, not bad, but just doesn’t work as it should.) So our scripts which sent just emails, are going to give data for nagios. For example:
/usr/local/nagios/libexec/check_oracle –tablespace DB_NAME dbsnmp password ALL 95 90
Argument ALL means all tablespaces, thresholds are critical 95 warning 90.
Command executes sqlplus, runs a tablespace free space query and formats answer for nagios and sends it into standard output:
OK – tablespaces SYSAUX=53.60% USERS=5.10% SYSTEM=49.90% ALPEPS_DATA=52.10% ALPEPS_INDEX=.10% | SYSAUX_U=561152Kbytes;943718;996147;0;1048576 USERS_U=1683456Kbytes;30198067;31875737;0;33553408 SYSTEM_U=523264Kbytes;943718;996147;0;1048576 ALPEPS_DATA_U=5460992Kbytes;9437184;9961472;0;10485760 ALPEPS_INDEX_U=1024Kbytes;4718592;4980736;0;5242880
Nagios will show this output in web console as (last line):
The other thing I did, is graphing. For graphing I used free tool – tnt4nagios. I had to use version 4.xx instead of 6.xx because of older PHP version in our Nagios server. But it works as well.
In standard output, everything after pipe is for graphing tool (performance data in nagios terminology). It’s going to be stored in RRD database, and PNP4Nagios queries the graph drawing information from there. It mean’s that, for implementing that, I had to install Nagios, RRD tool and PNP4Nagios. There was no any rpm-s for RHEL 5.4, so I build them all from the source. PNP4Nagios is quite cool, you see these orange symbols after nagios services, if you click there, you can get all the graphs you have for certain service (I snapped first two graphs, because the page is quite long – it contains historical data for each tablespace in 5 graphs – 4 hours, 24 hours, 1 week, 1 month, 1 year):
Ok these graphs were not so interesting because there is no big changes. Thanks to pnp4nagios guys, graphs are scalable, and you can select specific time gaps and etc. Maybe ping is a bit more interesting to check:
Or maybe not J. In a same way you can check db users count, some specific wait events,
Flash recovery area growth, db locks, etc (basically whatever).
The data is stored in a round-robin database in a circular buffer, it means that storage footprint remains the same over time. No any management needed.
By the way, one simple oracle check which goes over sqlplus and does a query in a database, checks anyway automatically:
- is db alive
- is listener alive
- is it possible to create session
- is network alive
- is server itself alive
If something goes wrong and its not possible to open sqlplus session, then something is really bad already. And cool thing is, you can see immediately sqlplus error message in Nagios, for example: maximum sessions exceeded, or database shutdown in progress, no listener etc…
Power of visualizing and simplicity is the key for end users. I’m not going to talk about different nagios features, because it’s just a monitoring tool. But it’s powerful, you can implement everything in there what you are able to write on a shell. It’s easy, dynamic, configurable, handy and synoptic. Ok, maybe only setup wasn’t that easy J
I suggest that tool for visualizing little shell monitoring scripts and acting as a backup Oracle monitoring tool next to Grid Control. Until I haven’t seen something better. So far I haven’t seen other monitoring products, where I could easily implement my own shell monitoring scripts.
Next thing I’m going to try is AppMon. And then maybe Open NMS and Zabbix.