I while back I blogged about how to approach error logging in production multi-node environments so that you stay sane. Later on, I looked at a way to deal with application level metrics using statsd.
One last piece of the puzzle is monitoring of the health of the nodes themselves. This typically includes CPU and Memory usage, disk space monitoring and application health monitoring. Diamond collectors are great exactly for that.