Keep access logs, both when a service receives a request and finishes a request.
Record request duration.
Always rotate logs.
Ingest logs into a central store if possible.
Ingest exceptions into a central store if possible.
Always use UTC everywhere in infra.
Make sure all (semantic) lines in a log file contain a timestamp.
Include thread ids if it makes sense to.
It's useful to log unix timestamp alongside human readable time because it is trivially sortable.
Use head/tail to test a command before running it on a large log file.
If you find yourself going to logs for time series data then it is definitely time to use a time series database. If you can't do that, at least write a `/private/stats` handler that displays in memory histograms/counters/gauges of relevant data.
Know the difference between stderr and stdout and how to manipulate them on the command line (2>/dev/null is invaluable, 2>&1 is useful), use them appropriately for script output.
Use atop, it makes debugging machine level/resource problems 10 fold easier.
Have a general knowledge of log files (sometimes /var/log/syslog will tell you exactly your problem, often in red colored text).
This needs to be used carefully and deliberately. This is the style of command that can test your backups. This style command has caused multiple _major_ outages. With it, you can find a needle in a haystack across an entire fleet of machines quickly and trivially. If you need to do more complex things, `bash -c` can be the command sent to ssh.
I've had an unreasonable amount of success opening up log files in vim and using vim to explore and operate on them. You can do command line actions one at a time (:!$bash_cmd), and you can trivially undo (or redo) anything to the logs. Searching and sorting, line jumping, pagedown/up, etc, diffing, jump to top of file or bottom, status bar telling you how far you are into a file or how many lines it has without having to wc -l, etc.
Lastly, it's great to think of the command line in terms of map and reduce. `sed` is a mapping command, `grep` is a reducing command. Awk is frequently used for either mapping or reducing.
Some of these are KPIs (Key Performance Indicators). What we did at a previous job was to have a system like Etsy's statd [1] (it's an easy system to implement) and it made it easy to add statistics like latency of requests, number of errors, just about anything that could be measured, without excessive overhead (in terms of source code).
Can Amazon do this? They use UTC and your local browser’s time seemingly randomly depending on AWS service, and it drives me nuts. They usually (not always) put the timezone next to it, but why can’t they just have a mandate that it either is or is not UTC?! (The worst one is that the Lambda console is UTC but cloudwatch isn’t, so you think you haven’t received a request in hours but then you did)
Know the difference between stderr and stdout and how to manipulate them on the command line (2>/dev/null is invaluable, 2>&1 is useful), use them appropriately for script output.
Use atop, it makes debugging machine level/resource problems 10 fold easier.
Have a general knowledge of log files (sometimes /var/log/syslog will tell you exactly your problem, often in red colored text).
If you keep around a list of relevant hostnames:
This needs to be used carefully and deliberately. This is the style of command that can test your backups. This style command has caused multiple _major_ outages. With it, you can find a needle in a haystack across an entire fleet of machines quickly and trivially. If you need to do more complex things, `bash -c` can be the command sent to ssh.I've had an unreasonable amount of success opening up log files in vim and using vim to explore and operate on them. You can do command line actions one at a time (:!$bash_cmd), and you can trivially undo (or redo) anything to the logs. Searching and sorting, line jumping, pagedown/up, etc, diffing, jump to top of file or bottom, status bar telling you how far you are into a file or how many lines it has without having to wc -l, etc.
Lastly, it's great to think of the command line in terms of map and reduce. `sed` is a mapping command, `grep` is a reducing command. Awk is frequently used for either mapping or reducing.