Last night I started trying to figure out how to store cumulative web data. It’s a harder problem than it looks, as most things are. The biggest problem with it seems to be storing it compactly.
Compressed raw logs take a lot of time to process. It’s not so bad if your site isn’t terribly busy or if you’ve got a lot of time on your hands. Granted, my site isn’t busy at all and I could wait a while for results, but I’m hoping someone else will find the program useful.
Intermediate storage is of course the key, but that’s where things get really tricky. It’s easy enough to store sums and averages of various numbers, but without storing stats for all 404s, how can you report on the top x of them? I’m half-way to an answer, but it’s probably not going to be pretty. It’s certainly not going to be elegant. But it’s also not going to be Webalizer, which doesn’t make sense to try to modify to suit my needs.
Another idea for a nifty tool is a web log playback tool — given an access_log, it replays the log, making the same requests at the same time offsets in the file. It may not be coming from all over the place (instead originating from one machine), but it may help debug some performance corner cases, for example.