Valid XHTML 1.0!


 

(last improvements regarding analog patches, webalizer adjustment 30. Nov 2002, 12. Jan and 3. May 2003)

General Remarks

This is a rather specialized topic primarily for webmasters, despite sometimes it can be useful even for scrutiny of squid logs for example too --- where is the most traffic and so on...

Some further remarks: tools for this are primarily designed for usage with the common Apache Web Server, which has a market share of over 60% of all web servers, but if some rules are respected, they can be used also for the rare birds among these. But if you can trust even the logs of shitty proprietary web servers like MIIS must be doubted heavily! And on the other hand if configured to log in the CLF (Common Log file Format) you can even use them on squid logs for example, at least after customizing the log format strings (especially in case of analog). The same holds true for the log analysis itself: only free, open source products are unaffected from any commercial interests, which often fake statistics like these in the first place... Therefore here are presented only two open source tools. And finally, if you have any influence on it, choose the (combined) NCSA (precursor of Apache) CLF (common log file) format in the webserver configuration, it's the easiest and most complete for analysis --- even if you like to query specialized questions with UNIX/GNU tools it's feasible with these (despite it maybe ease up splitting in several logs then, I have to admit).

Another thing to consider are intentionally or sometimes unintended wrongly sended informations by user agents; for example the Junkbuster proxy makes any computer running and using it by the browser appear as an old Mac 68k model with a special Netscape 3.01 Gold browser, while most of these are supposed to be LINUX machines in reality, running dated Netscape versions, new Mozilla or Opera browsers or some exotical ones (so on mine, for example...). But this is encountered not so often, that it can severe the statistics on a global scale. The bigger problem is missing information --- see also below! By the way, I have compared a number of important results of both tools with each other and with UNIX/GNU tool applications, and found no notable deviations. They seem both to be free from unintended faults too... Keep in mind, that web server responses with code 304 (unchanged file) are correctly generally counted as successful queries like the usual 200 (transmission completed successfully), when you compare numbers!

Analog

This may be the favorite web server log analyzing tool of all at now. It features a very highly tuneable configuration file, for which I provide you with my currently used own configuration respective the same for analog 5.1n, the same for analog 5.21 and for analog 5.31 at now as an useful example --- at least I hope so. It always depends of your special case, which values are best and from your interests, which parts you wish to see; so I point out mainly, that I have chosen the rather interesting OS category to be sorted by file requests instead of pageviews, because otherwise robots are clearly overestimated in importance (they poll usually only [HTML] pages and no images or other files). Generally speaking, the summarizing pie charts show only the categories with enough percentage to clearly show up, which gives you a good overall impression. Analog produces valid(!) HTML2.0, rather old, but universal useful and sufficient for this purpose, and it's rather fast in processing even large logs. Analog focuses generally on long-term statistics for arbitrary time periods, but you can break down of course the time intervals to your like. The search strings of engines, by which a page was found, are listed as single search words. Regarding the OS statistics I have to make some remarks and can give you an improvement too... It may help also to get some insight in the whole complex of web server log analysis.

Not the fault of analog is the inability, to discriminate between different OS versions on the Apple Mac hardware: the browsers there send simply no informations for it, so not even the dated proprietary Apple Macintosh OS <=9.x can be separated from the most modern UNIX system in most cases at now: the new Mac OS X >=10. This is rather disappointing, because so the UNIX category lacks the eventual leader Mac OS X, and therefore LINUX is now on top in front of Solaris there, both (or all three)together rendering all others (HP-UX, IRIX, AIX, OSF) meaningless as browsing platforms (at least on my own site). The remainder of my remarks and patches is only useful for my UNIX friends among you: I changed the file tree.c of the analog sources in three points, to get more accurate values for the OS category: the Konqueror browser is virtually endemic to LINUX (KDE component!), so the cases, where that information is lacking, can be easily attributed to LINUX; the NetPositive browser is endemic to BeOS, so it's also mostly lacking OS designation is also overcome by this special else branch and there is an anonymizer service running under the name SilentSurf on RedHat LINUX servers, which leave an X11 entry in the log, but masking of course the truly used OS. So I have assigned these few requests naturally to the unknown OS group. Finally Wget (with the possible exception of cygwin) and TeleportPro are endemic to the worlds of UNIX respective the evil M$ win, therefore they are classified that way as the OmniWeb and the Darwin kernel are the only hints to tell apart Mac OS X in these rare cases from the older, proprietary Apple OS. If you are interested, get the analog sources and then either replace (without problem you can do so with analog 5.*) the file src/tree.c with my own version or better patch it with the contents of my own contribution respective older ones see below, but I propose to you using the current version due to various reasons, one is, that I don't backport my patches to those older versions, most easily this way: patch src/tree.c {path}tree.patch after gunzip and pax (or the obsolete tar) application in the usual way. Hint: to make full use of it, you should always get the newest analog version and apply my newest corresponding patch, because I don't backport patches invented later. Finally (re-)compile it after eventually choosing the appropriate options in the Makefile (OS dependencies are commented) on the machine, where you let it run. Analog comes with the GPL >= 2, I think this hint is sufficient, because my code fragment is useless without the source distribution download from above.

Webalizer

This one is the main alternative to analog. The output is in valid HTML 4.0 Transitional, and webalizer is also licensed under the GPL >= 2. Instead of pie charts (with the exception of the top level domain/country distribution) it works largely with colored tables, sorted by hits or pageviews; configuration is also a major topic within it. Opposed to the continually working analog webalizer breaks the statistics automatically into monthly ones over one year at most, than the oldest will be replaced with the current ones. Rather simple and effective is the way, by which webalizer analyzes browser signatures: if you take some care, you get accurate numbers even about not so-often used browsers, but you have to configure it for it. This can be seen in my currently used example configuration which has some personal preferences in, but shows you two very important items too: the browser entries have to be in exactly this order, otherwise many Opera entries and all Mozilla/Netscape6/Galeon etc. Gecko engine based browser entries will be masked out by the prevailing M$IE and Netscape <= 4.x browsers.

Important hint: due to an inadequate default setting for the maximum user agent string length many Mozilla (derivative) browsers are not counted from the binary distributed webalizer as such, but as (old) Netscape (compatible) one! (the reason is, that the keyword Gecko is rather at or near the end of the pretty long user agent string of Mozilla family browsers, the same can happen to Opera browsers in some configurations/version/OS combinations, but less often). The solution is to get the sources, go into webalizer.h and change the constant MAXAGENT from the too low value 64 to 128 (maybe 96 is sufficient, but better safe than sorry). Than compile it and use from then on only your self adjusted and compiled version!

And for accurate search engine statistics it's vital, to configure all major ones, and most important is the following: the version 1.3 of webalizer, still widespread, couldn't list exactly the search strings, so the version 2 is much improved in this and features an additional search engine section for their CGI call patterns for supporting this. This important new section is also included in above offered configuration. Opposed to analog webalizer lists entire search clauses (but you can configure analog this way too), not just words out of the strings. The bad news is, that for migrating from version 1.3 to 2 of webalizer you have to throw away your history and reanalyze all logs, because the history formats are not compatible. This is especially a problem with incremental application, which is the usual one on huge log files caused by much traffic. So if you try it for the first time, you should always start with version 2! Meanwhile you must do so, because webalizer 1.3 works not any longer due to the one billion seconds problem (elapsed since the time 0 in the UNIX epoque) it has...

By the way, you can find some additional shell scripts for a more comprehensive statistics output of webalizer on my site too.

Older patch/tree.c versions for analog, no longer supported: analog 5.1 analog 5.21 analog5.22 analog 5.23 analog 5.24 analog 5.30 analog 5.31


 

back to packages main  back to computer/LINUX main  back to main

remarks etc. to: stefan.urbat@apastron.lb.shuttle.de

(URL:  http://www.lb.shuttle.de/apastron/linWebAn.htm)