Minutes From Cache Logs Analysis B.o.F 5th Caching Workshop, Lisbon: 24/5/00 ------------------------------------- Chair: Jens-S. Vöckler Present: John Denholm, Matija Grabnar, Serge Krashnov, George Neisser, Henrik Nordström, Alexei Novikov, Wojtek Sylwestrzak, Duane Wessels Agenda ------ - Log Aonymization - Standard Log Format Descriptor - Nature of Log Statistics Anonymization ------------- The principle objective is to provide logs suitable for engineering and academic study without compromising confidential information contained in the cache logs. Issues raised included - - Different countries have different standards. Any anonymization would have to demonstrably protect confidentiality to whatever degree. So different levels of anonymization would be needed to allow maximum information to be conveyed under different laws. Each level and style of anonymization would need its properties documenting. - Some sites include user information in the URL. Both in POST data and in some dynamically generated page names. Such information would also have to be removed. - Techniques would have to be proof against someone with access to site logs from a website accessed during the sample period. Such a person would have a de-anonymized sample set to compare against and this must not permit them to reverse engineer the anonymization process. Solutions and ideas included - - Fields such as source ip, URL and sitename could be MD5 encoded. By MD5 encoding the field data plus a secret string, secrecy should be preserved even if someone has access to site logs. While they may be able to cross- reference hit times and determine which encoded site is theirs, a 1kB or greater secret string renders reverse engineering the encoding non-trivial. Similarly such a person could determine some user identities based on site logs, but they should be unable to reverse engineer source ip encoding to obtain all user identities. - Cross-logfile comparability is also a concern. Since log files are vast, the interval of comparability needs to be defined beforehand (e.g. a year or just a month), or the secret needs to be increased sufficiently. - Site should be encoded separately to URL. This permits site by site analysis as well as within-site analysis. - File extensions are a matter of analysis - a separate encoding for those could be optionally provided. > If it is just the file extension {1,6}, it might not even have to > be encoded at all... Standardising Format by use of Meta-Data ------------------------------ Currently most log files do not come with meta-data headers describing the fields. Some kind of openly agreed standard would be advisable to enable automated discovery of log file format. Most or all vendors/authors are quite happy to provide this information for log analysis, so there should be no big issues in placing this information at the top of each log file. - Hash-based, comment style information... # Format... > After the BOF, there was a mail that the W3C already thinks along those > lines, too. Being aware that it is considered bad style to publish an email > w/o checking back with the author, it is believed that the author intended > to make things publicly known: > > Date: Tue, 23 May 2000 17:12:35 -0400 > From: Hussein Suleman > To: voeckler@rvs.uni-hannover.de, martin@net.lut.ac.uk > Cc: ... > Subject: meta-format for log files ... > > hi > > i received a copy of the following message from Balachander > Krishnamurthy - i dont know if anyone else has responded yet, but we've > been devising an approach to solve the problem of disparate log file > formats (where we is the Web Characterization Activity working group of > the W3C) > > we use an XML definition for entries in a log file. its still somewhat > under development - as we try to define more formats using this, the > format definition specification language may be refined. > > this definition gets fed into a data file validator which can then check > any general log file for errors, optionally converting the data to a > different format. > > for more details about the project and examples, go to the Web > Characterization Repository at: > http://purl.org/net/repository > > within the next month or so, we expect to publish a technical report on > this format-driven validation process. there will also be a > demonstration and short-paper presentation about this at ACM Digital > Libraries 2000 in 2 weeks ... > > ttfn > > hussein > >Martin Hamilton wrote: > > > >> It's been a bugbear of mine for a long time that although we have > >> (more or less) a standard log file format for proxy caches (well, > >> OK, plain vanilla Common Logfile and Squid), there isn't a commonly > >> accepted format for summaries of those log files. > > > >Considering native log format style, now supported by major cache > >vendors, there might be even more interesting things to report. Yes, > >I know about the offer from squid-dev to implement them. I am (still) > >not concrete but thinking in terms of what Adrian Cockroft did with > >the SUN web servers... And talking about that, some meta definition > >of log file formats is missing, too, e.g. a generally accepted style > >to describe what a log file does report - so you can parse a variety > >of logs if you have the meta description, not limited to squid native > >format. > > -- > ====================================================================== > hussein suleman - hussein@vt.edu - vt cs - http://purl.org/net/hussein > ====================================================================== - Many fields would be standard, src ip, timestamp, whereas others - cache return type (eg TCP_HIT, TCP_HIT_VERIFY), may be vendor specific and the meta-data should reflect that, eg (#timestamp Sqd-code). Some fields can also be reduced from such a string to a single number giving the code. A list of conversions would be quoted in the meta-data. This would save space in log files by avoiding needless repetition and simplify parsing. Standards concerning representation of data do exist and need to be verified for applicability. - Fields should be separated by a single space in all instances, no matter how nice more formatting makes them. Cache logs are vast, and as such making them human-readable is superfluous. Scripts to both search logs and present more human-readable results are trivial. The workload in coping with this formatting is unnecessary cost, and it requires much more complex code to guarantee a correct read of the log line. Conversion tools need not be limited to scripts, but could also encompass gui based tools as soon as cache maintainance becomes a regular part of an admin's work day. - Some URLs contain white spaces - this is clearly a problem for analysis. As it cannot be guaranteed that a log line will not be corrupt, verifying format of white space separated fields is made more complex by white space within URLs. This would be best solved by vendors converting white space within URLs into a non-breaking space. > Traditionally, a "+" sign was used to encode plain space. "%20" would > also do, of course, as well as the   representation chr(160), which > looks like a space to a human, but not to any parser. - Some vendors have a limit to the length of line they will print out - 8k has been encountered. An over-long line is frequently not terminated by a carriage return / newline. This causes format violation for analysis. All vendors should write log lines to (2^N)-2 chars (e.g. 8190 chars) and guarantee to put a CR/newline at the end of the line (yielding 8k chars). Maximum log line length could also be specified in the meta-data. HTTP/1.1 requires no limit by in practice parsing requirements mediate for a sensible maximum. > From RFC 2616, 3.2.1: > > The HTTP protocol does not place any a priori limit on the length of > a URI. Servers MUST be able to handle the URI of any resource they > serve, and SHOULD be able to handle URIs of unbounded length if they > provide GET-based forms that could generate such URIs. A server > SHOULD return 414 (Request-URI Too Long) status if a URI is longer > than the server can handle (see section 10.4.15). > > Note: Servers ought to be cautious about depending on URI lengths > above 255 bytes, because some older client or proxy > implementations might not properly support these lengths. - The issue of configurability and a basic set of option was briefly raised, without any results. Nature of Cache Behaviour ------------------------- (It's a hit - but what KIND of hit?) > That is an item for the Squid FAQ, too. Different vendors measure their hit rates in slightly different ways, and report their hits in slightly different ways. Standard interpretations should be agreed. Hit types include - - Validating a user agent IMS - Validating a store object with upstream authority - A full hit - returning stored object - A forced validation of a stored object - A stale hit (if IMS fails) - A prefetch hit - A partial hit (eg. part of an object obtained by byte ranges) - All vendors should provide exact details of how all these (and any other) are represented in their logs. Most already do. > All log file processors should tell how those HITs are combined. Only > Jesalfa (http://www.linofee.org/~elkner/webtools/jesalfa/) shows > different hits separately. - Also under debate is the nature of byte hit rate. An IMS has a byte saving equal to the size of the object minus the bytes required for the IMS reply. The size of the object is not quoted in all (or perhaps any) vendors' logs, so the byte saving associated with this cannot be determined. Useful might be the concept of Hit Bytes and Miss Bytes with IMS contributing to both. - Some fields not routinely included in cache logs are DNS RTT, DNS cache hit/miss, peer response RTT, and some other useful metrics. While these would swell logs considerably, they are excellent performance measurement tools for cache administrators. Vendors should provide a log format option to show all such information. Encoding the field values as numbers with sting values in the meta-data would reduce the size increase. > Adrian Cockroft "Sun Performance and Tuning, 2nd ed" drilled a few holes > into the Netscape 2.5 Proxy Cache (the book was published 1998) and had > it report (among the usual suspects): > > 9. status to client (outbound) > 10. length to client (outbound) > 11. status from server (inbound) > 12. length from server (inbound) > 13. length from client (for POST) > 14. length to server (for POST) > 15. client header request length > 16. proxy header response length > 17. proxy header request length > 18. server header response length > 23. cache finish status {written,refreshed,error,no-check,up-to-date, > host-not-available,cl-mismatch,do-not-cache,non-cacheable} > 24. total transfer time > 25. dns lookup time > 26. initial wait time > 27. full wait time Further discussion centered on the statistical methods used, and the (foggy) idea to define some standard meaning to not only the object and volume hit ratios, but also other statistical material ("Never trust a statistic you haven't forged yourself.") Different vendors talk about similar issues differently. The methods which are used, or a set of procedures need some standardization. There is also a need for a commonly acceptable taxonomy. The issue of standardized computational methods and comparable results was raised on WREC - the need to have a standard for log file summaries. Some discussion occured also on monitoring activity - a number of solutions exist using SNMP or cache-generated output got regularly and parsed by scripts to produce graphs. - A useful technique is to grab regular information and display an average of the last half-hour, hour, . By gathering stats regularly and averaging over a period of time, spike behaviour is prevented from skewing data. - On that note, it was deemed unfeasible to measure time for timestamps more accurately than the nearest miilisecond. Squid may not even have the timestamp to the nearest millisecond due to the timestamp only updating on calls to select() (Thanks Duane!) - other vendors may have similar procedures that prevent greater accuracy. - Another issue also encountered was to let some of the summing by any external log processors be done by the cache. A cache could possible accumulate a sum for , and keep it available during the next one or two such intervals for asynchroneous retrieval. Access to these counters would need a specific API, too. The mentioned approach has some drawbacks, and is not ideal from the viewpoint of software engineering.