diff options
Diffstat (limited to 'academic/ent/ent.1')
-rw-r--r-- | academic/ent/ent.1 | 230 |
1 files changed, 230 insertions, 0 deletions
diff --git a/academic/ent/ent.1 b/academic/ent/ent.1 new file mode 100644 index 0000000000000..8480b8f616d9e --- /dev/null +++ b/academic/ent/ent.1 @@ -0,0 +1,230 @@ +.TH ENT "1" "July 2007" "ent" "http://www.fourmilab.ch/random/" +.SH NAME +\fBent\fR \- pseudorandom number sequence test +.PP +This page describes a program, \fBent\fR, which applies various tests to +sequences of bytes stored in files and reports the results of those tests. +The program is useful for those evaluating pseudorandom number generators +for encryption and statistical sampling applications, compression +algorithms, and other applications where the information density of a file +is of interest. +.SH SYNOPSIS +\fBent\fR [ \-bcftu ] [ \fIinfile\fR ] +.SH DESCRIPTION +\fBent\fR performs a variety of tests on the stream of bytes in \fIinfile\fR (or +standard input if no \fIinfile\fR is specified) and produces output as follows +on the standard output stream: +.PP +.nf +Entropy = 7.980627 bits per character. + +Optimum compression would reduce the size +of this 51768 character file by 0 percent. + +Chi square distribution for 51768 samples is 1542.26, and randomly +would exceed this value 0.01 percent of the times. + +Arithmetic mean value of data bytes is 125.93 (127.5 = random). +Monte Carlo value for Pi is 3.169834647 (error 0.90 percent). +Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0). +.fi +.PP +The values calculated are as follows: +.PP +Entropy +.PP +The information density of the contents of the file, expressed as +a number of bits per character. The results above, which resulted +from processing an image file compressed with JPEG, indicate that +the file is extremely dense in information -- essentially random. +Hence, compression of the file is unlikely to reduce its size. By +contrast, the C source code of the program has entropy of about +4.9 bits per character, indicating that optimal compression of the +file would reduce its size by 38%. \fB[Hamming, pp. 104-108]\fR +.PP +Chi-square Test +.PP +The chi-square test is the most commonly used test for the +randomness of data, and is extremely sensitive to errors in +pseudorandom sequence generators. The chi-square distribution is +calculated for the stream of bytes in the file and expressed as an +absolute number and a percentage which indicates how frequently a +truly random sequence would exceed the value calculated. We +interpret the percentage as the degree to which the sequence +tested is suspected of being non-random. If the percentage is +greater than 99% or less than 1%, the sequence is almost certainly +not random. If the percentage is between 99% and 95% or between 1% +and 5%, the sequence is suspect. Percentages between 90% and 95% +and 5% and 10% indicate the sequence is "almost suspect". Note +that our JPEG file, while very dense in information, is far from +random as revealed by the chi-square test. +.PP +Applying this test to the output of various pseudorandom sequence +generators is interesting. The low-order 8 bits returned by the +standard Unix rand() function, for example, yields: +.PP +.nf +Chi square distribution for 500000 samples is 0.01, and randomly +would exceed this value 99.99 percent of the times. +.fi +.PP +While an improved generator \fB[Park & Miller]\fR reports: +.PP +.nf +Chi square distribution for 500000 samples is 212.53, and +randomly would exceed this value 95.00 percent of the times. +.fi +.PP +Thus, the standard Unix generator (or at least the low-order bytes +it returns) is unacceptably non-random, while the improved +generator is much better but still sufficiently non-random to +cause concern for demanding applications. Contrast both of these +software generators with the chi-square result of a genuine random +sequence created by timing radioactive decay events. +.PP +.nf +Chi square distribution for 32768 samples is 237.05, and +randomly would exceed this value 75.00 percent of the times. +.fi +.PP +See \fB[Knuth, pp. 35-40]\fR for more information on the chi-square +test. An interactive chi-square calculator is available at this +site. +.PP +Arithmetic Mean +.PP +This is simply the result of summing the all the bytes (bits if +the \fB\-b\fR option is specified) in the file and dividing by the file +length. If the data are close to random, this should be about +127.5 (0.5 for \fB\-b\fR option output). If the mean departs from this +value, the values are consistently high or low. +.PP +Monte Carlo Value for Pi +.PP +Each successive sequence of six bytes is used as 24 bit X and Y +co-ordinates within a square. If the distance of the +randomly-generated point is less than the radius of a circle +inscribed within the square, the six-byte sequence is considered a +"hit". The percentage of hits can be used to calculate the value +of Pi. For very large streams (this approximation converges very +slowly), the value will approach the correct value of Pi if the +sequence is close to random. A 32768 byte file created by +radioactive decay yielded: +.PP +.nf +Monte Carlo value for Pi is 3.139648438 (error 0.06 percent). +.fi +.PP +Serial Correlation Coefficient +.PP +This quantity measures the extent to which each byte in the file +depends upon the previous byte. For random sequences, this value +(which can be positive or negative) will, of course, be close to +zero. A non-random byte stream such as a C program will yield a +serial correlation coefficient on the order of 0.5. Wildly +predictable data such as uncompressed bitmaps will exhibit serial +correlation coefficients approaching 1. See \fB[Knuth, pp. 64-65]\fR for +more details. +.SH OPTIONS +.IP \fB\-b\fR +The input is treated as a stream of bits rather than of 8-bit +bytes. Statistics reported reflect the properties of the +bitstream. +.IP \fB\-c\fR +Print a table of the number of occurrences of each possible byte +(or bit, if the \fB\-b\fR option is also specified) value, and the +fraction of the overall file made up by that value. Printable +characters in the ISO 8859-1 Latin1 character set are shown along +with their decimal byte values. In non-terse output mode, values +with zero occurrences are not printed. +.IP \fB\-f\fR +Fold upper case letters to lower case before computing statistics. +Folding is done based on the ISO 8859-1 Latin1 character set, with +accented letters correctly processed. +.IP \fB\-t\fR +Terse mode: output is written in Comma Separated Value (CSV) +format, suitable for loading into a spreadsheet and easily read by +any programming language. See Terse Mode Output Format below for +additional details. +.IP \fB\-u\fR +Print how-to-call information. +.SH FILES +If no \fIinfile\fR is specified, \fBent\fR obtains its input from standard input. +Output is always written to standard output. +.SH TERSE MODE OUTPUT FORMAT +Terse mode is selected by specifying the \fB\-t\fR option on the command line. +Terse mode output is written in Comma Separated Value (CSV) format, which +can be directly loaded into most spreadsheet programs and is easily read +by any programming language. Each record in the CSV file begins with a +record type field, which identifies the content of the following fields. +If the \fB\-c\fR option is not specified, the terse mode output will consist of +two records, as follows: +.PP +.nf +0,File-bytes,Entropy,Chi-square,Mean,Monte-Carlo-Pi,Serial-Correlation +1,file_length,entropy,chi_square,mean,Pi_value,correlation +.fi +.PP +where the italicised values in the type 1 record are the numerical values +for the quantities named in the type 0 column title record. If the \fB\-b\fR +option is specified, the second field of the type 0 record will be +"File-bits", and the file_length field in type 1 record will be given in +bits instead of bytes. If the \fB\-c\fR option is specified, additional records +are appended to the terse mode output which contain the character counts: +.PP +.nf +2,Value,Occurrences,Fraction +3,v,count,fraction +. . . +.fi +.PP +If the \fB\-b\fR option is specified, only two type 3 records will appear for the +two bit values v=0 and v=1. Otherwise, 256 type 3 records are included, +one for each possible byte value. The second field of a type 3 record +indicates how many bytes (or bits) of value v appear in the input, and +fraction gives the decimal fraction of the file which has value v (which +is equal to the count value of this record divided by the file_length +field in the type 1 record). +.SH BUGS +Note that the "optimal compression" shown for the file is computed from +the byte- or bit-stream entropy and thus reflects compressibility based on +a reading frame of the chosen width (8-bit bytes or individual bits if the +\fB\-b\fR option is specified). Algorithms which use a larger reading frame, such +as the Lempel-Ziv \fB[Lempel & Ziv]\fR algorithm, may achieve greater +compression if the file contains repeated sequences of multiple bytes. +.SH SEE ALSO +\fIIntroduction to Probability and Statistics\fR +.br +http://www.fourmilab.ch/rpkp/experiments/statistics.html +.PP +\fB[Hamming]\fR +.br +Hamming, Richard W. \fICoding and Information Theory.\fR Englewood +Cliffs NJ: Prentice-Hall, 1980. +.PP +\fB[Knuth]\fR +.br +Knuth, Donald E. \fIThe Art of Computer Programming, Volume 2 / +Seminumerical Algorithms\fR. Reading MA: Addison-Wesley, 1969. ISBN +0-201-89684-2. +.PP +\fB[Lempel & Ziv]\fR +.br +Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data +Compression". \fIIEEE Transactions on Information Theory\fR \fB23\fR, 3, +pp. 337-343. +.PP +\fB[Park & Miller]\fR +.br +Park, Stephen K. and Keith W. Miller. "Random Number Generators: +Good Ones Are Hard to Find". \fICommunications of the ACM\fR, October +1988, p. 1192. +.SH COPYING +This software is in the public domain. Permission to use, copy, modify, +and distribute this software and its documentation for any purpose and +without fee is hereby granted, without any conditions or restrictions. +This software is provided "as is" without express or implied warranty. +.SH AUTHOR +John Walker +.br +October 20th, 1998 |