aboutsummaryrefslogtreecommitdiff
path: root/academic/ent/ent.1
diff options
context:
space:
mode:
authorB. Watson <yalhcru@gmail.com>2010-05-12 23:27:37 +0200
committerRobby Workman <rworkman@slackbuilds.org>2010-05-12 23:27:37 +0200
commit9f18c3b12b208e5529f4ffe908191251e6ced3d1 (patch)
tree746c1337ae91ed603b4b77263d325ef7823a8d67 /academic/ent/ent.1
parent16ca523fd1b171026fa0397ca26f1c1d6e966e7b (diff)
academic/ent: Added to 12.2 repository
Diffstat (limited to 'academic/ent/ent.1')
-rw-r--r--academic/ent/ent.1230
1 files changed, 230 insertions, 0 deletions
diff --git a/academic/ent/ent.1 b/academic/ent/ent.1
new file mode 100644
index 000000000000..8480b8f616d9
--- /dev/null
+++ b/academic/ent/ent.1
@@ -0,0 +1,230 @@
+.TH ENT "1" "July 2007" "ent" "http://www.fourmilab.ch/random/"
+.SH NAME
+\fBent\fR \- pseudorandom number sequence test
+.PP
+This page describes a program, \fBent\fR, which applies various tests to
+sequences of bytes stored in files and reports the results of those tests.
+The program is useful for those evaluating pseudorandom number generators
+for encryption and statistical sampling applications, compression
+algorithms, and other applications where the information density of a file
+is of interest.
+.SH SYNOPSIS
+\fBent\fR [ \-bcftu ] [ \fIinfile\fR ]
+.SH DESCRIPTION
+\fBent\fR performs a variety of tests on the stream of bytes in \fIinfile\fR (or
+standard input if no \fIinfile\fR is specified) and produces output as follows
+on the standard output stream:
+.PP
+.nf
+Entropy = 7.980627 bits per character.
+
+Optimum compression would reduce the size
+of this 51768 character file by 0 percent.
+
+Chi square distribution for 51768 samples is 1542.26, and randomly
+would exceed this value 0.01 percent of the times.
+
+Arithmetic mean value of data bytes is 125.93 (127.5 = random).
+Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
+Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).
+.fi
+.PP
+The values calculated are as follows:
+.PP
+Entropy
+.PP
+The information density of the contents of the file, expressed as
+a number of bits per character. The results above, which resulted
+from processing an image file compressed with JPEG, indicate that
+the file is extremely dense in information -- essentially random.
+Hence, compression of the file is unlikely to reduce its size. By
+contrast, the C source code of the program has entropy of about
+4.9 bits per character, indicating that optimal compression of the
+file would reduce its size by 38%. \fB[Hamming, pp. 104-108]\fR
+.PP
+Chi-square Test
+.PP
+The chi-square test is the most commonly used test for the
+randomness of data, and is extremely sensitive to errors in
+pseudorandom sequence generators. The chi-square distribution is
+calculated for the stream of bytes in the file and expressed as an
+absolute number and a percentage which indicates how frequently a
+truly random sequence would exceed the value calculated. We
+interpret the percentage as the degree to which the sequence
+tested is suspected of being non-random. If the percentage is
+greater than 99% or less than 1%, the sequence is almost certainly
+not random. If the percentage is between 99% and 95% or between 1%
+and 5%, the sequence is suspect. Percentages between 90% and 95%
+and 5% and 10% indicate the sequence is "almost suspect". Note
+that our JPEG file, while very dense in information, is far from
+random as revealed by the chi-square test.
+.PP
+Applying this test to the output of various pseudorandom sequence
+generators is interesting. The low-order 8 bits returned by the
+standard Unix rand() function, for example, yields:
+.PP
+.nf
+Chi square distribution for 500000 samples is 0.01, and randomly
+would exceed this value 99.99 percent of the times.
+.fi
+.PP
+While an improved generator \fB[Park & Miller]\fR reports:
+.PP
+.nf
+Chi square distribution for 500000 samples is 212.53, and
+randomly would exceed this value 95.00 percent of the times.
+.fi
+.PP
+Thus, the standard Unix generator (or at least the low-order bytes
+it returns) is unacceptably non-random, while the improved
+generator is much better but still sufficiently non-random to
+cause concern for demanding applications. Contrast both of these
+software generators with the chi-square result of a genuine random
+sequence created by timing radioactive decay events.
+.PP
+.nf
+Chi square distribution for 32768 samples is 237.05, and
+randomly would exceed this value 75.00 percent of the times.
+.fi
+.PP
+See \fB[Knuth, pp. 35-40]\fR for more information on the chi-square
+test. An interactive chi-square calculator is available at this
+site.
+.PP
+Arithmetic Mean
+.PP
+This is simply the result of summing the all the bytes (bits if
+the \fB\-b\fR option is specified) in the file and dividing by the file
+length. If the data are close to random, this should be about
+127.5 (0.5 for \fB\-b\fR option output). If the mean departs from this
+value, the values are consistently high or low.
+.PP
+Monte Carlo Value for Pi
+.PP
+Each successive sequence of six bytes is used as 24 bit X and Y
+co-ordinates within a square. If the distance of the
+randomly-generated point is less than the radius of a circle
+inscribed within the square, the six-byte sequence is considered a
+"hit". The percentage of hits can be used to calculate the value
+of Pi. For very large streams (this approximation converges very
+slowly), the value will approach the correct value of Pi if the
+sequence is close to random. A 32768 byte file created by
+radioactive decay yielded:
+.PP
+.nf
+Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).
+.fi
+.PP
+Serial Correlation Coefficient
+.PP
+This quantity measures the extent to which each byte in the file
+depends upon the previous byte. For random sequences, this value
+(which can be positive or negative) will, of course, be close to
+zero. A non-random byte stream such as a C program will yield a
+serial correlation coefficient on the order of 0.5. Wildly
+predictable data such as uncompressed bitmaps will exhibit serial
+correlation coefficients approaching 1. See \fB[Knuth, pp. 64-65]\fR for
+more details.
+.SH OPTIONS
+.IP \fB\-b\fR
+The input is treated as a stream of bits rather than of 8-bit
+bytes. Statistics reported reflect the properties of the
+bitstream.
+.IP \fB\-c\fR
+Print a table of the number of occurrences of each possible byte
+(or bit, if the \fB\-b\fR option is also specified) value, and the
+fraction of the overall file made up by that value. Printable
+characters in the ISO 8859-1 Latin1 character set are shown along
+with their decimal byte values. In non-terse output mode, values
+with zero occurrences are not printed.
+.IP \fB\-f\fR
+Fold upper case letters to lower case before computing statistics.
+Folding is done based on the ISO 8859-1 Latin1 character set, with
+accented letters correctly processed.
+.IP \fB\-t\fR
+Terse mode: output is written in Comma Separated Value (CSV)
+format, suitable for loading into a spreadsheet and easily read by
+any programming language. See Terse Mode Output Format below for
+additional details.
+.IP \fB\-u\fR
+Print how-to-call information.
+.SH FILES
+If no \fIinfile\fR is specified, \fBent\fR obtains its input from standard input.
+Output is always written to standard output.
+.SH TERSE MODE OUTPUT FORMAT
+Terse mode is selected by specifying the \fB\-t\fR option on the command line.
+Terse mode output is written in Comma Separated Value (CSV) format, which
+can be directly loaded into most spreadsheet programs and is easily read
+by any programming language. Each record in the CSV file begins with a
+record type field, which identifies the content of the following fields.
+If the \fB\-c\fR option is not specified, the terse mode output will consist of
+two records, as follows:
+.PP
+.nf
+0,File-bytes,Entropy,Chi-square,Mean,Monte-Carlo-Pi,Serial-Correlation
+1,file_length,entropy,chi_square,mean,Pi_value,correlation
+.fi
+.PP
+where the italicised values in the type 1 record are the numerical values
+for the quantities named in the type 0 column title record. If the \fB\-b\fR
+option is specified, the second field of the type 0 record will be
+"File-bits", and the file_length field in type 1 record will be given in
+bits instead of bytes. If the \fB\-c\fR option is specified, additional records
+are appended to the terse mode output which contain the character counts:
+.PP
+.nf
+2,Value,Occurrences,Fraction
+3,v,count,fraction
+. . .
+.fi
+.PP
+If the \fB\-b\fR option is specified, only two type 3 records will appear for the
+two bit values v=0 and v=1. Otherwise, 256 type 3 records are included,
+one for each possible byte value. The second field of a type 3 record
+indicates how many bytes (or bits) of value v appear in the input, and
+fraction gives the decimal fraction of the file which has value v (which
+is equal to the count value of this record divided by the file_length
+field in the type 1 record).
+.SH BUGS
+Note that the "optimal compression" shown for the file is computed from
+the byte- or bit-stream entropy and thus reflects compressibility based on
+a reading frame of the chosen width (8-bit bytes or individual bits if the
+\fB\-b\fR option is specified). Algorithms which use a larger reading frame, such
+as the Lempel-Ziv \fB[Lempel & Ziv]\fR algorithm, may achieve greater
+compression if the file contains repeated sequences of multiple bytes.
+.SH SEE ALSO
+\fIIntroduction to Probability and Statistics\fR
+.br
+http://www.fourmilab.ch/rpkp/experiments/statistics.html
+.PP
+\fB[Hamming]\fR
+.br
+Hamming, Richard W. \fICoding and Information Theory.\fR Englewood
+Cliffs NJ: Prentice-Hall, 1980.
+.PP
+\fB[Knuth]\fR
+.br
+Knuth, Donald E. \fIThe Art of Computer Programming, Volume 2 /
+Seminumerical Algorithms\fR. Reading MA: Addison-Wesley, 1969. ISBN
+0-201-89684-2.
+.PP
+\fB[Lempel & Ziv]\fR
+.br
+Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data
+Compression". \fIIEEE Transactions on Information Theory\fR \fB23\fR, 3,
+pp. 337-343.
+.PP
+\fB[Park & Miller]\fR
+.br
+Park, Stephen K. and Keith W. Miller. "Random Number Generators:
+Good Ones Are Hard to Find". \fICommunications of the ACM\fR, October
+1988, p. 1192.
+.SH COPYING
+This software is in the public domain. Permission to use, copy, modify,
+and distribute this software and its documentation for any purpose and
+without fee is hereby granted, without any conditions or restrictions.
+This software is provided "as is" without express or implied warranty.
+.SH AUTHOR
+John Walker
+.br
+October 20th, 1998