diff options
Diffstat (limited to 'academic/ent')
-rw-r--r-- | academic/ent/README | 9 | ||||
-rw-r--r-- | academic/ent/ent.1 | 230 | ||||
-rw-r--r-- | academic/ent/ent.SlackBuild | 61 | ||||
-rw-r--r-- | academic/ent/ent.info | 8 | ||||
-rw-r--r-- | academic/ent/slack-desc | 19 |
5 files changed, 327 insertions, 0 deletions
diff --git a/academic/ent/README b/academic/ent/README new file mode 100644 index 0000000000000..7a5602d720f15 --- /dev/null +++ b/academic/ent/README @@ -0,0 +1,9 @@ +ent - A pseudorandom number sequence test program + +ent applies various tests to sequences of bytes stored in files and +reports the results of those tests. The program is useful for evaluating +pseudorandom number generators for encryption and statistical sampling +applications, compression algorithms, and other applications where the +information density of a file is of interest. Tests performed include +Entropy, Chi-square, Arithmetic Mean, Monte Carlo value for Pi, and +Serial Correlation Coefficient. diff --git a/academic/ent/ent.1 b/academic/ent/ent.1 new file mode 100644 index 0000000000000..8480b8f616d9e --- /dev/null +++ b/academic/ent/ent.1 @@ -0,0 +1,230 @@ +.TH ENT "1" "July 2007" "ent" "http://www.fourmilab.ch/random/" +.SH NAME +\fBent\fR \- pseudorandom number sequence test +.PP +This page describes a program, \fBent\fR, which applies various tests to +sequences of bytes stored in files and reports the results of those tests. +The program is useful for those evaluating pseudorandom number generators +for encryption and statistical sampling applications, compression +algorithms, and other applications where the information density of a file +is of interest. +.SH SYNOPSIS +\fBent\fR [ \-bcftu ] [ \fIinfile\fR ] +.SH DESCRIPTION +\fBent\fR performs a variety of tests on the stream of bytes in \fIinfile\fR (or +standard input if no \fIinfile\fR is specified) and produces output as follows +on the standard output stream: +.PP +.nf +Entropy = 7.980627 bits per character. + +Optimum compression would reduce the size +of this 51768 character file by 0 percent. + +Chi square distribution for 51768 samples is 1542.26, and randomly +would exceed this value 0.01 percent of the times. + +Arithmetic mean value of data bytes is 125.93 (127.5 = random). +Monte Carlo value for Pi is 3.169834647 (error 0.90 percent). +Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0). +.fi +.PP +The values calculated are as follows: +.PP +Entropy +.PP +The information density of the contents of the file, expressed as +a number of bits per character. The results above, which resulted +from processing an image file compressed with JPEG, indicate that +the file is extremely dense in information -- essentially random. +Hence, compression of the file is unlikely to reduce its size. By +contrast, the C source code of the program has entropy of about +4.9 bits per character, indicating that optimal compression of the +file would reduce its size by 38%. \fB[Hamming, pp. 104-108]\fR +.PP +Chi-square Test +.PP +The chi-square test is the most commonly used test for the +randomness of data, and is extremely sensitive to errors in +pseudorandom sequence generators. The chi-square distribution is +calculated for the stream of bytes in the file and expressed as an +absolute number and a percentage which indicates how frequently a +truly random sequence would exceed the value calculated. We +interpret the percentage as the degree to which the sequence +tested is suspected of being non-random. If the percentage is +greater than 99% or less than 1%, the sequence is almost certainly +not random. If the percentage is between 99% and 95% or between 1% +and 5%, the sequence is suspect. Percentages between 90% and 95% +and 5% and 10% indicate the sequence is "almost suspect". Note +that our JPEG file, while very dense in information, is far from +random as revealed by the chi-square test. +.PP +Applying this test to the output of various pseudorandom sequence +generators is interesting. The low-order 8 bits returned by the +standard Unix rand() function, for example, yields: +.PP +.nf +Chi square distribution for 500000 samples is 0.01, and randomly +would exceed this value 99.99 percent of the times. +.fi +.PP +While an improved generator \fB[Park & Miller]\fR reports: +.PP +.nf +Chi square distribution for 500000 samples is 212.53, and +randomly would exceed this value 95.00 percent of the times. +.fi +.PP +Thus, the standard Unix generator (or at least the low-order bytes +it returns) is unacceptably non-random, while the improved +generator is much better but still sufficiently non-random to +cause concern for demanding applications. Contrast both of these +software generators with the chi-square result of a genuine random +sequence created by timing radioactive decay events. +.PP +.nf +Chi square distribution for 32768 samples is 237.05, and +randomly would exceed this value 75.00 percent of the times. +.fi +.PP +See \fB[Knuth, pp. 35-40]\fR for more information on the chi-square +test. An interactive chi-square calculator is available at this +site. +.PP +Arithmetic Mean +.PP +This is simply the result of summing the all the bytes (bits if +the \fB\-b\fR option is specified) in the file and dividing by the file +length. If the data are close to random, this should be about +127.5 (0.5 for \fB\-b\fR option output). If the mean departs from this +value, the values are consistently high or low. +.PP +Monte Carlo Value for Pi +.PP +Each successive sequence of six bytes is used as 24 bit X and Y +co-ordinates within a square. If the distance of the +randomly-generated point is less than the radius of a circle +inscribed within the square, the six-byte sequence is considered a +"hit". The percentage of hits can be used to calculate the value +of Pi. For very large streams (this approximation converges very +slowly), the value will approach the correct value of Pi if the +sequence is close to random. A 32768 byte file created by +radioactive decay yielded: +.PP +.nf +Monte Carlo value for Pi is 3.139648438 (error 0.06 percent). +.fi +.PP +Serial Correlation Coefficient +.PP +This quantity measures the extent to which each byte in the file +depends upon the previous byte. For random sequences, this value +(which can be positive or negative) will, of course, be close to +zero. A non-random byte stream such as a C program will yield a +serial correlation coefficient on the order of 0.5. Wildly +predictable data such as uncompressed bitmaps will exhibit serial +correlation coefficients approaching 1. See \fB[Knuth, pp. 64-65]\fR for +more details. +.SH OPTIONS +.IP \fB\-b\fR +The input is treated as a stream of bits rather than of 8-bit +bytes. Statistics reported reflect the properties of the +bitstream. +.IP \fB\-c\fR +Print a table of the number of occurrences of each possible byte +(or bit, if the \fB\-b\fR option is also specified) value, and the +fraction of the overall file made up by that value. Printable +characters in the ISO 8859-1 Latin1 character set are shown along +with their decimal byte values. In non-terse output mode, values +with zero occurrences are not printed. +.IP \fB\-f\fR +Fold upper case letters to lower case before computing statistics. +Folding is done based on the ISO 8859-1 Latin1 character set, with +accented letters correctly processed. +.IP \fB\-t\fR +Terse mode: output is written in Comma Separated Value (CSV) +format, suitable for loading into a spreadsheet and easily read by +any programming language. See Terse Mode Output Format below for +additional details. +.IP \fB\-u\fR +Print how-to-call information. +.SH FILES +If no \fIinfile\fR is specified, \fBent\fR obtains its input from standard input. +Output is always written to standard output. +.SH TERSE MODE OUTPUT FORMAT +Terse mode is selected by specifying the \fB\-t\fR option on the command line. +Terse mode output is written in Comma Separated Value (CSV) format, which +can be directly loaded into most spreadsheet programs and is easily read +by any programming language. Each record in the CSV file begins with a +record type field, which identifies the content of the following fields. +If the \fB\-c\fR option is not specified, the terse mode output will consist of +two records, as follows: +.PP +.nf +0,File-bytes,Entropy,Chi-square,Mean,Monte-Carlo-Pi,Serial-Correlation +1,file_length,entropy,chi_square,mean,Pi_value,correlation +.fi +.PP +where the italicised values in the type 1 record are the numerical values +for the quantities named in the type 0 column title record. If the \fB\-b\fR +option is specified, the second field of the type 0 record will be +"File-bits", and the file_length field in type 1 record will be given in +bits instead of bytes. If the \fB\-c\fR option is specified, additional records +are appended to the terse mode output which contain the character counts: +.PP +.nf +2,Value,Occurrences,Fraction +3,v,count,fraction +. . . +.fi +.PP +If the \fB\-b\fR option is specified, only two type 3 records will appear for the +two bit values v=0 and v=1. Otherwise, 256 type 3 records are included, +one for each possible byte value. The second field of a type 3 record +indicates how many bytes (or bits) of value v appear in the input, and +fraction gives the decimal fraction of the file which has value v (which +is equal to the count value of this record divided by the file_length +field in the type 1 record). +.SH BUGS +Note that the "optimal compression" shown for the file is computed from +the byte- or bit-stream entropy and thus reflects compressibility based on +a reading frame of the chosen width (8-bit bytes or individual bits if the +\fB\-b\fR option is specified). Algorithms which use a larger reading frame, such +as the Lempel-Ziv \fB[Lempel & Ziv]\fR algorithm, may achieve greater +compression if the file contains repeated sequences of multiple bytes. +.SH SEE ALSO +\fIIntroduction to Probability and Statistics\fR +.br +http://www.fourmilab.ch/rpkp/experiments/statistics.html +.PP +\fB[Hamming]\fR +.br +Hamming, Richard W. \fICoding and Information Theory.\fR Englewood +Cliffs NJ: Prentice-Hall, 1980. +.PP +\fB[Knuth]\fR +.br +Knuth, Donald E. \fIThe Art of Computer Programming, Volume 2 / +Seminumerical Algorithms\fR. Reading MA: Addison-Wesley, 1969. ISBN +0-201-89684-2. +.PP +\fB[Lempel & Ziv]\fR +.br +Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data +Compression". \fIIEEE Transactions on Information Theory\fR \fB23\fR, 3, +pp. 337-343. +.PP +\fB[Park & Miller]\fR +.br +Park, Stephen K. and Keith W. Miller. "Random Number Generators: +Good Ones Are Hard to Find". \fICommunications of the ACM\fR, October +1988, p. 1192. +.SH COPYING +This software is in the public domain. Permission to use, copy, modify, +and distribute this software and its documentation for any purpose and +without fee is hereby granted, without any conditions or restrictions. +This software is provided "as is" without express or implied warranty. +.SH AUTHOR +John Walker +.br +October 20th, 1998 diff --git a/academic/ent/ent.SlackBuild b/academic/ent/ent.SlackBuild new file mode 100644 index 0000000000000..8e2c12dbb93ba --- /dev/null +++ b/academic/ent/ent.SlackBuild @@ -0,0 +1,61 @@ +#!/bin/sh + +# Slackware build script for ent + +# Written by B. Watson (yalhcru@gmail.com) + +PRGNAM=ent +ARCHIVE=random.zip +VERSION=${VERSION:-20080128} +ARCH=${ARCH:-i486} +BUILD=${BUILD:-1} +TAG=${TAG:-_SBo} + +CWD=$(pwd) +TMP=${TMP:-/tmp/SBo} +PKG=$TMP/package-$PRGNAM +OUTPUT=${OUTPUT:-/tmp} + +if [ "$ARCH" = "i486" ]; then + SLKCFLAGS="-O2 -march=i486 -mtune=i686" +elif [ "$ARCH" = "i686" ]; then + SLKCFLAGS="-O2 -march=i686 -mtune=i686" +elif [ "$ARCH" = "x86_64" ]; then + SLKCFLAGS="-O2 -fPIC" +fi + +set -e + +rm -rf $PKG +mkdir -p $TMP $PKG $OUTPUT +cd $TMP +rm -rf $PRGNAM +mkdir -p $PRGNAM +cd $PRGNAM +unzip $CWD/$ARCHIVE +chown -R root:root . +find . \ + \( -perm 777 -o -perm 775 -o -perm 711 -o -perm 555 -o -perm 511 \) \ + -exec chmod 755 {} \; -o \ + \( -perm 666 -o -perm 664 -o -perm 600 -o -perm 444 -o -perm 440 -o -perm 400 \) \ + -exec chmod 644 {} \; + +make CC=gcc CFLAGS="$SLKCFLAGS" +strip $PRGNAM + +mkdir -p $PKG/usr/bin $PKG/usr/man/man1 $PKG/usr/doc/$PRGNAM-$VERSION +cp $PRGNAM $PKG/usr/bin + +# The man page comes with the SlackBuild, not the source +# (because I converted the HTML page to troff, hope it's OK) +# ## Sure, that's fine :) --rworkman +gzip -9c $CWD/$PRGNAM.1 > $PKG/usr/man/man1/$PRGNAM.1.gz + +cp -a $PRGNAM.html entitle.gif $PKG/usr/doc/$PRGNAM-$VERSION +cat $CWD/$PRGNAM.SlackBuild > $PKG/usr/doc/$PRGNAM-$VERSION/$PRGNAM.SlackBuild + +mkdir -p $PKG/install +cat $CWD/slack-desc > $PKG/install/slack-desc + +cd $PKG +/sbin/makepkg -l y -c n $OUTPUT/$PRGNAM-$VERSION-$ARCH-$BUILD$TAG.tgz diff --git a/academic/ent/ent.info b/academic/ent/ent.info new file mode 100644 index 0000000000000..9508eea838433 --- /dev/null +++ b/academic/ent/ent.info @@ -0,0 +1,8 @@ +PRGNAM="ent" +VERSION="20080128" +HOMEPAGE="http://www.fourmilab.ch/random/" +DOWNLOAD="http://www.fourmilab.ch/random/random.zip" +MD5SUM="8104a83af1ea5b280da96c92da18eae4" +MAINTAINER="B. Watson" +EMAIL="yalhcru@gmail.com" +APPROVED="rworkman" diff --git a/academic/ent/slack-desc b/academic/ent/slack-desc new file mode 100644 index 0000000000000..59d68635ed1d3 --- /dev/null +++ b/academic/ent/slack-desc @@ -0,0 +1,19 @@ +# HOW TO EDIT THIS FILE: +# The "handy ruler" below makes it easier to edit a package description. Line +# up the first '|' above the ':' following the base package name, and the '|' +# on the right side marks the last column you can put a character in. You must +# make exactly 11 lines for the formatting to be correct. It's also +# customary to leave one space after the ':'. + + |-----handy-ruler------------------------------------------------------| +ent: ent - A pseudorandom number sequence test program +ent: +ent: ent applies various tests to sequences of bytes stored in files +ent: and reports the results of those tests. The program is useful +ent: for evaluating pseudorandom number generators for encryption and +ent: statistical sampling applications, compression algorithms, and other +ent: applications where the information density of a file is of interest. +ent: Tests performed include Entropy, Chi-square, Arithmetic Mean, Monte +ent: Carlo value for Pi, and Serial Correlation Coefficient. +ent: +ent: |