diff options
author | Serban Udrea <S.Udrea@gsi.de> | 2010-04-08 23:25:08 -0500 |
---|---|---|
committer | Robby Workman <rworkman@slackbuilds.org> | 2010-05-15 10:25:39 +0200 |
commit | bc9d95f65940e1bc59515f368898724c3a2ca9b9 (patch) | |
tree | d6f4fa780e855eff6fa1fffb7d62fa869b2e1c18 | |
parent | 875ae3ca0352790a2d4ee699f4c6870e8e12f501 (diff) |
libraries/atlas: Added (BLAS implementation)
-rw-r--r-- | libraries/atlas/AMD64K10h64SSE3.tgz | bin | 0 -> 11038 bytes | |||
-rw-r--r-- | libraries/atlas/README | 11 | ||||
-rw-r--r-- | libraries/atlas/README.SLACKWARE | 100 | ||||
-rwxr-xr-x | libraries/atlas/atlas.SlackBuild | 229 | ||||
-rw-r--r-- | libraries/atlas/atlas.info | 10 | ||||
-rw-r--r-- | libraries/atlas/atlas.patch | 5072 | ||||
-rw-r--r-- | libraries/atlas/slack-desc | 19 |
7 files changed, 5441 insertions, 0 deletions
diff --git a/libraries/atlas/AMD64K10h64SSE3.tgz b/libraries/atlas/AMD64K10h64SSE3.tgz Binary files differnew file mode 100644 index 0000000000000..727f3748dbe01 --- /dev/null +++ b/libraries/atlas/AMD64K10h64SSE3.tgz diff --git a/libraries/atlas/README b/libraries/atlas/README new file mode 100644 index 0000000000000..d7d3b30931446 --- /dev/null +++ b/libraries/atlas/README @@ -0,0 +1,11 @@ +ATLAS (Automatically Tuned Linear Algebra Software) is an ongoing +research effort focusing on applying empirical techniques in order to +provide portable performance. At present, it provides C and Fortran77 +interfaces to a portably efficient BLAS implementation, as well as a few +routines from LAPACK. + +This requires blas, and it conflicts with cblas (only one of atlas +and cblas may be installed at any given time). Take care with LAPACK +(see notes 3 & 4 in README.SLACKWARE). + +You need to read over README.SLACKWARE *before* building this. diff --git a/libraries/atlas/README.SLACKWARE b/libraries/atlas/README.SLACKWARE new file mode 100644 index 0000000000000..3d7d9be243a3d --- /dev/null +++ b/libraries/atlas/README.SLACKWARE @@ -0,0 +1,100 @@ +ATLAS (Automatically Tuned Linear Algebra Software) is an ongoing +research effort focusing on applying empirical techniques in order to +provide portable performance. At present, it provides C and Fortran77 +interfaces to a portably efficient BLAS implementation, as well as a few +routines from LAPACK. + +IMPORTANT NOTES: + +1) Please note that the present SlackBuild for ATLAS does by no means + try to take into account all configuration/build issues of ATLAS. + Nevertheless, the relevant patches mentioned in the ATLAS Errata + are applied. + +2) The script takes advantage of the fact that the compilers shipped with + Slackware should be OK. It also assumes that you are installing on an x86 + or x86_64 platform. If you decide to use other compilers or install on + another platform, you are unfortunately on your own and welcome to suggest + improvements or patches to this SlackBuild. Moreover, there is no "post + install" tuning performed. + +3) ATLAS does not conflict with the reference netlib BLAS (see also note 6). + Nevertheless, if ATLAS got installed successfully you should consider removing + netlib BLAS and (re)compiling every BLAS dependent package (starting with + LAPACK) against ATLAS. Otherwise you may not have much gain from installing + ATLAS and may even get into problems (see next note). + +4) There is a strong interaction between ATLAS and LAPACK. If you want to install + ATLAS just for testing and avoid problems with LAPACK you are urged to make + use of the SYS_DESTDIR variable as explained later. Otherwise consider the + following: + a) It is not recommended to install LAPACK just along ATLAS, i.e. without building + it against ATLAS. Moreover, if LAPACK is already installed you have to first + remove it and later on build it against ATLAS. + b) If ATLAS+LAPACK doesn't work for you, just stick with (netlib) BLAS+LAPACK. + Netlib BLAS is also available as a SlackBuild. + c) If ATLAS+LAPACK is installed you have to recompile and reinstall LAPACK after + each ATLAS upgrade. + +5) ATLAS conflicts with cblas. + +6) You have to have netlib BLAS installed before you install ATLAS. As stated + above, you should consider removing it from your system afterwards. + +INSTALLATION DETAILS: + +1) Make sure CPU throttling is off before starting the install. This is + important, since ATLAS has to tune itself. + +2) For the same reason, keep the load on the system as low as possible + while building ATLAS. + +3) There are a few extra variables which you may want or need + to give appropriate values when calling the atlas.SlackBuild: + MAX_MALLOC, REF_BLAS, USE_ARCH_DEFAULTS, SYS_DESTDIR and + DEFAULT_DOCS. + + MAX_MALLOC is for adjusting the maximal size IN BYTES(!) that ATLAS + is allowed to allocate. According to the ATLAS errata, a too small + value may strongly reduce threaded performance. The default value + within this SlackBuild corresponds to 256MB. (The default value in + the ATLAS source corresponds to 64MB.) + + REF_BLAS defaults to the full path to the netlib BLAS library as + installed from the appropriate SlackBuilds.org script. If you have + the netlib BLAS elsewhere, you have to set the appropriate + value to this variable. + + USE_ARCH_DEFAULTS defaults to "yes", which means that the library + will be optimized by trying to take into account former builds done + on a similar machine. Thus ATLAS will use predefined optimizations + if available. This may reduce (much) the compilation time but may + not give you the best result if you don't use the same compiler + version (gcc 4.2) as the ATLAS author. + Please note that with this variable set to "no", or if there are no + known optimizations for your machine ATLAS compilation lasts for + about three hours! Take a nap :-) + + SYS_DESTDIR is set by default to "/usr" and is the system destination + directory. When installing the package produced by this SlackBuild, + ATLAS's files will be written to $SYS_DESTDIR/include, + $SYS_DESTDIR/include/atlas and $SYS_DESTDIR/lib (or lib64). + Documentation files are written to /usr/doc/atlas-$VERSION if not + otherwise stated (see below). + You may want to change the value of SYS_DESTDIR to avoid conflicts (see + IMPORTANT NOTES above). IMPORTANT: SYS_DESTDIR has to have an absolute + path as value. + + DEFAULT_DOCS has the default value "yes", which means that docs go + to /usr/doc/atlas-$VERSION, but you may want to let the docs to + go to $SYS_DESTDIR/doc/atlas-$VERSION. For this, just set this + variable to something like "no". + + All these settings may be done the usual way on the command line when + calling this SlackBuild, you do not have to edit the script. + +If you also installed the LAPACK linked against ATLAS, consider the following: +"IMPORTANT: If you are actually updating this library, i.e. ATLAS, you MUST also +rebuild and reinstall LAPACK, even if there is no update available for LAPACK! +Otherwise you end up with an broken/incomplete LAPACK library! + diff --git a/libraries/atlas/atlas.SlackBuild b/libraries/atlas/atlas.SlackBuild new file mode 100755 index 0000000000000..595b20b2b6798 --- /dev/null +++ b/libraries/atlas/atlas.SlackBuild @@ -0,0 +1,229 @@ +#!/bin/sh + +# Slackware build script for ATLAS + +# Copyright 2010 Serban Udrea <s.udrea@gsi.de> +# All rights reserved. +# +# Redistribution and use of this script, with or without modification, +# is permitted provided that the following conditions are met: +# +# 1. Redistributions of this script must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# THIS SOFTWARE IS PROVIDED BY THE AUTHOR ''AS IS'' AND ANY EXPRESS OR +# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED +# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +# DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, +# INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES +# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, +# STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING +# IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +# POSSIBILITY OF SUCH DAMAGE. + +PRGNAM=atlas +VERSION=${VERSION:-3.8.3} +ARCH=${ARCH:-i486} +BUILD=${BUILD:-1} +TAG=${TAG:-_SBo} + +CWD=$(pwd) +TMP=${TMP:-/tmp/SBo} +PKG=$TMP/package-$PRGNAM +OUTPUT=${OUTPUT:-/tmp} + +if [ "$ARCH" = "i486" ]; then + SLKCFLAGS="-O2 -march=i486 -mtune=i686" + LIBDIRSUFFIX="" + BITSize="32" # Specifically for ATLAS +elif [ "$ARCH" = "i686" ]; then + SLKCFLAGS="-O2 -march=i686 -mtune=i686" + LIBDIRSUFFIX="" + BITSize="32" # Specifically for ATLAS +elif [ "$ARCH" = "x86_64" ]; then + SLKCFLAGS="-O2 -fPIC" + LIBDIRSUFFIX="64" + BITSize="64" # Specifically for ATLAS +fi + +# You may change this to adjust the maximal size IN BYTES(!) that ATLAS +# is allowed to allocate. According to the ATLAS errata, a too small +# value may strongly reduce threaded performance. The default value +# here is 256MB. (The default value in the ATLAS source is 64MB.) +# +MAX_MALLOC=${MAX_MALLOC:-268435456} + +# If you don't want to use architectural defaults set the following to +# something like "no". +USE_ARCH_DEFAULTS=${USE_ARCH_DEFAULTS:-yes} + +# The path to a reference BLAS library. By default it is assumed that you +# have installed the netlib BLAS reference using the appropriate slackbuild +# from slackbuilds.org. If this is not the case, you have to run this script +# with another value for REF_BLAS. +REF_BLAS=${REF_BLAS:-/usr/lib${LIBDIRSUFFIX}/libblas.a} + +# Let's do a little check (that we deal with a regular file we can read). +[ -f "$REF_BLAS" -a -r "$REF_BLAS" ] || \ +{ echo "ERROR: Wrong path to reference BLAS library, exiting! " && exit 1; } + +# This is the system destination directory. When installing the +# package produced by this script, ATLAS's files will be written to +# $SYS_DESTDIR/include, $SYS_DESTDIR/include/atlas, $SYS_DESTDIR/lib +# or $SYS_DESTDIR/lib64 ond appropriate platforms, etc. +# Nevertheless, by default the documentation files go to +# /usr/doc/$PRGNAM-$VERSION. You may change this through the variable +# DEFAULT_DOCS, see below. +# +SYS_DESTDIR=${SYS_DESTDIR:-/usr} + +# Check if SYS_DESTDIR is an absolute path. If not, exit with error. +# NOTE: The $ is used because echo adds a \n at the end of the string. +echo $SYS_DESTDIR | grep -vE '/\.\./|/\.\.$' | grep -qE '^/' || \ +{ echo "ERROR: The system destination directory has no absolute path!" \ +&& echo " The value of SYS_DESTDIR is $SYS_DESTDIR" \ +&& echo " Please set it properly! " \ +&& exit 1; } + +# You may want to have the documentation files installed under +# $SYS_DESTDIR/doc/$PRGNAM-$VERSION not /usr/doc/$PRGNAM-$VERSION. +# To achieve this just set the following variable to something like +# "no". +# +DEFAULT_DOCS=${DEFAULT_DOCS:-yes} + +# The build directory to be created within the source directory of +# ATLAS. +BLDdir=BuildDir + +# Get the CPU frequency for good timing. +CPU_FREQ="$(cat /proc/cpuinfo |grep "cpu MHz"| head -n 1| cut -d ":" -s -f2| tr -d [:blank:])" + +set -e # Exit on most errors + +rm -rf $PKG +mkdir -p $TMP $PKG $OUTPUT + +cd $TMP +rm -rf $PRGNAM-$VERSION +tar xvf $CWD/${PRGNAM}${VERSION}.tar.bz2 +mv ATLAS $PRGNAM-$VERSION +cd $PRGNAM-$VERSION + +chown -R root:root . + +find . \ + \( -perm 777 -o -perm 775 -o -perm 711 -o -perm 555 -o -perm 511 \) \ + -exec chmod 755 {} \; -o \ + \( -perm 666 -o -perm 664 -o -perm 600 -o -perm 444 -o -perm 440 -o -perm 400 \) \ + -exec chmod 644 {} \; + +# Make changes as suggested in the atlas errata. +cat $CWD/atlas.patch | sed -e s%XXX_MaxMalloc_XXX%$MAX_MALLOC% | patch -p1 + +# If architectural defaults are to be used, copy the file mentioned in the errata +# to the architectural defaults directory. +case "$USE_ARCH_DEFAULTS" in + [yY]|[yY][eE]|[yY][eE][sS]) cp "$CWD/AMD64K10h64SSE3.tgz" CONFIG/ARCHS; USE_ARCH_DEFAULTS="1" ;; + *) USE_ARCH_DEFAULTS="0" ;; +esac + +mkdir -p $BLDdir +cd $BLDdir + +# Configure atlas. +# IMPORTANT: Here we assume that we are on a x86 machine (be it 32 or 64 bits) +# and gcc or icc is the compiler to be used. This should be presently +# a reasonable assumption with Slackware. Under other circumstances +# "-DPentiumCPS=$CPU_FREQ" has to be exchanged with "-DWALL". +# +../configure -Si archdef "$USE_ARCH_DEFAULTS" -b "$BITSize" -D c \ +-DPentiumCPS="$CPU_FREQ" -Fa alg -fPIC + +# NOTES ON THE FLAGS FOR CONFIGURE +# +# -Si archdef "$USE_ARCH_DEFAULTS" means that we ignore or not architectural defaults depending +# upon the value of "$USE_ARCH_DEFAULTS". +# -b "$BITSize" tells ATLAS about the platform's bitsize, 32 or 64. +# -D c -DPentiumCPS="$CPU_FREQ" is for achieving good timing on x86 platforms with gcc or icc. +# -Fa alg -fPIC is for beeing able to create dynamic libs too. + +# The next two variables are set and their values are finally saved +# for using them to compile lapack. +# Remember the compiler name. +ATLAS_COMPILER="$(grep "F77 =" Make.inc | cut -d "=" -f1 --complement)" + +# Remember the fortran compilation flags. +ATLAS_F77FLAGS="$(grep "F77FLAGS =" Make.inc | cut -d "=" -f1 --complement)" + +# Set the path to the reference BLAS. +sed -i -e '/^ \+BLASlib/s%BLASlib = .*%BLASlib = '"$REF_BLAS"% \ + Make.inc + +make +make check + +# If parallel libraries have been compiled check them too. +if [ -f lib/libptcblas.a ]; then + make ptcheck + PARALLEL_LIBS="yes" # We will use this when creating dynamic libs. +fi + +# Install the static libs created during the build process. +make install DESTDIR=$PKG$SYS_DESTDIR + +# Go to the ATLAS $BLDdir/lib directory and try to create and install +# the dynamic libraries. +# NOTE: The test for the presence of static parallel libs and the command to actually build the +# shared parallel libs are connected by a logical OR to make sure that the subshell +# does not exit with non-zero error code just because static parallel libs didn't +# get built. Therefore the test is successful if the variable PARALLEL_LIBS is unset or +# empty, i.e. when no static parallel libs got built. +( cd lib && make shared && \ + { [ "${PARALLEL_LIBS}1" = "1" ] || make ptshared; } && \ + cp -p *.so "$PKG$SYS_DESTDIR/lib" +) + +find $PKG | xargs file | grep -e "executable" -e "shared object" | grep ELF \ + | cut -f 1 -d : | xargs strip --strip-unneeded 2> /dev/null || true + +# This is probably the easiest way to make sure that we install in the +# proper place. +if [ ! -z $LIBDIRSUFFIX ]; then + mv $PKG$SYS_DESTDIR/lib $PKG$SYS_DESTDIR/lib${LIBDIRSUFFIX} +fi + +# Create the doc directory for atlas and populate it. +case "$DEFAULT_DOCS" in + [nN]|[nN][oO]) DOC_DIR="$PKG$SYS_DESTDIR/doc/$PRGNAM-$VERSION" ;; + *) DOC_DIR="$PKG/usr/doc/$PRGNAM-$VERSION" ;; +esac + +mkdir -p $DOC_DIR +cp -a ../INSTALL.txt ../README ../doc $DOC_DIR + +# The following makefiles may be needed to merge atlas and lapack. +mkdir $DOC_DIR/MAKEFILES +cp -p Make.inc $DOC_DIR/MAKEFILES +cp -p lib/Makefile $DOC_DIR/MAKEFILES/Makefile.lib + +# Create a file with the build flags for atlas. Needed to merge +# ATLAS and LAPACK. The LAPACK SlackBuild will just have to source +# this file to find out the compiler used for ATLAS and the build +# flags. +echo "ATLAS_COMPILER=\"$ATLAS_COMPILER\"" > "$DOC_DIR/SETTINGS" +echo "ATLAS_F77FLAGS=\"$ATLAS_F77FLAGS\"" >> "$DOC_DIR/SETTINGS" +sed -i -e s'%=" %="%' "$DOC_DIR/SETTINGS" # Remove the extra space after the "=" sign +echo "ATLAS_NOOPT=\"-O0\" #Eventually add more options within the quotes." >> "$DOC_DIR/SETTINGS" + +# Add the Slackbuild script and README.SLACKWARE to the docs. +cat $CWD/$PRGNAM.SlackBuild > $DOC_DIR/$PRGNAM.SlackBuild +cat $CWD/README.SLACKWARE > $DOC_DIR/README.SLACKWARE + +mkdir -p $PKG/install +cat $CWD/slack-desc > $PKG/install/slack-desc + +cd "$PKG" +/sbin/makepkg -l y -c n $OUTPUT/$PRGNAM-$VERSION-$ARCH-$BUILD$TAG.${PKGTYPE:-tgz} diff --git a/libraries/atlas/atlas.info b/libraries/atlas/atlas.info new file mode 100644 index 0000000000000..d223fcc27d504 --- /dev/null +++ b/libraries/atlas/atlas.info @@ -0,0 +1,10 @@ +PRGNAM="atlas" +VERSION="3.8.3" +HOMEPAGE="http://math-atlas.sourceforge.net/" +DOWNLOAD="http://downloads.sourceforge.net/math-atlas/atlas3.8.3.tar.bz2" +MD5SUM="6c13be94a87178e7582111c08e9503bc" +DOWNLOAD_x86_64="" +MD5SUM_x86_64="" +MAINTAINER="Serban Udrea" +EMAIL="S.Udrea@gsi.de" +APPROVED="rworkman" diff --git a/libraries/atlas/atlas.patch b/libraries/atlas/atlas.patch new file mode 100644 index 0000000000000..dea4dcc0b2eeb --- /dev/null +++ b/libraries/atlas/atlas.patch @@ -0,0 +1,5072 @@ +diff -rupN ATLAS/CONFIG/src/backend/archinfo_x86.c atlas-3.8.3/CONFIG/src/backend/archinfo_x86.c +--- ATLAS/CONFIG/src/backend/archinfo_x86.c 2009-02-18 19:47:37.000000000 +0100 ++++ atlas-3.8.3/CONFIG/src/backend/archinfo_x86.c 2009-11-12 13:47:23.777451677 +0100 +@@ -320,7 +320,7 @@ enum MACHTYPE Chip2Mach(enum CHIP chip, + iret = IntP4; + break; + case 3: +- case 4: ++ case 4: ; case 6: + iret = IntP4E; + break; + default: +diff -rupN ATLAS/include/atlas_lvl3.h atlas-3.8.3/include/atlas_lvl3.h +--- ATLAS/include/atlas_lvl3.h 2009-02-18 19:47:35.000000000 +0100 ++++ atlas-3.8.3/include/atlas_lvl3.h 2009-11-12 13:52:49.308496090 +0100 +@@ -126,7 +126,7 @@ + #define CPAT Mjoin(C_ATL_, PRE); + + #ifndef ATL_MaxMalloc +- #define ATL_MaxMalloc 67108864 ++ #define ATL_MaxMalloc XXX_MaxMalloc_XXX + #endif + + typedef void (*MAT2BLK)(int, int, const TYPE*, int, TYPE*, const SCALAR); +diff -rupN ATLAS/src/blas/gemm/ATL_cmmJITcp.c atlas-3.8.3/src/blas/gemm/ATL_cmmJITcp.c +--- ATLAS/src/blas/gemm/ATL_cmmJITcp.c 2009-02-18 19:47:44.000000000 +0100 ++++ atlas-3.8.3/src/blas/gemm/ATL_cmmJITcp.c 2009-11-12 12:44:34.816529051 +0100 +@@ -268,7 +268,8 @@ static void Mjoin(PATL,mmK) + { + NBmm0 = NBmm1 = NBmmX = Mjoin(PATLU,pKBmm); + if (SCALAR_IS_ZERO(beta)) +- Mjoin(PATL,gezero)(M, N, C, ldc); ++ /* Mjoin(PATL,gezero)(M, N, C, ldc); */ ++ { Mjoin(PATLU,gezero)(M, N, pC, ldpc); Mjoin(PATLU,gezero)(M, N, pC+ipc, ldpc); } + } + if (nblk) + { +diff -rupN ATLAS/src/blas/gemm/ATL_gereal2cplx.c atlas-3.8.3/src/blas/gemm/ATL_gereal2cplx.c +--- ATLAS/src/blas/gemm/ATL_gereal2cplx.c 2009-02-18 19:47:44.000000000 +0100 ++++ atlas-3.8.3/src/blas/gemm/ATL_gereal2cplx.c 2009-11-12 12:49:49.331651677 +0100 +@@ -43,7 +43,53 @@ void Mjoin(PATL,gereal2cplx) + const int ldc2 = (ldc-M)<<1; + int i, j; + +- if (ialp == ATL_rzero && ibet == ATL_rzero) ++/* ++ * Cannot read C if BETA is 0 ++ */ ++ if (rbet == ATL_rzero && ibet == ATL_rzero) ++ { ++ if (ialp == ATL_rzero) /* alpha is a real number */ ++ { ++ if (ralp == ATL_rone) /* alpha = 1.0 */ ++ { ++ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2) ++ { ++ for (i=0; i < M; i++, C += 2) ++ { ++ *C = R[i]; ++ C[1] = I[i]; ++ } ++ } ++ } ++ else ++ { ++ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2) ++ { ++ for (i=0; i < M; i++, C += 2) ++ { ++ *C = ralp * R[i]; ++ C[1] = ralp * I[i]; ++ } ++ } ++ } ++ } ++ else /* alpha is a complex number */ ++ { ++ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2) ++ { ++ for (i=0; i < M; i++, C += 2) ++ { ++ ra = R[i]; ia = I[i]; ++ C[0] = ralp * ra - ialp * ia; ++ C[1] = ralp * ia + ialp * ra; ++ } ++ } ++ } ++ } ++/* ++ * If alpha and beta are both real numbers ++ */ ++ else if (ialp == ATL_rzero && ibet == ATL_rzero) + { + if (ralp == ATL_rone && rbet == ATL_rone) + { +diff -rupN ATLAS/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c atlas-3.8.3/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c +--- ATLAS/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c 2009-02-18 19:48:26.000000000 +0100 ++++ atlas-3.8.3/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c 2009-11-12 12:35:50.453038827 +0100 +@@ -27,6 +27,13 @@ + * POSSIBILITY OF SUCH DAMAGE. + * + */ ++#if KB > 84 ++ #error "KB cannot exceed 84!" ++#endif ++#if (KB/4)*4 != KB ++ #error "KB must be a multiple of 4!" ++#endif ++ + #ifndef ATL_GAS_x8664 + #error "This kernel requires x86-64 assembly!" + #endif +@@ -58,25 +65,25 @@ + * Integer register usage shown be these defines + */ + #define pA %rcx +-#define pA10 %rbx +-#define ldab %rbp +-#define mldab %rdx ++#define pA10 %rbx ++#define ldab %rbp ++#define mldab %rdx + #define mldab5 %rax + #define pB %rdi + #define pC %rsi + #define incCn %r10 + #define stM %r9 + #define stN %r11 +-#define pfA %r8 +-#define pA5 pA +-#define pB0 pB ++#define pfA %r8 ++#define pA5 pA ++#define pB0 pB + #if MB == 0 +- #define stM0 %r12 +- #define incAm %r13 ++ #define stM0 %r12 ++ #define incAm %r13 + #endif + /* rax used in 32/64 conversion */ + +-#define NBso (KB*4) ++#define NBso (KB*4) + #define MBKBso (MB*KB*4) + #define NB2so (NBso+NBso) + #define NB3so (NBso+NBso+NBso) +@@ -95,22 +102,22 @@ + /* + * SSE2 register usage shown be these defines + */ +-#define rA0 %xmm0 +-#define rB0 %xmm1 +-#define rC0 %xmm2 +-#define rC1 %xmm3 +-#define rC2 %xmm4 +-#define rC3 %xmm5 +-#define rC4 %xmm6 +-#define rC5 %xmm7 +-#define rC6 %xmm8 +-#define rC7 %xmm9 +-#define rC8 %xmm10 +-#define rC9 %xmm11 +-#define rC10 %xmm12 +-#define rC11 %xmm13 +-#define rC12 %xmm14 +-#define rC13 %xmm15 ++#define rA0 %xmm0 ++#define rB0 %xmm1 ++#define rC0 %xmm2 ++#define rC1 %xmm3 ++#define rC2 %xmm4 ++#define rC3 %xmm5 ++#define rC4 %xmm6 ++#define rC5 %xmm7 ++#define rC6 %xmm8 ++#define rC7 %xmm9 ++#define rC8 %xmm10 ++#define rC9 %xmm11 ++#define rC10 %xmm12 ++#define rC11 %xmm13 ++#define rC12 %xmm14 ++#define rC13 %xmm15 + /* + * Prefetch defines + */ +@@ -127,99 +134,99 @@ + #if MB != 0 + #define incAm $MBKBso-NB14so+176 + #endif +- .text ++ .text + .global ATL_asmdecor(ATL_USERMM) + ATL_asmdecor(ATL_USERMM): + /* + * Save callee-saved iregs + */ +- movq %rbp, -8(%rsp) +- movq %rbx, -16(%rsp) ++ movq %rbp, -8(%rsp) ++ movq %rbx, -16(%rsp) + #if MB == 0 +- movq %r12, -32(%rsp) +- movq %r13, -40(%rsp) ++ movq %r12, -32(%rsp) ++ movq %r13, -40(%rsp) + #endif + #ifdef BETAX + #define BOF -56 +- movss %xmm1, BOF(%rsp) +- movss %xmm1, BOF+4(%rsp) +- movss %xmm1, BOF+8(%rsp) +- movss %xmm1, BOF+12(%rsp) ++ movss %xmm1, BOF(%rsp) ++ movss %xmm1, BOF+4(%rsp) ++ movss %xmm1, BOF+8(%rsp) ++ movss %xmm1, BOF+12(%rsp) + #endif + /* + * pA already comes in right reg + * Initialize pB = B; pC = C; NBso = NB * sizeof; + */ +- movq %rsi, stN +- movq %rdi, %rax +- movq 16(%rsp), pC +- prefC((pC)) +- prefC(64(pC)) +- movq %r9, pB +- prefB((pB)) +- prefB(64(pB)) +- movq %rax, stM ++ movq %rsi, stN ++ movq %rdi, %rax ++ movq 16(%rsp), pC ++ prefC((pC)) ++ prefC(64(pC)) ++ movq %r9, pB ++ prefB((pB)) ++ prefB(64(pB)) ++ movq %rax, stM + /* + * stM = pA + NBNBso; stN = pB + NBNBso; + */ + #if MB == 0 +- movq stM, pfA +- imulq $NBso, pfA +- prefB(128(pB)) +- movq pfA, incAm +- addq pA5, pfA +- addq $176-NB14so, incAm ++ movq stM, pfA ++ imulq $NBso, pfA ++ prefB(128(pB)) ++ movq pfA, incAm ++ addq pA5, pfA ++ addq $176-NB14so, incAm + #else +- movq $MBKBso, pfA +- addq pA5, pfA +- prefB(128(pB)) ++ movq $MBKBso, pfA ++ addq pA5, pfA ++ prefB(128(pB)) + #endif + /* + * convert ldc to 64 bits, and then set incCn = (ldc - MB)*sizeof + */ +- movl 24(%rsp), %eax +- cltq +- movq %rax, incCn +- subq stM, incCn +- addq $14, incCn ++ movl 24(%rsp), %eax ++ cltq ++ movq %rax, incCn ++ subq stM, incCn ++ addq $14, incCn + #ifdef SREAL +- shl $2, incCn ++ shl $2, incCn + #else +- shl $3, incCn +- prefC(128(pC)) +- prefC(192(pC)) ++ shl $3, incCn ++ prefC(128(pC)) ++ prefC(192(pC)) + #endif + /* + * Find M/14 if MB is not set + */ + #if MB == 0 +- cmp $84, stM +- jne MB_LT84 +-/* movq $84/14, stM */ +- movq $6, stM ++ cmp $84, stM ++ jne MB_LT84 ++/* movq $84/14, stM */ ++ movq $6, stM + MBFOUND: +- subq $1, stM +- movq stM, stM0 ++ subq $1, stM ++ movq stM, stM0 + #endif +- addq $120, pA5 +- addq $120, pB0 +- movq $KB*4, ldab +- movq $-KB*5*4, mldab5 +- movq $-KB*4, mldab +- subq mldab5, pA5 +- lea KB*4(pA5, ldab,4), pA10 +-/* movq $NB, stN */ ++ addq $120, pA5 ++ addq $120, pB0 ++ movq $KB*4, ldab ++ movq $-KB*5*4, mldab5 ++ movq $-KB*4, mldab ++ subq mldab5, pA5 ++ lea KB*4(pA5, ldab,4), pA10 ++/* movq $NB, stN */ + + UNLOOP: + #if MB == 0 +- movq stM0, stM +- cmp $0, stM +- je MLAST ++ movq stM0, stM ++ cmp $0, stM ++ je MLAST + #else + #ifdef ATL_DivAns +- movq $ATL_DivAns-1, stM ++ movq $ATL_DivAns-1, stM + #else +- movq $MB/14-1, stM ++ movq $MB/14-1, stM + #endif + #endif + #if MB == 0 || MB > 14 +@@ -227,992 +234,992 @@ UMLOOP: + /* + * rC[0-13] = pC[0-13] * beta + */ +- ALIGN16 ++ ALIGN16 + /*UKLOOP: */ + #ifdef BETA1 +- movaps 0-120(pA10,mldab5,2), rC0 +- movaps 0-120(pB0), rB0 +- mulps rB0, rC0 +- addss (pC), rC0 +- movaps 0-120(pA5, mldab,4), rC1 +- mulps rB0, rC1 +- addss CMUL(4)(pC), rC1 +- movaps 0-120(pA10, mldab,8), rC2 +- mulps rB0, rC2 +- addss CMUL(8)(pC), rC2 +- movaps 0-120(pA5, mldab,2), rC3 +- mulps rB0, rC3 +- addss CMUL(12)(pC), rC3 +- movaps 0-120(pA5, mldab), rC4 +- mulps rB0, rC4 +- addss CMUL(16)(pC), rC4 +- movaps 0-120(pA5), rC5 +- mulps rB0, rC5 +- addss CMUL(20)(pC), rC5 +- movaps 0-120(pA5, ldab), rC6 +- mulps rB0, rC6 +- addss CMUL(24)(pC), rC6 +- movaps 0-120(pA5, ldab,2), rC7 +- mulps rB0, rC7 +- addss CMUL(28)(pC), rC7 +- movaps 0-120(pA10, mldab,2), rC8 +- mulps rB0, rC8 +- addss CMUL(32)(pC), rC8 +- movaps 0-120(pA5,ldab,4), rC9 +- mulps rB0, rC9 +- addss CMUL(36)(pC), rC9 +- movaps 0-120(pA10), rC10 +- mulps rB0, rC10 +- addss CMUL(40)(pC), rC10 +- movaps 0-120(pA10,ldab), rC11 +- mulps rB0, rC11 +- addss CMUL(44)(pC), rC11 +- movaps 0-120(pA10,ldab,2), rC12 +- mulps rB0, rC12 +- addss CMUL(48)(pC), rC12 +- movaps 0-120(pA5,ldab,8), rC13 +- mulps rB0, rC13 +- addss CMUL(52)(pC), rC13 ++ movaps 0-120(pA10,mldab5,2), rC0 ++ movaps 0-120(pB0), rB0 ++ mulps rB0, rC0 ++ addss (pC), rC0 ++ movaps 0-120(pA5, mldab,4), rC1 ++ mulps rB0, rC1 ++ addss CMUL(4)(pC), rC1 ++ movaps 0-120(pA10, mldab,8), rC2 ++ mulps rB0, rC2 ++ addss CMUL(8)(pC), rC2 ++ movaps 0-120(pA5, mldab,2), rC3 ++ mulps rB0, rC3 ++ addss CMUL(12)(pC), rC3 ++ movaps 0-120(pA5, mldab), rC4 ++ mulps rB0, rC4 ++ addss CMUL(16)(pC), rC4 ++ movaps 0-120(pA5), rC5 ++ mulps rB0, rC5 ++ addss CMUL(20)(pC), rC5 ++ movaps 0-120(pA5, ldab), rC6 ++ mulps rB0, rC6 ++ addss CMUL(24)(pC), rC6 ++ movaps 0-120(pA5, ldab,2), rC7 ++ mulps rB0, rC7 ++ addss CMUL(28)(pC), rC7 ++ movaps 0-120(pA10, mldab,2), rC8 ++ mulps rB0, rC8 ++ addss CMUL(32)(pC), rC8 ++ movaps 0-120(pA5,ldab,4), rC9 ++ mulps rB0, rC9 ++ addss CMUL(36)(pC), rC9 ++ movaps 0-120(pA10), rC10 ++ mulps rB0, rC10 ++ addss CMUL(40)(pC), rC10 ++ movaps 0-120(pA10,ldab), rC11 ++ mulps rB0, rC11 ++ addss CMUL(44)(pC), rC11 ++ movaps 0-120(pA10,ldab,2), rC12 ++ mulps rB0, rC12 ++ addss CMUL(48)(pC), rC12 ++ movaps 0-120(pA5,ldab,8), rC13 ++ mulps rB0, rC13 ++ addss CMUL(52)(pC), rC13 + #else +- movaps 0-120(pA10,mldab5,2), rC0 +- movaps 0-120(pB0), rC13 +- mulps rC13, rC0 +- movaps 0-120(pA5, mldab,4), rC1 +- mulps rC13, rC1 +- movaps 0-120(pA10, mldab,8), rC2 +- mulps rC13, rC2 +- movaps 0-120(pA5, mldab,2), rC3 +- mulps rC13, rC3 +- movaps 0-120(pA5, mldab), rC4 +- mulps rC13, rC4 +- movaps 0-120(pA5), rC5 +- mulps rC13, rC5 +- movaps 0-120(pA5, ldab), rC6 +- mulps rC13, rC6 +- movaps 0-120(pA5, ldab,2), rC7 +- mulps rC13, rC7 +- movaps 0-120(pA10, mldab,2), rC8 +- mulps rC13, rC8 +- movaps 0-120(pA5,ldab,4), rC9 +- mulps rC13, rC9 +- movaps 0-120(pA10), rC10 +- mulps rC13, rC10 +- movaps 0-120(pA10,ldab), rC11 +- mulps rC13, rC11 +- movaps 0-120(pA10,ldab,2), rC12 +- mulps rC13, rC12 +- mulps 0-120(pA5,ldab,8), rC13 ++ movaps 0-120(pA10,mldab5,2), rC0 ++ movaps 0-120(pB0), rC13 ++ mulps rC13, rC0 ++ movaps 0-120(pA5, mldab,4), rC1 ++ mulps rC13, rC1 ++ movaps 0-120(pA10, mldab,8), rC2 ++ mulps rC13, rC2 ++ movaps 0-120(pA5, mldab,2), rC3 ++ mulps rC13, rC3 ++ movaps 0-120(pA5, mldab), rC4 ++ mulps rC13, rC4 ++ movaps 0-120(pA5), rC5 ++ mulps rC13, rC5 ++ movaps 0-120(pA5, ldab), rC6 ++ mulps rC13, rC6 ++ movaps 0-120(pA5, ldab,2), rC7 ++ mulps rC13, rC7 ++ movaps 0-120(pA10, mldab,2), rC8 ++ mulps rC13, rC8 ++ movaps 0-120(pA5,ldab,4), rC9 ++ mulps rC13, rC9 ++ movaps 0-120(pA10), rC10 ++ mulps rC13, rC10 ++ movaps 0-120(pA10,ldab), rC11 ++ mulps rC13, rC11 ++ movaps 0-120(pA10,ldab,2), rC12 ++ mulps rC13, rC12 ++ mulps 0-120(pA5,ldab,8), rC13 + #endif + + #if KB > 4 +- movaps 16-120(pA10,mldab5,2), rA0 +- movaps 16-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 16-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 16-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 16-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 16-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 16-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 16-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 16-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 16-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 16-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 16-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 16-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 16-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 16-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 16-120(pA10,mldab5,2), rA0 ++ movaps 16-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 16-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 16-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 16-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 16-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 16-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 16-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 16-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 16-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 16-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 16-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 16-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 16-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 16-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 8 +- movaps 32-120(pA10,mldab5,2), rA0 +- movaps 32-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 32-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 32-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 32-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 32-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 32-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 32-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 32-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 32-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 32-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 32-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 32-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 32-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 32-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 32-120(pA10,mldab5,2), rA0 ++ movaps 32-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 32-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 32-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 32-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 32-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 32-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 32-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 32-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 32-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 32-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 32-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 32-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 32-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 32-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 12 +- movaps 48-120(pA10,mldab5,2), rA0 +- movaps 48-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 48-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 48-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 48-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 48-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 48-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 48-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 48-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 48-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 48-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 48-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 48-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 48-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 48-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 48-120(pA10,mldab5,2), rA0 ++ movaps 48-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 48-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 48-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 48-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 48-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 48-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 48-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 48-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 48-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 48-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 48-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 48-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 48-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 48-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 16 +- movaps 64-120(pA10,mldab5,2), rA0 +- movaps 64-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 64-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 64-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 64-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 64-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 64-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 64-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 64-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 64-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 64-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 64-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 64-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 64-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 64-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 64-120(pA10,mldab5,2), rA0 ++ movaps 64-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 64-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 64-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 64-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 64-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 64-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 64-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 64-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 64-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 64-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 64-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 64-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 64-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 64-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 20 +- movaps 80-120(pA10,mldab5,2), rA0 +- movaps 80-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 80-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 80-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 80-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 80-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 80-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 80-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 80-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 80-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 80-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 80-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 80-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 80-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 80-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 80-120(pA10,mldab5,2), rA0 ++ movaps 80-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 80-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 80-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 80-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 80-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 80-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 80-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 80-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 80-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 80-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 80-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 80-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 80-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 80-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 24 +- movaps 96-120(pA10,mldab5,2), rA0 +- movaps 96-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 96-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 96-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 96-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 96-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 96-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 96-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 96-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 96-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 96-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 96-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 96-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 96-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 96-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 96-120(pA10,mldab5,2), rA0 ++ movaps 96-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 96-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 96-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 96-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 96-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 96-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 96-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 96-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 96-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 96-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 96-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 96-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 96-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 96-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 28 +- movaps 112-120(pA10,mldab5,2), rA0 +- movaps 112-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 112-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 112-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 112-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 112-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 112-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 112-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 112-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 112-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 112-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 112-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 112-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 112-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 112-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 112-120(pA10,mldab5,2), rA0 ++ movaps 112-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 112-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 112-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 112-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 112-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 112-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 112-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 112-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 112-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 112-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 112-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 112-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 112-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 112-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + #ifndef SREAL +- pref2((pfA)) +- pref2(64(pfA)) ++ pref2((pfA)) ++ pref2(64(pfA)) + #endif + + #if KB > 32 +- movaps 128-120(pA10,mldab5,2), rA0 +- movaps 128-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 128-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 128-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 128-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 128-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 128-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 128-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 128-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 128-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 128-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 128-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 128-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 128-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 128-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 128-120(pA10,mldab5,2), rA0 ++ movaps 128-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 128-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 128-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 128-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 128-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 128-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 128-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 128-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 128-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 128-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 128-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 128-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 128-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 128-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 36 +- movaps 144-120(pA10,mldab5,2), rA0 +- movaps 144-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 144-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 144-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 144-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 144-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 144-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 144-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 144-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 144-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 144-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 144-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 144-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 144-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 144-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 144-120(pA10,mldab5,2), rA0 ++ movaps 144-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 144-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 144-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 144-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 144-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 144-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 144-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 144-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 144-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 144-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 144-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 144-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 144-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 144-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 40 +- movaps 160-120(pA10,mldab5,2), rA0 +- movaps 160-120(pB0), rB0 +- mulps rB0, rA0 +- addq $176, pB0 +- addps rA0, rC0 +- movaps 160-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 160-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 160-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 160-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 160-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 160-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 160-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 160-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 160-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 160-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 160-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 160-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addq $176, pA10 +- addps rA0, rC12 +- mulps 160-120(pA5,ldab,8), rB0 +- addps rB0, rC13 +- addq $176, pA5 ++ movaps 160-120(pA10,mldab5,2), rA0 ++ movaps 160-120(pB0), rB0 ++ mulps rB0, rA0 ++ addq $176, pB0 ++ addps rA0, rC0 ++ movaps 160-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 160-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 160-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 160-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 160-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 160-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 160-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 160-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 160-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 160-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 160-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 160-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addq $176, pA10 ++ addps rA0, rC12 ++ mulps 160-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 ++ addq $176, pA5 + #else +- addq $176, pB0 +- addq $176, pA10 +- addq $176, pA5 ++ addq $176, pB0 ++ addq $176, pA10 ++ addq $176, pA5 + #endif + + #if KB > 44 +- movaps 0-120(pA10,mldab5,2), rA0 +- movaps 0-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 0-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 0-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 0-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 0-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 0-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 0-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 0-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 0-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 0-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 0-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 0-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 0-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 0-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 0-120(pA10,mldab5,2), rA0 ++ movaps 0-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 0-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 0-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 0-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 0-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 0-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 0-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 0-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 0-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 0-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 0-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 0-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 0-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 0-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 48 +- movaps 16-120(pA10,mldab5,2), rA0 +- movaps 16-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 16-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 16-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 16-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 16-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 16-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 16-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 16-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 16-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 16-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 16-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 16-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 16-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 16-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 16-120(pA10,mldab5,2), rA0 ++ movaps 16-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 16-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 16-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 16-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 16-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 16-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 16-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 16-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 16-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 16-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 16-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 16-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 16-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 16-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 52 +- movaps 32-120(pA10,mldab5,2), rA0 +- movaps 32-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 32-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 32-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 32-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 32-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 32-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 32-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 32-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 32-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 32-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 32-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 32-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 32-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 32-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 32-120(pA10,mldab5,2), rA0 ++ movaps 32-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 32-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 32-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 32-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 32-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 32-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 32-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 32-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 32-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 32-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 32-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 32-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 32-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 32-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 56 +- movaps 48-120(pA10,mldab5,2), rA0 +- movaps 48-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 48-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 48-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 48-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 48-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 48-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 48-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 48-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 48-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 48-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 48-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 48-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 48-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 48-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 48-120(pA10,mldab5,2), rA0 ++ movaps 48-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 48-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 48-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 48-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 48-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 48-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 48-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 48-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 48-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 48-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 48-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 48-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 48-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 48-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 60 +- movaps 64-120(pA10,mldab5,2), rA0 +- movaps 64-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 64-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 64-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 64-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 64-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 64-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 64-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 64-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 64-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 64-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 64-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 64-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 64-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 64-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 64-120(pA10,mldab5,2), rA0 ++ movaps 64-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 64-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 64-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 64-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 64-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 64-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 64-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 64-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 64-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 64-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 64-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 64-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 64-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 64-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 64 +- movaps 80-120(pA10,mldab5,2), rA0 +- movaps 80-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 80-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 80-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 80-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 80-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 80-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 80-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 80-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 80-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 80-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 80-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 80-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 80-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 80-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 80-120(pA10,mldab5,2), rA0 ++ movaps 80-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 80-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 80-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 80-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 80-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 80-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 80-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 80-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 80-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 80-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 80-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 80-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 80-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 80-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 68 +- movaps 96-120(pA10,mldab5,2), rA0 +- movaps 96-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 96-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 96-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 96-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 96-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 96-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 96-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 96-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 96-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 96-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 96-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 96-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 96-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 96-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 96-120(pA10,mldab5,2), rA0 ++ movaps 96-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 96-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 96-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 96-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 96-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 96-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 96-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 96-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 96-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 96-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 96-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 96-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 96-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 96-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 72 +- movaps 112-120(pA10,mldab5,2), rA0 +- movaps 112-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 112-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 112-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 112-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 112-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 112-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 112-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 112-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 112-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 112-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 112-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 112-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 112-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 112-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 112-120(pA10,mldab5,2), rA0 ++ movaps 112-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 112-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 112-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 112-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 112-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 112-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 112-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 112-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 112-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 112-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 112-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 112-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 112-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 112-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 76 +- movaps 128-120(pA10,mldab5,2), rA0 +- movaps 128-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 128-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 128-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 128-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 128-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 128-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 128-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 128-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 128-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 128-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 128-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 128-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 128-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 128-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 128-120(pA10,mldab5,2), rA0 ++ movaps 128-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 128-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 128-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 128-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 128-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 128-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 128-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 128-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 128-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 128-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 128-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 128-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 128-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 128-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 80 +- movaps 144-120(pA10,mldab5,2), rA0 +- movaps 144-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 144-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 144-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 144-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 144-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 144-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 144-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 144-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 144-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 144-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 144-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 144-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 144-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 144-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 144-120(pA10,mldab5,2), rA0 ++ movaps 144-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 144-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 144-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 144-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 144-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 144-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 144-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 144-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 144-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 144-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 144-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 144-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 144-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 144-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + /*UKLOOP */ +@@ -1220,234 +1227,234 @@ UMLOOP: + * Get these bastard things summed up correctly + */ + +- /* rC0 = c0a c0b c0c c0d */ +- /* rC1 = c1a c1b c1c c1d */ +- /* rC2 = c2a c2b c2c c2d */ +- /* rC3 = c3a c3b c3c c3d */ ++ /* rC0 = c0a c0b c0c c0d */ ++ /* rC1 = c1a c1b c1c c1d */ ++ /* rC2 = c2a c2b c2c c2d */ ++ /* rC3 = c3a c3b c3c c3d */ + /* */ +- movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */ +- prefC((pC)) +- prefC(64(pC)) +- movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */ +- unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */ +- unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */ +- unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */ +- movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */ +- unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */ +- movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */ +- movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */ +- movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */ +- addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */ +- movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */ +- movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */ +- movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */ +- addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */ +- movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */ +- addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */ +- +- +- /* rC4 = c4a c4b c4c c4d */ +- /* rC5 = c5a c5b c5c c5d */ +- /* rC6 = c6a c6b c6c c6d */ +- /* rC7 = c7a c7b c7c c7d */ +- /* rC8 = c08a c08b c08c c08d */ +- /* rC9 = c09a c09b c09c c09d */ +- /* rC10 = c10a c10b c10c c10d */ +- /* rC11 = c11a c11b c11c c11d */ +- /* rC12 = c12a c12b c12c c12d */ +- /* rC13 = c13a c13b c13c c13d */ ++ movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */ ++ prefC((pC)) ++ prefC(64(pC)) ++ movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */ ++ unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */ ++ unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */ ++ unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */ ++ movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */ ++ unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */ ++ movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */ ++ movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */ ++ movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */ ++ addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */ ++ movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */ ++ movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */ ++ movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */ ++ addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */ ++ movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */ ++ addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */ ++ ++ ++ /* rC4 = c4a c4b c4c c4d */ ++ /* rC5 = c5a c5b c5c c5d */ ++ /* rC6 = c6a c6b c6c c6d */ ++ /* rC7 = c7a c7b c7c c7d */ ++ /* rC8 = c08a c08b c08c c08d */ ++ /* rC9 = c09a c09b c09c c09d */ ++ /* rC10 = c10a c10b c10c c10d */ ++ /* rC11 = c11a c11b c11c c11d */ ++ /* rC12 = c12a c12b c12c c12d */ ++ /* rC13 = c13a c13b c13c c13d */ + /* */ +- movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */ +- prefC(128(pC)) ++ movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */ ++ prefC(128(pC)) + #ifdef SREAL +- pref2((pfA)) ++ pref2((pfA)) + #else +- prefC(192(pC)) ++ prefC(192(pC)) + #endif +- movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */ +- movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */ +- unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */ +- unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */ +- unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */ +- unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */ +- unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */ +- movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */ +- unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */ +- movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */ +- movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */ +- unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */ +- movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */ +- movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */ +- addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */ ++ movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */ ++ movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */ ++ unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */ ++ unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */ ++ unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */ ++ unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */ ++ unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */ ++ movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */ ++ unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */ ++ movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */ ++ movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */ ++ unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */ ++ movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */ ++ movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */ ++ addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */ + #ifdef BETAX + #ifdef SREAL +- movups (pC), rA0 +- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ +- movups 16(pC), rC4 +- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ +- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ +- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ +- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ +- movups 32(pC), rC5 +- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ +- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ +- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ +- movlps 48(pC), rC1 +- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ +- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ +- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ +- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ +- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ +- pref2(64(pfA)) +- mulps BOF(%rsp), rA0 +- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ +- mulps BOF(%rsp), rC4 +- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ +- mulps BOF(%rsp), rC5 +- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ +- mulps BOF(%rsp), rC1 ++ movups (pC), rA0 ++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ ++ movups 16(pC), rC4 ++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ ++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ ++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ ++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ ++ movups 32(pC), rC5 ++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ ++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ ++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ ++ movlps 48(pC), rC1 ++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ ++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ ++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ ++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ ++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ ++ pref2(64(pfA)) ++ mulps BOF(%rsp), rA0 ++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ ++ mulps BOF(%rsp), rC4 ++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ ++ mulps BOF(%rsp), rC5 ++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ ++ mulps BOF(%rsp), rC1 + + /* */ + +- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ +- addps rA0, rC3 +- addq $68, pfA +- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ +- addps rC4, rC7 +- addps rC5, rC11 +- addps rC1, rC12 ++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ ++ addps rA0, rC3 ++ addq $68, pfA ++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ ++ addps rC4, rC7 ++ addps rC5, rC11 ++ addps rC1, rC12 + #else /* BETA = X, complex type */ +- movups (pC), rA0 +- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ +- movups 16(pC), rC4 +- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ +- shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */ +- movups 32(pC), rC4 /* rC4 = c4 X c5 X */ +- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ +- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ +- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ +- movups 48(pC), rC5 /* rC5 = c6 X c7 X */ +- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ +- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ +- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ +- shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */ +- movups 64(pC), rC5 /* rC5 = c8 X c9 X */ +- movups 80(pC), rC1 /* rC1 = c10 X c11 X */ +- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ +- shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */ +- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ +- movss 96(pC), rC1 +- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ +- movss 104(pC), rB0 +- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ +- unpcklps rB0, rC1 +- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ +- prefC(256(pC)) +- mulps BOF(%rsp), rA0 +- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ +- mulps BOF(%rsp), rC4 +- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ +- mulps BOF(%rsp), rC5 +- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ +- mulps BOF(%rsp), rC1 ++ movups (pC), rA0 ++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ ++ movups 16(pC), rC4 ++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ ++ shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */ ++ movups 32(pC), rC4 /* rC4 = c4 X c5 X */ ++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ ++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ ++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ ++ movups 48(pC), rC5 /* rC5 = c6 X c7 X */ ++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ ++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ ++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ ++ shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */ ++ movups 64(pC), rC5 /* rC5 = c8 X c9 X */ ++ movups 80(pC), rC1 /* rC1 = c10 X c11 X */ ++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ ++ shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */ ++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ ++ movss 96(pC), rC1 ++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ ++ movss 104(pC), rB0 ++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ ++ unpcklps rB0, rC1 ++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ ++ prefC(256(pC)) ++ mulps BOF(%rsp), rA0 ++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ ++ mulps BOF(%rsp), rC4 ++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ ++ mulps BOF(%rsp), rC5 ++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ ++ mulps BOF(%rsp), rC1 + + /* */ + +- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ +- addps rA0, rC3 +- prefC(192(pC)) +- addq $68, pfA +- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ +- addps rC4, rC7 +- addps rC5, rC11 +- addps rC1, rC12 ++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ ++ addps rA0, rC3 ++ prefC(192(pC)) ++ addq $68, pfA ++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ ++ addps rC4, rC7 ++ addps rC5, rC11 ++ addps rC1, rC12 + #endif + + #else +- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ +- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ +- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ +- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ +- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ +- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ +- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ +- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ +- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ +- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ +- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ +- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ +- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ ++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ ++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ ++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ ++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ ++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ ++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ ++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ ++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ ++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ ++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ ++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ ++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ ++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ + #ifdef SREAL +- pref2(64(pfA)) ++ pref2(64(pfA)) + #else +- prefC(256(pC)) ++ prefC(256(pC)) + #endif +- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ +- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ +- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ ++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ ++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ ++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ + + /* */ + +- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ ++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ + #ifndef SREAL +- prefC(192(pC)) ++ prefC(192(pC)) + #endif +- addq $68, pfA +- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ ++ addq $68, pfA ++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ + + #endif + /* + * Write results back to C; pC += 14; + */ + #ifdef SREAL +- movups rC3, (pC) +- movups rC7, 16(pC) +- movups rC11, 32(pC) +- movlps rC12, 48(pC) +- addq $56, pC ++ movups rC3, (pC) ++ movups rC7, 16(pC) ++ movups rC11, 32(pC) ++ movlps rC12, 48(pC) ++ addq $56, pC + #else +- movss rC3, (pC) +- movss rC7, 32(pC) +- movhlps rC3, rC0 +- movhlps rC7, rC6 +- movss rC0, 16(pC) +- movss rC6, 48(pC) +- shufps $0x55, rC3, rC3 +- shufps $0x55, rC7, rC7 +- movss rC3, 8(pC) +- movss rC7, 40(pC) +- shufps $0x55, rC0, rC0 +- shufps $0x55, rC6, rC6 +- movss rC0, 24(pC) +- movss rC6, 56(pC) +- +- movss rC11, 64(pC) +- movhlps rC11, rC2 +- movss rC12, 96(pC) +- movss rC2, 80(pC) +- shufps $0x55, rC11, rC11 +- shufps $0x55, rC12, rC12 +- movss rC11, 72(pC) +- shufps $0x55, rC2, rC2 +- movss rC12, 104(pC) +- movss rC2, 88(pC) ++ movss rC3, (pC) ++ movss rC7, 32(pC) ++ movhlps rC3, rC0 ++ movhlps rC7, rC6 ++ movss rC0, 16(pC) ++ movss rC6, 48(pC) ++ shufps $0x55, rC3, rC3 ++ shufps $0x55, rC7, rC7 ++ movss rC3, 8(pC) ++ movss rC7, 40(pC) ++ shufps $0x55, rC0, rC0 ++ shufps $0x55, rC6, rC6 ++ movss rC0, 24(pC) ++ movss rC6, 56(pC) ++ ++ movss rC11, 64(pC) ++ movhlps rC11, rC2 ++ movss rC12, 96(pC) ++ movss rC2, 80(pC) ++ shufps $0x55, rC11, rC11 ++ shufps $0x55, rC12, rC12 ++ movss rC11, 72(pC) ++ shufps $0x55, rC2, rC2 ++ movss rC12, 104(pC) ++ movss rC2, 88(pC) + +- addq $112, pC ++ addq $112, pC + #endif + /* + * Write results back to C + */ +- addq $NB14so-176, pA5 +- addq $NB14so-176, pA10 +- subq $176, pB0 ++ addq $NB14so-176, pA5 ++ addq $NB14so-176, pA10 ++ subq $176, pB0 + /* + * pC += 14; pA += 14*NB; pB -= NB; + */ + /* + * while (pA != stM); + */ +- subq $1, stM +- jne UMLOOP ++ subq $1, stM ++ jne UMLOOP + #endif + + /* +@@ -1459,994 +1466,994 @@ MLAST: + #endif + /*UKLOOP: */ + #ifdef BETA1 +- movaps 0-120(pA10,mldab5,2), rC0 +- movaps 0-120(pB0), rB0 +- mulps rB0, rC0 +- addss (pC), rC0 +- movaps 0-120(pA5, mldab,4), rC1 +- mulps rB0, rC1 +- addss CMUL(4)(pC), rC1 +- movaps 0-120(pA10, mldab,8), rC2 +- mulps rB0, rC2 +- addss CMUL(8)(pC), rC2 +- movaps 0-120(pA5, mldab,2), rC3 +- mulps rB0, rC3 +- addss CMUL(12)(pC), rC3 +- movaps 0-120(pA5, mldab), rC4 +- mulps rB0, rC4 +- addss CMUL(16)(pC), rC4 +- movaps 0-120(pA5), rC5 +- mulps rB0, rC5 +- addss CMUL(20)(pC), rC5 +- movaps 0-120(pA5, ldab), rC6 +- mulps rB0, rC6 +- addss CMUL(24)(pC), rC6 +- movaps 0-120(pA5, ldab,2), rC7 +- mulps rB0, rC7 +- addss CMUL(28)(pC), rC7 +- movaps 0-120(pA10, mldab,2), rC8 +- mulps rB0, rC8 +- addss CMUL(32)(pC), rC8 +- movaps 0-120(pA5,ldab,4), rC9 +- mulps rB0, rC9 +- addss CMUL(36)(pC), rC9 +- movaps 0-120(pA10), rC10 +- mulps rB0, rC10 +- addss CMUL(40)(pC), rC10 +- movaps 0-120(pA10,ldab), rC11 +- mulps rB0, rC11 +- addss CMUL(44)(pC), rC11 +- movaps 0-120(pA10,ldab,2), rC12 +- mulps rB0, rC12 +- addss CMUL(48)(pC), rC12 +- movaps 0-120(pA5,ldab,8), rC13 +- mulps rB0, rC13 +- addss CMUL(52)(pC), rC13 ++ movaps 0-120(pA10,mldab5,2), rC0 ++ movaps 0-120(pB0), rB0 ++ mulps rB0, rC0 ++ addss (pC), rC0 ++ movaps 0-120(pA5, mldab,4), rC1 ++ mulps rB0, rC1 ++ addss CMUL(4)(pC), rC1 ++ movaps 0-120(pA10, mldab,8), rC2 ++ mulps rB0, rC2 ++ addss CMUL(8)(pC), rC2 ++ movaps 0-120(pA5, mldab,2), rC3 ++ mulps rB0, rC3 ++ addss CMUL(12)(pC), rC3 ++ movaps 0-120(pA5, mldab), rC4 ++ mulps rB0, rC4 ++ addss CMUL(16)(pC), rC4 ++ movaps 0-120(pA5), rC5 ++ mulps rB0, rC5 ++ addss CMUL(20)(pC), rC5 ++ movaps 0-120(pA5, ldab), rC6 ++ mulps rB0, rC6 ++ addss CMUL(24)(pC), rC6 ++ movaps 0-120(pA5, ldab,2), rC7 ++ mulps rB0, rC7 ++ addss CMUL(28)(pC), rC7 ++ movaps 0-120(pA10, mldab,2), rC8 ++ mulps rB0, rC8 ++ addss CMUL(32)(pC), rC8 ++ movaps 0-120(pA5,ldab,4), rC9 ++ mulps rB0, rC9 ++ addss CMUL(36)(pC), rC9 ++ movaps 0-120(pA10), rC10 ++ mulps rB0, rC10 ++ addss CMUL(40)(pC), rC10 ++ movaps 0-120(pA10,ldab), rC11 ++ mulps rB0, rC11 ++ addss CMUL(44)(pC), rC11 ++ movaps 0-120(pA10,ldab,2), rC12 ++ mulps rB0, rC12 ++ addss CMUL(48)(pC), rC12 ++ movaps 0-120(pA5,ldab,8), rC13 ++ mulps rB0, rC13 ++ addss CMUL(52)(pC), rC13 + #else +- movaps 0-120(pA10,mldab5,2), rC0 +- movaps 0-120(pB0), rC13 +- mulps rC13, rC0 +- movaps 0-120(pA5, mldab,4), rC1 +- mulps rC13, rC1 +- movaps 0-120(pA10, mldab,8), rC2 +- mulps rC13, rC2 +- movaps 0-120(pA5, mldab,2), rC3 +- mulps rC13, rC3 +- movaps 0-120(pA5, mldab), rC4 +- mulps rC13, rC4 +- movaps 0-120(pA5), rC5 +- mulps rC13, rC5 +- movaps 0-120(pA5, ldab), rC6 +- mulps rC13, rC6 +- movaps 0-120(pA5, ldab,2), rC7 +- mulps rC13, rC7 +- movaps 0-120(pA10, mldab,2), rC8 +- mulps rC13, rC8 +- movaps 0-120(pA5,ldab,4), rC9 +- mulps rC13, rC9 +- movaps 0-120(pA10), rC10 +- mulps rC13, rC10 +- movaps 0-120(pA10,ldab), rC11 +- mulps rC13, rC11 +- movaps 0-120(pA10,ldab,2), rC12 +- mulps rC13, rC12 +- mulps 0-120(pA5,ldab,8), rC13 ++ movaps 0-120(pA10,mldab5,2), rC0 ++ movaps 0-120(pB0), rC13 ++ mulps rC13, rC0 ++ movaps 0-120(pA5, mldab,4), rC1 ++ mulps rC13, rC1 ++ movaps 0-120(pA10, mldab,8), rC2 ++ mulps rC13, rC2 ++ movaps 0-120(pA5, mldab,2), rC3 ++ mulps rC13, rC3 ++ movaps 0-120(pA5, mldab), rC4 ++ mulps rC13, rC4 ++ movaps 0-120(pA5), rC5 ++ mulps rC13, rC5 ++ movaps 0-120(pA5, ldab), rC6 ++ mulps rC13, rC6 ++ movaps 0-120(pA5, ldab,2), rC7 ++ mulps rC13, rC7 ++ movaps 0-120(pA10, mldab,2), rC8 ++ mulps rC13, rC8 ++ movaps 0-120(pA5,ldab,4), rC9 ++ mulps rC13, rC9 ++ movaps 0-120(pA10), rC10 ++ mulps rC13, rC10 ++ movaps 0-120(pA10,ldab), rC11 ++ mulps rC13, rC11 ++ movaps 0-120(pA10,ldab,2), rC12 ++ mulps rC13, rC12 ++ mulps 0-120(pA5,ldab,8), rC13 + #endif + + #if KB > 4 +- movaps 16-120(pA10,mldab5,2), rA0 +- movaps 16-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 16-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 16-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 16-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 16-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 16-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 16-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 16-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 16-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 16-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 16-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 16-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 16-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 16-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 16-120(pA10,mldab5,2), rA0 ++ movaps 16-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 16-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 16-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 16-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 16-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 16-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 16-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 16-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 16-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 16-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 16-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 16-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 16-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 16-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 8 +- movaps 32-120(pA10,mldab5,2), rA0 +- movaps 32-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 32-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 32-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 32-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 32-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 32-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 32-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 32-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 32-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 32-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 32-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 32-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 32-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 32-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 32-120(pA10,mldab5,2), rA0 ++ movaps 32-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 32-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 32-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 32-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 32-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 32-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 32-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 32-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 32-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 32-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 32-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 32-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 32-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 32-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 12 +- movaps 48-120(pA10,mldab5,2), rA0 +- movaps 48-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 48-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 48-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 48-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 48-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 48-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 48-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 48-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 48-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 48-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 48-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 48-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 48-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 48-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 48-120(pA10,mldab5,2), rA0 ++ movaps 48-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 48-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 48-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 48-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 48-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 48-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 48-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 48-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 48-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 48-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 48-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 48-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 48-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 48-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 16 +- movaps 64-120(pA10,mldab5,2), rA0 +- movaps 64-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 64-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 64-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 64-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 64-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 64-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 64-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 64-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 64-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 64-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 64-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 64-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 64-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 64-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 64-120(pA10,mldab5,2), rA0 ++ movaps 64-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 64-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 64-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 64-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 64-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 64-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 64-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 64-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 64-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 64-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 64-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 64-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 64-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 64-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 20 +- movaps 80-120(pA10,mldab5,2), rA0 +- movaps 80-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 80-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 80-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 80-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 80-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 80-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 80-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 80-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 80-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 80-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 80-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 80-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 80-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 80-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 80-120(pA10,mldab5,2), rA0 ++ movaps 80-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 80-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 80-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 80-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 80-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 80-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 80-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 80-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 80-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 80-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 80-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 80-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 80-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 80-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 24 +- movaps 96-120(pA10,mldab5,2), rA0 +- movaps 96-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 96-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 96-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 96-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 96-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 96-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 96-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 96-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 96-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 96-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 96-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 96-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 96-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 96-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 96-120(pA10,mldab5,2), rA0 ++ movaps 96-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 96-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 96-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 96-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 96-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 96-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 96-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 96-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 96-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 96-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 96-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 96-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 96-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 96-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 28 +- movaps 112-120(pA10,mldab5,2), rA0 +- movaps 112-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 112-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 112-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 112-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 112-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 112-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 112-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 112-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 112-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 112-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 112-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 112-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 112-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 112-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 112-120(pA10,mldab5,2), rA0 ++ movaps 112-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 112-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 112-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 112-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 112-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 112-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 112-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 112-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 112-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 112-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 112-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 112-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 112-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 112-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 32 +- movaps 128-120(pA10,mldab5,2), rA0 +- movaps 128-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 128-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 128-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 128-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 128-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 128-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 128-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 128-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 128-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 128-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 128-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 128-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 128-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 128-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 128-120(pA10,mldab5,2), rA0 ++ movaps 128-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 128-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 128-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 128-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 128-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 128-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 128-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 128-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 128-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 128-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 128-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 128-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 128-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 128-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 36 +- movaps 144-120(pA10,mldab5,2), rA0 +- movaps 144-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 144-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 144-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 144-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 144-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 144-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 144-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 144-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 144-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 144-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 144-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 144-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 144-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 144-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 144-120(pA10,mldab5,2), rA0 ++ movaps 144-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 144-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 144-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 144-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 144-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 144-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 144-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 144-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 144-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 144-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 144-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 144-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 144-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 144-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif +- prefB((pB,ldab)) +- prefB(64(pB,ldab)) ++ prefB((pB,ldab)) ++ prefB(64(pB,ldab)) + + #if KB > 40 +- movaps 160-120(pA10,mldab5,2), rA0 +- movaps 160-120(pB0), rB0 +- mulps rB0, rA0 +- addq $176, pB0 +- addps rA0, rC0 +- movaps 160-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 160-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 160-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 160-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 160-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 160-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 160-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 160-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 160-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 160-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 160-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 160-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addq $176, pA10 +- addps rA0, rC12 +- mulps 160-120(pA5,ldab,8), rB0 +- addps rB0, rC13 +- addq $176, pA5 ++ movaps 160-120(pA10,mldab5,2), rA0 ++ movaps 160-120(pB0), rB0 ++ mulps rB0, rA0 ++ addq $176, pB0 ++ addps rA0, rC0 ++ movaps 160-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 160-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 160-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 160-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 160-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 160-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 160-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 160-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 160-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 160-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 160-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 160-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addq $176, pA10 ++ addps rA0, rC12 ++ mulps 160-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 ++ addq $176, pA5 + #else +- addq $176, pB0 +- addq $176, pA10 +- addq $176, pA5 ++ addq $176, pB0 ++ addq $176, pA10 ++ addq $176, pA5 + #endif + + #if KB > 44 +- movaps 0-120(pA10,mldab5,2), rA0 +- movaps 0-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 0-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 0-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 0-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 0-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 0-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 0-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 0-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 0-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 0-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 0-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 0-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 0-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 0-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 0-120(pA10,mldab5,2), rA0 ++ movaps 0-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 0-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 0-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 0-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 0-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 0-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 0-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 0-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 0-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 0-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 0-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 0-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 0-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 0-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 48 +- movaps 16-120(pA10,mldab5,2), rA0 +- movaps 16-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 16-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 16-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 16-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 16-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 16-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 16-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 16-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 16-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 16-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 16-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 16-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 16-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 16-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 16-120(pA10,mldab5,2), rA0 ++ movaps 16-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 16-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 16-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 16-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 16-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 16-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 16-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 16-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 16-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 16-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 16-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 16-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 16-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 16-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 52 +- movaps 32-120(pA10,mldab5,2), rA0 +- movaps 32-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 32-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 32-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 32-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 32-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 32-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 32-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 32-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 32-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 32-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 32-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 32-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 32-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 32-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 32-120(pA10,mldab5,2), rA0 ++ movaps 32-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 32-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 32-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 32-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 32-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 32-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 32-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 32-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 32-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 32-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 32-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 32-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 32-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 32-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 56 +- movaps 48-120(pA10,mldab5,2), rA0 +- movaps 48-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 48-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 48-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 48-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 48-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 48-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 48-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 48-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 48-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 48-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 48-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 48-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 48-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 48-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 48-120(pA10,mldab5,2), rA0 ++ movaps 48-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 48-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 48-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 48-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 48-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 48-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 48-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 48-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 48-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 48-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 48-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 48-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 48-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 48-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 60 +- movaps 64-120(pA10,mldab5,2), rA0 +- movaps 64-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 64-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 64-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 64-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 64-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 64-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 64-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 64-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 64-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 64-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 64-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 64-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 64-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 64-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 64-120(pA10,mldab5,2), rA0 ++ movaps 64-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 64-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 64-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 64-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 64-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 64-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 64-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 64-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 64-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 64-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 64-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 64-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 64-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 64-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif +- prefB(128-176(pB,ldab)) +- prefB(192-176(pB,ldab)) ++ prefB(128-176(pB,ldab)) ++ prefB(192-176(pB,ldab)) + + #if KB > 64 +- movaps 80-120(pA10,mldab5,2), rA0 +- movaps 80-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 80-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 80-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 80-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 80-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 80-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 80-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 80-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 80-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 80-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 80-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 80-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 80-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 80-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 80-120(pA10,mldab5,2), rA0 ++ movaps 80-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 80-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 80-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 80-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 80-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 80-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 80-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 80-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 80-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 80-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 80-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 80-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 80-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 80-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 68 +- movaps 96-120(pA10,mldab5,2), rA0 +- movaps 96-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 96-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 96-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 96-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 96-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 96-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 96-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 96-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 96-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 96-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 96-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 96-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 96-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 96-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 96-120(pA10,mldab5,2), rA0 ++ movaps 96-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 96-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 96-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 96-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 96-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 96-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 96-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 96-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 96-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 96-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 96-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 96-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 96-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 96-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 72 +- movaps 112-120(pA10,mldab5,2), rA0 +- movaps 112-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 112-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 112-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 112-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 112-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 112-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 112-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 112-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 112-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 112-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 112-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 112-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 112-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 112-120(pA5,ldab,8), rB0 +- prefC((pC)) +- prefC((pC,incCn)) +- addps rB0, rC13 ++ movaps 112-120(pA10,mldab5,2), rA0 ++ movaps 112-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 112-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 112-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 112-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 112-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 112-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 112-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 112-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 112-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 112-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 112-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 112-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 112-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 112-120(pA5,ldab,8), rB0 ++ prefC((pC)) ++ prefC((pC,incCn)) ++ addps rB0, rC13 + #else +- prefC((pC)) +- prefC((pC,incCn)) ++ prefC((pC)) ++ prefC((pC,incCn)) + #endif + + #if KB > 76 +- movaps 128-120(pA10,mldab5,2), rA0 +- movaps 128-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 128-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 128-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 128-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 128-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 128-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 128-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 128-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 128-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 128-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 128-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 128-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 128-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 128-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 128-120(pA10,mldab5,2), rA0 ++ movaps 128-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 128-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 128-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 128-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 128-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 128-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 128-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 128-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 128-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 128-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 128-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 128-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 128-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 128-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + #if KB > 80 +- movaps 144-120(pA10,mldab5,2), rA0 +- movaps 144-120(pB0), rB0 +- mulps rB0, rA0 +- addps rA0, rC0 +- movaps 144-120(pA5, mldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC1 +- movaps 144-120(pA10, mldab,8), rA0 +- mulps rB0, rA0 +- addps rA0, rC2 +- movaps 144-120(pA5, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC3 +- movaps 144-120(pA5, mldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC4 +- movaps 144-120(pA5), rA0 +- mulps rB0, rA0 +- addps rA0, rC5 +- movaps 144-120(pA5, ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC6 +- movaps 144-120(pA5, ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC7 +- movaps 144-120(pA10, mldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC8 +- movaps 144-120(pA5,ldab,4), rA0 +- mulps rB0, rA0 +- addps rA0, rC9 +- movaps 144-120(pA10), rA0 +- mulps rB0, rA0 +- addps rA0, rC10 +- movaps 144-120(pA10,ldab), rA0 +- mulps rB0, rA0 +- addps rA0, rC11 +- movaps 144-120(pA10,ldab,2), rA0 +- mulps rB0, rA0 +- addps rA0, rC12 +- mulps 144-120(pA5,ldab,8), rB0 +- addps rB0, rC13 ++ movaps 144-120(pA10,mldab5,2), rA0 ++ movaps 144-120(pB0), rB0 ++ mulps rB0, rA0 ++ addps rA0, rC0 ++ movaps 144-120(pA5, mldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC1 ++ movaps 144-120(pA10, mldab,8), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC2 ++ movaps 144-120(pA5, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC3 ++ movaps 144-120(pA5, mldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC4 ++ movaps 144-120(pA5), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC5 ++ movaps 144-120(pA5, ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC6 ++ movaps 144-120(pA5, ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC7 ++ movaps 144-120(pA10, mldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC8 ++ movaps 144-120(pA5,ldab,4), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC9 ++ movaps 144-120(pA10), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC10 ++ movaps 144-120(pA10,ldab), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC11 ++ movaps 144-120(pA10,ldab,2), rA0 ++ mulps rB0, rA0 ++ addps rA0, rC12 ++ mulps 144-120(pA5,ldab,8), rB0 ++ addps rB0, rC13 + #endif + + /*UKLOOP */ +@@ -2454,202 +2461,202 @@ MLAST: + * Get these bastard things summed up correctly + */ + +- /* rC0 = c0a c0b c0c c0d */ +- /* rC1 = c1a c1b c1c c1d */ +- /* rC2 = c2a c2b c2c c2d */ +- /* rC3 = c3a c3b c3c c3d */ ++ /* rC0 = c0a c0b c0c c0d */ ++ /* rC1 = c1a c1b c1c c1d */ ++ /* rC2 = c2a c2b c2c c2d */ ++ /* rC3 = c3a c3b c3c c3d */ + /* */ +- movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */ +- prefC(64(pC,incCn)) +- prefB(256-176(pB,ldab)) +- movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */ +- unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */ +- unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */ +- unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */ +- movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */ +- unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */ +- movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */ +- movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */ +- movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */ +- addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */ +- movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */ +- movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */ +- movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */ +- addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */ +- movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */ +- addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */ +- +- +- /* rC4 = c4a c4b c4c c4d */ +- /* rC5 = c5a c5b c5c c5d */ +- /* rC6 = c6a c6b c6c c6d */ +- /* rC7 = c7a c7b c7c c7d */ +- /* rC8 = c08a c08b c08c c08d */ +- /* rC9 = c09a c09b c09c c09d */ +- /* rC10 = c10a c10b c10c c10d */ +- /* rC11 = c11a c11b c11c c11d */ +- /* rC12 = c12a c12b c12c c12d */ +- /* rC13 = c13a c13b c13c c13d */ ++ movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */ ++ prefC(64(pC,incCn)) ++ prefB(256-176(pB,ldab)) ++ movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */ ++ unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */ ++ unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */ ++ unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */ ++ movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */ ++ unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */ ++ movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */ ++ movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */ ++ movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */ ++ addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */ ++ movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */ ++ movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */ ++ movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */ ++ addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */ ++ movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */ ++ addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */ ++ ++ ++ /* rC4 = c4a c4b c4c c4d */ ++ /* rC5 = c5a c5b c5c c5d */ ++ /* rC6 = c6a c6b c6c c6d */ ++ /* rC7 = c7a c7b c7c c7d */ ++ /* rC8 = c08a c08b c08c c08d */ ++ /* rC9 = c09a c09b c09c c09d */ ++ /* rC10 = c10a c10b c10c c10d */ ++ /* rC11 = c11a c11b c11c c11d */ ++ /* rC12 = c12a c12b c12c c12d */ ++ /* rC13 = c13a c13b c13c c13d */ + /* */ +- movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */ +- movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */ +- movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */ +- unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */ +- unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */ +- unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */ +- unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */ +- unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */ +- movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */ +- unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */ +- movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */ +- movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */ +- unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */ +- movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */ +- movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */ +- addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */ ++ movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */ ++ movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */ ++ movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */ ++ unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */ ++ unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */ ++ unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */ ++ unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */ ++ unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */ ++ movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */ ++ unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */ ++ movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */ ++ movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */ ++ unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */ ++ movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */ ++ movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */ ++ addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */ + #ifdef BETAX + #ifdef SREAL +- movups (pC), rA0 +- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ +- movups 16(pC), rC4 +- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ +- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ +- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ +- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ +- movups 32(pC), rC5 +- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ +- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ +- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ +- movlps 48(pC), rC1 +- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ +- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ +- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ +- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ +- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ +- mulps BOF(%rsp), rA0 +- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ +- mulps BOF(%rsp), rC4 +- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ +- mulps BOF(%rsp), rC5 +- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ +- mulps BOF(%rsp), rC1 ++ movups (pC), rA0 ++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ ++ movups 16(pC), rC4 ++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ ++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ ++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ ++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ ++ movups 32(pC), rC5 ++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ ++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ ++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ ++ movlps 48(pC), rC1 ++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ ++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ ++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ ++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ ++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ ++ mulps BOF(%rsp), rA0 ++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ ++ mulps BOF(%rsp), rC4 ++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ ++ mulps BOF(%rsp), rC5 ++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ ++ mulps BOF(%rsp), rC1 + + /* */ + +- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ +- addps rA0, rC3 +- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ +- addps rC4, rC7 +- addps rC5, rC11 +- prefB(320-176(pB,ldab)) +- addps rC1, rC12 ++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ ++ addps rA0, rC3 ++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ ++ addps rC4, rC7 ++ addps rC5, rC11 ++ prefB(320-176(pB,ldab)) ++ addps rC1, rC12 + #else /* BETA = X, complex type */ +- movups (pC), rA0 +- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ +- movups 16(pC), rC4 +- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ +- shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */ +- movups 32(pC), rC4 /* rC4 = c4 X c5 X */ +- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ +- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ +- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ +- movups 48(pC), rC5 /* rC5 = c6 X c7 X */ +- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ +- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ +- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ +- shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */ +- movups 64(pC), rC5 /* rC5 = c8 X c9 X */ +- movups 80(pC), rC1 /* rC1 = c10 X c11 X */ +- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ +- shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */ +- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ +- movss 96(pC), rC1 +- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ +- movss 104(pC), rB0 +- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ +- unpcklps rB0, rC1 +- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ +- mulps BOF(%rsp), rA0 +- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ +- mulps BOF(%rsp), rC4 +- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ +- mulps BOF(%rsp), rC5 +- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ +- mulps BOF(%rsp), rC1 ++ movups (pC), rA0 ++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ ++ movups 16(pC), rC4 ++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ ++ shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */ ++ movups 32(pC), rC4 /* rC4 = c4 X c5 X */ ++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ ++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ ++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ ++ movups 48(pC), rC5 /* rC5 = c6 X c7 X */ ++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ ++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ ++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ ++ shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */ ++ movups 64(pC), rC5 /* rC5 = c8 X c9 X */ ++ movups 80(pC), rC1 /* rC1 = c10 X c11 X */ ++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ ++ shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */ ++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ ++ movss 96(pC), rC1 ++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ ++ movss 104(pC), rB0 ++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ ++ unpcklps rB0, rC1 ++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ ++ mulps BOF(%rsp), rA0 ++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ ++ mulps BOF(%rsp), rC4 ++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ ++ mulps BOF(%rsp), rC5 ++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ ++ mulps BOF(%rsp), rC1 + + /* */ + +- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ +- addps rA0, rC3 +- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ +- addps rC4, rC7 +- addps rC5, rC11 +- prefB(320-176(pB,ldab)) +- addps rC1, rC12 ++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ ++ addps rA0, rC3 ++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ ++ addps rC4, rC7 ++ addps rC5, rC11 ++ prefB(320-176(pB,ldab)) ++ addps rC1, rC12 + #endif + + #else +- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ +- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ +- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ +- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ +- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ +- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ +- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ +- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ +- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ +- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ +- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ +- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ +- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ +- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ +- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ +- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ ++ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ ++ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ ++ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ ++ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ ++ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ ++ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ ++ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ ++ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ ++ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ ++ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ ++ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ ++ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ ++ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ ++ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ ++ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ ++ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ + + /* */ + +- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ +- prefB(320-176(pB,ldab)) +- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ ++ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ ++ prefB(320-176(pB,ldab)) ++ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ + + #endif + /* + * Write results back to C; pC += 14; + */ + #ifdef SREAL +- movups rC3, (pC) +- movups rC7, 16(pC) +- movups rC11, 32(pC) +- movlps rC12, 48(pC) +-/* addq $56, pC */ ++ movups rC3, (pC) ++ movups rC7, 16(pC) ++ movups rC11, 32(pC) ++ movlps rC12, 48(pC) ++/* addq $56, pC */ + #else +- movss rC3, (pC) +- movss rC7, 32(pC) +- movhlps rC3, rC0 +- movhlps rC7, rC6 +- movss rC0, 16(pC) +- movss rC6, 48(pC) +- shufps $0x55, rC3, rC3 +- shufps $0x55, rC7, rC7 +- movss rC3, 8(pC) +- movss rC7, 40(pC) +- shufps $0x55, rC0, rC0 +- shufps $0x55, rC6, rC6 +- movss rC0, 24(pC) +- movss rC6, 56(pC) +- +- movss rC11, 64(pC) +- movhlps rC11, rC2 +- movss rC12, 96(pC) +- movss rC2, 80(pC) +- shufps $0x55, rC11, rC11 +- shufps $0x55, rC12, rC12 +- movss rC11, 72(pC) +- shufps $0x55, rC2, rC2 +- movss rC12, 104(pC) +- movss rC2, 88(pC) ++ movss rC3, (pC) ++ movss rC7, 32(pC) ++ movhlps rC3, rC0 ++ movhlps rC7, rC6 ++ movss rC0, 16(pC) ++ movss rC6, 48(pC) ++ shufps $0x55, rC3, rC3 ++ shufps $0x55, rC7, rC7 ++ movss rC3, 8(pC) ++ movss rC7, 40(pC) ++ shufps $0x55, rC0, rC0 ++ shufps $0x55, rC6, rC6 ++ movss rC0, 24(pC) ++ movss rC6, 56(pC) ++ ++ movss rC11, 64(pC) ++ movhlps rC11, rC2 ++ movss rC12, 96(pC) ++ movss rC2, 80(pC) ++ shufps $0x55, rC11, rC11 ++ shufps $0x55, rC12, rC12 ++ movss rC11, 72(pC) ++ shufps $0x55, rC2, rC2 ++ movss rC12, 104(pC) ++ movss rC2, 88(pC) + +-/* addq $112, pC */ ++/* addq $112, pC */ + #endif + /* + * Write results back to C +@@ -2660,55 +2667,55 @@ MLAST: + /* + * while (pA != stM); + */ +-/* subq $1, stM */ +-/* jne UMLOOP */ ++/* subq $1, stM */ ++/* jne UMLOOP */ + /* + * pC += 14; pA += 14*NB; pB -= NB; + */ +-/* subq $MBKBso-NB14so+176, pA5 */ +-/* subq $MBKBso-NB14so+176, pA10 */ +- subq incAm, pA5 +- subq incAm, pA10 +- addq $NBso-176, pB0 ++/* subq $MBKBso-NB14so+176, pA5 */ ++/* subq $MBKBso-NB14so+176, pA10 */ ++ subq incAm, pA5 ++ subq incAm, pA10 ++ addq $NBso-176, pB0 + /* + * while (pA != stM); + */ +-/* subq $1, stM */ +-/* jne UMLOOP */ ++/* subq $1, stM */ ++/* jne UMLOOP */ + /* + * pC += incCn; pA -= NBNB; pB += NB; + */ +- addq incCn, pC ++ addq incCn, pC + /* + * while (pB != stN); + */ +- sub $1, stN +- jne UNLOOP ++ sub $1, stN ++ jne UNLOOP + + /* + * Restore callee-saved iregs + */ + DONE: +- movq -8(%rsp), %rbp +- movq -16(%rsp), %rbx ++ movq -8(%rsp), %rbp ++ movq -16(%rsp), %rbx + #if MB == 0 +- movq -32(%rsp), %r12 +- movq -40(%rsp), %r13 ++ movq -32(%rsp), %r12 ++ movq -40(%rsp), %r13 + #endif +- ret ++ ret + #if MB == 0 + MB_LT84: +- cmp $70, stM +- jne MB_LT70 +-/* movq $70/14, stM */ +- movq $5, stM +- jmp MBFOUND ++ cmp $70, stM ++ jne MB_LT70 ++/* movq $70/14, stM */ ++ movq $5, stM ++ jmp MBFOUND + MB_LT70: +- cmp $56, stM +- jne MB_LT56 +-/* movq $56/14, stM */ +- movq $4, stM +- jmp MBFOUND ++ cmp $56, stM ++ jne MB_LT56 ++/* movq $56/14, stM */ ++ movq $4, stM ++ jmp MBFOUND + MB_LT56: + cmp $42, stM + jne MB_LT42 +diff -rupN ATLAS/tune/blas/level1/scalsrch.c atlas-3.8.3/tune/blas/level1/scalsrch.c +--- ATLAS/tune/blas/level1/scalsrch.c 2009-02-18 19:48:25.000000000 +0100 ++++ atlas-3.8.3/tune/blas/level1/scalsrch.c 2009-11-12 13:45:48.141174024 +0100 +@@ -747,7 +747,7 @@ void GenMainRout(char pre, int n, int *i + /* + * Handle all special alpha cases + */ +- fprintf(fpout, "%sif ( SCALAR_IS_ZERO(alpha) )\n", spc); ++ /* fprintf(fpout, "%sif ( SCALAR_IS_ZERO(alpha) )\n", spc); + fprintf(fpout, "%s{\n", spc); + if (pre == 'c' || pre == 'z') + { +@@ -756,7 +756,7 @@ void GenMainRout(char pre, int n, int *i + } + else fprintf(fpout, "%s Mjoin(PATL,set)(N, ATL_rzero, X, incx);\n", spc); + fprintf(fpout, "%s return;\n", spc); +- fprintf(fpout, "%s}\n", spc); ++ fprintf(fpout, "%s}\n", spc); */ + GenAlphCase(pre, spc, fpout, 1, n, ix, iy, ia, ib); + GenAlphCase(pre, spc, fpout, -1, n, ix, iy, ia, ib); + if (pre == 'c' || pre == 'z') diff --git a/libraries/atlas/slack-desc b/libraries/atlas/slack-desc new file mode 100644 index 0000000000000..bed245ddbe18e --- /dev/null +++ b/libraries/atlas/slack-desc @@ -0,0 +1,19 @@ +# HOW TO EDIT THIS FILE: +# The "handy ruler" below makes it easier to edit a package description. Line +# up the first '|' above the ':' following the base package name, and the '|' +# on the right side marks the last column you can put a character in. You +# must make exactly 11 lines for the formatting to be correct. It's also +# customary to leave one space after the ':'. + + |-----handy-ruler------------------------------------------------------| +atlas: ATLAS (Automatically Tuned Linear Algebra Software) +atlas: +atlas: This is ATLAS (Automatically Tuned Linear Algebra Software), an +atlas: ongoing research effort focusing on applying empirical techniques in +atlas: order to provide portable performance. At present, it provides C and +atlas: Fortran77 interfaces to a portably efficient BLAS implementation as +atlas: well as a few routines from LAPACK. +atlas: +atlas: Homepage: http://math-atlas.sourceforge.net/ +atlas: +atlas: |