diff options
author | Willy Sudiarto Raharjo <willysr@slackbuilds.org> | 2022-03-08 00:11:39 +0700 |
---|---|---|
committer | Willy Sudiarto Raharjo <willysr@slackbuilds.org> | 2022-03-08 00:11:39 +0700 |
commit | a36d012001a96f7d97959cc7a2a279090bf4631c (patch) | |
tree | e58c2a76ce34ec2a66e27ea29aa2e5e91bf4cc9b /libraries | |
parent | c946ccd4de52a7ba91af30c87f13f809d517ce4a (diff) |
libraries/atlas: Removed (use OpenBLAS).
Signed-off-by: Willy Sudiarto Raharjo <willysr@slackbuilds.org>
Diffstat (limited to 'libraries')
-rw-r--r-- | libraries/atlas/AMD64K10h64SSE3.tgz | bin | 11038 -> 0 bytes | |||
-rw-r--r-- | libraries/atlas/README | 15 | ||||
-rw-r--r-- | libraries/atlas/README.SLACKWARE | 135 | ||||
-rw-r--r-- | libraries/atlas/TimingResults.txt | 62 | ||||
-rw-r--r-- | libraries/atlas/atlas.SlackBuild | 438 | ||||
-rw-r--r-- | libraries/atlas/atlas.info | 10 | ||||
-rw-r--r-- | libraries/atlas/atlas.patch | 5072 | ||||
-rw-r--r-- | libraries/atlas/slack-desc | 19 |
8 files changed, 0 insertions, 5751 deletions
diff --git a/libraries/atlas/AMD64K10h64SSE3.tgz b/libraries/atlas/AMD64K10h64SSE3.tgz Binary files differdeleted file mode 100644 index 727f3748dbe0..000000000000 --- a/libraries/atlas/AMD64K10h64SSE3.tgz +++ /dev/null diff --git a/libraries/atlas/README b/libraries/atlas/README deleted file mode 100644 index f8b90a9b834d..000000000000 --- a/libraries/atlas/README +++ /dev/null @@ -1,15 +0,0 @@ -ATLAS (Automatically Tuned Linear Algebra Software) is an ongoing -research effort focusing on applying empirical techniques in order to -provide portable performance. At present, it provides C and Fortran77 -interfaces to a portably efficient BLAS implementation, as well as a few -routines from LAPACK. Nevertheless, by default, this SlackBuild also -builds a full LAPACK linked with ATLAS. If you are really sure that you -don't want this, set LAPACK_SOURCE to the empty string when running this -script. - -This conflicts with cblas and lapack (not to be confused with lapack-atlas). -Nevertheless, it should be possible to avoid these conflicts by proper use -of the SYS_DESTDIR variable. - -The impatient may just switch CPU throttling off and run the script, but -you are advised to read over README.SLACKWARE *in advance*. diff --git a/libraries/atlas/README.SLACKWARE b/libraries/atlas/README.SLACKWARE deleted file mode 100644 index 826d5ddcf5c5..000000000000 --- a/libraries/atlas/README.SLACKWARE +++ /dev/null @@ -1,135 +0,0 @@ -IMPORTANT NOTES - -1) The present SlackBuild for ATLAS does by no means try to take into account - all configuration/build issues of ATLAS. Nevertheless, any relevant patches - mentioned in the ATLAS Errata are applied. - -2) The script mostly assumes that you are installing on an x86 or x86_64 - platform and use gcc for compilation. If you decide to use other compilers or - install on another platform, you are unfortunately on your own and welcome to - suggest improvements or patches to this SlackBuild. There is one small - exception to this: the USE_DWALL variable, see below. - -3) There is no "post install" tuning performed by this script. - -4) ATLAS does not conflict with the reference netlib BLAS. Nevertheless, if - ATLAS got installed successfully you should consider removing netlib BLAS and - (re)compiling every BLAS/LAPACK dependent package. Otherwise you may not have - much gain from installing ATLAS. - -5) There is a strong interaction between ATLAS and LAPACK. By default ATLAS - implements an optimized subset of LAPACK and creates the corresponding static - library. Nevertheles, provided that the full LAPACK source is available, - ATLAS builds a complete LAPACK library linked against its optimized BLAS - implementation. This is what the atlas SlackBuild does by default. You may - decide that you don't what this, then make use of the LAPACK_SOURCE variable - (see below). - - -INSTALLATION DETAILS - -1) Make sure CPU throttling is off before starting the install. This is - important, since ATLAS has to tune itself. As with Slackware 14.2 you - can run /etc/rc.d/rc.cpufreq as root with "performance" as command line - argument. To reset, run it again with what gets set at boot time (by - default "ondemand") as command line argument. - -2) For the same reason, keep the extra load on the system as low as possible - while building ATLAS. - - -GENERIC SETUP VARIABLES - -1) SYS_DESTDIR is set by default to "/usr" and is the system destination - directory. When installing the package produced by this SlackBuild, - ATLAS's and LAPACK's files will be written to $SYS_DESTDIR/include, - $SYS_DESTDIR/include/atlas and $SYS_DESTDIR/lib (or lib64). - Documentation files are written to /usr/doc/atlas-$VERSION if not - otherwise stated (see below). - You may want to change the value of SYS_DESTDIR to avoid conflicts. If - you do so, you have to make sure that these libraries and corresponding - headers are found by the compiler or the configuration software used - to build code depending on them. - IMPORTANT: SYS_DESTDIR has to have an absolute path as value. - -2) DEFAULT_DOCS has the default value "yes", which means that docs go - to /usr/doc/atlas-$VERSION, but you may want to let the docs go - to $SYS_DESTDIR/doc/atlas-$VERSION. For this, just set this - variable to "no". - - -SETUP VARIABLES FOR ATLAS - -1) USE_ARCH_DEFAULTS defaults to "yes", which means that the library - will be optimized by trying to take into account former builds done - on a similar machine. Thus ATLAS will use predefined optimizations - if available. This may reduce (much) the compilation time but may - not give you the best result if you don't use the same gcc compiler - version as the ATLAS author. - Please note that with this variable set to "no", or if there are no - known optimizations for your machine ATLAS compilation may last for - many hours! Take a nap :-) - NOTE: On the machine of this SlackBuild's author setting - USE_ARCH_DEFAULTS to "no" provided libraries with definitely - better performance. Compilation took about six hours. - -2) ARCH_DEF_DIR has different meanings, depending on the value of - USE_ARCH_DEFAULTS: - a) If USE_ARCH_DEFAULTS is "yes" and you have some custom architectural - defaults, then you may set this to the absolute path of the directory - containing the file with your custom defaults. - b) If USE_ARCH_DEFAULTS is "no" and you would like to create custom - architectural defaults then set this to the absolute path of the - directory which should contain the file with the custom defaults. - NOTE: Since this file is supposed to survive an upgrade, it doesn't - get included in the Slackware package. You have to remove it - by hand, if needed. A file named "ARCH_DEF_DIR" gets written - to the documentation directory, to remind you where the created - architectural defaults are. Make a backup of it, since it may - get deleted with an upgrade. - ARCH_DEF_DIR defaults to the empty string, which means that neither your - custom defaults are used nor custom defaults are created. - -3) USE_DWALL defaults to "no" which should be OK for x86 or x86_64 and the gcc - compiler. If you are on another architecture than x86 and/or don't use gcc - you need to set it to "yes". - -4) L2_CACHE_SIZE provides the size of the level 2 cache in bytes. By default it - is deduced from /proc/cpuinfo but you can just set the value manually, if you - wish or need so. - -5) NUM_THREADS allows you to set the maximum number of threads. By default it - is "-1", which means autodection. In this case it gets set equal to the - number of available processors. - -6) USE_PROCESSORS is by default the empty string, which means that any of the - available processors may be used. Nevertheless, under some circumstances, - one may want to specify the processor IDs, e.g. "0 2 4". Please consult - atlas_install.pdf, p. 13 for more informations. - NOTES: a) This is incompatible with the autodetection of the number of - threads. Therefore NUM_THREADS must be greater than 1. - b) Write just the processor IDs to this string, the script takes - care of the rest. Take care to have NUM_THREADS equal to the - amount of processor IDs. - -7) SHARED_SWITCH is set by default to ask for building shared libs along with - the static ones. Set this to the empty string, if you don't want to have - shared libs. - - -SETUP VARIABLES FOR LAPACK - -1) LAPACK_SOURCE set this variable to the empty string, if you don't want for a - full LAPACK library to get build. - -2) TEST_LAPACK set this variable to "yes" if you would like to run the LAPACK - tests. You will find the results of the tests in the documentation directory. - This has no relevance, if you didn't allow for a full LAPACK build. - -3) LAPACK_TIMER sets the timer to be used for LAPACK. If you stay with - gfortran, presently the default compiler on Slackware, you can leave the - value as is. Otherwise, set it to "NONE" or read LAPACK's make.inc.example - for more informations. - This has no relevance, if you didn't allow for a full LAPACK build. - - diff --git a/libraries/atlas/TimingResults.txt b/libraries/atlas/TimingResults.txt deleted file mode 100644 index 4cc33c00935f..000000000000 --- a/libraries/atlas/TimingResults.txt +++ /dev/null @@ -1,62 +0,0 @@ -MACHINE: Intel Core2 Duo T9600 @ 2.80GHz -COMPILER: gcc 5.3.0 (as shipped with Slackware Linux 14.2) - -The times labeled Reference are for ATLAS as installed by the authors. -NAMING ABBREVIATIONS: - kSelMM : selected matmul kernel (may be hand-tuned) - kGenMM : generated matmul kernel - kMM_NT : worst no-copy kernel - kMM_TN : best no-copy kernel - BIG_MM : large GEMM timing (usually N=1600); estimate of asymptotic peak - kMV_N : NoTranspose matvec kernel - kMV_T : Transpose matvec kernel - kGER : GER (rank-1 update) kernel -Kernel routines are not called by the user directly, and their -performance is often somewhat different than the total -algorithm (eg, dGER perf may differ from dkGER) - - -AFTER A PARTIAL SEARCH, ARCH IDENTIFIED AS Core232SSE3 -====================================================== - -Reference clock rate=2493Mhz, new rate=2801Mhz - Refrenc : % of clock rate achieved by reference install - Present : % of clock rate achieved by present ATLAS install - - single precision double precision - ******************************** ******************************* - real complex real complex - --------------- --------------- --------------- --------------- -Benchmark Refrenc Present Refrenc Present Refrenc Present Refrenc Present -========= ======= ======= ======= ======= ======= ======= ======= ======= - kSelMM 578.5 363.2 564.7 577.7 334.6 352.5 325.1 336.5 - kGenMM 156.3 101.2 156.5 102.0 159.9 159.2 161.7 97.3 - kMM_NT 134.3 125.8 133.0 127.1 151.6 140.7 151.2 152.9 - kMM_TN 154.8 101.3 152.6 101.1 142.4 90.8 149.7 94.2 - BIG_MM 554.0 350.7 554.6 352.2 318.9 330.7 312.3 324.5 - kMV_N 63.6 71.7 106.8 62.5 29.7 40.3 56.5 71.8 - kMV_T 64.7 74.7 108.0 79.3 32.5 44.9 60.5 65.8 - kGER 45.9 37.9 88.6 61.2 22.1 19.7 45.5 44.5 - - -AFTER A FULL SEARCH -=================== - -Reference clock rate=2493Mhz, new rate=2801Mhz - Refrenc : % of clock rate achieved by reference install - Present : % of clock rate achieved by present ATLAS install - - single precision double precision - ******************************** ******************************* - real complex real complex - --------------- --------------- --------------- --------------- -Benchmark Refrenc Present Refrenc Present Refrenc Present Refrenc Present -========= ======= ======= ======= ======= ======= ======= ======= ======= - kSelMM 578.5 624.7 564.7 572.9 334.6 347.2 325.1 334.3 - kGenMM 156.3 156.0 156.5 155.4 159.9 163.2 161.7 163.2 - kMM_NT 134.3 104.8 133.0 96.9 151.6 140.5 151.2 144.5 - kMM_TN 154.8 170.8 152.6 163.5 142.4 122.0 149.7 127.9 - BIG_MM 554.0 527.8 554.6 558.3 318.9 331.3 312.3 331.0 - kMV_N 63.6 72.1 106.8 118.8 29.7 44.8 56.5 79.1 - kMV_T 64.7 78.8 108.0 134.4 32.5 45.5 60.5 88.3 - kGER 45.9 40.2 88.6 74.6 22.1 21.7 45.5 44.8 diff --git a/libraries/atlas/atlas.SlackBuild b/libraries/atlas/atlas.SlackBuild deleted file mode 100644 index c5204336f1c3..000000000000 --- a/libraries/atlas/atlas.SlackBuild +++ /dev/null @@ -1,438 +0,0 @@ -#!/bin/bash - -# Slackware build script for ATLAS - -# Copyright 2010-2016 Serban Udrea <s.udrea@gsi.de> -# All rights reserved. -# -# Redistribution and use of this script, with or without modification, -# is permitted provided that the following conditions are met: -# -# 1. Redistributions of this script must retain the above copyright -# notice, this list of conditions and the following disclaimer. -# -# THIS SOFTWARE IS PROVIDED BY THE AUTHOR ''AS IS'' AND ANY EXPRESS OR -# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE -# DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, -# INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES -# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR -# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) -# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, -# STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING -# IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -# POSSIBILITY OF SUCH DAMAGE. - -cd $(dirname $0) ; CWD=$(pwd) - -PRGNAM=atlas -VERSION=${VERSION:-3.10.3} -BUILD=${BUILD:-2} -TAG=${TAG:-_SBo} -PKGTYPE=${PKGTYPE:-tgz} - -if [ -z "$ARCH" ]; then - case "$( uname -m )" in - i?86) ARCH=i586 ;; - arm*) ARCH=arm ;; - *) ARCH="$( uname -m )" ;; - esac -fi - -# If the variable PRINT_PACKAGE_NAME is set, then this script will report what -# the name of the created package would be, and then exit. This information -# could be useful to other scripts. -if [ ! -z "${PRINT_PACKAGE_NAME}" ]; then - echo "$PRGNAM-$VERSION-$ARCH-$BUILD$TAG.$PKGTYPE" - exit 0 -fi - -TMP=${TMP:-/tmp/SBo} -PKG=$TMP/package-$PRGNAM -OUTPUT=${OUTPUT:-/tmp} - -if [ "$ARCH" = "i586" ]; then - SLKCFLAGS="-O2 -march=i586 -mtune=i686" - LIBDIRSUFFIX="" - BITSize="32" # Specifically for ATLAS -elif [ "$ARCH" = "i686" ]; then - SLKCFLAGS="-O2 -march=i686 -mtune=i686" - LIBDIRSUFFIX="" - BITSize="32" # Specifically for ATLAS -elif [ "$ARCH" = "x86_64" ]; then - SLKCFLAGS="-O2 -fPIC" - LIBDIRSUFFIX="64" - BITSize="64" # Specifically for ATLAS -fi - -# If you don't want to use architectural defaults set the following to -# something like "no". -# -USE_ARCH_DEFAULTS=${USE_ARCH_DEFAULTS:-yes} - -# If you decide to use arch defaults and have some custom ones you may -# set the following variable to point to the directory containing these. -# -# If you decide to not use arch defauts and wish to create some after a build -# with full search, set the following variable to point to the directory where -# the file containing them should be placed. -# IMPORTANT: In this case, the file copied to ARCH_DEF_DIR will not be part of -# the ATLAS package, to avoid problems in case of an upgrade on the -# same machine. The value of ARCH_DEF_DIR will be written for your -# reference to the file named ARCH_DEF_DIR within the doc directory -# of ATLAS. -# -ARCH_DEF_DIR=${ARCH_DEF_DIR:-""} - -# If you are on another architecture than x86 and/or don't use gcc you need to -# set the following variable to "yes". -# -USE_DWALL=${USE_DWALL:-no} - -# You may wish to set the level 2 cache size to the proper value. The default -# is to deduce it from /proc/cpuinfo -# -L2_CACHE_SIZE=${L2_CACHE_SIZE:-"auto"} - -if [ "$L2_CACHE_SIZE" = "auto" ]; then - L2_CACHE_SIZE="$(cat /proc/cpuinfo |grep "cache size"| head -n 1| cut -d ":" -s -f2| cut -d " " -s -f2)" - L2_SIZE_UNIT="$(cat /proc/cpuinfo |grep "cache size"| head -n 1| cut -d " " -s -f4)" - case "$L2_SIZE_UNIT" in - "KB") L2_CACHE_SIZE=$(($L2_CACHE_SIZE * 1024)) - ;; - "MB") L2_CACHE_SIZE=$(($L2_CACHE_SIZE * 1024 * 1024)) - ;; - "GB") L2_CACHE_SIZE=$(($L2_CACHE_SIZE * 1024 * 1024 * 1024)) - ;; - esac -fi - -# Check the value of L2_CACHE_SIZE -# -case "$L2_CACHE_SIZE" in - ''|'0'|*[!0-9]*) echo "ERROR: The value of L2_CACHE_SIZE is not a strictly positive integer!" - exit 1 - ;; -esac - -# Set the (maximum) number of threads. If this is 0 just the serial libs get -# built, even on an SMP machine. By default it's set to -1 for autodetection. -# -NUM_THREADS=${NUM_THREADS:-"-1"} -case "$NUM_THREADS" in - '-1'|'0') echo -n # Do nothing - ;; - '1') NUM_THREADS="0" # One processor => no threading - ;; - ''|*[!0-9]*) echo "ERROR: NUM_THREADS has an improper value!" - exit 1 - ;; -esac - -if [ $NUM_THREADS -gt 1 ]; then - # On SMP machines one may want to set the processors to be used (see - # atlas_install.pdf, p. 13). By default the list of processor ID's is empty - # which means that ATLAS may use whatever is available. - # NOTE: This is incompatible with the autodetection of the number of threads. - # Therefore NUM_THREADS must be greater than 1. - # - USE_PROCESSORS=${USE_PROCESSORS:-""} - if [ -z "$USE_PROCESSORS" ]; then - MT_SWITCH="-t $NUM_THREADS" - else - MT_SWITCH="--force-tids=\"$NUM_THREADS $USE_PROCESSORS\"" - fi -else - MT_SWITCH="-t $NUM_THREADS" -fi - -# Decide upon building full LAPACK or not. Set LAPACK_SOURCE to the empty -# string, if you don't want a full LAPACK build. -# -LAPACK_SOURCE=${LAPACK_SOURCE:-"/usr/share/lapack-atlas/lapack.tgz"} -if [ -z "$LAPACK_SOURCE" ]; then - echo - echo "WARNING" - echo "WARNING: No LAPACK source specified. Just the highly restricted LAPACK" - echo " offered by ATLAS will get compiled!" - echo "WARNING" - echo - sleep 3 -else - tar -tf "$LAPACK_SOURCE" > /dev/null 2>&1 || \ - { echo "ERROR: Improper LAPACK source archive!" \ - && echo " Please check $LAPACK_SOURCE" \ - && echo " and set it properly! " \ - && exit 1; } # NOTE: Here we just test that we deal with a tar archive. - LAPACK_SOURCE="--with-netlib-lapack-tarfile=$LAPACK_SOURCE" - - # Change the following to yes if you would like to run the tests for LAPACK. - # - TEST_LAPACK="${TEST_LAPACK:-no}" - # Make Y or N out of yes, Yes, No, no, etc. - # - TEST_LAPACK=$(echo "$TEST_LAPACK"|cut -b 1|tr a-z A-Z) -fi - -# Decide upon building shared libraries or not. By default we ask for shared -# libs too. If one doesn't want this, she has to just set SHARED_SWITCH to the -# empty string. -# -SHARED_SWITCH=${SHARED_SWITCH:-"--shared"} - -# This is the timer to be used for LAPACK. If you stay with gfortran, -# presently the default compiler on Slackware, you can leave the value as is. -# Otherwise, please read LAPACK's make.inc.example for more informations. -# -LAPACK_TIMER="${LAPACK_TIMER:-INT_ETIME}" - -# This is the system destination directory. When installing the -# package produced by this script, ATLAS's files will be written to -# $SYS_DESTDIR/include, $SYS_DESTDIR/include/atlas, $SYS_DESTDIR/lib -# or $SYS_DESTDIR/lib64 ond appropriate platforms, etc. -# Nevertheless, by default the documentation files go to -# /usr/doc/$PRGNAM-$VERSION. You may change this through the variable -# DEFAULT_DOCS, see below. -# -SYS_DESTDIR=${SYS_DESTDIR:-/usr} - -# Check if SYS_DESTDIR is an absolute path. If not, exit with error. -# NOTE: The $ is used because echo adds a \n at the end of the string. -# -echo $SYS_DESTDIR | grep -vE '/\.\./|/\.\.$' | grep -qE '^/' || \ -{ echo "ERROR: The system destination directory has no absolute path!" \ -&& echo " The value of SYS_DESTDIR is $SYS_DESTDIR" \ -&& echo " Please set it properly! " \ -&& exit 1; } - -# You may want to have the documentation files installed under -# $SYS_DESTDIR/doc/$PRGNAM-$VERSION not /usr/doc/$PRGNAM-$VERSION. -# To achieve this just set the following variable to something like -# "no". -# -DEFAULT_DOCS=${DEFAULT_DOCS:-yes} - -# The build directory to be created within the source directory of -# ATLAS. -# -BLDdir="BuildDir" - -set -e - -rm -rf $PKG -mkdir -p $TMP $PKG $OUTPUT - -cd $TMP -rm -rf $PRGNAM-$VERSION -tar xvf $CWD/${PRGNAM}${VERSION}.tar.bz2 -mv ATLAS $PRGNAM-$VERSION -cd $PRGNAM-$VERSION -chown -R root:root . -find -L . \ - \( -perm 777 -o -perm 775 -o -perm 750 -o -perm 711 -o -perm 555 \ - -o -perm 511 \) -exec chmod 755 {} \; -o \ - \( -perm 666 -o -perm 664 -o -perm 640 -o -perm 600 -o -perm 444 \ - -o -perm 440 -o -perm 400 \) -exec chmod 644 {} \; - -# Set the proper value to USE_ARCH_DEFAULTS, and the proper value to the -# configure switch needed in case you want to use custom arch defaults. -# -ARCH_DIR_SWITCH="" -case "$USE_ARCH_DEFAULTS" in - [yY]|[yY][eE]|[yY][eE][sS]) USE_ARCH_DEFAULTS="1" - [ -z "$ARCH_DEF_DIR" ] || \ - ARCH_DIR_SWITCH="-Ss ADdir $ARCH_DEF_DIR" - ;; - *) USE_ARCH_DEFAULTS="0" ;; -esac - -mkdir -p $BLDdir -cd $BLDdir - -# Configure atlas. -# -case "$USE_DWALL" in - [yY]|[yY][eE]|[yY][eE][sS]) - # Here we assume that we aren't on a x86 machine - # and/or gcc isn't the compiler to be used. - # - ../configure $SHARED_SWITCH \ - --prefix="$SYS_DESTDIR" \ - $LAPACK_SOURCE \ - $MT_SWITCH \ - -Si archdef "$USE_ARCH_DEFAULTS" \ - $ARCH_DIR_SWITCH \ - -b "$BITSize" -D c -DWALL - ;; - *) - # Here we assume that we are on a x86 machine - # (be it 32 or 64 bits) and gcc is the compiler - # to be used. - # - # Get the CPU frequency for good timing. - # - CPU_FREQ="$(cat /proc/cpuinfo |grep "cpu MHz"| head -n 1| cut -d ":" -s -f2| tr -d [:blank:])" - # - ../configure $SHARED_SWITCH \ - --prefix="$SYS_DESTDIR" \ - $LAPACK_SOURCE \ - $MT_SWITCH \ - -Si archdef "$USE_ARCH_DEFAULTS" \ - $ARCH_DIR_SWITCH \ - -b "$BITSize" \ - -D c -DPentiumCPS="$CPU_FREQ" - ;; -esac - -# NOTES ON SOME FLAGS FOR CONFIGURE -# -# SHARED_SWITCH = "--shared" asks for building the shared libraries too -# -Si archdef "$USE_ARCH_DEFAULTS" means that we ignore or not architectural defaults depending -# upon the value of "$USE_ARCH_DEFAULTS". -# -b "$BITSize" tells ATLAS about the platform's bitsize, 32 or 64. -# -D c -DPentiumCPS="$CPU_FREQ" is for achieving good timing on x86 platforms with gcc. -# -D c -DWALL is for achieving good timing on non x86 platforms and/or non gcc compilers - -# Write the value of L2_CACHE_SIZE to Make.inc -# -sed -i -r Make.inc -e \ - "s%L2SIZE = -DL2SIZE=[0-9]+%L2SIZE = -DL2SIZE=$L2_CACHE_SIZE%" - -# Allow for deprecated LAPACK routines to get build in case of a full LAPACK -# installation. Also set the LAPACK timer to the desired value and add -# -frecursive to the compile flags, since this should help avoid problems -# with some functions which seem otherwise to not be thread safe. -# -if [ "$LAPACK_SOURCE" ]; then - sed -i ./src/lapack/reference/make.inc.example -e \ - "s%^#MAKEDEPRECATED *=.*Yes%MAKEDEPRECATED = Yes%" - sed -i ./interfaces/lapack/F77/src/Makefile -e \ - "s%NONE%$LAPACK_TIMER%" -e \ - "s%F77FLAGS)@%F77FLAGS) -frecursive@%" -e \ - "s%F77NOOPT)@%F77NOOPT) -frecursive@%" -fi - -make build -make check - -# If parallel libraries have been compiled check them too. -# -if [ -f lib/libptcblas.a ]; then - make ptcheck -fi - -# If the full LAPACK got build one may wish to test it too. -# -if [ "$LAPACK_SOURCE" ]; then - if [ "$TEST_LAPACK" = "Y" ]; then - ( cd src/lapack/reference - [ -e ./libtmglib.a ] || make tmglib - # Some testers segfault when build with -frecursive if one doesn't - # increase the stack size limit, thus it's better to remove this flag - # from make.inc - # - sed -i make.inc -e "s%-frecursive%%" - - # Now we have to set the proper library paths. Here for the serial libs. - # - ATLAS_LIBS="../../../../../lib/libf77blas.a ../../../../../lib/libcblas.a" - ATLAS_LIBS="$ATLAS_LIBS ../../../../../lib/libatlas.a" - LAPACK_LIB="../../../lib/liblapack.a" - - sed -i make.inc -e \ - "s%^BLASLIB *=.*%BLASLIB = $ATLAS_LIBS%" -e \ - "s%^CBLASLIB *=.*%CBLASLIB =%" -e \ - "s%^LAPACKLIB *=.*%LAPACKLIB = $LAPACK_LIB%" - - # Perform the tests. - # - make lapack_testing - - # Put the test results together - # - tar czf TEST_SERIAL_RESULTS.tgz TESTING/*.out - - # If threaded libs got build, we repeat the tests with them. - # - if [ -e ../../../lib/libptlapack.a ]; then - make cleantesting - ATLAS_LIBS="../../../../../lib/libptf77blas.a" - ATLAS_LIBS="$ATLAS_LIBS ../../../../../lib/libptcblas.a" - ATLAS_LIBS="$ATLAS_LIBS ../../../../../lib/libatlas.a -lpthread" - LAPACK_LIB="../../../lib/libptlapack.a" - sed -i make.inc -e \ - "s%^BLASLIB *=.*%BLASLIB = $ATLAS_LIBS%" -e \ - "s%^LAPACKLIB *=.*%LAPACKLIB = $LAPACK_LIB%" - make lapack_testing - tar czf TEST_PT_RESULTS.tgz TESTING/*.out - fi - ) - fi -fi - -make install DESTDIR=${PKG}${SYS_DESTDIR} - -# The install script (sometimes) "forgets" about libptlapack.a -# -cp -ua lib/libptlapack.a ${PKG}${SYS_DESTDIR}/lib/ || true - -find $PKG | xargs file | grep -e "executable" -e "shared object" | grep ELF \ - | cut -f 1 -d : | xargs strip --strip-unneeded 2> /dev/null || true - -# This is probably the easiest way to make sure that we install in the -# proper place. -# -if [ "$LIBDIRSUFFIX" ]; then - mv ${PKG}${SYS_DESTDIR}/lib ${PKG}${SYS_DESTDIR}/lib${LIBDIRSUFFIX} -fi - -# Create the doc directory for atlas and populate it. -# -case "$DEFAULT_DOCS" in - [nN]|[nN][oO]) DOC_DIR="$PKG$SYS_DESTDIR/doc/$PRGNAM-$VERSION" ;; - *) DOC_DIR="$PKG/usr/doc/$PRGNAM-$VERSION" ;; -esac -mkdir -p ${DOC_DIR} -cp -a ../INSTALL.txt ../README ../doc ${DOC_DIR} - -# Add the Slackbuild script and README.SLACKWARE to the docs. -# -cat $CWD/$PRGNAM.SlackBuild > $DOC_DIR/$PRGNAM.SlackBuild -cat $CWD/README.SLACKWARE > $DOC_DIR/README.SLACKWARE - -# Create custom arch defaults if appropriate. -# -if [ "$USE_ARCH_DEFAULTS" = "0" ]; then - if [ "$ARCH_DEF_DIR" ]; then - ( cd ARCHS - make ArchNew - make tarfile - cp -ua *.tar.* "$ARCH_DEF_DIR" - ) - echo "$ARCH_DEF_DIR" > $DOC_DIR/ARCH_DEF_DIR - fi -fi - -# If the full LAPACK got installed add also some relevant files from its source -# tree. -# -if [ "$LAPACK_SOURCE" ]; then - ( cd src/lapack/reference - LAPACK_VER=$(./INSTALL/testversion | sed -e "s% *LAPACK *%%" -e "s% *%%g") - LAPACK_DOC_DIR="${DOC_DIR}/lapack-$LAPACK_VER" - mkdir "$LAPACK_DOC_DIR" - cp -a LICENSE README "$LAPACK_DOC_DIR" - - # Copy the test results if present (getting around "set -e" with "echo -n"). - # - cp -a TEST_* "$LAPACK_DOC_DIR" 2>/dev/null || echo -n - ) -fi - -rm -f $PKG/usr/lib*/*.la - -mkdir -p $PKG/install -cat $CWD/slack-desc > $PKG/install/slack-desc - -cd "$PKG" -/sbin/makepkg -l y -c n $OUTPUT/$PRGNAM-$VERSION-$ARCH-$BUILD$TAG.$PKGTYPE diff --git a/libraries/atlas/atlas.info b/libraries/atlas/atlas.info deleted file mode 100644 index 72483a664443..000000000000 --- a/libraries/atlas/atlas.info +++ /dev/null @@ -1,10 +0,0 @@ -PRGNAM="atlas" -VERSION="3.10.3" -HOMEPAGE="http://math-atlas.sourceforge.net/" -DOWNLOAD="http://downloads.sourceforge.net/math-atlas/atlas3.10.3.tar.bz2" -MD5SUM="d6ce4f16c2ad301837cfb3dade2f7cef" -DOWNLOAD_x86_64="" -MD5SUM_x86_64="" -REQUIRES="lapack-atlas" -MAINTAINER="Serban Udrea" -EMAIL="S.Udrea@gsi.de" diff --git a/libraries/atlas/atlas.patch b/libraries/atlas/atlas.patch deleted file mode 100644 index dea4dcc0b2ee..000000000000 --- a/libraries/atlas/atlas.patch +++ /dev/null @@ -1,5072 +0,0 @@ -diff -rupN ATLAS/CONFIG/src/backend/archinfo_x86.c atlas-3.8.3/CONFIG/src/backend/archinfo_x86.c ---- ATLAS/CONFIG/src/backend/archinfo_x86.c 2009-02-18 19:47:37.000000000 +0100 -+++ atlas-3.8.3/CONFIG/src/backend/archinfo_x86.c 2009-11-12 13:47:23.777451677 +0100 -@@ -320,7 +320,7 @@ enum MACHTYPE Chip2Mach(enum CHIP chip, - iret = IntP4; - break; - case 3: -- case 4: -+ case 4: ; case 6: - iret = IntP4E; - break; - default: -diff -rupN ATLAS/include/atlas_lvl3.h atlas-3.8.3/include/atlas_lvl3.h ---- ATLAS/include/atlas_lvl3.h 2009-02-18 19:47:35.000000000 +0100 -+++ atlas-3.8.3/include/atlas_lvl3.h 2009-11-12 13:52:49.308496090 +0100 -@@ -126,7 +126,7 @@ - #define CPAT Mjoin(C_ATL_, PRE); - - #ifndef ATL_MaxMalloc -- #define ATL_MaxMalloc 67108864 -+ #define ATL_MaxMalloc XXX_MaxMalloc_XXX - #endif - - typedef void (*MAT2BLK)(int, int, const TYPE*, int, TYPE*, const SCALAR); -diff -rupN ATLAS/src/blas/gemm/ATL_cmmJITcp.c atlas-3.8.3/src/blas/gemm/ATL_cmmJITcp.c ---- ATLAS/src/blas/gemm/ATL_cmmJITcp.c 2009-02-18 19:47:44.000000000 +0100 -+++ atlas-3.8.3/src/blas/gemm/ATL_cmmJITcp.c 2009-11-12 12:44:34.816529051 +0100 -@@ -268,7 +268,8 @@ static void Mjoin(PATL,mmK) - { - NBmm0 = NBmm1 = NBmmX = Mjoin(PATLU,pKBmm); - if (SCALAR_IS_ZERO(beta)) -- Mjoin(PATL,gezero)(M, N, C, ldc); -+ /* Mjoin(PATL,gezero)(M, N, C, ldc); */ -+ { Mjoin(PATLU,gezero)(M, N, pC, ldpc); Mjoin(PATLU,gezero)(M, N, pC+ipc, ldpc); } - } - if (nblk) - { -diff -rupN ATLAS/src/blas/gemm/ATL_gereal2cplx.c atlas-3.8.3/src/blas/gemm/ATL_gereal2cplx.c ---- ATLAS/src/blas/gemm/ATL_gereal2cplx.c 2009-02-18 19:47:44.000000000 +0100 -+++ atlas-3.8.3/src/blas/gemm/ATL_gereal2cplx.c 2009-11-12 12:49:49.331651677 +0100 -@@ -43,7 +43,53 @@ void Mjoin(PATL,gereal2cplx) - const int ldc2 = (ldc-M)<<1; - int i, j; - -- if (ialp == ATL_rzero && ibet == ATL_rzero) -+/* -+ * Cannot read C if BETA is 0 -+ */ -+ if (rbet == ATL_rzero && ibet == ATL_rzero) -+ { -+ if (ialp == ATL_rzero) /* alpha is a real number */ -+ { -+ if (ralp == ATL_rone) /* alpha = 1.0 */ -+ { -+ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2) -+ { -+ for (i=0; i < M; i++, C += 2) -+ { -+ *C = R[i]; -+ C[1] = I[i]; -+ } -+ } -+ } -+ else -+ { -+ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2) -+ { -+ for (i=0; i < M; i++, C += 2) -+ { -+ *C = ralp * R[i]; -+ C[1] = ralp * I[i]; -+ } -+ } -+ } -+ } -+ else /* alpha is a complex number */ -+ { -+ for (j=0; j < N; j++, R += ldr, I += ldi, C += ldc2) -+ { -+ for (i=0; i < M; i++, C += 2) -+ { -+ ra = R[i]; ia = I[i]; -+ C[0] = ralp * ra - ialp * ia; -+ C[1] = ralp * ia + ialp * ra; -+ } -+ } -+ } -+ } -+/* -+ * If alpha and beta are both real numbers -+ */ -+ else if (ialp == ATL_rzero && ibet == ATL_rzero) - { - if (ralp == ATL_rone && rbet == ATL_rone) - { -diff -rupN ATLAS/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c atlas-3.8.3/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c ---- ATLAS/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c 2009-02-18 19:48:26.000000000 +0100 -+++ atlas-3.8.3/tune/blas/gemm/CASES/ATL_smm14x1x84_sseCU.c 2009-11-12 12:35:50.453038827 +0100 -@@ -27,6 +27,13 @@ - * POSSIBILITY OF SUCH DAMAGE. - * - */ -+#if KB > 84 -+ #error "KB cannot exceed 84!" -+#endif -+#if (KB/4)*4 != KB -+ #error "KB must be a multiple of 4!" -+#endif -+ - #ifndef ATL_GAS_x8664 - #error "This kernel requires x86-64 assembly!" - #endif -@@ -58,25 +65,25 @@ - * Integer register usage shown be these defines - */ - #define pA %rcx --#define pA10 %rbx --#define ldab %rbp --#define mldab %rdx -+#define pA10 %rbx -+#define ldab %rbp -+#define mldab %rdx - #define mldab5 %rax - #define pB %rdi - #define pC %rsi - #define incCn %r10 - #define stM %r9 - #define stN %r11 --#define pfA %r8 --#define pA5 pA --#define pB0 pB -+#define pfA %r8 -+#define pA5 pA -+#define pB0 pB - #if MB == 0 -- #define stM0 %r12 -- #define incAm %r13 -+ #define stM0 %r12 -+ #define incAm %r13 - #endif - /* rax used in 32/64 conversion */ - --#define NBso (KB*4) -+#define NBso (KB*4) - #define MBKBso (MB*KB*4) - #define NB2so (NBso+NBso) - #define NB3so (NBso+NBso+NBso) -@@ -95,22 +102,22 @@ - /* - * SSE2 register usage shown be these defines - */ --#define rA0 %xmm0 --#define rB0 %xmm1 --#define rC0 %xmm2 --#define rC1 %xmm3 --#define rC2 %xmm4 --#define rC3 %xmm5 --#define rC4 %xmm6 --#define rC5 %xmm7 --#define rC6 %xmm8 --#define rC7 %xmm9 --#define rC8 %xmm10 --#define rC9 %xmm11 --#define rC10 %xmm12 --#define rC11 %xmm13 --#define rC12 %xmm14 --#define rC13 %xmm15 -+#define rA0 %xmm0 -+#define rB0 %xmm1 -+#define rC0 %xmm2 -+#define rC1 %xmm3 -+#define rC2 %xmm4 -+#define rC3 %xmm5 -+#define rC4 %xmm6 -+#define rC5 %xmm7 -+#define rC6 %xmm8 -+#define rC7 %xmm9 -+#define rC8 %xmm10 -+#define rC9 %xmm11 -+#define rC10 %xmm12 -+#define rC11 %xmm13 -+#define rC12 %xmm14 -+#define rC13 %xmm15 - /* - * Prefetch defines - */ -@@ -127,99 +134,99 @@ - #if MB != 0 - #define incAm $MBKBso-NB14so+176 - #endif -- .text -+ .text - .global ATL_asmdecor(ATL_USERMM) - ATL_asmdecor(ATL_USERMM): - /* - * Save callee-saved iregs - */ -- movq %rbp, -8(%rsp) -- movq %rbx, -16(%rsp) -+ movq %rbp, -8(%rsp) -+ movq %rbx, -16(%rsp) - #if MB == 0 -- movq %r12, -32(%rsp) -- movq %r13, -40(%rsp) -+ movq %r12, -32(%rsp) -+ movq %r13, -40(%rsp) - #endif - #ifdef BETAX - #define BOF -56 -- movss %xmm1, BOF(%rsp) -- movss %xmm1, BOF+4(%rsp) -- movss %xmm1, BOF+8(%rsp) -- movss %xmm1, BOF+12(%rsp) -+ movss %xmm1, BOF(%rsp) -+ movss %xmm1, BOF+4(%rsp) -+ movss %xmm1, BOF+8(%rsp) -+ movss %xmm1, BOF+12(%rsp) - #endif - /* - * pA already comes in right reg - * Initialize pB = B; pC = C; NBso = NB * sizeof; - */ -- movq %rsi, stN -- movq %rdi, %rax -- movq 16(%rsp), pC -- prefC((pC)) -- prefC(64(pC)) -- movq %r9, pB -- prefB((pB)) -- prefB(64(pB)) -- movq %rax, stM -+ movq %rsi, stN -+ movq %rdi, %rax -+ movq 16(%rsp), pC -+ prefC((pC)) -+ prefC(64(pC)) -+ movq %r9, pB -+ prefB((pB)) -+ prefB(64(pB)) -+ movq %rax, stM - /* - * stM = pA + NBNBso; stN = pB + NBNBso; - */ - #if MB == 0 -- movq stM, pfA -- imulq $NBso, pfA -- prefB(128(pB)) -- movq pfA, incAm -- addq pA5, pfA -- addq $176-NB14so, incAm -+ movq stM, pfA -+ imulq $NBso, pfA -+ prefB(128(pB)) -+ movq pfA, incAm -+ addq pA5, pfA -+ addq $176-NB14so, incAm - #else -- movq $MBKBso, pfA -- addq pA5, pfA -- prefB(128(pB)) -+ movq $MBKBso, pfA -+ addq pA5, pfA -+ prefB(128(pB)) - #endif - /* - * convert ldc to 64 bits, and then set incCn = (ldc - MB)*sizeof - */ -- movl 24(%rsp), %eax -- cltq -- movq %rax, incCn -- subq stM, incCn -- addq $14, incCn -+ movl 24(%rsp), %eax -+ cltq -+ movq %rax, incCn -+ subq stM, incCn -+ addq $14, incCn - #ifdef SREAL -- shl $2, incCn -+ shl $2, incCn - #else -- shl $3, incCn -- prefC(128(pC)) -- prefC(192(pC)) -+ shl $3, incCn -+ prefC(128(pC)) -+ prefC(192(pC)) - #endif - /* - * Find M/14 if MB is not set - */ - #if MB == 0 -- cmp $84, stM -- jne MB_LT84 --/* movq $84/14, stM */ -- movq $6, stM -+ cmp $84, stM -+ jne MB_LT84 -+/* movq $84/14, stM */ -+ movq $6, stM - MBFOUND: -- subq $1, stM -- movq stM, stM0 -+ subq $1, stM -+ movq stM, stM0 - #endif -- addq $120, pA5 -- addq $120, pB0 -- movq $KB*4, ldab -- movq $-KB*5*4, mldab5 -- movq $-KB*4, mldab -- subq mldab5, pA5 -- lea KB*4(pA5, ldab,4), pA10 --/* movq $NB, stN */ -+ addq $120, pA5 -+ addq $120, pB0 -+ movq $KB*4, ldab -+ movq $-KB*5*4, mldab5 -+ movq $-KB*4, mldab -+ subq mldab5, pA5 -+ lea KB*4(pA5, ldab,4), pA10 -+/* movq $NB, stN */ - - UNLOOP: - #if MB == 0 -- movq stM0, stM -- cmp $0, stM -- je MLAST -+ movq stM0, stM -+ cmp $0, stM -+ je MLAST - #else - #ifdef ATL_DivAns -- movq $ATL_DivAns-1, stM -+ movq $ATL_DivAns-1, stM - #else -- movq $MB/14-1, stM -+ movq $MB/14-1, stM - #endif - #endif - #if MB == 0 || MB > 14 -@@ -227,992 +234,992 @@ UMLOOP: - /* - * rC[0-13] = pC[0-13] * beta - */ -- ALIGN16 -+ ALIGN16 - /*UKLOOP: */ - #ifdef BETA1 -- movaps 0-120(pA10,mldab5,2), rC0 -- movaps 0-120(pB0), rB0 -- mulps rB0, rC0 -- addss (pC), rC0 -- movaps 0-120(pA5, mldab,4), rC1 -- mulps rB0, rC1 -- addss CMUL(4)(pC), rC1 -- movaps 0-120(pA10, mldab,8), rC2 -- mulps rB0, rC2 -- addss CMUL(8)(pC), rC2 -- movaps 0-120(pA5, mldab,2), rC3 -- mulps rB0, rC3 -- addss CMUL(12)(pC), rC3 -- movaps 0-120(pA5, mldab), rC4 -- mulps rB0, rC4 -- addss CMUL(16)(pC), rC4 -- movaps 0-120(pA5), rC5 -- mulps rB0, rC5 -- addss CMUL(20)(pC), rC5 -- movaps 0-120(pA5, ldab), rC6 -- mulps rB0, rC6 -- addss CMUL(24)(pC), rC6 -- movaps 0-120(pA5, ldab,2), rC7 -- mulps rB0, rC7 -- addss CMUL(28)(pC), rC7 -- movaps 0-120(pA10, mldab,2), rC8 -- mulps rB0, rC8 -- addss CMUL(32)(pC), rC8 -- movaps 0-120(pA5,ldab,4), rC9 -- mulps rB0, rC9 -- addss CMUL(36)(pC), rC9 -- movaps 0-120(pA10), rC10 -- mulps rB0, rC10 -- addss CMUL(40)(pC), rC10 -- movaps 0-120(pA10,ldab), rC11 -- mulps rB0, rC11 -- addss CMUL(44)(pC), rC11 -- movaps 0-120(pA10,ldab,2), rC12 -- mulps rB0, rC12 -- addss CMUL(48)(pC), rC12 -- movaps 0-120(pA5,ldab,8), rC13 -- mulps rB0, rC13 -- addss CMUL(52)(pC), rC13 -+ movaps 0-120(pA10,mldab5,2), rC0 -+ movaps 0-120(pB0), rB0 -+ mulps rB0, rC0 -+ addss (pC), rC0 -+ movaps 0-120(pA5, mldab,4), rC1 -+ mulps rB0, rC1 -+ addss CMUL(4)(pC), rC1 -+ movaps 0-120(pA10, mldab,8), rC2 -+ mulps rB0, rC2 -+ addss CMUL(8)(pC), rC2 -+ movaps 0-120(pA5, mldab,2), rC3 -+ mulps rB0, rC3 -+ addss CMUL(12)(pC), rC3 -+ movaps 0-120(pA5, mldab), rC4 -+ mulps rB0, rC4 -+ addss CMUL(16)(pC), rC4 -+ movaps 0-120(pA5), rC5 -+ mulps rB0, rC5 -+ addss CMUL(20)(pC), rC5 -+ movaps 0-120(pA5, ldab), rC6 -+ mulps rB0, rC6 -+ addss CMUL(24)(pC), rC6 -+ movaps 0-120(pA5, ldab,2), rC7 -+ mulps rB0, rC7 -+ addss CMUL(28)(pC), rC7 -+ movaps 0-120(pA10, mldab,2), rC8 -+ mulps rB0, rC8 -+ addss CMUL(32)(pC), rC8 -+ movaps 0-120(pA5,ldab,4), rC9 -+ mulps rB0, rC9 -+ addss CMUL(36)(pC), rC9 -+ movaps 0-120(pA10), rC10 -+ mulps rB0, rC10 -+ addss CMUL(40)(pC), rC10 -+ movaps 0-120(pA10,ldab), rC11 -+ mulps rB0, rC11 -+ addss CMUL(44)(pC), rC11 -+ movaps 0-120(pA10,ldab,2), rC12 -+ mulps rB0, rC12 -+ addss CMUL(48)(pC), rC12 -+ movaps 0-120(pA5,ldab,8), rC13 -+ mulps rB0, rC13 -+ addss CMUL(52)(pC), rC13 - #else -- movaps 0-120(pA10,mldab5,2), rC0 -- movaps 0-120(pB0), rC13 -- mulps rC13, rC0 -- movaps 0-120(pA5, mldab,4), rC1 -- mulps rC13, rC1 -- movaps 0-120(pA10, mldab,8), rC2 -- mulps rC13, rC2 -- movaps 0-120(pA5, mldab,2), rC3 -- mulps rC13, rC3 -- movaps 0-120(pA5, mldab), rC4 -- mulps rC13, rC4 -- movaps 0-120(pA5), rC5 -- mulps rC13, rC5 -- movaps 0-120(pA5, ldab), rC6 -- mulps rC13, rC6 -- movaps 0-120(pA5, ldab,2), rC7 -- mulps rC13, rC7 -- movaps 0-120(pA10, mldab,2), rC8 -- mulps rC13, rC8 -- movaps 0-120(pA5,ldab,4), rC9 -- mulps rC13, rC9 -- movaps 0-120(pA10), rC10 -- mulps rC13, rC10 -- movaps 0-120(pA10,ldab), rC11 -- mulps rC13, rC11 -- movaps 0-120(pA10,ldab,2), rC12 -- mulps rC13, rC12 -- mulps 0-120(pA5,ldab,8), rC13 -+ movaps 0-120(pA10,mldab5,2), rC0 -+ movaps 0-120(pB0), rC13 -+ mulps rC13, rC0 -+ movaps 0-120(pA5, mldab,4), rC1 -+ mulps rC13, rC1 -+ movaps 0-120(pA10, mldab,8), rC2 -+ mulps rC13, rC2 -+ movaps 0-120(pA5, mldab,2), rC3 -+ mulps rC13, rC3 -+ movaps 0-120(pA5, mldab), rC4 -+ mulps rC13, rC4 -+ movaps 0-120(pA5), rC5 -+ mulps rC13, rC5 -+ movaps 0-120(pA5, ldab), rC6 -+ mulps rC13, rC6 -+ movaps 0-120(pA5, ldab,2), rC7 -+ mulps rC13, rC7 -+ movaps 0-120(pA10, mldab,2), rC8 -+ mulps rC13, rC8 -+ movaps 0-120(pA5,ldab,4), rC9 -+ mulps rC13, rC9 -+ movaps 0-120(pA10), rC10 -+ mulps rC13, rC10 -+ movaps 0-120(pA10,ldab), rC11 -+ mulps rC13, rC11 -+ movaps 0-120(pA10,ldab,2), rC12 -+ mulps rC13, rC12 -+ mulps 0-120(pA5,ldab,8), rC13 - #endif - - #if KB > 4 -- movaps 16-120(pA10,mldab5,2), rA0 -- movaps 16-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 16-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 16-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 16-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 16-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 16-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 16-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 16-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 16-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 16-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 16-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 16-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 16-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 16-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 16-120(pA10,mldab5,2), rA0 -+ movaps 16-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 16-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 16-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 16-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 16-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 16-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 16-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 16-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 16-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 16-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 16-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 16-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 16-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 16-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 8 -- movaps 32-120(pA10,mldab5,2), rA0 -- movaps 32-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 32-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 32-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 32-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 32-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 32-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 32-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 32-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 32-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 32-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 32-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 32-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 32-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 32-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 32-120(pA10,mldab5,2), rA0 -+ movaps 32-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 32-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 32-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 32-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 32-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 32-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 32-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 32-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 32-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 32-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 32-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 32-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 32-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 32-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 12 -- movaps 48-120(pA10,mldab5,2), rA0 -- movaps 48-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 48-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 48-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 48-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 48-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 48-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 48-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 48-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 48-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 48-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 48-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 48-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 48-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 48-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 48-120(pA10,mldab5,2), rA0 -+ movaps 48-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 48-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 48-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 48-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 48-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 48-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 48-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 48-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 48-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 48-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 48-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 48-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 48-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 48-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 16 -- movaps 64-120(pA10,mldab5,2), rA0 -- movaps 64-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 64-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 64-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 64-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 64-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 64-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 64-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 64-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 64-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 64-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 64-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 64-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 64-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 64-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 64-120(pA10,mldab5,2), rA0 -+ movaps 64-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 64-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 64-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 64-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 64-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 64-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 64-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 64-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 64-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 64-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 64-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 64-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 64-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 64-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 20 -- movaps 80-120(pA10,mldab5,2), rA0 -- movaps 80-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 80-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 80-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 80-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 80-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 80-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 80-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 80-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 80-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 80-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 80-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 80-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 80-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 80-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 80-120(pA10,mldab5,2), rA0 -+ movaps 80-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 80-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 80-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 80-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 80-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 80-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 80-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 80-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 80-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 80-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 80-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 80-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 80-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 80-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 24 -- movaps 96-120(pA10,mldab5,2), rA0 -- movaps 96-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 96-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 96-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 96-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 96-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 96-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 96-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 96-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 96-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 96-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 96-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 96-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 96-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 96-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 96-120(pA10,mldab5,2), rA0 -+ movaps 96-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 96-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 96-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 96-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 96-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 96-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 96-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 96-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 96-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 96-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 96-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 96-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 96-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 96-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 28 -- movaps 112-120(pA10,mldab5,2), rA0 -- movaps 112-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 112-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 112-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 112-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 112-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 112-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 112-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 112-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 112-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 112-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 112-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 112-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 112-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 112-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 112-120(pA10,mldab5,2), rA0 -+ movaps 112-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 112-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 112-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 112-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 112-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 112-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 112-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 112-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 112-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 112-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 112-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 112-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 112-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 112-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - #ifndef SREAL -- pref2((pfA)) -- pref2(64(pfA)) -+ pref2((pfA)) -+ pref2(64(pfA)) - #endif - - #if KB > 32 -- movaps 128-120(pA10,mldab5,2), rA0 -- movaps 128-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 128-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 128-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 128-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 128-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 128-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 128-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 128-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 128-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 128-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 128-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 128-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 128-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 128-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 128-120(pA10,mldab5,2), rA0 -+ movaps 128-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 128-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 128-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 128-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 128-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 128-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 128-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 128-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 128-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 128-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 128-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 128-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 128-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 128-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 36 -- movaps 144-120(pA10,mldab5,2), rA0 -- movaps 144-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 144-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 144-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 144-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 144-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 144-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 144-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 144-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 144-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 144-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 144-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 144-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 144-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 144-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 144-120(pA10,mldab5,2), rA0 -+ movaps 144-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 144-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 144-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 144-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 144-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 144-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 144-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 144-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 144-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 144-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 144-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 144-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 144-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 144-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 40 -- movaps 160-120(pA10,mldab5,2), rA0 -- movaps 160-120(pB0), rB0 -- mulps rB0, rA0 -- addq $176, pB0 -- addps rA0, rC0 -- movaps 160-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 160-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 160-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 160-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 160-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 160-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 160-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 160-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 160-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 160-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 160-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 160-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addq $176, pA10 -- addps rA0, rC12 -- mulps 160-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -- addq $176, pA5 -+ movaps 160-120(pA10,mldab5,2), rA0 -+ movaps 160-120(pB0), rB0 -+ mulps rB0, rA0 -+ addq $176, pB0 -+ addps rA0, rC0 -+ movaps 160-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 160-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 160-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 160-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 160-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 160-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 160-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 160-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 160-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 160-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 160-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 160-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addq $176, pA10 -+ addps rA0, rC12 -+ mulps 160-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 -+ addq $176, pA5 - #else -- addq $176, pB0 -- addq $176, pA10 -- addq $176, pA5 -+ addq $176, pB0 -+ addq $176, pA10 -+ addq $176, pA5 - #endif - - #if KB > 44 -- movaps 0-120(pA10,mldab5,2), rA0 -- movaps 0-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 0-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 0-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 0-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 0-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 0-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 0-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 0-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 0-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 0-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 0-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 0-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 0-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 0-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 0-120(pA10,mldab5,2), rA0 -+ movaps 0-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 0-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 0-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 0-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 0-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 0-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 0-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 0-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 0-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 0-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 0-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 0-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 0-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 0-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 48 -- movaps 16-120(pA10,mldab5,2), rA0 -- movaps 16-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 16-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 16-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 16-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 16-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 16-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 16-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 16-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 16-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 16-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 16-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 16-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 16-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 16-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 16-120(pA10,mldab5,2), rA0 -+ movaps 16-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 16-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 16-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 16-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 16-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 16-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 16-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 16-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 16-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 16-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 16-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 16-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 16-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 16-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 52 -- movaps 32-120(pA10,mldab5,2), rA0 -- movaps 32-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 32-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 32-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 32-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 32-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 32-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 32-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 32-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 32-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 32-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 32-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 32-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 32-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 32-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 32-120(pA10,mldab5,2), rA0 -+ movaps 32-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 32-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 32-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 32-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 32-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 32-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 32-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 32-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 32-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 32-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 32-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 32-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 32-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 32-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 56 -- movaps 48-120(pA10,mldab5,2), rA0 -- movaps 48-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 48-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 48-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 48-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 48-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 48-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 48-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 48-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 48-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 48-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 48-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 48-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 48-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 48-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 48-120(pA10,mldab5,2), rA0 -+ movaps 48-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 48-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 48-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 48-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 48-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 48-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 48-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 48-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 48-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 48-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 48-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 48-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 48-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 48-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 60 -- movaps 64-120(pA10,mldab5,2), rA0 -- movaps 64-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 64-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 64-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 64-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 64-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 64-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 64-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 64-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 64-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 64-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 64-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 64-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 64-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 64-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 64-120(pA10,mldab5,2), rA0 -+ movaps 64-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 64-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 64-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 64-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 64-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 64-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 64-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 64-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 64-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 64-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 64-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 64-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 64-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 64-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 64 -- movaps 80-120(pA10,mldab5,2), rA0 -- movaps 80-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 80-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 80-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 80-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 80-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 80-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 80-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 80-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 80-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 80-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 80-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 80-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 80-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 80-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 80-120(pA10,mldab5,2), rA0 -+ movaps 80-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 80-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 80-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 80-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 80-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 80-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 80-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 80-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 80-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 80-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 80-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 80-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 80-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 80-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 68 -- movaps 96-120(pA10,mldab5,2), rA0 -- movaps 96-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 96-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 96-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 96-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 96-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 96-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 96-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 96-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 96-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 96-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 96-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 96-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 96-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 96-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 96-120(pA10,mldab5,2), rA0 -+ movaps 96-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 96-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 96-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 96-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 96-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 96-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 96-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 96-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 96-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 96-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 96-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 96-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 96-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 96-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 72 -- movaps 112-120(pA10,mldab5,2), rA0 -- movaps 112-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 112-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 112-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 112-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 112-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 112-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 112-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 112-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 112-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 112-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 112-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 112-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 112-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 112-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 112-120(pA10,mldab5,2), rA0 -+ movaps 112-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 112-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 112-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 112-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 112-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 112-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 112-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 112-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 112-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 112-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 112-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 112-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 112-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 112-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 76 -- movaps 128-120(pA10,mldab5,2), rA0 -- movaps 128-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 128-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 128-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 128-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 128-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 128-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 128-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 128-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 128-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 128-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 128-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 128-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 128-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 128-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 128-120(pA10,mldab5,2), rA0 -+ movaps 128-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 128-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 128-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 128-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 128-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 128-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 128-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 128-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 128-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 128-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 128-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 128-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 128-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 128-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 80 -- movaps 144-120(pA10,mldab5,2), rA0 -- movaps 144-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 144-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 144-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 144-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 144-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 144-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 144-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 144-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 144-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 144-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 144-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 144-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 144-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 144-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 144-120(pA10,mldab5,2), rA0 -+ movaps 144-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 144-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 144-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 144-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 144-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 144-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 144-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 144-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 144-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 144-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 144-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 144-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 144-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 144-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - /*UKLOOP */ -@@ -1220,234 +1227,234 @@ UMLOOP: - * Get these bastard things summed up correctly - */ - -- /* rC0 = c0a c0b c0c c0d */ -- /* rC1 = c1a c1b c1c c1d */ -- /* rC2 = c2a c2b c2c c2d */ -- /* rC3 = c3a c3b c3c c3d */ -+ /* rC0 = c0a c0b c0c c0d */ -+ /* rC1 = c1a c1b c1c c1d */ -+ /* rC2 = c2a c2b c2c c2d */ -+ /* rC3 = c3a c3b c3c c3d */ - /* */ -- movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */ -- prefC((pC)) -- prefC(64(pC)) -- movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */ -- unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */ -- unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */ -- unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */ -- movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */ -- unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */ -- movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */ -- movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */ -- movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */ -- addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */ -- movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */ -- movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */ -- movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */ -- addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */ -- movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */ -- addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */ -- -- -- /* rC4 = c4a c4b c4c c4d */ -- /* rC5 = c5a c5b c5c c5d */ -- /* rC6 = c6a c6b c6c c6d */ -- /* rC7 = c7a c7b c7c c7d */ -- /* rC8 = c08a c08b c08c c08d */ -- /* rC9 = c09a c09b c09c c09d */ -- /* rC10 = c10a c10b c10c c10d */ -- /* rC11 = c11a c11b c11c c11d */ -- /* rC12 = c12a c12b c12c c12d */ -- /* rC13 = c13a c13b c13c c13d */ -+ movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */ -+ prefC((pC)) -+ prefC(64(pC)) -+ movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */ -+ unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */ -+ unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */ -+ unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */ -+ movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */ -+ unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */ -+ movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */ -+ movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */ -+ movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */ -+ addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */ -+ movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */ -+ movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */ -+ movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */ -+ addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */ -+ movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */ -+ addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */ -+ -+ -+ /* rC4 = c4a c4b c4c c4d */ -+ /* rC5 = c5a c5b c5c c5d */ -+ /* rC6 = c6a c6b c6c c6d */ -+ /* rC7 = c7a c7b c7c c7d */ -+ /* rC8 = c08a c08b c08c c08d */ -+ /* rC9 = c09a c09b c09c c09d */ -+ /* rC10 = c10a c10b c10c c10d */ -+ /* rC11 = c11a c11b c11c c11d */ -+ /* rC12 = c12a c12b c12c c12d */ -+ /* rC13 = c13a c13b c13c c13d */ - /* */ -- movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */ -- prefC(128(pC)) -+ movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */ -+ prefC(128(pC)) - #ifdef SREAL -- pref2((pfA)) -+ pref2((pfA)) - #else -- prefC(192(pC)) -+ prefC(192(pC)) - #endif -- movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */ -- movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */ -- unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */ -- unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */ -- unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */ -- unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */ -- unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */ -- movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */ -- unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */ -- movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */ -- movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */ -- unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */ -- movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */ -- movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */ -- addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */ -+ movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */ -+ movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */ -+ unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */ -+ unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */ -+ unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */ -+ unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */ -+ unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */ -+ movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */ -+ unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */ -+ movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */ -+ movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */ -+ unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */ -+ movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */ -+ movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */ -+ addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */ - #ifdef BETAX - #ifdef SREAL -- movups (pC), rA0 -- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -- movups 16(pC), rC4 -- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -- movups 32(pC), rC5 -- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -- movlps 48(pC), rC1 -- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -- pref2(64(pfA)) -- mulps BOF(%rsp), rA0 -- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -- mulps BOF(%rsp), rC4 -- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -- mulps BOF(%rsp), rC5 -- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -- mulps BOF(%rsp), rC1 -+ movups (pC), rA0 -+ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -+ movups 16(pC), rC4 -+ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -+ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -+ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -+ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -+ movups 32(pC), rC5 -+ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -+ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -+ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -+ movlps 48(pC), rC1 -+ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -+ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -+ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -+ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -+ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -+ pref2(64(pfA)) -+ mulps BOF(%rsp), rA0 -+ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -+ mulps BOF(%rsp), rC4 -+ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -+ mulps BOF(%rsp), rC5 -+ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -+ mulps BOF(%rsp), rC1 - - /* */ - -- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -- addps rA0, rC3 -- addq $68, pfA -- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -- addps rC4, rC7 -- addps rC5, rC11 -- addps rC1, rC12 -+ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -+ addps rA0, rC3 -+ addq $68, pfA -+ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -+ addps rC4, rC7 -+ addps rC5, rC11 -+ addps rC1, rC12 - #else /* BETA = X, complex type */ -- movups (pC), rA0 -- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -- movups 16(pC), rC4 -- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -- shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */ -- movups 32(pC), rC4 /* rC4 = c4 X c5 X */ -- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -- movups 48(pC), rC5 /* rC5 = c6 X c7 X */ -- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -- shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */ -- movups 64(pC), rC5 /* rC5 = c8 X c9 X */ -- movups 80(pC), rC1 /* rC1 = c10 X c11 X */ -- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -- shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */ -- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -- movss 96(pC), rC1 -- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -- movss 104(pC), rB0 -- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -- unpcklps rB0, rC1 -- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -- prefC(256(pC)) -- mulps BOF(%rsp), rA0 -- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -- mulps BOF(%rsp), rC4 -- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -- mulps BOF(%rsp), rC5 -- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -- mulps BOF(%rsp), rC1 -+ movups (pC), rA0 -+ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -+ movups 16(pC), rC4 -+ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -+ shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */ -+ movups 32(pC), rC4 /* rC4 = c4 X c5 X */ -+ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -+ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -+ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -+ movups 48(pC), rC5 /* rC5 = c6 X c7 X */ -+ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -+ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -+ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -+ shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */ -+ movups 64(pC), rC5 /* rC5 = c8 X c9 X */ -+ movups 80(pC), rC1 /* rC1 = c10 X c11 X */ -+ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -+ shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */ -+ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -+ movss 96(pC), rC1 -+ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -+ movss 104(pC), rB0 -+ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -+ unpcklps rB0, rC1 -+ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -+ prefC(256(pC)) -+ mulps BOF(%rsp), rA0 -+ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -+ mulps BOF(%rsp), rC4 -+ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -+ mulps BOF(%rsp), rC5 -+ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -+ mulps BOF(%rsp), rC1 - - /* */ - -- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -- addps rA0, rC3 -- prefC(192(pC)) -- addq $68, pfA -- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -- addps rC4, rC7 -- addps rC5, rC11 -- addps rC1, rC12 -+ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -+ addps rA0, rC3 -+ prefC(192(pC)) -+ addq $68, pfA -+ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -+ addps rC4, rC7 -+ addps rC5, rC11 -+ addps rC1, rC12 - #endif - - #else -- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -+ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -+ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -+ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -+ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -+ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -+ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -+ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -+ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -+ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -+ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -+ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -+ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -+ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ - #ifdef SREAL -- pref2(64(pfA)) -+ pref2(64(pfA)) - #else -- prefC(256(pC)) -+ prefC(256(pC)) - #endif -- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -+ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -+ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -+ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ - - /* */ - -- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -+ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ - #ifndef SREAL -- prefC(192(pC)) -+ prefC(192(pC)) - #endif -- addq $68, pfA -- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -+ addq $68, pfA -+ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ - - #endif - /* - * Write results back to C; pC += 14; - */ - #ifdef SREAL -- movups rC3, (pC) -- movups rC7, 16(pC) -- movups rC11, 32(pC) -- movlps rC12, 48(pC) -- addq $56, pC -+ movups rC3, (pC) -+ movups rC7, 16(pC) -+ movups rC11, 32(pC) -+ movlps rC12, 48(pC) -+ addq $56, pC - #else -- movss rC3, (pC) -- movss rC7, 32(pC) -- movhlps rC3, rC0 -- movhlps rC7, rC6 -- movss rC0, 16(pC) -- movss rC6, 48(pC) -- shufps $0x55, rC3, rC3 -- shufps $0x55, rC7, rC7 -- movss rC3, 8(pC) -- movss rC7, 40(pC) -- shufps $0x55, rC0, rC0 -- shufps $0x55, rC6, rC6 -- movss rC0, 24(pC) -- movss rC6, 56(pC) -- -- movss rC11, 64(pC) -- movhlps rC11, rC2 -- movss rC12, 96(pC) -- movss rC2, 80(pC) -- shufps $0x55, rC11, rC11 -- shufps $0x55, rC12, rC12 -- movss rC11, 72(pC) -- shufps $0x55, rC2, rC2 -- movss rC12, 104(pC) -- movss rC2, 88(pC) -+ movss rC3, (pC) -+ movss rC7, 32(pC) -+ movhlps rC3, rC0 -+ movhlps rC7, rC6 -+ movss rC0, 16(pC) -+ movss rC6, 48(pC) -+ shufps $0x55, rC3, rC3 -+ shufps $0x55, rC7, rC7 -+ movss rC3, 8(pC) -+ movss rC7, 40(pC) -+ shufps $0x55, rC0, rC0 -+ shufps $0x55, rC6, rC6 -+ movss rC0, 24(pC) -+ movss rC6, 56(pC) -+ -+ movss rC11, 64(pC) -+ movhlps rC11, rC2 -+ movss rC12, 96(pC) -+ movss rC2, 80(pC) -+ shufps $0x55, rC11, rC11 -+ shufps $0x55, rC12, rC12 -+ movss rC11, 72(pC) -+ shufps $0x55, rC2, rC2 -+ movss rC12, 104(pC) -+ movss rC2, 88(pC) - -- addq $112, pC -+ addq $112, pC - #endif - /* - * Write results back to C - */ -- addq $NB14so-176, pA5 -- addq $NB14so-176, pA10 -- subq $176, pB0 -+ addq $NB14so-176, pA5 -+ addq $NB14so-176, pA10 -+ subq $176, pB0 - /* - * pC += 14; pA += 14*NB; pB -= NB; - */ - /* - * while (pA != stM); - */ -- subq $1, stM -- jne UMLOOP -+ subq $1, stM -+ jne UMLOOP - #endif - - /* -@@ -1459,994 +1466,994 @@ MLAST: - #endif - /*UKLOOP: */ - #ifdef BETA1 -- movaps 0-120(pA10,mldab5,2), rC0 -- movaps 0-120(pB0), rB0 -- mulps rB0, rC0 -- addss (pC), rC0 -- movaps 0-120(pA5, mldab,4), rC1 -- mulps rB0, rC1 -- addss CMUL(4)(pC), rC1 -- movaps 0-120(pA10, mldab,8), rC2 -- mulps rB0, rC2 -- addss CMUL(8)(pC), rC2 -- movaps 0-120(pA5, mldab,2), rC3 -- mulps rB0, rC3 -- addss CMUL(12)(pC), rC3 -- movaps 0-120(pA5, mldab), rC4 -- mulps rB0, rC4 -- addss CMUL(16)(pC), rC4 -- movaps 0-120(pA5), rC5 -- mulps rB0, rC5 -- addss CMUL(20)(pC), rC5 -- movaps 0-120(pA5, ldab), rC6 -- mulps rB0, rC6 -- addss CMUL(24)(pC), rC6 -- movaps 0-120(pA5, ldab,2), rC7 -- mulps rB0, rC7 -- addss CMUL(28)(pC), rC7 -- movaps 0-120(pA10, mldab,2), rC8 -- mulps rB0, rC8 -- addss CMUL(32)(pC), rC8 -- movaps 0-120(pA5,ldab,4), rC9 -- mulps rB0, rC9 -- addss CMUL(36)(pC), rC9 -- movaps 0-120(pA10), rC10 -- mulps rB0, rC10 -- addss CMUL(40)(pC), rC10 -- movaps 0-120(pA10,ldab), rC11 -- mulps rB0, rC11 -- addss CMUL(44)(pC), rC11 -- movaps 0-120(pA10,ldab,2), rC12 -- mulps rB0, rC12 -- addss CMUL(48)(pC), rC12 -- movaps 0-120(pA5,ldab,8), rC13 -- mulps rB0, rC13 -- addss CMUL(52)(pC), rC13 -+ movaps 0-120(pA10,mldab5,2), rC0 -+ movaps 0-120(pB0), rB0 -+ mulps rB0, rC0 -+ addss (pC), rC0 -+ movaps 0-120(pA5, mldab,4), rC1 -+ mulps rB0, rC1 -+ addss CMUL(4)(pC), rC1 -+ movaps 0-120(pA10, mldab,8), rC2 -+ mulps rB0, rC2 -+ addss CMUL(8)(pC), rC2 -+ movaps 0-120(pA5, mldab,2), rC3 -+ mulps rB0, rC3 -+ addss CMUL(12)(pC), rC3 -+ movaps 0-120(pA5, mldab), rC4 -+ mulps rB0, rC4 -+ addss CMUL(16)(pC), rC4 -+ movaps 0-120(pA5), rC5 -+ mulps rB0, rC5 -+ addss CMUL(20)(pC), rC5 -+ movaps 0-120(pA5, ldab), rC6 -+ mulps rB0, rC6 -+ addss CMUL(24)(pC), rC6 -+ movaps 0-120(pA5, ldab,2), rC7 -+ mulps rB0, rC7 -+ addss CMUL(28)(pC), rC7 -+ movaps 0-120(pA10, mldab,2), rC8 -+ mulps rB0, rC8 -+ addss CMUL(32)(pC), rC8 -+ movaps 0-120(pA5,ldab,4), rC9 -+ mulps rB0, rC9 -+ addss CMUL(36)(pC), rC9 -+ movaps 0-120(pA10), rC10 -+ mulps rB0, rC10 -+ addss CMUL(40)(pC), rC10 -+ movaps 0-120(pA10,ldab), rC11 -+ mulps rB0, rC11 -+ addss CMUL(44)(pC), rC11 -+ movaps 0-120(pA10,ldab,2), rC12 -+ mulps rB0, rC12 -+ addss CMUL(48)(pC), rC12 -+ movaps 0-120(pA5,ldab,8), rC13 -+ mulps rB0, rC13 -+ addss CMUL(52)(pC), rC13 - #else -- movaps 0-120(pA10,mldab5,2), rC0 -- movaps 0-120(pB0), rC13 -- mulps rC13, rC0 -- movaps 0-120(pA5, mldab,4), rC1 -- mulps rC13, rC1 -- movaps 0-120(pA10, mldab,8), rC2 -- mulps rC13, rC2 -- movaps 0-120(pA5, mldab,2), rC3 -- mulps rC13, rC3 -- movaps 0-120(pA5, mldab), rC4 -- mulps rC13, rC4 -- movaps 0-120(pA5), rC5 -- mulps rC13, rC5 -- movaps 0-120(pA5, ldab), rC6 -- mulps rC13, rC6 -- movaps 0-120(pA5, ldab,2), rC7 -- mulps rC13, rC7 -- movaps 0-120(pA10, mldab,2), rC8 -- mulps rC13, rC8 -- movaps 0-120(pA5,ldab,4), rC9 -- mulps rC13, rC9 -- movaps 0-120(pA10), rC10 -- mulps rC13, rC10 -- movaps 0-120(pA10,ldab), rC11 -- mulps rC13, rC11 -- movaps 0-120(pA10,ldab,2), rC12 -- mulps rC13, rC12 -- mulps 0-120(pA5,ldab,8), rC13 -+ movaps 0-120(pA10,mldab5,2), rC0 -+ movaps 0-120(pB0), rC13 -+ mulps rC13, rC0 -+ movaps 0-120(pA5, mldab,4), rC1 -+ mulps rC13, rC1 -+ movaps 0-120(pA10, mldab,8), rC2 -+ mulps rC13, rC2 -+ movaps 0-120(pA5, mldab,2), rC3 -+ mulps rC13, rC3 -+ movaps 0-120(pA5, mldab), rC4 -+ mulps rC13, rC4 -+ movaps 0-120(pA5), rC5 -+ mulps rC13, rC5 -+ movaps 0-120(pA5, ldab), rC6 -+ mulps rC13, rC6 -+ movaps 0-120(pA5, ldab,2), rC7 -+ mulps rC13, rC7 -+ movaps 0-120(pA10, mldab,2), rC8 -+ mulps rC13, rC8 -+ movaps 0-120(pA5,ldab,4), rC9 -+ mulps rC13, rC9 -+ movaps 0-120(pA10), rC10 -+ mulps rC13, rC10 -+ movaps 0-120(pA10,ldab), rC11 -+ mulps rC13, rC11 -+ movaps 0-120(pA10,ldab,2), rC12 -+ mulps rC13, rC12 -+ mulps 0-120(pA5,ldab,8), rC13 - #endif - - #if KB > 4 -- movaps 16-120(pA10,mldab5,2), rA0 -- movaps 16-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 16-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 16-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 16-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 16-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 16-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 16-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 16-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 16-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 16-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 16-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 16-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 16-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 16-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 16-120(pA10,mldab5,2), rA0 -+ movaps 16-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 16-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 16-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 16-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 16-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 16-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 16-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 16-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 16-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 16-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 16-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 16-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 16-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 16-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 8 -- movaps 32-120(pA10,mldab5,2), rA0 -- movaps 32-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 32-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 32-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 32-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 32-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 32-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 32-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 32-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 32-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 32-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 32-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 32-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 32-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 32-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 32-120(pA10,mldab5,2), rA0 -+ movaps 32-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 32-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 32-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 32-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 32-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 32-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 32-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 32-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 32-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 32-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 32-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 32-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 32-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 32-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 12 -- movaps 48-120(pA10,mldab5,2), rA0 -- movaps 48-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 48-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 48-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 48-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 48-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 48-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 48-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 48-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 48-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 48-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 48-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 48-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 48-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 48-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 48-120(pA10,mldab5,2), rA0 -+ movaps 48-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 48-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 48-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 48-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 48-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 48-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 48-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 48-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 48-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 48-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 48-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 48-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 48-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 48-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 16 -- movaps 64-120(pA10,mldab5,2), rA0 -- movaps 64-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 64-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 64-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 64-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 64-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 64-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 64-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 64-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 64-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 64-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 64-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 64-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 64-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 64-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 64-120(pA10,mldab5,2), rA0 -+ movaps 64-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 64-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 64-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 64-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 64-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 64-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 64-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 64-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 64-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 64-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 64-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 64-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 64-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 64-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 20 -- movaps 80-120(pA10,mldab5,2), rA0 -- movaps 80-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 80-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 80-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 80-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 80-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 80-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 80-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 80-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 80-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 80-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 80-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 80-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 80-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 80-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 80-120(pA10,mldab5,2), rA0 -+ movaps 80-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 80-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 80-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 80-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 80-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 80-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 80-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 80-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 80-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 80-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 80-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 80-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 80-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 80-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 24 -- movaps 96-120(pA10,mldab5,2), rA0 -- movaps 96-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 96-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 96-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 96-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 96-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 96-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 96-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 96-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 96-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 96-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 96-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 96-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 96-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 96-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 96-120(pA10,mldab5,2), rA0 -+ movaps 96-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 96-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 96-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 96-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 96-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 96-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 96-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 96-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 96-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 96-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 96-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 96-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 96-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 96-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 28 -- movaps 112-120(pA10,mldab5,2), rA0 -- movaps 112-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 112-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 112-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 112-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 112-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 112-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 112-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 112-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 112-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 112-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 112-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 112-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 112-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 112-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 112-120(pA10,mldab5,2), rA0 -+ movaps 112-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 112-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 112-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 112-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 112-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 112-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 112-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 112-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 112-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 112-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 112-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 112-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 112-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 112-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 32 -- movaps 128-120(pA10,mldab5,2), rA0 -- movaps 128-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 128-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 128-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 128-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 128-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 128-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 128-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 128-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 128-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 128-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 128-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 128-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 128-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 128-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 128-120(pA10,mldab5,2), rA0 -+ movaps 128-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 128-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 128-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 128-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 128-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 128-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 128-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 128-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 128-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 128-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 128-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 128-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 128-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 128-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 36 -- movaps 144-120(pA10,mldab5,2), rA0 -- movaps 144-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 144-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 144-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 144-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 144-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 144-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 144-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 144-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 144-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 144-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 144-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 144-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 144-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 144-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 144-120(pA10,mldab5,2), rA0 -+ movaps 144-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 144-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 144-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 144-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 144-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 144-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 144-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 144-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 144-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 144-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 144-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 144-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 144-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 144-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif -- prefB((pB,ldab)) -- prefB(64(pB,ldab)) -+ prefB((pB,ldab)) -+ prefB(64(pB,ldab)) - - #if KB > 40 -- movaps 160-120(pA10,mldab5,2), rA0 -- movaps 160-120(pB0), rB0 -- mulps rB0, rA0 -- addq $176, pB0 -- addps rA0, rC0 -- movaps 160-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 160-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 160-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 160-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 160-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 160-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 160-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 160-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 160-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 160-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 160-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 160-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addq $176, pA10 -- addps rA0, rC12 -- mulps 160-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -- addq $176, pA5 -+ movaps 160-120(pA10,mldab5,2), rA0 -+ movaps 160-120(pB0), rB0 -+ mulps rB0, rA0 -+ addq $176, pB0 -+ addps rA0, rC0 -+ movaps 160-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 160-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 160-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 160-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 160-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 160-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 160-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 160-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 160-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 160-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 160-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 160-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addq $176, pA10 -+ addps rA0, rC12 -+ mulps 160-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 -+ addq $176, pA5 - #else -- addq $176, pB0 -- addq $176, pA10 -- addq $176, pA5 -+ addq $176, pB0 -+ addq $176, pA10 -+ addq $176, pA5 - #endif - - #if KB > 44 -- movaps 0-120(pA10,mldab5,2), rA0 -- movaps 0-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 0-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 0-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 0-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 0-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 0-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 0-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 0-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 0-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 0-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 0-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 0-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 0-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 0-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 0-120(pA10,mldab5,2), rA0 -+ movaps 0-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 0-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 0-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 0-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 0-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 0-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 0-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 0-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 0-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 0-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 0-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 0-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 0-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 0-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 48 -- movaps 16-120(pA10,mldab5,2), rA0 -- movaps 16-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 16-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 16-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 16-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 16-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 16-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 16-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 16-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 16-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 16-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 16-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 16-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 16-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 16-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 16-120(pA10,mldab5,2), rA0 -+ movaps 16-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 16-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 16-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 16-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 16-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 16-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 16-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 16-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 16-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 16-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 16-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 16-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 16-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 16-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 52 -- movaps 32-120(pA10,mldab5,2), rA0 -- movaps 32-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 32-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 32-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 32-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 32-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 32-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 32-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 32-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 32-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 32-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 32-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 32-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 32-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 32-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 32-120(pA10,mldab5,2), rA0 -+ movaps 32-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 32-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 32-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 32-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 32-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 32-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 32-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 32-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 32-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 32-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 32-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 32-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 32-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 32-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 56 -- movaps 48-120(pA10,mldab5,2), rA0 -- movaps 48-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 48-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 48-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 48-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 48-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 48-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 48-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 48-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 48-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 48-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 48-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 48-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 48-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 48-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 48-120(pA10,mldab5,2), rA0 -+ movaps 48-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 48-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 48-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 48-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 48-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 48-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 48-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 48-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 48-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 48-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 48-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 48-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 48-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 48-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 60 -- movaps 64-120(pA10,mldab5,2), rA0 -- movaps 64-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 64-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 64-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 64-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 64-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 64-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 64-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 64-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 64-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 64-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 64-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 64-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 64-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 64-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 64-120(pA10,mldab5,2), rA0 -+ movaps 64-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 64-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 64-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 64-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 64-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 64-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 64-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 64-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 64-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 64-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 64-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 64-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 64-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 64-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif -- prefB(128-176(pB,ldab)) -- prefB(192-176(pB,ldab)) -+ prefB(128-176(pB,ldab)) -+ prefB(192-176(pB,ldab)) - - #if KB > 64 -- movaps 80-120(pA10,mldab5,2), rA0 -- movaps 80-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 80-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 80-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 80-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 80-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 80-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 80-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 80-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 80-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 80-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 80-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 80-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 80-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 80-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 80-120(pA10,mldab5,2), rA0 -+ movaps 80-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 80-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 80-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 80-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 80-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 80-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 80-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 80-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 80-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 80-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 80-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 80-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 80-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 80-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 68 -- movaps 96-120(pA10,mldab5,2), rA0 -- movaps 96-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 96-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 96-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 96-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 96-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 96-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 96-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 96-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 96-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 96-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 96-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 96-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 96-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 96-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 96-120(pA10,mldab5,2), rA0 -+ movaps 96-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 96-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 96-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 96-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 96-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 96-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 96-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 96-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 96-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 96-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 96-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 96-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 96-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 96-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 72 -- movaps 112-120(pA10,mldab5,2), rA0 -- movaps 112-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 112-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 112-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 112-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 112-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 112-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 112-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 112-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 112-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 112-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 112-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 112-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 112-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 112-120(pA5,ldab,8), rB0 -- prefC((pC)) -- prefC((pC,incCn)) -- addps rB0, rC13 -+ movaps 112-120(pA10,mldab5,2), rA0 -+ movaps 112-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 112-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 112-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 112-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 112-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 112-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 112-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 112-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 112-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 112-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 112-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 112-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 112-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 112-120(pA5,ldab,8), rB0 -+ prefC((pC)) -+ prefC((pC,incCn)) -+ addps rB0, rC13 - #else -- prefC((pC)) -- prefC((pC,incCn)) -+ prefC((pC)) -+ prefC((pC,incCn)) - #endif - - #if KB > 76 -- movaps 128-120(pA10,mldab5,2), rA0 -- movaps 128-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 128-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 128-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 128-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 128-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 128-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 128-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 128-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 128-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 128-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 128-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 128-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 128-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 128-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 128-120(pA10,mldab5,2), rA0 -+ movaps 128-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 128-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 128-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 128-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 128-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 128-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 128-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 128-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 128-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 128-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 128-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 128-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 128-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 128-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - #if KB > 80 -- movaps 144-120(pA10,mldab5,2), rA0 -- movaps 144-120(pB0), rB0 -- mulps rB0, rA0 -- addps rA0, rC0 -- movaps 144-120(pA5, mldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC1 -- movaps 144-120(pA10, mldab,8), rA0 -- mulps rB0, rA0 -- addps rA0, rC2 -- movaps 144-120(pA5, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC3 -- movaps 144-120(pA5, mldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC4 -- movaps 144-120(pA5), rA0 -- mulps rB0, rA0 -- addps rA0, rC5 -- movaps 144-120(pA5, ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC6 -- movaps 144-120(pA5, ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC7 -- movaps 144-120(pA10, mldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC8 -- movaps 144-120(pA5,ldab,4), rA0 -- mulps rB0, rA0 -- addps rA0, rC9 -- movaps 144-120(pA10), rA0 -- mulps rB0, rA0 -- addps rA0, rC10 -- movaps 144-120(pA10,ldab), rA0 -- mulps rB0, rA0 -- addps rA0, rC11 -- movaps 144-120(pA10,ldab,2), rA0 -- mulps rB0, rA0 -- addps rA0, rC12 -- mulps 144-120(pA5,ldab,8), rB0 -- addps rB0, rC13 -+ movaps 144-120(pA10,mldab5,2), rA0 -+ movaps 144-120(pB0), rB0 -+ mulps rB0, rA0 -+ addps rA0, rC0 -+ movaps 144-120(pA5, mldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC1 -+ movaps 144-120(pA10, mldab,8), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC2 -+ movaps 144-120(pA5, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC3 -+ movaps 144-120(pA5, mldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC4 -+ movaps 144-120(pA5), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC5 -+ movaps 144-120(pA5, ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC6 -+ movaps 144-120(pA5, ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC7 -+ movaps 144-120(pA10, mldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC8 -+ movaps 144-120(pA5,ldab,4), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC9 -+ movaps 144-120(pA10), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC10 -+ movaps 144-120(pA10,ldab), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC11 -+ movaps 144-120(pA10,ldab,2), rA0 -+ mulps rB0, rA0 -+ addps rA0, rC12 -+ mulps 144-120(pA5,ldab,8), rB0 -+ addps rB0, rC13 - #endif - - /*UKLOOP */ -@@ -2454,202 +2461,202 @@ MLAST: - * Get these bastard things summed up correctly - */ - -- /* rC0 = c0a c0b c0c c0d */ -- /* rC1 = c1a c1b c1c c1d */ -- /* rC2 = c2a c2b c2c c2d */ -- /* rC3 = c3a c3b c3c c3d */ -+ /* rC0 = c0a c0b c0c c0d */ -+ /* rC1 = c1a c1b c1c c1d */ -+ /* rC2 = c2a c2b c2c c2d */ -+ /* rC3 = c3a c3b c3c c3d */ - /* */ -- movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */ -- prefC(64(pC,incCn)) -- prefB(256-176(pB,ldab)) -- movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */ -- unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */ -- unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */ -- unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */ -- movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */ -- unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */ -- movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */ -- movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */ -- movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */ -- addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */ -- movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */ -- movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */ -- movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */ -- addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */ -- movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */ -- addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */ -- -- -- /* rC4 = c4a c4b c4c c4d */ -- /* rC5 = c5a c5b c5c c5d */ -- /* rC6 = c6a c6b c6c c6d */ -- /* rC7 = c7a c7b c7c c7d */ -- /* rC8 = c08a c08b c08c c08d */ -- /* rC9 = c09a c09b c09c c09d */ -- /* rC10 = c10a c10b c10c c10d */ -- /* rC11 = c11a c11b c11c c11d */ -- /* rC12 = c12a c12b c12c c12d */ -- /* rC13 = c13a c13b c13c c13d */ -+ movaps rC2, rB0 /* rB0 = c2a c2b c2c c2d */ -+ prefC(64(pC,incCn)) -+ prefB(256-176(pB,ldab)) -+ movaps rC0, rA0 /* rA0 = c0a c0b c0c c0d */ -+ unpckhps rC3, rB0 /* rB0 = c2c c3c c2d c3d */ -+ unpckhps rC1, rA0 /* rA0 = c0c c1c c0d c1d */ -+ unpcklps rC3, rC2 /* rC2 = c2a c3a c2b c3b */ -+ movlhps rB0, rC3 /* rC3 = c3a c3b c2c c3c */ -+ unpcklps rC1, rC0 /* rC0 = c0a c1a c0b c1b */ -+ movhlps rA0, rC3 /* rC3 = c0d c1d c2c c3c */ -+ movlhps rC2, rA0 /* rA0 = c0c c1c c2a c3a */ -+ movhlps rC0, rB0 /* rB0 = c0b c1b c2d c3d */ -+ addps rA0, rC3 /* rC3 = c0cd c1cd c2ac c3ac */ -+ movlhps rC0, rC1 /* rC1 = c1a c1b c0a c1a */ -+ movhlps rC1, rC2 /* rC2 = c0a c1a c2b c3b */ -+ movaps rC4, rA0 /* rA0 = c4a c4b c4c c4d */ -+ addps rB0, rC2 /* rC2 = c0ab c1ab c2bd c3bd */ -+ movaps rC6, rB0 /* rB0 = c6a c6b c6c c6d */ -+ addps rC2, rC3 /* rC3 = c0abcd c1abcd c2bdac c3bdac */ -+ -+ -+ /* rC4 = c4a c4b c4c c4d */ -+ /* rC5 = c5a c5b c5c c5d */ -+ /* rC6 = c6a c6b c6c c6d */ -+ /* rC7 = c7a c7b c7c c7d */ -+ /* rC8 = c08a c08b c08c c08d */ -+ /* rC9 = c09a c09b c09c c09d */ -+ /* rC10 = c10a c10b c10c c10d */ -+ /* rC11 = c11a c11b c11c c11d */ -+ /* rC12 = c12a c12b c12c c12d */ -+ /* rC13 = c13a c13b c13c c13d */ - /* */ -- movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */ -- movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */ -- movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */ -- unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */ -- unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */ -- unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */ -- unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */ -- unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */ -- movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */ -- unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */ -- movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */ -- movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */ -- unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */ -- movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */ -- movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */ -- addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */ -+ movaps rC10, rC0 /* rC0 = c10a c10b c10c c10d */ -+ movaps rC8 , rC1 /* rC1 = c08a c08b c08c c08d */ -+ movaps rC12, rC2 /* rC2 = c12a c12b c12c c12d */ -+ unpckhps rC7, rB0 /* rB0 = c6c c7c c6d c7d */ -+ unpckhps rC5, rA0 /* rA0 = c4c c5c c4d c5d */ -+ unpcklps rC7, rC6 /* rC6 = c6a c7a c6b c7b */ -+ unpckhps rC11, rC0 /* rC0 = c10c c11c c10d c11d */ -+ unpckhps rC9 , rC1 /* rC1 = c08c c09c c08d c09d */ -+ movlhps rB0, rC7 /* rC7 = c7a c7b c6c c7c */ -+ unpcklps rC5, rC4 /* rC4 = c4a c5a c4b c5b */ -+ movhlps rA0, rC7 /* rC7 = c4d c5d c6c c7c */ -+ movlhps rC6, rA0 /* rA0 = c4c c5c c6a c7a */ -+ unpcklps rC11, rC10 /* rC10 = c10a c11a c10b c11b */ -+ movhlps rC4, rB0 /* rB0 = c4b c5b c6d c7d */ -+ movlhps rC0, rC11 /* rC11 = c11a c11b c10c c11c */ -+ addps rA0, rC7 /* rC7 = c4cd c5cd c6ac c7ac */ - #ifdef BETAX - #ifdef SREAL -- movups (pC), rA0 -- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -- movups 16(pC), rC4 -- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -- movups 32(pC), rC5 -- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -- movlps 48(pC), rC1 -- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -- mulps BOF(%rsp), rA0 -- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -- mulps BOF(%rsp), rC4 -- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -- mulps BOF(%rsp), rC5 -- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -- mulps BOF(%rsp), rC1 -+ movups (pC), rA0 -+ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -+ movups 16(pC), rC4 -+ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -+ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -+ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -+ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -+ movups 32(pC), rC5 -+ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -+ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -+ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -+ movlps 48(pC), rC1 -+ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -+ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -+ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -+ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -+ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -+ mulps BOF(%rsp), rA0 -+ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -+ mulps BOF(%rsp), rC4 -+ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -+ mulps BOF(%rsp), rC5 -+ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -+ mulps BOF(%rsp), rC1 - - /* */ - -- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -- addps rA0, rC3 -- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -- addps rC4, rC7 -- addps rC5, rC11 -- prefB(320-176(pB,ldab)) -- addps rC1, rC12 -+ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -+ addps rA0, rC3 -+ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -+ addps rC4, rC7 -+ addps rC5, rC11 -+ prefB(320-176(pB,ldab)) -+ addps rC1, rC12 - #else /* BETA = X, complex type */ -- movups (pC), rA0 -- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -- movups 16(pC), rC4 -- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -- shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */ -- movups 32(pC), rC4 /* rC4 = c4 X c5 X */ -- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -- movups 48(pC), rC5 /* rC5 = c6 X c7 X */ -- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -- shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */ -- movups 64(pC), rC5 /* rC5 = c8 X c9 X */ -- movups 80(pC), rC1 /* rC1 = c10 X c11 X */ -- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -- shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */ -- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -- movss 96(pC), rC1 -- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -- movss 104(pC), rB0 -- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -- unpcklps rB0, rC1 -- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -- mulps BOF(%rsp), rA0 -- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -- mulps BOF(%rsp), rC4 -- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -- mulps BOF(%rsp), rC5 -- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -- mulps BOF(%rsp), rC1 -+ movups (pC), rA0 -+ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -+ movups 16(pC), rC4 -+ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -+ shufps $0x88, rC4, rA0 /* rA0 = c0 c1 c2 c3 */ -+ movups 32(pC), rC4 /* rC4 = c4 X c5 X */ -+ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -+ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -+ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -+ movups 48(pC), rC5 /* rC5 = c6 X c7 X */ -+ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -+ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -+ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -+ shufps $0x88, rC5, rC4 /* rC4 = c4 c5 c6 c7 */ -+ movups 64(pC), rC5 /* rC5 = c8 X c9 X */ -+ movups 80(pC), rC1 /* rC1 = c10 X c11 X */ -+ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -+ shufps $0x88, rC1, rC5 /* rC5 = c8 c9 c10 c11 */ -+ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -+ movss 96(pC), rC1 -+ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -+ movss 104(pC), rB0 -+ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -+ unpcklps rB0, rC1 -+ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -+ mulps BOF(%rsp), rA0 -+ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -+ mulps BOF(%rsp), rC4 -+ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -+ mulps BOF(%rsp), rC5 -+ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -+ mulps BOF(%rsp), rC1 - - /* */ - -- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -- addps rA0, rC3 -- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -- addps rC4, rC7 -- addps rC5, rC11 -- prefB(320-176(pB,ldab)) -- addps rC1, rC12 -+ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -+ addps rA0, rC3 -+ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -+ addps rC4, rC7 -+ addps rC5, rC11 -+ prefB(320-176(pB,ldab)) -+ addps rC1, rC12 - #endif - - #else -- movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -- unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -- movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -- movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -- movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -- movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -- unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -- addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -- addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -- movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -- unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -- movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -- addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -- addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -- addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -- addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ -+ movlhps rC4, rC5 /* rC5 = c5a c5b c4a c5a */ -+ unpcklps rC9 , rC8 /* rC8 = c08a c09a c08b c09b */ -+ movhlps rC1, rC11 /* rC11 = c08d c09d c10c c11c */ -+ movlhps rC10, rC1 /* rC1 = c08c c09c c10a c11a */ -+ movhlps rC5, rC6 /* rC6 = c4a c5a c6b c7b */ -+ movhlps rC8 , rC0 /* rC0 = c08b c09b c10d c11d */ -+ unpcklps rC13, rC2 /* rC2 = c12a c13a c12b c13b */ -+ addps rC1, rC11 /* rC11 = c08cd c09cd c10ac c11ac */ -+ addps rB0, rC6 /* rC6 = c4ab c5ab c6bd c7bd */ -+ movlhps rC8 , rC9 /* rC9 = c09a c09b c08a c09a */ -+ unpckhps rC13, rC12 /* rC12 = c12c c13c c12d c13d */ -+ movhlps rC9 , rC10 /* rC10 = c08a c09a c10b c11b */ -+ addps rC6, rC7 /* rC7 = c4abcd c5abcd c6bdac c7bdac */ -+ addps rC0, rC10 /* rC10 = c08ab c09ab c10bd c11bd */ -+ addps rC2, rC12 /* rC12 = c12ac c13ac c12bd c13bd */ -+ addps rC10, rC11 /* rC11 = c08abcd c09abcd c10bdac c11bdac */ - - /* */ - -- movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -- prefB(320-176(pB,ldab)) -- addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ -+ movhlps rC12, rC13 /* rC13 = c12bd c13bd X X */ -+ prefB(320-176(pB,ldab)) -+ addps rC13, rC12 /* rC12 = c12abcd c13abcd X X */ - - #endif - /* - * Write results back to C; pC += 14; - */ - #ifdef SREAL -- movups rC3, (pC) -- movups rC7, 16(pC) -- movups rC11, 32(pC) -- movlps rC12, 48(pC) --/* addq $56, pC */ -+ movups rC3, (pC) -+ movups rC7, 16(pC) -+ movups rC11, 32(pC) -+ movlps rC12, 48(pC) -+/* addq $56, pC */ - #else -- movss rC3, (pC) -- movss rC7, 32(pC) -- movhlps rC3, rC0 -- movhlps rC7, rC6 -- movss rC0, 16(pC) -- movss rC6, 48(pC) -- shufps $0x55, rC3, rC3 -- shufps $0x55, rC7, rC7 -- movss rC3, 8(pC) -- movss rC7, 40(pC) -- shufps $0x55, rC0, rC0 -- shufps $0x55, rC6, rC6 -- movss rC0, 24(pC) -- movss rC6, 56(pC) -- -- movss rC11, 64(pC) -- movhlps rC11, rC2 -- movss rC12, 96(pC) -- movss rC2, 80(pC) -- shufps $0x55, rC11, rC11 -- shufps $0x55, rC12, rC12 -- movss rC11, 72(pC) -- shufps $0x55, rC2, rC2 -- movss rC12, 104(pC) -- movss rC2, 88(pC) -+ movss rC3, (pC) -+ movss rC7, 32(pC) -+ movhlps rC3, rC0 -+ movhlps rC7, rC6 -+ movss rC0, 16(pC) -+ movss rC6, 48(pC) -+ shufps $0x55, rC3, rC3 -+ shufps $0x55, rC7, rC7 -+ movss rC3, 8(pC) -+ movss rC7, 40(pC) -+ shufps $0x55, rC0, rC0 -+ shufps $0x55, rC6, rC6 -+ movss rC0, 24(pC) -+ movss rC6, 56(pC) -+ -+ movss rC11, 64(pC) -+ movhlps rC11, rC2 -+ movss rC12, 96(pC) -+ movss rC2, 80(pC) -+ shufps $0x55, rC11, rC11 -+ shufps $0x55, rC12, rC12 -+ movss rC11, 72(pC) -+ shufps $0x55, rC2, rC2 -+ movss rC12, 104(pC) -+ movss rC2, 88(pC) - --/* addq $112, pC */ -+/* addq $112, pC */ - #endif - /* - * Write results back to C -@@ -2660,55 +2667,55 @@ MLAST: - /* - * while (pA != stM); - */ --/* subq $1, stM */ --/* jne UMLOOP */ -+/* subq $1, stM */ -+/* jne UMLOOP */ - /* - * pC += 14; pA += 14*NB; pB -= NB; - */ --/* subq $MBKBso-NB14so+176, pA5 */ --/* subq $MBKBso-NB14so+176, pA10 */ -- subq incAm, pA5 -- subq incAm, pA10 -- addq $NBso-176, pB0 -+/* subq $MBKBso-NB14so+176, pA5 */ -+/* subq $MBKBso-NB14so+176, pA10 */ -+ subq incAm, pA5 -+ subq incAm, pA10 -+ addq $NBso-176, pB0 - /* - * while (pA != stM); - */ --/* subq $1, stM */ --/* jne UMLOOP */ -+/* subq $1, stM */ -+/* jne UMLOOP */ - /* - * pC += incCn; pA -= NBNB; pB += NB; - */ -- addq incCn, pC -+ addq incCn, pC - /* - * while (pB != stN); - */ -- sub $1, stN -- jne UNLOOP -+ sub $1, stN -+ jne UNLOOP - - /* - * Restore callee-saved iregs - */ - DONE: -- movq -8(%rsp), %rbp -- movq -16(%rsp), %rbx -+ movq -8(%rsp), %rbp -+ movq -16(%rsp), %rbx - #if MB == 0 -- movq -32(%rsp), %r12 -- movq -40(%rsp), %r13 -+ movq -32(%rsp), %r12 -+ movq -40(%rsp), %r13 - #endif -- ret -+ ret - #if MB == 0 - MB_LT84: -- cmp $70, stM -- jne MB_LT70 --/* movq $70/14, stM */ -- movq $5, stM -- jmp MBFOUND -+ cmp $70, stM -+ jne MB_LT70 -+/* movq $70/14, stM */ -+ movq $5, stM -+ jmp MBFOUND - MB_LT70: -- cmp $56, stM -- jne MB_LT56 --/* movq $56/14, stM */ -- movq $4, stM -- jmp MBFOUND -+ cmp $56, stM -+ jne MB_LT56 -+/* movq $56/14, stM */ -+ movq $4, stM -+ jmp MBFOUND - MB_LT56: - cmp $42, stM - jne MB_LT42 -diff -rupN ATLAS/tune/blas/level1/scalsrch.c atlas-3.8.3/tune/blas/level1/scalsrch.c ---- ATLAS/tune/blas/level1/scalsrch.c 2009-02-18 19:48:25.000000000 +0100 -+++ atlas-3.8.3/tune/blas/level1/scalsrch.c 2009-11-12 13:45:48.141174024 +0100 -@@ -747,7 +747,7 @@ void GenMainRout(char pre, int n, int *i - /* - * Handle all special alpha cases - */ -- fprintf(fpout, "%sif ( SCALAR_IS_ZERO(alpha) )\n", spc); -+ /* fprintf(fpout, "%sif ( SCALAR_IS_ZERO(alpha) )\n", spc); - fprintf(fpout, "%s{\n", spc); - if (pre == 'c' || pre == 'z') - { -@@ -756,7 +756,7 @@ void GenMainRout(char pre, int n, int *i - } - else fprintf(fpout, "%s Mjoin(PATL,set)(N, ATL_rzero, X, incx);\n", spc); - fprintf(fpout, "%s return;\n", spc); -- fprintf(fpout, "%s}\n", spc); -+ fprintf(fpout, "%s}\n", spc); */ - GenAlphCase(pre, spc, fpout, 1, n, ix, iy, ia, ib); - GenAlphCase(pre, spc, fpout, -1, n, ix, iy, ia, ib); - if (pre == 'c' || pre == 'z') diff --git a/libraries/atlas/slack-desc b/libraries/atlas/slack-desc deleted file mode 100644 index 73ea6b801bcf..000000000000 --- a/libraries/atlas/slack-desc +++ /dev/null @@ -1,19 +0,0 @@ -# HOW TO EDIT THIS FILE: -# The "handy ruler" below makes it easier to edit a package description. -# Line up the first '|' above the ':' following the base package name, and -# the '|' on the right side marks the last column you can put a character in. -# You must make exactly 11 lines for the formatting to be correct. It's also -# customary to leave one space after the ':' except on otherwise blank lines. - - |-----handy-ruler------------------------------------------------------| -atlas: atlas (Automatically Tuned Linear Algebra Software) -atlas: -atlas: This is ATLAS (Automatically Tuned Linear Algebra Software), an -atlas: ongoing research effort focusing on applying empirical techniques in -atlas: order to provide portable performance. At present, it provides C and -atlas: Fortran77 interfaces to a portably efficient BLAS implementation as -atlas: well as a few routines from LAPACK. Nevertheless, the default setting -atlas: for Slackware is to allow for a full LAPACK to get build and installed -atlas: along with ATLAS. -atlas: -atlas: Homepage: http://math-atlas.sourceforge.net/ |