\newpage
\begin{center}
{\Large\bf VI}
\vspace{3mm}
{\large\bf SOME PERFORMANCE DATA}
\end{center}
\hugeskip\noindent
{\large\bf \S6.1\hs{3mm}The C8 Benchmark}
\bigskip
The logic {\bf C} can be defined axiomatically in many ways. To
illustrate the difference between pre-defined logics, logics built up
with pre-defined axioms and the effect of user-defined axioms we give
the runtimes on a Sun 386i with 8 Mb of memory for the problem of
generating {\bf C$_+$} matrices of sizes up to $8\times 8$\@.
\smallskip
The first definition is as the negation-free fragment of {\bf FD} plus
user-defined axioms
\begin{tabbing}
mmmm\=axiom\kill
\> $ p\C q\Cdot q\C r\Cdot p\C r $ \\
\> $ q\C r\Cdot p\C q\Cdot p\C r $ \\
\> $ p\Cdot p\C q\CC r $ \end{tabbing}
This is a ludicrously cumbersome axiomatisation for matrix-generating
purposes. Because we did not want to wait for ever we only generated the
matrices up to $6\times 6$ in this way. Just to show that we could have
done worse, we also note (definition 2) what happens if the second axiom
is re-written
\begin{tabbing}
mmmm\=axiom\kill
\> $ p\C q\Cdot r\C p\Cdot r\C q $ \end{tabbing}
In some of the $6\times 6$ cases the program gets stuck finding footling
failures of the fusion postulates and the like instead of getting around
to the simple affixing fact that if $0\leq 1$ then $5\C 0\leq 5\C 1$\@.
Consequently the execution time blows out./footnote{This bad behaviour is
somewhat ameliorated in release 1.1.\@ See the final paragraph below.} The
moral is to make use of the postulates hard-programmed into MaGIC rather
than of user-defined axioms.
\smallskip
Another definition of {\bf C} is as {\bf TW} plus pre-defined axiom 15
\begin{tabbing}
mmmm\=axiom\kill
\> $ p\Cdot p\C q\CC q $ \end{tabbing}
This is definition 3. An equivalent in context, rather better for the
purposes of matrix-generation is the pair of pre-defined axioms 9 and 10:
\begin{tabbing}
mmmm\=axiom\kill
\> $ p\Cdot {\rm t}\C p $ \\
\> $ {\rm t}\C p\CC p $ \end{tabbing}
Definition 4, then, is {\bf TW} plus these two axioms. The following are
elapsed times for these axiomatisations and for {\bf C}
{\em simpliciter}. They are the best achieved in several (at least
three) repetitions of each run.
\hugeskip\begin{center}
{\bf Table 1: C$_+$ matrices of sizes up to 8 (times in seconds)} \\[5mm]
\begin{tabular}{r|rcrcrcrcr}
Size & 4 \ \ & & 5 \ \ & & 6 \ \ & & 7 \ \ & & 8 \ \ \\\hline\\
Matrices & 16 \ & & 70 \ & & 399 \ & & 2261 & & 14358 \\\\
{\bf Def 1} & 1.52 & & 13.72 & & 491.62 & & & & \\\\
{\bf Def 2} & 1.66 & & 14.96 & & 1319.40 & & & & \\\\
{\bf Def 3} & 0.44 & & 2.72 & & 18.59 & & 531.98 & & \\\\
{\bf Def 4} & 0.41 & & 3.12 & & 18.04 & & 139.18 & & 1066.36 \\\\
{\bf Def 5} & 0.38 & & 2.68 & & 13.72 & & 90.54 & & 898.50
\end{tabular} \end{center}
\hugeskip
These times are subject to errors due to other activity on the computer
used, and especially to traffic on the network and the file server. This
renders the short times (under 5 sec) rather unreliable, though we
did repeat those trials between five and ten times in the hope of
removing the worst discrepancies. The longer times are much firmer.
Note the exponential growth rate in times taken to complete the
progressively larger tasks. Note also, however, that the number of
matrices to be found also grows exponentially with the number of
elements, so the increase in time per matrix found may not be so bad.
To illustrate, here are the times taken per matrix generated for
the last three definitions
\hugeskip\begin{center}
{\bf Table 2: C8, milliseconds per matrix} \\[5mm]
\begin{tabular}{r|cccccccccc}
Size & \hs{2mm} & 4 & \hs{2mm} & 5 & \hs{2mm} & 6 & \hs{2mm} & 7 & \hs{2mm} & 8 \\\hline\\
{\bf Def 3} & & 27 & & 39 & & 47 & & 235 \\\\
{\bf Def 4} & & 26 & & 46 & & 45 & & 61 & & 74 \\\\
{\bf Def 5} & & 24 & & 38 & & 34 & & 40 & & 62
\end{tabular} \end{center}
\newpage\noindent
{\large\bf \S6.2\hs{3mm}The R10 and R12 Benchmarks}
\bigskip
As well as indicating the performance of MaGIC and the perils of doing it
yourself we should give some indication of the effect of parallelisation.
We present elapsed times in seconds on an 8-processor Sequent Symmetry
with 16Mb of memory for the problems of generating first
all the {\bf R} matrices up to size $10\!\times\!10$ and then up to size
$12\!\times\!12$\@. The latter is a problem of significant difficulty:
to the best of our knowledge the only programs anywhere to have completed
it have been those using the transferred refutations algorithm. {\bf R}
is a fairly strong logic, so models validating it are quite rare. Many
of the grains of work in this problem are therefore rather small. The
{\bf R10} problem involves generating 742 matrices, in the course of
which 7554 bad ones are tested and 108 isomorphic copies rejected. For
the {\bf R12} problem there are 4940 good matrices to be found, 54810 bad
ones to be tested and 1109 isomorphs to be omitted. No file output was
requested, and only summary TTY output, so the figures really are the
time taken to generate the matrices.
\hugeskip\begin{center}
{\bf Table 3: R matrices up to size 10 and size 12}\\[4mm]
\begin{tabular}{r|rrrrrr}
Slaves & 1 \ \ & 2 \ \ & 3 \ \ & 4 \ \ & 5 \ \ & 6 \ \ \\\hline\\
{\bf R10} time & 123.60 & 62.30 & 41.78 & 31.68 & 26.83 & 23.25 \\
speedup & 1.000 & 1.984 & 2.958 & 3.902 & 4.607 & 5.316 \\\\
{\bf R12} time & 1237.6 & 622.8 & 422.0 & 328.0 & 268.4 & 230.0\\
speedup & 1.000 & 1.987 & 2.933 & 3.773 & 4.611 & 5.380
\end{tabular} \end{center}
\hugeskip
The are times on a Sequent Symmetry with 8 processors. The maximum we
devoted to this job for timing purposes was 7 of the 8, since there are
always other things happening on the machine---for example, running the
DYNIX operating system---which occupy part of the time of one processor.
It will be seen that the {\bf R12} problem takes almost exactly 10 times
longer than the {\bf R10} problem and that the degree of parallelism
achieved in the two cases is about the same. There is a little fall-off
in proportion of possible speedup attained as the number of processes
gets to 6 and to 7. It is hoped that by buffering the slaves' transfer
of results to the master this can be improved, though we have yet to
implement such an idea.
\newpage\noindent
{\large\bf \S6.3\hs{3mm}The PW5 Benchmark}
\bigskip
This is a rather different problem: to generate all the {\bf TW} pure
implication matrices of sizes up to 5\@.\footnote{The pure implication
fragment of {\bf TW} is, for historical reasons, always known in the
literature as {\bf P$\!-\!$W}, hence the title.} There are far more such
matrices than are generated in the {\bf R12} problem, 47333 in fact.
Since only 40720 bad ones are tested along the way, the search ratio is
much better---more than half those tested are good. Consequently,
competition among slave processes for access to the heap is higher. 7439
isomorphic copies are also generated and ignored. Once again this is a
seriously challenging problem. We attempted to benchmark \cite{MGT} with
it but that program was still thrashing after more than 24 cpu hours on a
Sun 3/80 and would have taken close to 50 hours to complete the job. One
feature of the {\bf PW5} benchmark problem is the enormous variation in
grain size. Some pieces of work yield over 2000 good matrices while
others yield few or none, the pattern of occurrence of large sub-problems
being quite unpredictable. A major source of inefficiency introduced by
problems of this kind is that the number of good matrices found can
easily exceed the available heap of matrix blanks used by the slaves to
communicate their findings to the master. When this happens the slaves
hang in monitor DELAY queues until some garbage collection has taken
place. It is possible to mitigate this effect by fine-tuning MaGIC. If
the maximum matrix size is set to 5 instead of the usual 16, the matrix
blanks are much smaller, so more of them can be accomodated on the heap.
The runs were thus made twice, once with MaGIC given its normal
parameters and once with an expanded heap of 8000 blanks in place of the
default 3500.
\hugeskip\begin{center}
{\bf Table 4: PW matrices up to size 5} \\[4mm]
\begin{tabular}{r|rrrrrr}
Slaves & 1 \ \ & 2 \ \ & 3 \ \ & 4 \ \ & 5 \ \ & 6 \ \ \\\hline\\
Normal (sec) & 323.4 & 166.0 & 108.5 & 86.85 & 71.97 & 65.96 \\
speedup & 1.000 & 1.948 & 2.980 & 3.724 & 4.494 & 4.901 \\\\
Tuned (sec) & 349.6 & 176.0 & 117.5 & 88.89 & 71.16 & 59.51 \\
speedup & 1.000 & 1.986 & 2.975 & 3.933 & 4.913 & 5.875
\end{tabular} \end{center}
\hugeskip
These results are indeed curious. The first effect of giving the program
more shared structures for communication is to slow it down! However,
much better parallelism is achieved after fine tuning. The better
parallelism we can explain; the slowdown has us baffled. The effect is
not due to some one-off timing discrepancy, for we have repeated it many
times. Whatever the explanation of that, however, the speedup in the
``tuned" case is close to linear over 7 processors, which is very
encouraging.
\smallskip
All of the above benchmarking was done with MaGIC 1.0 and 1.0P\@.
Release 1.1 certainly outperforms it on all three problems. In particular,
it works with better definitions of strong logics like {\bf C} and {\bf R}
so that it runs over 15\% faster on those benchmarks, and also incorporates
a much more efficient test for user-defined axioms which greatly reduces the
difference between different selections of such axioms in the {\bf C8}
benchmark. We have still to carry out a thorough comparison of the
performance of the two releases, but here is a brief report on the {\bf R10}
benchmark.
\hugeskip\begin{center}
{\bf Table 5: R10 comparing 1.0P with 1.1P}\\[4mm]
\begin{tabular}{r|rrrrrr}
Slaves & 1 \ \ & 2 \ \ & 3 \ \ & 4 \ \ & 5 \ \ & 6 \ \ \\\hline\\
{\bf 1.0} time & 123.60 & 62.30 & 41.78 & 31.68 & 26.83 & 23.25 \\
speedup & 1.000 & 1.984 & 2.958 & 3.902 & 4.607 & 5.316 \\\\
{\bf 1.1} time & 96.78 & 51.99 & 35.00 & 26.53 & 21.93 & 19.22 \\
speedup & 1.000 & 1.862 & 2.765 & 3.648 & 4.413 & 5.035 \\
\% improvement & 27.7 & 19.8 & 19.4 & 19.4 & 22.3 & 21.0
\end{tabular} \end{center}
\hugeskip
The times for MaGIC 1.1 are the best recorded in five trial runs.
``Improvement" is defined as the percentage increase in speed as MaGIC 1.1
replaces 1.0\@. The degree of parallelism is registered as lower; this may
be due partly to the traffic on the Sequent Symmetry and partly to a slighly
high figure for the time taken by MaGIC 1.0 with just one slave. It may be
interesting to note that for this job, with summary output so that IO time
is negligible, the best we have recorded with a purely sequential MaGIC 1.1
is 96.47 seconds. The best with all 8 processors of the Sequent Symmetry
(that is, with a master and 7 slaves) is 16.65 seconds, giving a speedup of
5.813.