OPTIMIZATIONS


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

#1
(defun mysl (s)
    (declare (simple-string s))
    (declare (optimize (speed 3) (safety 0) (debug 0)))
    (let ((c 0))
      (declare (fixnum c))
      (dotimes (i (length s))
        (when (eql (aref s i) #\1)
          (incf c)))
      c))

* On X86 I is represented as a tagged integer.

* EQL uses "CMP reg,reg" instead of "CMP reg,im". This causes
  allocation of an extra register and an extra move.

* Unnecessary move:
  3: SLOT S!11[EDX] {SB-C::VECTOR-LENGTH 1 7} => t23[EAX]
  4: MOVE t23[EAX] => t24[EBX]

--------------------------------------------------------------------------------
#2
(defun quux (v)
  (declare (optimize (speed 3) (safety 0) (space 2) (debug 0)))
  (declare (type (simple-array double-float 1) v))
  (let ((s 0d0))
    (declare (type double-float s))
    (dotimes (i (length v))
      (setq s (+ s (aref v i))))
    s))

* Python does not combine + with AREF, so generates extra move and
  allocates a register.

* On X86 Python thinks that all FP registers are directly accessible
  and emits costy MOVE ... => FR1.

--------------------------------------------------------------------------------
#3
(defun bar (n)
  (declare (optimize (speed 3) (safety 0) (space 2))
           (type fixnum n))
  (let ((v (make-list n)))
    (setq v (make-array n))
    (length v)))

* IR1 does not optimize away (MAKE-LIST N).
--------------------------------------------------------------------------------
#4
(defun bar (v1 v2)
  (declare (optimize (speed 3) (safety 0) (space 2))
           (type (simple-array base-char 1) v1 v2))
  (dotimes (i (length v1))
    (setf (aref v2 i) (aref v1 i))))

VOP DATA-VECTOR-SET/SIMPLE-STRING V2!14[EDI] t32[EAX] t30[S2]>t33[CL]
                                  => t34[S2]<t35[AL] 
        MOV     #<TN t33[CL]>, #<TN t30[S2]>
        MOV     BYTE PTR [EDI+EAX+1], #<TN t33[CL]>
        MOV     #<TN t35[AL]>, #<TN t33[CL]>
        MOV     #<TN t34[S2]>, #<TN t35[AL]>

* The value of DATA-VECTOR-SET is not used, so there is no need in the
  last two moves.

* And why two moves?
--------------------------------------------------------------------------------
#5
(loop repeat 1.5)

uses generic arithmetic
--------------------------------------------------------------------------------
#6
09:49:05 <jtra> I have found a case in those where suboptimal code is
  generate with nested loops, it might be moderately easy to fix that
09:49:28 <jtra> see
  http://www.bagley.org/~doug/shootout/bench/nestedloop/nestedloop.cmucl
09:50:30 <jtra> if you add declarations to dotimes, generated code is
  almost optimal, but most inner loops run out of registers and use
  memory location for iteration variable

;;; -*- mode: lisp -*-
;;; http://www.bagley.org/~doug/shootout/
;;; from Friedrich Dominicus

(defun main ()
  (let ((n (parse-integer (or (car (last extensions:*command-line-strings*)) "1")))
        (x 0))
    (declare (fixnum n)
             (fixnum x)
             (optimize (speed 3) (debug 0) (safety 0)))
    (dotimes (a n)
      (dotimes (b n)
        (dotimes (c n)
          (dotimes (d n)
            (dotimes (e n)
              (dotimes (f n)
                (incf x)))))))
   (format t "~A~%" x)))
--------------------------------------------------------------------------------
#7
(defun foo (x)
  (declare (optimize speed (debug 0)))
  (if (< x 0) x (foo (1- x))))

SBCL generates a full call of FOO (but CMUCL does not).
--------------------------------------------------------------------------------
#8
(defun foo (d)
  (declare (optimize (speed 3) (safety 0) (debug 0)))
  (declare (type (double-float 0d0 1d0) d))
  (loop for i fixnum from 1 to 5
     for x1 double-float = (sin d) ;;; !!!
     do (loop for j fixnum from 1 to 4
             sum x1 double-float)))

Without the marked declaration Python will use boxed representation for X1.

This is equivalent to

(let ((x nil))
  (setq x 0d0)
  ;; use of X as DOUBLE-FLOAT
)

The initial binding is effectless, and without it X is of type
DOUBLE-FLOAT. Unhopefully, IR1 does not optimize away effectless
SETs/bindings, and IR2 does not perform type inference.
--------------------------------------------------------------------------------
#9 "Multi-path constant folding"
(defun foo (x)
  (if (= (cond ((irgh x) 0)
               ((buh x) 1)
               (t 2))
         0)
      :yes
      :no))

This code could be optimized to

(defun foo (x)
  (cond ((irgh x) :yes)
        ((buh x) :no)
        (t :no)))
--------------------------------------------------------------------------------
#11
(inverted variant of #9)

(lambda (x)
  (let ((y (sap-alien x c-string)))
    (list (alien-sap y)
          (alien-sap y))))

It could be optimized to

(lambda (x) (list x x))

(if Y were used only once, the current compiler would optimize it)
--------------------------------------------------------------------------------
#12
(typep (truly-the (simple-array * (*)) x) 'simple-vector)

tests lowtag.
--------------------------------------------------------------------------------