Shell Sort (for 16-Bit Elements) by
Fredrik Ramsberg
Here's my implementation of Shell Sort, a sort algorithm with rather amazing properties. All mathematicians that have tried to analyze this algorithm have failed, but there's lots of empirical data that suggests it's roughly O(n log n). I wish I could say I invented this algorithm, but I didn't.
This implementation can sort more than 32,000 16-bit values
(The only reason it can't sort 32,767 values is that there still has to be room
for the routine and a few bytes for temporary storage). There's presently no
other sorting routine in the repository that can handle more than 256 values.
; This is source code for what is meant to be an efficient implementation
; of ShellSort in 6502 assembler. The source code is in a format suitable for
; the excellent and free ACME cross-assembler by Marco Baye, but should be
; easy to convert for other assemblers.
;
; While ShellSort is very good for entirely unsorted arrays, it is also
; reasonably good for almost sorted arrays. However, if you happen to know
; that very few values are out of place OR that the values that are out
; of place are not very far from their right position, InsertionSort is a
; better choice. InsertionSort is also provided in this file, since
; ShellSort is really just a clever extension of InsertionSort.
; Here are some examples of sort times @ 1MHz (10000 values):
;
; InsertionSort ShellSort
; Array is entirely sorted from the start: 1,9s 14,7s
; 1 value is at the wrong end of the array: 2,9s 15,7s
; 10 values are at the wrong end of the array: 11,8s 16,9s
; 50 values are at the wrong end of the array: 51,3s 17,6s
; Array is entirely unsorted: 2464,2s 30,5s
!to "shellsrt.o" ; An assembler directive to set out-file
!sl "shelllbl.a" ; Tells the assembler to write all label
; values to a file
*=$1000 ; Start address. Can safely be set to
; anything from $0100 to $fe00
j=$fb ; Uses two bytes. Has to be on zero-page
j_plus_h=$fd ; Uses two bytes. Has to be on zero-page
arr_length = j_plus_h ; Can safely use the same location as
; j_plus_h, but doesn't have to be on ZP
; To increase speed and reduce code size, you can optionally place one or more
; of these 2-byte fields on zero-page (the suggested values work on a C64):
; v_plus_1 = $5
; h = $7
; arr_start = $A
; Some simple tests using an array of 10000 completely unsorted values showed
; a 5.6% shorter execution time if all three fields were placed on ZP, with
; v_plus_1 being a little more important than the others.
;
; To go even further, placing these 2-byte fields on zero-page will provide a
; small improvement:
; v
; i
; arr_end
; This routine mimics this ShellSort implementation (Language=JavaScript):
; function ShellSort(arr,length) {
; var j, i, v, h, k;
; for (h=1; h < length; h=3*h+1);
; while (h=(h-1)/3)
; for (i=h, j=i, v=arr[i]; i<=length; arr[j+h]=v, i++, j=i, v=arr[i])
; while((j-=h) >= 0 && arr[j] > v)
; arr[j+h]=arr[j];
; }
; To call the routine, create a word-array at address nnnn in memory. The
; first word should contain the number of bytes to be sorted (= 2 * the number
; of elements). Then come all those elements. Then sort the elements using
; ShellShort like this:
; lda #<nnnn, ldx #>nnnn, jsr shell_sort
; or to perform an InsertionSort:
; lda #<nnnn, ldx #>nnnn, jsr insertion_sort
; ( < means lowbyte and > means highbyte )
shell_sort ldy #h_high - h_low - 1
bne sort_main ; Always branch
insertion_sort ldy #0
sort_main sty h_start_index
cld
sta j
sta in_address
clc
adc #2
sta arr_start
stx j + 1
stx in_address + 1
txa
adc #0
sta arr_start + 1
ldy #0
lda (j),y
sta arr_length
clc
adc arr_start
sta arr_end
iny
lda (j),y
sta arr_length + 1
adc arr_start + 1
sta arr_end + 1
; for (h=1; h < length; h=3*h+1);
ldx h_start_index ; Start with highest value of h
chk_prev_h lda h_low,x
cmp arr_length
lda h_high,x
sbc arr_length + 1
bcc end_of_init ; If h < array_length, we've found the right h
dex
bpl chk_prev_h
rts ; array length is 0 or 1. No sorting needed.
end_of_init inx
stx h_index
; while (h=(h-1)/3)
h_loop dec h_index
bpl get_h
rts ; All done!
get_h ldy h_index
lda h_low,y
sta h
clc
adc in_address ; ( in_address is arr_start - 2)
sta i
lda h_high,y
sta h + 1
adc in_address + 1
sta i + 1
; for (i=h, j=i, v=arr[i]; i<=length; arr[j+h]=v, i++, j=i, v=arr[i])
i_loop lda i
clc
adc #2
sta i
sta j
lda i + 1
adc #0
sta i + 1
sta j + 1
ldx i
cpx arr_end
lda i + 1
sbc arr_end + 1
bcs h_loop
ldy #0
lda (j),y
sta v
clc
adc #1
sta v_plus_1
iny
lda (j),y
sta v + 1
adc #0
bcs i_loop ; v=$ffff, so no j-loop necessary
sta v_plus_1 + 1
dey ; Set y=0
; while((j-=h) >= 0 && arr[j] > v)
j_loop lda j
sta j_plus_h
sec
sbc h
sta j
tax
lda j + 1
sta j_plus_h + 1
sbc h + 1
sta j + 1
; Check if we've reached the bottom of the array
bcc exit_j_loop
cpx arr_start
sbc arr_start + 1
bcc exit_j_loop
; Do the actual comparison: arr[j] > v
lda (j),y
tax
iny ; Set y=1
lda (j),y
cpx v_plus_1
sbc v_plus_1 + 1
bcc exit_j_loop
; arr[j+h]=arr[j];
lda (j),y
sta (j_plus_h),y
dey ; Set y=0
txa
sta (j_plus_h),y
bcs j_loop ; Always branch
; for (i=h, j=i, v=arr[i]; i<length; arr[j+h]=v, i++, j=i, v=arr[i]) *** arr[j+h]=v part
exit_j_loop lda v
ldy #0
sta (j_plus_h),y
iny
lda v + 1
sta (j_plus_h),y
jmp i_loop
; This describes the sequence h(0)=1; h(n)=k*h(n-1)+1 for k=3 (1,4,13,40...)
; All word-values are muliplied by 2, since we are sorting 2-byte values
h_low !byte <2, <8, <26, <80, <242, <728, <2186, <6560, <19682
h_high !byte >2, >8, >26, >80, >242, >728, >2186, >6560, >19682
h_start_index !byte 0
h_index !byte 0
h !word 0
in_address !word 0
arr_start !word 0
arr_end !word 0
i !word 0
v !word 0
v_plus_1 !word 0
Last page update: Jan 14, 2005.