sorting - (python) Implementing C insort in python -
so i'm using insort bisect insert strings sorted list. it's somehow faster if use built in one. **by faster mean, on average twice fast (1 millisecond vs 2 millisecond on 10,000 word list). run through unix cmd on bigger list 1 have below through:
time script.py < wordlist.txt
i'm thinking has c, don't understand how or why. want make mine fast, without using built in.
here straight bisect source-code:
def insort_left(a, x, lo=0, hi=none): """insert item x in list a, , keep sorted assuming sorted. if x in a, insert left of leftmost x. optional args lo (default 0) , hi (default len(a)) bound slice of searched. """ if lo < 0: raise valueerror('lo must non-negative') if hi none: hi = len(a) while lo < hi: mid = (lo+hi)//2 if a[mid] < x: lo = mid+1 else: hi = mid a.insert(lo, x)
this part think makes different:
# overwrite above definitions fast c implementation try: _bisect import * except importerror: pass
here list of input:
#wordlist = ["hello", "jupiter", "albacore", "shrimp", "axe", "china", "lance", "peter", "sam", "uncle", "jeniffer", "alex", "fer", "joseph"]
some code make work:
sorted_list = [] line in wordlist: insort_left(sorted_list, line) print(sorted_list)
so concern implementing c based insort in python without using modules. how can this?
python executing on vm never fast native code in case this.
the reason the python vm has more work execute python code. c code compiled , executed directly. while c extension code still needs access python vm/runtime has additional benefit can perform operations directly through exposed c api. there rpython project can eliminate of additional runtime execution overhead of python code due typing, has restrictions (no pun).
instead of trying make python code "faster" c (which not happen same algorithm), make python code "smarter" c.
in case, algorithm used has poor complexity of ~o(n^2)
effectilvely nested loops - 1 loop reading line, 1 loop in nested insort
call (for insert, @ least). complexity can , should improved ~o(n lg n)
, make difference in performance n
above (relatively small) value.
assume lines
list containing all lines in file, consider following approaches. do not re-run these sort proposals inside loop!
lines.sort()
- built-in list.sort, hybrid sort , has better bounds above use of repeated callsinsort
. possibly cheating because still uses "native c" implementation provided python. in case, should absolutely crushinsort
implementation more couple [dozen] lines.largesort(lines)
-largesort
"pure python" implementation of merge sort or comb sort. for large enoughn
(across unsorted data) fasterinsort
c code. due additional constant performance overhead of executing python code, tests need conducted find out value ofn
starts dominate.smallsort(lines)
-smallsort
"pure python" insertion sort implementation. may better really small lists , can performed online/streaminginsort
approach. performance tests need done, may faster (a "pure python") insort in particular case.
in case, precise constant factor overheads , datasets being sorted relevant. suggest looking @ (i.e. graphing) performance on several approaches , anticipated data including worst-case scenarios. such pre-splitting data may (relatively) benefit 1 approach on or yield misleading/non-ideal gains. possible consideration timings ("benchmarks") of entire python program may dominated overhead not associated sorting mechanism employed.
Comments
Post a Comment