sorting - (python) Implementing C insort in python -


so i'm using insort bisect insert strings sorted list. it's somehow faster if use built in one. **by faster mean, on average twice fast (1 millisecond vs 2 millisecond on 10,000 word list). run through unix cmd on bigger list 1 have below through:

time script.py < wordlist.txt 

i'm thinking has c, don't understand how or why. want make mine fast, without using built in.

here straight bisect source-code:

def insort_left(a, x, lo=0, hi=none): """insert item x in list a, , keep sorted assuming sorted.  if x in a, insert left of leftmost x.  optional args lo (default 0) , hi (default len(a)) bound slice of searched. """  if lo < 0:     raise valueerror('lo must non-negative') if hi none:     hi = len(a) while lo < hi:     mid = (lo+hi)//2     if a[mid] < x: lo = mid+1     else: hi = mid a.insert(lo, x) 

bisect source code

this part think makes different:

# overwrite above definitions fast c implementation try:     _bisect import * except importerror:     pass 

here list of input:

#wordlist = ["hello", "jupiter", "albacore", "shrimp", "axe", "china", "lance", "peter", "sam", "uncle", "jeniffer", "alex", "fer", "joseph"] 

some code make work:

sorted_list = [] line in wordlist:     insort_left(sorted_list, line) print(sorted_list) 

so concern implementing c based insort in python without using modules. how can this?

python executing on vm never fast native code in case this.

the reason the python vm has more work execute python code. c code compiled , executed directly. while c extension code still needs access python vm/runtime has additional benefit can perform operations directly through exposed c api. there rpython project can eliminate of additional runtime execution overhead of python code due typing, has restrictions (no pun).

instead of trying make python code "faster" c (which not happen same algorithm), make python code "smarter" c.

in case, algorithm used has poor complexity of ~o(n^2) effectilvely nested loops - 1 loop reading line, 1 loop in nested insort call (for insert, @ least). complexity can , should improved ~o(n lg n) , make difference in performance n above (relatively small) value.

assume lines list containing all lines in file, consider following approaches. do not re-run these sort proposals inside loop!

  1. lines.sort() - built-in list.sort, hybrid sort , has better bounds above use of repeated calls insort. possibly cheating because still uses "native c" implementation provided python. in case, should absolutely crush insort implementation more couple [dozen] lines.

  2. largesort(lines) - largesort "pure python" implementation of merge sort or comb sort. for large enough n (across unsorted data) faster insort c code. due additional constant performance overhead of executing python code, tests need conducted find out value of n starts dominate.

  3. smallsort(lines) - smallsort "pure python" insertion sort implementation. may better really small lists , can performed online/streaming insort approach. performance tests need done, may faster (a "pure python") insort in particular case.

in case, precise constant factor overheads , datasets being sorted relevant. suggest looking @ (i.e. graphing) performance on several approaches , anticipated data including worst-case scenarios. such pre-splitting data may (relatively) benefit 1 approach on or yield misleading/non-ideal gains. possible consideration timings ("benchmarks") of entire python program may dominated overhead not associated sorting mechanism employed.


Comments

Popular posts from this blog

html - How to style widget with post count different than without post count -

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

IIS->Tomcat Redirect: multiple worker with default -