One problem that I needed to solve while making CodeTalker was fast tokenization — so I benchmarked several implementations of matching the WHITE and ID tokens.
Here’s the python implementation, so you understand what’s going on:
def white(at, st, ln): i = at while i<ln and st[i] == ' ': i+=1 return i - at def id(at, st, ln): i = at if i < ln and ('a' <= st[i] <= 'z' or 'A' <= st[i] <= 'Z' or st[i] == '_'): i += 1 while i < ln and ('a' <= st[i] <= 'z' or 'A' <= st[i] <= 'Z' or st[i] == '_' or '0' <= st[i] <= '9'): i += 1 return i - at
The pypy is of course no different, and for cython I only changed the first two lines of each. e.g.
cpdef white(int at, char* st, int ln): cdef int i = at
I also checked Regex to see if that would be any faster.
Anyway, the raw results are below, but here’s a summary:
cython wins by about a factor of 10 each time. pypy is about par w/ regex, which is 10x faster than straight python.
I’m using most recent stable release of pypy: 1.3
pypy 2.82211303711e-05 at 0 4.7474861145e-05 at 2 3.19180488586e-05 at 500 5.06615638733e-06 at 1007 5.14602661133e-06 at 1014 for IDS 0.00018682718277 at 0 7.82380104065e-05 at 10 2.23803520203e-06 at 1025 2.27212905884e-06 at 1028 cpython 0.000272939920425 at 0 0.000285755872726 at 2 0.000156256914139 at 500 6.81185722351e-06 at 1007 2.60829925537e-07 at 1014 for IDS 0.00100137805939 at 0 0.000964962005615 at 10 8.36133956909e-07 at 1025 3.35931777954e-07 at 1028 cython 9.29117202759e-07 at 0 9.21964645386e-07 at 2 5.56945800781e-07 at 500 1.21116638184e-07 at 1007 1.23023986816e-07 at 1014 for IDS 1.86841487885e-05 at 0 8.15391540527e-06 at 10 1.25169754028e-07 at 1025 1.23023986816e-07 at 1028 cpython + regex 2.03487873077e-05 at 0 3.93490791321e-05 at 2 1.27689838409e-05 at 500 1.77097320557e-06 at 1007 1.10411643982e-06 at 1014 for IDS 1.30689144135e-05 at 0 1.26049518585e-05 at 10 1.98006629944e-06 at 1025 9.75847244263e-07 at 1028 pypy + regex 0.000184435129166 at 0 4.27179336548e-05 at 2 5.02660274506e-05 at 500 3.67810726166e-05 at 1007 1.60229206085e-05 at 1014 for IDS 5.36160469055e-05 at 0 7.2783946991e-05 at 10 1.04148387909e-05 at 1025 1.33678913116e-05 at 1028