Fastest Way To Deduplicate Contiguous Characters In String - Python
Solution 1:
Itertools are a good lib to try
>>>t = "THHHISSS ISSS BBBBSSSSS">>>import itertools>>>''.join(char for char, _ in itertools.groupby(t))
'THIS IS BS'
Solution 2:
First of all, your deduplicate
function is actually really fast. But there can be some improvements made to make it even faster. I have lambda
ized your function and called it org_deduplicate
(below). Now for some time tests (using iPython's %timeit
):
s = 'this is an irritating string with random spacing .'
org_deduplicate = lambda s,c: c.join([substring for substring in s.strip().split(c) if substring])
%timeit org_deduplicate(s,' ')100000 loops, best of3: 3.59 µs per loop
but the strip
really isn't necessary and may even give you unexpected results (if you are not deduplicating whitespace) so we can try:
org_deduplicate2 = lambda s,c: c.join(substring for substring in s.split(c) if substring)
%timeit org_deduplicate2(s,' ')
100000 loops, best of 3: 3.4 µs per loop
which speeds things up by a tiny bit but its not all that impressive. Lets try a different approach... regular expressions. These are also nice because they give you the flexibility to choose any regular expression as your "character" to deduplicate (not just a single char):
import re
re_deduplicate = lambda s,c: re.sub(r'(%s)(?:\1)+' %c, '\g<1>', s)
re_deduplicate2 = lambda s,c: c.join(re.split('%s+'%c,s))
%timeit re_deduplicate(s,' ')
100000 loops, best of 3: 13.8 µs per loop
%timeit re_deduplicate2(s,' ')
100000 loops, best of 3: 6.47 µs per loop
The second one is faster but neither are even close to your original function. It looks like regular string operations are quicker than re
functions. What if we try zipping instead (use itertools.izip
if working with Python 2):
zip_deduplicate = lambda s,c: ''.join(s1 for s1,s2 inzip(s,s[1:]) if s1!=c or s1!=s2)
%timeit zip_deduplicate(s,' ')
100000 loops, best of 3: 12.9 µs per loop
Still no improvement. The zip method makes too many substrings which makes doing ''.join
slow. Ok one more try... what about str.replace
called recursively:
def rec_deduplicate(s,c):if s.find(c*2)!=-1:return rec_deduplicate(s.replace(c*2,c),c)return s
%timeit rec_deduplicate(s,' ')100000 loops, best of 3:2.83 µs per loop
Not bad, that seems to be our winner. But just to be sure, lets try it against our original function with a really long input string:
s2 = s*100000
%timeit rec_deduplicate(s2,' ')
10 loops, best of 3: 64.6 ms per loop
%timeit org_deduplicate(s2,' ')
1loop, best of 3: 209 ms per loop
Yup, it looks like it scales nicely. But lets try one more test, the recursive deduplicator only removes duplicate chars of length 2 each time it is called. So does it still do better with long duplicate chars:
s3 = 'this is an irritating string with random spacing .'
%timeit rec_deduplicate(s3,' ')100000 loops, best of3: 9.93 µs per loop
%timeit org_deduplicate(s3,' ')100000 loops, best of3: 8.99 µs per loop
It does lose some of its advantage when there are long strings of repeated characters to remove.
In summary, use your original function (with a few tweaks maybe) if your strings will have long substrings of repeating characters. Otherwise, the recursive version is fastest.
Post a Comment for "Fastest Way To Deduplicate Contiguous Characters In String - Python"