The quest for the perfect trim function in Perl
2013-10-17 10:49:39 írtaGuys will have been debating about whether there are gek girls or the causes behind the lack of them, and one of the point raised was the lack of will to pursue boring algorithmic questions. Funny that.
Oh, what I wanted to talk about was one of the most boring and simple and everyday things in programming: trimming the leading and trailing spaces from strings (remove them from the beginning and the end, that's it), and the Perl language. At that point all non-programmers and people who cannot speak perl or regular expressions are relieved from duty, go and get some food.
For those that remained, you may well know that there is no "trim" in perl, you usualy either use a module (Text::Trim comes to mind) or actually regexp it yourself. Most people have their favourite.
But they may not be the best ones!
Perlmonks (who are the pros in everything in perl, most of the time) had their run at it several times, but I wasn't quite satisfied about the depth of the analysis. I need the perfect trim!
So - following the Monk mindset - I gathered a few ideas, among my usual one, and threw them in two testing modules, namely Benchmark and Test::More; first to see the speed and second to check corectness. Actually, it's been the other way around: first I have thrown out those which were buggy and benchmarked the rest.
So here are those who did not fail the tests:
grin1 => '$str =~ s/^\s*(.*?)\s*$/$1/;',
mre => '$str =~ s/^\s*((?:.*\S)?)\s*$/$1/;',
silly => '$str =~ s/^\s+//; $str=reverse $str; $str =~ s/^\s+//; $str=reverse $str;',
hellish => '$str =~ s/^\s*//; $str =~ s/\s*$//;',
hellish2=> '$str =~ s/^\s+//; $str =~ s/\s+$//;',
split => '$str =~ s/^\s+|\s+$//g;',
te_tri => '$str =~ s/\A\s+//; $str =~ s/\s+\z//;',
First one was what I usually used, and the rest was advised in various places. The benchmark said:
mre: 13.8181 wallclock secs (13.67 usr + 0.01 sys = 13.68 CPU) @ 730994.15/s (n=10000000)
grin1: 11.2415 wallclock secs (10.81 usr + 0.04 sys = 10.85 CPU) @ 921658.99/s (n=10000000)
hellish: 6.81456 wallclock secs ( 6.50 usr + 0.01 sys = 6.51 CPU) @ 1536098.31/s (n=10000000)
silly: 4.13783 wallclock secs ( 4.14 usr + 0.00 sys = 4.14 CPU) @ 2415458.94/s (n=10000000)
hellish2: 1.69636 wallclock secs ( 1.70 usr + 0.00 sys = 1.70 CPU) @ 5882352.94/s (n=10000000)
te_tri: 1.70796 wallclock secs ( 1.70 usr + 0.00 sys = 1.70 CPU) @ 5882352.94/s (n=10000000)
split: 1.03471 wallclock secs ( 1.04 usr + 0.00 sys = 1.04 CPU) @ 9615384.62/s (n=10000000)
which is quite an interesting result. My version was quite a good one until I have removed those which were slower AND buggy. Now not quite that fast anymore.
One important point: this is perl v5.18.1, and it seems that optimalisations in perl code matter.
Due to that may have happened the biggest surprise that hellish code was magnitudes slower than anything else due to many matching failures in matching the ending and now it ran quite good. The silly method gave fast results either which shows how slow the end matching still is.
But to my greatest surprise the winner was the split method using repeated matches, which was faster than the method used in Text::Trim module (which is using the fancy way of the fixed up hellish2 regexp pair), and not by a small margin but almost twice as fast. And, oh, ten times faster than my original. :-)
So now I know what to use. :-)