One difficulty of using Regular Expressions effectively is that each language has their own little idiosyncrasies that can get you in to trouble.
Ruby has a particularly nasty Regexp foible around start-of-string and end-of-string anchors that confuses many of us, my-previous-self included.
In most languages, the
$ special characters match the start and end of a string, respectively. You can also opt in to multi-line mode where they will instead match the start and end of a line.
In Ruby, these characters start off in multi-line mode. This is not unknown. In fact, the Rails documentation states this loud and clear:
Note: use \A and \Z to match the start and end of the string, ^ and $ match the start/end of a line.
Unfortunately, I still see this advice ignored in many codebases. Let's use an ActiveRecord validation example to see why this is so dangerous.
I present to you a recipes web app that lets users create and share dishes. Each dish has a name, which can be a string of alphanumeric characters. The model might look like this:
class Dish < ActiveRecord::Base validates :name, format: /^[\sa-z0-9]+$/i end
You may think that is Regexp says:
- start of string
- one or more spaces or alphas or numerics
- end of string
- case insensitive
In Perl you would be correct. In Ruby; not so much. Let's give this a test drive and see if we can break it:
Dish.new(name: "Best Pizza Evar").valid? # => true
So far so good, but that
$ only matches to the end of a line. This also works:
Dish.new(name: "Best Pizza Evar\n just kidding").valid? # => true
That's not good. Users can bypass our validation by inserting arbitrary
\n characters followed by whatever else they want. You know, like this:
Dish.new(name: "Best Pizza Evar\n<script>alert('pwned!');</script>").valid? # => true
Thankfully, modern versions of Rails will auto-escape the dish name on its way out of the database before sending it down the wire, but we still don't want it in our database for obvious reasons.
This gotcha can be easily avoided if we just follow the note in the Rails docs and use
class Dish < ActiveRecord::Base validates :name, format: /\A[\sa-z0-9]+\Z/i end
Now let's see if the same hack will subvert our validations:
Dish.new(name: "Best Pizza Evar\n<script>alert('pwned!');</script>").valid? # => false