Friday, November 2, 2007

Simple Filename Wildcarding

I was writing a command-line application for a client the other day, and I found myself wanting to support syntax like this:

 

foobar *.foo *.bar foo*bar

 

which is to say, "run foobar against files that end with .foo, end with .bar, or that start with foo and end with bar." This is a fairly familiar thing for command-line apps to do.

 

One way to approach this is to interpret things like "*.foo" as regular expressions. I don't particularly like this solution, since a) it makes people learn regular expressions to use my app effectively, and b) the common patterns like *.foo don't really work like you want them to - "*.foo" is a regular expression meaning "anything that contains 'foo'" rather than "anything that ends with '.foo'".

 

Another way to deal with this is to use Directory.GetFiles. If you read the documentation, however, there are two issues. The first is that the caveats called out in the notes make the behavior pretty weird. For example, "*1*.txt" matches "LongFileNameHere.txt". Ick. The other issue is that this approach doesn't allow multiple patterns, so I have to scan the directory multiple times and merge the results to do the "*.foo *.bar foo*bar" example above.

 

I hit what I think is a fairly elegant solution, and I wanted to share it with you. The basic idea is to use string.Split, string.Join, and Regex.Escape to create a regular expression that supports exactly one "operator": the asterisk, which matches zero or more of any character. Here's the code:

internal static bool IsMatch(string name, string pattern)
{
// Short-circuit common cases
if (string.IsNullOrEmpty(pattern)) {
return false;
}
if (string.IsNullOrEmpty(name)) {
return false;
}
if (pattern.Equals("*")) {
return true;
}
if (pattern.Equals(name, StringComparison.CurrentCultureIgnoreCase)) {
return true;
}


// Bust the pattern apart at the asterisks
string[] parts = pattern.Split('*');
List patternParts = new List();
foreach (string part in parts) {
patternParts.Add(Regex.Escape(part));
}
// Then put it back together with .*, which matches
// any sequence of characters, and surround it with
// the start-of-string (^) and end-of-string ($) patterns
// so we don't match on things that only start or end
// with what we're after.
string newPattern = "^" + string.Join(".*", patternParts.ToArray()) + "$";
Regex expression = new Regex(newPattern, RegexOptions.IgnoreCase);
return expression.IsMatch(name);
}

Which, I think, is pretty nice, if I do say so myself. Best of all, it has semantics that make it match what I expect file patterns to do.

 

Using it is a simple matter of calling Directory.GetFiles to get all the files in the directory, and then filtering it via a looped call to this method.

 

You'll note that I haven't put in support for the ? operator (match exactly one character), because a) I hardly ever use this myself, and b) it would be pretty simple to bodge in if I ever needed to. It would just take a second call to Split/Join over each patternPart.

 

Hope this helps someone. It was fun to write.  

10 comments:

  1. I applied that technique once for a project, you get the best of both worlds

    ReplyDelete
  2. I think DOS works well with that.

    e.g. Dir *d*.exe *d*.dll

    This lists dlls & exes which has 'd' in its name



    Directory.GetFiles doesn't seem to work with that.

    ReplyDelete
  3. You can actually simplify this a little further:



    pattern = Regex.Escape(pattern).Replace("\\*", ".*")

    return Regex.IsMatch(filename, pattern, RegexOptions.IgnoreCase);

    ReplyDelete
  4. Did you see Brian Kernighan's "A regular expression matcher" in Beautiful Code?



    A completely different implementation (there's no RegEx.Escape method, for instante) but along the same lines.



    Although, on reading it, I had trouble seeing the 'beauty'... seeing as how 'char' is a 'byte'.

    ReplyDelete
  5. Haven't read "Beutiful Code", I'm afraid. Too busy reading this right now:



    http://www.amazon.com/Paradigms-Artificial-Intelligence-Programming-Studies/dp/1558601910

    ReplyDelete
  6. Norvig's a pretty amazing guy. I'm fairly sure he's got a lot of stuff figured out that most people don't understand. I'd like to read that book of his, but I don't think I'm quite up to it yet. :)



    Did you see his Theorizing from Data [1] presentation?



    I don't think Beautiful Code is a very impressive book when it's all said and done. There are better books to read, and I imagine Norvig's AI book is one of them.



    [1] http://www.youtube.com/watch?v=nU8DcBF-qo4

    ReplyDelete
  7. I really doubt the AI book is beyond you: so far it hasn't been particularly hard going. We'll see - I'm only about a third of the way through.

    ReplyDelete
  8. It's on my list. :)

    ReplyDelete
  9. http://en.wikipedia.org/wiki/Glob_(programming)



    They call this globbing in Unix (because of the glob() function). The Windows command line supports the sort of behavior you are looking for with the Tab key, too, so you would think that there is a Win32 version lurking in the Windows DLL's (or maybe in CMD.EXE) somewhere.

    ReplyDelete