Thursday, July 28, 2005

How Very 1998 of Them

If you've read my blog for a while, you know that I've long since left GotDotNet for SourceForge. Generally, I've been more happy there, particularly because I far prefer CVS to the crazy contraption GDN has, but not all is rosy. To start with, I find that during the middle of the day, SF has lately been slow to the point of unusability. This is particularly painful when I'm trying to use the bug and feature tracking pages, which require several clicks to get what I'm looking for. Each click can take as long as 30 seconds.


At times I've thought that maybe I should just set up my own bug database. But then I'd have to evaluate the options, pick one, and then maintain it. Bleh. Still, it might be better than having to wait half a minute for each page to come back. Then, the other day, I stumbled across this, a feature of SF that lets you get the entire contents of a tracker in “XML” form. (More about the quotes in a second.)


Brilliant! With this, I figured I could suck the bug and feature databases onto my machine, then use Excel or InfoPath (if I ever figure out InfoPath) or whatever to browse the database quickly. I can't do updates, but often times all I want to do is read through a bug description or do a quick search, and a local export is perfect for that. So I sat down to automate download, got done with that in a few seconds (gotta love wget), and then went to load up the FlexWiki bug database in Excel.


XML Parse Error.


WTF? I ran it through one of the XML utilities I wrote, and found out that there was an unescaped “smart quote” in the text. Disturbing that it wound up in the export, but oh well - three seconds with “find and replace” fixed that. Fire it up in Excel again and...


XML Parse Error.


OK, now what? Well, it turns out that the smart quote issue was just the tip of the iceberg. Here's an exerpt from the exported file that shows the problem in all its glaring obviousness:


<detail>Produces this output (first </ol> not needed, second
opening of <ol> not needed).  If that was fixed then the
incorrect double <ul> should also be fixed (single <ul>)
as they would be no longer necessary. BTW:  also seen
here the unnecessary blank lines
</detail>


Note the numerous problems with this. Here it's unescaped HTML tags being produced verbatim, but unescaped & and < characters (not to mention the smart quotes) litter the export. This is a classic sign of XML being built via string concatenation, and it renders this export virtually useless for consumption - all the dangling close and open tags completely destroy the ability of any reasonable XML parser to work with this document. I would have thought this sort of thing went out with go-go boots, but here it is, right in my face, in 2005.


Sigh. It's a fairly simple fix to process the document with custom code to escape the “XML” well enough to be able to work with it, but I shouldn't have to. The moral of the story here is that if you find yourself doing something like this:


xml = “<foo>” + fooContents + “</foo>”;


then you should lose points on your programming license.

4 comments:

  1. Programmer's license?!?! You mean you're supposed to have one of those??? Where do I get it, at the DPV?



    I'd like to see some of the "programmers" out there take the "driving test" for that... There'd be a lot less programmers legally programming then... And insurance rates would go up then too due to all the "unlicensed" programming going on.

    ReplyDelete
  2. You can get insurance against bad programmers?!? Sign me up!

    ReplyDelete
  3. What about when the snippet of XML is very short and the cost to create an xml document\element is deemed to expensive for the task required - (I suppose a counter argument to this might say 'what are you using xml for if performance is so critical')



    The fact that the example given above is embedding HTML content within an element and the developer (or more precise the development team) has not done enough testing to validate the generated xml is more of an issue IMHO.



    ReplyDelete
  4. Building up strings without proper encoding is suicide.



    The same technique is exactly what causes HTML script injection and SQL injection errors.



    [)

    ReplyDelete