Wednesday, September 12, 2007

Linqing to Xhtml - Part 2

Update: Fixed minor bug in implementation of IsOfClass.

 

In a previous post, I talked about how I'm hoping to be able to use Linq for XML to allow me to process XHTML, my current favorite data serialization format. At the end of that post, I wrote code more or less like this:

var alternates = from ul
in document.Element(xhtmlns + "html").Element(xhtmlns + "body").Elements(xhtmlns + "ul")
where ul.Attribute("class").Value == "alternates"
select
from li
in ul.Elements(xhtmlns + "li")
select li.Value;

The idea was to be able to pull the values out of a bunch of XHTML list items. The problem with this code is that it doesn't really give me what I want. If you were to look at the type of the object referred to by alternates, you'd discover that it's a

 

System.Linq.Enumerable.SelectIterator<System.Xml.Linq.XElement,System.Collections.Generic.IEnumerable<string>>

Which - if you can read that expression without going blind - indicates that what I've got is essentially "a sequence of a sequence of strings". No, that's not a typo: it's a sequence of sequences, and iterating over it with a nested loop is sort of annoying.

 

Fortunately, it appears that Don Box reads this blog, or at least read that post. :) He's the co-author of the rather excellent article found here, and had I read it I would have known the solution. But even though I hadn't (I have now - you should, too), he was kind enough to drop by with a comment that made everything work. Here's the code I'm using now:

 

var alternates =
from ul
in document.Element(Xhtml.Tag("html")).Element(Xhtml.Tag("body")).Elements(Xhtml.Tag("ul"))
where ul.IsOfClass("alternates")
from li
in ul.Elements(Xhtml.Tag("li"))
where li.IsOfClass("alternate")
select li.Value;

I've made a few changes beyond just the Linq bits, but I'll explain those in a minute. The key to making the query work was removing the first "select". Having the second "from" clause follow without an intervening "select" results in a SelectMany method call (all the Linq keywords like select, from, where, etc. are just shorthand for method calls). And that's exactly what we want: SelectMany collapses the query to a single dimension. Read the article for a better explanation. With this change, the query now returns something we can iterate directly over with a "foreach (string alternate in alternates)". Nice.

 

As for the other changes, there were a couple. One was to create a class called Xhtml with a static method called Tag that creates my XNames. This just cleans up the code a little bit from all that "xhtmlns +" stuff I had before. I also created this extension method:

public static bool IsOfClass(this XElement element, string className)
{
    // TODO: this should really account for the fact that the class
    // attribute is multivalued - i.e. it's legal to have
// class="foo bar quux", and we should return true for any of
// foo, bar, or quux.
    return element.Attribute("class").Value.Equals(className);
}

to let me use IsOfClass on an XElement - I just think the syntax is cleaner, and as I do more and more XHTML processing, stuff like this should help contribute to my goal of a reasonable syntax.

7 comments:

  1. A couple of untested options for your IsOfClass method ... this assumes the full class string in `classAttributeValue'.



    LINQ style:



    return classAttributeValue.Split().Any(c => c.Equals(className));



    Regex style:



    return new Regex(@"\b" + Regex.Escape(className) + @"\b").IsMatch(classAttributeValue);

    ReplyDelete
  2. I like it, particularly the first one. :)

    ReplyDelete
  3. Hi I am currently on a similar project as you, I am trying to convert my Xhtml to XElements.



    The XHTML come in from an email and my system has to pull out the Values from the <p> element in the body and then insert them into the relevant fields in the XElement.



    var pieces =

    from p in body.Descendants("p")

    select (string)p.Value;



    This is the code but it is not returning any values.



    I see you made use of XHtml.Tag("html") what do I need to include to access Xhtml elements?



    Thanks Oliver

    ReplyDelete
  4. The reason your code isn't working is that the element "p" is in the XHTML namespace, but you're querying for p elements in no namespace.



    I wrote Xhtml.Tag() to return an XName with the specified name and the XHTML namespace. You just need to do something similar.

    ReplyDelete
  5. Thanks I managed to figure that out "EVENTUALLY".



    xmlns = "{" + body.GetDefaultNamespace().NamespaceName + "}";



    and then appended the namespace to all the queries.



    Thanks again for you help, hope this helps some other people.

    ReplyDelete
  6. Could you please post the code for this function Xhtml.Tag?

    ReplyDelete
  7. This is off the top of my head, so I might have a typo in here, but basically this:

    public static class Xhtml
    {
    private readonly XNamespace _xhtmlNs =
    new XNamespace("http://www.w3.org/1999/xhtml/");
    public static XElement Tag(string name) {
    return new XElement(_xhtmlNs + name);
    }

    ReplyDelete