This is because of the method used to embed the structured post's XML source in the HTML output.
How the output looks
The current output looks like this, with the XML source for the post shown in bold:
<script type="application/x-subnode; charset=utf-8"> <!-- the following is structured blog data for machine readers. --> <subnode alternate-for-id="sbentry_5" xmlns:data-view="http://www.w3.org/2003/g/data-view#" data-view:interpreter="http://structuredblogging.org/subnode-to-rdf-interpreter.xsl" xmlns="http://www.structuredblogging.org/xmlns#subnode"> <xml-structured-blog-entry xmlns="http://www.structuredblogging.org/xmlns"> <generator id="wpsb-1" type="x-wpsb-post" version="1"/> <event type="event/conference"> <name>Doc's show</name> <image>/~phil/sb_latest/images/syndicate_logo.gif</image> <person role="organizer" url="http://doc.weblogs.com">Doc Searls</person> <description>This is Doc's show. He organized it, decided what panels to have, and he's paying for dinner.</description> <tags>doc</tags> <begins>2005-12-13T15:57:00</begins> <ends>2005-12-13T15:57:00</ends> </event> </xml-structured-blog-entry> </subnode> </script>
This embedding technique, called x-subnode and invented by the guys at PubSub (I think Bob Wyman and Duncan Werner) when they did the first SB plugin, is pretty clever. Because they don't know about the the
application/x-subnode script type, browsers will completely ignore the contents. This means you don't need to enclose the whole thing in a comment to stop it from being displayed. Then, you can just drop the whole thing into an RSS
<description> or Atom
<content> element and have the structured data flow out through the feed.
Other bits to note:
alternate-for-id attribute points to an ID earlier in the page which encloses the HTML of this post. This would let a Greasemonkey script reformat the post if it wanted to - or allow a crawler to go back from the structured data to the actual HTML.
The two lines in italics are there to enable GRDDL, which lets RDF people extract meaning from the XML content. This lets us be "RDF compatible" without having to actually generate the RDF.
So, in summary:
- It lets you embed XML inside HTML without commenting it out.
- The XML is still accessible using an XML parser, so XSLT etc works.
- GRDDL tools will be able to turn it into RDF.
- It works inside HTML and also inside RSS/Atom, so a separate embedding method isn't required for feeds.
<script> for all this fires off warnings everywhere we go, and pretty much everyone who looks at the embedded data, whether in a web page or in a feed, has a really bad first impression. So, it's time to do something about that.
Here are my thoughts so far.
Tidying the GRDDL stuff
It seems (from reading the GRDDL Team Submission, the GRDDL profile document, and Danny Ayers' explanation on how to make microformats GRDDL-friendly), that the
data-view bits needn't appear in the XML when embedded in HTML. If we put a profile for Structured Blogging in the HTML header like this:
... then, in the profile page, refer to the
data-view profile and point to the SB XSLT file using
profileTransformation, this will cause the XSLT file to be run on pages generated by the SB plugin.
Getting the XML out of the page
After setting up the GRDDL profile/transform, we could define a microformat to link to the XML source and move it to another URL. This way an RDF crawler would still pick up on it, while crawlers specifically looking for SB posts could look for the links and work from there.
I'm not quite sure how this should look, but here's one possibility: put a class name (e.g.
sb_post) on an element surrounding the post, and inside that element, link to the XML source with
rel="sb_source". So the HTML for a post might look like:
<div class="structured_post"> <h3>This is the post title</h3> <p>Here is some text</p> <p>(<a rel="alternate" type="application/xml" href="/path/to/xml_source">XML</a>)</p> </div>
Making the XML more accessible inside feeds
Currently the whole chunk of XML (above) is embedded in the
content elements in syndication feeds, as part of the encoded HTML. It would look a lot nicer if it could be moved out - perhaps like this:
<item> ... <description>HTML goes here</description> <source xmlns="http://structuredblogging.org/xmlns" url="http://server/path/to/xml_source"> core XML -- <event> from the first example -- goes here </source> </item>
We could GRDDL-enable this by putting a
namespaceTransformation reference in the
Pros and cons of the changes
Making these changes would:
- make everything look a lot nicer,
- and make everything validate,
- while maintaining RDF compatibility.
The downside is:
- the XML would no longer be directly available inside the HTML, so a crawler would have to make more HTTP requests,
- the XML wouldn't be sent over to other blogs when making remote posts via outputthis.org,
- and feed parsers (like the one powering PubSub) would have to be modified to understand the new syntax.
Perhaps the best solution would be to:
- Keep publishing the XML source (using x-subnode) in the HTML (and when sending via outputthis.org),
- but use
profileTransformationto get the
data-viewattributes out of each
- Use a
sb:sourceelement to include the XML source in feeds (rather than x-subnode).
Update (2005-01-19): I've changed my mind. Linking to the XML like this - with <a rel="alternate"> - is actually more likely to be preserved when sending stuff around (with outputthis or inside a feed). The only issue is that the link doesn't look that great. Perhaps we need an "SB XML source" icon, like RSS's white-on-orange XML icon. I've seen the white-on-orange icon used to mean other things than RSS, but I'm not sure how widespread that is.