The RSS Blog

News and commentary from the RSS and OPML community.

Somehow, the debate about how to encode RSS text elements has re-surfaced. I personally thought this was resolved, but clearly the debate rages on. The problem with encoding derives from the fact that certain characters, like ampersand (&) have special meaning in XML, so care must be made to encode them.  As a starting point, let's consider the following XML fragment.

<title>How to encode & in XML</title>

This XML fragment is not well-formed in itself. That's because the special ampersand character must be encoded.

Single Encoding

The most simplest way to encode the ampersand is to replace it with the string "&amp;". The valid XML fragment would look like the following.

<title>How to encode &amp; in XML</title>

When this is decoded, the XML text becomes "How to encode & in XML". This is called entity encoding. There are at least three other ways of encoding the same information. 

  1. Decimal character encoding (a.k.a. numeric character encoding )
    & becomes &#38;
  2. Hexdecimal character encoding (a.k.a. numeric character encoding )
    & becomes &#x26;
  3. CDATA encoding 
    & becomes <![CDATA[&]]>

Immediately, we have four different ways of encoding an ampersand. All four are valid for encoding an ampersand in an RSS <title> element. All four are called single encoding, because we`ve encoded the data exactly once.

Double Encoding

There`s also a reason to encode the data twice. This happens when we want to encode HTML within an XML element. This is necessary because the RSS specifically allows this for the <description> element.

the description contains the text (entity-encoded HTML is allowed; see examples)

Now, it must be pointed out that the spec uses entity-encoded HTML but refers to examples that are both CDATA encoded and entity encoded. That's OK, I'm capable of making the leap of faith that he meant double encode the <description>. Let's make clear that this means the <description> child element of <item> is double encoded. The <description> child element of <channel> remain single encoded. All other elements are single-encoded, including the <title>.

Reference

The best article written on the subject was by James Holderness, titled Encoding RSS Titles. Let's use this as a starting point. James did a great job of enumerating what works and doesn't work and he summarizes as follows.

Clearly if you want to support Firefox or Internet Explorer you’ve got no choice but to use the single encoding option. For certain strings, though, that would mean losing support for at least twenty other aggregators. No matter what you do, you can’t win.

Perhaps the only solution, barring a miraculous change of policy on the part of certain browser authors, is to just use both.

What the first sentence means to me is that publishers should single-encode their <title>s. The second sentence was a more complex solution where the publisher could use User-Agent sniffing to server the <title> as either single or double-encoded depending on what the user-agent supported.

Reader Comments Subscribe
Someone should fix this comment box, at least make a check that this area is not empty.
Indeed, thanks for sharing. Been trying to figure out if and how to get encoded HTML into an RSS description.
Peoria web design
 which was once a small fishing port and is now filling up with excursion boats, you can look down the small side streets to see the sea on either side of you. The old town has managed to retain many of its old features including its architecture and also the archeological sites. At every turn you will be faced with a new monument or ruin which never ceases to amaze those who return time.

Property in Side
after time.
Type "339":