Google on Well-Formed XML - The RSS Blog</title

Fri, 23 Dec 2005 17:31:49 GMT

Mihai Parparita: Here are the top XML errors that we have encountered when parsing all of the feeds that our users have added to Reader.

% of errors	Error description
15.6%	Input claims to be UTF-8 but contains invalid characters.
14.9%	Opening and ending tags mismatch
13.9%	An undefined entity is used (e.g. in an XML document without importing the HTML set)
7.8%	Documented expected to begin with a start tag, but no `<` was found
5.7%	Disallowed control characters present
5.5%	Extra content at the end of the document
4.2%	Unterminated entity reference (missing semi-colon)
4.2%	Unquoted attribute value
3.8%	Premature end of data in tag (truncated feed)
3.3%	Naked ampersand (should be represented as `&`)
2.1%	XML declaration allowed only at the start of the document
1.8%	Namespace prefix is used but not defined
0.75%	Comment not terminated
0.64%	Attribute without value

http://googlereader.blogspot.com/2005/12/xml-errors-in-feeds.html

Randy: Some interesting data would be the percentage chance that a feed has ill-formed XML based on the generator (Blogger, Wordpress, Typepad, MT, etc). Anybody got that data?