Google Base encoding woes
Google Base lets you upload descriptions of goods, services, publications—or whatever—to enable the world to find your stuff.
You can do this item-by-item, using a web form, or in bulk by submitting an RSS or Atom feed as an xml file.
So far, so good ...
But ...
If you want to use extended character sets (eg. characters, ξ € я þ ø æ œ and accents, å ç ê ñ ü ) you will naturally use utf-8 or unicode encoding.
So, I set up the feed for utf-8 encoding, and uploaded the xml using Direct Upload via Google Base, Google's file upload interface ...
Once my feed was processed, Google said, Your data feed contains an invalid character for the current encoding setting.
I tried File Transfer Protocol, uploading via ftp to google.uploads.com
. Google still didn't get the right encoding (I think Google was at fault here).
I tried Automatic upload via scheduling. This probably failed because my ISP's server insists on sending a content header saying it is serving everything in ascii—Google did the "right thing" and believed this travesty.
So, nothing worked: Your data feed contains an invalid character for the current encoding setting.
Solution: use xsl:output
to encode your feed in ascii
Here is a simple xsl transform to copy an xml file and change its encoding. The important line is the attribute encoding="us-ascii"
, in the xsl:output
element.
toascii.xsl
<?xml version="1.0" encoding="utf-8"?> <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="us-ascii" /> <xsl:template match="/"> <xsl:apply-templates /> </xsl:template> <xsl:template match="*|@*|comment() |processing-instruction()|text()"> <xsl:copy> <xsl:apply-templates select="*|@*|comment() |processing-instruction()|text()"/> </xsl:copy> </xsl:template> </xsl:transform>
utf8.xml
<?xml version="1.0" encoding="utf-8"?> <sample> characters, ξ € я þ ø æ œ and accents, å ç ê ñ ü </sample>
Executing the command
xsltproc -o ascii.xml toascii.xsl utf8.xmlproduces an ascii-encoded version:
ascii.xml
<?xml version="1.0" encoding="us-ascii"?> <sample> characters, ξ € я þ ø æ œ and accents, å ç ê ñ ü </sample>
If you are already using xsl to produce your feed, just add encoding="us-ascii"
to the xsl:output
element. If you have it by some other means you can use the identity transform given above.
1 comment:
An interesting article on various woes of Google-based encoding. Thanks for sharing.
Post a Comment