Bookmark this site!

2009-03-09

Google Base encoding woes

Google Base lets you upload descriptions of goods, services, publications—or whatever—to enable the world to find your stuff.

You can do this item-by-item, using a web form, or in bulk by submitting an RSS or Atom feed as an xml file.

So far, so good ...

But ...

If you want to use extended character sets (eg. characters, ξ € я þ ø æ œ and accents, å ç ê ñ ü ) you will naturally use utf-8 or unicode encoding.

So, I set up the feed for utf-8 encoding, and uploaded the xml using Direct Upload via Google Base, Google's file upload interface ...

Once my feed was processed, Google said, Your data feed contains an invalid character for the current encoding setting.

I tried File Transfer Protocol, uploading via ftp to google.uploads.com. Google still didn't get the right encoding (I think Google was at fault here).

I tried Automatic upload via scheduling. This probably failed because my ISP's server insists on sending a content header saying it is serving everything in ascii—Google did the "right thing" and believed this travesty.

So, nothing worked: Your data feed contains an invalid character for the current encoding setting.

Solution: use xsl:output to encode your feed in ascii

Here is a simple xsl transform to copy an xml file and change its encoding. The important line is the attribute encoding="us-ascii", in the xsl:output element.

toascii.xsl

<?xml version="1.0" encoding="utf-8"?>
<xsl:transform version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output
     method="xml"
     version="1.0"
     encoding="us-ascii"
     />

  <xsl:template match="/">
    <xsl:apply-templates />
  </xsl:template>

  <xsl:template 
     match="*|@*|comment()
             |processing-instruction()|text()">
    <xsl:copy>
      <xsl:apply-templates
    select="*|@*|comment()
             |processing-instruction()|text()"/>
    </xsl:copy>
  </xsl:template>
</xsl:transform>

utf8.xml

<?xml version="1.0" encoding="utf-8"?>
<sample>
  characters, ξ € я þ ø æ œ 
  and  accents, å ç ê ñ ü
</sample>

Executing the command

xsltproc -o ascii.xml toascii.xsl utf8.xml
produces an ascii-encoded version:

ascii.xml

<?xml version="1.0" encoding="us-ascii"?>
<sample>
  characters, &#958; &#8364; &#1103; &#254; &#248; &#230; &#339; 
  and  accents, &#229; &#231; &#234; &#241; &#252;
</sample>

If you are already using xsl to produce your feed, just add encoding="us-ascii" to the xsl:output element. If you have it by some other means you can use the identity transform given above.

1 comment:

Ramon Andrews said...

An interesting article on various woes of Google-based encoding. Thanks for sharing.