Microformats to the Rescue

This one is for the CMS designers out there.  If you’re in the business of building platforms that people create content in, you undoubtedly have run into the problem of storing metadata.  It may seem easy at first, “just put it in a database!”, but then you start running into predictable problems: the context information is hard to store, keeping references valid/up to date, and what happens when you export?

Databases + Metadata = Unsolved Problem

Metadata in databases loses its context quickly.  Let’s say Jen uploads an image, titles it “My pet puppy.”  Jen’s friend Steve wants to use the image, selects it in the image library and wants to change the title to “Jen’s pet puppy.”  Where do you store that title now?  What happens when Jen renames her image?  What if you use a couple copies of the same image, with a different title?  It’s a bit of a mess, but usually the solution is store the metadata in context: keep the metadata with each use of the image, in that HTML page.  Problem is, images don’t have a title attribute.

The other issue is maintaining those goddamn references between the database, the HTML file, and the image file. Odds are you’ll be using some database file system of some kind so now you have to manage deletions, renames, and metadata edits in three different linked places.  Those links are fragile, so things fall out of sync. Especially if users have access to editing their HTML source code, offline editing, import/export, anything like that.  So make sure to keep one authoritative copy of that data.

Lastly is the issue of exporting/sharing this content.  The platform that I work on has a strict requirement for being exportable without ruining everything, in order to keep a very important ($$) industry certification.  So when we export that web page, we don’t want to lose all of that image metadata.  We will if it’s in the database, unless you do a lot of non-standard hackery.  And we’d want to avoid non-HTML shit just to pass data around (a standard solution).

Microformats: Metadata, Inline, Bam

Really, a good way to do this is just store metadata inline, in the HTML content.  The best solution we found for this setup is using microformats.  The ideas that you wrap your object (an image, an object, a text block, whatever) with span tags that represent each one of the pieces of metadata.  There’s a much more verbose explanation on the microformats.org site.  The hCalendar format is good place to look for examples of this concept embraced.

So for our image example above, the HTML would look something like this.

<span class="image">
<img src="puppies.png" width="500" height="169" />
<span class="imagetitle">The puppies are st00pid fly.</span>
<span class="author hidden">Jen</span>
</span>

It’s pretty ingenious. You have the semantic relationship between the image and the imagetitle, and you can easily extend to add other information like the author, and keep that specific item hidden, or whatever.

Best thing ever is that since it’s semantically sound, you can do some magic with the CSS and DOM manipulations.  You can make it look pretty, keep it accessible, hack it with JavaScript in a reliable way.  Anywho, sometimes, it’s better than putting things in a database.

Sorting Country Names in their Native Language

I used to think that sorting things was easy. Collation is a really difficult problem, especially once you start considering different script (Latin, Chinese, German, etc.) and numeral systems (Western Arabic, Hindi, Japanese, etc.) in the same list, not to mention locale-specific sorting irregularities like German Phonebook sorts.

The problem of sorting country names is particularly sensitive.  When you want to display China as 中国 to Chinese speakers, where should it be sorted compared to Canada or Kâmpŭchea (Cambodia)?

Here we have an example list of countries, in the order I looked them up online, heh.  For the sake of not messing with my blog, I avoided Right-to-Left country names for Egypt, Iran, or Israel.

  • United States
  • España
  • 中国
  • Deutschland
  • Polska
  • Россия
  • भारत

If you had a “sort” feature in whatever software you’re using, hopefully it’s using the Unicode collation order to sort the names. You typically would get something like this as a result:

  1. Deutschland
  2. España
  3. Polska
  4. United States
  5. Россия
  6. भारत
  7. 中国

Business Case Sort

Alpha sorting, Unicode or otherwise, may seem pretty arbitrary especially if 95% of your customers come from three or four countries.  One can always make the case for sorting country lists with the most popular countries dominating the “top 5″ or so of the list *. For many businesses this may mean a sort order of:

  1. United States
  2. 中国
  3. Россия
  4. Deutschland
  5. España
  6. Polska
  7. भारत

That’s all good … if you want to confuse Indian and Polish visitors, don’t care about keyboard users, and want to take a big hit on your Russian branding.
* Instead of mucking with collation, a usable solution is to autodetect what country people are from and pre-selecting things in dropdown lists, or highlighting it as a choice outside of the sorted list.

ISO to the Rescue

In my research, I’ve found a pretty good general solution, irrespective of the business case, is to sort things according to the ISO 3166-1-2 code. I know, I know, it’s lame and old and under fire constantly .. but it’s a fairly standard coding that technical people understand, native speakers understand, keyboard access is alright, and it’s considered safe on the culture-war front (other than being based on the Latin alphabet).

Our example above would be:

  1. 中国 (cn)
  2. Deutschland (de)
  3. España (es)
  4. भारत (in)
  5. Polska (pl)
  6. Россия (ru)
  7. United States (us)

Anywho, that’s just my suggestion for a starting point. Your business case may indeed support other sort orders for countries.  But this one is reproducible and defensible, so that makes it good for programmers and business analysts alike.

Happy IDN Day!

Today is the day that internationalized domain names (IDNs) go live on the internet. As someone really interested in globalization, this is a huge development: this is the first time non-latin characters can be used as domain names in the public internet. Arabic nations especially are loving this, and I’m sure Hebrew and Chinese language domain names will surely follow within hours.

I asked an Egyptian co-worker what the URL was for the Egyptian Ministry of Communication .. I didn’t even know how to search for it. Here it is:

http://وزارة-الاتصالات.مصر/. Apparently the fonts I have butcher the script.

The URL looks good when I copy/paste, but it gets turned into Punycode: http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c, which magically still works. Anyone else have some insight into how this works?

Proudly powered by WordPress
Theme: Esquire by Matthew Buchanan.

Switch to our mobile site