I have a mild obsession with metadata. I imagine there’s a twelve-step programme to sort me out, hopefully tagged metadata, obsession and self-referential gag so I can locate it.
My current instance of the obsession concerns dates on photographs. Not the most radical of concepts. A problem, you might think, solved long ago when the elders plonked multiple competing metadata chunks inside an unsuspecting jpeg. One minute it’s slowly interlacing its way through a lady’s bosoms, the next it’s been given a sick bag of acronyms like EXIF, IPTC and XMP. It’s drowning in dates, unlike me.
These acronyms mean I can sprinkle an image file with hundreds and thousands of items of metadata — dates, and keywords, and copyright statements, and model release IDs, and Photoshop thumbnails, and GPS coordinates, and camera data, and Wake up! Wake up!
Unfortunately for me, these nuggets of metadata all have one thing in common: certainty. Attempt to poison the well with a droplet of uncertainty, and you’re out of luck.
This photo from my family’s archive was taken during World War Two, not far from the Suez Canal. It shows three friends — my great-uncle, his brother-in-law’s brother-in-law, and his half-brother — who pretty much bumped into each other while Montgomery was shaking his fist at Rommel in the desert.
Here’s the problem: I don’t have a precise date for this photo. Neither do I have a precise location. But I do have some very important and useful imprecise information: 1942 and the Suez Canal.
How do I encode this within the photo’s metadata?
Keywords, you say? Well, yes, but that’s the everything is a nail solution. I’m a fan of keywords, some of my best friends are keywords, but I don’t think they’re the solution here, at least not shovelled in as plain 1942 and Suez Canal, with nothing to identify them as approximate dates and locations.
Are there any other possibilities? On closer inspection it turns out that, at least for some date properties in the metadata ocean, you can reduce precision. Instead of the full YYYY-MM-DD HH:MM:SS.SSS+ZZ:ZZ monty you can say just YYYY-MM-DD, or even just YYYY. Whether or not any given vendor supports this is a different matter entirely – Flickr has a “chop the tail off” feature for dates, but I don’t know whether it automatically spots chopped-off dates embedded in files you upload.
But anyway: problem solved! I’ll put the date as 1942 and move on.
Except that’s not a very good general solution to the problem. And in any case, I lied earlier: I don’t know that the photo was taken in 1942. It might be 1943 — and I can’t encode “either 1942 or 1943” in one date field. Some beard-oriented committee’s idea of imprecision simply doesn’t match the real world.
And I can think of other, similar imprecisions, perhaps most importantly entities portrayed (“this part of this image shows either person/thing A or person/thing B”) and location (“this image shows either fuzzy place A or fuzzy place B”).
Is it too much to want to preserve this type of metadata in the file? It’s certainly valid metadata, and more useful in the real world to real humans than Did the flash gun fire? But it seems I can’t do this. Despite having apparently defined “date of creation” several times over in some perverse kind of Groundhog Metaday, nobody appears to have found the time to specify how to encode fuzziness.
This matters because more and more images that weren’t created by modern digital cameras pointing at actual, current events are making their way onto Flickr and Facebook etc. Let me just dangle the word genealogy here to emphasise the point. (And please note, vendors, there is a very significant difference between “date of creation of file” and “date of original creation of image”. A flatbed scanner is not a time traveller.)
Lack of imprecision is a genuine problem for tagging nerds like me, and it’d be nice if someone could fix it. Handily, the XMP standard — originally defined by Adobe, now an ISO standard, and supported by common image formats like jpeg and dng — allows for extensions: in other words, it lets people define their own metadata. You just need to convince every important vendor to support the madcap scheme you come up with.
With that rather glaring caveat, here’s an idea for dates.
First, a bunch of examples of fuzzy dates people might genuinely want to encode: “sometime in 1942”, “before 1900”, “between June and July 1972”, “between noon and 3pm EST on October 30, 1980,” “either 1945 or 1947, but not 1946,” “definitely not 1903, but apart from that I don’t have a clue”, “744 BC according to the Roman Calendar of Romulus”.
I think these examples generalise to: I want to embed a file with one or more preferably non-overlapping date ranges, possibly open-ended, with each date range specifying a calendar scale (defaulting to the proleptic Gregorian).
Excitingly, the ISO 8601 standard commonly used for representing dates and times specifies how you can encode time intervals in the proleptic Gregorian, and it’s almost exactly what I’m looking for. You can say things like “2007-03-01T13:00:00Z/2008-05-11T15:30:00+0100” — meaning “between March 1st 2007 at 1pm GMT and May 11th 2008 at 3:30pm BST” — if both ends are known with precision. If not, you can leave elements out: “2007-03/04” means “March-April 2007” (elements omitted on the right of the slash, like the year in this example, are considered the same as the value on the left).
I don’t think there’s a standardised way to denote an open-ended range, but I think you could special-case omitted dates either side of the slash: so “/1899” would mean “before 1900”, and “2063-04-05/” would mean “after first contact between Vulcan and Earth”. A single slash “/” could be used to mean “I have no idea” (which is better than saying nothing, as it lets you record a decision — to distinguish “I have examined this image and I have no idea” from “I haven’t got round to tagging this image yet”).
A list of these date ranges then defines a bunch of non-overlapping alternatives: “either this range or this range or…”, thus allowing you to easily state “1945 or 1947”, “either January 17th 1901 or sometime before June 1896”, or whenever.
Calendar scales are probably overkill, but it doesn’t hurt to think ahead. Other groups are: for example, the VCard spec suggests IANA should maintain a canonical list of official calendar scale identifiers, and says it should allow x-name experimental/unofficial values. That seems a good idea to me. In reality, since alternate calendar scales will be rare, we should treat them as special cases and not hobble the entire idea to support them.
Now to implementation…
A quick skim through the XMP specification reveals it lets you store structured data, including lists of such structures. This means we could decompose the calendar scale and the two dates into three separate fields, but let’s keep it simple: since we have the ISO 8601 time interval, and we’ve identified some simple and very useful augmentation to it, let’s just bolt on some optional calendar stuff as a prefix. Then we’re left with a plain old list of strings. Easy.
All we need now is a little naming whitewashed over the top, some well-defined semantics, and Bob’s your imprecise uncle. Still time for Adobe to implement in Lightroom 5!
XMP extension specification: Fuzzy namespace
This namespace contains properties that provide “fuzzy” or imprecise metadata.
- The namespace URI and field namespace URI shall be “http://ns.avaragado.org/fuzzy/1.0”
- The preferred namespace prefix is fuzzy.
These are the properties so far defined in the fuzzy namespace:
Type: unordered array of fuzzy:dateRange
Description: A set of possible date ranges that apply to the document. The actual date is unknown but believed to be within exactly one of the date ranges.
Description: A single date range. A Unicode string containing up to five parts:
- Optional. A prefix defining the calendar scale that applies to the date range. Allowed values are as defined by IANA. If omitted, defaults to the IANA designation for the proleptic Gregorian calendar (which I assume for example purposes would be “gregorian”).
- Optional. A single pipe character “|” (U+007C VERTICAL LINE). Must be present if and only if the calendar scale is present.
- Optional. A “not before” date, written in the syntax applicable to the calendar scale (for proleptic Gregorian, this is the Date type as defined in the XMP specification; the syntax for dates in other calendar scales is not defined here, but it is assumed it does not contain a pipe or a slash character). If omitted, means “the beginning of time”.
- Optional. A single slash “/” (U+002F SOLIDUS). Must be present if both “not before” and “not after” dates are specified, or if neither of them is.
- Optional. A “not after” date, following the same date syntax rules as the “not before” date. If omitted, means “the end of time”. Must correspond to a date no earlier than the “not before” date. Must have the same or fewer date components (year, month, etc) than the “not before” date; must not have components omitted from the “not before” date.
If only one date is supplied and there is no slash, the date is used as both the “not before” and “not after” date. This allows for concise imprecision using the XMP Date value’s ability to omit date components.
Here are some valid example values, with semantics in parentheses:
/(“I don’t know”)
2063-04-05/(“On or after April 5th 2063”)
2003(“Some time in 2003”, same as
2003-01(“Some time in January 2003”, same as
2003/2005(“Between the start of 2003 and the end of 2005”)
2013-04-27/28(“On April 27th or 28th 2013”)
2013-04-27T14:01+01:00/(“At or after 2.01pm BST on April 27th 2013”)
2007-11-13T00:00/15T24:00(“Between November 13th and 15th, 2007” — equivalent to “2007-11-13/15”)
gregorian|/(“I don’t know, explicitly in the proleptic Gregorian calendar”)
gregorian|1800-01-01/1899-11-30(“Any time in the 1800s except December 1899”)
x-wombat|hatstand/banister(“Between two values meaningful in the x-wombat calendar scale”)
And here are some invalid example values:
- (the empty string) (invalid because the slash is mandatory if no dates are given)
gregorian|(invalid because the slash is mandatory if no dates are given)
2001/2000(invalid because the “not after” date is before the “not before” date)
2000/2001/2002(invalid because no more than two dates are permitted)
2001/2001-04-05(invalid because the “not after” date contains date components omitted from the “not before” date)
2013-04-27/15:00(invalid because the “not after” date contains date components omitted from the “not before” date)
hatstand/banister(invalid because no calendar scale prefix means proleptic Gregorian is implied, and neither date component is a valid Date as defined in the XMP specification)
You know what? You can add individual fuzzy date ranges as keywords to your images right now. You don’t have to wait for anyone. Prefix each fuzzy date range with
fuzzydate: and maybe, at some point, a vendor like Adobe or Flickr might notice. Flickr already spots some specially formatted “machine tags” and treats them differently than normal tags.
- “Sometime during the American Civil War” is one keyword:
- “Definitely not 1901” is two keywords:
Away you go.
One response to “The fuzziness business”
Great article avaragado!
Being able to apply the ‘fuzzy’ details to an item’s meta-data may actually assist with finding the actual details – as there’s probably a whole range of experts within genealogists, and photographic and fashion experts, that would want to see this information and offer their advice, information, or services.
Plus, using ‘fuzzy’ means that it can help stop incorrect information becoming ‘correct’ by assumption, simply because systems won’t allow things to have a ‘might be’ aspect to them.