For the past 2 years I have spent quite alot of my time establishing and evangelizing the concept of a "Correlation ID". Now I did not know it was to be called a correlation ID (cID) and I thank that to the paper by Microsoft’s Markus Vilcinskas, but in essence the concepts were the same. I would recommend anyone looking to establish an iDM infrastructure strongly look at reading the above paper, as it will save alot of headaches down the road if the concepts are well understood. For the most part, I wanted to present my personal experiences with establishing such an ID in the enterprise or sort of a lessons learned. I do not want to re-create the work Markus has done, but offer some insight into the same concepts as they were applied in a large enterprise
 
So what is a Correlation ID?
 
Simply put, it is an ID value to uniquely identify an entity (person) across multiple systems. If you are familiar with a database concept, it is key value that is globaly unique. This ID can be used to connect an identity across a varying array of systems (Databases, Directories, Identity providers,etc).
 
How would I use a cID?
 
  • a cID is used to identify an identity across multiple data sources. An example is an Activie Directory account, a Lotus notes account, and an HR profile that all contain the same cID for a person. If the cID is the same, then a "picture" can be establish of the entire dataset that exists for that user with that cID in the enterprise.
  • Once a cID is established across multiple data stores, process can be established for managing the connected Identity (Provisioning, De-Provisioning, SOX compliancy reporting,etc.)
 
What makes a good Correlation ID?
 
  • An ID value that is globally unique
    • "Globally" would most likely be considered your enterprise and any connected enterprises (IE B2B)
  • An ID should contain no encoded data about the identity it connects.
    • Sometimes a number is simply a number, and when it comes to a cID, the only thing that matters is that it is globally unique and not duplicated.
    • By encoding data about an identity, it leaves an issue of that encoded data changing, and thus having to re-issue an cID with an updated data. An example would be to create an ID based upon a name value which is often the case of Logon IDs. If a person has a life event, with a name change, the encoding breakds down.
    • With encoding data about the identity into the cID, the problem of assumptions come into play. If it is a known that a cID follows a standard formula, people have a tendency to assume they know a cID based upon this formula. An example is when using a logon ID as a cID, a person filling out a form may assume that a logon ID is "smithJL" based upon the formula of 5 letters of the last name, 1 letter of the first and 1 letter of an initial. This assumption leads to people not verifying if that ID does indeed belong to the person in question, or may exist for an entirely different person. Keep the cID format understood, but keep it an ID that by itself means noting. The power is in the data connected to the ID.
  • the cID should be memorable, but not without context supporting it
    • I believe in a public cID, and one that is commonly known by the identity who wishes to present that reference value. Since a value is public, a good candidate to have a person know their cID is to provide it on a enterprise ID badge. Combined with a central public authoritative repository of the cID an ID can be questioned and confirmed by process.
    • By making it a common part of the users everyday life they are more likely to remember and utilize their cID. Afterall, if a person is asked for their home phone #, or sees it connected to their profile, they wil recognize if it is correct or not.
  • It should be Public and not private.
    • To quote IBM’s Bob Blakely, "By definition, attributes which aren’t observable can’t be used to recognize subjects.".
    • This basically means that if the cID is safe guarded from public eyes, the less likely it can be used to identify that person in public sytems.
    • An issue that a private ID brinds is that it cannot be verified with a centralied common repository.
What makes a bad cID?
  • a cID utilizing private data
    • If a cID is built on a private value, the ability to verify the data is diminished. An example would be if you gave me your phone number I could check with the phone company to see if it did indeed belong to you.
  • a cID with no authoritative verifiable source
    • If the cID can’t be tied back to an authoritative source for the cID, all context of the data connected to it can be lost.   I have seen many systems utilize SSN to connect data about a person, but because SSN needs to verified with the SSA, there is no way (except for the HR doing background checks) to verify that indeed that person has the SSN.  So what happens?  A 9 digit number is entered in to satisfy the criteria of putting in the SSN.  How do you prove that is not correct?  In a perfect world of centralized user registration it is possible, but even then there are issues.