Monday, September 11, 2006

Lessons from Planet Sabre

Back in the early days of Momentum I did quite a bit of hands-on architectural consulting. One of my clients was Sabre, the travel information company. One day I was reassigned from another Sabre project (I'll tell that story another day), to a project in distress called, "Planet Sabre".

Planet Sabre was an application that focused on the needs of the Travel Agent. It allowed an agent to book flights, cars, hotels, etc. As you might imagine, these booking activities look quite similar to ones that might be done over the web (Travelocity) or through the internal call center. Hence, they were great candidates for services (they had a good client-to-service ratio).

I was assigned as the chief architect over a team of about 30 designers and developers. (BTW, I was like the 4th chief architect on the project). The developers were pissed that they received yet-another architect to 'help them out'. Regardless, they were good sports and we worked together nicely.

At Sabre, the services were mostly written in TPF (think Assembler) and client developers were given a client side library (think CLI). The service development group owned the services (funded, maintained and supported users). They worked off a shared schedule - requests came in, they prioritized them and knocked them out as they could.

The (consuming) application development groups would receive a list of services that were available as well as availability estimates for new services and changes to existing services. All services were available from a 'test' system for us to develop off of.

So, what were the issues?

The reason why the project was considered 'distressed' was due to poor performance. Sounds simple, eh? Surely the services were missing their SLA's, right? Wrong. The services were measured on their ability to process the request and to send the result back over the wire to the client. Here, the system performed according to SLA's. The issue that we hit was that the client machine was very slow, the client side VM and payload-parser were slow as was the connection to the box (often a modem).

We saw poor performance because the service designers assumed that the network wouldn't be the bottleneck, nor would the client side parser - both incorrect. The response messages from the service were fatty and the deserialization was too complex causing the client system to perform poorly. In addition, the client application would perform 'eager acquisition' of data to increase performance. This was a fine strategy except it would cause 'random' CPU spikes where an all-or-nothing download of data would occur (despite our best attempts to manipulate threads). From our point of view, we needed the equivalent of the 'database cursor' to more accurately control the streaming of data back to the client.

Lesson: Client / consumer capabilities will vary significantly. Understand the potential bottlenecks and design your services accordingly. Common remedies include streaming with throttling, box-carring, multi-granular message formats, cursors and efficient client side libraries.

The second lesson was more 'organizational' in nature. The 'shared service group' provided us with about 85% of all of the services we would need. For the remaining 15% we had two options - ask the shared services group to build them - or build them on our own. The last 15% weren't really that reusable - and in some cases were application specific - but they just didn't belong in the client. So, who builds them? In our case, we did. The thing is, we had hired a bunch of UI guys (in this case Java Swing), who weren't trained in designing services. They did their best - but, you get what you pay for. The next question was, who maintains the services we built? Could we move them to the shared services group? Well, we didn't know how to program in TPF so we built them in Java. The shared services group was not equipped to maintain our services so we did. No big deal - but now it's time to move the services into production. The shared services group had a great process for managing the deployment and operational processes around services that THEY built. But what about ours? Doh!

Lesson: New services will be found on projects and in some cases they will be 'non-shared'. Understand who will build them, who will maintain them and how they will be supported in a production environment.

Planet Sabre had some SOA issues, but all in all I found the style quite successful. When people ask me who is the most advanced SOA shop, I'll still say Sabre. They hit issues but stuck with it and figured it out. The project I discussed happened almost 10 years ago yet I see the same issues at clients today.

Lesson: SOA takes time to figure out. Once you do, you'll never, ever, ever go back. If you're SOA effort has already been deemed a failure it only means that your organization didn't have the leadership to 'do something hard'. Replace them.

No comments: