A remote worker taps a team member on the shoulder via her touch screen display to instantly continue an always-on video session – as if she was talking with her co-worker at a water cooler – minus the planes, trains and automobiles to get them both to that same water cooler. Continual visual presence of each team member is sent to her Brady Bunch type display so she’s not blindly tapping on shoulders. She doesn’t video call each member – this is real-time, always-on video – she instantly resumes each interaction via a touch on her screen. This is the future of distributed work teams and other real-time video use cases. Getting to this distance independent interaction paradigm is the key to the long awaited hockey stick curve actually materializing for the enterprise video and telepresence market.

We need three main developments to get to an always-on video paradigm: (1) a replacement or overlay to protocols such as SIP and H.323 protocols and their precepts; (2) expansion of personal network enabling services; (3) visual or multi-dimensional presence.
Moving beyond SIP, H.323 and calling models
Calls are transactions, not real-time interactions
Phone and video calls are transactions – find number, make call, wait for response, hope other party picks up, connect the call. Transactions involve friction such as time, uncertainty, error and overhead. Friction deters communication; less friction = more communication. Always-on video is designed to eliminate that friction and make video interaction much closer to water cooler interaction. However, SIP, H.323 and related protocols are designed for transactions, not real-time interaction.
SIP and H.323 are designed for the exception rather than the rule
When you make a call or video call and are waiting for someone to pick up on the other side, SIP or H.323 is setting up the session (voice or video), which includes aspects like authentication, capabilities exchange, codec negotiation, opening of ports through firewalls, etc. All good but also not all needed for a high percentage of voice and video sessions, mainly because at least 80% (80% is my estimate) of voice and video sessions take place between endpoints that have previously talked or can be proactively assumed to desire to talk, which enables us to do some engineering to pre-build the video session, so the session is ready before you tap me on your display.
SIP and H.323 however are essentially are designed for the ~20%, not the ~80%. Even WebRTC – which I love and believe has a bright future – does much of the same and leaves much of it to the endpoint or UA (user agent) to deal with. Always-on video needs to be designed for the 80%. We will increasingly demand the always-on paradigm, rather than the wait-for-someone-to-pick-up-a-phone paradigm. IM (instant messaging) and mobile push-to-talk (referring to the type of service that Nextel pioneered in the US) are good examples of the usefulness of the always-on paradigm. Now to extend it to video.
The new paradigm means not setting up each and every call as if it was the first call attempted
Proactively building the video session can be potentially done in many ways, including via protocols that replace SIP or H.323, or with methods that are an overlay to SIP or H.323. It is not trivial, and I won’t engineer it here, but to list a few representative pieces of the puzzle:
- Endpoint software can proactively broadcast (or make available to query) the capabilities that are shared during today’s SIP or H.323 call setup. One capability that can’t be predicted is future availability but we can address that one with presence. Another problematic one is dealing with firewalls, NAT traversal and any B2BUAs and that one requires some work. Those two items need to be accounted for in all of these options.
- When endpoints are added to a personal network (see below for personal network description), the endpoints can do a SIP or H.323 type negotiation to discover the necessary call setup info. Not when the first call occurs, but when the endpoints are added to a personal network. That info is then cached, periodically updated via similar methods, and shared with other endpoints in the personal network (and made available within a security wrapper to external endpoints).
- We could do a SIP or H.323 type negotiation on the first call attempt, and then cache the necessary information so the setup doesn’t need to be repeated for subsequent sessions.
- We could do an actual SIP or H.323 call setup but then keep the session alive “permanently” via other elements in the signaling path and wake it up when necessary.
The evolution of online presence
How would the remote worker / Brady Bunch example at the top of this post work? How does presence work in the physical world? I walk by you and judge whether you are available for a conversation. I see and hear what you are doing or not doing, your expressions, any activity going on in the background, etc. Most of the time, I can accurately determine your presence because I have signals from multiple dimensions to process based on what I see and hear. Online presence however is still mainly one dimensional, e.g. a “green light” by an icon that means I’m online on some device. That by itself was great – this flat indication of availability is one of the killer features of IM. But we can do better than that. We can have visual or multi-dimensional presence, even when we’re not face to face.
For the past seven or so years, I’ve managed large, multi-location, international teams, mainly from my home office. My tools have improved dramatically, especially for video. I now have a powerful Cisco Telepresence EX90 – high-definition quality (1080p, 30 FPS) video calling and video conferencing, a touch-screen control for normal use, a web UI and CLI for more robust functionality. The EX90 has decent interoperability with other video units – can run SIP or H.323, uses standard H.264 AVC media, works with many standard VC endpoints and newer software versions interoperate with Cisco CTS telepresence units. But visual presence would make my current EX90 paradigm seem like a fax machine.
Visual presence is the water cooler for the virtually co-located world
In the upcoming visual presence powered world of always on video, my EX90 will periodically broadcast a “snapshot” within my personal networks. Let’s say for example my EX90 broadcasts a low resolution snapshot every 60 seconds (higher resolution, more frequent snapshots will make sense in some cases too). The snapshot could be a still image or could include a few seconds of video. All members of my personal network, my work group in this example, receive that snapshot, and broadcast their own similar snapshots. So, at any time, I can scroll through and see you with a maximum of 60-second delay (in this example), or play a few seconds of video to see and hear you for a slice of time. Perhaps if it is known that you are already in an audio or video session, an indicative icon shows along with your visual presence snapshot.
So I tap your snapshot on my screen and our video session, already connected via methods like the ones described above, is instantly on. Note: I don’t wait for a ring and answer and hope for a connection; that would be a huge deterrent even though it may like seem like it. So it is just like walking by your office and making contact; perhaps I can tap different icons to initiate IM (like waving to you), voice (like saying “hi”) or video (like walking in your office). We will have the proverbial water cooler conversations that we currently only have when we happen to be in the same building, which in tomorrow’s world will be even less frequent.
I can barely quantify what this would mean for improving effectiveness of distributed teams, and enabling folks to work from remote areas that otherwise don’t do very well in that environment. I think it would be an order of magnitude “larger” than the sum of all the technology and process improvements that I’ve seen in my experience managing distributed teams. The third major component we need in order to get here is the personal network enabling services – the service that enables me to build my work group as a personal network.
Personal video network enabling services are the new black
What is a personal video network enabling service?
Wish I had a better name for it. And, no, not another acronym, please. Marketing folks, help, please. Skype, Google, now Facebook Messenger (via Beluga), and most consumer-oriented video products enable users to build our own networks – the key is they have integrated directory, personal network building and presence functionality to enable users to quickly build their own networks with anyone that has a network connection, share status (presence) across those networks and easily communicate across those networks. They enable us to build our own open networks with no barrier to entry, which means I know I can reach you on that network, and that you can easily join it. I know the traditional enterprise video and telepresence products don’t have personal network enabling capability, but they will lose the war and become the minority of video endpoints (even in the enterprise) if they don’t add personal network capability (or partner to add it), so I think they will add it.
Personal network enabling services will increasingly offer more functionality and features within each personal virtual network. For example, I might belong to six personal networks, and my video endpoints may share different information with each of them, according to my preferences and in some cases according to algorithms. In my work group personal network, my endpoints may share the cached call setup type information described above, and may share different dimensions of presence within that work group personal network, than within another of my personal networks.
What does this mean for the enterprise video and telepresence market?
According to Infonetics Research, one of the relatively conservative sets of research, 2010 enterprise videoconferencing and telepresence system revenue was $2.2 billion, up 18% from 2009, with $5 billion forecasted by 2015. I’ve seen other industry forecasts that exceed $10 billion in those time frames, especially ones that try to forecast service revenue. Billions of dollars means plenty of analysis on who will win the market, and the important variables in the war. Some of the most cited variables:
- Internet-video vs. managed private network video
- High-definition quality video vs. lower resolutions and less bandwidth
- SIP vs. XMPP vs. WebRTC vs. proprietary
- Different encryption and security paradigms
- AVC vs. SVC
- Varying frame rates and methods for application sharing
- Unified communications integrations
- Etc.
All are important variables and there are plenty of experts to tell you which vendors or service providers will win in each area and why. And just as many smart people will present logical arguments with the opposite conclusions. Which experts are correct? Wrong question. The real question is how to get to always-on video with personal networks and visual presence. Only then will video and telepresence really enable distance independent interaction, which is the only way the market will really explode.
Great, how soon for this always-on, personal network, visual presence world?
I don’t know but it isn’t far. It can be hacked together today, just not very elegantly, efficiently or cost effectively. Different video services enable parts of this to be done – main issue being cost and user experience as you’d need to leave the session running full video continually to a centralized MCU or bridge in order to get the always-on function. At the other extreme, I could also take a few iPads or Android tablets, make a Hollywood Squares type display from them with some positioning work to get their cameras pointed at me, launch Skype video calls with different people connected to each tablet, leave the sessions running on mute, and un-mute yours when I want to video with you. You can come up with more examples – most of the building blocks are there.
I look forward to seeing you at the water cooler, anytime, anywhere, instantly.