tumblr visitor stats



October 20, 2010

Big Ideas around Big Problems in Big Data

Last Wednesday, IA Ventures held its first annual Big Data Summit. Attendance was restricted to our Limited Partners, portfolio company CEOs and CTOs, leading thinkers and practitioners in fields related to Big Data and a small group of co-investors. There were only 70 people in the room. Thomson Reuters, one of our LPs, was kind enough to allow us to hold the conference in awesome space at the top of 3 Times Square. The sky was blue, the sun was shining and the conversations even more dazzling.

The goal for the conference was to catalyze a series of conversations about how to best deal with problems and capitalize on opportunities in today’s Big Data world. At IA Ventures we structure our thinking about Big Data around three buckets: Visualizing; Learning; and Scaling. Every IA Ventures portfolio company is dealing with these issues day in, day out, but their issues are certainly not unique. One of the most interesting threads in the conference covered how hardware innovation has finally allowed us to tackle Big Data problems. The inexorable drop in the costs of storage, compute and bandwidth, with a related increase in network access has led us to a point where, in the words of Mike Driscoll, CTO and Co-founder at Metamarkets and author of the blog Dataspora, "The economic value of Big Data finally exceeds its extractions costs." The implications are enormous: we now stand at the edge of a generational wave that will fuel innovation towards solutions for managing the chasm between Big Data problems and Big Data science.

Here are a handful of the Big Data insights I gleaned from the conference:

  • Widespread adoption of real-time sensors will fuel a Big Data revolution.
    Real-time data collection is expanding on exponentially. Passive detection and measurement techniques will provide enormously valuable data to the tools developed by the Big Data community. The skyrocketing richness of data around context, preferences and behaviors will lead to new opportunities for analyzing our “offline” world.
  • The ability to store everything can lead to problems.
    Plummeting cost of storage has enabled many to simply store all their data. Problem is, some of that data is valuable while some is not. The process of identifying and exploiting the data-containing signal can get muddied by avoiding critical thought on the front end.
  • Algorithms are the game-changer.
    So much emphasis has been placed on advances in cloud computing and storage that the value of better algorithms has been lost in the shuffle. Today’s machines with yesterday’s algos or yesterday’s machines with today’s algos? Hilary Mason’s compelling presentation convinced me: I’ll take the latter, hands down.
  • That said, AWS is a juggernaut and the power of its lock-in should not be underestimated.
    While the ease of creating and scaling a virtual data infrastructure has skyrocketed, the difficulty in moving large-scale data has not meaningfully decreased. Many companies who built their businesses on top of AWS will find it both painful and costly to switch.
  • Scaling development using offshore teams often benefits from moving a core team member to manage the operation.
    Coordination between on- and offshore development teams is challenging in the best of circumstances, and as offshore components have grown the importance of global coordination has increased. Migrating the domestic development DNA to the offshore team is vital to achieving long-term success.
  • Once the wrinkles are ironed out, Mechanical Turk can be a powerful and cost-effective vehicle for scaling data-driven businesses.
    We’ve seen several companies leverage Turk to aggregate and QA large-scale data sets with great success. When you let humans do what they’re best at and let machines do the rest, you can take on tasks that were only a few years ago impossible.
  • NumPy / SciPy have impressed many with its performance.
  • Privacy is dead.
    What do we consider to be the boundaries of privacy, especially with respect to items like medical data? In a data privacy-free world, should we be regulating data usage instead? How do we deal with asymmetric access to our personal data, e.g., how is it that insurance companies claim the right to our personal information?
  • If the past is search, the the future is prediction and suggestion.
  • With open source and Internet protocols commoditizing software, the advantage will be derived through data.
    In this regard, context is king, e.g., contextualizing datasets to surface previously unseen relationships in the feed.

Tim O’Reilly wrapped up his talk with an important message: "Create more value than you capture." Essentially the ethos of the open source movement. The magnitude of the problems discussed above do not subject themselves to “point solutions,” but larger collaborative efforts leveraging the global brain. The solutions require not only advances in technology but a wholesale re-evaluation of the way data is used, owned and regulated. Preparing for the future is not simply an issue of throwing money at the problem, but of being thoughtful and deliberate in building open standards to facilitate innovation on a large-scale basis.

comments powered by Disqus