An inspirational graduation speech from Jack Ma. I especially like the way he describes change: “Change is painful, but not changing is bitter.”
IDC estimates that $6B (out of $27B for big data) will be invested in Hadoop infrastructure in 2015. That’s not bad for an open source technology spawn within Yahoo just less than 10 years ago. The onslaught of mobile, social, and emerging prevalence of sensor data is driving businesses and governments to extract trends, sense, and information from.
After all, for a few hundred thousand dollars in investment, a corporation can build and install a compute cluster to perform analytics and predictions from the collected data. Yahoo, Facebook, Salesforce, Twitter, and internet based businesses have already invested significantly. Traditional businesses, like insurance, banks, medical providers, manufacturers, are now investing. These businesses benefit from gaining insights into their customers and supply chains, using cheap off the shelf compute servers and storage, and free open source software that’s Hadoop.
But many companies are finding that their investments in the compute infrastructure might be humming and consuming power in their data centers, but not producing enough results to justify the ROI. Hadoop, as implemented, has a number of issues:
- programming the cluster using the Java / MapReduce framework could be cumbersome in many cases
- the cluster runs as a batch processing system, harkening back to the dates of the mainframe, where jobs are submitted and the computed results would come hours later. Once a job is submitted, it’s coffee time and a waiting game.
That’s why the Spark project has zoomed from a mere Berkeley AmpLab project 4 years ago to an important top level Apache project with a 1.0 release published just in May 2014. Spark replaces many pieces of traditional Hadoop, especially the Hadoop execution engine (MapReduce). Often, customers keep the Hadoop data store HDFS.
Much of the IT world is suddenly seeing an entirely better solution than the traditional MapReduce and batch nature of Hadoop. Even the new version of Hadoop 2.x, with a new execution engine called YARN is overshadowed by the looming Spark. You see, Spark is designed to address the highlighted issues, specifically provides the following:
Deliver much higher developer velocity.
To analyze the streams of data coming from mobile, social and sensors, developers need to write and adapt existing analytics and machine learning algorithms to process the data. Spark allows the developers to write in Scala and Python, in addition to Java, which makes the coding much faster and easier to read the more succinct code. Also, Spark provides an execution engine (DAG, RDD) that goes far beyond the relatively simplistic model of MapReduce, making programing the cluster much more intuitive to write. Execution runs with all the data residing in memory, as opposed to getting data from disks, so the turn-around is much faster. Developers can iterate quickly on their programs with fast turn-around of running Spark jobs.
Process data with 10x-100X speedup.
Because Spark specifically arrange the data processing operations to ensure all data procession is done in memory in each compute node of the cluster. The result is tremendous speedup in processing. Customers can see more results quickly, getting higher return on their Hadoop investment.
Offer investment protection for existing Hadoop.
Spark is compatible with traditional Hadoop infrastructure. Spark can run along side Hadoop 2.0’s YARN execution engine. Spark can take data from Hive or HDFS, and in fact runs the data much faster, by putting the data in memory using an in-memory file system called Tachyon.
This last reason makes it so much easier to consider using Spark for an existing Hadoop investment. Now Hadoop developers are paying attention and flocking to meetups on Spark. While Spark just released a version 1.0, so it’s still relatively new, and subject to the iidiosyncracies of newly developed open source software. Often, the above benefits are outweighing the risks for traditional Hadoop users and developers to switch to Spark.
There are many other benefits to Spark, such as streaming – these we’ll reserved for discussions later.
I’m inspired by this video from Saras Sarasvathy, very clear description of risk vs uncertainty. Quoting Frank Knight, risk could be managed with probability and statistics, while a great deal of situations have so much uncertainty that are not possible to characterize. Professor Sarasvathy suggests an approach that makes it possible to thrive on these types of situations:
Amy Cutty shows us how the physical body actions could change how we think and feel and especially how we could us that to change ourselves.
I just completed the training for SCRUM Product Owner with Agile Learning Labs. Chris Sims was a very good instructor. He explained the what, the how, and the why of a Product Owner (PO), as well as the PO’s responsibilities to the stakeholders and to the development team. Successful PO needs to regularly prioritize user stories relative to the potentially changing requirements and valuation of the stories from the stakeholders. At the same time, the PO needs to fit an appropriate number of stories, by story points, relative to the capacity of each sprint. Asides from various exercises, the training included a half-day simulation that ran thru 4 sprints. That helped me understand what really is needed and expected from the product owner in the SCRUM process.
Chris also showed a video from Henrik Kniberg that illustrates the process of SCRUM from the perspective of a product owner. I found this video succinctly explains the process of SCRUM as well:
I completed Andrew Ng’s machine learning class on Coursera. While I’ve done other machine learning (ML) classes, I found this class’ Octave exercises enabled me to understand what’s happening in the algorithms. Many of my previous ML classes involved proofs of the mathematics, whereas this class focused on applying the vector and matrix operations to work with datasets, whether it’s a small m or large m (number of data points). After all, we need to work with the data to derive any meaningful insights.
Much of the supervised learning was devoted to figuring out the cost function and applying gradient descent to minimize the cost function whether it’s linear regression, logistic regression, neural network, or SVM. Unsupervised learning included k-means, PCA, and collaborative filtering. Most important is the approaches to develop better algorithms specific to the data: diagnosing bias and variance, add features to improve underfitting, regularization to reduce overfitting, use of ceiling analysis to determine where to spend effort to get improvements. These concepts all come to life for me in the programming exercises.
In Ethernet Summit 2014, Alan Weckel of the Dell’Oro Group showed a very interesting chart on projections for server adoption. Due to copyright issues, I’d summarize the info as follows:
In 2013, cloud and server providers account for ~20% server unit shipments, by 2018, this group of customers is forecasted to account for up to 50% of server unit shipments. If this trend continues, the there would no growth to server shipments to enterprise customers.
Since servers account for part of the data center, the implication is that both networking and storage gear would move this way as well. Cloud and SP are significantly changing the data center equipment market.
Another interest point, 2 players dominate in the cloud, Google and Amazon, while Facebook could be an up-and-comer. These players design their own data center equipment and directly work with ODMs to manufacture their own equipment. It would take some hard maneuvers for an IT equipment vendor to get into these accounts. HP is trying such as maneuver: creating low-cost entry servers in partnership in Foxconn. Time will tell whether this would work.
This cartoon was posted on the screen at a recent BigData Guru meetup. “Data is the new oil.” While this concept was first stated in 2006, it’s still very relevant today, especially with explosion of the internet, mobile, IoT, and the multitude of tools for people to generate content and to enable transactions of all types. IBM indicated in 2013 that 90% of the world’s data was generated in the last 2 years.
The premise is that much of data from email, tweets, IM, Facebook posts, Google searches, Amazon purchases, Ebay transactions, mobile and IoT sensor data, retail transactions could be mined for value, analogous to crude oil being processed and refined into gasoline and other ingredients to make plastic and other materials. The value of the data could result in better targeting of customers to buy the right products, in better understanding of the environment from sensor data to increase productivity, such as increasing crop production or solar energy yield. The possibilities seem endless.
All the while, both the companies extracting the value from the data and the vendors producing tools and equipment (e.g. data center servers and storage) to enable this extraction are being rewarded handsomely with $$$. Data is the new oil.
This is the claim by Nvidia CEO Jen-Hsun Huang, that 3 Nvidia Titan Z CUDA based GPU cards could offer the same performance in running deep learning neural nets as done in the Google Brian project using Intel processors.
- 1/150 the acquisition cost reduction
- 1/150 heat consumption
If this could be done in general for deep learning type problems, we could have many more machines to do machine learning on the explosion of data. At the same time, to use the CUDA cores, software programmers would need to learn program this hardware and / or use OpenCL. The cost savings could warrant pushing over the learning curve.
This paper is referenced: “Deep Learning with COTS HPC Systems” by A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro, published on ICML 2013
Google announced Project Tango on February 20, 2014. It’s a cell phone that captures and reconstructs the environment in 3D, wherever the user points the back cameras. There are 2 cameras, a color imaging camera and a depth camera (or Z-camera), very much like the first generation Kinect. But Project Tango is much more than the Kinect, it performs in real time all the computation of the 3D reconstruction using co-processors from Movidius.
This reminds me of what Dr. Illah Nourbakhsh said in 2007 in the inaugural presentation of the IEEE RAS OEB/SCV/SF Joint Chapter: that some day, we’d be able to wave a camera and capture the entire 3D image of our environment. Project Tango is just that simple, just aim the cameras to the areas to create the 3D reconstruction. To complete a room, you’d have to walk around the whole room to capture all the information.
Using SLAM algorithm, aGPS, and orientation sensors, Project Tango is also able to localize the 3D reconstructed image to its location on earth and relative to the location of the device itself.
Project Tango is running a version of Android Jelly Bean, rather than the latest Kitkat release. What’s more, it apparently is using a PrimeSense sensor, which now is no longer available after Apple’s acquisition of PrimeSense. (Interesting that Google did not push to outbid Apple for PrimeSense. After all, there are plenty of alternative depth sensor technologies out there.) Furthermore, Battery life is very limited. These and other issues will eventually be solved, for real-world deployment.
Applications for real time 3D reconstruction and mapping include augmented reality, architectural design, and many others. Most interesting would be the use in mobile robots to maneuver in the real world. Just imagine in-door drones, armed with this capability, would be able to move autonomously and safely anywhere in a building, monitoring and transporting items from one location to another. The applications are endless.
Google has advanced computing technology to enable real interaction with the physical world, by demonstrating the real-time 3D reconstruction and mapping capability in Project Tango.
Andy Feng of Yahoo presented his work on Apache Storm. This picture shows the 3 types of Hadoop 2 processing scenarios:
- Hadoop batch processing (MapReduce or the newer Tez providing DAG based processing)
- Spark iterative processing (for machine learning where the algorithms crunch on the same data repeated to minimize some objective function. Spark supports the Directed Acyclic Graph processing model. With the capabilities of Spark, it has drawn increasing interests.
- Storm stream processing for real time data
This platform presents an “operating system”-like set of functions to manage cluster of compute and storage:
- HDFS to manage storage
- YARN to manage compute resources
- MapReduce/Tez/Storm and Spark to schedule and run tasks
All open source and changing quickly.
To draw an owl, just start with 2 ovals that look like an owl, then fill in the details via quick iterations. Twillio describes this as “There’s no instruction book, it’s ours to draw. Figure it out, ship it and iterate.”
Ted said that while Box’s target customers are enterprises, Box can still approach product releases using this approach.
This applies well to new product development with the goal of getting to the product with features that maximizes customer value and thus revenue, in a series of quick iterations.
I think key to success consists of
(1) starting with a framework (the 2 ovals) that has a high chance of success
(2) having the discipline to identify likely failure of the framework if that occurs, and then abandon the project quickly, not be bothered with the sunk cost.
This past weekend at Code Camp, I saw Yosun Chang‘s presentation on hacking Google Glass. I liked how her slides flow from 10,000 foot view of all the slides and zoomed into each slides. She was using Prezi.
I’ve seen Prezi presentations before but had not the opportunity to use it. I had been using PowerPoint and Keynote.
I have to give a speech at Startup Speakers, and I turned the opportunity into using Prezi. I like how Prezi enforces a structure to the presentation by just a simple graphical layout.
Here’s my 6-minute speech on my excursion on evenings and weekends into 3d printing in the last few years:
I finally got to tried Glass. I attended Yosun Chang‘s Code Camp session on hacking Glass and she let the audience test some of the hacks she did with Glass. It was rather light and felt comfortable on the head.
I saw up close the structure that holds the OMAP4430 processor, 500+MB memory, 16G storage, GPS, camera, accelerometer, speaker, mic, light sensor and touch sensor. There is a battery that’s designed to hang behind the ear when Glass is worn. The most visible piece is the clear block of plastic with a builtin prism to reflect the lights from the LCD screen to the eye’s upper right field of view.
I think it has huge potential for sensor fusion where the user intent could be surmised from the sense information of the accelerometer, mic, light, touch, and other sensors.
But NOT the use case of deploying the head as a mouse. One of hacks had the wearer browse items projected on a virtual cylinder by moving the head. A person could easily get a neck cram and maybe develop a new form of carpal tunnel. Using gaze of the eye to track cursor position could be interesting, although currently there is no camera point at the eye. Opportunities for future versions of Glass.
In the mean time, I took a picture with me wearing Glass.
It’s been over 1 year since I joined the Startup Speakers Toastmaster. I had completed 10 speeches in the first 7 months. Then I become the VP of Education, responsible to schedule duties for all the members. In that role, I used Doodle to poll for availability and used Google spreadsheet to setup the schedule, with automatic generation of the agenda for each meeting. For the various special events, such as Speech Contests and Pitch sessions, I encouraged members to take lead to run the event, which enabled them to work toward their Competent Leadership goals. My duty as VP of Education has since ended and I decided not to pursue officer roles in order to focus my time on MOOC classes and other endeavors.
Why did I join Startup Speakers?
Recently, I found my speaking skills degrading. I recalled how Toastmasters had helped me to become a competent speaker who can manage an audience of several hundred people. Prior to Toastmasters, I would sweat a lot when I had to speak in front of an audience and I was too nervous to come up with meaning words to engage the audience. Over 10 years of Toastmaster had helped me become a much better speaker.
In the mid 2000’s, I was presented to customers every week, so I did not have to attend Toastmaster to maintain my speaking skills. In 2012, I felt it was time to practice public speaking again. I visited a number of clubs and found this dynamic and enthusiastic group that comes together at 7am in the morning to practice speaking and leadership skills. Many of the folks are much younger than I but I fit in well.
Now I continue to work on speeches at this Toastmasters.
Startup Speakers Toastmasters meets every Wednesday morning 7-830a at Plug and Play Tech Center in Sunnvyale. Info is available at the meetup.
I had the opportunity to attend an event where the creator of Ruby was speaking. Matz was extremely friendly and he was gracious enough to take a picture with me.
I just received my certificate for the “Introduction to Artificial Intelligence” online course offered by Sebastian Thrun and Peter Norvig. I’m one of 23,000 who received the certificate, for doing well in the homework, midterm, and final exams. The class was one of three open online courses offered as an experiment by Stanford. Enrollment reached 160,000 students from 190 countries. The course started in September and ended in December.
I finally got ot understand how probability and statistics could be applied to make sense of data. Dr. Thrun explained very well the applications of bayesian statistics and especially particle filter. I had years and years of statistical theories in math, physics, finance, computer science classes.
These 3 classes have launched the Massively Open Online Course revolution in education.
On June 24, we had a thought provoking set of presentations at the SDForum SAMsig, arranged by our departing Co-Chair Paul O’Rourke. The presentations ended with how Twitter is scaling today with the rewriting of the backend infrastructure from Ruby on Rails to a new language called Scala. Notably, a queuing system, kestrel, that mediates between Twitter’s web user interface and the processing of “Tweet following” and sending tweets was written in Scala and implements Actor. This implementation is much simplified from other implementations, is more reliable, and scales to handle billions of tweets.
3 themes came out in the meeting, that pointed to Twitter’s switch to Scala:
- Actor model enables simple programming of applications involving concurrency
- Scala language features make programming fun and interesting
- JVM has solid reliability and thread scaling
Created by Carl Hewitt some 35 years ago at MIT, the Actor model is surfacing as a good way to think about and to implement systems that involve multiple simultaneous communications, such as chat and Twitter. Actors are objects that do not operate on the same content that changes (mutable states) and communicate with each other via messages (message passing). As the result, the actor model eliminates many of the headaches that programmers face when solving the concurrency issues involving millions of senders and receivers of messages.
At SAMsig, Hewitt reviewed some of these issues and the history of the creation of the Actor model.
Scala is a recently created language that runs on top of the Java Virtual Machine (JVM) and thus uses all the facilities of the Java environment. It takes advantage of the proven reliability, performance, and other capabilities offered by the JVM. However, many programmers find coding in Java could be tedious with its formal requirements. Scala makes it fun for programmers with its simplicity of passing functions (functional programming like LISP) and pattern matching (more general and powerful than C case statement). Scala also implements Actor using the multi-threading capabilities of JVM, while removes the complexity of thread communication for the programmers.
At SAMsig, Frank Sommers presented many of the features of Scala in a short 40 minutes. There was much to take in in a short time.
JVM reliability and multi-threading
JVM has proven to be reliable and can scale easily to take advantage of the latest multi-core processors and large cluster of servers running together in the data center. Robey Pointer indicated that by using Scala to write the Twitter queuing system, kestrel, he was able to take advantage of all the goodness in the JVM, without having to write in Java. And yet, Java could be a backup language if Scala fails. Furthermore, by using Actor within Scala, the coding was much simplified from similar code written in Java, as the Actor enforces a share-nothing form of communication, designed for concurrent environment. The code size for kestrel is estimated at half the size of similar code written in Java. Without the complexity of managing threads and locks explicitly in kestrel, along with smaller code size, the support and maintenance of the code is much easier. In fact, kestrel runs on multiple servers and processes billions of tweets without failing. System responsiveness is fast. Pointer’s slides are here.
The combination of Actor, Scala, and JVM makes the kestrel queueing system and Twitter reliable, scalable, and fast. By writing the code in Scala and using Actors, the code is easier to develop and simpler to maintain. Such combination of these elements points to the continuing innovation happening with software. This serves as a good example for us, as we endeavor to develop new products, that we consider new advances in software (such as Scala), rather than just being stuck with doing things the same old way.
Copyright (c) 2009 by Waiming Mok
With less than 1K servers, redundant data centers included, Salesforce.com supports over 55K customers, including Google and Dell. This feat is achieved by an ingenious group of Oracle database experts who have taken an enterprise class relational database and turned it into a multi-tenant system that runs customer relationship management for these many customers. BTW, this system supports close to 200M transactions each weekday, at less than 1/4 of a second response time.
On March 25, Craig Weissman, CTO of Salesforce.com, gave an illuminating presentation on the internal architecture at his company to a room full of attendees at the SDForum SAMsig. Some highlights:
- There are 15 or so “pods”, each consisting of 50 or so servers, running a 4-way Oracle RAC and application (Java) and support servers. Each pod supports thousands of customers.
- Each Oracle RAC database consists of static tables that store the data from thousands of customers all mixed together. One row of data belongs to one customer and the next row belongs to another, and the columns may have completely different types of data between the rows. Control and access to the data are managed by metadata. In essence, the Oracle database is transformed to a multi-tenant database.
- Customer data in the columns are stored as human-readable strings. Some customers have requested certain data to be encrypted. Appropriate transformation functions are used to convert to the correct data types when the data are accessed.
- Using lucene, the data are all indexed.
- Apex is a new language to enable customers to write data processing applications, like a new generation of 4GL. It resembles Java. Governors are deployed to prevent abuse of the system resources.
The Salesforce.com architecture is an engineering feat in leveraging the strength of an existing product (reliability and scalability of Oracle RAC) to build a new system that supports thousands of customers on the web with million of transactions per day and fast response time. Interestingly, Salesforce was able to move the underlying hardware from Sun Sparc systems to x86 and updated the Oracle version, over the course of 10 years, while retaining the higher level software architecture and getting substantiate customer growth.
Copyright (c) 2009 by Waiming Mok
I just finished reading this book Managing Humans by Michael “Rands” Lopp. Great book for software managers, managers, and individual contributors. I find Rands provide realistic and practical advice on dealing with people who run your company, the person you work for, and people you manage. Rands also provide interesting insights on people and classify them in his terms: NADD, incrementalists, completionists, organics, mechanics, inwards, outwards, holistics, free electrons, joe.
Like usual, I borrowed this book from the library so I’d read it, as I tend not to read books I buy. With this book, I purchased a copy via AllBookstores.com after I’ve read it. It’s worth the purchase, if only for future reference.
Copyright (c) 2009 by Waiming Mok
At the Monte Jade annual conference on March 7, Dr. Arun Majumdar of Berkeley gave a very interesting talk on energy use and the opportunities for improvement. He presented lots of data. I found 3 of his points especially interesting:
1. When looking at overall energy use, there are 9 sources of supply and 6 areas of demand. Even high growth rate in solar would take years to be of significant impact — that should not stop solar.
2. Abatement of energy use is the most efficient and effective way in the short term to solve the energy crisis. There are many ways, including changing the way we build buildings and how a building is put together. These savings are extremely significant.
3. Battery, energy storage, has improved little. Storage capacity has doubled in 60 years. New development in nanotechnolgy could help.
By focusing in these areas with new technologies and software, significant headway could be made in the next 3-5 years to improve the energy problem.
Copyright (c) 2009 by Waiming Mok
Concurrent web-based tests
Main database server utilization
Additional commodity servers
Management of session transient data from main database server to Terracotta networked memory
By moving the processing of various transient data used in the web service from the main Oracle database to Terracotta’s network attached memory model, with minimal changes to the Java code. There were some additional hardware added to support the Terracotta servers, but these were low cost commodity hardware.
This examples shows that often web/app/database architectures tend to rely on the backend database servers to manage the transactions, but also the transient data for sessions. Transient session information includes data like tracking the page the user is on, the data that he has entered, etc. Terracotta can manage the transient data in place of the main database. The resulting system is much more efficient than having the entire web service architecture relying on the main database to maintain all the states in the system.
Terracotta is open source and works with Java based systems. Orion mentioned that other languages can be supported, such as Ruby via jruby.
Copyright (c) 2009 by Waiming Mok
This motor shows the marvel of nature:
- voltage potential brings about an electric field (battery and wire)
- electric field generates magnetic field
- 2 magnetic fields in same direction cause repulsion (magnet and wire)
and the marvel of man’s ingenuity:
converting this knowledge into a device (electric motor) that’s used in so many applications from disk drives to garage door opener to cars.
I certainly have learned from this simple experience, reminding me why I was a physics major a long time ago.
To make the motor work, be sure of the following:
- the wire must be insulated wire, so the coil does not short
- one end of the wire has the insulated wire removed by sanding; the other end has only one side of the insulation removed (this ensures that the wire generated magnetic field is turned on and off as the wire spins, enabling the repulsion of the 2 magnetic fields to continue the spinning of the wire)
- regularly sand the ends of the wires again, as there are occasionaly electric arcs that bring about deposit (carbon?) on the wire that prevents electrical conduction
Copyright (c) 2009 by Waiming Mok
In CACM, 01/09, Werner Vogels, in his article “Eventually Consistent“, describes how the construction of large distributed systems, such as Amazon EC2 and S3, would require the tradeoff between consistency and availability. This follows Eric Brewer‘s CAP theorem (conjecture?). The idea is that in any shared-data system, only 2 of following 3 properties could be satisfied: consistency, availability, and tolerance to network partitions. In large distributed systems, e.g. cloud-computing, tolerance to network partitions is a given. So there is a trade-off between consistency and availability.
Vogels’ proposition is that availability of the system (and the resulting services) has higher priority to consistency in most cases. He identifies different types of relaxed form of consistency, other than strong consistency — that all data stored in the entire system could be retrieved, and when compared, are the same. The relaxed consistencies allow the Amazon implemention of the cloud to be highly available, and eventually, transaction updates are made consistent.
This article points to the fact that in order to build cloud-computing architectures, previous assumptions such as ACID may need to be relaxed.
Copyright (c) 2009 by Waiming Mok