Feeds:
Posts
Comments

Archive for the ‘cloud-computing’ Category

IDC estimates that $6B (out of $27B for big data) will be invested in Hadoop infrastructure in 2015. That’s not bad for an open source technology spawn within Yahoo just less than 10 years ago. The onslaught of mobile, social, and emerging prevalence of sensor data is driving businesses and governments to extract trends, sense, and information from.

After all, for a few hundred thousand dollars in investment, a corporation can build and install a compute cluster to perform analytics and predictions from the collected data. Yahoo, Facebook, Salesforce, Twitter, and internet based businesses have already invested significantly. Traditional businesses, like insurance, banks, medical providers, manufacturers, are now investing. These businesses benefit from gaining insights into their customers and supply chains, using cheap off the shelf compute servers and storage, and free open source software that’s Hadoop.

But many companies are finding that their investments in the compute infrastructure might be humming and consuming power in their data centers, but not producing enough results to justify the ROI. Hadoop, as implemented, has a number of issues:

  • programming the cluster using the Java / MapReduce framework could be cumbersome in many cases
  • the cluster runs as a batch processing system, harkening back to the dates of the mainframe, where jobs are submitted and the computed results would come hours later. Once a job is submitted, it’s coffee time and a waiting game.

That’s why the Spark project has zoomed from a mere Berkeley AmpLab project 4 years ago to an important top level Apache project with a 1.0 release published just in May 2014. Spark replaces many pieces of traditional Hadoop, especially the Hadoop execution engine (MapReduce). Often, customers keep the Hadoop data store HDFS.

Much of the IT world is suddenly seeing an entirely better solution than the traditional MapReduce and batch nature of Hadoop.  Even the new version of Hadoop 2.x, with a new execution engine called YARN is overshadowed by the looming Spark. You see, Spark is designed to address the highlighted issues, specifically provides the following:

Deliver much higher developer velocity.

To analyze the streams of data comsonicboom-velocitying from mobile, social and sensors, developers need to write and adapt existing analytics and machine learning algorithms to process the data. Spark allows the developers to write in Scala and Python, in addition to Java, which makes the coding much faster and easier to read the more succinct code. Also, Spark provides an execution engine (DAG, RDD) that goes far beyond the relatively simplistic model of MapReduce, making programing the cluster much more intuitive to write. Execution runs with all the data residing in memory, as opposed to getting data from disks, so the turn-around is much faster. Developers can iterate quickly on their programs with fast turn-around of running Spark jobs.

moving_sandbags

Process data with 10x-100X speedup.

Because Spark specifically arrange the data processing operations to ensure all data procession is done in memory in each compute node of the cluster. The result is tremendous speedup in processing. Customers can see more results quickly, getting higher return on their Hadoop investment.

green-prestine

Offer investment protection for existing Hadoop.

Spark is compatible with traditional Hadoop infrastructure. Spark can run along side Hadoop 2.0’s YARN execution engine. Spark can take data from Hive or HDFS, and in fact runs the data much faster, by putting the data in memory using an in-memory file system called Tachyon.

This last reason makes it so much easier to consider using Spark for an existing Hadoop investment. Now Hadoop developers are paying attention and flocking to meetups on Spark. While Spark just released a version 1.0, so it’s still relatively new, and subject to the iidiosyncracies of newly developed open source software. Often, the above benefits are outweighing the risks for traditional Hadoop users and developers to switch to Spark.

There are many other benefits to Spark, such as streaming – these we’ll reserved for discussions later.

Advertisements

Read Full Post »

In Ethernet Summit 2014, Alan Weckel of the Dell’Oro Group showed a very interesting chart on projections for server adoption. Due to copyright issues, I’d summarize the info as follows:

In 2013, cloud and server providers account for ~20% server unit shipments, by 2018, this group of customers is forecasted to account for up to 50% of server unit shipments. If this trend continues, the there would no growth to server shipments to enterprise customers.

Since servers account for part of the data center, the implication is that both networking and storage gear would move this way as well. Cloud and SP are significantly changing the data center equipment market.

Another interest point, 2 players dominate in the cloud, Google and Amazon, while Facebook could be an up-and-comer. These players design their own data center equipment and directly work with ODMs to manufacture their own equipment. It would take some hard maneuvers for an IT equipment vendor to get into these accounts. HP is trying such as maneuver: creating low-cost entry servers in partnership in Foxconn. Time will tell whether this would work.

Read Full Post »

carlh-robeypOn June 24, we had a thought provoking set of presentations at the SDForum SAMsig, arranged by our departing Co-Chair Paul O’Rourkesamsig-attendees-june24The presentations ended with how Twitter is scaling today with the rewriting of the backend infrastructure from Ruby on Rails to a new language called Scala.  Notably, a queuing system, kestrel, that mediates between Twitter’s web user interface and the processing of “Tweet following” and sending tweets was written in Scala and implements Actor.  This implementation is much simplified from other implementations, is more reliable, and scales to handle billions of tweets.

The switch from Ruby to Scala for Twitter’s backend is detailed at Artima developer, run by Bill Venners, who also mediated the presentations by the 3 speakers (Hewitt, Sommers, Pointer) at SAMsig.

3 themes came out in the meeting, that pointed to Twitter’s switch to Scala:

  • Actor model enables simple programming of applications involving concurrency
  • Scala language features make programming fun and interesting
  • JVM has solid reliability and thread scaling

Actor model

Created by Carl Hewitt some 35 years ago at MIT, the Actor model is surfacing as a good way to think about and to implement systems that involve multiple simultaneous communications, such as chat and Twitter.  Actors are objects that do not operate on the same content that changes (mutable states) and communicate with each other via messages (message passing).  As the result, the actor model eliminates many of the headaches that programmers face when solving the concurrency issues involving millions of senders and receivers of messages. 

At SAMsig, Hewitt reviewed some of these issues and the history of the creation of the Actor model.

Scala

Scala is a recently created language that runs on top of the Java Virtual Machine (JVM) and thus uses all the facilities of the Java environment.  It takes advantage of the proven reliability, performance, and other capabilities offered by the JVM.  However, many programmers find coding in Java could be tedious with its formal requirements.  Scala makes it fun for programmers with its simplicity of passing functions (functional programming like LISP) and pattern matching (more general and powerful than C case statement).  Scala also implements Actor using the multi-threading capabilities of JVM, while removes the complexity of thread communication for the programmers. 

At SAMsig, Frank Sommers presented many of the features of Scala in a short 40 minutes.  There was much to take in in a short time.

JVM reliability and multi-threading

JVM has proven to be reliable and can scale easily to take advantage of the latest multi-core processors and large cluster of servers running together in the data center.  Robey Pointer indicated that  by using Scala to write the Twitter queuing system, kestrel, he was able to take advantage of all the goodness in the JVM, without having to write in Java.  And yet, Java could be a backup language if Scala fails.   Furthermore, by using Actor within Scala, the coding was much simplified from similar code written in Java, as the Actor enforces a share-nothing form of communication, designed for concurrent environment.  The code size for kestrel is estimated at half the size of similar code written in Java.  Without the complexity of managing threads and locks explicitly in kestrel, along with smaller code size, the support and maintenance of the code is much easier.  In fact, kestrel runs on multiple servers and processes billions of tweets without failing.  System responsiveness is fast.  Pointer’s slides are here.

The combination of Actor, Scala, and JVM makes the kestrel queueing system and Twitter  reliable, scalable, and fast.  By writing the code in Scala and using Actors, the code is easier to develop and simpler to maintain.  Such combination of these elements points to the continuing innovation happening with software.  This serves as a good example for us, as we endeavor to develop new products, that we consider new advances in software (such as Scala), rather than just being stuck with doing things the same old way.

Copyright (c) 2009 by Waiming Mok

Read Full Post »

salesforceperfWith less than 1K servers, redundant data centers included, Salesforce.com supports oversalesforcedb 55K customers, including Google and Dell.  This feat is achieved by an ingenious group of Oracle database experts who have taken an enterprise class relational database and turned it into a multi-tenant system that runs customer relationship management for these many weissmancustomers.   BTW, this system supports close to 200M transactionsweissmanaudience1 each weekday, at less than 1/4 of a second response time.

On March 25, Craig Weissman, CTO of Salesforce.com, gave an illuminating presentation on the internal architecture at his company to a room full of attendees at the SDForum SAMsig.  Some highlights:forcesearcheng

  • There are 15 or so “pods”, each consisting of 50 or so servers, running a 4-way Oracle RAC and application (Java) and support servers.  Each pod supports thousands of customers.
  • Each Oracle RAC database consists of static tables that store the data from thousands of customers all mixed together.  One row of data belongs to one customer and the next row belongs to another, and the columns may have completely different types of data between the rows.  Control and apexarchaccess to the data are managed by metadata.  In essence, the Oracle database is transformed to a multi-tenant database.
  • Customer data in the columns are stored as human-readable strings.  Some customers have requested certain data to be encrypted.  Appropriate transformation functions are used to convert to the correct data types when the data are accessed.
  • Using lucene, the data are all indexed.
  • Apex is a new language to enable customers to write data processing applications, like a new generation of 4GL.  It resembles Java.  Governors are deployed to prevent abuse of the system resources.

The Salesforce.com architecture is an engineering feat in leveraging the strength of an existing product (reliability and scalability of Oracle RAC) to build a new system that supports thousands of customers on the web with million of transactions per day and fast response time.  Interestingly, Salesforce was able to move the underlying hardware from Sun Sparc systems to x86 and updated the Oracle version, over the course of 10 years, while retaining the higher level software architecture and getting substantiate customer growth.

Copyright (c) 2009 by Waiming Mok

Read Full Post »

orionOrion Letizi gave taudienan interesting talk on Terracotta at the SDForum SAMsig tonight.  His example on the deployment of Terracotta to solve a database scaling problem was especially compelling:

 Examinator example

Before

After

Concurrent web-based tests

5K

10K

Main database server utilization

75%

30%

Additional hardware

Additional commodity servers

Software changes

Management of session transient data from main database server to Terracotta networked memory

By moving the processing of various transient data used in the web service from the main Oracle database to Terracotta’s network attached memory model, with minimal changes to the Java code.  There were some additional hardware added to support the Terracotta servers, but these were low cost commodity hardware. 

This examples shows that often web/app/database architectures tend to rely on the backend database servers to manage the transactions, but also the transient data for sessions.  Transient session information includes data like tracking the page the user is on, the data that he has entered, etc.  Terracotta can manage the transient data in place of the main database.  The resulting system is much more efficient than having the entire web service architecture relying on the main database to maintain all the states in the system.

Terracotta is open source and works with Java based systems.  Orion mentioned that other languages can be supported, such as Ruby via jruby.

Copyright (c) 2009 by Waiming Mok

Read Full Post »

In CACM, 01/09, Werner Vogels, in his article “Eventually Consistent“, describes how the construction of large distributed systems, such as Amazon EC2 and S3, would require the tradeoff between consistency and availability.  This follows Eric Brewer‘s CAP theorem (conjecture?).  The idea is that in any shared-data system, only 2 of following 3 properties could be satisfied:  consistency, availability, and tolerance to network partitions.  In large distributed systems, e.g. cloud-computing, tolerance to network partitions is a given.  So there is a trade-off between consistency and availability.

Vogels’ proposition is that availability of the system (and the resulting services) has higher priority to consistency in most cases.  He identifies different types of relaxed form of consistency, other than strong consistency — that all data stored in the entire system could be retrieved, and when compared, are the same.  The relaxed consistencies allow the Amazon implemention of the cloud to be highly available, and eventually, transaction updates are made consistent. 

This article points to the fact that in order to build cloud-computing architectures, previous assumptions such as ACID may need to be relaxed.

Copyright (c) 2009 by Waiming Mok

Read Full Post »

Tencent QQ

tencentJeff Xiong, Co-CTO of Tencent, presented to an overfilled conference room at Fenwick & West on Friday evening.

Jeff presented the history of Tencent and some insights into the company’s success.  Tencent is among the largest IM service in the world, currently it has over 856M user accounts and 355M (bigger than US population) active users thru its QQ service.   tencentimThe company has implemented a very scalable system that was able to get, without the hour, words of the Sichuan earthquake throughout China and Olympics results.  In additional to IM, Tencent offers email, gaming, and other web services and it’s one of the fastest growing web service in the world.

With only 20% of its > 1B population on the internet, China offers great growth potential for Tencent.  In addition, Tencent is looking to expand into other markets, such as India and Vietnam.

BTW, Tencent is the transliteration of its Chinese name, which means speedy communication.

Copyright (c) 2009 by Waiming Mok

Read Full Post »

Older Posts »