Feeds:
Posts
Comments

Archive for the ‘storage’ Category

IDC estimates that $6B (out of $27B for big data) will be invested in Hadoop infrastructure in 2015. That’s not bad for an open source technology spawn within Yahoo just less than 10 years ago. The onslaught of mobile, social, and emerging prevalence of sensor data is driving businesses and governments to extract trends, sense, and information from.

After all, for a few hundred thousand dollars in investment, a corporation can build and install a compute cluster to perform analytics and predictions from the collected data. Yahoo, Facebook, Salesforce, Twitter, and internet based businesses have already invested significantly. Traditional businesses, like insurance, banks, medical providers, manufacturers, are now investing. These businesses benefit from gaining insights into their customers and supply chains, using cheap off the shelf compute servers and storage, and free open source software that’s Hadoop.

But many companies are finding that their investments in the compute infrastructure might be humming and consuming power in their data centers, but not producing enough results to justify the ROI. Hadoop, as implemented, has a number of issues:

  • programming the cluster using the Java / MapReduce framework could be cumbersome in many cases
  • the cluster runs as a batch processing system, harkening back to the dates of the mainframe, where jobs are submitted and the computed results would come hours later. Once a job is submitted, it’s coffee time and a waiting game.

That’s why the Spark project has zoomed from a mere Berkeley AmpLab project 4 years ago to an important top level Apache project with a 1.0 release published just in May 2014. Spark replaces many pieces of traditional Hadoop, especially the Hadoop execution engine (MapReduce). Often, customers keep the Hadoop data store HDFS.

Much of the IT world is suddenly seeing an entirely better solution than the traditional MapReduce and batch nature of Hadoop.  Even the new version of Hadoop 2.x, with a new execution engine called YARN is overshadowed by the looming Spark. You see, Spark is designed to address the highlighted issues, specifically provides the following:

Deliver much higher developer velocity.

To analyze the streams of data comsonicboom-velocitying from mobile, social and sensors, developers need to write and adapt existing analytics and machine learning algorithms to process the data. Spark allows the developers to write in Scala and Python, in addition to Java, which makes the coding much faster and easier to read the more succinct code. Also, Spark provides an execution engine (DAG, RDD) that goes far beyond the relatively simplistic model of MapReduce, making programing the cluster much more intuitive to write. Execution runs with all the data residing in memory, as opposed to getting data from disks, so the turn-around is much faster. Developers can iterate quickly on their programs with fast turn-around of running Spark jobs.

moving_sandbags

Process data with 10x-100X speedup.

Because Spark specifically arrange the data processing operations to ensure all data procession is done in memory in each compute node of the cluster. The result is tremendous speedup in processing. Customers can see more results quickly, getting higher return on their Hadoop investment.

green-prestine

Offer investment protection for existing Hadoop.

Spark is compatible with traditional Hadoop infrastructure. Spark can run along side Hadoop 2.0’s YARN execution engine. Spark can take data from Hive or HDFS, and in fact runs the data much faster, by putting the data in memory using an in-memory file system called Tachyon.

This last reason makes it so much easier to consider using Spark for an existing Hadoop investment. Now Hadoop developers are paying attention and flocking to meetups on Spark. While Spark just released a version 1.0, so it’s still relatively new, and subject to the iidiosyncracies of newly developed open source software. Often, the above benefits are outweighing the risks for traditional Hadoop users and developers to switch to Spark.

There are many other benefits to Spark, such as streaming – these we’ll reserved for discussions later.

Advertisements

Read Full Post »

Data is the New Oil

datanewoil

 

This cartoon was posted on the screen at a recent BigData Guru meetup. “Data is the new oil.” While this concept was first stated in 2006, it’s still very relevant today, especially with explosion of the internet, mobile, IoT, and the multitude of tools for people to generate content and to enable transactions of all types. IBM indicated in 2013 that 90% of the world’s data was generated in the last 2 years.

The premise is that much of data from email, tweets, IM, Facebook posts, Google searches, Amazon purchases, Ebay transactions, mobile and IoT sensor data, retail transactions could be mined for value, analogous to crude oil being processed and refined into gasoline and other ingredients to make plastic and other materials. The value of the data could result in better targeting of customers to buy the right products, in better understanding of the environment from sensor data to increase productivity, such as increasing crop production or solar energy yield. The possibilities seem endless.

All the while, both the companies extracting the value from the data and the vendors producing tools and equipment (e.g. data center servers and storage) to enable this extraction are being rewarded handsomely with $$$. Data is the new oil.

Read Full Post »

deeplearningcots

This is the claim by Nvidia CEO Jen-Hsun Huang, that 3 Nvidia Titan Z CUDA based GPU cards could offer the same performance in running deep learning neural nets as done in the Google Brian project using Intel processors.

  • 1/150 the acquisition cost reduction
  • 1/150 heat consumption

If this could be done in general for deep learning type problems, we could have many more machines to do machine learning on the explosion of data. At the same time, to use the CUDA cores, software programmers would need to learn program this hardware and / or use OpenCL. The cost savings could warrant pushing over the learning curve.

This paper is referenced: “Deep Learning with COTS HPC Systems” by A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro, published on ICML 2013

Read Full Post »