Wednesday, March 19, 2014

Understanding Big Data


Recently, I was invited to give a talk  to engineering college students about big data, and I decided to document it & here it is.

Big Data is now a hot buzz word in IT sector. Business leaders are eager to harness the power of big data and are spending serious amounts on money on it.

With Big data, companies can predict next months' sales, can predict future pricing, better understand customer preferences in ways unimaginable before.

While big data promises the sky with a beautiful rainbow, it is really a big challenge to successfully harness the power of big data. To begin with, lets understand what is big data.

Big data is just that - huge volumes of data which the current IT systems cannot handle and organizations do not process this data as this data does not fit into the current databases or data warehousing technologies.

To understand this better, let me explain with an example.

Consider a retail operation and you would like to know: How did sales this month compare to the previous month and also compare with the same month last year?

The traditional way of reporting this will be look at sales transactions & look at sales revenues, also look at ERP systems and report on the merchandise sold. But this does not give the full picture of customers. It does not capture the number of footfalls. It does not tell the number of items customers looked at before buying one product. It does not tell how much time each customer spent in the shop, it does not tell what items customer wanted to buy - but could not do so due to lack of inventory in that shop. So in other words, the traditional IT reporting systems present only a partial - a tiny bit of the full picture.

While Big Data can do this by capturing data from various other systems.

What is Big Data?

Big data is characterized by four parameters: Variety, Volume, Velocity and Veracity. These are often referred to as 4V's.

Variety

Big Data in simple terms implies looking at the same parameters, in our example of monthly sales, in different ways - which is often called as Variety of Data Sources

Continuing on our retail operations example, There are a whole lot of  "Variety" of data sources that can be tracked:

1. RFID enabled shopping carts to tell how much time each customer spent in the shop.
2. RFID readers in the cart to identify which items customer put in the cart and then did not buy.
3. Cameras and sensors that shows which areas of the shop was overcrowded and that prevented customers from shopping in that isle.
4. Sensors & RFID readers that show which items customers looked at before selecting an item.
5. Sensors data which tells when each customer came to the shop
6. Sensor data which tells the time taken at the billing section.
7. Internet search information from each customer at the online store.

This type of wide variety of data is valuable, yet today business IT systems cannot handle it or even use it for analyzing monthly sales.

Wide variety of data sources also implies that IT systems must be agile to accommodate a wide variety of data types: Transaction data, Video feeds, sensor data etc and seamlessly integrate with various data processing technologies.

All of this data is stored in a dizzying array of formats. Some gets stored in structured formats in databases or enterprise resource planning (ERP) applications, but much of the video feeds, sensor data, photos are all unstructured data formats that must be managed, stored and processed.

Data doesn't rest once it is in storage. It must be moved from application to application and from system to system so managers and executives can interpret the data and come to meaningful conclusions.

Volume

In traditional IT systems, each customer transaction translated to few KB of data, but if we were to consider all the different variety of data sources, the volume of data generated per transaction will run in several Mega Bytes or even Giga bytes!

Increasing data volume is at the heart of the big data challenge. Large data volumes can cause many obvious technical problems,  such as excessive batch processing times, bottlenecks and so on.

The sheer volume and complexity of big data means that the traditional method of managing data does not apply to this new sources of data. A complete new system has to be built to collect, governm and process this large volumes of data. New automation systems are needed for integration, governance and using it.

Researchers at  IDC estimate that by the end of 2013,  the amount of stored data will exceed  4 zettabytes, or 4 billion terabytes. That's  50 percent more data than the digital universe held at the end of 2012, and  four times as much as in 2010.

Velocity

Every second of every day, businesses generate more data. What used to be several MB of data per hour with the older systems, is now several GB!

In our retail example, data is coming in several mega bytes every second. Data is coming in faster than ever. Data from video cameras, RFID readers, sensors are being collected in seconds and microseconds.

The IT systems must be able to understand data as it is streaming in, store that information, quickly process this information, and move data quickly from one application or repository to another,  where it can be processed and analyzed.

Unfortunately, many of the older data integration solutions lack the high performance that  big data projects require. There is not enough time to collect the data and process it in real time or near real-time.

Validity or Veracity

The first three Vs: Variety, Volume and Velocity define big data.

But the fourth V veracity is most important for business analytics.

Veracity is the validity of data. Once the data is collected and stored, understanding which of the data sets is valid for a particular business analysis is of paramount importance. Running analysis on invalid data results in invalid or useless results.

For Big data to be valuable, the data must be valid.

Often, big data is collected into a vast pool of data - which is often referred as "Data Lake". Big data coming into this data lake has be indexed, categorized and verified to ensure that the data is accurate, current and complete - before running analysis and making business decisions.

For any organization to take advantage of the opportunities available with big data, it must have IT systems, processes and solutions that can handle all four Vs.

It starts with discovering different sources of data, setting up systems for collecting that data, building  IT systems to govern, process and store large volumes of data. The IT system must be agile and capable to accommodate a wide variety of data and seamlessly integrate with various technologies and it must automatically discover, protect and monitor sensitive information as part of big data applications.

Big data, big opportunities

Big data presents big opportunities for increased growth and profitability.

Organization recognize that big data contains valuable information. They are eager to analyze it to obtain actionable insights that could help them take better decisions that helps improve sales, profits and identify new revenue opportunities.

Forward thinking organizations - such as Flipkart & Myntra in India, Amazon, Wal-Mart, WholeFoods in the USA are already realizing some of these benefits.


  • Myntra plans to get revenues of Rs 1,500 Crores by 2015.
  • Flipkart is expected to have a revenue of Rs 6000 Crores by 2014.
  • WholeFoods - a specilty food retailer uses social media data to analyze customer sentiment  and improve loyalty, helping to drive revenue growth.
  • Fedex uses machine data to optimize logistics, thereby reducing shipping costs.


While Big data promises big advantages, but capturing these sorts of benefits from big data requires knowing what the business needs and being able to find key items within the larger mass of big data.
One has to start with articulating the business goals, and that helps determine:

1. What data to collect
2. When & how to collect
3. How to store & process

Then only business can generate the analytics and reports necessary to support those objectives. Now business leaders can make meaningful decisions.  

No comments: