Arun Kottolli: Understanding Data in Big Data

Data forms the foundation for all Big Data Analytics. The first step in Big Data Project is to define the business objective. The second step is to understand the data needed for that big data project.

To make the most of big data, you have to start with the data you understand and trust.

To understand this better, lets continue on our retail example.

In many retail operations, there are multiple ID's for customer. There is a customer ID, Account number, Loyalty Card Number, Customer Email ID, Customer Web Login ID. All these Ids were is different data types & formats.

So when the company started a big data project, defining the term customer ID was a big challenge. Adding to this confusion, each department has a different ways/systems to track customer purchases. This meant that project team had to spend extra time and efforts to identify and document various data sources and determine which data to use in which reports. While business leaders were hampered with multiple & inconsistent reports coming from duplicated or missed data.

To harness the power of big data, companies must ensure that source of all information is trustworthy and protected. We call this as Master Data - which represents the absolute truth and forms the basis of all analytics. Once the master data is created & protected, organizations can trust the analytics and take decisions.

It is equally important to have a common business glossary for key terms and data sets. Once a common terminology is established, the next step is to define where each piece of data will come from and where it will go - i.e, define all sources of data and the destination of that data. This will help everyone involved in the big data project know exactly what each data terms mean, what each key metrics mean, where the data should originate, and where the data will be stored - thus establishing a "universal truth" for all the business data.

Trustworthy Data

Business leaders are eager to harness the power of big data. However, as the opportunity increases, ensuring that source information is trustworthy and protected becomes exponentially more difficult. If this trustworthiness issue is not addressed directly, end users may lose confidence in the insights generated from their data— which can result in a failure to act on opportunities or against threats.

To make the most of big data, you have to start with data you trust.

Defining universal truth is just a first step. Next step is to secure that the data being generated from the source is indeed "trustworthy". Unfortunately, it is not enough just to establish policies and definitions and hope that people will follow them. To have a truly "trustworthy" data, organizations must be able to trace its path through their systems & have adequate security systems to ensure that the data follows the defined paths only and does not get tampered with.

Data collected by the system must be controlled via some form of access and version control, so that the original data is not lost and any changes made can be tracked. There are lots of tools in the industry to secure & track data.

Defining "good data"

In case of big data, remember that the volume of data can be really large and can easily overwhelm any data collection system. So the data collection system must be tuned to collect only the "Good Data" - i.e., data that is useful in analysis, reporting and decision making.

For example, sensor in a retail shop can collect data on number of people at a particular isle will keep collecting data even when the store is closed. It can also collect lots of other data such as ambient temperature etc., which is may be irrelevant to the analysis.

So organizations must be able to extract only those bits of data necessary to support a particular objective. This way unnecessary data can be kept out of the system and avoid hardware/software costs.

Data Lake

Once all the good data is identified and collected, it has to be pooled together for analysis tools to work. Technically, one need not pool all the data into one system, instead knowing where the good data is stored in a safe trustworthy location - which together forms the data lake - a collection of all valid data.

Please note that Big Data is also characterized by "Variety", "Volume", & "Velocity".

Continuing on our retail operations example, There are a whole lot of "Variety" of data sources that can be tracked:

1. RFID data,
2. Sensor/Camera data
3. Data from Social Networking sites
4. Data from Website
5. Data from Twitter or SMS
6. Data from Point-of-Sale devices

Variety of data souks also implies different data types: Structured transactional data, Unstructured video/camera data, and Metadata. So you now have multiple sources of data feeding in huge volumes of data at a rapid rate. This implies that your analytics must be able to process this large volume of data in a reasonable time - which can give analytical results in a meaningful way.

Often times, companies do not want to process all the data in real time - because of business objectives and due to customer behavior. For example, customers may react to a marketing campaign at different times, and this reaction can be seen in social networking sites, or tweets, or SMS, or email or web comments at different times. So one must collect the data over a period of time for any useful analysis. This results in creating a system that can handle this huge volume of data in the data lake.

Data Lake Illustrated

All the data collected could reside in multiple locations - which is logically pooled together to form a "Data Lake".

Data collected in this data lake extends beyond on-premise applications and data warehouses. It may include, data from social networking sites, customer tweets, web sites, etc. This type of external data can be harder to collect and analyze than traditional transactional data. Potential insights can be buried in unstructured documents such as: User generated documents, spreadsheets, reports, email and multimedia files.

All this data is collected and secured in this data lake. The secure trusted data in the data lake then forms the basis for the big data analytics project.

Data from multiple sources is collected, sometime selectively to limit the volume of data, and even processed to make the unstructured data more useful. Even the metadata - i.e, the data which describes the main data is also collected. For example, in case of a photograph, time/date, location, type of camera, ambient conditions at that time, etc. are all metadata. Metadata is very important in understanding unstructured data.

Data lakes act as repositories of all valid information: Log files, User database, transactional information, behavioral information. Analytics is often run on the data stored in the data lake.

Analyzing big data at rest

By analyzing large volumes of stored data, organizations can discover patterns and insights that allow them to optimize processes and profitability. In environments where you quickly need to gain insight on hundreds of millions or even billions of records, the ability to match master data at the big-data scale is a must-have. Examples include reconciling large lead lists or matching citizens or customers against watch lists. Additionally, organizations want to analyze large volumes of stored data to discover patterns and insights.

Analyzing big data in motion

The data lake must be capable of handling the high volumes of data that is being generated.

With certain kinds of data - often coming from sensors, there is no time to store it before acting on it because it is constantly changing. In cases of fraud detection, or health care issue or traffic management etc.,. In such cases gaining real time insights with high speed real time analytics is vital. High-velocity, high-volume data calls for in-motion analytics.

Handling such volumes of data can be daunting. How to analyze hundreds of millions of information bits in real time calls for systems that can analyze data in motion. These data in motion analytics require dedicated systems to process data as it gets generated and alerts are sent in real time. In addition the processed data is also captured in the data lake for future analysis.

Data in motion analytics systems analyzes large data volumes with micro-latency. Rather than accumulating and storing data first, the software analyzes data as it flows in and identifies conditions that trigger alerts - such as credit card frauds. Along with the alert, the processed data is also stored in the data lake and can be analyzed with other data for better business outcomes.

Securing Data Lake

The main advantage of creating a data lake is that it acts as a single repository of all data, which users can trust and use in their analytics. Access to this data lake can be controlled and have a unified data protection systems. Data lakes will require role-based access, policies, and policy enforcement.

Often times, all data entering into the data lake are tagged and metadata is added. Metadata information on data security, access controls end user rights are tagged to the data. Each data element is tagged with a set of metadata tags and attributes that describes the data, who can access it, and how it should be accessed and handle it. This tagging is rule based and can be enforced with data management tools.

Protecting the entire data lake is often cheaper than securing each individual component of data. This way, organization can check, monitor and report on data security, privacy and regulatory compliance. Data lakes thus acts as single secure & trusted repository of information.

This is the third part in the Understanding Big Data series. The first two parts are:

Also see:
Part-1. Understanding big data
Part-2. Defining the Business Objectives for Big Data Projects

2 comments:

JpaSolutions.in said...: Security Intelligence Solution provides one-click access to a comprehensive forensic trail and analytics in the same solution to simplify and accelerate threat discovery and incident investigation. To know more, visit Hadoop Training Bangalore; 12:27 PM
Unknown said...: this is too good blog, i have gone through with this blog, keep updating
thank you

best seo taught here

seo training in bangalore; 6:05 AM

Sunday, March 23, 2014

Understanding Data in Big Data

2 comments:

About Me

Blog Archive

Labels

Whirlpool of colors

Subscribe Now!

Bookmark Me!

Books I recommend

My Blog List