Sunday, 28 February 2016

Big Unstructured Data vs Structured Relational Data

With emerging technologies, data is exploding at an astounding rate. Big Data refers to huge datasets with larger volumes with greater complexity and variety. This term is closely associated with unstructured data with 90% of big data comprising of unstructured content.




So what is unstructured data? How different is it from structured data?

UNSTRUCTURED DATA
Unstructured data is any information that does not have a pre-defined data model and which cannot be organized in a predefined manner. This type of data does not fit into relational databases and is difficult to search using algorithms. Unstructured data mainly includes text and multimedia content. A few examples are e-mails, word documents, videos, photos, audio files, presentations, web pages and other kinds of business documents.

Organizations are looking towards unstructured data to extract important information which can be useful in strategic decision-making. They are adopting various technological solutions such as Hadoop, Business Intelligence software, data mining tools and data integration software to gain more accurate insights and achieve competitive advantage.

STRUCTURED DATA
In contrast, structured data is data that can be easily organized. It resides in a fixed field within a record or file. Data which can be stored in relational databases or on spreadsheets generally form structured data. Structured data storage depends on the creation of a data model which defines the business data along with the data types such as numeric, alphanumeric, Boolean, etc., data constraints such as primary, referential integrity, check, not null, etc. and metadata information. Some examples of structured data are call detail records such as time of call, caller and receiver information, Point-of-Sale data such as credit card details, product information and location of sale.

Although structured data comprises of only 10% of total data available, it plays a critical role in data analytics and serves as a backbone to critical business insights. The advantage of structured data is that it can be easily entered, stored, queried and analyzed. We can use Structured Query Language (SQL) to manage structured data. SQL helps us perform several operations including insert, update, delete to analyze the data and fetch desired results.



DIFFERENT DATA TYPES AVAILABLE TO ORGANIZATIONS

1. Integrated Operational Data
It is a volatile collection of data that supports an organization’s daily business activities. This data store contains dynamic data which is constantly updated by business operations. It connects all scattered operational data stores with a software tool so that they work as one large efficient system.

2. Spatial Data
This is information that has several dimensions. It includes both geospatial and structo-spatial data and gives importance to location. It is commonly used in geographical databases and comes in different formats such as map coordinates, images taken from space, remote sensing data etc.

3. Redundant Data
This form of data refers to data duplication wherein same data is stored at multiple data sites. Redundant data can enter organizational systems unknowingly and lead to undesirable results such as reduced performance, inaccurate results, compromised data integrity. It can also risk data quality if the databases are not updated concurrently.

4. Legacy Data
This refers to information which is stored in an old or obsolete format, making it difficult to access or process. The various sources of legacy data are XML documents, flat files, relational databases etc. This form of data adds to the disparity in data warehouse systems.

5. Historical Data
Historical data is digital information that outlines a company’s past activities and trends. This data is often archived, and may be held in non-volatile, secondary storage. Historical data can be useful to perform predictive analysis.


ROLE OF DATA WAREHOUSING
Data warehouse is a central repository of aggregated data from various data sources. It is used to store large amounts of data, such as analytics, historical, or customer data, and then build reports which could help managers take informed decisions. We can also analyze data over different time periods further improving business decision making.



LIMITATIONS OF DATA WAREHOUSING IN TERMS OF ANALYZING DATA

High maintenance
Data warehouses systems require high maintenance. Any reorganization of the business processes and the source systems may affect the data warehouse and consequently, result in high maintenance cost.


Required data not captured

The required data may not be completely captured by the source systems in some cases. This data may be important for data warehousing purposes. For example, the registration data of property may not be used in source systems but it may be useful for data analysis.


Data Flexibility

Data warehouses have static data making them difficult to analyze. Moreover, since the data is imported and filtered through a schema, it becomes old by the time it is actually used. Data warehouses are also subjected to ad hoc queries creating difficulties in the tuning of processing and query speed. 


Underestimation of resources of data loading

In certain scenarios, we may underestimate the time required to extract, clean, and load the data into the warehouse. It may take a significant proportion of the total development time however, we can reduce the time and effort spent on this process with the use of some tools. 

FUTURE OF DATA WAREHOUSING
The data warehouse market has begun to evolve with the advent of big data. In the past, data warehouses were designed to handle only structured data stored in ERP systems. The new torrent of data from social media, sensors, satellites, cameras etc. has made it essential for data warehouses to develop advanced statistical capabilities for performance analytics and forecasting. There have already been profound changes in data warehousing but what is the future of data warehousing and how will it affect businesses? These questions have been answered below:

Data warehouses will continuously adjust their standing due to technologies such as Hadoop. It will become critical to reduce errors and speed migration of database schema as the data warehouses continue to evolve with development, test and production environments.

The data warehouse of the future is a fluid system which will bring resources online as organizational needs evolve. A modern front-end will allow you to choose the data sets you want to query and will bring them together while hiding all the complexities. 

The evolution in data and in cloud services will make it a necessity to have a cloud-based solution for data warehousing and analytics. Cloud-based solutions will be critical to helping organizations expand access to data and analytics as well as increase their agility with data. These solutions will offer performance on demand and support a wide range of analytics without overhead costs by taking advantage of the flexibility and cost model of the cloud.




REFERENCES

http://www.kimballgroup.com/2015/12/design-tip-180-the-future-is-bright/

https://www.betterbuys.com/bi/future-of-data-warehousing/

http://www.smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-unstructured-data

http://www.webopedia.com/TERM/U/unstructured_data.html

http://www.webopedia.com/TERM/S/structured_data.html



Thursday, 18 February 2016

Dimensional Model for Uber

COMPANY BACKGROUND


Uber Technologies Inc. is an online transportation network company with its headquarters in San Francisco, California. It provides services in 58 countries and 300 cities worldwide. The company develops and operates the Uber mobile app which allows customers to request for a trip through their smartphones. Once a ride request is confirmed, the driver and riders are connected through GPS allowing both parties to track each other’s location. The total fare amount for each ride is calculated as a base fare with additional per-minute and per-mile charges. All payments from charging the customers to depositing money to the driver’s account after deducting a fee are processed by the company in the background keeping the entire transaction completely cashless.

Uber offers the following services based on the different types of customers:
· UberX: It is the least expensive service with a maximum capacity of 4 passengers.
· UberXL: This service provides the customers with an SUV or a minivan. This can seat 6 passengers and has higher prices as compared to UberX.
· UberSelect: This is a premium service offered by Uber where the customer gets to travel in a luxury sedan.

As of 2014, Uber had 100,000 trips per week in each of the largest cities and hundreds of thousands of active users in those cities.

PERFORMANCE METRICS

The important metrics that would be need to be tracked for performance analysis of Uber would be:
 Number of Active Users
The number of users that are actively using Uber for transportation purposes.
Number of Active Drivers
Uber provides its drivers flexibility and has low barriers of entry. This leads to a lot of drivers testing Uber but the continuation rate reduces over time.
Total Trips per Week
This is the total number of trips that are made per city in a week.
 Revenue
The revenue earned by Uber varies according to various factors such as the rates that customers are charged – regular or surge prices, the commission fee of 20-25% that Uber deducts from each fare amount etc.
Average driver ratings
This is the average of all the ratings a driver gains from the users. It can be helpful to track the performances of drivers.

These are the metrics that Uber would be interested in analyzing in order to determine the effectiveness of their business. However, for the dimensional model below I will be considering only the revenue and the total trips per week.

DIMENSIONAL MODEL


The dimensional model which can be used to measure these two metrics would be a Periodic Snapshot table which will capture the ride details on a weekly basis. The Periodic Snapshot table has been chosen because we need to model the business process (in this case, the details of a ride) at a regular interval i.e. weekly.
The grain of the fact table will be the total number of trips made by drivers in each week. The dimensions involved would be Customers, Drivers, Pickup and Drop Location and Week. Instead of the Date dimension, I have used the Week dimension to represent the time period at a higher level of detail. 


REFERENCES
1. https://en.wikipedia.org/wiki/Uber_(company)
2. http://time.com/3556741/uber/

Wednesday, 3 February 2016

Evaluation of Business Intelligence & Analysis Products

The exponential growth of data has rapidly increased the demand for Business Intelligence and Analysis products. Companies are looking at these products to leverage Big Data by converting it into useful information to improve decision-making and increase business efficiency.

For comparison of various Business Intelligence and Analysis products that are available in the market, I have categorized the evaluation criteria into two different groups - Technology Assessment and Execution. This provides a complete picture of the features and helps in analyzing them better.

1.      Technology Assessment
This is the estimation of the features and services provided by the various products.           
a.     Mobility
It should allow usage of BI applications on mobile devices such as smartphones or tablets.

b.     Data Sourcing
It should provide access to a number of databases and file types such as csv, Excel, text and XML. The specific details can be customized based on the needs of the organization.

c.     Visualization
The product should allow querying of data, analyzing that data and then creating visualizations in the form of line, bar, pie and area charts.

d.     Training
This includes assessment of whether any training for specific features or processes has been provided. The training can be in the form of in-person classes, online classes or Web recordings.

2.       Execution
           This assessment criterion is to review the capability of the products.
a.       Ease of Use
      This indicates how simple the product is to use by BI and administrative users.

b.       Scalability
     The scalability should be in terms of the number of concurrent users as well as the data metrics such as volume, variety and veracity.

c.       Pricing
       This is the biggest factor in determination of a product for an organization.

The five products that I have selected for evaluation are :

1. Tableau
Tableau is an exceptional self-service Business Intelligence tool which provides wide variety of visualization and extensive supported data types for easy exploration and analysis of business data.



Features:  
The software has a drag-and-drop feature, which makes it convenient for the user to add data for analysis.
Tableau has an extensive online library and learning community. The live and on-demand webinars, online support documents, discussion forums are great resources to understand the interfaces and controls of the software.  
The pricing varies according to the desired version. Tableau Desktop comes in two different versions - Professional and Personal. The Professional edition costs $1,999 per user and it covers more than 40 different connections. The Personal edition costs $999 per user that can handle six connections. 

2. TIBCO Spotfire
TIBCO Spotfire helps companies integrate and synthesize big data into an easily understood format. The aim is to create actionable data reporting through tools anyone in the organization can utilize to manipulate and understand essential information about the company.
Features:
It allows the users to leverage existing database skill sets to create applications quickly.
It allows for instant visualization and interaction with organizational data.
The feature of location-based analytics takes e-commerce to a higher level.
The desktop version of Spotfire which runs on Windows can be subscribed for $650 per year.

 3. SAP Business Objects
SAP Business Objects is a platform for organizations to manage and optimize Business Intelligence. It has a centralized portal, from which users can perform data cleaning activities, create dashboards and different reports such as OLAP, ad-hoc and Crystal Reports.



Features:
SAP Business Objects provides mobile access to its users to access to reports and other data.
The platform possesses a semantic layer which ensures data accuracy and consistency.
The solution provided can be easy integrated with other enterprise solutions such as Salesforce and Microsoft Office.

4. Domo
Domo is a cloud-based business management suite which allows integration with multiple data sources such as spreadsheets, social media and databases.



Features:
It includes interactive visualization tools and instant access to data using customized dashboards.
Domo is also available for all mobile devices such as tablets and smartphones.
Pricing is based on annual subscription and varies according to the number of users. 

5. Chartio 
Chartio is a self-service business intelligence tool which simplifies performing database joins. It is heavily dependent on knowledge of SQL.


Features:
It supports more than a dozen different data sources.
The online documentation is not very well-organized and does not permit easy understanding of the software.
The pricing is around $2,000 per user making it one of the most expensive BI tools.


WEIGHTED ANALYSIS OF THE BI AND ANALYSIS TOOLS

These five different BI and Analysis products have been analyzed based on some common criteria to evaluate their features and identify the best product from among these products.


I have rated TIBCO Spotfire as the highest due to its easy to use and stylish dashboards, ability to connect to various databases and to applications such as Excel, ERP, CRM and MS Access. It also allows collaboration of data on mobile devices along with low cost of ownership. Thus, my recommendation for any organization looking for a Business Intelligence and Analysis product would be TIBCO Spotfire.