0 Comments

Follow a 3-paragraph format; Define, explain in detail, then present an actual example via research. Your paper must provide in-depth analysis of all the topics presented:

> Read cases and white papers that talk about Big Data analytics. Present the common theme in those case studies.

> Review the following Big Data Tutorial (attached).

> Choose one of the three applications for big data presented (Recommendation, Social Network Analytics, and Media Monitoring)

> Provide a case study of how a company has implemented the big data application and from your research suggest areas of improvement or expansion.

Need 8-10 pages in APA format with introduction and conclusion. Must include minimum of 9 peer-reviewed citations.

Marko Grobelnik, Blaz Fortuna, Dunja Mladenic

Jozef Stefan Institute, Slovenia

Sydney, Oct 22nd 2013

 Big-Data in numbers

 Big-Data Definitions

 Motivation

 State of Market

 Techniques

 Tools

 Data Science

 Applications ◦ Recommendation, Social networks, Media Monitoring

 Concluding remarks

 „Big-data‟ is similar to „Small-data‟, but bigger

 …but having data bigger it requires different approaches: ◦ techniques, tools, architectures

 …with an aim to solve new problems ◦ …or old problems in a better way.

 Volume – challenging to load and process (how to index, retrieve)

 Variety – different data types and degree of structure (how to query semi- structured data)

 Velocity – real-time processing influenced by rate of data arrival

From “Understanding Big Data” by IBM

 1. Volume (lots of data = “Tonnabytes”)  2. Variety (complexity, curse of

dimensionality)  3. Velocity (rate of data and information flow)

 4. Veracity (verifying inference-based models from comprehensive data collections)

 5. Variability  6. Venue (location)  7. Vocabulary (semantics)

Comparing volume of “big data” and “data mining” queries

…adding “web 2.0” to “big data” and “data mining” queries volume

Big-Data

 Key enablers for the appearance and growth of “Big Data” are:

◦ Increase of storage capacities

◦ Increase of processing power

◦ Availability of data

Source: WikiBon report on “Big Data Vendor Revenue and Market Forecast 2012-2017”, 2013

 …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem

◦ Modeling and reasoning with data of different kinds can get extremely complex

 Good news about big-data: ◦ Often, because of vast amount of data, modeling

techniques can get simpler (e.g. smart counting can replace complex model-based analytics)…

◦ …as long as we deal with the scale

 Research areas (such as IR, KDD, ML, NLP, Se mWeb, …) are sub- cubes within the data cube

Scalability

Streaming

Context

Quality

Usage

 A risk with “Big-Data mining” is that an analyst can “discover” patterns that are meaningless

 Statisticians call it Bonferroni‟s principle: ◦ Roughly, if you look in more places for interesting

patterns than your amount of data will support, you are bound to find crap

Example:

 We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day ◦ 109 people being tracked. ◦ 1000 days. ◦ Each person stays in a hotel 1% of the time (1 day out of 100) ◦ Hotels hold 100 people (so 105 hotels). ◦ If everyone behaves randomly (i.e., no terrorists) will the data

mining detect anything suspicious?

 Expected number of “suspicious” pairs of people: ◦ 250,000 ◦ … too many combinations to check – we need to have some

additional evidence to find “suspicious” pairs of people in some more efficient way

Example taken from: Rajaraman, Ullman: Mining of Massive Datasets

 Smart sampling of data ◦ …reducing the original data while not losing the

statistical properties of data

 Finding similar items ◦ …efficient multidimensional indexing

 Incremental updating of the models ◦ (vs. building models from scratch)

◦ …crucial for streaming data

 Distributed linear algebra ◦ …dealing with large sparse matrices

 On the top of the previous ops we perform usual data mining/machine learning/statistics operators: ◦ Supervised learning (classification, regression, …)

◦ Non-supervised learning (clustering, different types of decompositions, …)

◦ …

 …we are just more careful which algorithms we choose ◦ typically linear or sub-linear versions of the algorithms

 An excellent overview of the algorithms covering the above issues is the book “Rajaraman, Leskovec, Ullman: Mining of Massive Datasets”

 Downloadable from: http://infolab.stanford.edu/~ullman/mmds.html

 Where processing is hosted? ◦ Distributed Servers / Cloud (e.g. Amazon EC2)

 Where data is stored? ◦ Distributed Storage (e.g. Amazon S3)

 What is the programming model? ◦ Distributed Processing (e.g. MapReduce)

 How data is stored & indexed? ◦ High-performance schema-free databases (e.g.

MongoDB)

 What operations are performed on data? ◦ Analytic / Semantic Processing

 Computing and storage are typically hosted transparently on cloud infrastructures ◦ …providing scale, flexibility and high fail-safety

 Distributed Servers ◦ Amazon-EC2, Google App Engine, Elastic,

Beanstalk, Heroku

 Distributed Storage ◦ Amazon-S3, Hadoop Distributed File System

 Distributed processing of Big-Data requires non-standard programming models ◦ …beyond single machines or traditional parallel

programming models (like MPI)

◦ …the aim is to simplify complex programming tasks

 The most popular programming model is MapReduce approach ◦ …suitable for commodity hardware to reduce costs

 The key idea of the MapReduce approach: ◦ A target problem needs to be parallelizable

◦ First, the problem gets split into a set of smaller problems (Map step) ◦ Next, smaller problems are solved in a parallel way ◦ Finally, a set of solutions to the smaller problems get synthesized

into a solution of the original problem (Reduce step)

Google Maps charts new territory into businesses

Google selling new tools for businesses to build their own maps

Google 4

Maps 4

Businesses 4

New 1

Charts 1

Territory 1

Tools 1

Google promises consumer experience for businesses with Maps Engine Pro

Google is trying to get its Maps service used by more businesses

Google Maps charts new territory into businesses

Google selling new tools for businesses to build their own maps

Businesses 2

Charts 1

Maps 2

Territory 1

Google promises consumer experience for businesses with Maps Engine Pro

Google is trying to get its Maps service used by more businesses

Map 2

Businesses 2

Engine 1

Maps 2

Service 1

Map 1

 Split according to the hash of a key

 In our case: key = word, hash = first character

Businesses 2

Charts 1

Maps 2

Territory 1

Businesses 2

Engine 1

Maps 2

Service 1

Maps 2

Territory 1

Maps 2

Service 1

Businesses 2

Charts 1

Businesses 2

Engine 1

R e d u c e 1

R e d u c e 2

T a s k 1

T a s k 2

Businesses 4

Charts 1

Engine 1

Maps 4

Territory 1

Service 1

Maps 2

Territory 1

Maps 2

Service 1

Businesses 2

Charts 1

Businesses 2

Engine 1

Reduce 2

Reduce 1

 We concatenate the outputs into final result

Businesses 4

Charts 1

Engine 1

Maps 4

Territory 1

Service 1

Businesses 4

Charts 1

Engine 1

Maps 4

Territory 1

Service 1

R e d

u c e 1

R e d u c e 2

 Apache Hadoop [http://hadoop.apache.org/] ◦ Open-source MapReduce implementation

 Tools using Hadoop: ◦ Hive: data warehouse infrastructure that provides data

summarization and ad hoc querying (HiveQL) ◦ Pig: high-level data-flow language and execution

framework for parallel computation (Pig Latin) ◦ Mahout: Scalable machine learning and data mining

library ◦ Flume: Flume is a distributed, reliable, and available

service for efficiently collecting, aggregating, and moving large amounts of log data

◦ Many more: Cascading, Cascalog, mrjob, MapR, Azkaban, Oozie, …

 “[…] need to solve a problem that relational databases are a bad fit for”, Eric Evans

 Motives: ◦ Avoidance of Unneeded Complexity – many use-case

require only subset of functionality from RDBMSs (e.g ACID properties)

◦ High Throughput – some NoSQL databases offer significantly higher throughput then RDBMSs

◦ Horizontal Scalability, Running on commodity hardware ◦ Avoidance of Expensive Object-Relational Mapping –

most NoSQL store simple data structures ◦ Compromising Reliability for Better Performance

Based on “NoSQL Databases”, Christof Strauch http://www.christof-strauch.de/nosqldbs.pdf

 BASE approach ◦ Availability, graceful degradation, performance

◦ Stands for “Basically available, soft state, eventual consistency”

 Continuum of tradeoffs: ◦ Strict – All reads must return data from latest completed

writes

◦ Eventual – System eventually return the last written value

◦ Read Your Own Writes – see your updates immediately

◦ Session – RYOW only within same session

◦ Monotonic – only more recent data in future requests

 Consistent hashing ◦ Use same function for

hashing objects and nodes

◦ Assign objects to nearest nodes on the circle

◦ Reassign object when nodes added or removed

◦ Replicate nodes to r nearest nodes

White, Tom: Consistent Hashing. November 2007. – Blog post of 2007-11-27. http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html

 Storage Layout ◦ Row-based

◦ Columnar

◦ Columnar with Locality Groups

 Query Models ◦ Lookup in key-value stores

 Distributed Data Processing via MapReduce

Lipcon, Todd: Design Patterns for Distributed Non-Relational Databases. June 2009. – Presentation of 2009-06-11. http://www.slideshare.net/guestdfd1ec/design-patterns-for-distributed-nonrelationaldatabases

 Map or dictionary allowing to add and retrieve values per keys

 Favor scalability over consistency ◦ Run on clusters of commodity hardware ◦ Component failure is “standard mode of operation”

 Examples: ◦ Amazon Dynamo ◦ Project Voldemort (developed by LinkedIn) ◦ Redis ◦ Memcached (not persistent)

 Combine several key-value pairs into documents

 Documents represented as JSON

 Examples: ◦ Apache CouchDB

◦ MongoDB

" Title " : " CouchDB ",

" Last editor " : "172.5.123.91" ,

" Last modified ": "9/23/2010" ,

" Categories ": [" Database ", " NoSQL ", " Document Database "],

" Body ": " CouchDB is a …" , " Reviewed ": false

 Using columnar storage layout with locality groups (column families)

 Examples: ◦ Google Bigtable

◦ Hypertable, HBase

 open source implementation of Google Bigtable

◦ Cassandra

 combination of Google Bigtable and Amazon Dynamo

 Designed for high write throughput

Infrastructure:  Kafka [http://kafka.apache.org/]

◦ A high-throughput distributed messaging system

 Hadoop [http://hadoop.apache.org/] ◦ Open-source map-reduce implementation

 Storm [http://storm-project.net/] ◦ Real-time distributed computation system

 Cassandra [http://cassandra.apache.org/] ◦ Hybrid between Key-Value and Row-Oriented DB ◦ Distributed, decentralized, no single point of failure ◦ Optimized for fast writes

 Mahout

◦ Machine learning library working on top of Hadoop

◦ http://mahout.apache.org/

 MOA

◦ Mining data streams with concept drift

◦ Integrated with Weka

◦ http://moa.cms.waikato.ac.nz/

Mahout currently has:

• Collaborative Filtering

• User and Item based recommenders

• K-Means, Fuzzy K-Means clustering

• Mean Shift clustering

• Dirichlet process clustering

• Latent Dirichlet Allocation

• Singular value decomposition

• Parallel Frequent Pattern mining

• Complementary Naive Bayes

classifier

• Random forest decision tree based

classifier

 Interdisciplinary field using techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.

 Data science is a novel term that is often used interchangeably with competitive intelligence or business analytics, although it is becoming more common.

 Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners.

http://en.wikipedia.org/wiki/Data_science

Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work By Harlan Harris, Sean Murphy, Marck Vaisman Publisher: O'Reilly Media Released: June 2013

An Introduction to Data Jeffrey Stanton, Syracuse University School of Information Studies Downloadable from http://jsresearch.net/wiki/projects/teachdatascience Released: February 2013

Data Science for Business: What you need to know about data mining and data-analytic thinking by Foster Provost and Tom Fawcett Released: Aug 16, 2013

Recommendation

Social Network Analytics

Media Monitoring

 User visit logs ◦ Track each visit using embedded JavaScript

 Content ◦ The content and metadata of visited pages

 Demographics ◦ Metadata about (registered) users

User ID cookie: 1234567890

IP: 95.87.154.251 (Ljubljana, Slovenia)

Requested URL: http://www.bloomberg.com/news/2012-07- 19/americans-hold-dimmest-view-on- economic-outlook-since-january.html

Referring URL: http://www.bloomberg.com/

Date and time: 2009-08-25 08:12:34

Device: Chrome, Windows, PC

 News-source:

◦ www.bloomberg.com

 Article URL:

◦ http://www.bloomberg.com/news /2011-01-17/video-gamers- prolonged-play-raises-risk-of- depression-anxiety-phobias.html

 Author:

◦ Elizabeth Lopatto

 Produced at:

◦ New York

 Editor:

◦ Reg Gale

 Publish Date:

◦ Jan 17, 2011 6:00 AM

 Topics:

◦ U.S., Health Care, Media, Technology, Science

Topics (e.g. DMoz): ◦ Health/Mental Health/…/Depression ◦ Health/Mental Health/Disorders/Mood ◦ Games/Game Studies

Keywords (e.g. DMoz): ◦ Health, Mental Health, Disorders, Mood,

Games, Video Games, Depression, Recreation, Browser Based, Game Studies, Anxiety, Women, Society, Recreation and Sports

Locations: ◦ Singapore (sws.geonames.org/1880252/) ◦ Ames (sws.geonames.org/3037869/)

People: ◦ Duglas A. Gentile

Organizations: ◦ Iowa State University

(dbpediapa.org/resource/ Iowa_State_University)

◦ Pediatrics (journal)

 Provided only for registered users ◦ Only some % of unique users typically register

 Each registered users described with: ◦ Gender

◦ Year of birth

◦ Household income

 Noisy

 List of articles based on ◦ Current article

◦ User‟s history

◦ Other Visits

 In general, a combination of text stream (news articles) with click stream (website access logs)

 The key is a rich context model used to describe user

 “Increase in engagement” ◦ Good recommendations can make a difference when

keeping a user on a web site ◦ Measured in number of articles read in a session

 “User experience” ◦ Users return to the site ◦ Harder to measure and attribute to recommendation

module

 Predominant success metric is the attention span of a user expressed in terms of time spent on site and number of page views.

 Cold start ◦ Recent news articles have little usage history

◦ More sever for articles that did not hit homepage or section front, but are still relevant for particular user segment

 Recommendation model must be able to generalize well to new articles.

 Access logs analysis shows, that half of the articles read are less then ~8 hours old

 Weekends are exception

A rt

ic le

a g e [

m in

u te

s ]

 History

◦ Time

◦ Article

 Current request:

◦ Location

◦ Requested page

◦ Referring page

◦ Local Time

 Each article from the time window is described with the following features:

◦ Popularity (user independent)

◦ Content

◦ Meta-data

◦ Co-visits

◦ Users

 Features computed by comparing article‟s and user‟s feature vectors

 Features computed on- the-fly when preparing recommendations

recommendations

time

training

1 2-10 11-50 51- 21% 24% 32% 37%

 Measure how many times one of top 4 recommended article was actually read

 Feature space ◦ Extracted from subset of fields ◦ Using vector space model ◦ Vector elements for each field are normalized

 Training set ◦ One visit = one vector ◦ One user = a centroid of all his/her visits ◦ Users from the segment form positive class ◦ Sample of other users form negative class

 Classification algorithm ◦ Support Vector Machine ◦ Good for dealing with high dimensional data ◦ Linear kernel ◦ Stochastic gradient descent  Good for sampling

 Real-world dataset from a major news publishing website ◦ 5 million daily users, 1 million registered

 Tested prediction of three demographic dimensions: ◦ Gender, Age, Income

 Three user groups based on the number of visits: ◦ ≥2, ≥10, ≥50

 Evaluation: ◦ Break Even Point (BEP) ◦ 10-fold cross validation

Category Size Category Size Category Size

Male 250,000 21-30 100,000 0-24k 50,000

Female 250,000 31-40 100,000 25k-49k 50,000

41-50 100,000 50k-74k 50,000

51-60 100,000 75k-99k 50,000

61-80 100,000 100k- 149k

50,000

150k- 254k

50,000

50.00%

55.00%

60.00%

65.00%

70.00%

75.00%

80.00%

Male Female

≥2

≥10

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

21-30 31-40 41-50 51-60 61-80

≥2

≥10

≥50

14.00%

15.00%

16.00%

17.00%

18.00%

19.00%

20.00%

21.00%

22.00%

0-24 50-74 150-254

Text Features

Named Entities

All Meta Data

 Observe social and communication phenomena at a planetary scale

 Largest social network analyzed till 2010

Research questions:

 How does communication change with user demographics (age, sex, language, country)?

 How does geography affect communication?

 What is the structure of the communication network?

90 “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

 We collected the data for June 2006

 Log size:

150Gb/day (compressed)

 Total: 1 month of communication data:

4.5Tb of compressed data

 Activity over June 2006 (30 days) ◦ 245 million users logged in

◦ 180 million users engaged in conversations

◦ 17,5 million new accounts activated

◦ More than 30 billion conversations

◦ More than 255 billion exchanged messages

91 “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

92“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

93“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

 Count the number of users logging in from particular location on the earth

94 “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

 Logins from Europe

95“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

 6 degrees of separation [Milgram ‟60s]

 Average distance between two random users is 6.6

 90% of nodes can be reached in < 8 hops

Hops Nodes

1 10

2 78

3 396

4 8648

5 3299252

6 28395849

7 79059497

8 52995778

9 10321008

10 1955007

11 518410

12 149945

13 44616

14 13740

15 4476

16 1542

17 536

18 167

19 71

20 29

21 16

22 10

23 3

24 2

25 3“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

Social-networkSocial-network

 The aim of the project is to collect and analyze global main-stream and social media ◦ …documents are crawled from 100 thousands of

sources

◦ …each crawled document gets cleaned, linguistically and semantically enriched

◦ …we connect documents across languages (cross-lingual technology)

◦ …we identify and connect events

http://render-project.eu/ http://www.xlike.org/

 The NewsFeed.ijs.si system collects ◦ 40.000 main-stream news

sources ◦ 250.000 blog sources ◦ Twitter stream

 …resulting in ~500.000 documents + #N of twits per day

 Each document gets cleaned, linguistically and semantically annotated

Plain text

Text Enrichment

Extracted graph of triples from text

“Enrycher” is available as as a web-service generating Semantic Graph, LOD links, Entities, Keywords, Categories, Text Summarization

 Reporting has bias – same information is being reported in different ways

 DiversiNews system allows exploring news diversity along:

◦ Topicality

◦ Geography

◦ Sentiment

 Having stream of news & social media, the task is to structure documents into events

 “Event Registry” system allows for: ◦ Identification of events from documents ◦ Connecting documents across many languages ◦ Tracking events and constructing story-lines ◦ Describing events in a (semi)structured way ◦ UI for exploration through Search & Visualization ◦ Export into RDF (Storyline ontology)

 Prototype operating at ◦ http://mustang.ijs.si:8060/searchEvents

 Big-Data is everywhere, we are just not used to deal with it

 The “Big-Data” hype is very recent ◦ …growth seems to be going up ◦ …evident lack of experts to build Big-Data apps

 Can we do “Big-Data” without big investment? ◦ …yes – many open source tools, computing machinery is

cheap (to buy or to rent) ◦ …the key is knowledge on how to deal with data ◦ …data is either free (e.g. Wikipedia) or to buy (e.g.

twitter)

,

545

Big Data, Cloud Computing, and Location Analytics: Concepts and Tools

LEARNING OBJECTIVES

■■ Learn what Big Data is and how it is changing the world of analytics

■■ Understand the motivation for and business drivers of Big Data analytics

■■ Become familiar with the wide range of enabling technologies for Big Data analytics

■■ Learn about Hadoop, MapReduce, and NoSQL as they relate to Big Data analytics

■■ Compare and contrast the complementary uses of data warehousing and Big Data technologies

■■ Become familiar with in-memory analytics and Spark applications

■■ Become familiar with select Big Data platforms and services

■■ Understand the need for and appreciate the capabilities of stream analytics

■■ Learn about the applications of stream analytics ■■ Describe the current and future use of cloud computing in business analytics

■■ Describe how geospatial and location-based analytics are assisting organizations

B ig Data, which means many things to many people, is not a new technological fad. It has become a business priority that has the potential to profoundly change the competitive landscape in today’s globally integrated economy. In addition

to providing innovative solutions to enduring business challenges, Big Data and analyt- ics instigate new ways to transform processes, organizations, entire industries, and even society altogether. Yet extensive media coverage makes it hard to distinguish hype from reality. This chapter aims to provide a comprehensive coverage of Big Data, its enabling technologies, and related analytics concepts to help understand the capabilities and limi- tations of this emerging technology. The chapter starts with a definition and related con- cepts of Big Data followed by the technical details of the enabling technologies, including Hadoop, MapReduce, and NoSQL. We provide a comparative analysis between data warehousing and Big Data analytics. The last part of the chapter is dedicated to stream

9 C H A P T E R

M09_SHAR1552_11_GE_C09.indd 545 07/01/20 4:42 PM

546 Part III • Prescriptive Analytics and Big Data

analytics, which is one of the most promising value propositions of Big Data analytics. This chapter contains the following sections:

9.1 Opening Vignette: Analyzing Customer Churn in a Telecom Company Using Big Data Methods 546

9.2 Definition of Big Data 549 9.3 Fundamentals of Big Data Analytics 555 9.4 Big Data Technologies 559 9.5 Big Data and Data Warehousing 568 9.6 In-Memory Analytics and Apache SparkTM 573 9.7 Big Data and Stream Analytics 579 9.8 Big Data Vendors and Platforms 585 9.9 Cloud Computing and Business Analytics 593

9.10 Location-Based Analytics for Organizations 603

9.1 OPENING VIGNETTE: Analyzing Customer Churn in a Telecom Company Using Big Data Methods

BACKGROUND

A telecom company (named Access Telecom [AT] for privacy reasons) wanted to stem the tide of customers churning from its telecom services. Customer churn in the telecommuni- cations industry is common. However, Access Telecom was losing customers at an alarm- ing rate. Several reasons and potential solutions were attributed to this phenomenon. The management of the company realized that many cancellations involved communications between the customer service department and the customers. To this end, a task force comprising members from the customer relations office and the information technology (IT) department was assembled to explore the problem further. Their task was to explore how the problem of customer churn could be reduced based on an analysis of the cus- tomers’ communication patterns (Asamoah, Sharda, Zadeh, & Kalgotra, 2016).

BIG DATA HURDLES

Whenever a customer had a problem about issues such as their bill, plan, and call quality, they would contact the company in multiple ways. These included a call center, company Web site (contact us links), and physical service center walk-ins. Customers could cancel an account through one of these listed interactions. The company wanted to see if analyz- ing these customer interactions could yield any insights about the questions the custom- ers asked or the contact channel(s) they used before canceling their account. The data generated because of these interactions were in both text and audio. So, AT would have to combine all the data into one location. The company explored the use of traditional platforms for data management but soon found they were not versatile enough to handle advanced data analysis in the scenario where there were multiple formats of data from multiple sources (Thusoo, Shao, & Anthony, 2010).

There were two major challenges in analyzing this data: multiple data sources leading to a variety of data and also a large volume of data.

1. Data from multiple sources: Customers could connect with the company by ac- cessing their accounts on the company’s Web site, allowing AT to generate Web log information on customer activity. The Web log track allowed the company to identify if and when a customer reviewed his/her current plan, submitted a complaint, or

M09_SHAR1552_11_GE_C09.indd 546 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 547

checked the bill online. At the customer service center, customers could also lodge a service complaint, request a plan change, or cancel the service. These activities were logged into the company’s transaction system and then the enterprise data warehouse. Last, a customer could call the customer service center on the phone and transact business just like he/she would do in person at a customer service center. Such transactions could involve a balance inquiry or an initiation of plan cancella- tion. Call logs were available in one system with a record of the reasons a customer was calling. For meaningful analysis to be performed, the individual data sets had to be converted into similar structured formats.

2. Data volume: The second challenge was the sheer quantity of data from the three sources that had to be extracted, cleaned, restructured, and analyzed. Although pre- vious data analytics projects mostly utilized a small sample set of data for analysis, AT decided to leverage the multiple variety and sources of data as well as the large volume of data recorded to generate as many insights as possible.

An analytical approach that could make use of all the channels and sources of data, although huge, would have the potential of generating rich and in-depth insights from the data to help curb the churn.

SOLUTION

Teradata Vantage’s unified Big Data architecture (previously offered as Teradata Aster) was utilized to manage and analyze the large multistructured data. We will introduce Teradata Vantage in Section 9.8. A schematic of which data was combined is shown in Figure 9.1. Based on each data source, three tables were created with each table containing the following variables: customer ID, channel of communication, date/time

Data on ASTER

TERADATA ASTER

Online Data

Store Data

Data on TERADATA

SQL-H connector Load_from_teradata

Callcenter Data

HCatalog metadata and Data on HDFS

FIGURE 9.1 Multiple Data Sources Integrated into Teradata Vantage. Source: Teradata Corp.

M09_SHAR1552_11_GE_C09.indd 547 07/01/20 4:42 PM

548 Part III • Prescriptive Analytics and Big Data

stamp, and action taken. Prior to final cancellation of a service, the action-taken vari- able could be one or more of these 11 options (simplified for this case): present a bill dispute, request for plan upgrade, request for plan downgrade, perform profile update, view account summary, access customer support, view bill, review contract, access store locator function on the Web site, access frequently asked questions section on the Web site, or browse devices. The target of the analysis focused on finding the most common path resulting in a final service cancellation. The data was sessionized to group a string of events involving a particular customer into a defined time period (5 days over all the channels of communication) as one session. Finally, Vantage’s nPath time sequence func- tion (operationalized in an SQL-MapReduce framework) was used to analyze common trends that led to a cancellation.

RESULTS

The initial results identified several routes that could lead to a request for service cancel- lation. The company determined thousands of routes that a customer may take to cancel service. A follow-up analysis was performed to identify the most frequent routes to can- cellation requests. This was termed as the Golden Path. The top 20 most occurring paths that led to a cancellation were identified in both short and long terms. A sample is shown in Figure 9.2.

This analysis helped the company identify a customer before they would cancel their service and offer incentives or at least escalate the problem resolution to a level where the customer’s path to cancellation did not materialize.

u QUESTIONS FOR THE OPENING VIGNETTE

1. What problem did customer service cancellation pose to AT’s business survival?

2. Identify and explain the technical hurdles presented by the nature and characteristics of AT’s data.

3. What is sessionizing? Why was it necessary for AT to sessionize its data?

Callcenter:Bill Dispute

Store:Bill Dispute

Store:New Account

Store:Service Complaint

Store:Service Complaint

Callcenter:Service

Complaint

Online:Cancel Service

Callcenter:Cancel Service

Store:Cancel Service

Callcenter:Cancel Service

Callcenter:Bill

Dispute

Store:Bill Dispute

Store:Cancel Service

Callcenter: Service

Complaint

FIGURE 9.2 Top 20 Paths Visualization. Source: Teradata Corp.

M09_SHAR1552_11_GE_C09.indd 548 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 549

4. Research other studies where customer churn models have been employed. What types of variables were used in those studies? How is this vignette different?

5. Besides Teradata Vantage, identify other popular Big Data analytics platforms that could handle the analysis described in the preceding case. (Hint: see Section 9.8.)

WHAT CAN WE LEARN FROM THIS VIGNETTE?

Not all business problems merit the use of a Big Data analytics platform. This situation presents a business case that warranted the use of a Big Data platform. The main challenge revolved around the characteristics of the data under consideration. The three different types of customer interaction data sets presented a challenge in analysis. The formats and fields of data generated in each of these systems was huge. And the volume was large as well. This made it imperative to use a platform that uses technologies to permit analysis of a large volume of data that comes in a variety of formats.

Recently, Teradata stopped marketing Aster as a separate product and has merged all of the Aster capabilities into its new offering called Teradata Vantage. Although that change somewhat impacts how the application would be developed today, it is still a ter- rific example of how a variety of data can be brought together to make business decisions.

It is also worthwhile to note that AT aligned the questions asked of the data with the organization’s business strategy. The questions also informed the type of analysis that was performed. It is important to understand that for any application of a Big Data architec- ture, the organization’s business strategy and the generation of relevant questions are key to identifying the type of analysis to perform.

Sources: D. Asamoah, R. Sharda, A. Zadeh, & P. Kalgotra. (2016). “Preparing a Big Data Analytics Professional: A Pedagogic Experience.” In DSI 2016 Conference, Austin, TX. A. Thusoo, Z. Shao, & S. Anthony. (2010). “Data Warehousing and Analytics Infrastructure at Facebook.” In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (p. 1013). doi: 10.1145/1807167.1807278.

9.2 DEFINITION OF BIG DATA

Using data to understand customers/clients and business operations to sustain (and fos- ter) growth and profitability is an increasingly challenging task for today’s enterprises. As more and more data becomes available in various forms and fashions, timely processing of the data with traditional means becomes impractical. Nowadays, this phenomenon is called Big Data, which is receiving substantial press coverage and drawing increasing interest from both business users and IT professionals. The result is that Big Data is be- coming an overhyped and overused marketing buzzword, leading some industry experts to argue dropping this phrase altogether.

Big Data means different things to people with different backgrounds and interests. Traditionally, the term Big Data has been used to describe the massive volumes of data analyzed by huge organizations like Google or research science projects at NASA. But for most businesses, it’s a relative term: “Big” depends on an organization’s size. The point is more about finding new value within and outside conventional data sources. Pushing the boundaries of data analytics uncovers new insights and opportunities, and “big” depends on where you start and how you proceed. Consider the popular description of Big Data: Big Data exceeds the reach of commonly used hardware environments and/or capabili- ties of software tools to capture, manage, and process it within a tolerable time span for its user population. Big Data has become a popular term to describe the exponential growth, availability, and use of information, both structured and unstructured. Much has been written on the Big Data trend and how it can serve as the basis for innovation,

M09_SHAR1552_11_GE_C09.indd 549 07/01/20 4:42 PM

550 Part III • Prescriptive Analytics and Big Data

differentiation, and growth. Because of the technology challenges in managing the large volume of data coming from multiple sources, sometimes at a rapid speed, additional new technologies have been developed to overcome the technology challenges. Use of the term Big Data is usually associated with such technologies. Because a prime use of storing such data is generating insights through analytics, sometimes the term Big Data is expanded as Big Data analytics. But the term is becoming content free in that it can mean different things to different people. Because our goal is to introduce you to the large data sets and their potential in generating insights, we will use the original term in this chapter.

Where does Big Data come from? A simple answer is “everywhere.” The sources that were ignored because of the technical limitations are now treated as gold mines. Big Data may come from Web logs, radio-frequency identification (RFID), global positioning systems (GPS), sensor networks, social networks, Internet-based text documents, Internet search indexes, detail call records, astronomy, atmospheric science, biology, genomics, nuclear physics, biochemical experiments, medical records, scientific research, military surveillance, photography archives, video archives, and large-scale e-commerce practices.

Big Data is not new. What is new is that the definition and the structure of Big Data constantly change. Companies have been storing and analyzing large volumes of data since the advent of the data warehouses in the early 1990s. Whereas terabytes used to be synonymous with Big Data warehouses, now it’s exabytes, and the rate of growth in data volume continues to escalate as organizations seek to store and analyze greater levels of transaction details, as well as Web- and machine-generated data, to gain a better under- standing of customer behavior and business drivers.

Many (academics and industry analysts/leaders alike) think that “Big Data” is a misnomer. What it says and what it means are not exactly the same. That is, Big Data is not just “big.” The sheer volume of the data is only one of many characteristics that are often associated with Big Data, including variety, velocity, veracity, variability, and value proposition, among others.

The “V”s That Define Big Data

Big Data is typically defined by three “V”s: volume, variety, velocity. In addition to these three, we see some of the leading Big Data solution providers adding other “V”s, such as veracity (IBM), variability (SAS), and value proposition.

VOLUME Volume is obviously the most common trait of Big Data. Many factors contributed to the exponential increase in data volume, such as transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, automatically generated RFID and GPS data, and so on. In the past, excessive data volume created storage issues, both technical and financial. But with today’s advanced technologies coupled with decreasing storage costs, these issues are no longer significant; instead, other issues have emerged, including how to determine relevance amid the large volumes of data and how to create value from data that is deemed to be relevant.

As mentioned before, big is a relative term. It changes over time and is perceived differently by different organizations. With the staggering increase in data volume, even the naming of the next Big Data echelon has been a challenge. The highest mass of data that used to be called petabytes (PB) has left its place to zettabytes (ZB), which is a tril- lion gigabytes (GB) or a billion terabytes (TB). Technology Insights 9.1 provides an over- view of the size and naming of Big Data volumes.

From a short historical perspective, in 2009 the world had about 0.8 ZB of data; in 2010, it exceeded the 1 ZB mark; at the end of 2011, the number was 1.8 ZB. It is ex- pected to be 44 ZB in 2020 (Adshead, 2014). With the growth of sensors and the Internet of Things (IoT—to be introduced in the next chapter), these forecasts could all be wrong.

M09_SHAR1552_11_GE_C09.indd 550 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 551

Though these numbers are astonishing in size, so are the challenges and opportunities that come with them.

VARIETY Data today come in all types of formats—ranging from traditional databases to hierarchical data stores created by the end users and OLAP systems to text documents, e-mail, XML, meter-collected and sensor-captured data, to video, audio, and stock ticker data. By some estimates, 80 to 85% of all organizations’ data are in some sort of unstruc- tured or semi-structured format (a format that is not suitable for traditional database sche- mas). But there is no denying its value, and hence, it must be included in the analyses to support decision making.

VELOCITY According to Gartner, velocity means both how fast data is being pro- duced and how fast the data must be processed (i.e., captured, stored, and analyzed) to meet the need or demand. RFID tags, automated sensors, GPS devices, and smart meters are driving an increasing need to deal with torrents of data in near real time. Velocity is perhaps the most overlooked characteristic of Big Data. Reacting quickly enough to deal with velocity is a challenge to most organizations. For time-sensitive environments, the opportunity cost clock of the data starts ticking the moment the data is created. As time passes, the value proposition of the data degrades and even- tually becomes worthless. Whether the subject matter is the health of a patient, the well-being of a traffic system, or the health of an investment portfolio, accessing the data and reacting faster to the circumstances will always create more advantageous outcomes.

TECHNOLOGY INSIGHTS 9.1 The Data Size Is Getting Big, Bigger, and Bigger

The measure of data size is having a hard time keeping up with new names. We all know kilobyte (KB, which is 1,000 bytes), megabyte (MB, which is 1,000,000 bytes), gigabyte (GB, which is 1,000,000,000 bytes), and terabyte (TB, which is 1,000,000,000,000 bytes). Beyond that, the names given to data sizes are relatively new to most of us. The following table shows what comes after terabyte and beyond.

Name Symbol Value

Kilobyte kB 103

Megabyte MB 106

Gigabyte GB 109

Terabyte TB 1012

Petabyte PB 1015

Exabyte EB 1018

Zettabyte ZB 1021

Yottabyte YB 1024

Brontobyte* BB 1027

Gegobyte* GeB 1030

*Not an official SI (International System of Units) name/symbol, yet.

Consider that an exabyte of data is created on the Internet each day, which equates to 250 million DVDs’ worth of information. And the idea of even larger amounts of data—a zettabyte— isn’t too far off when it comes to the amount of information traversing the Web in any one year. In fact, industry experts are already estimating that we will see 1.3 zettabytes of traffic annually

M09_SHAR1552_11_GE_C09.indd 551 07/01/20 4:42 PM

552 Part III • Prescriptive Analytics and Big Data

over the Internet by 2016—and it could jump to 2.3 zettabytes by 2020. By 2020, Internet traffic is expected to reach 300 GB per capita per year. When referring to yottabytes, some of the Big Data scientists often wonder about how much data the NSA or FBI have on people altogether. Put in terms of DVDs, a yottabyte would require 250 trillion of them. A brontobyte, which is not an official SI prefix but is apparently recognized by some people in the measurement com- munity, is a 1 followed by 27 zeros. The size of such a magnitude can be used to describe the amount of sensor data that we will get from the Internet in the next decade, if not sooner.

A gegobyte is 10 to the power of 30. With respect to where the Big Data comes from, consider the following:

• The CERN Large Hadron Collider generates 1 petabyte per second. • Sensors from a Boeing jet engine create 20 terabytes of data every hour. • Every day, 600 terabytes of new data are ingested in Facebook databases. • On YouTube, 300 hours of video are uploaded per minute, translating to 1 terabyte every

minute. • The proposed Square Kilometer Array telescope (the world’s proposed biggest telescope)

will generate an exabyte of data per day.

Sources: S. Higginbotham. (2012). “As Data Gets Bigger, What Comes after a Yottabyte?” gigaom.com/ 2012/10/30/as-data-gets-bigger-what-comes-after-a-yottabyte (accessed October 2018). Cisco. (2016). “The Zettabyte Era: Trends and Analysis.” cisco.com/c/en/us/solutions/collateral/service-provider/ visual-networking-index-vni/vni-hyperconnectivity-wp.pdf (accessed October 2018).

In the Big Data storm that we are currently witnessing, almost everyone is fixated on at-rest analytics, using optimized software and hardware systems to mine large quanti- ties of variant data sources. Although this is critically important and highly valuable, there is another class of analytics, driven from the velocity of Big Data, called “data stream analytics” or “in-motion analytics,” which is evolving fast. If done correctly, data stream analytics can be as valuable as, and in some business environments more valuable than at-rest analytics. Later in this chapter we will cover this topic in more detail.

VERACITY Veracity is a term coined by IBM that is being used as the fourth “V” to de- scribe Big Data. It refers to conformity to facts: accuracy, quality, truthfulness, or trustwor- thiness of the data. Tools and techniques are often used to handle Big Data’s veracity by transforming the data into quality and trustworthy insights.

VARIABILITY In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something big trending in the social media? Perhaps there is a high-profile IPO looming. Maybe swimming with pigs in the Bahamas is suddenly the must-do vacation activity. Daily, seasonal, and event-triggered peak data loads can be highly variable and thus challenging to manage—especially with social media involved.

VALUE PROPOSITION The excitement around Big Data is its value proposition. A precon- ceived notion about “Big” data is that it contains (or has a greater potential to contain) more patterns and interesting anomalies than “small” data. Thus, by analyzing large and feature- rich data, organizations can gain greater business value that they may not have otherwise. Although users can detect the patterns in small data sets using simple statistical and machine- learning methods or ad hoc query and reporting tools, Big Data means “big” analytics. Big analytics means greater insight and better decisions, something that every organization needs.

Because the exact definition of Big Data (or its successor terms) is still a matter of ongoing discussion in academic and industrial circles, it is likely that more characteristics (perhaps more “V”s) are likely to be added to this list. Regardless of what happens, the importance and value proposition of Big Data is here to stay. Figure 9.3 shows a concep- tual architecture where Big Data (at the left side of the figure) is converted to business

M09_SHAR1552_11_GE_C09.indd 552 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 553

insight through the use of a combination of advanced analytics and delivered to a variety of different users/roles for faster/better decision making.

Another term that is being added to Big Data buzzwords is alternative data. Application Case 9.1 shows examples of multiple types of data in a number of different scenarios.

MOVE

DATA PLATFORM

Fast Data Loading & Availability

Filtering & Processing

Deep History: Online Archival

Data Mgmt. (data lake)

MANAGE ACCESS Marketing Marketing

Executives

Operational Systems

Customers Partners

Frontline Workers

Business Analysts

Data Scientists

Engineers

USERS

Applications

Business Intelligence

Data Mining

Math and Stats

Languages

ANALYTIC TOOLS & APPS

ERP

SCM

CRM

Images

Audio and Video

Machine Logs

Text

Web and Social

SOURCES

Business lntelligence Predictive Analytics

Operational Intelligence

INTEGRATED DATA WAREHOUSE

Data Discovery Fast-Fail Hypothesis Testing

Path, Graph, Time-Series Analysis Pattern Detection

INTEGRATED DISCOVERY PLATFORM

FIGURE 9.3 A High-Level Conceptual Architecture for Big Data Solutions. Source: Teradata Company.

Getting a good forecast and understanding of the situ- ation is crucial for any scenario, but it is especially important to players in the investment industry. Being able to get an early indication of how a particular retailer’s sales are doing can give an investor a leg up on whether to buy or sell that retailer’s stock even before the earnings reports are released. The prob- lem of forecasting economic activity or microclimates based on a variety of data beyond the usual retail data is a very recent phenomenon and has led to another

buzzword—“alternative data.” A major mix in this alternative data category is satellite imagery, but it also includes other data such as social media, government filings, job postings, traffic patterns, changes in park- ing lots or open spaces detected by satellite images, mobile phone usage patterns in any given location at any given time, search patterns on search engines, and so on. Facebook and other companies have invested in satellites to try to image the whole globe every day so that daily changes can be tracked at any location

Application Case 9.1 Alternative Data for Market Analysis or Forecasts

(Continued )

M09_SHAR1552_11_GE_C09.indd 553 07/01/20 4:42 PM

554 Part III • Prescriptive Analytics and Big Data

and the information can be used for forecasting. Many interesting examples of more reliable and advanced forecasts have been reported. Indeed, this activity is being led by start-up companies. Tartar (2018) cited several examples. We mentioned some in Chapter 1. Here are some of the examples identified by them and many other proponents of alternative data:

• RS Metrics monitored parking lots across the United States for various hedge funds. In 2015, based on an analysis of the parking lots, RS Met- rics predicted a strong second quarter in 2015 for JC Penney. Its clients (mostly hedge funds) profited from this advanced insight. A similar story has been reported for Wal-Mart using car counts in its parking lots to forecast sales.

• Spaceknow keeps track of changes in factory surroundings for over 6,000 Chinese factory sites. Using this data, the company has been able to provide a better idea of China’s indus- trial economic activity than what the Chinese government has been reporting.

• Telluslabs, Inc. compiles data from NASA and European satellites to build prediction models for various crops such as corn, rice, soybean, wheat, and so on. Besides the images from the satellites, they incorporate measurements of thermal infrared bands, which help measure radiating heat to predict health of the crops.

• DigitalGlobe is able to analyze the size of a forest with more accuracy because its software can count every single tree in a forest. This re- sults in a more accurate estimate because there is no need to use a representative sample.

These examples illustrate just a sample of ways data can be combined to generate new insights. Of course, there are privacy concerns in some cases. For example, Yodlee, a division of Envestnet, pro- vides personal finance tools to many large banks as well as personal financial tools to individuals. Thus,

it has access to massive information about individu- als. It has faced concerns about the privacy and security of this information, especially in light of the major breaches reported by Facebook, Cambridge Analytics, and Equifax. Although such concerns will eventually be resolved by policy makers or the mar- ket, what is clear is that new and interesting ways of combining satellite data and many other data sources are spawning a new crop of analytics com- panies. All of these organizations are working with data that meets the three V’s—variety, volume, and velocity characterizations. Some of these companies also work with another category of data—sensors. But this group of companies certainly also falls under a group of innovative and emerging applications.

Sources: C. Dillow. (2016). “What Happens When You Combine Artificial Intelligence and Satellite Imagery.” fortune.com/2016/ 03/30/facebook-ai-satellite-imagery/ (accessed October 2018). G. Ekster. (2015). “Driving Investment Performance with Alternative Data.” integrity-research.com/wp-content/uploads/2015/11/ Driving-Investment-Performance-With-Alternative-Data.pdf (accessed October 2018). B. Hope. (2015). “Provider of Personal Finance Tools Tracks Bank Cards, Sells Data to Investors.” wsj.com/ articles/provider-of-personal-finance-tools-tracks-bank-cards- sells-data-to-investors-1438914620 (accessed October 2018). C.  Shaw. (2016). “Satellite Companies Moving Markets.” quandl. com/blog/alternative-data-satellite-companies (accessed Octo- ber 2018). C. Steiner. (2009). “Sky High Tips for Crop Traders.” www.forbes.com/forbes/2009/0907/technology-software- satellites-sky-high-tips-for-crop-traders.html (accessed October 2018). M. Turner. (2015). “This Is the Future of Investing, and You Probably Can’t Afford It.” businessinsider.com/hedge-funds-are- analysing-data-to-get-an-edge-2015-8 (accessed October 2018).

Questions for Discussion

1. What is a common thread in the examples dis- cussed in this application case?

2. Can you think of other data streams that might help give an early indication of sales at a retailer?

3. Can you think of other applications along the lines presented in this application case?

u SECTION 9.2 REVIEW QUESTIONS

1. Why is Big Data important? What has changed to put it in the center of the analytics world?

2. How do you define Big Data? Why is it difficult to define?

3. Out of the “V”s that are used to define Big Data, in your opinion, which one is the most important? Why?

4. What do you think the future of Big Data will be like? Will it leave its popularity to something else? If so, what will it be?

Application Case 9.1 (Continued)

M09_SHAR1552_11_GE_C09.indd 554 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 555

9.3 FUNDAMENTALS OF BIG DATA ANALYTICS

Big Data by itself, regardless of the size, type, or speed, is worthless unless business users do something with it that delivers value to their organizations. That’s where “big” analyt- ics comes into the picture. Although organizations have always run reports and dash- boards against data warehouses, most have not opened these repositories to in-depth on-demand exploration. This is partly because analysis tools are too complex for the average user but also because the repositories often do not contain all the data needed by the power user. But this is about to change (and has been changing, for some) in a dramatic fashion, thanks to the new Big Data analytics paradigm.

With the value proposition, Big Data also brought about big challenges for organi- zations. The traditional means for capturing, storing, and analyzing data are not capable of dealing with Big Data effectively and efficiently. Therefore, new breeds of technolo- gies need to be developed (or purchased/hired/outsourced) to take on the Big Data challenge. Before making such an investment, organizations should justify the means. Here are some questions that may help shed light on this situation. If any of the follow- ing statements are true, then you need to seriously consider embarking on a Big Data journey.

• You can’t process the amount of data that you want to because of the limitations posed by your current platform or environment.

• You want to involve new/contemporary data sources (e.g., social media, RFID, sen- sory, Web, GPS, textual data) into your analytics platform, but you can’t because it does not comply with the data storage schema-defined rows and columns without sacrificing fidelity or the richness of the new data.

• You need to (or want to) integrate data as quickly as possible to be current on your analysis.

• You want to work with a schema-on-demand (as opposed to predetermined schema used in relational database management systems [RDBMSs]) data storage paradigm because the nature of the new data may not be known, or there may not be enough time to determine it and develop a schema for it.

• The data is arriving so fast at your organization’s doorstep that your traditional ana- lytics platform cannot handle it.

As is the case with any other large IT investment, the success in Big Data analytics depends on a number of factors. Figure 9.4 shows a graphical depiction of the most criti- cal success factors (Watson, 2012).

The following are the most critical success factors for Big Data analytics (Watson, Sharda, & Schrader, 2012):

1. A clear business need (alignment with the vision and the strategy). Business investments ought to be made for the good of the business, not for the sake of mere technology advancements. Therefore, the main driver for Big Data analytics should be the needs of the business, at any level—strategic, tactical, and operations.

2. Strong, committed sponsorship (executive champion). It is a well-known fact that if you don’t have strong, committed executive sponsorship, it is difficult (if not impossible) to succeed. If the scope is a single or a few analytical applications, the sponsorship can be at the departmental level. However, if the target is enterprise- wide organizational transformation, which is often the case for Big Data initiatives, sponsorship needs to be at the highest levels and organization wide.

3. Alignment between the business and IT strategy. It is essential to make sure that the analytics work is always supporting the business strategy, and not the other way around. Analytics should play the enabling role in successfully executing the business strategy.

M09_SHAR1552_11_GE_C09.indd 555 07/01/20 4:42 PM

556 Part III • Prescriptive Analytics and Big Data

4. A fact-based decision-making culture. In a fact-based decision-making culture, the numbers rather than intuition, gut feeling, or supposition drive decision making. There is also a culture of experimentation to see what works and what doesn’t. To create a fact-based decision-making culture, senior management needs to:

• Recognize that some people can’t or won’t adjust • Be a vocal supporter • Stress that outdated methods must be discontinued • Ask to see what analytics went into decisions • Link incentives and compensation to desired behaviors

5. A strong data infrastructure. Data warehouses have provided the data infra- structure for analytics. This infrastructure is changing and being enhanced in the Big Data era with new technologies. Success requires marrying the old with the new for a holistic infrastructure that works synergistically.

As the size and complexity increase, the need for more efficient analytical systems is also increasing. To keep up with the computational needs of Big Data, a number of new and innovative computational techniques and platforms have been developed. These tech- niques are collectively called high-performance computing, which includes the following:

• In-memory analytics: Solves complex problems in near real time with highly accurate insights by allowing analytical computations and Big Data to be processed in-memory and distributed across a dedicated set of nodes.

• In-database analytics: Speeds time to insights and enables better data gover- nance by performing data integration and analytic functions inside the database so you won’t have to move or convert data repeatedly.

A clear business

need

Strong, committed

sponsorship

Alignment between the business and IT strategy

The right analytics

tools

Personnel with advanced

analytical skills

Keys to Success

with Big Data Analytics

A strong data infrastructure

A fact-based decision-making

culture

FIGURE 9.4 Critical Success Factors for Big Data Analytics. Source: Watson, H. (2012). The

requirements for being an analytics-based organization. Business Intelligence Journal, 17(2), 42–44.

M09_SHAR1552_11_GE_C09.indd 556 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 557

• Grid computing: Promotes efficiency, lower cost, and better performance by processing jobs in a shared, centrally managed pool of IT resources.

• Appliances: Brings together hardware and software in a physical unit that is not only fast but also scalable on an as-needed basis.

Computational requirements are just a small part of the list of challenges that Big Data impose on today’s enterprises. The following is a list of challenges that are found by business executives to have a significant impact on successful implementation of Big Data analytics. When considering Big Data projects and architecture, being mindful of these challenges will make the journey to analytics competency a less stressful one.

Data volume: The ability to capture, store, and process a huge volume of data at an acceptable speed so that the latest information is available to decision makers when they need it.

Data integration: The ability to combine data that is not similar in structure or source and to do so quickly and at a reasonable cost.

Processing capabilities: The ability to process data quickly, as it is captured. The traditional way of collecting and processing data may not work. In many situations, data needs to be analyzed as soon as it is captured to leverage the most value. (This is called stream analytics, which will be covered later in this chapter.)

Data governance: The ability to keep up with the security, privacy, ownership, and quality issues of Big Data. As the volume, variety (format and source), and velocity of data change, so should the capabilities of governance practices.

Skills availability: Big Data is being harnessed with new tools and is being looked at in different ways. There is a shortage of people (often called data scientists) with skills to do the job.

Solution cost: Because Big Data has opened up a world of possible business improvements, a great deal of experimentation and discovery is taking place to determine the patterns that matter and the insights that turn to value. To ensure a positive return on investment on a Big Data project, therefore, it is crucial to reduce the cost of the solutions used to find that value.

Though the challenges are real, so is the value proposition of Big Data analytics. Anything that you can do as a business analytics leader to help prove the value of new data sources to the business will move your organization beyond experimenting and ex- ploring Big Data into adapting and embracing it as a differentiator. There is nothing wrong with exploration, but ultimately the value comes from putting those insights into action.

Business Problems Addressed by Big Data Analytics

The top business problems addressed by Big Data overall are process efficiency and cost reduction, as well as enhancing customer experience, but different priorities emerge when it is looked at by industry. Process efficiency and cost reduction are perhaps among the top-ranked problems that can be addressed with Big Data analytics for the manu- facturing, government, energy and utilities, communications and media, transport, and healthcare sectors. Enhanced customer experience may be at the top of the list of prob- lems addressed by insurance companies and retailers. Risk management usually is at the top of the list for companies in banking and education. Here is a partial list of problems that can be addressed using Big Data analytics:

Process efficiency and cost reduction Brand management Revenue maximization, cross-selling, and up-selling Enhanced customer experience Churn identification, customer recruiting

M09_SHAR1552_11_GE_C09.indd 557 07/01/20 4:42 PM

558 Part III • Prescriptive Analytics and Big Data

Improved customer service Identifying new products and market opportunities Risk management Regulatory compliance Enhanced security capabilities

The retail industry has increasingly adopted Big Data infrastructure to better analyze cus- tomer preferences and trends in buying behavior. Application Case 9.2 shows how Japanese retail companies make use of it.

Data circulation is growing exponentially. In 2013, it was 4.4 zettabytes; by 2020, it is expected to reach 44 zettabytes. Parallel with this is the number of compa- nies investing in ways to put these data to good use.

Retail is one of the sectors that have witnessed the most rapid changes in this area. That big data is transforming the way the sector functions around the world should come as no surprise considering the amount of research that goes into in-store analytics to anticipate customers’ future purchasing behavior.

The Japanese AI venture ABEJA is one of the many companies that specialize in this kind of big data analysis solution for the retail industry. Since its establishment in 2012, the company has developed deep learning technology in-house to offer AI solu- tions to a series of corporate clients, many of them in retail.

ABEJA harvests high volumes of data acquired across a range of different sources, including drones, robots, and all sorts of logistic devices, that can later be used to conduct high-level analysis through deep learning. This allows the company to optimize store management and inventory manage- ment for its fashion and retail customers based on accumulation of sensor data like radio-frequency identification (RFID) and IoT. The company’s

proprietary technology can be found in about 530 stores in Japan, including brands such as Guess and Beams. In Tokyo’s Parco-ya Ueno shopping center, ABEJA maintains 200 cameras and runs all the data it gathers through the company’s AI sys- tem to track volumes of customers as well as their ages and genders.

In 2018, ABEJA made headlines around the world when it received funding from Google. After having established its first foreign subsidiary in Singapore in March 2017, the company continued to expand in South-East Asian markets.

Questions for Discussion

1. How do retail companies use Big Data infrastructure?

2. How is ABEJA’s AI used in day-to-day retail business?

3. What are ABEJA’s target markets in the medium term?

Sources: ABEJA. (2019). “Company Profile.” https://abejainc. com/en/company/ (accessed October 2019). Nikkei Asian Review. (2018). “Japan’s Google-backed AI Pioneer Plots a Quantum Leap.” https://asia.nikkei.com/Spotlight/Startups- in-Asia/Japan-s-Google-backed-AI-pioneer-plots-a-quantum- leap (accessed October 2019).

Application Case 9.2 Big Data and Retail Business: The Rise of ABEJA

This section has introduced the basics of Big Data and some potential applications. In the next section we will learn about a few terms and technologies that have emerged in Big Data space.

u SECTION 9.3 REVIEW QUESTIONS

1. What is Big Data analytics? How does it differ from regular analytics?

2. What are the critical success factors for Big Data analytics?

3. What are the big challenges that one should be mindful of when considering imple- mentation of Big Data analytics?

4. What are the common business problems addressed by Big Data analytics?

M09_SHAR1552_11_GE_C09.indd 558 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 559

9.4 BIG DATA TECHNOLOGIES

There are a number of technologies for processing and analyzing Big Data, but most have some common characteristics (Kelly, 2012). Namely, they take advantage of commodity hardware to enable scale-out and parallel-processing techniques; employ nonrelational data storage capabilities to process unstructured and semistructured data; and apply ad- vanced analytics and data visualization technology to Big Data to convey insights to end users. The three Big Data technologies that stand out that most believe will transform the business analytics and data management markets are MapReduce, Hadoop, and NoSQL.

MapReduce

MapReduce is a technique popularized by Google that distributes the processing of very large multistructured data files across a large cluster of machines. High performance is achieved by breaking the processing into small units of work that can be run in parallel across the hundreds, potentially thousands, of nodes in the cluster. To quote the seminal paper on MapReduce:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this func- tional style are automatically parallelized and executed on a large cluster of commodity machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distrib- uted system. (Dean & Ghemawat, 2004)

The key point to note from this quote is that MapReduce is a programming model, not a programming language, that is, it is designed to be used by programmers, rather than business users. The easiest way to describe how MapReduce works is through the use of an example (see the Colored Square Counter in Figure 9.5).

The input to the MapReduce process in Figure 9.5 is a set of colored squares. The objective is to count the number of squares of each color. The programmer in this example is responsible for coding the map and reducing programs; the remainder of the process- ing is handled by the software system implementing the MapReduce programming model.

The MapReduce system first reads the input file and splits it into multiple pieces. In this example, there are two splits, but in a real-life scenario, the number of splits would typically be much higher. These splits are then processed by multiple map programs run- ning in parallel on the nodes of the cluster. The role of each map program in this case is to group the data in a split by color. The MapReduce system then takes the output from each map program and merges (shuffle/sort) the results for input to the reduce program, which calculates the sum of the number of squares of each color. In this example, only one copy of the reduce program is used, but there may be more in practice. To optimize performance, programmers can provide their own shuffle/sort program and can also deploy a combiner that combines local map output files to reduce the number of output files that have to be remotely accessed across the cluster by the shuffle/sort step.

Why Use MapReduce?

MapReduce aids organizations in processing and analyzing large volumes of multistruc- tured data. Application examples include indexing and search, graph analysis, text analy- sis, machine learning, data transformation, and so forth. These types of applications are often difficult to implement using the standard SQL employed by relational DBMSs.

The procedural nature of MapReduce makes it easily understood by skilled pro- grammers. It also has the advantage that developers do not have to be concerned with implementing parallel computing—this is handled transparently by the system. Although

M09_SHAR1552_11_GE_C09.indd 559 07/01/20 4:42 PM

560 Part III • Prescriptive Analytics and Big Data

MapReduce is designed for programmers, nonprogrammers can exploit the value of pre- built MapReduce applications and function libraries. Both commercial and open source MapReduce libraries are available that provide a wide range of analytic capabilities. Apache Mahout, for example, is an open source machine-learning library of “algorithms for clustering, classification and batch-based collaborative filtering” that are implemented using MapReduce.

Hadoop

Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. Originally created by Doug Cutting at Yahoo!, Hadoop was inspired by MapReduce, a user-defined function developed by Google in the early 2000s for indexing the Web. It was designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel. Hadoop clusters run on inexpensive commodity hardware so projects can scale-out without breaking the bank. Hadoop is

Courtesy of The Apache Software Foundation.

The Goal: Determining the frequency counts of the shapes

Results: Frequency counts of the shapes

4

3

3

3

3

Shape Count

Raw Data Map Function Reduce Function

FIGURE 9.5 A Graphical Depiction of the MapReduce Process.

M09_SHAR1552_11_GE_C09.indd 560 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 561

now a project of the Apache Software Foundation, where hundreds of contributors con- tinuously improve the core technology. Fundamental concept: Rather than banging away at one huge block of data with a single machine, Hadoop breaks up Big Data into mul- tiple parts so each part can be processed and analyzed at the same time.

How Does Hadoop Work?

A client accesses unstructured and semistructured data from sources including log files, social media feeds, and internal data stores. It breaks the data up into “parts,” which are then loaded into a file system made up of multiple nodes running on commodity hardware. The default file store in Hadoop is the Hadoop Distributed File System, or HDFS. File systems such as HDFS are adept at storing large volumes of unstructured and semistructured data as they do not require data to be organized into relational rows and columns. Each “part” is replicated multiple times and loaded into the file system so that if a node fails, another node has a copy of the data contained on the failed node. A Name Node acts as facilitator, communicating back to the client information such as which nodes are available, where in the cluster certain data resides, and which nodes have failed.

Once the data is loaded into the cluster, it is ready to be analyzed via the MapReduce framework. The client submits a “Map” job—usually a query written in Java—to one of the nodes in the cluster known as the Job Tracker. The Job Tracker refers to the Name Node to determine which data it needs to access to complete the job and where in the cluster that data is located. Once determined, the Job Tracker submits the query to the relevant nodes. Rather than bringing all the data back into a central location for process- ing, the processing occurs at each node simultaneously, or in parallel. This is an essential characteristic of Hadoop.

When each node has finished processing its given job, it stores the results. The cli- ent initiates a “Reduce” job through the Job Tracker in which results of the map phase stored locally on individual nodes are aggregated to determine the “answer” to the origi- nal query, and then are loaded onto another node in the cluster. The client accesses these results, which can then be loaded into one of a number of analytic environments for analysis. The MapReduce job has now been completed.

Once the MapReduce phase is complete, the processed data is ready for further analysis by data scientists and others with advanced data analytics skills. Data scientists can manipulate and analyze the data using any of a number of tools for any number of uses, including searching for hidden insights and patterns, or use as the foundation for building user-facing analytic applications. The data can also be modeled and transferred from Hadoop clusters into existing relational databases, data warehouses, and other tradi- tional IT systems for further analysis and/or to support transactional processing.

Hadoop Technical Components

A Hadoop “stack” is made up of a number of components, which include

Hadoop Distributed File System (HDFS): The default storage layer in any given Hadoop cluster.

Name Node: The node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail.

Secondary Node: A backup to the Name Node, it periodically replicates and stores data from the Name Node should it fail.

Job Tracker: The node in a Hadoop cluster that initiates and coordinates MapReduce jobs or the processing of the data.

Slave Nodes: The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the Job Tracker.

M09_SHAR1552_11_GE_C09.indd 561 07/01/20 4:42 PM

562 Part III • Prescriptive Analytics and Big Data

In addition to these components, the Hadoop ecosystem is made up of a number of complementary subprojects. NoSQL data stores like Cassandra and HBase are also used to store the results of MapReduce jobs in Hadoop. In addition to Java, some MapReduce jobs and other Hadoop functions are written in Pig, an open source language designed specifically for Hadoop. Hive is an open source data warehouse originally developed by Facebook that allows for analytic modeling within Hadoop. Here are the most commonly referenced subprojects for Hadoop.

HIVE Hive is a Hadoop-based data warehousing–like framework originally developed by Facebook. It allows users to write queries in an SQL-like language called HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with busi- ness intelligence (BI) and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, and so forth.

PIG Pig is a Hadoop-based query language developed by Yahoo! It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL).

HBASE HBase is a nonrelational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts, and deletes. eBay and Facebook use HBase heavily.

FLUME Flume is a framework for populating Hadoop with data. Agents are populated throughout one’s IT infrastructure—inside Web servers, application servers, and mobile devices, for example—to collect data and integrate it into Hadoop.

OOZIE Oozie is a workflow processing system that lets users define a series of jobs writ- ten in multiple languages—such as MapReduce, Pig, and Hive—and then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

AMBARI Ambari is a Web-based set of tools for deploying, administering, and monitoring Apache Hadoop clusters. Its development is being led by engineers from Hortonworks, which includes Ambari in its Hortonworks Data Platform.

AVRO Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls.

MAHOUT Mahout is a data mining library. It takes the most popular data mining algo- rithms for performing clustering, regression testing, and statistical modeling and imple- ments them using the MapReduce model.

SQOOP Sqoop is a connectivity tool for moving data from non-Hadoop data stores— such as relational databases and data warehouses—into Hadoop. It allows users to spec- ify the target location inside of Hadoop and instructs Sqoop to move data from Oracle, Teradata, or other relational databases to the target.

HCATALOG HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.

M09_SHAR1552_11_GE_C09.indd 562 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 563

Hadoop: The Pros and Cons

The main benefit of Hadoop is that it allows enterprises to process and analyze large vol- umes of unstructured and semistructured data, heretofore inaccessible to them, in a cost- and time-effective manner. Because Hadoop clusters can scale to petabytes and even exabytes of data, enterprises no longer must rely on sample data sets but can process and analyze all relevant data. Data scientists can apply an iterative approach to analysis, continually refining and testing queries to uncover previously unknown insights. It is also inexpensive to get started with Hadoop. Developers can download the Apache Hadoop distribution for free and begin experimenting with Hadoop in less than a day.

The downside to Hadoop and its myriad components is that they are immature and still developing. As with any young, raw technology, implementing and managing Hadoop clusters and performing advanced analytics on large volumes of unstructured data require significant expertise, skill, and training. Unfortunately, there is currently a dearth of Hadoop developers and data scientists available, making it impractical for many enterprises to maintain and take advantage of complex Hadoop clusters. Further, as Hadoop’s myriad components are improved on by the community and new components are created, there is, as with any immature open source technology/approach, a risk of forking. Finally, Hadoop is a batch-oriented framework, meaning it does not support real- time data processing and analysis.

The good news is that some of the brightest minds in IT are contributing to the Apache Hadoop project, and a new generation of Hadoop developers and data scien- tists is coming of age. As a result, the technology is advancing rapidly, becoming both more powerful and easier to implement and manage. An ecosystem of vendors, both Hadoop-focused start-ups like Cloudera and Hortonworks and well-worn IT stalwarts like IBM, Microsoft, Teradata, and Oracle are working to offer commercial, enterprise- ready Hadoop distributions, tools, and services to make deploying and managing the technology a practical reality for the traditional enterprise. Other bleeding edge start-ups are working to perfect NoSQL (Not Only SQL) data stores capable of delivering near– real-time insights in conjunction with Hadoop. Technology Insights 9.2 provides a few facts to clarify some misconceptions about Hadoop.

TECHNOLOGY INSIGHTS 9.2 A Few Demystifying Facts about Hadoop

Although Hadoop and related technologies have been around for more than five years now, most people still have several misconceptions about Hadoop and related technologies such as MapReduce and Hive. The following list of 10 facts intends to clarify what Hadoop is and does relative to BI, as well as in which business and technology situations Hadoop-based BI, data warehousing, and analytics can be useful (Russom, 2013).

Fact #1. Hadoop consists of multiple products. We talk about Hadoop as if it’s one monolithic software, whereas it is actually a family of open source products and technolo- gies overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also available via vendor distributions; more on that later.)

The Apache Hadoop library includes (in BI priority order) HDFS, MapReduce, Hive, Hbase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with Hbase and Hive) constitute a use- ful technology stack for applications in BI, data warehousing, and analytics.

Fact #2. Hadoop is open source but available from vendors, too. Apache Hadoop’s open source software library is available from ASF at apache.org. For users desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include additional administrative tools and technical support.

M09_SHAR1552_11_GE_C09.indd 563 07/01/20 4:42 PM

564 Part III • Prescriptive Analytics and Big Data

Fact #3. Hadoop is an ecosystem, not a single product. In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of vendor products that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these.

Fact #4. HDFS is a file system, not a database management system (DBMS). Hadoop is primarily a distributed file system and lacks capabilities we would associate with a DBMS, such as indexing, random access to data, and support for SQL. That’s okay, because HDFS does things DBMSs cannot do.

Fact #5. Hive resembles SQL but is not standard SQL. Many of us are handcuffed to SQL because we know it well and our tools demand it. People who know SQL can quickly learn to hand-code Hive, but that doesn’t solve compatibility issues with SQL-based tools. TDWI feels that over time, Hadoop products will support standard SQL, so this issue will soon be moot.

Fact #6. Hadoop and MapReduce are related but don’t require each other. Developers at Google developed MapReduce before HDFS existed, and some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some DBMSs.

Fact #7. MapReduce provides control for analytics, not analytics per se. MapReduce is a general-purpose execution engine that handles the complexities of network communica- tion, parallel programming, and fault tolerance for any kind of application that you can hand code—not just analytics.

Fact #8. Hadoop is about data diversity, not just data volume. Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS.

Fact #9. Hadoop complements a DW; it’s rarely a replacement. Most organizations have designed their DW for structured, relational data, which makes it difficult to wring BI value from unstructured and semistructured data. Hadoop promises to complement DWs by handling the multistructured data types most DWs can’t.

Fact #10. Hadoop enables many types of analytics, not just Web analytics. Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other Web data, but other use cases exist. For example, consider the Big Data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples—such as customer-base segmentation, fraud detection, and risk analysis—can benefit from the additional Big Data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view.

NoSQL

A related new style of database called NoSQL (Not Only SQL) has emerged to, like Hadoop, process large volumes of multistructured data. However, whereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NoSQL databases are aimed, for the most part (though there are some important exceptions), at serving up discrete data stored among large volumes of multistructured data to end-user and au- tomated Big Data applications. This capability is sorely lacking from relational database technology, which simply can’t maintain needed application performance levels at a Big Data scale.

In some cases, NoSQL and Hadoop work in conjunction. The aforementioned HBase, for example, is a popular NoSQL database modeled after Google BigTable that is often deployed on top of HDFS, the Hadoop Distributed File System, to pro- vide low-latency, quick lookups in Hadoop. The downside of most NoSQL databases

M09_SHAR1552_11_GE_C09.indd 564 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 565

today is that they trade ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. Many also lack mature management and monitoring tools. Both of these shortcomings are in the process of being overcome by the open source NoSQL communities and a handful of vendors that are attempting to com- mercialize the various NoSQL databases. NoSQL databases currently available include HBase, Cassandra, MongoDB, Accumulo, Riak, CouchDB, and DynamoDB, among oth- ers. Application Case 9.3 shows the use of NoSQL databases at eBay. Although the case is a few years old, we include it to give you a flavor of how multiple datasets come together. Application Case 9.4 illustrates a social media application where the Hadoop infrastructure was used to compile a corpus of messages on Twitter to understand which types of users engage in which type of support for healthcare patients seeking information about chronic mental diseases.

eBay is one of the world’s largest online market- places, enabling the buying and selling of practically anything. One of the keys to eBay’s extraordinary success is its ability to turn the enormous volumes of data it generates into useful insights that its custom- ers can glean directly from the pages they frequent. To accommodate eBay’s explosive data growth—its data centers perform billions of reads and writes each day—and due to the increasing demand to process data at blistering speeds, eBay needed a solution that did not have the typical bottlenecks, scalability issues, and transactional constraints asso- ciated with common relational database approaches. The company also needed to perform rapid analysis on a broad assortment of the structured and unstruc- tured data it captured.

The Solution: Integrated Real-Time Data and Analytics

Its Big Data requirements brought eBay to NoSQL technologies, specifically Apache Cassandra and DataStax Enterprise. Along with Cassandra and its high-velocity data capabilities, eBay was also drawn to the integrated Apache Hadoop analytics that come with DataStax Enterprise. The solution incorporates a scale-out architecture that enables eBay to deploy multiple DataStax Enterprise clus- ters across several different data centers using com- modity hardware. The end result is that eBay is now able to more cost effectively process massive

amounts of data at very high speeds, at very high velocities, and achieve far more than they were able to with the higher cost proprietary system they had been using. Currently, eBay is managing a siz- able portion of its data center needs—250TBs+ of storage—in Apache Cassandra and DataStax Enterprise clusters.

Additional technical factors that played a role in eBay’s decision to deploy DataStax Enterprise so widely include the solution’s linear scalability, high availability with no single point of failure, and out- standing write performance.

Handling Diverse Use Cases

eBay employs DataStax Enterprise for many dif- ferent use cases. The following examples illustrate some of the ways the company is able to meet its Big Data needs with the extremely fast data han- dling and analytics capabilities the solution pro- vides. Naturally, eBay experiences huge amounts of write traffic, which the Cassandra implementa- tion in DataStax Enterprise handles more efficiently than any other RDBMS or NoSQL solution. eBay currently sees 6 billion+ writes per day across mul- tiple Cassandra clusters and 5 billion+ reads (mostly offline) per day as well.

One use case supported by DataStax Enterprise involves quantifying the social data eBay displays on its product pages. The Cassandra distribution in DataStax Enterprise stores all the

Application Case 9.3 eBay’s Big Data Solution

(Continued )

M09_SHAR1552_11_GE_C09.indd 565 07/01/20 4:42 PM

566 Part III • Prescriptive Analytics and Big Data

information needed to provide counts for “like,” “own,” and “want” data on eBay product pages. It also provides the same data for the eBay “Your Favorites” page that contains all the items a user likes, owns, or wants, with Cassandra serving up the entire “Your Favorites” page. eBay provides this data through Cassandra’s scalable counters feature.

Load balancing and application availability are important aspects to this particular use case. The DataStax Enterprise solution gave eBay architects the flexibility they needed to design a system that enables any user request to go to any data center, with each data center having a single DataStax Enterprise clus- ter spanning those centers. This design feature helps balance the incoming user load and eliminates any possible threat to application downtime. In addition to the line of business data powering the Web pages its customers visit, eBay is also able to perform high- speed analysis with the ability to maintain a sepa- rate data center running Hadoop nodes of the same DataStax Enterprise ring (see Figure 9.6).

Another use case involves the Hunch (an eBay sister company) “taste graph” for eBay users and items, which provides customer recommenda- tions based on user interests. eBay’s Web site is essentially a graph between all users and the items for sale. All events (bid, buy, sell, and list) are cap- tured by eBay’s systems and stored as a graph in Cassandra. The application sees more than 200

million writes daily and holds more than 40 billion pieces of data.

eBay also uses DataStax Enterprise for many time-series use cases in which processing high- volume, real-time data is a foremost priority. These include mobile notification logging and tracking (every time eBay sends a notification to a mobile phone or device it is logged in Cassandra), fraud detection, SOA request/response payload logging, and RedLaser (another eBay sister company) server logs and analytics.

Across all of these use cases is the common requirement of uptime. eBay is acutely aware of the need to keep their business up and open for business, and DataStax Enterprise plays a key part in that through its support of high-availability clus- ters. “We have to be ready for disaster recovery all the time. It’s really great that Cassandra allows for active-active multiple data centers where we can read and write data anywhere, anytime,” says eBay architect Jay Patel.

Questions for Discussion

1. Why did eBay need a Big Data solution?

2. What were the challenges, the proposed solu- tion, and the obtained results?

Source: DataStax. Customer case studies. datastax.com/ resources/casestudies/eBay (accessed October 2018).

DATA CENTER 1 DATA CENTER 2

Cassandra Ring

DATA CENTER 3

Analytics Nodes Running DSE Hadoop

for near real-time analytics

Topology—NTS RF–2:2:2

LB

://DNS

LB

FIGURE 9.6 eBay’s Multi–Data Center Deployment. Source: DataStax.

Application Case 9.3 (Continued)

M09_SHAR1552_11_GE_C09.indd 566 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 567

On the Internet today, all users have the power to contribute as well as consume information. This power is used in many ways. On social network platforms such as Twitter, users are able to post information about their health condition as well as receive help on how best to manage those health conditions. Many users have wondered about the quality of information disseminated on social net- work platforms. Whereas the ability to author and disseminate health information on Twitter seems valuable to many users who use it to seek support for their disease, the authenticity of such informa- tion, especially when it originates from lay individ- uals, has been in doubt. Many users have asked, “How do I verify and trust information from non- experts about how to manage a vital issue like my health condition?”

What types of users share and discuss what type of information? Do users with a large fol- lowing discuss and share the same type of infor- mation as users with a smaller following? The number of followers of a user relate to the influ- ence of a user. Characteristics of the information are measured in terms of quality and objectivity of the Tweet posted. A team of data scientists set out to explore the relationship between the num- ber of followers a user had and the characteristics of information the user disseminated (Asamoah & Sharda, 2015).

Solution

Data was extracted from the Twitter platform using Twitter’s API. The data scientists adapted the knowledge-discovery and data management model to manage and analyze this large set of data. The model was optimized for managing and analyzing Big Data derived from a social network platform and included phases for gaining domain knowledge, developing an appropriate Big Data platform, data acquisition and storage, data clean- ing, data validation, data analysis, and results and deployment.

Technology Used

The tweets were extracted, managed, and ana- lyzed using Cloudera’s distribution of the Apache Hadoop. The Apache Hadoop framework has sev- eral subprojects that support different kinds of data management activities. For instance, the Apache Hive subproject supported the reading, writing, and managing of the large tweet data. Data analyt- ics tools such as Gephi were used for social net- work analysis and R for predictive modeling. They conducted two parallel analyses; social network analysis to understand the influence network on the platform and text mining to understand the content of tweets posted by users.

What Was Found?

As noted earlier, tweets from both influential and noninfluential users were collected and analyzed. The results showed that the quality and objectiv- ity of information disseminated by influential users was higher than that disseminated by noninfluential users. They also found that influential users con- trolled the flow of information in a network and that other users were more likely to follow their opinion on a subject. There was a clear difference between the type of information support provided by influ- ential users versus the others. Influential users dis- cussed more objective information regarding the disease management—things such as diagnoses, medications, and formal therapies. Noninfluential users provided more information about emotional support and alternative ways of coping with such diseases. Thus, a clear difference between influen- tial users and the others was evident.

From the nonexperts’ perspective, the data scientists portray how healthcare provision can be augmented by helping patients identify and use valuable resources on the Web for managing their disease condition. This work also helps identify how nonexperts can locate and filter healthcare information that may not necessarily be beneficial to the management of their health condition.

Application Case 9.4 Understanding Quality and Reliability of Healthcare Support Information on Twitter

(Continued )

M09_SHAR1552_11_GE_C09.indd 567 07/01/20 4:42 PM

568 Part III • Prescriptive Analytics and Big Data

u SECTION 9.4 REVIEW QUESTIONS

1. What are the common characteristics of emerging Big Data technologies?

2. What is MapReduce? What does it do? How does it do it?

3. What is Hadoop? How does it work?

4. What are the main Hadoop components? What functions do they perform?

5. What is NoSQL? How does it fit into the Big Data analytics picture?

9.5 BIG DATA AND DATA WAREHOUSING

There is no doubt that the emergence of Big Data has changed and will continue to change data warehousing in a significant way. Until recently, enterprise data warehouses (Chapter 3 and online supplements) were the centerpiece of all decision support tech- nologies. Now, they have to share the spotlight with the newcomer, Big Data. The ques- tion that is popping up everywhere is whether Big Data and its enabling technologies such as Hadoop will replace data warehousing and its core technology RDBMS. Are we witnessing a data warehouse versus Big Data challenge (or from the technology stand- point, Hadoop versus RDBMS)? In this section we will explain why these questions have no basis—and at least justify that such an either-or choice is not a reflection of the reality at this point in time.

In the last decade or so, we have seen significant improvement in the area of computer-based decision support systems, which can largely be credited to data warehousing and technological advancements in both software and hardware to capture, store, and analyze data. As the size of the data increased, so did the capabilities of data warehouses. Some of these data warehousing advances included massively parallel pro- cessing (moving from one or few to many parallel processors), storage area networks (easily scalable storage solutions), solid-state storage, in-database processing, in-memory processing, and columnar (column-oriented) databases, just to name a few. These ad- vancements helped keep the increasing size of data under control, while effectively serv- ing analytics needs of the decision makers. What has changed the landscape in recent years is the variety and complexity of data, which made data warehouses incapable of keeping up. It is not the volume of the data but the variety and velocity that forced the world of IT to develop a new paradigm, which we now call “Big Data.” Now that we have these two paradigms—data warehousing and Big Data—seemingly competing for the same job—turning data into actionable information—which one will prevail? Is this a fair question to ask? Or are we missing the big picture? In this section, we try to shed some light on this intriguing question.

Questions for Discussion

1. What was the data scientists’ main concern regarding health information that is disseminated on the Twitter platform?

2. How did the data scientists ensure that nonexpert information disseminated on social media could indeed contain valuable health information?

3. Does it make sense that influential users would share more objective information whereas less

influential users could focus more on subjective information? Why?

Sources: D. Asamoah & R. Sharda. (2015). “Adapting CRISP-DM Process for Social Network Analytics: Application to Healthcare.” In AMCIS 2015 Proceedings. aisel.aisnet.org/amcis2015/ BizAnalytics/GeneralPresentations/33/ (accessed October 2018). Sarasohn-Kahn, J. (2008). The Wisdom of Patients: Health Care Meets Online Social Media. Oakland, CA: California HealthCare Foundation.

Application Case 9.4 (Continued)

M09_SHAR1552_11_GE_C09.indd 568 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 569

As has been the case for many previous technology innovations, hype about Big Data and its enabling technologies like Hadoop and MapReduce is rampant. Nonpractitioners as well as practitioners are overwhelmed by diverse opinions. Yet oth- ers have begun to recognize that people are missing the point in claiming that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate because both Hadoop and data warehouse systems can run in parallel, scale-up to enormous data volumes, and have shared-nothing architectures. At a conceptual level, one would think they are interchangeable. The reality is that they are not, and the differences between the two overwhelm the similarities. If they are not interchangeable, then how do we decide when to deploy Hadoop and when to use a data warehouse?

Use Cases for Hadoop

As we have covered earlier in this chapter, Hadoop is the result of new developments in computer and storage grid technologies. Using commodity hardware as a foundation, Hadoop provides a layer of software that spans the entire grid, turning it into a single system. Consequently, some major differentiators are obvious in this architecture:

Hadoop is the repository and refinery for raw data. Hadoop is a powerful, economical, and active archive.

Thus, Hadoop sits at both ends of the large-scale data life cycle—first when raw data is born, and finally when data is retiring, but is still occasionally needed.

1. Hadoop as the repository and refinery. As volumes of Big Data arrive from sources such as sensors, machines, social media, and clickstream interactions, the first step is to capture all the data reliably and cost effectively. When data volumes are huge, the traditional single-server strategy does not work for long. Pouring the data into HDFS gives architects much needed flexibility. Not only can they capture hundreds of terabytes in a day, but they can also adjust the Hadoop configuration up or down to meet surges and lulls in data ingestion. This is accomplished at the lowest possible cost per gigabyte due to open source economics and leveraging commodity hardware.

Because the data is stored on local storage instead of storage area networks, Hadoop data access is often much faster, and it does not clog the network with tera- bytes of data movement. Once the raw data is captured, Hadoop is used to refine it. Hadoop can act as a parallel “ETL engine on steroids,” leveraging handwritten or commercial data transformation technologies. Many of these raw data transforma- tions require the unraveling of complex freeform data into structured formats. This is particularly true with clickstreams (or Web logs) and complex sensor data formats. Consequently, a programmer needs to tease the wheat from the chaff, identifying the valuable signal in the noise.

2. Hadoop as the active archive. In a 2003 interview with ACM, Jim Gray claimed that hard disks could be treated as tape. Although it may take many more years for magnetic tape archives to be retired, today some portions of tape workloads are already being redirected to Hadoop clusters. This shift is occurring for two funda- mental reasons. First, although it may appear inexpensive to store data on tape, the true cost comes with the difficulty of retrieval. Not only is the data stored offline, requiring hours if not days to restore, but tape cartridges themselves are also prone to degradation over time, making data loss a reality and forcing companies to factor in those costs. To make matters worse, tape formats change every couple of years, requiring organizations to either perform massive data migrations to the newest tape format or risk the inability to restore data from obsolete tapes.

M09_SHAR1552_11_GE_C09.indd 569 07/01/20 4:42 PM

570 Part III • Prescriptive Analytics and Big Data

Second, it has been shown that there is value in keeping historical data online and accessible. As in the clickstream example, keeping raw data on a spinning disk for a longer duration makes it easy for companies to revisit data when the context changes and new constraints need to be applied. Searching thousands of disks with Hadoop is dramatically faster and easier than spinning through hundreds of magnetic tapes. In addition, as disk densities continue to double every 18 months, it becomes economically feasible for organizations to hold many years’ worth of raw or refined data in HDFS. Thus, the Hadoop storage grid is useful both in the preprocessing of raw data and the long-term storage of data. It’s a true “active archive” because it not only stores and protects the data, but also enables users to quickly, easily, and per- petually derive value from it.

Use Cases for Data Warehousing

After nearly 30 years of investment, refinement, and growth, the list of features available in a data warehouse is quite staggering. Built on relational database technology using schemas and integrating BI tools, the major differences in this architecture are

Data warehouse performance Integrated data that provides business value Interactive BI tools for end users

1. Data warehouse performance. Basic indexing, found in open source databases, such as MySQL or Postgres, is a standard feature used to improve query response times or enforce constraints on data. More advanced forms such as materialized views, aggregate join indexes, cube indexes, and sparse join indexes enable numer- ous performance gains in data warehouses. However, the most important perfor- mance enhancement to date is the cost-based optimizer. The optimizer examines incoming SQL and considers multiple plans for executing each query as fast as possible. It achieves this by comparing the SQL request to the database design and extensive data statistics that help identify the best combination of execution steps. In essence, the optimizer is like having a genius programmer examine every query and tune it for the best performance. Lacking an optimizer or data demographic statistics, a query that could run in minutes may take hours, even with many indexes. For this reason, database vendors are constantly adding new index types, partitioning, statis- tics, and optimizer features. For the past 30 years, every software release has been a performance release. As we will note at the end of his section, Hadoop is now gain- ing on traditional data warehouses in terms of query performance.

2. Integrating data that provides business value. At the heart of any data warehouse is the promise to answer essential business questions. Integrated data is the unique foundation required to achieve this goal. Pulling data from multiple subject areas and numerous applications into one repository is the raison d’être for data warehouses. Data model designers and Extract, Transform, and Load (ETL) architects armed with metadata, data-cleansing tools, and patience must ra- tionalize data formats, source systems, and semantic meaning of the data to make it understandable and trustworthy. This creates a common vocabulary within the corporation so that critical concepts such as “customer,” “end of month,” and “price elasticity” are uniformly measured and understood. Nowhere else in the entire IT data center is data collected, cleaned, and integrated as it is in the data warehouse.

3. Interactive BI tools. BI tools such as MicroStrategy, Tableau, IBM Cognos, and others provide business users with direct access to data warehouse insights. First, the business user can create reports and complex analysis quickly and easily using

M09_SHAR1552_11_GE_C09.indd 570 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 571

these tools. As a result, there is a trend in many data warehouse sites toward end- user self-service. Business users can easily demand more reports than IT has staffing to provide. More important than self-service, however, is that the users become inti- mately familiar with the data. They can run a report, discover they missed a metric or filter, make an adjustment, and run their report again all within minutes. This process results in significant changes in business users’ understanding of the business and their decision-making process. First, users stop asking trivial questions and start ask- ing more complex strategic questions. Generally, the more complex and strategic the report, the more revenue and cost savings the user captures. This leads to some users becoming “power users” in a company. These individuals become wizards at teasing business value from the data and supplying valuable strategic information to the ex- ecutive staff. Every data warehouse has anywhere from 2 to 20 power users. As noted in Section 9.8, all of these BI tools have begun to embrace Hadoop to be able to scale their offerings to larger data stores.

The Gray Areas (Any One of the Two Would Do the Job)

Even though there are several areas that differentiate one from the other, there are also gray areas where the data warehouse and Hadoop cannot be clearly discerned. In these areas ei- ther tool could be the right solution—either doing an equally good or a not-so-good job on the task at hand. Choosing one over the other depends on the requirements and the prefer- ences of the organization. In many cases, Hadoop and the data warehouse work together in an information supply chain, and just as often, one tool is better for a specific workload (Awadallah & Graham, 2012). Table 9.1 illustrates the preferred platform (one versus the other, or equally likely) under a number of commonly observed requirements.

TABLE 9.1 When to Use Which Platform—Hadoop versus DW

Requirement Data Warehouse Hadoop

Low latency, interactive reports, and OLAP �

ANSI 2003 SQL compliance is required � �

Preprocessing or exploration of raw unstructured data �

Online archives alternative to tape �

High-quality cleansed and consistent data � �

100s to 1,000s of concurrent users � �

Discover unknown relationships in the data �

Parallel complex process logic � �

CPU intense analysis �

System, users, and data governance �

Many flexible programming languages running in parallel �

Unrestricted, ungoverned sandbox explorations �

Analysis of provisional data �

Extensive security and regulatory compliance � �

M09_SHAR1552_11_GE_C09.indd 571 07/01/20 4:42 PM

572 Part III • Prescriptive Analytics and Big Data

Coexistence of Hadoop and Data Warehouse

There are several possible scenarios under which using a combination of Hadoop and relational DBMS-based data warehousing technologies makes more sense. Here are some of those scenarios (White, 2012):

1. Use Hadoop for storing and archiving multistructured data. A connector to a relational DBMS can then be used to extract required data from Hadoop for analy- sis by the relational DBMS. If the relational DBMS supports MapReduce functions, these functions can be used to do the extraction. The Vantage-Hadoop adaptor, for example, uses SQL-MapReduce functions to provide fast, two-way data loading be- tween HDFS and the Vantage Database. Data loaded into the Vantage Database can then be analyzed using both SQL and MapReduce.

2. Use Hadoop for filtering, transforming, and/or consolidating multistruc- tured data. A connector such as the Vantage-Hadoop adaptor can be used to extract the results from Hadoop processing to the relational DBMS for analysis.

3. Use Hadoop to analyze large volumes of multistructured data and publish the analytical results. In this application, Hadoop serves as the analytics plat- form, but the results can be posted back to the traditional data warehousing environ- ment, a shared workgroup data store, or a common user interface.

4. Use a relational DBMS that provides MapReduce capabilities as an investi- gative computing platform. Data scientists can employ the relational DBMS (the Vantage Database system, for example) to analyze a combination of structured data and multistructured data (loaded from Hadoop) using a mixture of SQL processing and MapReduce analytic functions.

5. Use a front-end query tool to access and analyze data. Here, the data are stored in both Hadoop and the relational DBMS.

These scenarios support an environment where the Hadoop and relational DBMSs are separate from each other and connectivity software is used to exchange data between the two systems (see Figure 9.7). The direction of the industry over the next few years will likely be moving toward more tightly coupled Hadoop and relational DBMS-based data warehouse technologies—both software and hardware. Such integration provides

Raw Data Streams

Extract, Transform

Operational Systems

Sensor Data

Blogs E-mail

Web Data

Docs PDFs

Images Videos CRM SCM ERP Legacy

3rd Party

File Copy

Developer Environments Business Intelligence Tools

Extract, Transform, Load

Integrated Data Warehouse

FIGURE 9.7 Coexistence of Hadoop and Data Warehouses. Source: “Hadoop and the Data

Warehouse: When to Use Which, teradata, 2012.” Used with permission from Teradata Corporation.

M09_SHAR1552_11_GE_C09.indd 572 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 573

many benefits, including eliminating the need to install and maintain multiple systems, reducing data movement, providing a single metadata store for application develop- ment, and providing a single interface for both business users and analytical tools. The opening vignette (Section 9.1) provided an example of how data from a traditional data warehouse and two different unstructured data sets stored on Hadoop were integrated to create an analytics application to gain insight into a customer’s interactions with a company before canceling an account. As a manager, you care about the insights you can derive from the data, not whether the data is stored in a structured data warehouse or a Hadoop cluster.

u SECTION 9.5 REVIEW QUESTIONS

1. What are the challenges facing data warehousing and Big Data? Are we witnessing the end of the data warehousing era? Why or why not?

2. What are the use cases for Big Data and Hadoop?

3. What are the use cases for data warehousing and RDBMS?

4. In what scenarios can Hadoop and RDBMS coexist?

9.6 IN-MEMORY ANALYTICS AND APACHE SPARKTM

Hadoop utilizes the batch processing framework and lacks real time processing capabili- ties. In the evolution of big data computing, in-memory analytics is an emerging process- ing technique to analyse data stored in in-memory databases. Because accessing data stored in memory is much faster than the data in hard disk, in-memory processing is more efficient than the batch processing. This also allows for the analytics of streaming data in real-time.

In-memory analytics have several applications where low latency execution is required. It can help build real-time dashboards for better insights and faster deci- sion making. The real-time applications include understanding customer behaviour and engagement, forecasting stock price, optimizing airfare, predicting fraud, and several others.

The most popular tool supporting the in-memory processing is Apache SparkTM . It is a unified analytics engine that can execute both batch and streaming data. Originally developed at University of California, Berkeley in 2009, Apache SparkTM uses in-memory computation to achieve high performance on large-scale data pro- cessing. By adopting an in-memory processing approach, Apache SparkTM runs faster than the traditional Apache Hadoop. Moreover, it can be interactively used from the Java, Scala, Python, R, and SQL shells for writing data management and machine learning applications. Apache SparkTM can run on Apache Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. Besides, it can connect to different external data sources such as HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and others.

Apache SparkTM can be used to create machine learning, fog computing, graph, streaming, and real-time analytics applications. Several big market players in the analytics sector have adopted Apache SparkTM . Examples include Uber, Pinterest, Netflix, Yahoo, and eBay. Uber uses Apache SparkTM to detect fraudulent trips at scale. Pinterest measures user engagement in real-time using Apache SparkTM. The recommendation engine of Netflix also utilizes the capabilities of Apache SparkTM. Yahoo, one of the early adopters of Apache SparkTM, has used it for creating business intelligence applications. Finally, eBay has used Apache SparkTM for data management and stream processing.

M09_SHAR1552_11_GE_C09.indd 573 07/01/20 4:42 PM

574 Part III • Prescriptive Analytics and Big Data

Architecture of Apache SparkTM

Apache SparkTM works on a master-slave framework. There is a driver program that talks to the master node, also known as cluster manager, which manages the worker nodes. The execution of the tasks takes place in the worker nodes where executors run. The entry point of the engine is called a Spark Context. It acts as a bridge of communica- tion between the application and the Spark execution environment as represented in Figure 9.8. As discussed earlier, Spark can run in different modes. In a standalone mode, it runs an application on different nodes in the cluster managed by the Spark itself. However, in a Hadoop mode, Spark uses Hadoop cluster to run jobs and leverage HDFS and MapReduce framework.

The use of big data in business is growing rapidly in Asia Pacific, and Apache SparkTM is one of the most popular tools among companies in the region. It currently boasts the largest open-source commu- nity in big data, with over a thousand contributors from over 250 organizations.

Driving this impressive growth is Databrick, a company launched by the creators of Apache SparkTM that is now focusing specifically on Asia Pacific; of the thirteen regions that Databrick cur- rently supports, five are located in the region. Databrick’s turnover has grown continuously, gen- erating over $100 million in annual recurring rev- enue. Its subscription revenue has tripled during the last quarter of 2018, and much of it comes from Asia Pacific. Moreover, the company has recently raised $498.5 million in funding (partly contributed by Microsoft) to expand into the health, fintech, media, and entertainment sectors in the region.

One of Databrick’s great successes in the region is Japan, where it has been adopted by a

number of big corporations, including NTT Data Corporation, a partially owned subsidiary of Nippon Telegraph and Telephone, which has been devel- oping systems using Hadoop/Spark since 2008, the early days of Big Data.

Questions for Discussion

1. Why is Databrick expanding in Asia Pacific?

2. Which sectors is Databrick likely to invest in dur- ing its expansion in Asia Pacific?

Sources: Databricks. (2019). “Company Profile.” https://data- bricks.com/spark/about (accessed October 2019). Rosalie Chan. (2019). “Artificial Intelligence Startup Databricks Is Now Worth $2.75 Billion after Raising $250 Million from Andreessen Horowitz and Microsoft.” Business Insider. https://www. businessinsider.com/databricks-2-billion-plus-valuation- funding-andreessen-horowitz-2019-2?r=US&IR=T (accessed October 2019). NTT. (2019). “Global Expansion of Apache Spark at NTT.” https://www.ntt-review.jp/archive/ntttechnical. php?contents=ntr201802fa7.html (accessed October 2019).

Application Case 9.5 Databrick’s Apache SparkTM: Asia-Pacific Big Data Processing in Action

Application Driver Program

Spark Context

Cluster Manager

Worker Node

Executor

Worker Node

Executor

FIGURE 9.8 Apache SparkTM Architecture.

M09_SHAR1552_11_GE_C09.indd 574 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 575

A very important component of Apache SparkTM is Resilient Distributed Dataset, com- monly known as RDD. It handles lineage, memory management, fault tolerance, and data partitioning across all nodes in a cluster. RDD provides several transformation functions like map, filter, and join that are performed on existing RDDs to create a new RDD. All transfor- mations in Spark are lazy in nature, that is, Spark does not execute these operations until any action function is performed on data. The action functions (e.g., count, reduce) print or return value after an execution. This approach is called a Lazy Evaluation. In Spark Streaming, a series of RDDs, also known as a Dstream, are utilized to process streaming data.

Getting Started with Apache SparkTM

In this section, we explain how to get started with Apache Spark on a Quick Start (QS) version of Cloudera Hadoop. It begins with downloading the latest version of Cloudera QS Virtual Machine (VM) and concludes with running your Spark query.

Hardware and Software Requirements Check • A computer with 64-bit host Operating System (Windows or Linux) with at least 12 GB

RAM for good performance • VMware Workstation Player: Download and install the latest (free) version of

the VMware Player from www.vmware.com/products/workstation-player/ workstation-player-evaluation.html

• 8 GB memory for the VM | 20 GB free disk space • 7-Zip: Extract (or unzip) the Cloudera Quick Start package using 7-Zip, available

from: www.7-zip.org/

Steps to be followed to get started with Spark on Cloudera QS VM:

1. Download Cloudera QS VM from www.cloudera.com/downloads/quickstart_ vms/5-13.html

2. Unzip it with 7-Zip. The downloaded file contains a VM machine. 3. Install VMware Workstation Player and turn it on. Now, open the Cloudera VM im-

ages through VMWare Player (Player > File > Open > full_path_of_vmx file). 4. Before turning on the VM, you must configure the memory and processer settings.

The default memory on VM will be 4 GB RAM. Click on the “Edit virtual machine setting” to change the settings. Make sure the RAM is more than 8 GB and the num- ber of processor cores is 2.

5. Turn on the machine. Cloudera has installed Hadoop and components on CentOS Linux.

6. A default user named “cloudera” with password “cloudera” is already available. 7. On the desktop of VM, open “Launch Cloudera Express.” The engine will take a few

minutes to get started.

M09_SHAR1552_11_GE_C09.indd 575 07/01/20 4:42 PM

576 Part III • Prescriptive Analytics and Big Data

8. Once started, open the web browser inside the VM. You will find an icon on the top of Cloudera Desktop.

9. Login into Cloudera Manager using username “cloudera” and password “cloudera.” 10. To use HDFS and map-reduce, we will want to start two services: HDFS and YARN

using the drop-down menu in front of them.

11. To turn on Spark, start the Spark service. 12. To run queries on Spark, we can use Python or Scala programming. Open Terminal

by right clicking on the Desktop of VM. 13. Type pyspark to enter Python shell as shown in the screenshot below. To exit the

Python shell, type exit().

Courtesy of Mozilla Firefox.

M09_SHAR1552_11_GE_C09.indd 576 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 577

14. Type spark-shell to enter Scala Spark shell as shown in the screenshot below. To exit the Scala Spark shell, type exit.

15. From here onward, we describe steps to run a Spark streaming word count applica- tion where count of words will be calculated interactively. We run this application in Scala Spark shell. To use Spark Streaming interactively, we need to run Scala Spark shell with at least two threads. To do so, type spark-shell –master local[2] as shown in this screenshot.

a. Next, to run a streaming application, we need to import three related classes one by one as shown in this screenshot.

M09_SHAR1552_11_GE_C09.indd 577 07/01/20 4:42 PM

578 Part III • Prescriptive Analytics and Big Data

b. After importing the required classes, create a Spark Streaming Context sss with a batch duration of 10 seconds as in this screenshot.

c. Create a discretized Stream (DStream), the basic abstraction in Spark Streaming, to read text from port 1111. It is presented in this screenshot.

d. To count the occurrence of words on the stream, MapReduce codes shown in the screenshot are run. Then, count.print() command is used to print word count in the batch of 10 seconds.

e. At this point, open a new terminal and run command nc – lkv 1111 as shown in the right-hand-side terminal in this screenshot.

f. To start streaming context, run the sss.start() command in the Spark shell. This will result a connection of DStream sss with the socket (right-hand-side terminal).

M09_SHAR1552_11_GE_C09.indd 578 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 579

g. In the final step, run sss.awaitTermination() in the Spark shell and start typing some words in the right-hand-side terminal as shown in this screenshot. After every 10 seconds, the word count pairs will be calculated in the Spark shell.

h. To stop the process, close the right-hand-side terminal and press CTRL + C in the left-hand-side terminal.

i. Because you may want to run the application again, all commands are listed here. spark-shell –master local[2] import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.Seconds val sss = new StreamingContext(sc,Seconds(10)) val firststream = sss.socketTextStream(“localhost”,1111) val words = firststream.flatMap(_.split(“ “)) val pairs = words.map(word = 7 (word, 1)) val count = pairs.reduceByKey(_+_) count.print() sss.start() sss.awaitTermination()

u SECTION 9.6 REVIEW QUESTIONS

1. What are some of the unique features of Spark as compared to Hadoop?

2. Give examples of companies that have adopted Apache Spark. Find new examples online.

3. Run the exercise as described in this section. What do you learn from this exercise?

9.7 BIG DATA AND STREAM ANALYTICS

Along with volume and variety, as we have seen earlier in this chapter, one of the key characteristics that defines Big Data is velocity, which refers to the speed at which the data is created and streamed into the analytics environment. Organizations are looking for new means to process streaming data as it comes in to react quickly and accurately to problems and opportunities to please their customers and to gain a competitive advan- tage. In situations where data streams in rapidly and continuously, traditional analytics approaches that work with previously accumulated data (i.e., data at rest) often either ar- rive at the wrong decisions because of using too much out-of-context data, or they arrive at the correct decisions but too late to be of any use to the organization. Therefore, it is critical for a number of business situations to analyze the data soon after it is created and/ or as soon as it is streamed into the analytics system.

The presumption that the vast majority of modern-day businesses are currently liv- ing by is that it is important and critical to record every piece of data because it might contain valuable information now or sometime in the near future. However, as long as

M09_SHAR1552_11_GE_C09.indd 579 07/01/20 4:42 PM

580 Part III • Prescriptive Analytics and Big Data

the number of data sources increases, the “store-everything” approach becomes harder and harder and, in some cases, not even feasible. In fact, despite technological advances, current total storage capacity lags far behind the digital information being generated in the world. Moreover, in the constantly changing business environment, real-time detec- tion of meaningful changes in data as well as of complex pattern variations within a given short time window are essential to come up with the actions that better fit with the new environment. These facts become the main triggers for a paradigm that we call stream analytics. The stream analytics paradigm was born as an answer to these challenges, namely, the unbounded flows of data that cannot be permanently stored to be subse- quently analyzed, in a timely and efficient manner, and complex pattern variations that need to be detected and acted on as soon as they happen.

Stream analytics (also called data-in-motion analytics and real-time data analyt- ics, among other names) is a term commonly used for the analytic process of extracting actionable information from continuously flowing/streaming data. A stream is defined as a continuous sequence of data elements (Zikopoulos et al., 2013). The data elements in a stream are often called tuples. In a relational database sense, a tuple is similar to a row of data (a record, an object, an instance). However, in the context of semistructured or un- structured data, a tuple is an abstraction that represents a package of data, which can be characterized as a set of attributes for a given object. If a tuple by itself is not sufficiently informative for analysis or a correlation—or other collective relationships among tuples are needed—then a window of data that includes a set of tuples is used. A window of data is a finite number/sequence of tuples, where the windows are continuously updated as new data become available. The size of the window is determined based on the system being analyzed. Stream analytics is becoming increasingly more popular because of two things. First, time-to-action has become an ever-decreasing value, and second, we have the technological means to capture and process the data while it is created.

Some of the most impactful applications of stream analytics were developed in the energy industry, specifically for smart grid (electric power supply chain) systems. The new smart grids are capable of not only real-time creation and processing of multiple streams of data to determine optimal power distribution to fulfill real customer needs, but also generating accurate short-term predictions aimed at covering unexpected de- mand and renewable energy generation peaks. Figure 9.9 shows a depiction of a generic use case for streaming analytics in the energy industry (a typical smart grid application). The goal is to accurately predict electricity demand and production in real time by using streaming data that is coming from smart meters, production system sensors, and meteo- rological models. The ability to predict near future consumption/production trends and detect anomalies in real time can be used to optimize supply decisions (how much to produce, what sources of production to use, and optimally adjust production capacities) as well as to adjust smart meters to regulate consumption and favorable energy pricing.

Stream Analytics versus Perpetual Analytics

The terms streaming and perpetual probably sound like the same thing to most people, and in many cases they are used synonymously. However, in the context of intelli- gent systems, there is a difference (Jonas, 2007). Streaming analytics involves applying transaction-level logic to real-time observations. The rules applied to these observa- tions take into account previous observations as long as they occurred in the prescribed window; these windows have some arbitrary size (e.g., last 5 seconds, last 10,000 obser- vations). Perpetual analytics, on the other hand, evaluates every incoming observation against all prior observations, where there is no window size. Recognizing how the new observation relates to all prior observations enables the discovery of real-time insight.

Both streaming and perpetual analytics have their pros and cons and their respec- tive places in the business analytics world. For example, sometimes transactional volumes

M09_SHAR1552_11_GE_C09.indd 580 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 581

are high and the time-to-decision is too short, favoring nonpersistence and small win- dow sizes, which translates into using streaming analytics. However, when the mission is critical and transaction volumes can be managed in real time, then perpetual analytics is a better answer. That way, one can answer questions such as “How does what I just learned relate to what I have known?” “Does this matter?” and “Who needs to know?”

Critical Event Processing

Critical event processing is a method of capturing, tracking, and analyzing streams of data to detect events (out of normal happenings) of certain types that are worthy of the effort. Complex event processing is an application of stream analytics that combines data from multiple sources to infer events or patterns of interest either before they actually occur or as soon as they happen. The goal is to take rapid actions to prevent (or mitigate the negative effects of) these events (e.g., fraud or network intrusion) from occurring, or in the case of a short window of opportunity, take full advantage of the situation within the allowed time (based on user behavior on an e-commerce site, create promotional of- fers that they are more likely to respond to).

These critical events may be happening across the various layers of an organization such as sales leads, orders, or customer service calls. Or, more broadly, they may be news items, text messages, social media posts, stock market feeds, traffic reports, weather condi- tions, or other kinds of anomalies that may have a significant impact on the well-being of the organization. An event may also be defined generically as a “change of state,” which may be detected as a measurement exceeding a predefined threshold of time, tempera- ture, or some other value. Even though there is no denying the value proposition of critical event processing, one has to be selective in what to measure, when to measure, and how often to measure. Because of the vast amount of information available about events, which is sometimes referred to as the event cloud, there is a possibility of overdoing it, in which case as opposed to helping the organization, it may hurt the operational effectiveness.

Energy Production System (Traditional

and Renewable)

Energy Consumption System (Residential

and Commercial)

Sensor Data (Energy Production

System Status)

Meteorological Data (Wind, Light, Temperature, etc.)

Streaming Analytics (Predicting Usage, Production, and

Anomalies)

Usage Data (Smart Meters, Smart Grid Devices)

Permanent Storage Area

Data Integration and Temporary

Staging

Capacity Decisions

Pricing Decisions

FIGURE 9.9 A Use Case of Streaming Analytics in the Energy Industry.

M09_SHAR1552_11_GE_C09.indd 581 07/01/20 4:42 PM

582 Part III • Prescriptive Analytics and Big Data

Data Stream Mining

Data stream mining, as an enabling technology for stream analytics, is the process of extracting novel patterns and knowledge structures from continuous, rapid data records. As we saw in the data mining chapter (Chapter 4), traditional data mining methods require the data to be collected and organized in a proper file format, and then processed in a recursive manner to learn the underlying patterns. In contrast, a data stream is a continuous flow of an ordered sequence of instances that in many applications of data stream mining can be read/processed only once or a small number of times using limited computing and storage capabilities. Examples of data streams include sensor data, computer network traffic, phone conversations, ATM transactions, Web searches, and financial data. Data stream mining is considered a subfield of data mining, machine learning, and knowledge discovery.

In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Specialized machine-learning techniques (mostly derivative of traditional machine-learning techniques) can be used to learn this prediction task from labeled examples in an automated fashion. An example of such a prediction method was developed by Delen, Kletke, and Kim (2005), where they gradu- ally built and refined a decision tree model by using a subset of the data at a time.

Applications of Stream Analytics

Because of its power to create insight instantly, helping decision makers to be on top of events as they unfold and allowing organizations to address issues before they become problems, the use of streaming analytics is on an exponentially increasing trend. The fol- lowing are some of the application areas that have already benefited from stream analytics.

e-Commerce

Companies like Amazon and eBay (among many others) are trying to make the most out of the data that they collect while a customer is on their Web site. Every page visit, every product looked at, every search conducted, and every click made is recorded and ana- lyzed to maximize the value gained from a user’s visit. If done quickly, analysis of such a stream of data can turn browsers into buyers and buyers into shopaholics. When we visit an e-commerce Web site, even the ones where we are not a member, after a few clicks here and there we start to get very interesting product and bundle price offers. Behind the scenes, advanced analytics are crunching the real-time data coming from our clicks, and the clicks of thousands of others, to “understand” what it is that we are interested in (in some cases, even we do not know that) and make the most of that information by making creative offerings.

Telecommunications

The volume of data that come from call detail records (CDR) for telecommunications companies is astounding. Although this information has been used for billing purposes for quite some time now, there is a wealth of knowledge buried deep inside this Big Data that the telecommunications companies are just now realizing to tap. For instance, CDR data can be analyzed to prevent churn by identifying networks of callers, influencers, leaders, and followers within those networks and proactively acting on this information. As we all know, influencers and leaders have the effect of changing the perception of the followers within their network toward the service provider, either positively or nega- tively. Using social network analysis techniques, telecommunication companies are iden- tifying the leaders and influencers and their network participants to better manage their customer base. In addition to churn analysis, such information can also be used to recruit new members and maximize the value of the existing members.

M09_SHAR1552_11_GE_C09.indd 582 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 583

Salesforce has expanded their Marketing Cloud ser- vices to include Predictive Scores and Predictive Audience features called the Marketing Cloud Predictive Journey. This addition uses real-time streaming data to enhance the customer engagement online. First, the customers are given a Predictive Score unique to them. This score is calculated from several different factors, including how long their browsing history is, if they clicked an e-mail link, if they made a purchase, how much they spent, how long ago did they make a purchase, or if they have ever responded to an e-mail or ad campaign. Once customers have a score, they are then segmented into different groups. These groups are given different marketing objectives and plans based on the predictive behaviors assigned to them. The scores and segments are updated and changed daily and give companies a better road map to target and achieve a desired response. These mar- keting solutions are more accurate and create more personalized ways companies can accommodate their customer retention methods.

Questions for Discussion

1. Are there areas in any industry where streaming data is irrelevant?

2. Besides customer retention, what are other ben- efits of using predictive analytics?

What Can We Learn from This Case?

Through the analysis of data acquired in the here and now, companies are able to make predictions and decisions about their consumers more rap- idly. This ensures that businesses target, attract, and retain the right customers and maximize their value. Data acquired last week is not as beneficial as the data companies have today. Using relevant data makes our predictive analysis more accurate and efficient.

Sources: M. Amodio. (2015). “Salesforce Adds Predictive Analytics to Marketing Cloud. Cloud Contact Center.” www.cloudcontactcenter zone.com/topics/cloud- contact-center/articles/413611- salesforce-adds-predictive-analytics-marketing-cloud.htm (ac- cessed October 2018). J. Davis. (2015). “Salesforce Adds New Predictive Analytics to Marketing Cloud.” Information Week. www.information week.com/big-data/big-data-analytics/salesforce-adds- new-predictive-analytics-to-marketing-cloud/d/d-id/1323201 (accessed October 2018). D. Henschen. (2016). “Salesforce Reboots Wave Analytics, Preps IoT Cloud.” ZD Net. www.zdnet.com/ article/salesforce-reboots-wave-analytics-preps-iot-cloud/ (ac- cessed October 2018).

Application Case 9.6 Salesforce Is Using Streaming Data to Enhance Customer Value

Continuous streams of data that come from CDR can be combined with social media data (sentiment analysis) to assess the effectiveness of marketing campaigns. Insight gained from these data streams can be used to rapidly react to adverse effects (which may lead to loss of customers) or boost the impact of positive effects (which may lead to maximizing purchases of existing customers and recruitment of new customers) observed in these campaigns. Furthermore, the process of gaining insight from CDR can be rep- licated for data networks using Internet protocol detail records. Because most telecom- munications companies provide both of these service types, a holistic optimization of all offerings and marketing campaigns could lead to extraordinary market gains. Application Case 9.6 is an example of how Salesforce.com gets a better sense of its customers based upon an analysis of clickstreams.

Law Enforcement and Cybersecurity

Streams of Big Data provide excellent opportunities for improved crime prevention, law enforcement, and enhanced security. They offer unmatched potential when it comes to security applications that can be built in the space, such as real-time situ- ational awareness, multimodal surveillance, cyber-security detection, legal wiretapping, video surveillance, and face recognition (Zikopoulos et al., 2013). As an application of

M09_SHAR1552_11_GE_C09.indd 583 07/01/20 4:42 PM

584 Part III • Prescriptive Analytics and Big Data

information assurance, enterprises can use streaming analytics to detect and prevent net- work intrusions, cyberattacks, and malicious activities by streaming and analyzing net- work logs and other Internet activity monitoring resources.

Power Industry

Because of the increasing use of smart meters, the amount of real-time data collected by power utilities is increasing exponentially. Moving from once a month to every 15 min- utes (or more frequently), meter reading accumulates large quantities of invaluable data for power utilities. These smart meters and other sensors placed all around the power grid are sending information back to the control centers to be analyzed in real time. Such analyses help utility companies to optimize their supply chain decisions (e.g., capacity adjustments, distribution network options, real-time buying or selling) based on the up- to-the-minute consumer usage and demand patterns. In addition, utility companies can integrate weather and other natural conditions data into their analytics to optimize power generation from alternative sources (e.g., wind, solar) and to better forecast energy de- mand on different geographic granulations. Similar benefits also apply to other utilities such as water and natural gas.

Financial Services

Financial service companies are among the prime examples where analysis of Big Data streams can provide faster and better decisions, competitive advantage, and regula- tory oversight. The ability to analyze fast-paced, high volumes of trading data at very low latency across markets and countries offers a tremendous advantage to making the split-second buy/sell decisions that potentially translate into big financial gains. In addition to optimal buy/sell decisions, stream analytics can also help financial service companies in real-time trade monitoring to detect fraud and other illegal activities.

Health Sciences

Modern-era medical devices (e.g., electrocardiograms and equipment that measure blood pressure, blood oxygen level, blood sugar level, and body temperature) are capable of producing invaluable streaming diagnostic/sensory data at a very fast rate. Harnessing this data and analyzing it in real time offers benefits—the kind that we often call “life and death”—unlike any other field. In addition to helping healthcare companies become more effective and efficient (and hence more competitive and profitable), stream analyt- ics is also improving patient conditions and saving lives.

Many hospital systems all around the world are developing care infrastructures and health systems that are futuristic. These systems aim to take full advantage of what the technology has to offer, and more. Using hardware devices that generate high-resolution data at a very rapid rate, coupled with super-fast computers that can synergistically ana- lyze multiple streams of data, increases the chances of keeping patients safe by quickly detecting anomalies. These systems are meant to help human decision makers make faster and better decisions by being exposed to a multitude of information as soon as it becomes available.

Government

Governments around the world are trying to find ways to be more efficient (via optimal use of limited resources) and effective (providing the services that people need and want). As the practices for e-government become mainstream, coupled with widespread use and access to social media, very large quantities of data (both structured and un- structured) are at the disposal of government agencies. Proper and timely use of these

M09_SHAR1552_11_GE_C09.indd 584 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 585

Big Data streams differentiates proactive and highly efficient agencies from the ones who are still using traditional methods to react to situations as they unfold. Another way in which government agencies can leverage real-time analytics capabilities is to man- age natural disasters such as snowstorms, hurricanes, tornadoes, and wildfires through a surveillance of streaming data coming from radar, sensors, and other smart detection devices. They can also use similar approaches to monitor water quality, air quality, and consumption patterns and detect anomalies before they become significant problems. Another area where government agencies use stream analytics is in traffic management in congested cities. By using the data coming from traffic flow cameras, GPS data com- ing from commercial vehicles, and traffic sensors embedded in roadways, agencies are able to change traffic light sequences and traffic flow lanes to ease the pain caused by traffic congestion problems.

u SECTION 9.7 REVIEW QUESTIONS

1. What is a stream (in the Big Data world)?

2. What are the motivations for stream analytics?

3. What is stream analytics? How does it differ from regular analytics?

4. What is critical event processing? How does it relate to stream analytics?

5. Define data stream mining. What additional challenges are posed by data stream mining?

6. What are the most fruitful industries for stream analytics?

7. How can stream analytics be used in e-commerce?

8. In addition to what is listed in this section, can you think of other industries and/or application areas where stream analytics can be used?

9. Compared to regular analytics, do you think stream analytics will have more (or less) use cases in the era of Big Data analytics? Why?

9.8 BIG DATA VENDORS AND PLATFORMS

The Big Data vendor landscape is developing very rapidly. As is the case with many emerging technologies, even the terms change. Many Big Data technologies or solutions providers have rechristened themselves to be AI providers. In this section, we will do a quick overview of several categories of Big Data providers. Then we will briefly describe one provider’s platform.

One way to study Big Data vendors and platforms is to go back to Chapter 1’s analytics ecosystem (depicted in Figure 1.17). If we focus on some of the outermost petals of that analyt- ics flower, we can see some categories of Big Data platform offerings. A more detailed classi- fication of Big Data/AI providers is also included in Matt Turck’s Big Data Ecosystem blog and the associated figure available at http://mattturck.com/wp-content/uploads/2018/07/ Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final_reduced-768×539.png (ac- cessed October 2018). The reader is urged to check this site frequently to get updated ver- sions of his view of the Big Data ecosystem.

In terms of technology providers, one thing is certain: everyone wants a bigger share of the technology spending pie and is thus willing to offer every single piece of technology, or partner with another provider so that the customer does not consider a competitor offering. Thus, many players seem to compete with each other by adding capabilities that their partners offer or by collaborating with their partners. In addi- tion, there is always significant merger/acquisition activity. Finally, most vendors keep changing their products’ names as the platforms evolve. This makes this specific sec- tion likely to be obsolete sooner than one might think. Recognizing all these caveats,

M09_SHAR1552_11_GE_C09.indd 585 07/01/20 4:42 PM

586 Part III • Prescriptive Analytics and Big Data

one highly aggregated way to group the Big Data providers is to use the following broad categories:

• Infrastructure Services Providers • Analytics Solution Providers • Legacy BI Providers Moving to Big Data

Infrastructure Services Providers

Big Data infrastructure was initially developed by two companies coming out of initial col- laboration between Yahoo and Facebook. A number of vendors have developed their own Hadoop distributions, most based on the Apache open source distribution but with various levels of proprietary customization. Two market leaders were Cloudera (cloudera.com) and Hortonworks (hortonworks.com). Cloudera was started by Big Data experts includ- ing Hadoop creator Doug Cutting and former Facebook data scientist Jeff Hammerbacher. Hortonworks was spun out of Yahoo! These two companies have just (October 2018) an- nounced a plan to merge into one company to provide a full suite of services in Big Data. The combined company will be able to offer Big Data services and be able to compete and partner with all other major providers. This makes it perhaps the largest independent provider of Hadoop distribution that provides on-premise Hadoop infrastructure, train- ing, and support. MapR (mapr.com) offers its own Hadoop distribution that supplements HDFS with its proprietary network file system (NFS) for improved performance. Similarly, EMC was acquired by Dell to provide its own Big Data on-premise distribution. There are many other vendors that offer similar platforms with their own minor variations.

Another category of Hadoop distributors that add their own value-added services for customers are companies such as Datastax, Nutanix, VMWare, and so on. These compa- nies deliver commercially supported versions of the various flavors of NoSQL. DataStax, for example, offers a commercial version of Cassandra that includes enterprise support and services, as well as integration with Hadoop and open source enterprise search. Many other companies provide Hadoop connectors and complementary tools aimed at making it easier for developers to move data around and within Hadoop clusters.

The next category of major infrastructure providers is the large cloud providers such as Amazon Web Services, Microsoft Azure, Google Cloud, and IBM Cloud. All of these com- panies offer storage and computing services but have invested heavily to provide Big Data and AI technology offerings. For example, Amazon AWS includes Hadoop and many other Big Data/AI capabilities (e.g., Amazon Neptune). Azure is a popular cloud provider for many analytics vendors, but Azure also offers its own Machine Learning and other capabilities. IBM and Google similarly offer their cloud services, but have major data science/AI offerings available, such as IBM Watson analytics and Google Tensor Flow, AutoML, and so on.

Analytics Solution Providers

The analytics layer of the Big Data stack is also experiencing significant development. Not surprisingly, all major traditional analytics and data service providers have incor- porated Big Data analytics capabilities into their offerings. For example, Dell EMC, IBM Big Insights (now part of Watson), Microsoft Analytics, SAP’s Hanna, Oracle Big Data, and Teradata have all integrated Hadoop, Streaming, IoT, and Spark capabilities into their platforms. IBM’s BigInsights platform is based on Apache Hadoop, but includes numerous proprietary modules including the Netezza database, InfoSphere Warehouse, Cognos business intelligence tools, and SPSS data mining capabilities. It also offers IBM InfoSphere Streams, a platform designed for streaming Big Data analysis. With the suc- cess of Watson analytics brand, IBM has folded many of its analytics offerings in general and Big Data offerings in particular under the Watson label. Teradata Vantage similarly implements many of the commonly used analytics functions in the Big Data environment.

M09_SHAR1552_11_GE_C09.indd 586 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 587

Further, as noted earlier, most of these platforms are also accessible through their own as well as public cloud providers. Rather than showing software details for all the platforms (which are quite similar anyway), we illustrate their description by using Teradata’s newest offering, Teradata Vantage, in Technology Insights 9.3.

Business Intelligence Providers Incorporating Big Data

In this group we note several of the major BI software providers who, again not surpris- ingly, have incorporated Big Data technologies into their offerings. The major names to note in this space include SAS, Microstrategy, and their peers. For example, SAS Viya claims to perform in-memory analytics on massive data. Data visualization specialist Tableau Software has added Hadoop and Next Generation Data Warehouse connectivity to its product suite. Relatively newer players such as Qlik and Spotfire also are adapting their offerings to include Big data capabilities.

Application Case 9.7 illustrates an example of a Big data project where both IBM and Teradata analytics software capabilities were used in addition to pulling data from Google and Twitter.

Infectious diseases impose a significant burden to the U.S. public health system. The rise of HIV/AIDS in the late 1970s, pandemic H1N1 flu in 2009, the H3N2 epidemic during the 2012–2013 winter season, the Ebola virus disease outbreak in 2015, and the Zika virus scare in 2016 have demonstrated the suscepti- bility of people to such contagious diseases. Virtually each year influenza outbreaks happen in various forms and result in consequences of varying impacts. The annual impact of seasonal influenza outbreaks in the United States is reported to be an average of 610,660 undiscounted life-years lost, 3.1 million hos- pitalized days, 31.4 million outpatient visits, and a total of $87.1 billion in economic burden. As a result of this growing trend, new data analytics techniques and technologies capable of detecting, tracking, map- ping, and managing such diseases have come on the scene in recent years. In particular, digital surveil- lance systems have shown promise in their capacity to discover public health-seeking patterns and trans- form these discoveries into actionable strategies.

This project demonstrated that social media can be utilized as an effective method for early detection of influenza outbreaks. We used a Big Data platform to employ Twitter data to monitor influenza activity in the United States. Our Big Data analytics meth- ods comprised temporal, spatial, and text mining. In the temporal analysis, we examined whether Twitter

data could indeed be adapted for the nowcasting of influenza outbreaks. In spatial analysis, we mapped flu outbreaks to the geospatial property of Twitter data to identify influenza hotspots. Text analytics was performed to identify popular symptoms and treatments of flu that were mentioned in tweets.

The IBM InfoSphere BigInsights platform was employed to analyze two sets of flu activity data: Twitter data were used to monitor flu outbreaks in the United States, and Cerner HealthFacts data warehouse was used to track real-world clinical encounters. A huge volume of flu-related tweets was crawled from Twitter using Twitter Streaming API and was then ingested into a Hadoop cluster. Once the data were successfully imported, the JSON Query Language (JAQL) tool was used to manip- ulate and parse semistructured JavaScript Object Notation (JSON) data. Next, Hive was used to tabu- larize the text data and segregate the information for the spatial-temporal location analysis and visualiza- tion in R. The entire data mining process was imple- mented using MapReduce functions. We used the package BigR to submit the R scripts over the data stored in HDFS. The package BigR enabled us to benefit from the parallel computation of HDFS and to perform MapReduce operations. Google’s Maps API libraries were used as a basic mapping tool to visualize the tweet locations.

Application Case 9.7 Using Social Media for Nowcasting Flu Activity

(Continued )

M09_SHAR1552_11_GE_C09.indd 587 07/01/20 4:42 PM

588 Part III • Prescriptive Analytics and Big Data

TECHNOLOGY INSIGHTS 9.3 An Illustrative Big Data Technology Platform: Teradata Vantage™

Introduction This description is adapted from content provided by Teradata, especially Sri Raghavan. Teradata Vantage is an advanced analytics platform embedded with analytic engines and functions, which can be implemented with preferred data science languages (e.g., SQL, Python, R) and tools (e.g., Teradata Studio, Teradata AppCenter, R Studio, Jupyter Notebook) on any data volume of any type by diverse analytics personas (e.g., Data Scientist, Citizen Data Scientist, Business Analyst) across multiple environments (On-Premises, Private Cloud, Public Cloud Marketplaces). There are five important conceptual pieces central to understanding Vantage: Analytics Engines and Functions, Data Storage and Access, Analytic Languages and Tools, Deployment, and Usage. Figure 9.10 illustrates the general architecture of Vantage and its interrelationships with other tools.

Analytic Engines and Functions

An analytic engine is a comprehensive framework that includes all the software components that are well integrated into a container (e.g., Docker) to deliver advanced analytics functionality that can be implemented by a well-defined set of user personas. An analytic engine’s components include:

• Advanced Analytics functions • Access points to data storage that can ingest multiple data types • Integration into visualization and analytic workflow tools • Built in management and monitoring tools • Highly scalable and performant environment with established thresholds

It is advantageous to have an analytic engine as it delivers a containerized compute envi- ronment that can be separated from data storage. Furthermore, analytic engines can be tailored for access and use by specific personas (e.g., DS, Business Analyst).

There are three analytic engines in the first release of Vantage. These are NewSQL Engine, Machine Learning Engine, and Graph Engine.

Our findings demonstrated that the integration of social media and medical records can be a valu- able supplement to the existing surveillance systems. Our results confirmed that flu-related traffic on social media is closely related with the actual flu outbreak. This has been shown by other researchers as well (St Louis & Zorlu, 2012; Broniatowski, Paul, & Dredze, 2013). We performed a time-series analysis to obtain the spatial-temporal cross- correlation between the two trends (91%) and observed that clinical flu encoun- ters lag behind online posts. In addition, our location analysis revealed several public locations from which a majority of tweets were originated. These findings can help health officials and governments to develop more accurate and timely forecasting models during outbreaks and to inform individuals about the loca- tions that they should avoid during that time period.

Questions for Discussion

1. Why would social media be able to serve as an early predictor of flu outbreaks?

2. What other variables might help in predicting such outbreaks?

3. Why would this problem be a good problem to solve using Big Data technologies mentioned in this chapter?

Sources: A. H. Zadeh, H. M. Zolbanin, R. Sharda, & D. Delen. (2015). “Social Media for Nowcasting the Flu Activity: Spatial- Temporal and Text Analysis.” Business Analytics Congress, Pre-ICIS Conference, Fort Worth, TX. D. A. Broniatowski, M. J. Paul, & M. Dredze. (2013). “National and Local Influenza Surveillance through Twitter: An Analysis of the 2012–2013 Influenza Epidemic.” PloS One, 8(12), e83672. P. A. Moran. (1950). “Notes on Continuous Stochastic Phenomena.” Biometrika, 17–23.

Application Case 9.7 (Continued)

M09_SHAR1552_11_GE_C09.indd 588 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 589

The NewSQL engine includes embedded analytic functions. Teradata will continue to add more functions for the high-speed analytics processing required to operationalize analytics. New functions within the NewSQL engine include:

• nPath • Sessionization • Attribution • Time series • 4D analytics • Scoring functions (e.g., Naïve Bayes, GLM, Decision Forests)

The Machine Learning engine delivers more than 120 prebuilt analytic functions for path, pat- tern, statistical, and text analytics to solve a range of business problems. Functions range from understanding sentiment to predictive part failure analysis.

The Graph engine provides a set of functions that discover relationships between people, products, and processes within a network. Graph analytics solve complex problems such as social network connections, influencer relationships, fraud detection, and threat identification.

Vantage embeds analytic engines close to the data, which eliminates the need to move data, allowing users to run their analytics against larger data sets without sampling and execute models with greater speed and frequency. This is made possible through the use of containers managed by Kubernetes, which allow businesses to easily manage and deploy new cutting-edge analytic engines, such as Spark and TensorFlow, both of which will be available in the near fu- ture. Another benefit of containers is the ability to scale out the engines.

From a user’s perspective, Vantage is a unified analytic and data framework. Under the covers, it contains a cross-engine orchestration layer that pipelines the right data and analytic re- quest to the right analytic engine across a high-speed data fabric. This enables a business analyst or data scientist, for example, to invoke analytic functions from different engines in a single ap- plication, such as Jupyter Notebook, without enduring the trouble of hopping from one analytic server or application to another. The result is a tightly integrated analytic implementation that’s not restrained by functional or data silos.

Data Storage and Access: Teradata Vantage comes with a natively embedded Teradata MPP Database. Furthermore, a high-speed data fabric (Teradata QueryGrid™ and Presto™) connects the platform to external data sources that include third-party enterprise data warehouses (e.g., Oracle), open source data platforms (e.g., Hadoop), no-SQL databases

H ig

h S

pe ed

F ab

ri c

SQL Engine

Data Storage

Analytic Engines

Analytic Languages

AppCenter

Analytic Tools

Machine Learning Engine

Teradata Data Store

Graph Engine

FIGURE 9.10 Teradata Vantage Architecture. Source: Teradata Corp.

M09_SHAR1552_11_GE_C09.indd 589 07/01/20 4:42 PM

590 Part III • Prescriptive Analytics and Big Data

Application Case 9.8 illustrates another application of Teradata Vantage where its advanced network analytics capabilities were deployed to analyze data from a large elec- tronic medical records data warehouse.

(e.g., Cassandra), and others. Data support ranges from relational, spatial, and temporal to XML, JSON, Avro, and time-series formats.

Analytic Languages and Tools: Teradata Vantage was built out of the recognition that ana- lytics professionals such as Data Scientists and Business Analysts require a diverse set of languages and tools to process large data volumes to deliver analytic insights. Vantage includes languages such as SQL, R, and Python on which analytics functions can be ex- ecuted through Teradata Studio, R Studio, and Jupyter Notebooks.

Deployment: Vantage platform provides the same analytic processing across deployment op- tions, including the Teradata Cloud and public cloud, as well as on-premises installations on Teradata hardware or commodity hardware. It is also available as a service.

Usage: Teradata Vantage is intended to be used by multiple analytic personas. The ease of SQL ensures that citizen data scientists and business analysts can implement prebuilt analytic func- tions integrated into the analytic engines. The ability to invoke Teradata-supported packages such as dplyr and teradataml ensures that Data Scientists familiar with R and Python can exe- cute analytic packages through R Studio and Jupyter notebooks, respectively, on the platform. Users who are not proficient at executing programs can invoke analytic functions codified into Apps built into Teradata AppCenter, an app building framework available in Vantage, to deliver compelling visualizations such as Sankey, Tree, Sigma diagrams, or word clouds.

Example Usage: A global retailer had a Web site that suboptimally delivered search results to potential buyers. With online purchases accounting for 25% of total sales, inaccurate search results negatively impacted the customer experience and the bottom line.

The retailer implemented Teradata machine learning algorithms, available in Teradata Vantage, to accumulate, parse, and classify search terms and phrases. The algorithms delivered the answers needed to identify search results that closely matched online customer needs. This led to more than $1 .3 million in incremental revenue from high-value customers, as measured by purchase volumes, over a two-month holiday period.

The Center for Health Systems Innovation at Oklahoma State University has been given a mas- sive data warehouse by Cerner Corporation, a major electronic medical records (EMRs) provider, to help develop analytic applications. The data warehouse contains EMRs on the visits of more than 50 million unique patients across U.S. hospitals (2000–2015). It is the largest and the industry’s only relational data- base that includes comprehensive records with phar- macy, laboratory, clinical events, admissions, and billing data. The database also includes more than 2.4 billion laboratory results and more than 295 million orders for nearly 4,500 drugs by name and brand. It is one of the largest compilations of de-identified, real-world, HIPAA-compliant data of its type.

The EMRs can be used to develop multiple ana- lytics applications. One application is to understand the relationships between diseases based on the information about the simultaneous diseases devel- oped in the patients. When multiple diseases are present in a patient, the condition is called comorbid- ity. The comorbidities can be different across popu- lation groups. In an application (Kalgotra, Sharda, & Croff, 2017), the authors studied health disparities in terms of comorbidities by gender.

To compare the comorbidities, a network analysis approach was applied. A network is com- prised of a defined set of items called nodes, which are linked to each other through edges. An edge represents a defined relationship between the

Application Case 9.8 Analyzing Disease Patterns from an Electronic Medical Records Data Warehouse

M09_SHAR1552_11_GE_C09.indd 590 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 591

ICD-9 Description

0012139: Infectious and parasitic diseases

1402239: Neoplasms

2402279: Endocrine, nutritional and metabolic diseases, and immunity disorders

2802289: Diseases of the blood and blood-forming organs

2902319: Mental disorders

3202359: Diseases of the nervous system

3602389: Diseases of the sense organs

3902459: Diseases of the circulatory system

4602519: Diseases of the respiratory system

5202579: Diseases of the digestive system

5802629: Diseases of the genitourinary system

6302679: Complications of pregnancy, childbirth, and the puerperium

6802709: Diseases of the skin and subcutaneous tissue

7102739: Diseases of the musculoskeletal system and connective tissue

7402759: Congenital anomalies

7602779: Certain conditions originating in the perinatal period

8002999: Injury and poisoning

Male Comorbidity Network

Female Comorbidity Network

FIGURE 9.11 Female and Male Comorbidity Networks.

(Continued )

M09_SHAR1552_11_GE_C09.indd 591 07/01/20 4:42 PM

592 Part III • Prescriptive Analytics and Big Data

As noted earlier, our goal in this section is to highlight some of the players in Big data technology space. In addition to the vendors listed above, there are hundreds of oth- ers in the categories identified earlier as well as very specific industry applications. Rather than listing these names here, we urge you to check the latest version of the Big Data analytics ecosystem at http://mattturck.com/bigdata2018/ (accessed October 2018). Matt Turck’s updated ecosystem diagram identifies companies in each cluster.

u SECTION 9.8 REVIEW QUESTIONS

1. Identify some of the key Big Data technology vendors whose key focus is on-premise Hadoop platforms.

2. What is special about the Big Data vendor landscape? Who are the big players?

3. Search and identify the key similarity and differences between cloud providers’ ana- lytics offerings and analytics providers’ presence on specific cloud platforms.

4. What are some of the features of a platform such as Teradata Vantage?

nodes. A very common example of network is a friendship network in which individuals are con- nected to each other if they are friends. Other com- mon networks are computer networks, Web page networks, road networks, and airport networks. To compare the comorbidities, networks of the diagnoses developed by men and women were created. The information about the diseases devel- oped by each patient in the lifetime history was used to create a comorbidity network. For the analysis, 12 million female patients and 9.9 mil- lion male patients were used. To manage such a huge data set, Teradata Aster Big Data platform was used. To extract and prepare the network data, SQL, SQL-MR, and SQL-GR frameworks supported by Aster were utilized. To visualize the networks, Aster AppCenter and Gephi were used.

Figure 9.11 presents the female and male comor- bidity networks. In these networks, nodes represent different diseases classified as the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), aggregated at the three- digit level. Two diseases are linked based on the similarity calculated using Salton Cosine Index. The larger the size of a node, the greater the comorbid- ity of that disease. The female comorbidity network is denser than the male network. The number of nodes and edges in the female network are 899 and 14,810, respectively, whereas the number of nodes and edges in the male network are 839 and 12,498,

respectively. The visualizations present a difference between the pattern of diseases developed in male and female patients. Specifically, females have more comorbidities of mental disorders than males. On the other hand, the strength of some disease asso- ciations between lipid metabolism and chronic heart disorders is stronger in males than females. Such health disparities present questions for biological, behavioural, clinical, and policy research.

The traditional database systems would be taxed in efficiently processing such a huge data set. The Teradata Aster made the analysis of data con- taining information on millions of records fairly fast and easy. Network analysis is often suggested as one method to analyze big data sets. It helps under- stand the data in one picture. In this application, the comorbidity network explains the relationship between diseases at one place.

Questions for Discussion

1. What could be the reasons behind the health dis- parities across gender?

2. What are the main components of a network?

3. What type of analytics was applied in this application?

Source: Kalgotra, P., Sharda, R., & Croff, J. M. (2017). Examining health disparities by gender: A multimorbidity network analysis of electronic medical record. International Journal of Medical Informatics, 108, 22–28.

Application Case 9.8 (Continued)

M09_SHAR1552_11_GE_C09.indd 592 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 593

9.9 CLOUD COMPUTING AND BUSINESS ANALYTICS

Another emerging technology trend that business analytics users should be aware of is cloud computing. The National Institute of Standards and Technology (NIST) defines cloud computing as “a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, and services) that can be rapidly provisioned and released with minimal management effort or service-provider interaction.” Wikipedia (n.d., Cloud Computing) defines cloud comput- ing as “a style of computing in which dynamically scalable and often virtualized resources are provided over the Internet. Users need not have knowledge of, experience in, or con- trol over the technology infrastructures in the cloud that supports them.” This definition is broad and comprehensive. In some ways, cloud computing is a new name for many previous, related trends: utility computing, application service provider grid computing, on-demand computing, software as a service (SaaS), and even older, centralized com- puting with dumb terminals. But the term cloud computing originates from a reference to the Internet as a “cloud” and represents an evolution of all of the previously shared/ centralized computing trends. The Wikipedia entry also recognizes that cloud computing is a combination of several IT components as services. For example, infrastructure as a service (IaaS) refers to providing computing platforms as a service (PaaS), as well as all of the basic platform provisioning, such as management administration, security, and so on. It also includes SaaS, which includes applications to be delivered through a Web browser, whereas the data and the application programs are on some other server.

Although we do not typically look at Web-based e-mail as an example of cloud com- puting, it can be considered a basic cloud application. Typically, the e-mail application stores the data (e-mail messages) and the software (e-mail programs that let us process and manage e-mails). The e-mail provider also supplies the hardware/software and all of the basic infrastructure. As long as the Internet is available, one can access the e-mail application from anywhere in the cloud. When the application is updated by the e-mail provider (e.g., when Gmail updates its e-mail application), it becomes available to all cus- tomers. Social networking Web sites like Facebook, Twitter, and LinkedIn, are also exam- ples of cloud computing. Thus, any Web-based general application is in a way an example of a cloud application. Another example of a general cloud application is Google Docs and Spreadsheets. This application allows a user to create text documents or spreadsheets that are stored on Google’s servers and are available to the users anywhere they have ac- cess to the Internet. Again, no programs need to be installed as “the application is in the cloud.” The storage space is also “in the cloud.” Even Microsoft’s popular office applica- tions are all available in the cloud, with the user not needing to download any software.

A good general business example of cloud computing is Amazon.com’s Web services. Amazon.com has developed an impressive technology infrastructure for e-commerce as well as for BI, customer relationship management, and supply-chain management. It has built major data centers to manage its own operations. However, through Amazon.com’s cloud services, many other companies can employ these very same facilities to gain advantages of these technologies without having to make a similar investment. Like other cloud-computing services, a user can subscribe to any of the fa- cilities on a pay-as-you-go basis. This model of letting someone else own the hardware and software but making use of the facilities on a pay-per-use basis is the cornerstone of cloud computing. A number of companies offer cloud-computing services, including Salesforce.com, IBM Cloud, Microsoft Azure, Google, Adobe, and many others.

Cloud computing, like many other IT trends, has resulted in new offerings in ana- lytics. These options permit an organization to scale up its data warehouse and pay only for what it uses. The end user of a cloud-based analytics service may use one or- ganization for analysis applications that, in turn, uses another firm for the platform or

M09_SHAR1552_11_GE_C09.indd 593 07/01/20 4:42 PM

594 Part III • Prescriptive Analytics and Big Data

infrastructure. The next several paragraphs summarize the latest trends in the interface of cloud computing and BI/business analytics. A few of these statements are adapted from an early paper written by Haluk Demirkan and one of the coauthors of this book (Demirkan & Delen, 2013).

Figure 9.12 illustrates a conceptual architecture of a service-oriented decision sup- port environment, that is, a cloud-based analytics system. This figure superimposes the cloud-based services on the general analytics architecture presented in previous chapters.

In service-oriented decision support solutions, (1) operational systems, (2) data warehouses, (3) online analytic processing, and (4) end-user components can be ob- tained individually or bundled and provided to the users as service. Any or all of these services can be obtained through the cloud. Because the field of cloud computing is fast evolving and growing at a rapid pace, there is much confusion about the terminology being used by various vendors and users. The labels vary from Infrastructure, Platform, Software, Data, Information, and Analytics as a Service. In the following, we define these services. Then we summarize the current technology platforms and highlight applications of each through application cases.

Data as a Service (DaaS)

The concept of data as a service basically advocates the view that “where data lives”— the actual platform on which the data resides—doesn’t matter. Data can reside in a local computer or in a server at a server farm inside a cloud-computing environment.

Information Sources

Data Management

Information Management

Operations Management

Analytics Service

Information Service

Data Service

Data mart (…)

Replication

External Data

Other OLTP/Web

POS

Legacy

ERP

Enterprise Data Warehouse

ETL

Servers Software

Intranet Search for Content

Dashboards

OLAP

Routine Business Reporting

Metadata

Data mart (Marketing)

Data mart (Engineering)

Data mart (Finance)

Data Mining

Text Mining

Simulation

Automated Decision System

Optimization

FIGURE 9.12 Conceptual Architecture of a Cloud-Oriented Support System. Source: Based on Demirkan, H., & Delen, D.

(2013, April). Leveraging the capabilities of service-oriented decision support systems: Putting analytics and Big Data in cloud.

Decision Support Systems, 55(1), 412–421.

M09_SHAR1552_11_GE_C09.indd 594 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 595

With DaaS, any business process can access data wherever it resides. Data as a service began with the notion that data quality could happen in a centralized place, cleansing and enriching data and offering it to different systems, applications, or users, irrespective of where they were in the organization, computers, or on the network. This has now been replaced with master data management and customer data integration solutions, where the record of the customer (or product, or asset, etc.) may reside anywhere and is available as a service to any application that has the services allowing access to it. By applying a standard set of transformations to the various sources of data (for example, ensuring that gender fields containing different notation styles [e.g., M/F, Mr./Ms.] are all translated into male/female) and then enabling applications to access the data via open standards such as SQL, XQuery, and XML, service requestors can access the data regard- less of vendor or system.

With DaaS, customers can move quickly thanks to the simplicity of the data access and the fact that they don’t need extensive knowledge of the underlying data. If cus- tomers require a slightly different data structure or have location-specific requirements, the implementation is easy because the changes are minimal (agility). Second, provid- ers can build the base with the data experts and outsource the analysis or presentation layers (which allows for very cost-effective user interfaces and makes change requests at the presentation layer much more feasible), and access to the data is controlled through the data services. It tends to improve data quality because there is a single point for updates.

Software as a Service (SaaS)

This model allows consumers to use applications and software that run on distant comput- ers in the cloud infrastructure. Consumers need not worry about managing underlying cloud infrastructure and have to pay for the use of software only. All we need is a Web browser or an app on a mobile device to connect to the cloud. Gmail is an example of SaaS.

Platform as a Service (PaaS)

Using this model, companies can deploy their software and applications in the cloud so that their customers can use them. Companies don’t have to manage resources needed to manage their applications in cloud-like networks, servers, storage, or operating systems. This reduces the cost of maintaining underlying infrastructure for running their software and also saves time for setting up this infrastructure. Now, users can focus on their busi- ness rather than focusing on managing infrastructure for running their software. Examples of PaaS are Microsoft Azure, Amazon EC2, and Google App Engine.

Infrastructure as a Service (IaaS)

In this model, infrastructure resources like networks, storage, servers, and other comput- ing resources are provided to client companies. Clients can run their application and have administrative rights to use these resources but do not manage underlying infrastructure. Clients have to pay for usage of infrastructure. A good example of this is Amazon.com’s Web services. Amazon.com has developed impressive technology infrastructure that in- cludes data centers. Other companies can use Amazon.com’s cloud services on a pay- per-use-basis without having to make similar investments. Similar services are offered by all major cloud providers such as IBM, Microsoft, Google, and so on.

We should note that there is considerable confusion and overlap in the use of cloud terminology. For example, some vendors also add information as a service (IaaS), which is an extension of DaaS. Clearly, this IaaS is different from infrastructure as a service described earlier. Our goal here is to just recognize that there are varying

M09_SHAR1552_11_GE_C09.indd 595 07/01/20 4:42 PM

596 Part III • Prescriptive Analytics and Big Data

degrees of services that an organization can subscribe to in order to manage the ana- lytics applications. Figure 9.13 highlights the level of service subscriptions a client uses in each of the three major types of cloud offerings. SaaS is clearly the highest level of cloud service that a client may get. For example, in using Office 365, an organization is using the software as a service. The client is only responsible for bringing in the data. Many of the analytics as a service application fall in this category as well. Further, several analytics as a service provider may in turn use clouds such as Amazon’s AWS or Microsoft Azure to provide their services to the end users. We will see examples of such services shortly.

Essential Technologies for Cloud Computing

VIRTUALIZATION Virtualization is the creation of a virtual version of something like an operating system or server. A simple example of virtualization is the logical division of a hard drive to create two separate hard drives in a computer. Virtualization can be in all three areas of computing:

Network virtualization: It is the splitting of available bandwidth into channels, which disguises complexity of the network by dividing it into manageable parts. Then each bandwidth can be allocated to a particular server or device in real time.

Storage virtualization: It is the pooling of physical storage from multiple network storage devices into a single storage device that can be managed from a central console.

Server virtualization: It is the masking of physical servers from server users. Users don’t have to manage the actual servers or understand complicated details of server resources.

This difference in the level of virtualization directly relates to which cloud service one employs.

Application Case 9.9 illustrates an application of cloud technologies that enable a mobile application and allow for significant reduction in information miscommunication.

Networking

Infrastructure as a Service

IaaS

Storage

Servers

Virtualization

Operating System

Middleware

Runtime

Data

Application

Platform as a Service

PaaS

Networking

Storage

Servers

Virtualization

Operating System

Middleware

Runtime

Data

Application

Software as a Service

SaaS

Networking

Storage

Servers

Virtualization

Operating System

Middleware

Runtime

Data

Application

Managed by Client

Managed by Cloud Vendor

FIGURE 9.13 Technology Stack as a Service for Different Types of Cloud Offerings.

M09_SHAR1552_11_GE_C09.indd 596 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 597

Historical communication between utilities and first responders has been by phone calls or two-way radios. Some of these are with first responders on the scene, and some with dispatch or other units of the first responder organization. When a member of the public sees an incident on the field, they usu- ally just call 911, which is routed to first respond- ers. Dispatch centers route the closest first responder to the field, who then call back to the center either on their radios or cell phones to let them know the actual status. The dispatch centers then call the inci- dent in to the appropriate utility, who then sends their own team to the field for further resolution. This also requires that the exact location be conveyed to the dispatch center from the field, and from the for- mer to utility—particularly challenging if the incident location is not at a specific address (e.g., along a free- way, across open land, etc.). The utility also needs to let the dispatch center know the status of their own crew. This information must also be relayed to the first responders on the field. Much of this process relies on information being communicated orally and then forwarded to one or more recipients, with infor- mation also flowing back and forth along the same chain. All of this can result in garbled communication and/or incomplete messages, which can eat away precious minutes or even hours in emergencies.

A major West Coast Utility, a leader in using technology to address traditional problems, deter- mined that many of these challenges can be addressed through better information sharing in a timelier manner using cloud-mobile technology. Their territory encompassed densely populated cit- ies to far-flung rural communities with intervening miles of desert, national parks, and more.

Recognizing that most first responders have a smartphone or tablet, the utility selected Connixt’s iMarq™ mobile suite to provide a simple-to-use mobile app that allows first responders to advise the utility of any incident in the field. The technol- ogy also keeps the first responders apprised of the utility’s response status with respect to the incident.

With a targeted base of over 20,000 first respond- ers spread across the entire territory, lowering barri- ers to adoption was an especially important factor. “Improving communication with groups that are outside your organization is historically difficult,” says G. Satish,

cofounder and CEO, Connixt. “For this deployment, the focus on simplicity is the key to its success.”

First responders are invited to download and self-register the app, and once the utility grants access rights, they can report incidents using their own tab- lets or smartphones. The first responder simply uses a drop-down menu to pick from a list of preconfigured incidents, taps an option to indicate if they will wait at the scene, and attach photographs with annotations— all with a few touches on their device. The utility receives notification of the incident, reviews the time and geostamped information (no more mixed-up addresses), and updates their response. This response (which may be a truck roll) is sent to the first respond- ers and maintained in the app.

The simplicity of the solution makes it easy for the first responders. They use their own phone or tablet, communicate in a way they are used to, and provide needed information simply and effectively. They can see the utility updates (such as the current status of the truck that was sent). Missed or garbled phone messages are minimized. Options such as recording voice memos, using speech-to-text and more, are also available.

Cloud technology has been particularly use- ful in this case—deployment is faster without issues related to hardware procurement, installation, and appropriate backups. Connixt’s cloud-based Mobile Extension Framework (MXF™) is architected for rapid configuration and deployment—configuration is completed in the cloud, and, once configured, the apps are ready for download and deployment. More importantly, MXF enables easy modifications to forms and processes—for example, if the utility needs to add additional options to the incident drop- down, they simply add this once in MXF. Within minutes the option is now available on the field for all users. Figure 9.14 illustrates this architecture.

There are further benefits from a system that leverages ubiquitous cloud and mobile technologies. Because all of the business logic and configurations are stored in the cloud, the solution itself can act as a stand-alone system for customers who have no back- end systems—very important in the context of small and medium businesses (SMBs). And for those with back-end systems, the connectivity is seamless through Web services and the back-end system serves as the

Application Case 9.9 Major West Coast Utility Uses Cloud-Mobile Technology to Provide Real-Time Incident Reporting

(Continued )

M09_SHAR1552_11_GE_C09.indd 597 07/01/20 4:42 PM

598 Part III • Prescriptive Analytics and Big Data

system of record. This additionally helps businesses adopt technology in a phased manner—starting with a noninvasive, standalone system with minimal inter- nal IT impact while automating field operations, and then moving toward back-end system integration.

On the other hand, the mobile apps are them- selves system agnostic—they communicate using standard Web services and the end device can be Android or iOS and smartphone or tablet. Thus, irrespective of the device used, all communication, business logic, and algorithms are standardized across platforms/devices. As native apps across all devices, iMarq leverages standard technology that is provided by the device manufacturers and the OS vendors. For example, using native maps appli- cations allows the apps to benefit from improve- ments made by the platform vendors; thus, as maps become more accurate, the end users of the mobile apps also benefit from these advances.

Finally, for successful deployments, enterprise cloud-mobile technology has to be heavily user-centric. The look and feel must be geared to user-comfort, much as users expect from any mobile app they use. Treating the business user as an app consumer meets their standard expectations of an intuitive app that immediately saves them time and effort. This approach is essential to ensuring successful adoption.

The utility now has better information from first responders, as information is directly shared

from the field (not through a dispatcher or other third party), pictures are available, and there is geo- and time stamping. Garbled phone messages are avoided. The two-way communication between util- ity and the first responder in the field is improved. Historical records of the incidents are kept.

The utility and the first responders are now more unified in their quick and complete responses to incidents, improving service to the public. By tightening ties with first responders (police and fire department personnel), the public is served with a better coordinated and superior response for inci- dents that are discovered by first responders.

Questions for Discussion

1. How does cloud technology impact enterprise software for small and mid-size businesses?

2. What are some of the areas where businesses can use mobile technology?

3. What types of businesses are likely to be the fore- runners in adopting cloud-mobile technology?

4. What are the advantages of cloud-based enterprise software instead of the traditional on-premise model?

5. What are the likely risks of cloud versus tradi- tional on-premise applications?

Source: Used with permission from G Satish, Connixit, Inc.

2

Configure Connixt MXF for integration, business rules etc.

CUSTOMER

FIELD WORKER MANAGER

Field Users download ConnixtApp

1

Mobile Extension Framework

(connixtMXFTM)

Business Rules Webservices

Workflows

User Objects

Adaptors

Authentication

MAC- HINE NEEDS ATTEN- TION

Customer Back-end System(s)

FIGURE 9.14 Interconnections between workers and technology in a cloud analytics application.

Application Case 9.9 (Continued)

M09_SHAR1552_11_GE_C09.indd 598 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 599

Cloud Deployment Models

Cloud services can be acquired in several ways, from building an entirely private infra- structure to sharing with others. The following three models are the most common.

Private cloud: This can also be called internal cloud or corporate cloud. It is a more secure form of cloud service than public clouds like Microsoft Azure and Amazon Web Services. It is operated solely for a single organization having a mis- sion critical workload and security concerns. It provides the same benefits as a public cloud-like service, scalability, changing computing resources on demand, and so on. Companies that have a private cloud have direct control over their data and applications. The disadvantage of having a private cloud is the cost of main- taining and managing the cloud because on-premise IT staff are responsible for managing it.

Public cloud: In this model the subscriber uses the resources offered by ser- vice providers over the Internet. The cloud infrastructure is managed by the service provider. The main advantage of this public cloud model is saving time and money in setting up hardware and software required to run their business. Examples of public clouds are Microsoft Azure, Google Cloud Platform, and Amazon AWS.

Hybrid cloud: The hybrid cloud gives businesses great flexibility by moving work- loads between private and public clouds. For example, a company can use hybrid cloud storage to store its sales and marketing data, and then use a public cloud platform like Amazon Redshift to run analytical queries to analyze its data. The main requirement is network connectivity and API (application program interface) com- patibility between the private and public cloud.

Major Cloud Platform Providers in Analytics

This section first identifies some key cloud players that provide the infrastructure for analytics as a service, as well as selected analytics functionalities. Then we also mention representative analytics-as-a-service offerings that may even run on these cloud platforms.

Amazon Elastic Beanstalk: Amazon Elastic Beanstalk is a service offered from Amazon Web Services. It can deploy, manage, and scale Web applications. It sup- ports the following programming languages: Java, Ruby, Python, PHP, and .NET on servers like Apache HTTP, Apache Tomcat, and IIS. A user has to upload the code for the application, and Elastic Beanstalk handles the deployment of the application, load balancing, and autoscaling and monitors the health of the application. So the user can focus on building Web sites, mobile applications, API backend, content management systems, SaaS, and so on, while the applications and infrastructure to manage them are taken care of by Elastic Beanstalk. A user can use Amazon Web Services or an integrated development environment like Eclipse or Visual Studio to upload their application. A user has to pay for AWS resources needed to store and run the applications.

IBM Cloud: IBM Cloud is a cloud platform that allows a user to build apps using many open source computer technologies. Users can also deploy and manage hy- brid applications using the software. With IBM Watson, whose services are available on IBM Cloud, users can now create next-generation cognitive applications that can discover, innovate, and make decisions. IBM Watson services can be used for ana- lyzing emotions and synthesizing natural-sounding speech from text. Watson uses the concept of cognitive computing to analyze text, video, and images. It supports programming languages like Java, Go, PHP, Ruby, and Python.

Microsoft Azure: Azure is a cloud platform created by Microsoft to build, deploy, and manage applications and services through a network of Microsoft data centers.

M09_SHAR1552_11_GE_C09.indd 599 07/01/20 4:42 PM

600 Part III • Prescriptive Analytics and Big Data

It serves as both PaaS and IaaS and offers many solutions such as analytics, data warehousing, remote monitoring, and predictive maintenance.

Google App Engine: Google App Engine is Google’s Cloud computing platform used for developing and hosting applications. Managed by Google’s data centers, it supports developing apps in Python, Java, Ruby, and PHP programming languages. The big query environment offers data warehouse services through the cloud.

Openshift: Openshift is Red Hat’s cloud application platform based on a PaaS model. Through this model, application developers can deploy their applications on the cloud. There are two different models available for openshift. One serves as a public PaaS and the other serves as a private PaaS. Openshift Online is Red Hat’s public PaaS that offers development, build, hosting, and deployment of ap- plications in the cloud. The private PaaS, openshift Enterprise, allows develop- ment, build, and deployment of applications on an internal server or a private cloud platform.

Analytics as a Service (AaaS)

Analytics and data-based managerial solutions—the applications that query data for use in business planning, problem solving, and decision support—are evolving rapidly and being used by almost every organization. Enterprises are being flooded with information, and getting insights from this data is a big challenge for them. Along with that, there are challenges related to data security, data quality, and compliance. AaaS is an extensible analytical platform using a cloud-based delivery model where various BI and data analyt- ics tools can help companies in better decision making and get insights from their huge amount of data. The platform covers all functionality aspects from collecting data from physical devices to data visualization. AaaS provides an agile model for reporting and analytics to businesses so they can focus on what they do best. Customers can either run their own analytical applications in the cloud or they can put their data on the cloud and receive useful insights.

AaaS combines aspects of cloud computing with Big Data analytics and empowers data scientists and analysts by allowing them to access centrally managed information data sets. They can now explore information data sets more interactively and discover richer insights more rapidly, thus erasing many of the delays that they may face while discovering data trends. For example, a provider might offer access to a remote analyt- ics platform for a fee. This allows the client to use analytics software for as long as it is needed. AaaS is a part of SaaS, PaaS, and IaaS, thus helping IT significantly reduce costs and compliance risk, while increasing productivity of users.

AaaS in the cloud has economies of scale and scope by providing many virtual analytical applications with better scalability and higher cost savings. With growing data volumes and dozens of virtual analytical applications, chances are that more of them leverage processing at different times, usage patterns, and frequencies.

Data and text mining is another very promising application of AaaS. The capabilities that a service orientation (along with cloud computing, pooled resources, and parallel processing) brings to the analytics world can also be used for large-scale optimization, highly complex multicriteria decision problems, and distributed simulation models. Next we identify selected cloud-based analytics offerings.

Representative Analytics as a Service Offerings

IBM CLOUD IBM is making all of its analytics offerings available through its cloud. IBM Cloud offers several categories of analytics and AI. For example, IBM Watson Analytics integrates most of the analytics features and capabilities that can be built and deployed through their cloud. In addition, IBM Watson Cognitive has been a major cloud-based

M09_SHAR1552_11_GE_C09.indd 600 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 601

offering that employs text mining and deep learning at a very high level. It was intro- duced earlier in the context of text mining.

MINEMYTEXT.COM One of the areas of major growth in analytics is text mining. Text mining identifies high-level topics of documents, infers sentiments from reviews, and visualizes the document or term/concept relationships, as covered in the text mining chapter. A start-up called MineMyText.com offers these capabilities in the cloud through their Web site.

SAS VIYA SAS Institute is making its analytics software offering available on demand through the cloud. Currently, SAS Visual Statistics is only available as a cloud service and is a competitor of Tableau.

TABLEAU Tableau, a major visualization software that was introduced in the context of descriptive analytics, is also available through the cloud.

SNOWFLAKE Snowflake is a cloud-based data warehouse solution. Users can bring to- gether their data from multiple sources as one source and analyze it using Snowflake.

Illustrative Analytics Applications Employing the Cloud Infrastructure

In this section we highlight several cloud analytics applications. We present them as one section as opposed to individual Application Cases.

Using Azure IOT, Stream Analytics, and Machine Learning to Improve Mobile Health Care Services

People are increasingly using mobile applications to keep track of the amount of exer- cise they do every day and maintain their health history as well. Zion China, which is a provider of mobile healthcare services, has come up with an innovative health monitor- ing tool that gathers data about health problems such as glucose levels, blood pressure, diet, medication, and exercise of their users and help them improve their quality of life by giving them suggestions on how to manage their health and prevent or cure illness on a daily basis.

The huge volume of real-time data presented scalability and data management problems, so the company collaborated with Microsoft to take advantage of Stream Analytics, Machine Learning, IOT solution and Power BI, which also improved data security and analysis. Zion China was completely dependent on traditional BI with data being collected from various devices or cloud. Using a cloud-based analytics architecture, Zion was able to add several features, speed, and security. They added an IoT hub to the front end for better transmission of data from device to cloud. The data is first transferred from the device to a mobile application via Bluetooth and then to an IoT hub via HTTPS and AMQP. Stream Analytics helps in processing the real time gathered in the IoT hub, and generates insights and useful information, which is further streamed to an SQL database. They use Azure Machine Learning to generate predictive models on diabetes patient data and improve the analysis and prediction levels. Power BI provides simple and easy visualization of data insights achieved from analysis to the users.

Sources: “Zion China Uses Azure IoT, Stream Analytics, and Machine Learning to Evolve Its Intelligent Dia- betes Management Solution” at www.codeproject.com/Articles/1194824/Zion-China-uses-Azure-IoT- Stream-Analytics-and-M (accessed October 2018) and https://microsoft.github.io/techcasestudies/ iot/2016/12/02/IoT-ZionChina.html (accessed October 2018).

M09_SHAR1552_11_GE_C09.indd 601 07/01/20 4:42 PM

602 Part III • Prescriptive Analytics and Big Data

Gulf Air Uses Big Data to Get Deeper Customer Insight

Gulf Air is the national carrier of Bahrain. It is a major international carrier with 3,000 employees, serving 45 cities in 24 countries across three continents. Gulf Air is an indus- try leader in providing traditional Arabian hospitality to customers. To learn more about how their customers felt about their hospitality services, the airline wanted to know what their customers were saying on social media about the airline’s hospitality. The challenge was analyzing all the comments and posts from their customers, as there were hundreds of thousands of posts every day. Monitoring these posts manually would be a time-consuming and daunting task and would also be prone to human error.

Gulf Air wanted to automate this task and analyze the data to learn of the emerging market trends. Along with that, the company wanted a robust infrastructure to host such a social media monitoring solution that would be available around the clock and agile across geographical boundaries.

Gulf Air developed a sentiment analysis solution, “Arabic Sentiment Analysis,” that analyzes English and Arabic social media posts. The Arabic Sentiment Analysis tool is based on Cloudera’s distribution of Hadoop Big Data framework. It runs on Gulf Air’s private cloud environment and also uses the Red Hat JBoss Enterprise Application platform. The private cloud holds about 50 terabytes of data, and the Arabic Sentiment Analysis tool can analyze thousands of posts on social media, providing sentiment results in minutes.

Gulf Air achieved substantial cost savings by putting the “Arabic Sentiment Analysis” application on the company’s existing private cloud environment as they didn’t need to invest in setting up the infrastructure for deploying the application. “Arabic Sentiment Analysis” helps Gulf Air in deciding promotions and offers for their passengers on a timely basis and helps them stay ahead of their competitors. In case the master server fails, the airline created “ghost images” of the server that can be deployed quickly, and the image can start functioning in its place. The Big Data solution quickly and efficiently captures posts periodically and transforms them into reports, giving Gulf Air up-to-date views of any change in sentiment or shifts in demand, enabling them to respond quickly. Insights from the Big Data solution have had a positive impact on the work performed by the employees of Gulf Air.

Sources: RedHat.com. (2016). “Gulf Air Builds Private Cloud for Big Data Innovation with Red Hat Technolo- gies.” www.redhat.com/en/about/press-releases/gulf-air-builds-private-cloud-big-data-innovation-red- hat-technologies (accessed October 2018); RedHat.com. (2016). “Gulf Air’s Big Data Innovation Delivers Deeper Customer Insight.” www.redhat.com/en/success-stories (accessed October 2018); ComputerWeekly.com. (2016). “Big-Data and Open Source Cloud Technology Help Gulf Air Pin Down Customer Sentiment.” www. computerweekly.com/news/450297404/Big-data-and-open-source-cloud-technology-help-Gulf-Air- pin-down-customer-sentiment (accessed October 2018).

Chime Enhances Customer Experience Using Snowflake

Chime, a banking option, offers a Visa debit card, FDIC-insured spending and savings ac- count, and a mobile application app that makes banking easier for people. Chime wanted to learn about their customer engagement. They wanted to analyze data across their mobile, Web, and backend platforms to help enhance the user experience. However, pulling and aggregating data from multiple sources such as ad services from Facebook and Google and events from other third-party analytics tools like JSON (JavaScript Object Notation) docs, was a laborious task. They wanted a solution that could aggregate data from these multiple sources and analyze the data set. Chime needed a solution that could process JSON data sources and query them using standard SQL database tables.

Chime started using Snowflake Elastic Data Warehouse solution. Snowflake pulled data from all 14 data sources of chime, including data like JSON docs from applications.

M09_SHAR1552_11_GE_C09.indd 602 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 603

Snowflake helped Chime analyze JSON data quickly to enhance member services and provide a more personalized banking experience to customers.

Source: Based on Snowflake.net. (n.d.). Chime delivers personalized customer experience using Chime. http://www.snowflake.net/product (accessed Oct 2018).

We are entering the “petabyte age,” and traditional data and analytics approaches are beginning to show their limits. Cloud analytics is an emerging alternative solution for large- scale data analysis. Data-oriented cloud systems include storage and computing in a distrib- uted and virtualized environment. A major advantage of these offerings is the rapid diffusion of advanced analysis tools among the users, without significant investment in technology acquisition. These solutions also come with many challenges, such as security, service level, and data governance. A number of concerns have been raised about cloud computing, in- cluding loss of control and privacy, legal liabilities, cross-border political issues, and so on. According to Cloud Security Alliance, the top three security threats in the cloud are data loss and leakage, hardware failure of equipment, and an insecure interface. All the data in the cloud is accessible by the service provider, so the service provider can unknowingly or deliberately alter the data or can pass the data to a third party for purposes of law without asking the company. Research is still limited in this area. As a result, there is ample oppor- tunity to bring analytical, computational, and conceptual modeling into the context of ser- vice science, service orientation, and cloud intelligence. Nonetheless, cloud computing is an important initiative for an analytics professional to watch as it is a fast-growing area.

u SECTION 9.9 REVIEW QUESTIONS

1. Define cloud computing. How does it relate to PaaS, SaaS, and IaaS?

2. Give examples of companies offering cloud services.

3. How does cloud computing affect BI?

4. How does DaaS change the way data is handled?

5. What are the different types of cloud platforms?

6. Why is AaaS cost-effective?

7. Name at least three major cloud service providers.

8. Give at least three examples of analytics-as-a-service providers.

9.10 LOCATION-BASED ANALYTICS FOR ORGANIZATIONS

Thus far, we have seen many examples of organizations employing analytical techniques to gain insights into their existing processes through informative reporting, predictive analytics, forecasting, and optimization techniques. In this section, we learn about a critical emerg- ing trend—incorporation of location data in analytics. Figure 9.15 gives our classification of location-based analytic applications. We first review applications that make use of static location data that is usually called geospatial data. We then examine the explosive growth of applications that take advantage of all the location data being generated by today’s devices. This section first focuses on analytics applications that are being developed by organizations to make better decisions in managing operations, targeting customers, promotions, and so forth. Then we will also explore analytics applications that are being developed to be used directly by a consumer, some of which also take advantage of the location data.

Geospatial Analytics

A consolidated view of the overall performance of an organization is usually repre- sented through the visualization tools that provide actionable information. The in- formation may include current and forecasted values of various business factors and

M09_SHAR1552_11_GE_C09.indd 603 07/01/20 4:42 PM

604 Part III • Prescriptive Analytics and Big Data

key performance indicators (KPIs). Looking at the KPIs as overall numbers via vari- ous graphs and charts can be overwhelming. There is a high risk of missing potential growth opportunities or not identifying the problematic areas. As an alternative to sim- ply viewing reports, organizations employ visual maps that are geographically mapped and based on the traditional location data, usually grouped by postal codes. These map-based visualizations have been used by organizations to view the aggregated data and get more meaningful location-based insights. The traditional location-based ana- lytic techniques using geocoding of organizational locations and consumers hamper the organizations in understanding “true location-based” impacts. Locations based on postal codes offer an aggregate view of a large geographic area. This poor granularity may not help pinpoint the growth opportunities within a region, as the location of tar- get customers can change rapidly. Thus, an organization’s promotional campaigns may not target the right customers if it is based on postal codes. To address these concerns, organizations are embracing location and spatial extensions to analytics. The addition of location components based on latitudinal and longitudinal attributes to the tradi- tional analytical techniques enables organizations to add a new dimension of “where” to their traditional business analyses, which currently answers the questions of “who,” “what,” “when,” and “how much.”

Location-based data are now readily available from geographic information systems (GIS). These are used to capture, store, analyze, and manage data linked to a location using integrated sensor technologies, global positioning systems in- stalled  in  smartphones, or through RFID deployments in the retail and healthcare industries.

By integrating information about the location with other critical business data, organizations are now creating location intelligence. Location intelligence is enabling organizations to gain critical insights and make better decisions by optimizing impor- tant processes and applications. Organizations now create interactive maps that further drill down to details about any location, offering analysts the ability to investigate new trends and correlate location-specific factors across multiple KPIs. Analysts can now pinpoint trends and patterns in revenue, sales, and profitability across geographical areas.

LOCATION-BASED ANALYTICS

GEOSPATIAL STATIC APPROACH

Examining Geographic Site Locations

Live Location Feeds; Real-Time Marketing

Promotions

GPS Navigation and Data Analysis

Historic and Current Location Demand Analysis; Predictive

Parking; Health-Social Networks

GEOSPATIAL STATIC APPROACH

LOCATION-BASED DYNAMIC APPROACH

LOCATION-BASED DYNAMIC APPROACH

ORGANIZATION ORIENTED CONSUMER ORIENTED

FIGURE 9.15 Classification of Location-Based Analytics Applications.

M09_SHAR1552_11_GE_C09.indd 604 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 605

By incorporating demographic details into locations, retailers can determine how sales vary by population level and proximity to other competitors; they can assess the demand and efficiency of supply-chain operations. Consumer product companies can identify the specific needs of customers and customer complaint locations and easily trace them back to the products. Sales reps can better target their prospects by analyzing their geography.

A company that is the market leader in providing GIS data is ESRI (esri.com). ESRI licenses its ArcGIS software to thousands of customers including commercial, government, and the military. It would take a book or more to highlight applications of ESRI’s GIS database and software! Another company grindgis.com identifies over 60 categories of GIS applications (http://grindgis.com/blog/gis-applications-uses (accessed October 2018)). A few examples that have not been mentioned yet include the following:

Agricultural applications: By combining location, weather, soil, and crop-related data, very precise irrigation and fertilizer applications can be planned. Examples include companies such as sstsoftware.com and sensefly.com (they com- bine GIS and the latest information collected through drones, another emerging technology).

Crime analysis: Superimposition of crime data including date, time, and type of crime onto the GIS data can provide significant insights into crime patterns and police staffing.

Disease spread prediction: One of the first known examples of descriptive analyt- ics is the analysis of the cholera outbreak in London in 1854. Dr. John Snow plotted the cases of cholera on a map and was able to refute the theory that the cholera outbreak was being caused by bad air. The map helped him pinpoint the outbreak to a bad water well (TheGuardian.com, 2013). We have come a long way from needing to plot maps manually, but the idea of being able to track and then predict outbreaks of diseases, such as the flu, using GIS and other data has become a major field in itself. Application Case 9.7 gave an example of using social media data along with GIS data to pinpoint flu trends.

In addition, with location intelligence, organizations can quickly overlay weather and environmental effects and forecast the level of impact on critical business operations. With technology advancements, geospatial data is now being directly incorporated in enterprise data warehouses. Location-based in-database analytics enable organizations to perform complex calculations with increased efficiency and get a single view of all the spatially oriented data, revealing hidden trends and new opportunities. For exam- ple, Teradata’s data warehouse supports the geospatial data feature based on the SQL/ MM standard. The geospatial feature is captured as a new geometric data type called ST_GEOMETRY. It supports a large spectrum of shapes, from simple points, lines, and curves to complex polygons in representing the geographic areas. They are converting the nonspatial data of their operating business locations by incorporating the latitude and longitude coordinates. This process of geocoding is readily supported by service compa- nies like NAVTEQ and Tele Atlas, which maintain worldwide databases of addresses with geospatial features and make use of address-cleansing tools like Informatica and Trillium, which support mapping of spatial coordinates to the addresses as part of extract, trans- form, and load functions.

Organizations across a variety of business sectors are employing geospatial ana- lytics. We will review some examples next. Application Case 9.10 provides an example of how location-based information was used in the Indian retail industry. Application Case 9.11 illustrates another application that goes beyond just the location decision.

M09_SHAR1552_11_GE_C09.indd 605 07/01/20 4:42 PM

606 Part III • Prescriptive Analytics and Big Data

Geographic information systems (GIS) is a relatively new technology that is increasingly being used for decision making in business. Software and databases developed by organizations such as the Environmental System Research Institute (ESRI) are enabling business operators in the retail sector to use GIS to determine the ideal spatial attributes for business intel- ligence and analysis in addition to traditional mea- sures such as sales area and turnover.

This is particularly important in the case of coun- tries such as India, which presents a variety of distinc- tive challenges in the retail business. The country has only recently moved from an unorganized to an orga- nized, territorial structure of retailing with the participa- tion of the government and private business. Moreover, the level of the goods and the shopping locations is often non-competitive and of variable quality.

The number of companies adopting GIS is still limited due to its technical complexity, but it offers numerous features, such as the following:

• the ability to develop intelligent-marketing strategies by combining census, street, and suburb information

• identification of new target cities for roll out of existing formats

• mapping of certain kinds of customers to cer- tain geographical locations

• analysis of customers’ spatial location in a static and dynamic approach

The future for GIS is promising. It has been adopted by Reliance Jio Infocomm Limited, the largest mobile operator in India, and other brands have been taking notice as well. By adopting GIS, India’s retail industry promises to make rapid progress toward modernization and automation of the entire sector.

Questions for Discussion

1. How is GIS used in business applications? 2. Why would GIS be useful for India’s retail industry? 3. How can companies conduct customer analysis

through GIS? 4. What is the main hurdle for the adoption of GIS?

Sources: Geospatial Technology Consulting. (2018). “Information Technology and GIS for Retail Industry in India.” https://sites. google.com/site/geospatialconsulting/home/gis-for-re (ac- cessed October 2019). Sita Mishra. (2009). “GIS in Indian Retail Industry: A Strategic Tool.” International Journal of Marketing Studies, 1, 1. http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.1007.9579&rep=rep1&type=pdf (accessed October 2019). Reliance Jio Infocomm Limited. (2019). https://www.jio. com/welcome (accessed October 2019).

Application Case 9.10 GIS and the Indian Retail Industry

One of the key challenges for any organization that is trying to grow its presence is deciding the loca- tion of its next store. Starbucks faces the same ques- tion. To identify new store locations, more than 700 Starbucks employees (referred to as partners) in 15 countries use an ArcGIS-based market plan- ning and BI solution called Atlas. Atlas provides partners with workflows, analysis, and store perfor- mance information so that local partners in the field can make decisions when identifying new business opportunities.

As reported in multiple sources, Atlas is employed by local decision makers to understand the population trends and demand. For example,

in China, there are over 1,200 Starbucks stores, and the company is opening a new store almost every day. Information such as trade areas, retail clusters and generators, traffic, and demographics is important in deciding the next store’s location. After analyzing a new market and neighborhood, a manager can look at specific locations by zoom- ing into an area in the city and identifying where three new office towers may be completed over the next 2 months, for example. After viewing this area on the map, a workflow window can be cre- ated that will help the manager move the new site through approval, permitting, construction, and eventually opening.

Application Case 9.11 Starbucks Exploits GIS and Analytics to Grow Worldwide

M09_SHAR1552_11_GE_C09.indd 606 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 607

In addition to the retail transaction analysis applications highlighted here, there are many other applications of combining geographic information with other data being gen- erated by an organization. For example, network operations and communication compa- nies often generate massive amounts of data every day. The ability to analyze the data quickly with a high level of location-specific granularity can better identify the customer churn and help in formulating strategies specific to locations for increasing operational efficiency, quality of service, and revenue.

Geospatial analysis can enable communication companies to capture daily transac- tions from a network to identify the geographic areas experiencing a large number of failed connection attempts of voice, data, text, or Internet. Analytics can help determine the exact causes based on location and drill down to an individual customer to provide better customer service. You can see this in action by completing the following multime- dia exercise.

A Multimedia Exercise in Analytics Employing Geospatial Analytics

Teradata University Network includes a BSI video on the case of dropped mobile calls. Please watch the video that appears on YouTube at the following link: www. teradatauniversitynetwork.com/Library/Samples/BSI-The-Case-of-the-Dropped- Mobile-Calls (accessed October 2018).

A telecommunication company launches a new line of smartphones and faces prob- lems with dropped calls. The new rollout is in trouble, and the northeast region is the worst hit region as they compare effects of dropped calls on the profits for the geographic region. The company hires BSI to analyze the problems arising due to defects in smart- phone handsets, tower coverage, and software glitches. The entire northeast region data

By integrating weather and other local data, one can also better manage demand and supply- chain operations. Starbucks is integrating its enter- prise business systems with its GIS solutions in Web services to see the world and its business in new ways. For example, Starbucks integrates AccuWeather’s forecasted real-feel temperature data. This forecasted temperature data can help localize marketing efforts. If a really hot week in Memphis is forthcoming, Starbucks analysts can select a group of coffee houses and get detailed information on past and future weather patterns, as well as store characteristics. This knowledge can be used to design a localized promotion for Frappuccinos, for example, helping Starbucks antic- ipate what its customers will be wanting a week in advance.

Major events also have an impact on coffee houses. When 150,000 people descended on San Diego for the Pride Parade, local baristas served a lot of customers. To ensure the best possible cus- tomer experience, Starbucks used this local event

knowledge to plan staffing and inventory at loca- tions near the parade.

Questions for Discussion

1. What type of demographics and GIS informa- tion would be relevant for deciding on a store location?

2. It has been mentioned that Starbucks encourages its customers to use its mobile app. What type of information might the company gather from the app to help it better plan operations?

3. Will the availability of free Wi-Fi at Starbucks stores provide any information to Starbucks for better analytics?

Sources: Digit.HBS.org. (2015). “Starbucks: Brewing up a Data Storm!” https://digit.hbs.org/submission/starbucks-brewing- up-a-data-storm/ (accessed October 2018); Wheeler, C. (2014). “Going Big with GIS.” www.esri.com/esri-news/arcwatch/ 0814/going-big-with-gis (accessed October 2018); Blogs. ESRI.com. “From Customers to CxOs, Starbucks Delivers World- Class Service.” (2014). https://blogs.esri.com/esri/ucinsider/ 2014/07/29/starbucks/ (accessed October 2018).

M09_SHAR1552_11_GE_C09.indd 607 07/01/20 4:42 PM

608 Part III • Prescriptive Analytics and Big Data

is divided into geographic clusters, and the company solves the problem by identifying the individual customer data. The BSI team employs geospatial analytics to identify the locations where network coverage was leading to dropped calls and suggests installing a few additional towers where unhappy customers are located.

After the video is complete, you can see how the analysis was prepared at: slideshare. net/teradata/bsi-teradata-the-case-of-the-dropped-mobile-calls (accessed October 2018).

This multimedia excursion provides an example of a combination of geospatial ana- lytics along with Big Data analytics that assist in better decision making.

Real-Time Location Intelligence

Many devices in use by consumers and professionals are constantly sending out their location information. Cars, buses, taxis, mobile phones, cameras, and personal naviga- tion devices all transmit their locations thanks to network-connected positioning tech- nologies such as GPS, Wi-Fi, and cell tower triangulation. Millions of consumers and businesses use location-enabled devices for finding nearby services, locating friends and family, navigating, tracking assets and pets, dispatching, and engaging in sports, games, and hobbies. This surge in location-enabled services has resulted in a massive database of historical and real-time streaming location information. It is, of course, scattered and not very useful by itself. The automated data collection enabled through capture of cell phones and Wi-Fi hotspot access points presents an interesting new di- mension in nonintrusive market research, data collection, and, of course, microanalysis of such massive data sets.

By analyzing and learning from these large-scale patterns of movement, it is pos- sible to identify distinct classes of behaviors in specific contexts. This approach allows a business to better understand its customer patterns and make more informed deci- sions about promotions, pricing, and so on. By applying algorithms that reduce the dimensionality of location data, one can characterize places according to the activity and movement between them. From massive amounts of high-dimensional location data, these algorithms uncover trends, meaning, and relationships to eventually produce human-understandable representations. It then becomes possible to use such data to automatically make intelligent predictions and find important matches and similarities between places and people.

Location-based analytics finds its application in consumer-oriented marketing appli- cations. Many companies are now offering platforms to analyze location trails of mobile users based on geospatial data obtained from the GPS and target tech-savvy customers with coupons on their smartphones as they pass by a retailer. This illustrates the emerg- ing trend in the retail space where companies are looking to improve efficiency of mar- keting campaigns—not just by targeting every customer based on real-time location, but by employing more sophisticated predictive analytics in real time on consumer behav- ioral profiles to find the right set of consumers for advertising campaigns.

Yet another extension of location-based analytics is to use augmented reality. In 2016, Pokémon GO became a market sensation. It is a location-sensing augmented reality- based game that encourages users to claim virtual items from select geographic locations. The user can start anywhere in a city and follow markers on the app to reach a specific item. Virtual items are visible through the app when the user points a phone’s camera toward the virtual item. The user can then claim this item. Business applications of such technologies are also emerging. For example, an app called Candybar allows businesses to place these virtual items on a map using Google Maps. The placement of this item can be fine-tuned using Google’s Street View. Once all virtual items have been configured with the information and location, the business can submit items, which are then visible to the user in real time. Candybar also provides usage analytics to the business to enable

M09_SHAR1552_11_GE_C09.indd 608 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 609

better targeting of virtual items. The virtual reality aspect of this app improves the experi- ence of users, providing them with a “gaming” environment in real life. At the same time, it provides a powerful marketing platform for businesses to reach their customers.

As is evident from this section, location-based analytics and ensuing applications are perhaps the most important front in the near future for organizations. A common theme in this section was the use of operational or marketing data by organizations. We will next explore analytics applications that are directly targeted at users and sometimes take advantage of location information.

Analytics Applications for Consumers

The explosive growth of the apps industry for smartphone platforms (iOS, Android, Windows, and so forth) and the use of analytics are creating tremendous opportunities for developing apps where the consumers use analytics without ever realizing it. These apps differ from the previous category in that these are meant for direct use by a con- sumer, as opposed to an organization that is trying to mine a consumer’s usage/purchase data to create a profile for marketing specific products or services. Predictably, these apps are meant for enabling consumers to make better decisions by employing specific analyt- ics. We highlight two of these in the following examples.

• Waze, a social Web app that assists users in identifying a navigation path and alerts users about potential issues such as accidents, police checkpoints, speed traps, and construction, based on other users’ inputs, has become a very popular navigation app. Google acquired this app a few years ago and has enhanced it further. This app is an example of aggregating user-generated information and making it avail- able for customers.

• Many apps allow users to submit reviews and ratings for businesses, products, and so on, and then present those to the users in an aggregated form to help them make choices. These apps can also be identified as apps based on social data that are targeted at consumers where the data are generated by the consumers. One of the more popular apps in this category is Yelp. Similar apps are available all over the world.

• Another transportation-related app that uses predictive analytics, ParkPGH, has been deployed since about 2010 in Pittsburgh, Pennsylvania. Developed in collabo- ration with Carnegie Mellon University, this app includes predictive capabilities to estimate parking availability. ParkPGH directs drivers to parking lots in areas where parking is available. It calculates the number of parking spaces available in sev- eral garages in the cultural arts district of Pittsburgh. Available spaces are updated every 30 seconds, keeping the driver as close to the current availability as possible. Depending on historical demand and current events, the app is able to predict parking availability and provide information on which lots will have free space by the time the driver reaches the destination. The app’s underlying algorithm uses data on current events around the area—for example, a basketball game—to pre- dict an increase in demand for parking spaces later that day, thus saving the com- muters valuable time searching for parking spaces in the busy city. Success of this app has led to a proliferation of parking apps that work in many major cities and allow a user to book a parking space in advance, recharge the meter, even bid for a parking space, etc. Both iPhone app store and Google Play store include many such apps.

Analytics-based applications are emerging not just for fun and health, but also to enhance one’s productivity. For example, Google’s e-mail app called Gmail analyzes billions of e-mail transactions and develops automated responses for e-mails. When a

M09_SHAR1552_11_GE_C09.indd 609 07/01/20 4:42 PM

610 Part III • Prescriptive Analytics and Big Data

user receives an e-mail and reads it in her Gmail app, the app also recommends short responses for the e-mail at hand that a user can select and send to the original sender.

As is evident from these examples of consumer-centric apps, predictive analytics is beginning to enable development of software that is directly used by a consumer. We be- lieve that the growth of consumer-oriented analytic applications will continue and create many entrepreneurial opportunities for the readers of this book.

One key concern in employing these technologies is the loss of privacy. If someone can track the movement of a cell phone, the privacy of that customer is a big issue. Some of the app developers claim that they only need to gather aggregate flow information, not individually identifiable information. But many stories appear in the media that highlight violations of this general principle. Both users and developers of such apps have to be very aware of the deleterious effect of giving out private information as well as collecting such information. We discuss this issue a bit further in Chapter 14.

u SECTION 9.10 REVIEW QUESTIONS

1. How does traditional analytics make use of location-based data?

2. How can geocoded locations assist in better decision making?

3. What is the value provided by geospatial analytics?

4. Explore the use of geospatial analytics further by investigating its use across various sectors like government census tracking, consumer marketing, and so forth.

5. Search online for other applications of consumer-oriented analytical applications.

6. How can location-based analytics help individual consumers?

7. Explore more transportation applications that may employ location-based analytics.

8. What other applications can you imagine if you were able to access cell phone loca- tion data?

Chapter Highlights

• Big Data means different things to people with different backgrounds and interests.

• Big Data exceeds the reach of commonly used hardware environments and/or capabilities of software tools to capture, manage, and process it within a tolerable time span.

• Big Data is typically defined by three “V”s: vol- ume, variety, velocity.

• MapReduce is a technique to distribute the pro- cessing of very large multistructured data files across a large cluster of machines.

• Hadoop is an open source framework for pro- cessing, storing, and analyzing massive amounts of distributed, unstructured data.

• Hive is a Hadoop-based data warehousing– like framework originally developed by Facebook.

• Pig is a Hadoop-based query language developed by Yahoo!

• NoSQL, which stands for Not Only SQL, is a new paradigm to store and process large volumes of unstructured, semistructured, and multistructured data.

• Big Data and data warehouses are complementary (not competing) analytics technologies.

• As a relatively new area, the Big Data vendor landscape is developing very rapidly.

• Stream analytics is a term commonly used for ex- tracting actionable information from continuously flowing/streaming data sources.

• Perpetual analytics evaluates every incoming ob- servation against all prior observations.

• Critical event processing is a method of captur- ing, tracking, and analyzing streams of data to detect certain events (out of normal happenings) that are worthy of the effort.

• Data stream mining, as an enabling technology for stream analytics, is the process of extracting novel patterns and knowledge structures from continuous, rapid data records.

• Cloud computing offers the possibility of using software, hardware, platforms, and infrastructure, all on a service-subscription basis. Cloud comput- ing enables a more scalable investment on the part of a user.

M09_SHAR1552_11_GE_C09.indd 610 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 611

• Cloud-computing–based analytic services offer organizations the latest technologies without sig- nificant up-front investment.

• Geospatial data can enhance analytics applica- tions by incorporating location information.

• Real-time location information of users can be mined to develop promotion campaigns that are targeted at a specific user in real time.

• Location information from mobile phones can be used to create profiles of user behavior and movement. Such location information can enable users to find other people with similar interests and advertisers to customize their promotions.

• Location-based analytics can also benefit consum- ers directly rather than just businesses. Mobile apps are being developed to enable such innova- tive analytics applications.

Key Terms

Big Data Big Data analytics cloud computing critical event processing data scientists data stream mining

geographic information systems (GIS)

Hadoop Hadoop Distributed File System

(HDFS) Hive

MapReduce NoSQL perpetual analytics Pig Spark stream analytics

Questions for Discussion

1. What is Big Data? Why is it important? Where does Big Data come from?

2. What do you think the future of Big Data will be? Will it lose its popularity to something else? If so, what will it be?

3. What is Big Data analytics? How does it differ from reg- ular analytics?

4. What are the critical success factors for Big Data analytics?

5. What are the big challenges that one should be mind- ful of when considering implementation of Big Data analytics?

6. What are the common business problems addressed by Big Data analytics?

7. In the era of Big Data, are we about to witness the end of data warehousing? Why?

8. What are the use cases for Big Data/Hadoop and data warehousing/RDBMS?

9. Is cloud computing “just an old wine in a new bot- tle?” How is it similar to other initiatives? How is it different?

10. What is stream analytics? How does it differ from regular analytics?

11. What are the most fruitful industries for stream analyt- ics? What is common to those industries?

12. Compared to regular analytics, do you think stream ana- lytics will have more (or less) use cases in the era of Big Data analytics? Why?

13. What are the potential benefits of using geospatial data in analytics? Give examples.

14. What types of new applications can emerge from knowing locations of users in real time? What if you also knew what they have in their shopping cart, for example?

15. How can consumers benefit from using analytics, espe- cially based on location information?

16. “Location-tracking–based profiling is powerful but also poses privacy threats.” Comment.

17. Is cloud computing “just an old wine in a new bottle?” How is it similar to other initiatives? How is it different?

18. Discuss the relationship between mobile devices and social networking.

Exercises

Teradata University Network (TUN) and Other Hands-on Exercises

1. Go to teradatauniversitynetwork.com, and search for case studies. Read cases and white papers that talk about Big Data analytics. What is the common theme in those case studies?

2. At teradatauniversitynetwork.com, find the SAS Visual Analytics white papers, case studies, and hands-on exercises. Carry out the visual analytics exercises on large data sets and prepare a report to discuss your findings.

3. At teradatauniversitynetwork.com, go to the Sports Analytics page. Find applications of Big Data in sports. Summarize your findings.

M09_SHAR1552_11_GE_C09.indd 611 07/01/20 4:42 PM

612 Part III • Prescriptive Analytics and Big Data

4. Go to teradatauniversitynetwork.com, and search for BSI Videos that talk about Big Data. Review these BSI videos, and answer the case questions related to them.

5. Go to the teradata.com and/or asterdata.com Web sites. Find at least three customer case studies on Big Data, and write a report where you discuss the com- monalities and differences of these cases.

6. Access the Crunchbase portal and explore the section on Asia’s Big Data companies (https://www.crunchbase. com/hub/asia-big-data-companies). List the most rep- resentative ones based on their sector.

7. Visit the Web site of the Big Data Value Association (http://www.bdva.eu/node/884). Go to the Resources page and explore the available material to find out how Big Data utilization has grown over the years.

8. A Spark™ data frame may be described as a distributed data collection organized into named columns together with other functions, such as filtering. Visit https:// spark.apache.org/ and discuss its applications.

9. Go to hortonworks.com. Find at least three customer case studies on Hadoop implementation, and write a report in which you discuss the commonalities and dif- ferences of these cases.

10. Go to marklogic.com. Find at least three customer case studies on Hadoop implementation, and write a report where you discuss the commonalities and differ- ences of these cases.

11. Apache Hive is a data warehouse software project built on top of Apache Hadoop. Visit https://hive.apache. org/ and discuss its features.

12. Go to https://www.educba.com/hadoop-vs-rdbms/ and discuss five or more of the differences between HADOOP and RDBM listed on this page.

13. Enter google.com/scholar, and search for articles on data stream mining. Find at least three related articles. Read and summarize your findings.

14. Enter google.com/scholar, and search for articles that talk about Big Data versus data warehousing. Find at least five articles. Read and summarize your findings.

15. Go to https://www.educba.com/big-data-vs-data- warehouse/ and discuss the differences between big data and data warehousing listed on this page.

16. Go to https://www.educba.com/mapreduce-vs-spark/ and discuss the differences between MapReduce and SparkTM listed on this page.

17. Go to https://www.educba.com/hadoop-vs-hive/ and discuss the differences between Hadoop and Hive on this page.

18. Go to the PC Magazine Web site and read the article “The Best Video Streaming Services for 2019” (https:// www.pcmag.com/roundup/336650/the-best-video- streaming-services). Choose two streaming services for comparison and explain your choice.

19. The objective of the exercise is to familiarize you with the capabilities of smartphones to identify human activity. The data set is available at archive.ics.uci.edu/ml/datasets/ Human+Activity+Recognition+Using+Smartphones.

It contains accelerometer and gyroscope readings on 30 subjects who had the smartphone on their waist. The data is available in a raw format and involves some data preparation efforts. Your objective is to identify and classify these readings into activities like walking, running, climb- ing, and such. More information on the data set is available on the download page. You may use clustering for initial exploration and to gain an understanding of the data. You may use tools like R to prepare and analyze this data.

References

Adapted from Alteryx.com. Great Clips. alteryx.com/sites/ default/files/resources/files/case-study-great-chips. pdf (accessed September 2018).

Adapted from Snowflake.net. (n.d.). “Chime Delivers Per- sonalized Customer Experience Using Chime.” www. snowflake.net/product (accessed September 2018).

Adshead, A. (2014). “Data Set to Grow 10-fold by 2020 as Internet of Things Takes Off.” www.computerweekly.com/ news/2240217788/Data-set-to-grow-10-fold-by-2020- as-internet-of-things-takes-off (accessed September 2018).

Altaweel, Mark. “Accessing Real-Time Satellite Imagery and Data.” GIS Lounge, 1 Aug. 2018, www.gislounge.com/ accessing-real-time-satellite-imagery/.

Amodio, M. (2015). Salesforce adds predictive analytics to Market- ing Cloud. Cloud Contact Center.cloudcontactcenterzone .com/topics/cloud-contact-center/articles/413611- salesforce-adds-predictive-analytics-marketing- cloud.htm (accessed September 2018).

Asamoah, D., & R. Sharda. (2015). Adapting CRISP-DM process for social network analytics: Application to healthcare. In AMCIS 2015 Proceedings. aisel.aisnet.org/amcis2015/

BizAnalytics/GeneralPresentations/33/ (accessed September 2018).

Asamoah, D., R. Sharda, A. Zadeh, & P. Kalgotra. (2016). “ Preparing a Big Data Analytics Professional: A Pedagogic Experience.” In DSI 2016 Conference, Austin, TX.

Awadallah, A., & D. Graham. (2012). “Hadoop and the Data Warehouse: When to Use Which.” teradata.com/white- papers/Hadoop-and-the-Data-Warehouse-When-to- Use-Which (accessed September 2018).

Blogs.ESRI.com. “From Customers to CxOs, Starbucks De- livers World-Class Service.” (2014). https://blogs.esri. com/esri/ucinsider/2014/07/29/starbucks/ (accessed September 2018).

Broniatowski, D. A., M. J. Paul, & M. Dredze. (2013). “National and Local Influenza Surveillance through Twitter: An Analysis of the 2012–2013 Influenza Epidemic.” PloS One, 8(12), e83672.

Cisco. (2016). “The Zettabyte Era: Trends and Analysis.” cisco. com/c/en/us/solutions/collateral/service-provider/ visual-networking-index-vni/vni-hyperconnectivity- wp.pdf (accessed October 2018).

M09_SHAR1552_11_GE_C09.indd 612 07/01/20 4:42 PM

Chapter 9 • Big Data, Cloud Computing, and Location Analytics: Concepts and Tools 613

ComputerWeekly.com. (2016). “Big-Data and Open Source Cloud Technology Help Gulf Air Pin Down Customer Sentiment.” www.computerweekly.com/ news/450297404/Big-data-and-open-source-cloud- technology-help-Gulf-Air-pin-down-customer- sentiment (accessed September 2018).

CxOtoday.com. (2014). “Cloud Platform to Help Pharma Co Accelerate Growth.” www.cxotoday.com/story/ mankind-pharma-to-drive-growth-with-softlayers- cloud-platform/ (accessed September 2018).

Dalininaa, R., “Using Natural Language Processing to Ana- lyze Customer Feedback in Hotel Reviews,” awww. datascience.com/resources/notebooks/data-science- summarize-hotel-reviews (Accessed October 2018).

DataStax. “Customer Case Studies.” datastax.com/resources/ casestudies/eBay (accessed September 2018).

Davis, J. (2015). “Salesforce Adds New Predictive Analytics to Marketing Cloud. Information Week.” informationweek. com/big-data/big-data-analytics/salesforce-adds- new-predictive-analytics-to-marketing-cloud/d/d- id/1323201 (accessed September 2018).

Dean, J., & S. Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters.” research.google.com/ archive/mapreduce.html (accessed September 2018).

Delen, D., M. Kletke, & J. Kim. (2005). “A Scalable Classifica- tion Algorithm for Very Large Datasets.” Journal of Infor- mation and Knowledge Management, 4(2), 83–94.

Demirkan, H., & D. Delen. (2013, April). “Leveraging the Ca- pabilities of Service-Oriented Decision Support Systems: Putting Analytics and Big Data in Cloud.” Decision Support Systems, 55(1), 412–421.

Digit.HBS.org. (2015). “Starbucks: Brewing up a Data Storm!” https://digit.hbs.org/submission/starbucks- brewing-up-a-data-storm/ (accessed September 2018).

Dillow, C. (2016). “What Happens When You Combine Ar- tificial Intelligence and Satellite Imagery.” fortune.com/ 2016/03/30/facebook-ai-satellite-imagery/ (accessed September 2018).

Ekster, G. (2015). “Driving Investment Performance with Al- ternative Data.” integrity-research.com/wp-content/ uploads/2015/11/Driving-Investment-Performance- With-Alternative-Data.pdf (accessed September 2018).

Henschen, D. (2016). “Salesforce Reboots Wave Analytics, Preps IoT Cloud.” ZD Net. zdnet.com/article/salesforce-reboots- wave-analytics-preps-iot-cloud/ (accessed September 2018).

Higginbotham, S. (2012). “As Data Gets Bigger, What Comes after a Yottabyte?” gigaom.com/2012/10/30/as-data- gets-bigger-what-comes-after-a-yottabyte (accessed September 2018).

Hope, B. (2015). “Provider of Personal Finance Tools Tracks Bank Cards Sells Data to Investors.” Wall Street Journal. wsj.com/articles/provider-of-personal-finance-tools- tracks-bank-cards-sells-data-to-investors-1438914620 (accessed September 2018).

Jonas, J. (2007). “Streaming Analytics vs. Perpetual Analyt- ics (Advantages of Windowless Thinking).” jeffjonas. typepad.com/jeff_jonas/2007/04/streaming_analy. html (accessed September 2018).

Kalgotra, P., & R. Sharda. (2016). “Rural Versus Urban Comor- bidity Networks.” Working Paper, Center for Health Sys- tems and Innovation, Oklahoma State University.

Kalgotra, P., R. Sharda, & J. M. Croff. (2017). “Examining Health Disparities by Gender: A Multimorbidity Network Analysis of Electronic Medical Record.” International Jour- nal of Medical Informatics, 108, 22–28.

Kelly, L. (2012). “Big Data: Hadoop, Business Analytics, and Beyond.” wikibon.org/wiki/v/Big_Data:_Hadoop,_ Business_Analytics_and_Beyond (accessed September 2018).

Moran, P. A. (1950). “Notes on Continuous Stochastic Phe- nomena.” Biometrika, 17–23.

“Overstock.com: Revolutionizing Data and Analytics to Con- nect Soulfully with their Customers,” at https://www. teradata.com/Resources/Videos/Overstock-com- Revolutionizing-data-and-analy (accessed October 2018).

“Overstock.com Uses Teradata Path Analysis To Boost Its Customer Journey Analytics,” March 27, 2018, at https:// www.retailitinsights.com/doc/overstock-com- uses-teradata-path-analysis-boost-customer-journey- analytics-0001 (accessed October 2018).

Palmucci, J., “Using Apache Spark for Massively Parallel NLP,” at http://engineering.tripadvisor.com/using-apache- spark-for-massively-parallel-nlp/ (accessed October 2018).

RedHat.com. (2016). “Gulf Air’s Big Data Innovation Delivers Deeper Customer Insight.” https://www.redhat.com/ en/success-stories (accessed September 2018).

RedHat.com. (2016). “Gulf Air Builds Private Cloud for Big Data Innovation with Red Hat Technologies.” https:// www.redhat.com/en/about/press-releases/gulf-air- builds-private-cloud-big-data-innovation-red-hat- technologies (accessed September 2018).

Russom, P. (2013). “Busting 10 Myths about Hadoop: the Big Data Explosion.” TDWI’s Best of Business Intelligence, 10, 45–46.

Sarasohn-Kahn, J. (2008). The Wisdom of Patients: Health Care Meets Online Social Media. Oakland, CA: California HealthCare Foundation.

Shaw, C. (2016). “Satellite Companies Moving Markets.” quandl. com/blog/alternative-data-satellite-companies (accessed September 2018).

Steiner, C. (2009). “Sky High Tips for Crop Traders” (accessed September 2018).

St Louis, C., & G. Zorlu. (2012). “Can Twitter Predict Disease Outbreaks?” BMJ, 344.

Tableau white paper. (2012). “7 Tips to Succeed with Big Data in 2013.” cdnlarge.tableausoftware.com/sites/default/ files/whitepapers/7-tips-to-succeed-with-big-data- in-2013.pdf (accessed September 2018).

M09_SHAR1552_11_GE_C09.indd 613 07/01/20 4:42 PM

614 Part III • Prescriptive Analytics and Big Data

Tartar, Andre, et al. “All the Things Satellites Can Now See From Space.” Bloomberg.com, Bloomberg, 26 July 2018, www.bloomberg.com/news/features/2018-07-26/ all-the-things-satellites-can-now-see-from-space (ac- cessed October 2018).

Thusoo, A., Z. Shao, & S. Anthony. (2010). “Data Warehousing and Analytics Infrastructure at Facebook.” In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (p. 1013).

Turner, M. (2015). “This Is the Future of Investing, and You Probably Can’t Afford It.” businessinsider.com/hedge- funds-are-analysing-data-to-get-an-edge-2015-8 (ac- cessed September 2018).

Watson, H. (2012). “The Requirements for Being an Analytics- Based Organization.” Business Intelligence Journal, 17(2), 42–44.

Watson, H., R. Sharda, & D. Schrader. (2012). “Big Data and How to Teach It.” Workshop at AMCIS, Seattle, WA.

Wheeler, C. (2014). “Going Big with GIS.” www.esri.com/ esri-news/arcwatch/0814/going-big-with-gis (accessed October 2018).

White, C. (2012). “MapReduce and the Data Scientist.” Tera- data Vantage White Paper. teradata.com/white-paper/ MapReduce-and-the-Data-Scientist (accessed Sep- tember 2018).

Wikipedia.com. “Petabyte.” en.wikipedia.org/wiki/Petabyte (accessed September 2018).

Zadeh, A. H., H. M. Zolbanin, R. Sharda, & D. Delen. (2015). “Social Media for Nowcasting the Flu Activity: Spatial- Temporal and Text Analysis.” Business Analytics Congress, Pre-ICIS Conference, Fort Worth, TX.

Zikopoulos, P., D. DeRoos, K. Parasuraman, T. Deutsch, D. Corrigan, & J. Giles. (2013). Harness the Power of Big Data. New York: McGraw-Hill.

“Zion China uses Azure IoT, Stream Analytics, and Machine Learning to Evolve Its Intelligent Diabe- tes Management Solution,” www.codeproject.com/ Articles/1194824/Zion-China-uses-Azure-IoT- Stream-Analytics-and-M (accessed October 2018) and https://microsoft.github.io/techcasestudies/ iot/2016/12/02/IoT-ZionChina.html (accessed October 2018).

M09_SHAR1552_11_GE_C09.indd 614 07/01/20 4:42 PM

Order Solution Now

Categories: