82
Big Data? Ricardo Campos Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2017

Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

  • Upload
    others

  • View
    30

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

Big Data?

Ricardo Campos

Instituto Politécnico de Tomar

Mestrado EI-IC – Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2017

Page 2: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

AGENDAWhat is this talk about?

Who

2Overview

1Where

3

BD vs

Traditional

6V’s

5Different Types of

Data

7

Why

4

Q&A

8

Page 3: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Page 4: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Big Data is used in the singular and refers to a collection of data sets so large and complex, it’s impossible to process them with the usual databases and tools.

Because of its size and associated numbers, Big Data is hard to capture, store, search,

share, analyze and visualize.

Consider reading:

https://www.simplilearn.com/whats-the-big-deal-about-big-data-article

https://storage.googleapis.com/supplemental_media/udacityu/306818608/Lesson%201

%20Notes.pdf

Page 5: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

The phenomenon came about in recent years due to the sheer amount of machine

data being generated today – thanks to:

mobile devices

tracking systems / RFID

sensor networks

social networks

internet searches

automated record keeping

video archives

e-commerce

coupled with the additional information derived by analyzing all this information,

which on its own creates another enormous data set.

Page 6: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Big Data analysis requires collecting massive amounts of messy data

The data is not in a uniform format as one would see in traditional database, it is not

annotated (semantically tagged)

Think of every tweet ever tweeted

Page 7: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Big Data analytics can reveal important patterns that would otherwise go unnoticed.

Taking the antidepressant Paxil together with the anti-cholesterol drug Pravachol could

result in diabetic blood sugar levels. Discovered by:

(1) using a symptomatic footprint characteristic of very high blood sugar levels

obtained by analyzing thirty years of reports in an FDA database, and

(2) then finding that footprint in the Bing searches using an algorithm that

detected statistically significant correlations. People taking both drugs also tended

to enter search terms (“fatigue” and “headache,” for example) that constitute the

symptomatic footprint.

Page 8: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Common use cases for Big Data:

• Fraud Detection;

• Risk Modeling;

• Social Sentiment Analysis;

• Image Classification;

• Graph Analysis

Please consider reading page 29 – 39 of

Hadoop for Dummies book

Please consider reading page 3 – 9 of

Harness the Power of Big Data the IBM

Big Data Platform book

Page 9: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Big data must follow the same principles of data management:

Data collection (sensors etc)

Data storage (Oracle, SAP, IBM, EMC, Spark, Hadoop, Storm, BigQuery, Amazon

EC2 and EMR)

Data format conversion (voice2txt, txt2voice, natural language processing from

unstructured to structured)

Data integration ( data linkage, meta data)

Data privacy (privacy-preserved data mining, computer security)

Page 10: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Page 11: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Technical Challenges:

Storage: How can we capture relevant data in time and then use the insight

derived from that data for business results?

Analysis: How can we understand and utilize it, when it comes in such a multitude

of unstructured formats?

Price: How can we analyze and manage the need for and the size of computational

capacity required to handle it safely?

Page 12: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Technical Challenges:

Storage: NoSQL DBs: Hadoop, Dynamo DB, Berkeley DB, MangoDB, CouchDB

…Non relational

Analysis: Parallel computing (Hadoop’s Map reduce).

Price: Parallel computing (Hadoop’s Map reduce).

Page 13: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Page 14: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Big Data is so promissing that IBM has created the Big Data University

Page 15: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Companies pursue Big Data because it can be revelatory in spotting business trends,

improving research quality, and gaining insights in a variety of fields, from IT to

medicine to law enforcement and everything in between and beyond.

A health care consultancy has made the data coming out of medical

practices the focus of its thriving business. The company collects billing

and diagnostic code data from 10,000 doctors on a daily, weekly and

monthly basis to create a virtual clinical integration model.

Health

Cloud services such as Ginger.io already allow care providers to monitor their patients

through sensor-based applications on their smartphones.

Page 16: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Global position satellite technology now allows trucking firms to track their trucks - and the

merchandise inside them. Practically anything you can attach an RFID tag to can be tracked. How

a company uses that information – to re-route trucks to create efficient routes, alert customers to

deliveries, and forecast and price services – depends on the ability to manage and analyze data

effectively.

Walmart handles more than 1 million customer

transactions every hour, which is imported into

databases estimated to contain more than 2.5 petabytes

* of data — the equivalent of 167 times the information

contained in all the books in the US Library of Congress.

Consumer Products Companies

https://pplware.sapo.pt/informacao/amazon-go-fim-das-filas-caixas-supermercado/

Page 17: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Last month, I talked to Amazon customer service about my malfunctioning Kindle, and

it was great. Thirty seconds after putting in a service request on Amazon’s website, my

phone rang, and the woman on the other end--let’s call her Barbara--greeted me by

name and said, "I understand that you have a problem with your Kindle." We resolved

my problem in under two minutes, we got to skip the part where I carefully spell out my

last name and address, and she didn’t try to upsell me on anything. After nearly a

decade of ordering stuff from Amazon, I never loved the company as much as I did at

that moment.

Article by Sean Madden, May 2012, an expert in service design and innovation

strategy.

Page 18: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

The fact is, Amazon has been collecting my information for years--not just addresses

and payment information but the identity of everything I’ve ever bought or even

looked at. And while dozens of other companies do that, too, Amazon’s doing

something remarkable with theirs. They’re using that data to build our relationship.

Page 19: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Sports ClubsIn one of the greatest sports stories of all

times, Leicester City won the Premier League

title of 2015/16.

Throughout the history of the Premier League, every champion, until now, has

finished in the top 3 in the season before winning the title. Leicester City, however, was

an exception, finishing the 2014/15 season in 14th place, 46 points behind winners

Chelsea. How did they do it?

Please consider reading this article: https://www.simplilearn.com/data-analytics-

behind-leicester-city-16-epl-win-article

http://www.maisfutebol.iol.pt/benfica/formacao/video-imagens-nunca-vistas-sobre-a-

maquina-do-seixal?_ga=2.67815961.940326162.1493990127-1281919722.1484127213

Page 20: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Big Pharmaceutical Companies

Page 21: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Government Agencies

Page 22: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Credit Card Companies

Page 23: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Telecoms

Page 24: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Facebook

Facebook uses Hadoop, Hive, and HBase for data warehousing and real-time application serving. Their data warehousing clusters are petabytes in size with thousands of nodes.

Please consider reading this article:https://www.simplilearn.com/how-facebook-is-using-big-data-articlehttps://www.facebook.com/note.php?note_id=468211193919https://www.simplilearn.com/how-facebook-is-using-big-data-article

Page 25: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Twitter

Twitter uses Hadoop, Pig, and HBase for data analysis, visualization, social graph analysis, and machine learning

Page 26: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Yahoo

Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email antispam, ad optimization, ETL, and more. Combined, it has over 40,000 servers running Hadoop with 170 PB of storage.

Page 27: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Google

Google, in its MapReduce paper, indicated that it used its version of MapReduce to create its web index from crawl data.

In 2010 Google moved to a real-time indexing system called Caffeine:

Please consider reading this article:https://googleblog.blogspot.pt/2010/06/our-new-search-index-caffeine.html

Page 28: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

eBay, Samsung, ….

eBay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL, Last.fm, and StumbleUpon are some of the other organizations that are also heavily invested in Hadoop and Spark.

Page 29: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

IBM Watson

Sloan Kettering Cancer Center doctors are training IBM Watson to be an expert in

cancer diagnosis and treatment based on learning:

Over 600,000 diagnostic reports

Two million pages of medical journal articles

One and a half million patient records

Page 30: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Watson is an IBM supercomputer that combines artificial intelligence (AI) and sophisticated analytical software for optimal performance as a “question answering” machine (https://web.stanford.edu/~jurafsky/slp3/28.pdf and http://start.csail.mit.edu/index.php);

The supercomputer is named for IBM’s founder, Thomas J. Watson.

To replicate (or surpass) a high-functioning human’s ability to answer questions, Watson accesses 90 servers with a combined data store of over 200 million pages of information, which it processes against six million logic rules.

Page 31: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Apache's Hadoop, a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

SUSE operating system;

2,880 processor cores.

15 terabytes of RAM.

BM'sDeepQA software, which is designed for information retrieval that incorporates natural language processing and machine learning.

Page 32: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

It performs text mining and complex analytics on huge volumes of unstructured data;

Not available through a Web interface;

Vertical applications such as healthcare and decision support;

Watson triumphs in Jeopardy's man vs. machine challenge

http://www.computerworld.com/article/2513199/high-performance-computing/watson-triumphs-in-jeopardy-s-man-vs--machine-challenge.html

Page 33: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

IBM's Watson Supercomputer May Soon Be The Best Doctor In The World

http://www.businessinsider.com/ibms-watson-may-soon-be-the-best-doctor-in-the-world-2014-4

Watson is already capable of storing far more medical information than doctors;

Its decisions are all evidence-based and free of cognitive biases;

It's also capable of understanding natural language, generating hypotheses, evaluating the strength of those hypotheses, and learning — not just storing data, but finding meaning in it.

Page 34: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

It’s based on all available medical knowledge. Human doctors can’t possibly hold this much information in their heads, or keep up it as it changes over time. Dr. Watson knows it all and never overlooks or forgets anything.

It’s accurate. If Dr. Watson is as good at medical questions as the current Watson is at game show questions, it will be an excellent diagnostician indeed.

It has very low marginal cost. It’ll be very expensive to build and train Dr. Watson, but once it’s up and running the cost of doing one more diagnosis with it is essentially zero

Page 35: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

It’s consistent. Given the same inputs, Dr. Watson will always output the same diagnosis. Inconsistency is a surprisingly large and common flaw among human medical professionals, even experienced ones. And Dr. Watson is always available and never annoyed, sick, nervous, hungover, upset, in the middle of a divorce, sleep-deprived, and so on.

It can be offered anywhere in the world. If a person has access to a computer or mobile phone, Dr. Watson is on call for them.

http://andrewmcafee.org/2011/03/mcafee-watson-ibm-healthcare-verghese/

https://www.youtube.com/watch?v=_Xcmh1LQB9Ihttps://www.youtube.com/watch?v=P18EdAKuC1U

Page 36: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

http://tek.sapo.pt/noticias/computadores/artigo/robots_na_saude_ainda_estao_longe_de_substituir_os_m-49530btt.html

Page 37: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Page 38: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Page 39: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

The Internet of Things (IoT) is a scenario in which

objects, animals or people are provided with

unique identifies and the ability to automatically

transfer data over a network without requiring

human-to-human or human-to-computer

interaction

Page 40: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Sports

It’s now possible to get a basketball with over 200 built-in

sensors that provide player and coaches with detailed

feedback on performance

In tennis a system called SlamTracker can record a

player’s performance providing real-time statistics

and comprehensive match analytics.

Page 41: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

If you’ve ever watched rugby you may have wondered what

the bump is between the players’ shoulder blades – it’s a GPS

tracking system that allows the coaching staff to assess

performance in real time.

The device will measure the players’ average speed, whether the player is performing

above or below their normal levels, and heart rate, to identify potential problems

before they occur.

Page 42: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Self-driving cars

Computers in cars know where you go, when you go, how fast you go, how many times

you stop along the way, whether you stay in your lane, what your average MPG is, how

you like your temperature, how close you get before stepping on the brake, and tens of

thousands of other facts….instantly.

The ethical dilemma of self-driving carshttps://www.youtube.com/watch?v=ixIoDYVfKA0

Page 43: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Analyzing all of this data rapidly allows a self-driving car to:

Anticipate where you are going by looking at driving history

Check road signs using sensors to know what the speed limit is or if a stop sign is approaching

Alert and activate your braking and steering systems if pedestrians are in the street or you’re too

close to the curb or you drift into another lane or you doze off.

In 2040, it is anticipated people will not need to get driver ’s licenses. Cars will be able to drop

someone off and then go find a parking space.

Page 44: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Homes

There are smart thermostats that monitor the home

and only heat the areas that are being used. The

temperature of your home can be changed while

you are still at work so that when you arrive on a

winter’s evening the house is cosy

Smart TVs use face recognition to make sure your children don’t ever watch anything unsuitable

for their age

Page 45: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Considering all the toys, gadgets and smart appliances there are now more machines

connected to the Internet than people. And all those smart things are gathering data

and communicating with each other.

http://exameinformatica.sapo.pt/noticias/insolitos/2016-10-21-Esta-coleira-diz-lhe-

se-o-cao-esta-feliz

Gadgets

Page 46: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Social Networks

Online dating site eHarmony matches people based on

twenty-nine different variables such as personality traits,

behaviours, beliefs, values and social

skills.

Page 47: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Search Engines

Page 48: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Web Browsers

Page 49: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Electronic Devices

Page 50: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Movie Rental Sites

Page 51: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Apps

Restaurant reservations (Open Table)

Weather in L.A. in 3 days (Weather+)

Side effects of medications (MedWatcher)

3-star hotels in New Orleans (Priceline)

Which PC should I buy and where (PriceCheck)

Page 52: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

From traffic patterns and music downloads to web history and medical records, data

is recorded, stored, and analyzed to enable that technology and services that the

world relies on every day. But what exactly is big data used for?

Page 53: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

To send you catalogs for exactly the merchandise you typically purchase.;

To suggest medications that precisely match your medical history.

To “push” television channels to your set instead of your “pulling” them in.

To send advertisements on those channels just for you!

Page 54: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

To know what you need before you even know you need it based on past

purchasing habits!

To notify you of your expiring driver’s license or credit cards or last refill on a Rx, etc.

To give you turn-by-turn directions to a shelter in case of emergency.

Predict weather patterns to plan optimal wind turbine usage, and optimize capital

expenditure on asset placement

Page 55: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Make risk decisions based on real-time transactional data

Identify criminals and threats from disparate video, audio, and data feeds

(recorded future. com)

Detect life-threatening conditions at hospitals in time to intervene

Multi-channel customer sentiment and experience analysis

Page 56: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

According to IBM scientists big data can be break into four dimensions:

Volume, Velocity, Variety and Veracity.

Volume

of Tweets

create daily.

12+ terabytes

Variety

of different

types of data (structured,

unstructured, text, multimedia)

100’s Veracity

decision makers trust

their information. Fact Checking (https://poligrafo.sapo.pt/)

Only 1 in 3

trade events

per second. Analysis of data

to take decisions within

seconds

5+million

Velocity

Please consider reading page 9 – 15 of

Harness the Power of Big Data the IBM

Big Data Platform book

Page 57: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Responding to the

increasing Velocity

30 Billion RFID sensors and counting

Collectively Analyzing the broadening

Variety

80% of the

worlds data is unstructured

Establishing the

Veracity of big data sources

1 in 3 business leaders don’t trust the information they use to make decisions

Cost efficiently processing the

growing Volume

50x 35 ZB

20202010

Page 58: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Volume

Page 59: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Volume

Many factors contribute to the increase in data volume:

Transaction-based data stored through the years.

Unstructured data streaming in from social media.

Increasing amounts of sensor and machine-to-machine data being collected.

In the past, excessive data volume was a storage issue. But with decreasing storage

costs, other issues emerge, including how to determine relevance within large data

volumes and how to use analytics to create value from relevant data.

Page 60: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Variety

Page 61: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Variety

Data today comes in all types of formats.

Structured, numeric data in traditional databases.

Information created from line-of-business applications.

Unstructured text documents, email, video, audio, stock ticker data and financial

transactions.

Managing, merging and governing different varieties of data is something many

organizations still grapple with.

Page 62: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Velocity

Page 63: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Velocity

Data is streaming in at unprecedented speed and must be dealt with in a timely

manner.

RFID tags, sensors and smart metering are driving the need to deal with torrents of

data in near-real time.

Reacting quickly enough to deal with data velocity is a challenge for most

organizations.

Page 64: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Veracity

Page 65: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Veracity

Big Data Veracity refers to the biases, noise and abnormality in data.

Is the data that is being stored, and mined meaningful to the problem being

analyzed.

In scoping out your big data strategy you need to have your team and partners work

to help keep your data clean and processes to keep ‘dirty data’ from accumulating in

your systems.

Page 66: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Value

Value is defined as the usefulness of data for an enterprise. The value characteristic is

intuitively related to the veracity characteristic in that the higher the data fidelity, the

more value it holds for the business;

Value is also dependent on how long data processing takes because analytics

results have a shelf-life;

Page 67: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

The real value is not in the large volumes of data but what we can now do with it.

It is not the amount of data that is making the difference but our ability to analyze

vast and complex data sets beyond anything we could ever do before.

Innovations such as cloud computing combined with improved network speed as

well as creative techniques to analyse data have resulted in a new ability to turn vast

amounts of complex data into value.

What’s more, the analysis can now be performed without the need to purchase or build

large supercomputers.

Page 68: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

The longer it takes for data to be turned into meaningful information, the less value

it has for a business

Page 69: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Structured vs. Exploratory

IT

Structures the data to answer that question

Business Users

Determine what question to ask

Monthly sales reports

Profitability analysis

Customer surveys

Traditional Approach

Structured & Repeatable Analysis

IT

Delivers a platform to enable creative discovery

Business Users

Explores what questions could be asked

Brand sentiment

Product strategy

Maximum asset utilization

Big Data Approach

Iterative & Exploratory Analysis

Page 70: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Page 71: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

The data processed by Big Data solutions can be human-generated

or machine-generated

Human-generated

Page 72: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Machine-generated

Page 73: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Human-generated and machine-generated data can come from a variety of sources

and be represented in various formats or types. The primary types of data are:

• Structured Data

• Unstructured Data

• Semi-Structured Data

Page 74: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Structured Data

Structured data conforms to a data model or schema and is often stored in tabular

form. It is used to capture relationships between different entities and is therefore

most often stored in a relational database.

Examples of this type of data include banking transactions, invoices, and customer

records.

Page 75: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Unstructured Data

Data that does not conform to a data model or data schema is known as

unstructured data. It is estimated that unstructured data makes up 80% of the data

within any given enterprise.

This form of data is either textual or binary and often conveyed via files that are self-

contained and non-relational. A text file may contain the contents of various tweets

or blog postings. Binary files are often media files that contain image, audio or video

data

Basically, unstructured data is the data we can’t easily store and index in traditional

formats or databases and includes email conversations, social media posts, video

content, photos, voice recordings, sounds, etc.

Page 76: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

In most businesses there are already huge amounts of text or word-based data in the

form of documents, reports, internal and external communication, customer

communication, emails, websites, social media updates, blog

And while all those words are structured to make sense to a human being they are

unstructured from an analytics perspective, as they don’t fit neatly into a relational

database or rows and columns of a spreadsheet

But they still present a huge opportunity if we can just figure out how to use it.

Page 77: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

What sets unstructured data apart from structured data is that its structure is

unpredictable.

Some people believe that the term unstructured data is misleading because each text

source may contain its own specific structure or formatting based on the software that

created it. In fact, it is the content of the document that is really unstructured

Page 78: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

Semi-Structured Data

Semi-structured data has a defined level of structure and consistency, but is not

relational in nature. This kind of data is commonly stored in files that contain text.. Due

to the textual nature of this data and its conformance to some level of structure, it is

more easily processed than unstructured data.

Page 79: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

These data types refer to the internal organization of data and are sometimes called

data formats. Apart from these three fundamental data types, another important type

of data in Big Data environments is metadata.

Metadata

Metadata provides information about a dataset’s characteristics and structure. This type

of data is mostly machine-generated and can be appended to data

Page 80: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

The tracking of metadata is crucial to Big Data processing, storage and analysis

because it provides information about the pedigree of the data and its provenance

during processing. Examples of metadata include:

• XML tags providing the author and creation date of a document

• Attributes providing the file size and resolution of a digital photograph

Page 81: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?

https://www.youtube.com/watch?v=l-SVN3txo_4

Page 82: Big Data? - IPTricardo/ficheiros/BD - Big Data.pdf · Big data must follow the same principles of data management: Data collection (sensors etc) Data storage (Oracle, SAP, IBM, EMC,

What is Information Retrieval?