Everything About Data

Data is the set of techniques and methodologies that have as their object the extraction of useful information from large quantities of data through automatic or semi-automatic methods and the scientific, corporate / industrial or operational use of them.

The statistic can otherwise be defined as “extraction of useful information from datasets”.

The concept of data mining is similar, but with a substantial difference: statistics allow you to process general information about a population (e.g. unemployment rates, births), while data mining is used to look for correlations between multiple variables relative to individuals individuals; for example knowing the average behavior of the customers of a telephone company I try to predict how much the average customer will spend in the near future.

In essence, data mining is “the analysis, from a mathematical point of view, performed on large databases”, typically preceded by other stages of preparation / transformation / filtering of data such as data cleaning. The term data mining became popular in the late 1990s as a shortened version of the above definition; today data mining has a double value:

  • extraction, with cutting edge analytical techniques, of implicit information, hidden from already structured data, to make it available and directly usable;
  • exploration and analysis, performed automatically or semi-automatically, on large quantities of data in order to discover significant patterns (patterns or regularities).

This type of activity is crucial in many areas of scientific research, but also in other sectors (for example in that of market research). In the professional world it is used to solve different problems, ranging from the management of customer relations (CRM), to the detection of fraudulent behavior, up to the optimization of websites.

Among the techniques most used in this area are:

  • Clustering;
  • Neural networks;
  • Decision trees;
  • Analysis of associations (identification of products purchased jointly).

Another popular technique for data mining is learning by classification. This learning scheme starts from a well-defined set of classification examples for known cases, from which it is expected to deduce a way to classify unknown examples. This approach is also called “supervised”, in the sense that the learning scheme operates under the supervision implicitly provided by the classification examples for known cases; for this reason, these examples are also called training examples, or “examples for training”. Knowledge acquired through learning by classification can be represented with a decision tree.

The actual data extraction therefore comes to the end of a process involving numerous phases: 

  • the sources of data are identified; 
  • a single set of aggregated data is created; 
  • pre-processing is carried out (data cleaning, exploratory analysis, selection, etc.);
  • the data is extracted with the chosen algorithm; 
  • patterns are interpreted and evaluated; 
  • the last step goes from patterns to new knowledge thus acquired.

Data management:

But what is a data management strategy and how to implement it? What are the key elements for data management to be truly effective? Here are all the latest technological innovations that support companies in this delicate task, even if, as we will read, to implement an effective data management system, technology is not enough but processes, skills and governance skills are also needed.This is a fundamental commitment to take full advantage of the growing amount of information already present in the company and all those collected gradually, also in real time, which must be analyzed to understand market trends, the needs of company stakeholders and therefore to provide the most correct and, above all, information useful to business decision-makers to increase performance.


What are Big Data:

The definition big data refers to both the world of statistics and that of information technology, in fact, it indicates the collection of such a quantity of data (characterized by a large volume, but also by a wide variety) to make it necessary to use specific analytical methods and technologies to be processed and to ensure that value and knowledge are extracted there. In computer science, the meaning of big data extends to the ability to relate heterogeneous, structured and unstructured data, with the aim of discovering links and correlations between different phenomena and then making predictions.

Big data management cannot be approached as in the past when the priorities were ‘reduced’ to the governance of the data at the It level and to its use by some ‘restricted’ users.

Data sources continue to evolve and grow: ‘waves’ of new data continue to be generated not only by internal business applications but by public resources (such as the web and social media), mobile platforms, data services and, always more, from things and sensors (IoT-Internet of Things, just think that according to the Internet of Things Observatory of the School of Management of the Milan Polytechnic, the adoption of IoT in sectors such as the Smart Home and Industrial IoT grew in 2018 52% and 40% respectively, this means that the data generated by the devices located in these areas will increase exponentially). “The Big Data Management strategy cannot fail to take these aspects into account, often linked to the characteristics of volume, speed and variety of Big data in continuous growth and evolution. For companies, it becomes essential to succeed, according to a logic of continuous improvement, in identifying new sources and incorporating them into data management platforms. “

In the era of big data, therefore, it is essential to be able to ‘capture’ and archive all the useful data and since their usefulness is often not assessable a priori, it becomes a challenge to be able to have them all available (some data that could be irrelevant in the current business context, such as for example the mobile data of the Gps, could actually be relevant to the objectives of future business). “Until a few years ago the efforts and costs to be able to capture and maintain all this data were excessive”, reads the Forrester report, “but today innovative and low-cost technologies such as Hadoop have made this approach possible”;

The goal of big data analysis is not to report on what has happened but to understand how this can help make better decisions. This means changing the big data analysis model by opting for so-called ‘descriptive’, ‘predictive’, ‘prescriptive’ approaches, i.e. taking advantage of big data analytics through which to generate ‘insights’, knowledge useful for decision-making processes (for example anticipating the needs of the customer knowing their preferences and habits in real-time). Success in this goal requires new skills, starting with data scientists; moreover, it means using artificial intelligence techniques, big data analytics technologies, machine learning algorithms, advanced visualization tools, data mining, pattern recognition, natural language processing, signal processing and implementing the most advanced hardware technologies to create the technological platforms that they try to imitate the human brain: all this generates useful and ‘not obvious’ information in support of the company’s competitiveness and profitability;

release data quickly and freely to all those in need: it may seem obvious but we know well how the history of IT has shown how much the ‘silos’ approach also applies to data, often residing in non-shared and difficult databases to be

integrated.

Big Data Technologies:

Hadoop Ecosystem:  It is an open source framework for the distributed processing of large data sets. It has grown large enough to contain an entire ecosystem of related software, and many commercial big data solutions are based on Hadoop.

NoSQL Databases: NoSQL databases store unstructured data and provide fast performance. This means that it offers flexibility by managing a wide variety of high-volume data types. Some examples of NoSQL databases include MongoDB, Redis and Cassandra

 Blockchain: Blockchain is mainly used in payment functions, commitment and can speed up transactions, reduce fraud and increase financial security. It is also the distributed database technology that is under the Bitcoin currency. An excellent choice for Big Data applications in sensitive sectors because it is highly secure.

Business case 

An Open Source Approach to Log Analytics with Big Data In the Trenches with Big Data & Search – A Blog and Video Series  Searchtechnologies.com says: Companies had used registries for Insight long before big data became the next interesting thing. But with the exponential growth of log files, managing and analyzing logs has become so daunting that it becomes almost impossible. How did we leverage open source big data to process over 600 GB per day for faster, more accurate and cheaper log analysis? ”

Top Five High-Impact Use Cases for Big Data Analytics: “This eBook outlines these use cases and includes examples from real customers of how other organizations have used Datameer’s big data analytics solution to unlock the value of their data and deliver true commercial value. ” From datameer.com


Cloud:

In computing, the English term cloud computing indicates a paradigm of provision of services offered on demand by a supplier to an end customer through the Internet network starting from a set of pre-existing resources, configurable and available remotely in the form of a distributed architecture.

By using various types of processing units (CPUs), fixed or mobile mass memories such as RAM, internal or external hard disks, CDs / DVDs, USB keys, etc., a computer is able to process, store, recover programs and data.

In the case of computers connected in a local (LAN) or geographical (WAN) network, the possibility of processing / archiving / recovery can be extended to other remote computers and devices located on the network itself.

By taking advantage of cloud computing technology, users connected to a cloud provider can perform all these tasks, even through a simple internet browser.

The cloud computing system has three distinct factors:

  • Service provider (cloud provider) – Offers services (virtual servers, storage, complete applications (eg cloud database) generally according to a “pay-per-use” model;
  • Administrator customer – Choose and configure the services offered by the supplier, generally offering added value such as software applications;
  • End customer – Use the services properly configured by the administrator customer.

Although the term is rather vague and appears to be used in different contexts with different meanings,can be distinguished three basic types of cloud computing services:

  • SaaS (Software as a Service) – It consists in the use of programs installed on a remote server, that is, outside the physical computer or the local LAN, often through a web server. in part the philosophy of a term today in disuse, ASP (Application service provider.)

Market Solutions: Microsoft Office 365, G Suite apps, Salesforce

  • DaaS (Data as a Service) – With this service, only the data that users can access through any application are made available via the web as if they were resident on a local disk.

Market Solutions: Xignite, D&B Hoovers

  • HaaS (Hardware as a Service) – With this service the user sends data to a computer which is processed by computers made available and returned to the initial user.

To these three main services, others may be integrated:

  • PaaS (Platform as a Service) – Instead of one or more single programs, a software platform that can be made up of different services, programs, libraries, etc. is executed remotely. 

Market Solutions: Microsoft Azure, AWS Elastic Beanstalk

  • IaaS (Infrastructure as a Service) – In addition to remote virtual resources, hardware resources are also made available, such as servers, network capacity, storage systems, archive and backup. The characteristic of IaaS is that resources are instantiated on demand or demand when a platform needs it.

Market Solutions: AWS, Microsoft Azure, Cisco Metacloud

The term cloud computing differs however from grid computing which is instead a paradigm oriented towards distributed computing and, in general, requires that applications be designed in a specific way.

Business cases: 

  • Cloud-Based Analytics: A Business Case For CFOs: According to Digitalistmag.com : “The emerging technological advances resulting from today’s digital reality are penetrating all corporate fields with impressive speed, including financial operations. Cloud-based analytics is one of the contemporary innovative digital resources for financial operations that must be assimilated into the strategy of any competitive market operator. “
  • Creating the Cloud Business Case: Scovering the fundamental commercial levers that AWS offers to its customers; work through a framework to help identify the possible benefits of moving to the cloud; and outlines the steps necessary to create a Cloud business case.

Share Post