BIG DATA ANALYTICS METHODOLOGY

Wullianallur Raghupathi and Viju Raghupathi

While several different methodologies are being developed in this rapidly emerging discipline, here a practical hands-on methodology is outlined.

Table 3.1 shows the main stages of such a methodology. In stage 1, the

TABLE 3.1

Outline of Big Data Analytics Methodology Stage 1 Concept design

• Establish need for big data analytics project

• Define problem statement

• Why is project important and significant?

Stage 2 Proposal

• Abstract—summarize proposal

• Introduction

• What is problem being addressed?

• Why is it important and interesting?

• Why big data analytics approach?

• Background material

• Problem domain discussion

• Prior projects and research Stage 3 Methodology

• Hypothesis development

• Data sources and collection

• Variable selection (independent and dependent variables)

• ETL and data transformation

• Platform/tool selection

• Analytic techniques

• Expected results and conclusions

• Policy implications

• Scope and limitations

• Future research

• Implementation

• Develop conceptual architecture

− Show and describe component (e.g., Figure 3.1)

− Show and describe big data analytic platform/tools

• Execute steps in methodology

• Import data

• Perform various big data analytics using various techniques and algorithms (e.g., word count, association, classification, clustering, etc.)

• Gain insight from outputs

• Draw conclusion

• Derive policy implications

• Make informed decisions Stage 4 • Presentation and walkthrough

• Evaluation

interdisciplinary big data analytics team develops a concept design. This is a first cut at briefly establishing the need for such a project since there are trade-offs in terms of cheaper options, risk, problem-solution align-ment, etc. Additionally, a problem statement is followed by a description of project importance and significance. Once the concept design is approved in principle, one proceeds to stage 2, which is the proposal development stage. Here, more details are filled. Taking the concept design as input, an abstract highlighting the overall methodology and implementation pro-cess is outlined. This is followed by an introduction to the big data analyt-ics domain: What is the problem being addressed? Why is it important and interesting to the organization? It is also necessary to make the case for a big data analytics approach. Since the complexity and cost are much higher than those of traditional analytics approaches, it is important to justify its use. Also, the project team should provide background information on the problem domain and prior projects and research done in this domain.

Both the concept design and the proposal are evaluated in terms of the 4Cs:

• Completeness: Is the concept design complete?

• Correctness: Is the design technically sound? Is correct terminol-ogy used?

• Consistency: Is the proposal cohesive, or does it appear choppy? Is there flow and continuity?

• Communicability: Is proposal formatted professionally? Does report communicate design in easily understood language?

Next, in stage 3, the steps in the methodology are fleshed out and imple-mented. The problem statement is broken down into a series of hypotheses.

Please note these are not rigorous, as in the case of statistical approaches.

Rather, they are developed to help guide the big data analytics process.

Simultaneously, the independent and dependent variables are identified.

In terms of analytics itself, it does not make a major difference to classify the variables. However, it helps identify causal relationships or correla-tions. The data sources as outlined in Figure 3.1 are identified; the data is collected (longitudinal data, if necessary), described, and transformed to make it ready for analytics. A very important step at this point is platform/

tool evaluation and selection. For example, several options, as indicated previously, such as AWS Hadoop, Cloudera, IBM BigInsights, etc., are available. A major criterion is whether the platform is available on a desk-top or on the cloud. The next step is to apply the various big data analytics

techniques to the data. These are not different from the routine analytics.

They’re only scaled up to large data sets. Through a series of iterations and what if analysis, insight is gained from the big data analytics. From the insight, informed decisions can be made and policy shaped. In the final steps, conclusions are offered, scope and limitations are identified, and the policy implications discussed. In stage 4, the project and its findings are presented to the stakeholders for action. Additionally, the big data analyt-ics project is validated using the following criteria:

• Robustness of analyses, queries, reports, and visualization

• Variety of insight

• Substantiveness of research question

• Demonstration of big data analytics application

• Some degree of integration among components

• Sophistication and complexity of analysis

The implementation is a staged approach with feedback loops built in at each stage to minimize risk of failure. The users should be involved in the implementation. It is also an iterative process, especially in the analyt-ics step, wherein the analyst performs what if analysis. The next section briefly discusses some of the key challenges in big data analytics.

CHALLENGES

For one, a big data analytics platform must support, at a minimum, the key functions necessary for processing the data. The criteria for platform evaluation may include availability, continuity, ease of use, scalability, ability to manipulate at different levels of granularity, privacy and secu-rity enablement, and quality assurance (Bollier, 2010; Ohlhorst, 2012).

Additionally, while most currently available platforms are open source, the typical advantages and limitations of open-source platforms apply.

They have to be shrink-wrapped, made user-friendly, and transparent for big data analytics to take off. Real-time big data analytics is a key require-ment in many industries, such as retail, banking, healthcare, and others (SAP AG, 2012). The lag between when data is collected and processed has to be addressed. The dynamic availability of the numerous analyt-ics algorithms, models, and methods in a pull-down type of menu is also

necessary for large-scale adoption. The in-memory processing, such as in SAP’s Hana, can be extended to the Hadoop/MapReduce framework. The various options of local processing (e.g., a network, desktop/laptop), cloud computing, software as a service (SaaS), and service-oriented architecture (SOA) web services delivery mechanisms have to be explored further. The key managerial issues of ownership, governance, and standards have to be addressed as well. Interleaved into these are the issues of continuous data acquisition and data cleansing. In the future, ontology and other design issues have to be discussed. Furthermore, an appliance-driven approach (e.g., access via mobile computing and wireless devices) has to be inves-tigated. We next discuss big data analytics in a particularly industry, namely, healthcare and the practice of medicine.

Dans le document Big Data, Mining, and Analytics (Page 77-81)