BIG DATA ANALYTICS OF CANCER BLOGS

Wullianallur Raghupathi and Viju Raghupathi

In this section we describe our ongoing prototype research project in the use of the Hadoop/MapReduce framework on the AWS for the analysis of unstructured cancer blog data (Raghupathi, 2010). Health organizations and individuals such as patients are using blog content for several pur-poses. Health and medical blogs are rich in unstructured data for insight and informed decision making. While current applications such as web crawlers and blog analysis are good at generating statistics about the num-ber of blogs, top 10 sites, etc., they are not advanced/useful or scalable computationally to help with analysis and extraction of insight. First, the blog data is growing exponentially (volume); second, they’re posted in real time and the analysis could become outdated very quickly (veloc-ity); and third, there is a variety of content in the blogs. Fourth, the blogs themselves are distributed and scattered all over the Internet. Therefore,

blogs in particular and social media in general are great candidates for the application of big data analytics. To reiterate, there has been an expo-nential increase in the number of blogs in the healthcare area, as patients find them useful in disease management and developing support groups.

Alternatively, healthcare providers such as physicians have started to use blogs to communicate and discuss medical information. Examples of use-ful information include alternative medicine and treatment, health condi-tion management, diagnosis–treatment informacondi-tion, and support group resources. This rapid proliferation in health- and medical-related blogs has resulted in huge amounts of unstructured yet potentially valuable information being available for analysis and use. Statistics indicate health-related bloggers are very consistent at posting to blogs. The analysis and interpretation of health-related blogs are not trivial tasks. Unlike many of the blogs in various corporate domains, health blogs are far more complex and unstructured. The postings reflect two important facets of the blog-gers and visitors: the individual patient care and disease management (fine granularity) to generalized medicine (e.g., public health).

Hadoop/MapReduce defines a framework for implementing systems for the analysis of unstructured data. In contrast to structured information, whose meaning is expressed by the structure or the format of the data, the meaning of unstructured information cannot be so inferred. Examples of data that carry unstructured information include natural language text and data from audio or video sources. More specifically, an audio stream has a well-defined syntax and semantics for rendering the stream on an audio device, but its music score is not directly represented. Hadoop/

MapReduce is sufficiently advanced and sophisticated computationally to aid in the analysis and understanding of the content of health-related blogs. At the individual level (document-level analysis) one can perform analysis and gain insight into the patient in longitudinal studies. At the group level (collection-level analysis) one can gain insight into the pat-terns of the groups (network behavior, e.g., assessing the influence within the social group), for example, in a particular disease group, the commu-nity of participants in an HMO or hospital setting, or even in the global community of patients (ethnic stratification). The results of these analyses can be generalized (Raghupathi, 2010). While the blogs enable the forma-tion of social networks of patients and providers, the uniqueness of the health/medical terminology comingled with the subjective vocabulary of the patient compounds the challenge of interpretation.

Discussing at a more general level, while blogs have emerged as contem-porary modes of communication within a social network context, hardly any research or insight exists in the content analysis of blogs. The blog world is characterized by a lack of particular rules on format, how to post, and the structure of the content itself. Questions arise: How do we make sense of the aggregate content? How does one interpret and generalize?

In health blogs in particular, what patterns of diagnosis, treatment, man-agement, and support might emerge from a meta-analysis of a large pool of blog postings? How can the content be classified? What natural clus-ters can be formed about the topics? What associations and correlations exist between key topics? The overall goal, then, is to enhance the qual-ity of health by reducing errors and assisting in clinical decision making.

Additionally, one can reduce the cost of healthcare delivery by the use of these types of advanced health information technology. Therefore, the objectives of our project include the following:

1. To use Hadoop/MapReduce to perform analytics on a set of cancer blog postings from http://www.thecancerblog.com

2. To develop a parsing algorithm and association, classification, and clustering technique for the analysis of cancer blogs

3. To develop a vocabulary and taxonomy of keywords (based on exist-ing medical nomenclature)

4. To build a prototype interface

5. To contribute to social media analysis in the semantic web by gener-alizing the models from cancer blogs

The following levels of development are envisaged: first level, patterns of symptoms, management (diagnosis/treatment); second level, glean insight into disease management at individual/group levels; and third level, clini-cal decision support (e.g., generalization of patterns, syntactic to seman-tic)—informed decision making. Typically, the unstructured information in blogs comprises the following:

Blog topic (posting): What issue or question does the blogger (and com-ments) discuss?

Disease and treatment (not limited to): What cancer type and treatment (and other issues) are identified and discussed?

Other information: What other related topics are discussed? What links are provided?

What Can We Learn from Blog Postings?

Unstructured information related to blog postings (bloggers), including responses/comments, can provide insight into diseases (cancer), treatment (e.g., alternative medicine, therapy), support links, etc.

1. What are the most common issues patients have (bloggers/responses)?

2. What are the cancer types (conditions) most discussed? Why?

3. What therapies and treatments are being discussed? What medical and nonmedical information is provided?

4. Which blogs and bloggers are doing a good job of providing relevant and correct information?

5. What are the major motivations for the postings (comments)? Is it classified by role, such as provider (physician) or patient?

6. What are the emerging trends in disease (symptoms), treatment, and therapy (e.g., alternative medicine), support systems, and informa-tion sources (links, clinical trials)?

What Are the Phases and Milestones?

This project envisions the use of Hadoop/MapReduce on the AWS to facilitate distributed processing and partitioning of the problem-solving process across the nodes for manageability. Additionally, supporting plug-ins are used to develop an application tool to analyze health-related blogs. The project is scoped to content analysis of the domain of cancer blogs at http://www.thecancerblog.com. Phase 1 involved the collection of blog postings from http://www.thecancerblog.com into a Derby applica-tion. Phase 2 consisted of the development and configuration of the archi-tecture—keywords, associations, correlations, clusters, and taxonomy.

Phase 3 entailed the analysis and integration of extracted information in the cancer blogs—preliminary results of initial analysis (e.g., patterns that are identified). Phase 4 involved the development of taxonomy. Phase 5 proposes to test the mining model and develop the user interface for deployment. We propose to develop a comprehensive text mining system that integrates several mining techniques, including association and clus-tering, to effectively organize the blog information and provide decision support in terms of search by keywords (Raghupathi, 2010).

CONCLUSIONS

Big data analytics is transforming the way companies are using sophisti-cated information technologies to gain insight from their data repositories to make informed decisions. This data-driven approach is unprecedented, as the data collected via the web and social media is escalating by the sec-ond. In the future we’ll see the rapid, widespread implementation and use of big data analytics across the organization and the industry. In the pro-cess, the several challenges highlighted above need to be addressed. As it becomes more mainstream, issues such as guaranteeing privacy, safe-guarding security, establishing standards and governance, and continually improving the tools and technologies would garner attention. Big data ana-lytics and applications are at a nascent stage of development, but the rapid advances in platforms and tools can accelerate their maturing process.

REFERENCES

Bollier, D. The Promise and Peril of Big Data. Washington, DC: The Aspen Institute, Communications and Society Program, 2010.

Borkar, V.R., Carey, M.J., and C. Li. Big Data Platforms: What’s Next? XRDS, 19(1), 44–49, 2012.

Burghard, C. Big Data and Analytics Key to Accountable Care Success. Framingham, MA: IDC Health Insights, 2012.

Connolly, S., and S. Wooledge. Harnessing the Value of Big Data Analytics. Teradata, January 2013.

Courtney, M. Puzzling Out Big Data. Engineering and Technology, January 2013, pp. 56–60.

Davenport, T.H., ed. Enterprise Analytics. Upper Saddle River, NJ: FT Press, Pearson Education, 2013.

Dembosky, A. Data Prescription for Better Healthcare. Financial Times, December 12, 2012, p. 19.

EIU. Big Data—Lessons from the Leaders. Economic Intelligence Unit, 2012.

Ferguson, M. Architecting a Big Data Platform for Analytics. Intelligent Business Strategies, October 2012.

Fernandes, L., O’Connor, M., and V. Weaver. Big Data, Bigger Outcomes. Journal of AHIMA, October 12, 2012, pp. 38–42.

Franks, B. Taming the Big Data Tidal Wave. New York: John Wiley & Sons, 2012.

Gobble, M. Big Data: The Next Big Thing in Innovation. Research-Technology Management, January–February, 2013, pp. 64–65.

HP. Information Optimization—Harness the Power of Big Data. December 2012.

IBM. IBM Big Data Platform for Healthcare. Solutions brief, June 2012.

jStart. How Big Data Analytics Reduced Medicaid Re-admissions. A jStart case study, 2012.

Kumar, A., Niu, F., and C. Re. Hazy: Making It Easier to Build and Maintain Big-Data Analytics. Communications of the ACM, 56(3), 40–49, 2013.

Manyika, J., Chui, M., Brown, B., Buhin, J., Dobbs, R., Roxburgh, C., and A.H. Byers. Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute, 2011.

McAfee, A., and E. Brynjolfsson. Big Data: The Management Revolution. Harvard Business Review, October 2012, pp. 61–68.

Mehta, A. Big Data: Powering the Next Industrial Revolution. Tableau, April 2011.

Mone, G. Beyond Hadoop. Communications of the ACM, 56(1), 22–24, 2013.

Nair, R., and A. Narayanan. Benefitting from Big Data Leveraging Unstructured Data Capabilities for Competitive Advantage. Boozandco, 2012.

Ohlhorst, F. Big Data Analytics: Turning Big Data into Big Money. New York: John Wiley &

Sons, 2012.

Oracle. Oracle Information Architecture: An Architect’s Guide to Big Data. 2012.

Raden, N. Big Data Analytics Architecture—Putting All Your Eggs in Three Baskets. Hired Brains, 2012.

Raghupathi, W. Data Mining in Health Care. In Healthcare Informatics: Improving Efficiency and Productivity, ed. S. Kudyba. Boca Raton, FL: Taylor and Francis, 2010, pp.

211–223.

Russom, P. Big Data Analytics. TDWI Best Practices Report, 2011.

SAP AG. Harnessing the Power of Big Data in Real-Time through In-Memory Technology and Analytics. In The Global Information Technology Report. World Economic Forum, 2012, pp. 89–96.

Sathi, A. Big Data Analytics. MC Press Online LLC, 2012.

Savage, N. Digging for Drug Facts. Communications of the ACM, 55(10), 11–13, 2012.

Srinivasan, N., and R. Nayar. Harnessing the Power of Big Data—Big Opportunity for Retailers to Win Customers. Infosys, 2012.

White, D., and N. Rowe. Go Big or Go Home? Maximizing the Value of Analytics and Big Data. Aberdeen Group, 2012.

Zikopoulos, P.C., deRoos, D., Parasuraman, K., Deutsch, T., Corrigan, D., and J. Giles.

Harness the Power of Big Data—The IBM Big Data Platform. New York: McGraw-Hill, 2013.

Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T., and G. Lapis. Understanding Big Data—

Analytics for Enterprise Class Hadoop and Streaming Data. New York: McGraw-Hill, 2012.

4 Data Mining Methods

Dans le document Big Data, Mining, and Analytics (Page 82-88)