DK Kim's data analysis (using Python,R): A simple data classification-part 3/4 (Using Python in a Hadoop filesystem)

Through the last few posts, we made a python code which calculates the frequency of age. Today, what we are going to do is making a code applied to a Big Data architecture. If our "Age" data is a quite big. For example, over 10 tera byte or 100 tera bytes or something. It probably takes so much time to finish it. In this case, we can consider a Big Data platform like a Hadoop filesystem. A Hadoop filesystem automatically distributes the huge data into a multiple nodes, so it is much faster than a traditional single node filesystem.

In a Hadoop filesystem, there is an unique programming method, so called a "map-reduce" programming to manipulate data.
There are so many articles related to Hadoop and "map-reduce" programming, so I won't explain anymore, but I just focus on the python programming in a Hadoop platform.

The demo scenario is as follows.
1). The first step
Getting the individual "Age" data from the html web page using the url module and the regular expression module.
2). The second step
Once you have an "Age" data file, copy the file to the Hadoop filesytem in order to distribute the data to multiple nodes.
3). The third step
Mapping program coding.
4). The forth step
Reducer program coding.
5). The fifth step
Run in a Hadoop filesystem using a streaming service.

We've already finished first step through the last few posts, so we just need to make an python code for step 3 and 4.

Let's take a look at more detail.

If you finished the first step, you have the file data which looks like this.

22
23
20
23
22
......

Mapping step is a simple step which generates data in a key and value format, so after this step the result might be looked like this.

Python program just read the file "Age" data and print it with a key,value format.
In this case value is just 1

====================
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()

    for word in words:
        print '%s \t %s' %(word,1)
[hadoop07:31:50@NBAAge]$cat reducer.py
#!/usr/bin/env python
====================

Key(Age) Value(Count)
22 1
23 1
20 1
23 1
22 1

After finishing the "mapping step", all the result should be handed over to the "reducer step". reducer programming calculate the frequency of "Age" using the dictionary structure. Desired result should be like this.

=============================
import sys

def main():
current_word = None
current_cnt = 0
word_dic ={}

for line in sys.stdin:
    line = line.strip()
    current_word, current_cnt = line.split('\t',1)
    if word_dic.has_key(current_word):
        word_dic[current_word]=word_dic[current_word]+1
    else:
        word_dic[current_word]=1
for words, cnts in word_dic.items():
     print '%s\t%s' %(words ,cnts)
=============================

Key(Age) Value(Count)
22 2
23 2
20 1

The real beauty of big data is a performance. It is hard to feel this performance with small data, but if we have super huge data to be calculated within limited time, Hadoop filesystem shows the worth. As a result, a Hadoop filesystem is so popular these day.

This time I reviewed the programming concept before moving on the running step.

DK Kim's data analysis (using Python,R)

Monday, 22 September 2014

A simple data classification-part 3/4 (Using Python in a Hadoop filesystem)

No comments: