Sunday, 28 September 2014

A simple data classification-part 4/4 (Using Python in a Hadoop filesystem)

Today I am going to run a python code by using Hadoop filesystem.
(Hadoop version in my computer is 1.2.1)

We've already made a code structure in last post, so this time we should know how to execute a python code in Hadoop filesystem.

When it comes to the Hadoop, we exactly means Hadoop distributed file system, and this means data is distributed across multiple nodes. Hadoop distributed system takes care of distribution, security and fault tolerance.

In order to execute python program in Hadoop environment, you have to use Hadoop streaming API.
Check following web page out !
"http://hadoop.apache.org/docs/r1.1.2/streaming.html#Hadoop+Streaming"
As you can see in the document, any executables that read the input from stdin can be the mapper and reducer. In our case, we use a python mapper and reducer.

Now, we are ready to execute python code.

First, start up the Hadoop file system.
In my case I have a single Hadoop node because I installed hadoop in my virtual machine in my PC.
   [ Hadoop Administration console ]


Second, copy the Age data into the Hadoop filesystem. In my case, I make a new directory "python" and copy from local file to Hadoop file system.As soon as text file loaded into Hadoop filesystem, it is distributed automatically.

[hadoop06:58:20@NBAAge]$ls -lrt NBA_Age_file.txt
-rw-rw-r-- 1 hadoop hadoop 1458 Sep 29 06:42 NBA_Age_file.txt
[hadoop06:57:44@NBAAge]$hadoop fs -mkdir /python
[hadoop06:58:16@NBAAge]$hadoop fs -copyFromLocal NBA_Age_file.txt /python/

   [ Location of Age file in Hadoop file system ]



Third, execute python map/reduce program using streaming API. In my case, I specify two python program as a mapper and reducer and I added -file option because python program is located in local file system. At the last part of the command I specify the location for the result.

* If you want to check the mapper and reducer source, check out the post 3/4.

[hadoop07:25:26@NBAAge]$
hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -file /home/hadoop/python2.7/MAPREDUCE/NBAAge/mapper.py -mapper /home/hadoop/python2.7/MAPREDUCE/NBAAge/mapper.py -file /home/hadoop/python2.7/MAPREDUCE/NBAAge/reducer.py -reducer /home/hadoop/python2.7/MAPREDUCE/NBAAge/reducer.py -input /python/NBA_Age_file.txt -output /python/NBAAge

[ Map-Reduce execution log ]


Finally, once you succeed in the map-reduce process, you can check the result using either the administration console or Hadoop command.

Let's open the console and browse the target directory we specified at the execution and check the result.
   [ Directory of result file ]

   [ Final result of map/reduce]


As you can see above, a frequency of each "Age" is calculated by using map-reduce python programming. It's fun.



No comments: