Skip navigation

Category Archives: Python


This was just an idea popped up when I had trouble sleeping last night. Basically we can use Thrift for defining the data and service going in and out Lambda service instead of plain REST API.

Below is the code for the server. To make it easier to deploy I am using Serverless Framework.

This is still lack authentication process whatsoever so I don’t think this should be use in production until I figure it out.

The full source code is hosted here: https://github.com/petrabarus/aws-lambda-thrift-example

Iklan

I imported a CSV in pandas like below

>>> import pandas
>>> df = pandas.read_csv('file.csv',names=['count', 'province', 'city', 'district', 'region', 'area'])
>>> df.head()
   count province         city           district region area
0   7923     Aceh   Aceh Barat   Arongan Lambalek            
1    628     Aceh   Aceh Barat     Johan Pahlawan            
2    235     Aceh   Aceh Barat        Woyla Timur            
4   3900     Aceh   Banda Aceh                         

using SQL, I can do something like this

    SELECT SUM(count) AS sum, district 
        FROM table WHERE city = 'Aceh Barat' 
        GROUP BY district 
        ORDER BY sum DESC

but using pandas python library, I can achieve the same using.

>>> import pandas, numpy
>>> df = pandas.read_csv('file.csv',names=['count', 'province', 'city', 'district', 'region', 'area'])
>>> df[df['city'] == 'Aceh Barat'].groupby('district').aggregate(numpy.sum).sort(['count'], ascending=False)
                  count
district               
Arongan Lambalek   7923
Johan Pahlawan      628
Woyla Timur         235

>>> df[df['city'] == 'Aceh Barat'].groupby('district').aggregate(numpy.sum).sort(['count'], ascending=False)
                  count
district               
Arongan Lambalek   7923
Johan Pahlawan      628
Woyla Timur         235

>>> df[df['city'] == 'Medan']
        count        province   city            district region area
10340  108769  Sumatera Utara  Medan                 NaN    NaN  NaN
10341     759  Sumatera Utara  Medan        Medan Amplas    NaN  NaN
10342     579  Sumatera Utara  Medan        Medan Amplas    NaN  NaN
10343    1272  Sumatera Utara  Medan        Medan Amplas    NaN  NaN
10344     769  Sumatera Utara  Medan        Medan Amplas    NaN  NaN
10345     379  Sumatera Utara  Medan        Medan Amplas    NaN  NaN
10346     988  Sumatera Utara  Medan        Medan Amplas    NaN  NaN
10347    4395  Sumatera Utara  Medan          Medan Area    NaN  NaN
10348    5598  Sumatera Utara  Medan         Medan Barat    NaN  NaN

>>> df[df['city'] == 'Medan'].groupby('district').aggregate(numpy.sum).sort(['count'], ascending=False)
                    count
district                 
Medan Tuntungan      7425
Medan Tembung        6349
Medan Barat          5598
Medan Timur          5378
Medan Amplas         4746

Satu hal yang suka dari Python adalah fitur REPL dan banyaknya fitur pemrosesan data. Ini membuat pemrosesan data bisa lebih interaktif dan efisien.

Dari banyak data yang saya proses, salah satu yang paling sering adalah log apache. Sebelum data log apache bisa diproses, pertama-tama harus diparsing dulu, yakni dengan library apachelog.

$sudo pip install apachelog

Dan di REPL

$python
>>> import apachelog, sys
>>> p = apachelog.parser(apachelog.formats['extended'])
>>> for line in open('file.log'):
...    data = p.parse(line)
...    print data

Dari kode di atas, variabel data bisa tinggal digunakan untuk pemrosesan.

UPDATE 2014-02-20:

Ada salah ketik di bagian

apachelog.format['extended']

harusnya

apachelog.formats['extended']
%d blogger menyukai ini: