阿正的工作筆記本: [pycon]pycon 2016 day 3 小整理 from 共筆

關於Django & Twisted

Scaling Django Application

Django server only response one request a time
concurrent request = thread x pools
Higher scale means higher complexity
有沒有更好的方法去容納更多的 request ?
CPU-bound: 數學運算、資料運算等
IO-bound: Database requests, web requests, other network IO
ex. 跟資料庫溝通獲取某些資訊

Asynchronous IO Programming
所有的事件都是從 IO 來的，我們就等他來就好啦 ~
Twist

selector function
有一個 list 有一堆 file descriptor（socket, …）
selector loop 可以處理上千個 file descriptor (?)
nothing locks, it give control to the next event
no blocking means no threads
event driven 的最佳case是I/O bound的程式，低CPU使用率
若程式是CPU bound，則搭配task queue
Putting task on the queue and removing them is cheap
所以 task queue 要擴張比較容易
想要做更多工作的話，就增加更多的 worker 吧

Django channel
Project for Asynchronous django

interface server > channel queue > many of workers
interface server: 就只是接收 requests
workers: 拿走 queue 裡面的 request 然後處理他們，處理完後在放回 channel
當 requests 被 worker 處理好的時候
interface server 會把 response 撿走
拿去回應到對的 requests
daphne: Daphne is a HTTP, HTTP2 and WebSocket protocol server for ASGI, and developed to power Django Channels.
Daphne is written in Twisted.

The channel layer can be shard ( ?
可以有很多個 channel queue 一起處理這些 request
worker 也不一定要被放在 web server 裡面
如果規模不大的話，也可以把 channel queue 放在 shared memory 裡面

how channels work
request
incoming HTTP request
workers listen on these channel name（如果有工作就送給 worker）
```
http.response!c134x7y
有這個 code 就能把他指回去對的地方
```
worker 自己預設不會使用非同步 IO，你不會有 blocked worker 這種東西

group
worker can listening on specific channel, they don’t need to listen all of them
fan-out message

Channel is a bridge to Asynchronous feature
Django Channel document
可能在 Django 1.11/2.0 會 release

QA Time

Q: Why JSON not BSON…?
A: 在一般狀況下 JSON 足以，但在需要的情況下，可以自訂需要的 serialization(非官方)
Q: queue 是 FIFO，可以自訂嗎？
A: 在 Channel 這層只能 FIFO, 如果需要針對不同的 task 有不同的處理優先權, 可以把不同種類的 tasks 放到不同的 channel 上, 就能用個別的 worker 去處理
Q: in-memory queue?
A: 需要 IPC，很多 machine 的情況下沒辦法 (?)

關於python的特性介紹

容易碰到但不容易第一時間弄懂的問題：

縮排
參數傳遞
closure
Global Variable
Dead or Alive
Interface
List Related
Package
Quality
Inheritance

縮排
- Tab 和空白視為不同的字元
- 直譯器會用一個 stack 紀錄現在的縮排多少空格, 若是縮排減少, 會把 stack pop, 一直到縮排跟 stack 頂端一致
- 在看到 def 的時候會計算 arguments 的預設值
- 但只有第一次會計算，之後都不會

Closure
- 把函數先存到list，再一個一個印出來 -> 結果會不正確
- 產生closure時，python只會記住內部變數的名字，不會執行涵數內的程式(只是個symbol table的名字)
- 被 closure 記住的變數不會立刻被 GC
- 解法：
  1. 被closure的函數中加parameter=參數
  2. 改用 class，並使用 __call__
  3. 用 functools.partial
- Pyhton 在執行函數的第一步, 會先確定語法正確性, 才開始執行程式內容
  若global var和local var同名，python會錯亂，會把兩個都當成local
  解法：記得用 global

Dead or Alive
- circular reference 的多個 objects 他們的 __del__ 不會被執行, 因為他們的 reference count 都大於 0
- 如果兩個物件都有實作自己的__del__，Python會不敢動作
- 解法
  1. 若一定要用circular reference，使用weak reference
Package
- 用virtualenv隔離套件
- pip freeze > requirements.txt
- 如果不熟悉 compile 流程或有用科學計算建議使用 conda
- 建議 requirements.txt 用手動修改, 否則若有用一些嘗試性質的套件會被 pip freeze 匯出, 可能會弄髒其他共同開發的同事的環境
flake8 可以檢查code符合PEP

關於中文歌詞分析

slide: https://speakerdeck.com/daikeren/analyzing-chinese-lyrics-with-python

取得中文歌詞
抓歌詞
Tools: Scrapy, MongoDB
清理歌詞
有重複歌曲、奇怪字元等
抓出需要的資料(EX.作曲人、作詞人)

開始分析
使用pandas+pymongo
可以把pandas想成程式版的excel。excel能做的事pandas幾乎都做得到
matplotlib:將資料畫成圖表的工具。
資料量不大的話可以直接整個資料庫丟進pandas
df[df.lyricist=='林夕']　就可以直接找出所有林夕的歌

斷詞
使用jieba
- 支援繁體、自訂字典…

計算詞頻
使用 counter 計算（counter.most_common)

視覺化(文字雲)
使用wordcloud套件作文字雲
-pip install wordcloud
- 需要自己給他中文字型檔

其他來不及說的東西
jupyter
elasticsearch。配合ElasticSearchDSL會更好

關於Time series in python

What is Time series data?

Methods：
OLS, GLS, …
ARIMA+ SVR , SdA
ARIMA, Autoregressive Integrated Moving Average Model
- http://wiki.mbalib.com/zh-tw/ARIMA模型
SVR, Support Vector Regression
- Related Paper
  - A Distributed PSO-ARIMA-SVR Hybrid System for Time Series Forecasting
  - http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6974534
SDA, Stacked Denoising Autoencoders
- Related Paper
  - Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion
  - http://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf
    Scenario:
    IoT device report, need a model that is more complicated than the linear model.

ARIMA: linear part (big picture)

SVR: residual (details?)

ACF/PACF

AIC/BIC choose best model

package rpy2
automatically to decide the parameters
- import pandas2ri, imporrtr

use sklearn, SVR

use grid_search to tune the parameters
problem
- pattern modeling (normal part)
- exception modeling (abnormal part)
- cross validation: beware of the order, because it’s important for time series

第二部分

預測人會不會待在辦公室 (例如跟氣溫的相關性)

Advantages of SdA
non-linear

Packages used
Keras - deep learning
TensorFlow - backend of Keras
iypthon - visualization

pretraining: autoencoders
Steps:
- add noise
- autoencode
- train
- stack encoders

supervised learning
Dataset: UCI SML2010
Target: indoor temperature

關於如何打造關鍵字精靈

Slider: http://www.slideshare.net/ssuser05afc89/how-to-build-an-keyword-wizard
Speaker: 施晨揚

What is keyword

是一個有指標或是有識別性的字詞，且他也包含著一些特定的意義

Why we need ?

Advertisement (廣告)
TAG (標籤)
Relation (關聯性)
Article Summary (文章的總結)

Word Relation Model

關聯性搜尋
- Model 1（關聯詞)
  沖繩 -> 飯店、自由行、推薦
- Model 2（同義詞）
  沖繩 -> 琉球、壺屋通、…
把文字、文章 Mapping 到多維向量空間
可以看出那些文章，或是哪些詞是有關係的

One Hot v.s Continue Value
如果是維度十分高的話（多維空間），是很難辨識出哪些詞是相似的

Word Representation -One Hot Representation
最簡單的方法 - One Hot Representation
先把每一個詞建出一個 One Hot Index
但是這種編碼模式會找不到詞與詞之間的相關性，找關係會很難找
Word Representation - Context Vector
在範例中，以詞作為 X Y 軸來產生一個表格，
把兩個詞之間同時出現的機率來辨識出兩個詞之間的相關性
Ex. 沖繩 vs. 浮淺 = 0.7, 沖繩 vs. 餐廳 = 0.1

Word Context Vector
講到拉麵 -> 美味しい
講到一蘭 -> 喔依西捏
就可以把兩個詞關聯起來

Co-occurrence Matrix
如果很大的話 n ~= 500k
那 space = n & n，time = n *n

Word2Vec
word2vec = 兩層式的類神經網路
「我想要去沖繩 … 潛水」必須再看到前面的字就要能預測出會說 潛水，
可能的詞有：打球、潛水、睡覺、…、洗臉（可能有好多個 Label），
可以用類神經網路來逼近出這個 Model，
Reference

Major Process Flow
1. Article Selection
2. Content Extraction
3. Word Cutting

Article Raw Data Preparation

文章都是一行，要幫文章做斷詞，把文章中的詞以空格隔開。

Term Database
- 收集詞庫
- Search Log
- http://baseterm.com/
- 輸入法詞庫
  - 詞庫破解
- 各大電商網站(e.q 阿里巴巴)

Term Database - Sarch Log
google search sole → search histroy →Filter & Counting →Term Collection

Search Log

從 search log 產生詞庫，可以直接用 count 來做，
累積到一定的數量，就可以知道 太陽的後裔 是一個新詞
但是可能會有奇怪的詞混進來，所以要限制長度

Term Database - Word Cutting
- Word Cut Tool
  - Jieba
- Get Bot Token

關於First try for CAS, SymPy with codegen

Slides: https://speakerdeck.com/wdv4758h/first-try-for-cas-sympy-with-codegen
Sympy可協助數學運算
建symbol、expression
symplify可直接代入運算式得結果
expand展開
solve解方程式
lambdify產生可運算程式碼
可以接到各種語言的backend 像fortran、numpy…

關於python處理geo資料做法

what drives me
Visualizations
UX
Engineering 希望把所有東西做的簡單，好看

Projects working on
CO2 visualizaion
data.worldbank.org
Flood Risk

Geo Data Format

shapefiles (GIS 格式)
- dfb: shapes
- prj: coordinate
- shp: main entrypoint
- shx: index file(?)

Shape formats for WEB
- geojson (simple, standard json)-https://github.com/geojson/draft-geojson
- topojson (more compact, boarder sharing)-https://github.com/mbostock/topojson

Tools
- QGIS (desktop app)
- Geojson.io (web app)
- mapshaper.org (feature simplification)
- mapbox.com (basemap creation)
- js:
  - leaflet.js
  - mapbox.js (propertory, speaker has good user experience)
  - d3.js (customise, low-level APIs) -可參考 http://www.taiwanstat.com/

simple approach
1. shapefiles and api
2. data processor
3. geojson / json
4. webapp

Frontend
- load data
- basemap
- stylethe features
- create the ranges

Common pitfalls
- Data encoding
- Coordinate systems
- Check the mappings

Optimizing for web
- file size is critical
- use topojson to save space
- simplify the features with mapshaper

Optimizing choropleth
- play with border styling
- make it interactive
- try differenet color schemes

如何取得data?
County open data OR JSON Api

Resources
- Formats: Shapefiles, Geojson, Topjson
- Python packages: pyshp, geojson, topojson
- Sites: 前面介紹的那幾個
- Tools: QGIS
- Wiredcraft blog

Content on UX
- interactivity
- colors + styling
- usability with mobile devices
- talk to ther users

example
- 接續上面的 simple approach
- 修改自 mapbox tutorial (?)
- mapbox 讀 python 生出來的 geojson
- getColor 用分數決定顏色

2016年6月6日 星期一

[pycon]pycon 2016 day 3 小整理 from 共筆

關於Django & Twisted

Scaling Django Application

Asynchronous IO Programming

Twist

selector function

Django channel

how channels work

group

Channel is a bridge to Asynchronous feature

QA Time

關於python的特性介紹

容易碰到但不容易第一時間弄懂的問題：

縮排

Closure

Dead or Alive

Package

關於中文歌詞分析

slide: https://speakerdeck.com/daikeren/analyzing-chinese-lyrics-with-python

取得中文歌詞

清理歌詞

開始分析

斷詞

計算詞頻

視覺化(文字雲)

其他來不及說的東西

關於Time series in python

Methods：

ARIMA: linear part (big picture)

SVR: residual (details?)

ACF/PACF

AIC/BIC choose best model

package rpy2

use sklearn, SVR

use grid_search to tune the parameters

第二部分

預測人會不會待在辦公室 (例如跟氣溫的相關性)

Advantages of SdA

Packages used

pretraining: autoencoders

supervised learning

關於如何打造關鍵字精靈

What is keyword

Why we need ?

Word Relation Model

One Hot v.s Continue Value

Word Representation -One Hot Representation

Word Representation - Context Vector

Word Context Vector

Co-occurrence Matrix

Word2Vec

Major Process Flow

Article Raw Data Preparation

Term Database

Term Database - Sarch Log

Search Log

Term Database - Word Cutting

關於First try for CAS, SymPy with codegen

關於python處理geo資料做法

what drives me

Projects working on

Geo Data Format

shapefiles (GIS 格式)

Shape formats for WEB

Tools

simple approach

Frontend

Common pitfalls

Optimizing for web

Optimizing choropleth

如何取得data?

Resources

Content on UX

example

沒有留言:

張貼留言

Google Analytics初學者入門簡介

2016年6月6日星期一