1st ChinaSys Workshop
- Date: November 23, 2011
- Location: Huangxun Hotel(皇轩酒店),Shengzhen,China
- Organizing Chair: Wenguang Chen (Department of Computer Science, Tsinghua University)
Workshop Program
8:20 – 8:30 Opening by Wenguang Chen
8:30 – 9:50 Who’s who session
- Chair: ZZ
- Fudan/SJTU: Haibo Chen
- HUST: Xiaofei Liao
- ICT: Han Yinhe
- MSRA: Lidong Zhou
- PKU: Wang Xiaolin
- Tsinghua: Wenguang Chen
- USTC: Zhang Yu
Each participating institute has a 10-15 minutes introduction on research interest of the group, the group members. Especially, to introduce people who attend the seminar.
- 09:50 – 10:10 Tea Break 1
- 10:10 – 12:10 Work in Progress Session (Chair: Wenguang Chen)
- 10:10 – 10:40 Scalable Deterministic Replay in a Parallel Full-system Emulator, 陈宇飞, Fudan
- 10:40 – 11:10 Improve GPU Virtualization with QoS Feedback for Cloud Gaming, 于淼, Fudan
- 11:10 – 11:40 Adaptive Runtime systems for CPU/GPU, Xiaofei Liao, HUST
- 11:40 – 12:10 Low Power Architecture, Yinhe Han, ICT
- 12:10 – 14:00 Lunch
- 14:00 – 15:30 Work in Progress Session (Chair: Yinhe Han)
- 14:00 – 14:30 Static Analysis and Optimization for Data-Parallel Programs, Hucheng Zhou, MSRA
- 14:30 – 15:00 Large-scale Fault-tolerant Stream Processing in the Cloud, Zhengping Qian, MSRA
- 15:00 – 15:30 A cloud database interface, Wentao Han, Tsinghua
- 15:30 – 16:00 Centralized Run Queue based Fair Scheduling on Composable Multicore Architectures, Tao Sun, USTC
- 16:00 – 16:30 Tea Break
- 16:30 – 17:30 Free Discussion (Chair: Haibo Chen)
Attendee List
- Fudan/SJTU:老师:臧斌宇,陈海波,学生:于淼 (导师:戚正伟), 陈宇飞
- HUST: 教师:韩建军, 学生:范学鹏、朱亮、叶晨成
- ICT: 教师:韩银和,霍玮,鄢贵海, 学生:马君,孙发强,李雪亮
- MSRA: 钱正平,周虎成,洪春涛,张峥, 周礼栋,张霖涛
- PKU: 教师:汪小林
- Tsinghua: 教师:陈文光,陈渝,陈康,董渊,韩文弢(学生)
- USTC: 教师:吴俊敏,张昱, 博士生:孙涛(导师:安虹),王俊昌(导师:华蓓),张凯(导师:华蓓)
Panel Discussion Summary
Credit: Yufei Chen (did the scribe)
Lidong Zhou
Dependability, Redefined
Traditional approach
- replication and fault tolerance - checkpoint and recovery - testing and bug finding - monitoring, diagnosis and repair
Emergent misbehavior
- sufficiently many machines - predictability and stability
Replication at finer granularity
Assumptions and issues for replication
- black box and coarse granularity - detection time - different failure types - replicated state machine - unnecessary serialization
Language-based security
- widely used in bug finding - compiler assisted checkpointing and replication - ds data-parallel computing
Yinhe Han
基础可靠性保障
针对芯片,测试、验证、可调式性设计
内存系统可靠性保障
- 总线数据传输协议,芯片设计 - 内存检查点和故障恢复协议
故障管理
- 数据中心故障模型及传播分析,故障注入 - 节点状态监测
系统级保障信息收集和管理
- 处理器、内存 - 成果形态:独立软件或者提供接口
单机故障信息收集管理
Xiaolin Wang
Conflict between security and perf?
constrained and unconstrained
not requiring security all the time
- transitional locking? - lock with attar. - let user control
Wenguang Chen
From lang and compiler
locks
- checkpoint on various platform - not sure whether it's correct
MPI
- No fault tolerance - One error, all error
programming model
- provide sth.? - checkpoint
arch. provide
- Garth Gibbson. HPC needs checkpoint, but huge amount data. - Needs arch. support.
Yu Zhang
Verification
- Existing para application - Certified kernel
Compiler
- Certifying compiler - Proof checker - Subset C certified compiler (ongoing work)
Deterministic execution
- diff with existing deter. exec - specific input, same result - select one possible scheduling - may generate scheduling which is different from existing model - programmer should be able to infer the order
- problem on shared memory (Bryan Ford) - consistency memory model
Jianjun Han
real-time needs reliability
hot temp. impact reliability problem
- task migration to reduce load on hot component - guarantee real-time requirement
ZZ —
Algorithm dependable flexibility
- algorithm itself is not sensible to error - e.g. Monte Carlo - some can be relaxed, others can't - machine learning algorithm - OS over design for these algorithms
Part of the program is verifiable
- verify subspace of a program - exec in this space is safe - what happens when the exec goes outside this space
MPI global checkpoint
- but we can change a view point
Energy
- analyze power according to program exec (Haibo) - power model, emulator (not accurate, 1 time error – Yinhe)
Free diss
100W power generate 100W heat?
- computation consume energy - radio - but this is rather small
What's the standard of verifiable? (Yinhe Han)
- How to verify the correctness of Google's search result - Integrate economic factor (Wenguang)
Benchmark now cares what the user can feel
We can do reliability in different layers
- Is redundancy necessary? - worse: conflict
Acknowledgement
Microsoft Research Asia sponsored the dinners and conference room.