8:20 – 8:30 Opening by Wenguang Chen
8:30 – 9:50 Who’s who session
Each participating institute has a 10-15 minutes introduction on research interest of the group, the group members. Especially, to introduce people who attend the seminar.
Lidong Zhou
Dependability, Redefined
Traditional approach
- replication and fault tolerance - checkpoint and recovery - testing and bug finding - monitoring, diagnosis and repair
Emergent misbehavior
- sufficiently many machines - predictability and stability
Replication at finer granularity
Assumptions and issues for replication
- black box and coarse granularity - detection time - different failure types - replicated state machine - unnecessary serialization
Language-based security
- widely used in bug finding - compiler assisted checkpointing and replication - ds data-parallel computing
Yinhe Han
基础可靠性保障
针对芯片,测试、验证、可调式性设计
内存系统可靠性保障
- 总线数据传输协议,芯片设计 - 内存检查点和故障恢复协议
故障管理
- 数据中心故障模型及传播分析,故障注入 - 节点状态监测
系统级保障信息收集和管理
- 处理器、内存 - 成果形态:独立软件或者提供接口
单机故障信息收集管理
Xiaolin Wang
Conflict between security and perf?
constrained and unconstrained
not requiring security all the time
- transitional locking? - lock with attar. - let user control
Wenguang Chen
From lang and compiler
locks
- checkpoint on various platform - not sure whether it's correct
MPI
- No fault tolerance - One error, all error
programming model
- provide sth.? - checkpoint
arch. provide
- Garth Gibbson. HPC needs checkpoint, but huge amount data. - Needs arch. support.
Yu Zhang
Verification
- Existing para application - Certified kernel
Compiler
- Certifying compiler - Proof checker - Subset C certified compiler (ongoing work)
Deterministic execution
- diff with existing deter. exec - specific input, same result - select one possible scheduling - may generate scheduling which is different from existing model - programmer should be able to infer the order
- problem on shared memory (Bryan Ford) - consistency memory model
Jianjun Han
real-time needs reliability
hot temp. impact reliability problem
- task migration to reduce load on hot component - guarantee real-time requirement
ZZ —
Algorithm dependable flexibility
- algorithm itself is not sensible to error - e.g. Monte Carlo - some can be relaxed, others can't - machine learning algorithm - OS over design for these algorithms
Part of the program is verifiable
- verify subspace of a program - exec in this space is safe - what happens when the exec goes outside this space
MPI global checkpoint
- but we can change a view point
Energy
- analyze power according to program exec (Haibo) - power model, emulator (not accurate, 1 time error – Yinhe)
Free diss
100W power generate 100W heat?
- computation consume energy - radio - but this is rather small
What's the standard of verifiable? (Yinhe Han)
- How to verify the correctness of Google's search result - Integrate economic factor (Wenguang)
Benchmark now cares what the user can feel
We can do reliability in different layers
- Is redundancy necessary? - worse: conflict
Microsoft Research Asia sponsored the dinners and conference room.