1st ChinaSys Workshop

Workshop Program

8:20 – 8:30 Opening by Wenguang Chen

8:30 – 9:50 Who’s who session

Each participating institute has a 10-15 minutes introduction on research interest of the group, the group members. Especially, to introduce people who attend the seminar.

Attendee List
Panel Discussion Summary
Credit: Yufei Chen (did the scribe)

Lidong Zhou

Dependability, Redefined

Traditional approach

- replication and fault tolerance - checkpoint and recovery - testing and bug finding - monitoring, diagnosis and repair

Emergent misbehavior

- sufficiently many machines - predictability and stability

Replication at finer granularity

Assumptions and issues for replication

- black box and coarse granularity - detection time - different failure types - replicated state machine - unnecessary serialization

Language-based security

- widely used in bug finding - compiler assisted checkpointing and replication - ds data-parallel computing

Yinhe Han


基础可靠性保障

针对芯片,测试、验证、可调式性设计

内存系统可靠性保障

- 总线数据传输协议,芯片设计 - 内存检查点和故障恢复协议

故障管理

- 数据中心故障模型及传播分析,故障注入 - 节点状态监测

系统级保障信息收集和管理

- 处理器、内存 - 成果形态:独立软件或者提供接口

单机故障信息收集管理

Xiaolin Wang


Conflict between security and perf?

constrained and unconstrained

not requiring security all the time

- transitional locking? - lock with attar. - let user control

Wenguang Chen


From lang and compiler

locks

- checkpoint on various platform - not sure whether it's correct

MPI

- No fault tolerance - One error, all error

programming model

- provide sth.? - checkpoint

arch. provide

- Garth Gibbson. HPC needs checkpoint, but huge amount data. - Needs arch. support.

Yu Zhang


Verification

- Existing para application - Certified kernel

Compiler

- Certifying compiler - Proof checker - Subset C certified compiler (ongoing work)

Deterministic execution

- diff with existing deter. exec - specific input, same result - select one possible scheduling - may generate scheduling which is different from existing model - programmer should be able to infer the order

- problem on shared memory (Bryan Ford) - consistency memory model

Jianjun Han


real-time needs reliability

hot temp. impact reliability problem

- task migration to reduce load on hot component - guarantee real-time requirement

ZZ —

Algorithm dependable flexibility

- algorithm itself is not sensible to error - e.g. Monte Carlo - some can be relaxed, others can't - machine learning algorithm - OS over design for these algorithms

Part of the program is verifiable

- verify subspace of a program - exec in this space is safe - what happens when the exec goes outside this space

MPI global checkpoint

- but we can change a view point

Energy

- analyze power according to program exec (Haibo) - power model, emulator (not accurate, 1 time error – Yinhe)

Free diss


100W power generate 100W heat?

- computation consume energy - radio - but this is rather small

What's the standard of verifiable? (Yinhe Han)

- How to verify the correctness of Google's search result - Integrate economic factor (Wenguang)

Benchmark now cares what the user can feel

We can do reliability in different layers

- Is redundancy necessary? - worse: conflict

Acknowledgement

Microsoft Research Asia sponsored the dinners and conference room.