69
Estudio de la robustez frente a SEUs de algoritmos autoconvergentes Dr. Raoul Velazco Laboratorio TIMA Grupo «ARIS» Grenoble Francia h?p://Bma.imag.fr Laboratorio PRiSME Grupo «SYSCOM» Universidad de Versailles Saint QuenBn les Yvelines Francia h?p://www.prism.uvsq.fr/

Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Embed Size (px)

Citation preview

Page 1: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Estudio  de  la  robustez  frente  a  SEUs  de  algoritmos  auto-­‐convergentes  

 Dr.  Raoul  Velazco  

 

Laboratorio  TIMA  Grupo  «ARIS»  

Grenoble  -­‐  Francia  h?p://Bma.imag.fr  

 Laboratorio  PRiSME  Grupo  «SYSCOM»  

Universidad  de  Versailles  Saint  QuenBn  les  Yvelines  -­‐  Francia  

h?p://www.prism.uvsq.fr/    

Page 2: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   2  

1.   RadiaBon  effects  in  ICs  2.   The  Self-­‐Stabilizing  Algorithm  3.   SEUs  in  processor-­‐based  applicaBons  4.   The  LEON3  processor  5.   The  ASTERICS  test  plaYorm  6.   SimulaBon  of  SEUs  on  the  LEON3  7.   Conclusions  

Outline  

Page 3: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   3  

1.   RadiaBon  effects  in  ICs  2.   The  Self-­‐Stabilizing  Algorithm  3.   SEUs  in  processor-­‐based  applicaBons  4.   The  LEON3  processor  5.   The  ASTERICS  test  plaYorm  6.   SimulaBon  of  SEUs  on  the  LEON3  7.   Conclusions  

Outline  

Page 4: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  context      

•   Aerospace  electronic  systems  operate  in  a  radiaBon  environment  

   

•   Charged  parBcles  come  from  three  main  sources:  Van  Allen  Belts,  Cosmic  Rays  &  Solar  Flares    

Cosmic rays

Protons from solar flares

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   4  

Page 5: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

•  The  microelectronic  technology  is  constantly  changing:  –  higher  density,    –  faster  devices,    –  lower  power.    

•  These  increase  the  devices’  vulnerability  to  the  effects  of  radiaLon  (not  only  in  nuclear-­‐  space  environments)  

•  In  some  applicaLons,  no  failure  is  allowed  •  Advanced  technologies  are  potenLally  sensiLve  to  the  effects  of  

atmospheric  neutrons  •  Space  Agencies  favor  the  use  of  COTS  technologies  

1.  RadiaBon  effects  in  ICs:  context  

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   5  

Page 6: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  types  of  faults  

RadiaBon  and  Electronic  Devices    

Displacement

T.I.D.

Accumulated

Single Particle S. E. E.

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   6  

Page 7: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  descripBon  of  SEE  

What  you  always  wanted  to  know  about    Single  Event  Effects  (SEE’s)  

 •  What  are  they?:    

One  of  the  result  of  the  interacLon  between  the  radiaLon  and  the  electronic  devices  

•  How  do  they  act?:    CreaLng  free  charge  in  the  silicon  bulk  that,  in  pracLcal,  behaves  as  a  short-­‐life  but  intense  current  pulse  

•  Which  are  the  ul4mate  consequences?    From  simple  bit-­‐flips  or  noise-­‐like  signals  unLl  the  physical  destrucLon  of  the  device  

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   7  

Page 8: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

The  Physical  Mechanism                

  The   incident  parLcle   generates   a  dense   track  of   electron  hole  pairs   and  this   ionizaLon  causes  a  transient  current  pulse   if  the  strike  occurs  near  a  sensiLve  volume  

 

1.  RadiaBon  effects  in  ICs:  descripBon  of  SEE’s

CHARGE COLLECTION

VOLUME

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   8  

Page 9: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

 1.  RadiaBon  effects  in  ICs:  classificaBon  of  SEE  

SINGLE EVENT UPSET (SEU): CHANGE OF DATA OF MEMORY CELLS

MULTIPLE BIT UPSET (MBU): SEVERAL SIMULTANEOUS SEU’s SINGLE EVENT TRANSIENT (SET): PEAKS IN COMBINATIONAL IC’s

SINGLE EVENT LATCH-UP (SEL): PARASITIC THYRISTOR TRIGGER

FUNCTIONAL INTERRUPTION (SEFI): PHENOMENA IN CRITICAL PARTS

AND OTHERS…

HARD ERRORS and SOFT ERRORS

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   9  

Page 10: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  descripBon  of  SEE

CROSS SECTION (σ)

.EVENTS

DEVN

Part Fluenceσ =

LINEAR ENERGY TRANSFER (LET)

SOFT ERROR RATE: PROBABILITY OF AN ERROR AT USUAL CONDITIONS FIT: Typical unit of SER à Probability of 1 ERROR every 109 h

E.g.- 180-nm SRAM: 1000-3000 FIT/Mb

Some Useful Definitions

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   10  

Page 11: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s    Usually,  SEE’s  have  been  associated  with  space  missions  because  of  the  absence  of  the  atmospheric  shield…  

Cosmic rays

Protons from solar flares

Unfortunately, our quiet oasis seems to be vanishing since the enemy is knocking on the door…

•  Alpha particles from vestigial U or Th traces •  Atmospheric neutrons and other cosmic rays

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   11  

Page 12: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s  

     

SomeBmes,  they  appeared  without  a  warning  and,  aher  some  months  and  spending  a  lot  of  money,  the  source  is  detected*.  

•  In  1978,  Intel  had  to  stop  a  factory  because  water  was  extracted  from  a  nearby  river  that,  upstream,  is  too  close  to  an  old  uranium  mine.  

Alpha Particles

* J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges. A guide for Designing with Memory ICs”, Cypress Semiconductor, USA, 2004. Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   12  

Page 13: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s

     SomeBmes,  they  appeared  without  a  warning  and,  aher  some  months  and  spending  a  lot  of  money,  the  source  is  detected*  

•  In  1978,  Intel  had  to  stop  a  factory  because  water  was  extracted  from  a  nearby  river  that,  upstream,  is  too  close  to  an  old  uranium  mine.  

Alpha Particles

* J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges. A guide for Designing with Memory ICs”, Cypress Semiconductor, USA, 2004. Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   13  

Page 14: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s

     

SomeBmes,  they  appeared  without  a  warning  and,  aher  some  months  and  spending  a  lot  of  money,  the  source  is  detected*.  

•  In  1986,  IBM  detected  a  high  rate  of  useless  devices  and  related  it  to  the  phosphoric  acid,  the  bo?les  of  which  were  cleaned  with  a  210P  deionizer  gadget…hundreds  of  kms  far.  

Alpha Particles

* J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges. A guide for Designing with Memory ICs”, Cypress Semiconductor, USA, 2004. Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   14  

Page 15: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s

     

SomeBmes,  they  appeared  without  a  warning  and,  aher  some  months  and  spending  a  lot  of  money,  the  source  is  detected*.  

•  In  1992,  the  problem  came  from  the  use  of  bat  droppings  living  in  cavern  with  traces  of  Th  and  U  to  obtain  phosphorus.  

Alpha Particles

* J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges. A guide for Designing with Memory ICs”, Cypress Semiconductor, USA, 2004. Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   15  

Page 16: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s

 But  someBmes,  we  are  a  li?le  naive…  

•  Solder  balls  are  usually  made  from  Sn  and  Pb,  which  come  from  minerals  where  there  may  be  uranium  and  thorium  traces.  

   

Nevertheless,  the  designer  forgets  this    detail  and  places    the  solder  balls  too  close  to  cri4cal  nodes!  

Alpha Particles

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   16  

Page 17: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s      

•  Fortunately,  they  are  easily  controlled  following  some  simple  rules  during  the  manufacturing  process.  

   But,  some4mes,  the  enemy  strikes  back!  

  In 2005, a figure of 2·106 FIT/Mbit was observed in the SRAMs attached to pacemakers where: •  the package had been removed by cosmetic reasons and the solder balls had not been previously purified*.

Fortunately, nobody deceased (We cross our fingers).

Alpha Particles

* J. Wilkinson, IEEE Trans. Dev. Mat. Reliab., 5 (3), pp. 428-433, 2005 Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   17  

Page 18: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s      

 Usually,  they  had  been  a  headache  for  the  designers  of  electronics  boarded  in  space  missions…  

     Here  you  are  some  of  their  pracBcal  jokes*…  

•  Cassini Mission (1997).- Some information was lost because of MBUs.

•  Deep Space 1.- An SEU caused a solar panel to stop opening out.

•  Mars Odyssey (2001).- Two weeks after the launch, alarms went off because some errors lately attributed to an SEU.

•  GPS satellite network.- One of the satellites is out of work, probably because of a latch-up.

Cosmic Rays

* B. E. Pritchard, IEEE NSREC 2002 Data Workshop Proceedings, pp. 7-17, 2002 Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   18  

Page 19: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s

A  nice  example…  The  birth  of  a  star,    picture  taken  by    

the  Hubble  Telescope  

Cosmic Rays

Don’t you realise that there is something odd in the picture?

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   19  

Page 20: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s      •  The  highest  fluency  is  reached  between  15-­‐20  km  of  alBtude.  •  Less  than  1%  of  this  parBcle  rain  reaches  the  sea  level.  •  The  composiBon  has  also  changed…  

•  Basically,  neutrons,  muons  and  some  pions  

Usually, the neutron flux is referenced to that of New York City, its value been of (in appearance) only 15 n/cm2/h

•  This value depends on the altitude (approximately, x10 each 3 km until saturation at

15-20 km). •  And also on latitude, since the nearer the Poles, the higher rate. •  South America Anomaly (SAA), close to Argentina. •  1.5 m of concrete reduces the flux to a half.

What a weak foe, really should be we afraid of?

Cosmic Rays at Ground Level

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   20  

Page 21: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s    Perhaps,  we  may  believe  that  we  are  in  a  safe  shelter  but…    

–  1992.-­‐  The  PERFORM  system,  used  by  airplanes  to  manage  the  taking-­‐off  manoeuvre  had  to  be  suddenly  replaced  because  of  the  SEUs  in  their  SRAMs*.  

–  1998.-­‐  A  study  reported  that,  every  day,  the  1  out  of  10000  SRAMs  a?ached  to  pacemakers  underwent  biYlips**.  

This  factor  being  300  Bmes  higher  if  the  paBent  had  taken  an  transoceanic  aircrah.    

Cosmics Rays at Ground Level

* J. Olsen, IEEE Trans. Nucl. Sci., 1993, 40, 74-77

** P. D. Bradley, IEEE Trans. Nucl. Sci., 45 (6), 2829-2940 Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   21  

Page 22: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s      

–  The  call  of  the  Thousand  (2000).-­‐  Sun  Unix  server  systems  crashed  in  dozens  of  places  all  over  the  USA  because  of  SEU’s  happening  in  their  cache  memory,  cosBng  several  millions  of  dollars*.  

–  In  2003  the  elecBons  in  Belgium  were  realized  simultaneously  in  the  tradiBonal  way  and  in  electronic  way.  A  difference  of  4096  was  find.  Experts  explained  this  difference  as  a  consequence  of  an  SEU**.    

–  2005.  Aher  102  days,  the  ASC  Q  Cluster  supercomputer  showed  7170  errors  in  its  81-­‐Gb  cache  memory,  243  of  which  led  to  a  crash  of  the  programs  or  the  operaBng  system***.  

Cosmic Rays at Ground Level

* Forbes, 2000

** Chantal Enguehard, Jean-Didier Graton. Electronic Voting: the Devil is in the Details 2008. hal-00274635

*** K. W. Harris, IEEE Trans. Dev. Mat. Reliab., 2005, 5, 336-342

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   22  

Page 23: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s      

–  The  call  of  the  Thousand  (2000).-­‐  Sun  Unix  server  systems  crashed  in  dozens  of  places  all  over  the  USA  because  of  SEU’s  happening  in  their  cache  memory,  cosBng  several  millions  of  dollars*.  

–  2005.-­‐  Aher  102  days,  the  ASC  Q  Cluster  supercomputer  showed  7170  errors  in  its  81-­‐Gb  cache  memory,  243  of  which  led  to  a  crash  of  the  programs  or  the  operaBng  system**.  

Cosmic Rays at Ground Level

* FORBES, 2000

** K. W. Harris, IEEE Trans. Dev. Mat. Reliab., 2005, 5, 336-342 Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   23  

Page 24: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

ALWAYS DAMNING THE PROGRAM DEVELOPPER?

PERHAPS, IT MIGHT HAVE BEEN AN SEU!!!

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   24  

Page 25: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s      Why  these  exoBc  phenomena  are  appearing  at  lower  and  lower  alBtude?  

The present trend is to minimise the typical layout length.

This has helped to decrease the sensitive volume but, also, the critical charge does.

Most pessimistic simulations show a rock-bottom at 130-180 nm and a sudden increase is expected for more advanced technologies.

Cosmic Rays at Ground Level

T. Granlund, IEEE Trans. Nuc. Sci., 2003, 50, 2065-2068

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   25  

Page 26: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s

     

In any case, everybody agrees with an increasing error rate in the whole system…

And with the increasing sensitivity of the combinational logic devices.

Cosmic Rays at Ground Level

* R. Baumann, IEEE Trans. Dev. Mat. Reliab., 2005, 5, 305-316

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   26  

Page 27: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.  RadiaBon  effects  in  ICs:  sources  of  SEE’s

Can  this  background  be  worse?  Yes,  it  can.  Some  details  may  increase  the  neutron  sensiBvity.    

–  Power  supply  values.-­‐  The  lower,  the  more  likely  the  SEU’s  –  Frequency  of  work.-­‐  SEU’s  are  more  dangerous  while  the  system  is  reading  

or  wriBng.  –  Presence  of  Boron.-­‐  There  is  an  isotope  of  boron,  10B,  able  to  trap  low  

energy  thermal  neutrons  and  release  an  energeBc  alpha  parBcle.        

–  AlBtude  

10 1 4 75 0 2 3B n Liα+ → +

Cosmic Rays at Ground Level

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   27  

Page 28: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   28  

1.   RadiaBon  effects  in  ICs  2.   The  Self-­‐Stabilizing  Algorithm  3.   SEUs  in  processor-­‐based  applicaBons  4.   The  LEON3  processor  5.   The  ASTERICS  test  plaYorm  6.   SimulaBon  of  SEUs  on  the  LEON3  7.   Conclusions  

Outline  

Page 29: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   29  

2.  Self-­‐stabilizing  algorithms  •  Self-­‐Stabilizing  Algorithms  are  used  for  communicaLons  

between  computer  or  sensor  networks    They  are  supposed  to  have  fault  tolerant  capabiliLes  

•  Are  there  robust  with  respect  to  soh  errors?    The  ASTERICS  test  plaYorm  was  used  to  simulate  SEUs  by  HW/SW  means  SEU  fault  injecLon  experiments  were  performed  on  the  LEON3  while  execuLng  a  self-­‐converging  applicaLon  

•  Final  goal:  idenLfy  sensiLve  resources  and  explore  SW  fault  tolerance  soluLons  for  the  self-­‐stabilizing  algorithm  

Page 30: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   30  

2.  Self-­‐stabilizing  algorithms  •  Defined  by  Edsger  Dijkstra  in  1974  •  Is  a  property  of  distributed  systems:                when  the  system  is  wrongly  iniLalized  or  perturbed,            it  can  automaLcally  go  back  to  a  correct  operaLon            in  a  finite  number  of  calculaLon  steps    •  ApplicaLons:  

–  in  «  theorethical  compuLng  science  »  in  domains  where  the  human  intervenLon  for  restarLng  a  system  aeer  a  failure  is  impossible  

–  In  computer  networks,  sensor  networks  as  well  as  in  criLcal  systems  such  as  satellites.  

Edsger Dijkstra

« Testing shows the presence, not the absence, of bugs ! » Edsger Dijkstra

Page 31: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   31  

2.  Self-­‐stabilizing  distributed  algorithms  

•  Idea:  a  fault  can  put  the  system  in  any  arbitrary  state    •  From  any  state,  resume  a  normal  behavior  and  remains  in  it    •  Defined  by:  

–  Convergence:  the  sytem  eventually  reaches  a  normal  behavior  

–  Closure:  when  no  fault  occurs,  the  system  behaves  in  the  intended  manner    

Page 32: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   32  

2.  Self-­‐stabilizing  algorithms  behaviour  

Page 33: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   33  

2.  Self-­‐stabilizing  algorithms:  Self-­‐convergence  

•  A fault leads to an arbitrary state •  The algorithm gives a correct answer:

–  If the error occurs not too close to the end (e.g. just before return)

–  If the error does not modify the data

Page 34: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   34  

2.  Self-­‐stabilizing  algorithms:  Distributed  Shortest  Paths  in  a  graph  

•  Given:  –  A  weighted  graph  G  defined  by  its  matrix  (an  array)  and  its  size  (an  integer)  

   •  Computes:  

–  shortest  paths  from  any  node  i  to  node  0  

•  Mimics  the  behavior  of  distributed  self-­‐stabilizing  algorithm  

Page 35: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   35  

2.  Self-­‐stabilizing  algorithms:  Distributed  Shortest  Paths  in  a  graph  (cnt’d)  •  Any  node  i  knows  

–  Its  distance  lij  to  any  neighbor  j    

•  Node  0  knows  it  is  the  sink  –  So  its  distance  to  itself  is  0,  and  

the  shortest  path  is  to  remain  on  0    

–  Once  no  computaLon  can  modify  d,  di  is  the  distance  from  i  to  0  and  nexti  is  the  next  step  on  the  shortest  path  from  i  to  0.    

If(i=0) di:=0 nexti:= 0

else di:=min{lij+dj} nexti:=argmin{lij+dj} // with j neighbor of i

endif

 

« The shortest path in a graph is never the one we think, it can come from nowhere and, most of the time, it does not exist » Edsger Dijkstra

Page 36: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   36  

2.  Self-­‐stabilizing  algorithms:  Self-­‐convergent  shortest  paths  

b=c=1 T= NxN matrix Matrix T represents a graph. Nodes i and j are D= Nx1 matrix connected by an edge of length T(i,j) while(b||c) { c=b; The distance between node I and 0 is Di=min(Tij+Dij) b=0; D[0]=0; for(i=1; i<N; i++) { m = VERY LARGE; for(j = 0; j<N; j++) { if(m>=D[j]+T[N*i+j]) m=D[j]+T[N*i+j]; } if(D[i]!=m) b=1; D[i]=m; } }

Page 37: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   37  

1.   MoBvaBons  2.   The  Self-­‐Stabilizing  Algorithm  3.   SEUs  in  processor-­‐based  applicaBons  4.   The  LEON3  processor  5.   The  ASTERICS  test  plaYorm  6.   SimulaBon  of  SEUs  on  the  LEON3  7.   Conclusions  

Outline  

Page 38: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   38  

•  First  studies  on  SEUs  were  done  end  of  60s    •  They  strictly  considered  space  applicaLons  •  ICs  issued  from  advanced  manufacturing  processes  are  sensiLve  

to  thermal  neutrons  present  in  the  Earth’s  atmosphere  even  at  the  ground  level  

•  Processor  and  memories  embed  significant  number  of  SEU  targets    

•  ApplicaLons  for  which  soe  errors  may  have  criLcal  consequences  must  be  evaluated  with  respect  to  SEUs  

3. SEUs in processor-based applications

Page 39: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   39  

•  Presented  for  the  first  Lme  in  2000  •  EsLmates  the  number  of  parLcles  required  to  obtain  an  observable  event  on  an  applicaLon  by  combining  fault  injecLon  and  accelerated  test  results  

•  Provide  data  on  system’s  sensiLvity  at  a  early  stage  of  the  development  •  How  to  do  that?  

1.  Calculate  the  probability  for  a  fault  to  provoke  an  error  on  the  applicaLon    

2.  Obtain  the  staLc  cross-­‐secLon  (literature  or  measurements)    

3.  Obtain  the  system  error  rate      *    R.  Velazco,  S.  Rezgui,  R.  Ecoffet,  “PredicLng  Error  Rate  for  Microprocessor-­‐Based  Digital  Architectures  through  C.E.U.  (Code  

EmulaLng  Upsets)  InjecLon”,  IEEE  TransacLon  of  Nuclear  Science,  Vol.  47,  No.  6,  Dec.  2000,  pp.  2405-­‐2411.  

faultsinjectederrorsnapplicatio

INJ ⋅⋅

⋅⋅=##τ

fluencymemoryionconfigurattheinerrors

SEU

⋅⋅⋅⋅⋅=#

σ

τστ INJSEUPRED*=

3.  SEUs  in  processor-­‐based  applicaBons:      The  CEU  method  

Page 40: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   40  

Fault  injecBon  mechanism:  •  Faults  are  injected  using  an  external  interrupLon  of  the  processor  •  Bit-­‐flip  target  using  the  instrucLon  set  

 =>  The  accuracy  of  the  method  depends  on  the  number  of  accessible  memory  elements  compared  to  the  total  number  of  memory  cells  embedded  in  the  DUT  

 

3.  SEUs  in  processor-­‐based  applicaBons:      The  CEU  method  

Page 41: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   41  

•   Can  be  applied  to  any  processor  :  –  In  HW  version  –  Implemented  in  an  FPGA  

 

•   SEU  targets  are    memory  cells  accessible  though  the  instrucBon  set:  –  Registers  –  Special  funcLon  registers  (SP,  PC,….)  –  Internal  SRAM  –  Cache  memory  –  …  

•   CEU  codes  strongly  depend  on  the  studied  processor’s  architecture  and  instrucBon  set  

3.  SEUs  in  processor-­‐based  applicaBons:      The  CEU  method  

Page 42: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   42  

1.   RadiaBon  effects  in  ICs  2.   The  Self-­‐Stabilizing  Algorithm  3.   SEUs  in  processor-­‐based  applicaBons  4.   The  LEON3  processor  5.   The  ASTERICS  test  plaYorm  6.   SimulaBon  of  SEUs  on  the  LEON3  7.   Conclusions  

Outline  

Page 43: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   43  

4.  LEON3  processor  

Generalities:

LEON3 is a synthesizable VHDL model �  32-bit processor compliant with the SPARC V8 architecture Main features: �  7-stage pipeline �  High-performance, fully pipelined IEEE-754 FPU �  Separate instruction and data cache (Harvard architecture) �  AMBA-2.0 AHB bus interface �  Symmetric Multi-processor support (SMP) �  Up to 125 MHz in FPGA and 400 MHz on 0.13 µm ASIC technologies �  Fault-tolerant and SEU-proof version available for space applications �  High Performance: 1.4 DMIPS/MHz, 1.8 CoreMark/MHz (gcc -4.1.2) �  Free: http://www.gaisler.com/

Page 44: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   44  

4.  LEON3  processor:  interfaces  and  peripherals  

Page 45: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   45  

4.  LEON3  processor:  specificiBes  

•  The LEON3 processor does not have a unique Stack Pointer (SP) register like in typical processors

•  The LEON3 is organized around a system of 8 ‘windows’. Each window provides a separate register environment

•  A function call or an interruption provoke a window switch

•  input registers of window Wn become output registers of window Wn+1 and

Wn+1 receives a new set of local and out registers

•  Each window has its own pointer stored in o6 (out register)

Page 46: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   46  

4.  LEON3  processor:  Register  file  

•  136 General purpose registers 8 global registers + 128 window registers •  Only 32 accessible at any time by an instruction:

- 8 global registers (g0 to g7)

- 24 window registers 8 in registers (i0 to i7) 8 local registers (l0 to l7) 8 out registers (o0 to o7)

Page 47: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   47  

Processor control registers: * Processor State Register (PSR) * Current Window Pointer (CWP) * Window Invalid Mask (WIM) * Program Counters (PC & nPC)

User application registers and memories: * Register file 136 General purpose registers 8 global registers + 128 window registers. Program Counter (PC) and next Program Counter (nPC) are special registers in the interrupt Window * Data and Instruction caches They are both configurable caches, (associativity, size…) Our data cache is 1Kb direct mapped Our Instruction cache is 1Kb direct mapped

4.  LEON3  processor:  accessible  SEU-­‐targets    

Page 48: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   48  

Non-­‐accessible  using  the  instrucLon  set  

Accessible  using  the  instrucLon  set  

LEON3  integer  unit  

4.  LEON3  processor:            Accessible  and  non-­‐accessible  registers  

Page 49: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   49  

1.   RadiaBon  effects  in  ICs  2.   The  Self-­‐Stabilizing  Algorithm  3.   SEUs  in  processor-­‐based  applicaBons  4.   The  LEON3  processor  5.   The  ASTERICS  test  plaYorm  6.   SimulaBon  of  SEUs  on  the  LEON3  7.   Conclusions  

Outline  

Page 50: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   50  

•  Built around two Virtex-4 FPGAs: •  Control FPGA: XC4VFX60 •  Chipset FPGA: XC4VLX40

•  Use of the PowerPc embedded in the FPGA for controlling the tester

•  Up to 1GB of DDR-SDRAM for the Control FPGA

•  Compact Flash memory used to store the FPGA configuration and the PowerPC instruction code.

•  Up to 180 IOs available for connecting the Device Under Test (DUT) to the tester via a high-speed connector

•  The DUT can access to 32Mb of SRAM memory and 512Mb of DDR-SDRAM

•  The configuration of the chipset FPGA is managed by the control FPGA

•  Tester remotely controlled via a 10/100/1000 Ethernet link

5.  ASTERICS  (Advanced  System  for  the  TEst  under  RadiaBon  of  IC  and  Systems)  

Page 51: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   51  

Operating conditions: * The PowerPC embedded in the Control FPGA runs at 300MHz * DUT frequency up to 200MHz * Available IO voltages: 3.3V, 2.5V, 1.8V, 1.5V, 1.2V

Typical target DUTs (Device Under Test):

* Advanced digital processors up to 64bits

* Memories (SRAM, DRAM, etc …) * Mixed analog/digital circuits (ADC, DAC, SoC, …)

* MEMs (potential upgrade depending on the specs)

5.  ASTERICS  characterisBcs  

Page 52: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   52  

Control  FPGA  DDR-­‐SDRAM  for  the  PowerPC   Ethernet  link  

DUT  Connector  Chipset  FPGA  

DUT  DDR-­‐SDRAM  

DUT  SRAM  

5.  ASTERICS  characterisBcs  

Page 53: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   53  

1.   RadiaBon  effects  in  ICs  2.   The  Self-­‐Stabilizing  Algorithm  3.   SEUs  in  processor-­‐based  applicaBons  4.   The  LEON3  processor  5.   The  ASTERICS  test  plaYorm  6.   SimulaBon  of  SEUs  on  the  LEON3  7.   Conclusions  

Outline  

Page 54: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:            CEU  fault-­‐injecBon  environment  

Fault injection mechanism � Faults are injected using an external interruption of the processor � Bitflip target is selected using the instruction set

54

Experimental results can be used to predict the application error-rate

•  The accuracy of the error-rate prediction method depends on the number of accessible memory elements compared to the total number of memory cells embedded in the DUT

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015  

Page 55: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:            CEU  fault-­‐injecBon  environment  

55

� Hardware setup: PC + ASTERICS + Power supply

� No DUT board : Chipset FPGA used as DUT

� ASTERICS memory : LEON3 code & data

� Functions embedded in Chipset FPGA: - Shared-memory controller (allow access by the CP and by the Leon3) - Supervisor (control the experiment LEON3 and its peripherals)

� LEON3 application: a benchmark Self-stabilizing algorithm

Comm.  FPGA  

LEON3  +  Peripherals  

Shared-­‐memory  controller  

Supervisor  Memory  

Ethernet  link  

ASTERICS  

Chipset  FPGA  

Power  supply  

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   55

Page 56: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:            CEU  fault-­‐injecBon  environment  

�  Store the injection vectors: instant, target, register, bit mask �  Start the execution of the LEON3 application �  Generate the interruption according to the instant vector �  Detect normal end of application �  Compare the obtained results with the expected results and count the

errors. �  Deal with timeouts: there are 3 type of timeouts

- Boot timeout: when the boot sequence does not finish - ASTERICS timeout: when the running application does not finish - Computer timeout: when the supervisor does not work properly or the ASTERICS stops responding

Expected  end  

Fault  injecLon  

ASTERICS  Lmeout  

Computer  Lmeout  

Boot  Lmeout  

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   56

Page 57: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:            Experiment  flowchart  

Computer   Supervisor   LEON3  

IniLalize  shared-­‐memory  

Generate  injecLon  vectors  

Store  injecLon  vectors  

Send  init.  Memory  command  

ApplicaLon  run  Generate  interrupt  

Fault  injecLon  rouLne  

Detect  end  of  execuLon  or  generate  Lmeout  

Send  Read  Memory  command  

Send  results  Compare  results  with  

reference   Fault injection rate: 1 SEU/2 sec

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   57

Page 58: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:            Preliminary  results:    target  =  register  file  Self  converging  algorithm:  b=c=1!N= 16!T= NxN matrix!

D= Nx1 matrix!while(b||c){ ! c=b; ! b=0; !

D[0]=0; ! for(i=1; i<N; i++){ ! m = BIGNUMBER; !

for(j = 0; j<N; j++) { ! If(m>=D[j]+T[N*i+j])! ! m=D[j]+T[N*i+j]; ! } !

if(D[i]!=m) !! b=1;!

D[i]=m;! } !!

}!

Test # Inj. Faults Result errors Timeout Silent Run limit

1 130577 204 (0.15 %) 32143 (24.6 %) 219 1,5 2 199550 324 (0.16 %) 49478 (24.8 %) 384 1,5 3 15068 1709 (11.3 %) 992 (6.6 %) 28 5 4 14264 1614 (11.3 %) 900 (6.3 %) 0 8 5 8007 887 (11,07 %) 508 (6.3 %) 17 16

Preliminary  Results  of  fault  injecBon  experiments  

Variable Observed errors recoverable i timeouts yes j timeouts yes m errors and

timeouts yes

D errors and timeouts

no

T errors and timeouts

no

b timeouts yes c timeouts yes

SensiBvity  of  the  program  variables  

•  During Tests 1 and 2 were detected very few errors but high number of timeouts •  Self-converging requires more than 1.5 x 336 ms (the nominal time) to converge •  Tests 3, 4 and 5 proved that timeouts masked result errors: => a suitable timeout limit is higher than 5 times the nominal execution time

58

Page 59: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:            SW  modificaBons  

Using of modulo operator « % » when calling an array, i.e. m=D[j%16]+T[((N*(i%16))+(j%16))%256]; Specifying for every variable a register in the register file by using the following « C » instruction Goal: reduce the number of used registers register unsigned int variable asm ("register name"); Initialize the variables b and c with 8 bits number instead of « 1 » to avoid a bitflip that make them equal to « 0 »

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   59

Page 60: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:  SEU  injecBons  on  the  modified  version    Target=register  file    •  The running limit set to be 5 times

the time required for the application to end execution without fault injection

•  The erronoeus decrease from 11.3% to 4.45%

•  The timeouts decrease from 6.6% to 2.6%

#Runs # errors # timeouts # converges

8000 356 (4.45%) 208 (2.6%) 2972 (37.15%)

Results of fault injection on the modified source code

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   60

Page 61: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:  SEU  injecBons  on  the  modified  version      Target:  other  ressources    

Zone   #  of  runs   # of errors   # of timeouts  Inst. cache   12174   107(0.88%)   385(3.16%)  

Data cache   12348   547(4,42%)   0 (0%)  

Multi-resources   88410   2196 (2.48%)   1415(1.6%)  

Results of fault injection in new resources

•  Data and instruction caches are also very sensitive to SEUs. They both can be accessed by the CEU through the load and store instructions

•  A fault injection campaign was performed on each of the caches, while the LEON3 executed the modified algorithm

•  The last campaign was performed on all the resources at the same time (2075 registers of 32 bits each):

- Register file - PC and nPC - Instruction cache - Data cache

•  Running limit was set to 5  

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   61

Page 62: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:  Triple  Modular  Redundancy  (TMR)    

Core 1

Core 2

Core 3

TMR Error Timeout

Converge

•  A TMR was emulated : 3 LEON3 cores executing simultaneously the same self-convergent algorithm

•  The comparison was done in the external PC

•  SEUs can hit, one two or three cores in one simulation

•  The executable is the modified self-convergence algorithm

•  The TMR results will be: –  Error:  if  there  are  two  errors,  or  one  error  and  a  

Lmeout  

–  Timeout:  if  two  Lmeouts  occur  

–  Converge:  if  the  self  converging  algorithm  converge  in  at  least  one  of  the  cores,  with  a  correct  result  

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   62

Page 63: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:  Three-­‐cores  fault  injecBon  results.  Target:  register  file    

•  The running limit is set to be 5 times the time required for the application to end execution without fault injection

•  In 17.73 % of the simulations the self-converging algorithm converges to correct results

•  The error rate decreases from 4.45% to 0.64%

•  The timeouts decrease from 2.6% to 0.18%

#Runs # errors # timeouts #converges

42543 276(0.64%) 77 (0.18%) 7543 (17.73%)

Results of fault injection on three cores processor

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   63

Page 64: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

#Run # of errors # of timeouts # of converges 100000 85(0.085%) 15(0.015%) 1825(1.825%)

Results of fault injection on three cores for all resources

•  The running limit is set to be 5

•  In 1.825 % of the simulations the self converging algorithm converges to correct results

•  The erronoeus results decrease from 2.48% to 0.085%

•  The timeouts decrease from 1.6% to 0.015%

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015  

6.  SimulaBon  of  SEUs  on  the  LEON3:  Three-­‐cores  fault  injecBon  results.  Target:  all  ressources  

64

Page 65: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

6.  SimulaBon  of  SEUs  on  the  LEON3:  Three-­‐cores  fault  injecBon  results.  Target:  all  ressources  

26 18

2 1

1

1DC/1IC

2DC

1DC/1RF

1DC/1nPC

1DC/1PC

48 double SEUs

9

1

15

1

9

1 1 2IC/1DC

2DC/1PC

2DC1IC

3IC

3DC

1DC/1IC/1PC

2DC/1RF

37 triple SEUs

Distribution of errors on all resources

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   65

Page 66: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

1.   RadiaBon  effects  in  ICs  2.   The  Self-­‐Stabilizing  Algorithm  3.   SEUs  in  processor-­‐based  applicaBons  4.   The  LEON3  processor  5.   The  ASTERICS  test  plaYorm  6.   SimulaBon  of  SEUs  on  the  LEON3  7.   Conclusions  

Outline  

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   66

Page 67: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

7.  Conclusions  and  future  work  

•  The sensitivity to SEUs of a self-converging algorithm was studied •  Fault injection experiments were performed on a benchmark self-

converging program executed by a Leon3 processor implemented on an FPGA

•  The CEU (Code Emulated Upsets) approach was adopted to perform SEU fault injection experiments using ASTERICS test platform was used

•  Obtained results show the fault tolerance and “Achilles Hails” of the studied program

•  Different versions were explored. The one implementing a TMR was immune to SEUs and quite robust with respect to MBU. SEUs in the voter were not injected

•  In futur work new versions of self-converging algorithms will be implemented in a Network on Chip to perform radiation ground testing

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   67

Page 68: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

Acknowledgements  

•  Dr.  Francisco  Javier  Franco  Peláez  (UCM)  

•  Dr.  Juan  Antonio  Clemente  (UCM)  

•  Dr.  Devan  Sohier  (Prisme,  Univ.  de  Versailles)  

•  Dr.  Alain  Bui  (Prisme,  Univ.  de  Versailles)  

•  Dr.  Greicy  Costa  (TIMA  Lab.)          

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   68

Page 69: Estudio de la robustez frente a SEUs de algoritmos auto-convergentes

THANK YOU FOR YOUR ATTENTION!

TIME FOR QUESTIONS

Universidad  Complutense  de  Madrid  -­‐  16th  march  2015   69