Software

Devops

Alibaba Cloud boosts failure prediction with logfile timestamps

Machine learning helps, but more data catches more faults - so Chinese champ has shared its data


Alibaba Cloud has revealed homebrew tech it used to improve server fault prediction and detection, which it claims saw its ability to detect problems beat comparable tech by ten percent.

The Chinese cloud champ's claims emerged last week in a paper [PDF] presented at the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

The document points out that reliability is a major selling point for public clouds, making predicting failures an important ability. Log files, the authors observe, contain plenty of info on "exceptions" to normal performance that indicate potential performance problems. The authors opine that tools using logs to predict failures rely on machine learning and deep learning to detect future failures, when more obvious indicators – timestamps – aren't paid the attention they are due.

Here's the thinking, in a nutshell:

The time interval lengths between successive exceptions often reflect the urgency and severity of the anomalies. For instance, a server with 1,000 "machine check exceptions" in three days may not fail, but a server with 1,000 such exceptions in five minutes tends to fail. Therefore, effective failure prediction must adequately make use of the exception timestamp information.

Alibaba Cloud therefore created its own tool called Time-Aware Attention-Based Transformer (TAAT) to analyze timestamp info.

TAAT doesn't entirely ignore ML tools. Instead, it uses the Bidirectional Encoder Representations from Transformers (BERT) – a language model developed by Google that represents text as vectors and has been used to predict server failures. The paper asserts, however, that BERT hasn't been tuned to make full use of log timestamps.

Alibaba's tool therefore relies on BERT for some failure analysis and compares that with TAAT's analysis of logfile timestamps. The paper contains a lot of math describing exactly how Alibaba analyzes log info, but the bottom line was apparently a ten percent improvement in fault predictions – and presumably slightly more reliable cloudy IaaS.

Alibaba's boffins think TAAT's output is also useful because it doesn't need expert analysis – meaning folks familiar with cloudy crashes aren't needed to help as often. It's already in production at Alibaba Cloud.

TAAT appears not to be available for download. But Alibaba Cloud has posted a colossal dataset comprising "∼2.7 billion syslogs from ∼300,000 servers in a four-month period of the real productional system of Alibaba Cloud" to help researchers consider how to develop log sampling strategies of their own to inform future failure prediction efforts.

The authors have also posted a video outlining TAAT's operation. ®

Send us news
2 Comments

Dropbox to shed another 500 staff, CEO takes 'full responsibility'

Cloudy concern has also spent over $500M buying back its own shares amid multiple rounds of layoffs

Microsoft accused of 'greenwashing' as AI used in fossil fuel exploration

Activists press Redmond to come clean on ‘material reputational, legal, and operational risks’

Cloud repatriation officially a trend... for specific workloads

It's not a mass exodus, say analysts, but biz bods are bringing things down to earth

Chinese engineers wire Raspberry Pi into 600-meter railway tunnel to find any holes

The GPIO turns out to be a handy tool if you want to measure the conductivity of concrete

Just how private is Apple's Private Cloud Compute? You can test it to find out

Also updates bug bounty program with $1M payout

Google Cloud burst by 12-hour power outage in German region

Loose juice led to cooling issue in one zone, but the pain was widespread

Huawei releases data detailing serverless secrets

Reveals why your functions start slowly on its cloud and maybe others too

Telcos find cloud migrations, security, are a pain in the IaaS

Carriers consume less than half the cloud they committed to use

Developer pockets $2M in savings from going cloud-free

37signals CTO claims cost of new hardware was 'entirely recouped' as contracts expired after AWS exit

Microsoft's Arm-based Cobalt 100 CPU now live and powering Azure VMs

For general-purpose and memory-optimized workloads

Microsoft sprinkles AI 'magic' and additional storage tiers on OneDrive

Big emphasis on photos in mobile app

Kyndryl follows in IBM's footsteps with rolling layoffs likely affecting thousands

Underutilized staff get sent to the 'bench' – and seldom return