过去六年在小米搞(wa)错(keng)的几个技术细节

2010年的时候,我们开始最早的一波做小米的服务器的同学,基本都很少互联网经验,七拼八凑的把米聊上了线,这么多年过去了,很多技术框架沉淀到了公司各处团队中去了。

今天再来看,其实有很多细节,当时真的没考虑(现在也是坑)。

细节一 用nginx的proxy_pass代理java上线不平滑

一个典型的架子,是nginx+resin/tomcat,然后在nginx上设置weight=1 max_fails=3 等等。

在上线的时候,并没有平滑过度的手段(比如修改一下所有nginx配置拿掉一台之类的),所有的上线都是有损的。

庆幸的是,移动互联网native的app,断个一两秒的不服务用户感觉不出来。

细节二 监控数据很多,有用的很少

线上故障的情况,不出意外就是一个模块和另一个模块之间发生了什么问题。

过去的几年,我们始终没考虑过抽象出来这种数据。

我们的监控数据全部是以单独一个模块自身的访问数据(qps、响应时长等),常见的问题是别人说访问你慢,访问老失败,你自己一看数据觉得还挺好。

细节三 为android ios配备了http框架

如果当时没有paoding-rose,我想我们会考虑做成一个标准的tcp server,中间用pb传输到手机。

这样做的好处,在应对不好的移动网络的时候,http束手束脚,而tcp却得多。

这一点我同样要点名嘲笑一个微博的客户端,在一样的坑里。

细节四 选java又没有语言级专家

如果当时选的是php,我想我们线上的服务在很多年后需要重启的会很少(由于nio或者其他什么泄漏之类的,最后服务就假死了,重启就能管很长时间)。

当然了,现在来看,更倾向于选c/c++,至少老老实实的写不会有太大的坑,跑起来也稳定。

细节五 过于依赖关系型数据库mysql

用mysql没有什么错,使用方便,实现业务快。

在中期要做多IDC容灾的时候,没办法了,实在是关系太复杂了,做不了。

如果当时我们全部有key-value的,将大量的mysql做的事情放在业务代码里来,做多IDC简直是小菜一碟。

而多IDC在一个互联网业务来看,上量了之后又是多么重要的一件事情。

细节六 过多使用rabbitmq

在需要削峰的项目上使用mq无可厚非,但是一个项目到处都在用mq的时候,简直是灾难(想一想,一个大系统,调用谁不清楚,谁在调用也不清楚,只知道自己在消费什么对象)。

后期的时候,要想知道一个模块正在被谁调用基本无从查询了,因为这些开源的mq,根本不会考虑实际运维中的需求,出发点全部是如何快速的使用。

后记

细节有点多,坑都还在(还有一些坑已经爬出来了就不列在这里了),依旧有后继的团队和项目在坑里,如果一个项目立志要做大做强,还是一开始就跳出这些坑吧。


English version

Something wrong in the pass six years at XIAOMI server

We deployed MILIAO server, which was a sketchy version patched together, in 2010.The first team members that developed at XIAOMI sever-side were less experience in the internet field.

The techlonogy frameworks have broadcasted to every team in XIAOMI after all these years.

And now for a look at the techlonogy details,it was thought too little at that time.

Detail 1.nginx’s pass_proxy which proxied java server is not graceful when deploying

A typical case is nginx worked with resin or tomcat,and there is a configration like ‘weight=1 max_fails=3’ in nginx.

There is not graceful when deploying the service.We dont have any load-balance tool.

It is a relief that the users do not know out of service at the native app.

Detail 2.too many traffic data,too less is useful

No surprisingly,there is something wrong between one model call another one when it is out of service online.

We have not thought about the model-call traffice data in the pass years.

We have too many data like QPS and percentile but all of them are made by the servce be called rather than the caller.

One question that frequently comes up was that looks all good by data but not good by the caller.

Detail 3.http framework for Android and Ios

Without pading-rose framework,I think we would develop a tcp server,which transfer data by protobuffer.

It is better for Chinese network enviroment,because the tcp one is more free than the http one.

The Weibo app is also designed by http.

Detail 4.java without specialist

I think the service online is not need restarted now if we had choose php at that time.

Now I prefer c/c++.It is less problem.

Detail 5.depended too much on RDBS

It is not wrong that a project is designed by mysql.It is easy and quick.

But it is difficult when you want to take the service to multi-IDC.It is impossible.

Mult-IDC is a easy thing,if we have designed the service by key-value store.The logic must be done in project rather than in mysql.

This is so important when your project have became greater.

Detail 6.depended too much on rabbitmq

It is doesnot matter that a peakedness project is designed by mq.It must be a disaster that a usual project is designed by mq everywhere.

It is difficult to find the calling relationship.

PS

There are too many details.To avoid the wrong thing if your service will be greater.


原创文章如转载,请注明:转载自五四陈科学院[http://www.54chen.com]

捐款订阅54chen
捐赠说明