guide readingThis paper mainly shares the application experience of Datacake and AI algorithm in big data governance. This sharing is divided into five parts. The first part clarifies the relationship between big data and AI. Big data can not only serve AI, but also use AI to optimize its own services. The two are mutually supportive and dependent. The second part introduces the application practice of comprehensive evaluation of big data task health by using AI model, which provides quantitative basis for subsequent data governance; The third part introduces the application practice of using AI model to intelligently recommend the configuration of Spark task operation parameters, and realizes the goal of improving the utilization rate of cloud resources. The fourth part introduces the practice of recommending task execution engine by model intelligence in SQL query scenario; The fifth part looks forward to the application scenarios of AI in the whole life cycle of big data.
Full-text catalog:
1. Big data and AI
2. Health assessment of big data tasks
3. Spark task intelligent parameter adjustment
4. Intelligent selection of SQL task execution engine
5. The application prospect of AI algorithm in big data governance.
Sharing guests | Li Weimin Happy Eggplant algorithm engineer
Edit | |Charles
Production community | |DataFun
01
Big data and AI
It is generally believed that cloud computing collects and stores massive data, thus forming big data; Then, through the mining and learning of big data, the AI model is further formed. This concept acquiesces that big data serves AI, but ignores the fact that AI algorithms can also feed back big data, and there is a two-way, mutual support and dependence relationship between them.
The whole life cycle of big data can be divided into six stages, and each stage faces some problems. Proper use of AI algorithm is helpful to solve these problems.
Data acquisition:This stage will pay more attention to the quality, frequency and security of data collection, such as whether the collected data is complete, whether the speed of data collection is too fast or too slow, whether the collected data has been desensitized or encrypted, etc. At this time, AI can play some roles, such as evaluating the rationality of log collection based on similar applications, and using anomaly detection algorithms to find the sudden increase or decrease of data volume.
Data transmission:This stage pays more attention to the availability, integrity and security of data, and AI algorithm can be used to do some fault diagnosis and intrusion detection.
Data storage:At this stage, we pay more attention to whether the storage structure of data is reasonable, whether the resource occupation is low enough, whether it is safe enough, etc., and we can also use AI algorithm to do some evaluation and optimization.
Data processing:This stage is the most obvious stage that affects and optimizes the benefits. Its core problem is to improve the efficiency of data processing and reduce the consumption of resources. AI can be optimized from multiple starting points.
Data exchange:There is more and more cooperation between enterprises, which will involve the security of data. Algorithms can also be applied in this respect. For example, the popular federated learning can help to share data better and more safely.
Data destruction:Data can’t just be saved and not deleted, so we need to consider when we can delete data and whether it is risky. On the basis of business rules, AI algorithm can assist in judging the timing of deleting data and its associated impact.
Overall, data lifecycle management has three goals:High efficiency and low cost,andsafe. In the past, we relied on experts’ experience to formulate some rules and strategies, which had obvious disadvantages, high cost and low efficiency. Proper use of AI algorithm can avoid these drawbacks and feed back into the construction of big data basic services.
—
02
Health Assessment of Big Data Tasks
In eggplant technology, several application scenarios that have already landed are first of all the evaluation of the health of big data tasks.
On the big data platform, thousands of tasks are running every day. However, many tasks only stay in the stage of correct output, and no attention is paid to the time-consuming operation and resource consumption of tasks, which leads to inefficiency and waste of resources in many tasks.
Even if some data developers begin to pay attention to task health, it is difficult to accurately evaluate whether the task is healthy or not. Because there are many indicators related to tasks, such as failure rate, time consumption and resource consumption, and there are natural differences in the complexity of different tasks and the volume of data processed, it is obviously unreasonable to simply choose the absolute value of an indicator as the evaluation standard.
Without quantitative task health, it is difficult to determine which tasks are unhealthy and need to be treated, let alone where the problem lies and where to start treatment. Even after treatment, we don’t know how effective it is, and even some indicators improve but others deteriorate.
Demand:Faced with the above problems, we urgently need a quantitative index to accurately reflect the comprehensive health status of the task. The way of making rules manually is inefficient and incomplete, so the power of machine learning model is considered. The goal is that the model can give the quantitative score of the task and its position in the global distribution, and give the main problems and solutions of the task.
To meet this demand, our functional module scheme is to display the key information of all tasks under the owner’s name in the management interface, such as score, task cost, CPU utilization, memory utilization and so on. In this way, the health of the task is clear at a glance, which is convenient for the task owner to do the task management in the future.
Secondly, the model scheme of scoring function is treated as a classification problem. Intuitively, task scoring is obviously a regression problem, and it should be an arbitrary real number between 0 and 100. But in this case, it requires enough samples with scores, and manual labeling is costly and unreliable.
Therefore, we consider transforming the problem into a classification problem, and the classification probability given by the classification model can be further mapped into a real number score. We divide tasks into two categories: good task 1 and bad task 0, which are marked by big data engineers. The so-called good task usually refers to a task that takes short time and consumes less resources under the same task amount and complexity.
The model training process is as follows:
The first is sample preparation. Our samples come from historical task data, and the sample characteristics include running time, resources used, whether execution failed, etc. The sample labels are marked as good and bad by big data engineers according to rules or experience. Then we can train the model. We have tried LR, GBDT, XGboost and other models. Both theory and practice prove that XGboost has better classification effect. The model will eventually output the probability that the task is a "good task". The greater the probability, the higher the final mapped task score will be.
After training, 19 features are selected from the initial nearly 50 original features, which can basically determine whether a task is a good task. For example, for tasks with many failures and tasks with low resource utilization, most of the scores will not be too high, which is basically consistent with the subjective feelings of labor.
After using the model to score tasks, we can see that tasks below 0 to 30 belong to unhealthy tasks that need to be managed urgently; Between 30 and 60 are tasks with acceptable health; Those with a score of 60 or above are tasks with good health and need to maintain the status quo. In this way, with quantitative indicators, the task owner can be guided to actively manage some tasks, thus achieving the goal of reducing costs and increasing efficiency.
After the application of the model, it brought usThe following benefits:
First of all, the task owner can know the health of the tasks under his name, and can know whether the tasks need to be managed through scores and rankings;
(2) Quantitative indicators provide a basis for the follow-up task governance;
(3) How much profit and how much improvement have been achieved after the completion of task governance can also be quantitatively demonstrated through scores.
—
03
Spark task intelligent parameter adjustment

The second application scenario is the intelligent parameter adjustment of Spark task. A survey by Gartner reveals that 70% of cloud resources consumed by cloud users are unnecessarily wasted. When applying for cloud resources, many people may apply for more resources in order to ensure the successful implementation of the task, which will cause unnecessary waste. There are still many people who use the default configuration when creating tasks, but this is actually not the optimal configuration. If it can be carefully configured, it can achieve very good results, which can not only ensure the operation efficiency, but also ensure the success of the operation, and at the same time save a lot of resources. However, task parameter configuration has high requirements for users. In addition to understanding the meaning of configuration items, it is also necessary to consider the correlation influence between configuration items. Even relying on expert experience, it is difficult to achieve optimization, and the strategy of rule class is difficult to adjust dynamically.
This puts forward a demand, hoping that the model can intelligently recommend the optimal parameter configuration for task operation, so as to improve the utilization rate of task cloud resources while keeping the original running time of the task unchanged.
For the task parameter adjustment function module, our design scheme includes two situations: the first one is that the model should be able to recommend the most suitable configuration parameters according to the historical operation of the task; In the second case, the model should be able to give a reasonable configuration through the analysis of the tasks for which the users are not online.
The next step is to train the model. First, we must determine the output target of the model. There are more than 300 configurable items, and it is impossible to give them all by the model. After testing and investigation, we chose three parameters that have the greatest influence on the task performance, namelyCores core number of executor、Total memory、Number of instances instances. Each configuration item has its default value and adjustable range. In fact, given a parameter space, the model only needs to find the optimal solution in this space.

In the training stage, there are two schemes to carry out. Option one isLearning experience ruleIn the early stage, the parameters were recommended by rules, and the effect was good after online, so let the model learn this set of rules first, so as to achieve the goal of online quickly. The model training sample is more than 70,000 task configurations previously calculated according to the rules. The sample features the historical operation data of tasks (such as the amount of data processed by tasks, the amount of resources used, the time consumed by tasks, etc.) and some statistical information (such as the average consumption and the maximum consumption in the past seven days, etc.).
We chose the basic model.Multiple regression model with multiple dependent variables. The common regression model is single output, with many independent variables but only one dependent variable. Here we hope to output three parameters, so we adopt a multiple regression model with multiple dependent variables, and its essence is still an LR model.
The above picture shows the theoretical basis of this model. On the left is a multi-label, that is, three configuration items, β is the coefficient of each feature and σ is the error. The training method is the same as unitary regression, and the least square method is used to estimate the sum of squares of all elements in σ.
The advantage of option one is thatYou can learn the rules quickly, and the cost is relatively small.. The drawback is thatIts optimization upper limit can achieve the same good effect as the rule at most, but it will be more difficult to exceed it.

The second scheme is Bayesian optimization, which is similar to reinforcement learning, and tries to find the optimal configuration in parameter space. Bayesian framework is adopted here, because it can make use of the basis of the last attempt, and it will have some transcendental experience in the next attempt, so that it can quickly find a better position. The whole training process will be carried out in a parameter space, and a configuration will be randomly sampled for verification and then run; After the operation, we will pay attention to some indicators, such as utilization rate and cost, to judge whether it is optimal; Then repeat the above steps until the tuning is completed. After the model is trained, there is also a tricky process in the use process. If there is a certain similarity between the new task and the historical task, there is no need to calculate the configuration again, and the previous optimal configuration can be adopted directly.

After the trial and practice of these two schemes, we can see that certain effects have been achieved. For the existing tasks, after modification according to the configuration parameters recommended by the model, more than 80% of the tasks can improve the resource utilization rate by about 15%, and the resource utilization rate of some tasks is even doubled. But both schemes actually exist.defectThe regression model of learning rules has a lower upper limit of optimization; The disadvantage of Bayesian optimization model for global optimization is that it is too expensive to make various attempts.
The future exploration directions are as follows:
Semantic analysis:Spark semantics is rich, including different code structures and operator functions, which are closely related to task parameter configuration and resource consumption. But at present, we only use the historical operation of the task, ignoring the Spark semantics itself, which is a waste of information. The next thing to do is to penetrate into the code level, analyze the operator functions contained in the Spark task, and make more fine-grained tuning accordingly.
Classification tuning:Spark has many application scenarios, such as pure analysis, development, processing, etc. The tuning space and objectives of different scenarios are also different, so it is necessary to do classification tuning.
Engineering optimization:One of the difficulties encountered in practice is that there are few samples and the test cost is high, which requires the cooperation of relevant parties to optimize the project or process.
—
04
Intelligent selection of SQL task execution engine
The third application scenario is the intelligent choice of SQL query task execution engine.
Background:
(1)SQL query platform is a big data product that most users have the most contact with and the most obvious experience. No matter data analysts, R&D or product managers, they write a lot of SQL every day to get the data they want;
(2) When many people run SQL tasks, they don’t pay attention to the underlying execution engine. For example, Presto is based on pure memory calculation. In some simple query scenarios, its advantage is that the execution speed will be faster, but its disadvantage is that if the storage capacity is not enough, it will be directly hung up; In contrast, Spark is more suitable for executing complex scenes with a large amount of data. Even if oom appears, it will use disk storage, thus avoiding the failure of the task. Therefore, different engines are suitable for different task scenarios.
(3) The effect of 3)SQL query should comprehensively consider the execution time of the task and the consumption of resources, neither can it excessively pursue the query speed without considering the consumption of resources, nor can it affect the query efficiency in order to save resources.
(4) There are three traditional engine selection methods in the industry, namely RBO, CBO and HBO.RBO It is a rule-based optimizer, which is difficult to make rules and has low update frequency.CBO Is based on cost optimization, too much pursuit of cost optimization may lead to the failure of task execution;HBO It is an optimizer based on historical task operation, which is limited to historical data.
In the design of the function module, after the user writes the SQL statement and submits it for execution, the model will automatically judge which engine to use and pop up a window to prompt, and the user will finally decide whether to use the recommended engine for execution.

The overall scheme of the model is to recommend the execution engine based on the SQL statement itself. Because you can see what tables and functions are used from SQL itself, this information directly determines the complexity of SQL, thus affecting the choice of execution engine. Model training samples come from SQL statements run in history, and model labels are marked according to historical execution. For example, tasks with long task execution and huge data volume will be marked as suitable for running on Spark, and the rest are SQL suitable for running on Presto. NLP technology and N-gram plus TF-IDF method are used to extract sample features. The general principle is to extract phrases to see their frequency in sentences, so that keyword groups can be extracted. The vector features generated after this operation are very large. We first select 3000 features by linear model, and then train XGBoost model as the final prediction model.

After training, we can see that the accuracy of the model prediction is still relatively high, about 90% or more.
The online application process of the final model is: after the user submits SQL, the model recommends the execution engine. If it is different from the engine originally selected by the user, the language conversion module will be called to complete the conversion of SQL statements. If the execution fails after switching engines, we will have a failover mechanism to switch back to the user’s original engine to ensure the success of task execution.
The benefit of this practice is that the model can automatically select the most suitable execution engine, and complete the subsequent sentence transformation, without the need for users to do additional learning.
In addition, the engine recommended by the model can basically keep the original execution efficiency unchanged, while reducing the failure rate, so the overall user experience will increase.
Finally, due to the reduction of the unnecessary use of high-cost engines and the decline in the failure rate of task execution, the overall resource cost consumption has decreased.
From the second part to the fourth part, we shared three applications of AI algorithm on big data platform. One of the characteristics that can be seen is thatThe algorithm used is not particularly complicated, but the effect will be very obvious.This inspires us to take the initiative to understand the pain points or optimization space of the big data platform during its operation. After determining the application scenario, we can try to use different machine learning methods to solve these problems, so as to realize the feedback of AI algorithm to big data.
—
05
Application Prospect of AI Algorithm in Big Data Governance
Finally, we look forward to the application scenario of AI algorithm in big data governance.
The three application scenarios described above focus on the data processing stage. In fact, echoing the relationship between AI and big data in the first chapter, AI can play a better role in the whole data life cycle.
For example, in the data acquisition stage, whether the log is reasonable can be judged; Can do intrusion detection when transmitting; When processing, it can further reduce costs and increase efficiency; Do some work to ensure data security when exchanging; When destroying, we can judge the timing and related influence of destruction. There are many application scenarios of AI in the big data platform, and here it is just a brick to attract jade. It is believed that the mutual support relationship between AI and big data will be more prominent in the future. AI assists big data platforms to collect and process data better, and better data quality can help train better AI models, thus achieving a virtuous circle.
—
06
Question and answer session
Q1: What kind of rule engine is used? Is it open source?
A1: The so-called parameter tuning rules here are formulated by our big data colleagues based on the experience of manual tuning in the early stage, such as how many minutes the execution time of the task exceeds, or how much data is processed, and how many cores or memory are recommended for the task. This is a set of rules that have been accumulated for a long time, and the effect is better after going online, so we use this set of rules to train our parameter recommendation model.
Q2: Is the dependent variable only the adjustment of parameters? Have you considered the influence of the performance instability of the big data platform on the calculation results?
A2: When making parameter recommendation, we don’t just pursue low cost, otherwise the recommended resources will be low and the task will fail. It is true that the dependent variable only has parameter adjustment, but in order to prevent instability, we have added additional restrictions. First of all, the model features, we choose the average value of a certain period of time rather than the value of an isolated day; Secondly, for the parameters recommended by the model, we will compare the differences between them and the actual configuration values. If the differences are too large, we will adopt the strategy of slow rise and slow down to avoid the failure of the task caused by excessive one-time adjustment.
Q3: Are regression model and Bayesian model used at the same time?
A3: No. Just now, we talked about doing parameter recommendation, and we have used two schemes: learning rules uses regression model; Then the Bayesian optimization framework is used. They are not used at the same time. We have made two attempts. The advantage of the former learning rule is that it can quickly use historical past experience; The second model can find a better or even optimal configuration on the basis of the previous one. The two of them belong to a sequential or progressive relationship, rather than being used at the same time.
Q4: Is the introduction of semantic analysis considered from expanding more features?
A4: Yes. As mentioned just now, the information we use when doing Spark tuning is only its historical implementation, but we haven’t paid attention to the Spark task itself yet. Spark itself actually contains a lot of information, including various operators and stages. If we don’t analyze its semantics, we will lose a lot of information. So our next plan is to analyze the semantics of Spark task and expand more features to assist parameter calculation.
Q5: Will parameter recommendation be unreasonable, which will lead to abnormal or even failed tasks? Then how to reduce abnormal task error and task fluctuation in such a scenario?
A5: If we completely rely on the model, it is possible that it pursues to improve the utilization rate of resources as high as possible. At this time, the recommended parameters may be more radical, such as the memory shrinking from 30g to 5g at once. Therefore, in addition to the model recommendation, we will add additional restrictions, such as how many g the parameter adjustment span can’t exceed, that is, the slow-rising and slow-falling strategy.
Q6: Sigmoid 2022 has some articles related to parameter tuning. Are there any references?
A6: Task intelligent parameter tuning is still a hot research direction, and teams in different fields have adopted different methods and models. Before we started, we investigated many industry methods, including the sigmoid 2022 paper you mentioned. After comparison and practice, we finally tried the two schemes we shared. We will continue to pay attention to the latest progress in this direction and try more methods to improve the recommendation effect.
That’s all for today’s sharing. Thank you.
| Share guests |
| |DataFun New Media Matrix |
| About DataFun| |
Focus on the sharing and communication of big data and artificial intelligence technology applications. Founded in 2017, more than 100+ offline and 100+ online salons, forums and summits have been held in Beijing, Shanghai, Shenzhen, Hangzhou and other cities, and more than 2,000 experts and scholars have been invited to participate in the sharing. Its WeChat official account DataFunTalk has accumulated 900+ original articles, one million+readings and 160,000+accurate fans.