We are hiring technical experts
Database Kernel Expert
Job Description:
Alibaba Cloud ApsaraDB team is well known for database products, including application database, hybrid transaction/analytical processing database, and big data database. ApsaraDB not only eases the pain of database management, but also focus on improving open source database engines. The team are looking for international background, self- motivated engineers who have deep understanding in database kernel and productivity. If you are a DB geek, why not join us and explore the beauty of DB.
Job requirement:
- Excelent multiprocess and network programming skills in C++/JAVA/Python/Golang
- Familiar with Linux kernel and performance tuning
- Solid knowledge of distributed database
Minimum qualifications:
- Easy going nature
- English (full professional proficiency)
- Ablity to work with team remotely
Preferred qualifications:
- Experience in design or core development with high data volume and high load products / systems
- Open source project development experience
- Familiar with c/c++/erlang proxy development, such as maxscale/spider/nginx etc.
- Contact information:
- Email:sibo.zsb@alibaba-inc.com / LinkedIn:
Invites Research Project Proposals on Apsara Cloud Database
Alibaba Cloud Database team is willing to establish scientific and technical cooperation with universities and institutes. The categories of database areas where research proposals are invited include: In-Memory DB, NVM Optimization, Self-Driving DB, Query Optimization based on Machine Learning, Distributed Transaction, OLAP Computing Engine Optimization, and Time Series/Spatio Temporal DB. Details of the seven categories are as follows:
Topic 1 In-memory Database
Background
In-memory databases, such as SAP HANA, can provide high performance and high throughput, which have been widely deployed in high-performance demand environment. But their deployment cost severely depends on the local memory size, which imposes limitations on how this kind of database can be used. However, with the increase size of the host memory and the extensive utilization of emerging hardware such as 3DXPoint and AEP, servers with super large memory capacity (more than 1TB) become more and more popular. Therefore, the utility of in-memory database is about to usher in rapid development.
Currently, most in-memory databases are designed for traditional hardware such as SSD and HDD. For instance, their sub-modules such as logging, index and cache have not been optimized according to the access features of memory, such as how to fit the ten-nanosecond level access speed and how to take the advantage of byte-addressable capability, which are problems to be solved.
For in-memory databases such as Redis, MemSQL, VoltDB, and so on, there are also many challenges such as how to efficiently support data compression, data indexing and loading, and data persistence in the scenario of large memory capacity or hybrid memory systems. To tackle these issues, innovative design of data structures and storage mechanisms are needed.
Related Research Topics
- New compression algorithms that improve memory space utilization as well as support efficient database indexing.
- DRAM/NVM/SSD hybrid memory oriented optimization mechanisms for key modules of in-memory database, such as logging, hybrid storage allocation and data layout management.
- The low latency and scalable architectures for nonvolatile storage media. The storage engine that supports high throughput (ten million QPS or more).
Topic2 NVM based database optimization
Background
NVM(Non-volatile Memory) has the features of persistent, byte-addressable, high storage density, DRAM-comparable write performance. As a new generation of storage media, NVM is propelling the significant revolution of computing architecture. Core components of databases like storage engine, logging system as well as index structure are capable of being optimized by taking advantage of the above features of NVM.
Ralated Research Topics
- The DRAM/NVM/SSD hybrid storage engine for traditional relative databases that improves data access performance.
- Efficient NVM-based parallel logging mechanisms, which provide higher throughput of logging system and shorter recovery time after database crash.
-
Index structures designed for all sorts of NVM media such as 3DXPoint or MRAM, and suitable for the efficient access mode of CPUs.
Topic3: Self-driving Database
Background
The highly efficient database operation and maintenance has long been concerned as a competitive system feature, especially for today when the amount of data is increasing in an exponential speed. Since the database scale is titanic and there exist too many system parameters, plus complicated and ever-changing workload, manual operation and maintenance tends to be far more difficult with time passing by. Therefore, people are paying more attention to the self-driving database, whose core idea is to take advantage of the approach of machine learning and artificial intelligence, which enables the automatic parameter adjustment, self-diagnosis, and self-optimization of databases. The goal of the self-driving database is to offload the burden of DBA, but also deliver improved performance and lower cost.
However, there are too many parameters in databases. For example, MySQL and PostgreSQL both have hundreds of adjustable options, and Oracle has even more. Even worse, many of these parameters are relying on each other and influencing each other. It is a significant challenge to give an appropriate configuration based on complex combinations of these factors.
Ralated Research Topics
- Automatic parameter adjustment. Choosing appropriate database parameters is always the key task for a DBA. It is a vital important function for the self-driving database to adjust parameter automatically with the method of machine learning and artificial intelligence.
- Workload prediction, Performance diagnosis and optimization. One of core tasks of database operation and maintenance is dynamically performance diagnosis and optimization. We need effective solution to predict the workload, diagnose on the performance and optimize it intelligently.
Topic4:Query optimization based on machine learning
Background
It has been proved that query optimization is an NP-complete problem. Traditional query optimizers are restricted by statistical methods, query cost models, and optimization algorithms, and hence when faced with data skew and highly relational data, these optimizers are unable to choose the best query plan, which therefore leads to query performance deterioration. It is a key requirement how to utilize machine learning as an important basis and input for query scheduling and resource management to bring up more accurate cost modeling and better query plans. It intends to break through the bottleneck and restriction of traditional SQL optimization technology to enhance database query performance and system resource utilization.
Ralated Research Topics
- Self-learning query optimizer cost models (Include homogeneous and heterogeneous computing), which optimize the query cost accuracy and the resource prediction accuracy in terms of the query processing operator and workload.
- More accurate and intelligent query plan that improves the query performance that previously affected by non-optimized query plan in the practical scenarios and the standard benchmarks.
-
Performance and resource prediction based on machine learning that enhances the accuracy of prediction in the practical scenarios and the standard benchmarks.
Topic5: Distributed Transaction and Query
Background
Under the background of an era when big data promotes the development of industries, enterprises often select various database products to support online transactions, report generation, log storage and off-line analysis to give a solid support for the high-speed development of their business. HTAP databases are born in such an environment. They supports both OLTP and OLAP in a hybrid form, meeting the requirements of most enterprise-level applications, solving the business problems of customers with a one-stop solution.
As to OLTP, most current systems use an independent center node to deal with distributed transactions, which imposes constraints on their performance and scalability. The Spanner provides the distributed transaction consistency based on special device (GPS plus Atomic Clock) and few companies could afford this. A more practical high performance and scalable distributed transaction solution is needed.
In terms of OLAP, the distributed SQL query is an important measure for databases to deal with a huge amount of data. The quality of the query plan, which involves in several factors such as host hardware, network throughput, and data layout, may have a great impact on the response latency of a distributed SQL query. It is a significant challenge to make an optimized distributed query plan according to these dynamically changing factors.
Ralated Research Topics
- Decentralized distributed transaction system prototype that delivers high performance, with up to millions of QPS, hundred microseconds latency, and near-linear horizontal scalability. It would be self-adaptive to various size of transaction, and have fast crash recovery capability for large scale transaction processing.
- Novel distributed query optimizers that can precisely and effectively combine resources of the whole distributed database and then develop high quality query plan dynamically and expeditiously.
Topic 6 OLAP Computing Engine
Background
To be able to face current and future rapidly growing data environment, it is a pressing requirement and greater challenge for the new generation OLAP computing engine to consider how to deal with PB even EB level data highly efficiently, how to provide real-time analysis of a huge amount of non structured data, and how to support more complicated computing model apart from traditional SQL.
In the scene of big data OLAP analysis, Column Storage is capable of providing a better I/O and data compression and becomes the primary storage mode currently. However, it remains a challenge for databases to provide the hybrid row-column storage model and the hot/cold data separation policies in the same system to meet the requirement for OLAP scenarios.
Ralated Research Topics
- Storage mechanisms for OLAP. Exploiting techniques and algorithms for column storage compression, hybrid row-column storage, and hybrid row-column storage.
- Vectorization analysis. To deal with the non-structured data such as videos, images, and voices stored in databases, the OLAP engine need support real time on-line non-structured data indexing using vectorization analysis and indexing enabled by built-in machine learning algorithms.
- Code Gen capability. OLAP computing engine need improve its analysis performance by introducing novel compilation and execution technology.
Topic 7 time series/spatio-temporal database
Background
With the rapid development of Internet, IoT and Edge computing, Surveying and Social Sensing, the time series and spatio-temporal data will be enormously enriched and the cross-media linkage of information will be growingly more complicated. New types of data such as 3D-scene data,spatio-temporal trajectory data ,IoT sensing data(time+location+value),spatio-temporal media data and complex relational network data will be used in various industries.
New businesses like IoT、E-commerce/new retail,shared trip, automatic driving, intelligent logistics, intelligent transportation will prompt the time series computing ,the spatio-temporal computing and graph computing to come up everywhere. Therefore, the time series/spatio-temporal/graph computing power of database will become the core requirement to support these emerging industries, and also act as a key driving power for cloud computing business.
Ralated Research Topics
- Real time non-structured data processing. Combined with stream computing system, we need to establish a real-time access, efficient compression storage and analysis framework for large-scale time-series / location / graph data, write up to tens of millions of sequential data points per second.
- Graph modeling and application based on spatio-temporal constraints. In conjunction with the application of related fields, we need to design and build a Graph model with dynamic temporal and spatial semantic constraints, which can support data compression, fast path/relational search and analysis under large-scale scenarios.
- Hierarchical multidimensional efficient index. Design/optimize temporal indexes, spatial indexes, graph indexes and their combinations; combined with time series/ graph new data models and query features, studies to implement pre-aggregated indexes, correlation indexes, and approximate query indexes.
- Hardware and software acceleration for graphics and images. Studies to implement graphic and image query processing operator based on hardware acceleration and algorithm optimization, and the performance will be improved by more than one order of magnitude.