亚洲国产日韩欧美一区二区三区,精品亚洲国产成人av在线,国产99视频精品免视看7,99国产精品久久久久久久成人热,欧美日韩亚洲国产综合乱

Home Common Problem Breaking down data silos using a unified data warehouse: CDP based on Apache Doris

Breaking down data silos using a unified data warehouse: CDP based on Apache Doris

Mar 20, 2024 pm 01:47 PM
database User registration

As enterprise data sources become increasingly diverse, the problem of data silos has become common. When insurance companies build customer data platforms (CDPs), they face the problem of component-intensive computing layers and scattered data storage caused by data silos. In order to solve these problems, they adopted CDP 2.0 based on Apache Doris, using Doris' unified data warehouse capabilities to break data silos, simplify data processing pipelines, and improve data processing efficiency.

Breaking down data silos using a unified data warehouse: CDP based on Apache Doris

The data silo problem is like arthritis for online businesses because almost everyone encounters it as they age. Businesses interact with customers through websites, mobile apps, HTML5 pages, and end devices. For some reason, integrating data from all these sources is tricky. The data remains in place and cannot be correlated with each other for further analysis. This is how data silos form. The larger your business becomes, the more diverse sources of customer data you have, and the more likely you are to become trapped in data silos.

That’s exactly what happened with the insurance company I’m going to discuss in this article. By 2023, they have served more than 500 million customers and signed 57 billion insurance contracts. When they began building their Customer Data Platform (CDP) to accommodate such massive data scale, they used multiple components.

Data silos in CDP

Like most data platforms, their CDP 1.0 has both batch pipelines and real-time streaming pipelines. Offline data is loaded into Impala via a Spark job, where it is labeled and divided into groups. At the same time, Spark also sends it to NebulaGraph for OneID calculation (more on this later in this article). On the other hand, real-time data is tagged by Flink and then stored in HBase for query.

This results in a component-intensive computing layer in CDP: Impala, Spark, NebulaGraph and HBase.

As a result, offline labels, live labels and graph data are scattered across multiple components. Integrating them to provide further data services is costly due to redundant storage and large data transfers. More importantly, due to storage differences, they had to expand the scale of the CDH cluster and NebulaGraph cluster, increasing resource and maintenance costs.

CDP based on Apache Doris

For CDP 2.0, they decided to introduce a unified solution to clean up the mess. In the computing layer of CDP 2.0, Apache Doris is responsible for real-time and offline data storage and calculation.

In order to ingest offline data, they utilize the stream loading method. Their 30-thread ingest test showed that it can perform over 300,000 update inserts per second. To load real-time data, they used a combination of Flink-Doris-Connector and Stream Load. Additionally, in real-time reporting that requires pulling data from multiple external data sources, they leverage multi-catalog capabilities for federated queries.

The customer analysis workflow on this CDP is as follows. First, they organize customer information and then label each customer. They group customers according to tags for more targeted analysis and actions.

Next, I'll dig into these workloads and show you how Apache Doris accelerates them.

One ID

Have you ever encountered this situation when your products and services have different user registration systems? You could collect User ID A's email from one product page, and then collect User ID B's Social Security number from another product page. You will then discover that UserID A and UserID B actually belong to the same person because they use the same phone number.

This is why OneID emerged as an idea. It is to collect the user registration information of all business lines into a large table in Apache Doris, organize it, and ensure that each user has a unique OneID.

This is how they leverage functionality in Apache Doris to determine which registrations belong to the same user.

Tag Service

This CDP accommodates 500 million customer information, which comes from more than 500 source tables, with a total of more than 2,000 tags attached.

According to timeliness, tags can be divided into real-time tags and offline tags. Real-time tags are computed by Apache Flink and written to flat tables in Apache Doris, while offline tags are computed by Apache Doris as they originate from user attribute tables, business tables, and user behavior tables in Doris. The following are the company’s best practices in data labeling:

1. Offline tags

During the peak period of data writing, due to the large scale of data, full updates are very difficult. It is easy to cause OOM errors. To avoid this, they leveraged Apache Doris' INSERT INTO SELECT functionality and enabled partial column updates. This will significantly reduce memory consumption and maintain system stability during data loading.

set enable_unique_key_partial_update=true;
insert into tb_label_result(one_id, labelxx)
select one_id, label_value as labelxx
from .....

2. Live tags

Partial column updates can also be used for live tags, because even live tags update at different speeds. All that is required is to set partial_columns to true.

curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:id,balance,last_access_time" -T /tmp/test.csv http ://127.0.0.1:48037/api/db1/user_profile/_stream_load

3. High concurrency point query

With the current business scale, the company is using Receiving tag query requests at a concurrency level of over 5000 QPS. They use a combination of strategies to ensure high performance. First, they use Prepared Statement to precompile and preexecute SQL. Second, they fine-tune the parameters of the Doris backend and tables to optimize storage and execution. Finally, they enable row caching as a complement to column-oriented Apache Doris.

Fine-tune Doris’ backend parameters be.conf:

disable_storage_row_cache = false
storage_page_cache_limit=40%

Fine-tuning table parameters when creating a table:

enable_unique_key_merge_on_write = true
store_row_column = true
light_schema_change = true

4. Tag calculation (Join)

In practice, many tag services are implemented through multi-table connections in the database. This typically involves more than 10 tables. In order to obtain the best computing performance, they adopted a co-located group policy in Doris.

Customer Grouping

The customer grouping pipeline in CDP 2.0 is like this: Apache Doris receives SQL from the customer service, performs calculations, and sends the result set through SELECT INTO OUTFILE Send to S3 object storage. The company has divided its customers into 1 million groups. A customer grouping task that used to take 50 seconds in Impala now takes only 10 seconds in Doris.

In addition to grouping customers for more fine-grained analysis, sometimes they also perform reverse analysis. That is, for a certain customer, find out which groups he/she belongs to. This helps analysts understand the characteristics of customers and how different customer groups overlap.

In Apache Doris, this is achieved through the BITMAP function: BITMAP_CONTAINS is a quick way to check whether a customer belongs to a certain group, BITMAP_OR, BITMAP_INTERSECT and BITMAP_XOR are the choices for cross analysis.

Conclusion

From CDP 1.0 to CDP 2.0, insurance companies use the unified data warehouse Apache Doris to replace Spark Impala HBase NebulaGraph. Improved data processing efficiency by breaking down data silos and simplifying data processing pipelines. In CDP 3.0, they hope to group customers by combining real-time tags and offline tags for more diverse and flexible analysis. The Apache Doris community and VeloDB team will continue to be support partners during this upgrade.

The above is the detailed content of Breaking down data silos using a unified data warehouse: CDP based on Apache Doris. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

PHP Tutorial
1488
72
Why can't I register at the Bitget Wallet exchange? Why can't I register at the Bitget Wallet exchange? Sep 06, 2024 pm 03:34 PM

There are various reasons for being unable to register for the BitgetWallet exchange, including account restrictions, unsupported regions, network issues, system maintenance and technical failures. To register for the BitgetWallet exchange, please visit the official website, fill in the information, agree to the terms, complete registration and verify your identity.

Laravel user login function list Laravel user login function list Apr 18, 2025 pm 01:06 PM

Building user login capabilities in Laravel is a crucial task and this article will provide a comprehensive overview covering every critical step from user registration to login verification. We will dive into the power of Laravel’s built-in verification capabilities and guide you through customizing and extending the login process to suit specific needs. By following these step-by-step instructions, you can create a secure and reliable login system that provides a seamless access experience for users of your Laravel application.

In back-end development, how to distinguish the responsibilities of the service layer and the dao layer? In back-end development, how to distinguish the responsibilities of the service layer and the dao layer? Apr 19, 2025 pm 01:51 PM

Discussing the hierarchical architecture in back-end development. In back-end development, hierarchical architecture is a common design pattern, usually including controller, service and dao three layers...

DeepSeek official website entrance and latest promotional activities DeepSeek official website entrance and latest promotional activities Feb 19, 2025 pm 05:15 PM

DeepSeek's official website is now launching multiple discount activities to provide users with a shopping experience. New users sign up to get a $10 coupon, and enjoy a 15% limited time discount for the entire audience. Recommend friends can also earn rewards, and you can accumulate points for redemption of gifts when shopping. The event deadlines are different. For details, please visit the DeepSeek official website for inquiries.

Sesame Open Door Official Website Trading Platform Sesame Open Door Official Website Exchange Registration Entrance Sesame Open Door Official Website Trading Platform Sesame Open Door Official Website Exchange Registration Entrance Feb 28, 2025 am 10:57 AM

Gate.io Sesame Open is the world's leading blockchain digital asset trading platform, including fiat currency trading, currency trading, leveraged trading, perpetual contracts, ETF leveraged tokens, wealth management, Startup initial public offering and other sections, providing users with security, stability, openness and transparency.

Go language user registration: How to improve email sending efficiency? Go language user registration: How to improve email sending efficiency? Apr 02, 2025 am 09:06 AM

Optimization of the efficiency of email sending in the Go language registration function. In the process of learning Go language backend development, when implementing the user registration function, it is often necessary to send a urge...

Gate.io Exchange Newbie Registration and Trading Tutorial Gate.io Exchange Newbie Registration and Trading Tutorial Feb 21, 2025 pm 09:54 PM

Gate.io Exchange is one of the world's leading cryptocurrency trading platforms. This guide provides step-by-step tutorials to help users register and trade with Gate.io. The registration process includes selecting the registration method (phone, email or social account), filling in information, setting a login password, and completing identity authentication. Trading tutorials include accessing trading pages, selecting trading pairs, entering trading information, placing an order, and viewing order status. With the guidance of this article, users can easily start trading cryptocurrency on Gate.io.

Ultimate consistency in distributed systems: how to apply and how to compensate for data inconsistencies? Ultimate consistency in distributed systems: how to apply and how to compensate for data inconsistencies? Apr 19, 2025 pm 02:24 PM

Exploring the application of ultimate consistency in distributed systems Distributed transaction processing has always been a problem in distributed system architecture. To solve the problem...