Reflections on my 2020 Data Projects — Part I

9 min readJan 6, 2021

Lessons I learned and want to keep in mind in 2021+

2020 has sunsetted. Looking back over it, I had three projects that needed to process multiple terabytes of data regularly (one of them needs to process hundreds of terabytes of data every day) and one project that wanted to apply good MLOps practices on GitHub. Some projects have gone production and are running on data centers in different continents to serve worldwide customers.

In this blog post, I want to summarize the key practices learned from these projects which I want to remember and keep in mind in 2021. I want to read them each time before a new data project kicks off and make a better plan.

Note! Even for similar workflow, small/medium/large scale projects require different architecture and plan differently. This post focus on large scale data project.

#1 Prepare for Data Schema Change

This is a hard fact, but real-world data always evolve and data schema will change overtime. Years ago, it was a common practice that we spent a lot of time negotiating the data schema/contract with data providers and tried to define the best schema that can live forever. But we are in a world that data is exponentially growing; more and more data comes from non-traditional OLTP systems. In some projects (especially projects with over several gigabytes of data every day), it’s not realistic to assume we can have a stable data format/schema over time. The requests to support varying data schema keeps pop up in system requirements in my 2020 data projects.

There are several techniques that can help to handle data schema changes, such as:

Bronze, Silver, and Gold architecture

Extract, load, transform

Using data format that supports dynamic schema such as JSON, Parquet, and leveraging capabilities of NoSQL solutions such as MongoDB, COSMOS DB, Elastic Search, Azure Data Explorer, Delta Lake, Big Query, DynamoDB…
Schema versioning. You can adopt some techniques like Temporal Patterns, Artifacts are version controlled with application code, Data Mapper, Document Versioning Pattern to implement data versioning.

Most of the techniques share similar principles and can complement each other in handling data schema change. The key point is paying some extra attention to it. If someone assumes that they will have a stable schema, it’s better to discuss more on this assumption and validate again. The requirement to handle schema change will impact the foundation of the whole system design.

#2 Design for Query & Charting

Image by David Schwarzenberg from Pixabay

A good data system should at least fulfill the team’s basic query and charting requirements. It is very common that data collectors (such as data engineers, people who collect and ingest the data) and data consumers (BI reporter, data analyzer, ML researcher.. ) are from different teams, and they don’t understand each others’ languages in projects. Considering the skill sets required for different roles and different groups also have different business goals, this is very normal. However, bringing the promised benefits of the whole data system requires deep engagement from both parties. A couple of times, we saw that technologies are chosen and built based on one side’s requirements, but the solution poorly meets the other side’s needs.

It is usually easier to figure out the data collector requirements than data consumers’ needs because data collectors’ requirements are mostly related to the 3Vs of Big Data — Volume, Variant, and Velocity. And we can quantify them with specific numbers in the requirements specification. The data consumers’ requirements often depend on the business context, problem domain, and size, skill set of the data consumers team; it requires some more digging to figure out.

Here are some common factors to consider:

Visualization Tools: Some teams may have strong Excel skills; they are fluent in using PowerPivot and Excel functions to generate in-depth interactive reports. Some teams like the advanced graphical presentation provided by dashboard/reporting tools such as PowerBI, Tableau. Some groups like the powerful programmability in Jupyter Notebook or Shinny. Also, most of the data platforms will provide their build-in tools to visualize data. For example, Azure Data Explore provides very handy and powerful charting capabilities for historical log data and time-series data analysis.

*Residual time series chart in Azure Data Explorer*

Data Post Processing/Query Tools and Languages: Data consumers will need to further process data to fulfill their analytic requirements. Languages like SQL, R, and Python (ok, you can also add Julia) are very common and powerful ones that data analyzers use to further “massage data”. But each analysis domain has it’s own unique data processing pattern. Based on different design purposes, different data platforms also provides their own data query syntax and query engine to support and simplify data post-processing for specific domains. Understanding each platform’s strength and choosing the best-fit one is critical to the project’s success. For example, it’s easy to implement complex text filtering and parsing in Elastic Search, but it’s hard to implement complex data join and aggregation in it. The other example is Azure Data Explorer, which focuses on historical log data analytics, provides powerful Kusto Query Lanaguage to simply log parsing, aggregation, data join, and semi-structure data processing. It is one of my favorite query languages.
Performance goal and Limitation to retrieve insights from “Big Data”: Processing large volumes of data is not easy; different solution applies different strategies to achieve their performance goal. They use different index algorithms, in-memory computing, cluster resource allocation, file storage structure, data compression type, etc., to archive their performance goal. These strategies will also come with some limitations like “limitation of concurrent users”, “unbalance computing resource allocation”, “required huge memory caches”, “data paging support”…etc. Review the pro and cons of different solutions and test them in the project’s main data query scenarios.

Both data collectors and data consumers are critical stakeholders of a data project. In the solution design phase, it’s helpful to have deep engagement with both two teams. Spending a little more time on how the data query pattern looks like and considering the data analysis tools suitable for users, you will have a bright outlook for the system.

#3 Balance the workload of each resources

*image credit:* *Rob Leane -Den of Geek UK*.

When we worked on small or medium-sized data projects, we implemented the solution using one or two key technologies such as Databricks or Azure Data Explorer. These technologies come with some default capabilities to ingest, manage, and query data. We focused on providing proper configuration on these platforms.

However, when we worked on large scale projects, instead of relying on built-in capabilities, we found we need to re-architect the system to further distribute the computation workloads to different components in the system. We need to spend more time thinking about how to make the data more “swallowable” for the data platform. “Swallowable” could mean things like reducing the data fragmentation, easier to build index, remove dirty data, remove duplicated data… etc.

The workload distribution process is also a process to find the balance of each component. Here are some points that might help:

Size of your batch and how it impacts latency and memory consumption
Choose different data formats, consider it’s the impact to data serialize/un-serialize time and disk I/O
Decide where to filter, split, aggregate data so the next component can reduce workload and simplify management and configuration complexity.
It is also related to the maximum scale-out/in capabilities of each component. All the data input and output velocity of these components can then be within its connected components’ bandwidth.

For a large scale project, the right solution will be the tasks are well distributed among all critical components, and the workload is balanced. So the designed system can have room growth its capacity, handle peak-time traffic, and prepare for future business growth.

#4 Wow, there are duplicated data

This is another common issue when we process a large amount of data. We could have duplicated data for dozens of reasons. It could be the data source sent twice because of a network issue; it could be some parts of the data processing pipeline partially failed and the system is trying to resume from the latest checkpoint; it could be the underneath message service only guarantee at-least-once, etc.

We need to evaluate the business impact and the cost we want to pay for mitigating the issue. Common strategies to handle duplicated data are:

Ingestion Time Check: We can have ingestion time check by adding data processing logs or checkpoint files. Then all the ingestion operations need to verify the log/checkpoint file before process the data.
Settling Basin Filter: In this strategy, we store data in a temporary storage/table. Then, we compared the data with recently ingested data to make sure they are not duplicated data.
Query Time Check: In some systems, a few duplicated data can be ignored in most use cases (e.g., website user login log for analyzing user demographic/ geographical distribution), and only a few of its use cases require duplicated data check. For such a scenario, we can check duplicated data at query time for only required use cases; then we can reduce the system cost for de-duplicated data check.
Under Some Threshold, Ignore It: It’s the best solution if the business context allows it. De-duplication operations usually occupy a certain amount of system resources and impact system performance. In a large data processing system, it’s very costly to implement it. It will be helpful if we can ignore it when it doesn’t impact the business goad.

Besides the above strategies to handle duplicated data, one other point is understanding where the duplicated data comes from. Which part of the system or infrastructure services provides at-least-once and which part can support exactly-once. If the system use checkpoint to resume the failed operation, try to understand how it works.

#5 Understanding of Core Technologies

This point should be needless to say, but I add it here not just because it’s fundamental but also it could take consider mount of time to understand the design philosophy and available configuration parameters of the different data platforms. Many settings could impact each other; we should keep in mind that planning enough time to learn, test, and verify the best configuration setting is required.

For example, in the core technologies I used in these projects, here are some key configurations that need to check, calculate, and test:

Databricks

Number of Cores
Number of Memory
Duration of Job, Stage, Tasks
Fair scheduler Pool
Structured Steaming : max-files, max-size per trigger
Shuffle Partitions
Shuffle Spill (memory)

Azure Data Explorer

[Ingestion]

MaximumBatchingTimeSpan
MaximumNumberOfItems
MaximumRawDataSizeMB

[Capacity Policy]

Ingestion capacity
Extents merge capacity
Extents purge rebuild capacity
Export capacity
Extents partition capacity
Materialized views capacity policy

Azure Functions (Python)

FUNCTIONS_WORKER_PROCESS_COUNT
Max instances when Functions scale out

[Queue Trigger]

batchSize
newBatchThreshold

[HTTP Trigger]

maxConcurrentRequests

» read Part-II