In this article I want to talk about the data design consideration.
At the time when a new process is designed and there is a need to stores and retrieves data from a storage platform, the architect or senior developer designing the solution starts by creating the data model based on the application requirements he or she is modelling.
In many cases, there is no DBA involved at this phase of design.
In many cases, the developers do not start by analyzing how the data will grow over time and there is no provision into their design to address this issue at the beginning of the project.
This gap existing in development it is happening due to the way the development process is conducted: the developers lacks the knowledge and interest to maintain the data over long time while the DBAs are not getting involved into the database decisions until the application it is in a release stage and in many cases much later after this point.
I will present 3 situations I have encountered over time working for my clients, situations that exemplifies the issue from above.
1. Azure Table Storage
2. Business logic inside the middle-tier
3. Dynamically created queries aggregating 10 years of financial data
Azure Table Storage
Azure Table Storage is a cheaper alternative provided in Azure that allows the user to store NoSQL data, data that is not bind to a particular schema. While Azure SQL databases can scale up to 150GB only, the maximum data size for Azure tables is 200TB per table.
Microsoft recommends that the developers should consider using Azure table storage when they want to store data in the range of multiple terabytes, while keeping storage costs down — when the data stored does not depend on complex server-side joins or other logic.
My client started to saved data into Azure Table Storage couple of years ago. At the present time, some of the tables contains between 15 to 20 millions records.
No data clean up has been implemented nor a mechanism to keep just the active data required by the application.
Following the initial release of the tables, a separate process running as an ETL job over the night has been initiated and it is owned by the DBA's department: the process reads the Azure Tables from the storage and aggregates the data to be used for some reporting activity.
This process becomes slower and slower over time and currently it is taking more than 24 hrs. to complete.
At this moment in time, the DBAs made available a list of records that became inactive in the system in the last couple of years and asked the development team to clean up these records from the Azure Table Storage.
The development team implemented a process that reads the inactive records from SQL Server and search for these records inside the Azure Table Storage, against 20 millions records.
The following results have been found: the time for reading and updating a particular record into an Azure table is dependent on the number of records the table contains:
For a table with 10K records, the processing time is 7 seconds.
For a table with 11,5 million records the processing time is 1:33 minutes.
The process implemented by the development team uses parallel programming with a configurable number of threads (the tests have been done with 20 threads and a CPU usage of 25 to 40%) processing batches of rows of data (1000 records) at the same time.
It has been found that the bottleneck of identifying a particular records cannot be helped by the parallel processing since it Is related with the Azure Table Storage’s function performance.
To provide a working solution to this problem I have changed the approach: instead of reading the Azure SQL Server database’s table and try to search for each record inside the Azure Table Storage (that will be very slow), I have done the following:
read in order the records from the Azure Table Storage, in batches,
check them against the Azure SQL Server database’s table.
copy these records into a backup table identical with the source table and marked the records as deleted or physically deleted them.
This process executes around 2 million records per hour against a table with 20 million records.
The entire execution of the initial clean up process still takes 1o hours.
Business logic inside the middle-tier
The bank I have been working for implemented a background process that ran over the night: while the initial few months it performed well, over more than a year this process crashes for more than 1 week in the Production environment, by providing a significant database timeout.
Two senior developers were working in conjunction to implement it: one developer was building a .NET library that provides the required calculation in the middle-tier; the other developer was involved with the creation of the data access stored procedures and functions required by this process.
When I have been asked to investigate and fix this Production issue, I have discovered that the middle-tier was making 2500 calls to the database, each call was taking an average time of 500 milliseconds.
All the data processed in the middle-tier could have been processed just at the database level, without the need of this chatty communications; one argument of this approach is that one can test better the methods in the middle-tier than in the database. (As a note, I prefer to perform the work in the native system where it belongs, in this case the database.)
At the same time, I have discovered that each query that was executed for 500 milliseconds was involved in joining tables with many millions of records added in the previous years while the actual process requires a very small number of records from the previous night: around 1000 records. (The problem has been enhanced as well due to another process running nightly in the same time frame; this process performed a rebuilding of the database indexes and the initial process could not use the indexes for its running queries.)
Both of the senior developers were very hot headed and they blame each other pointing that the problem resides in the other person's work.
While the initial process was working well at the beginning, it has a few drawbacks:
it was never tested for performance for the amount of data that became available over a more significant period of time;
the design choices were not very well thought through by both of the developers;
the process of rebuilding the indexes was not understood at the time when the issue has been discovered in Production.
Dynamically created queries aggregating 10 years of financial data
A different bank than the one described in the above example was providing business reports to their analysts: the data retrieved by the reports was gathered via a stored procedures that generates its queries dynamically, at run-time.
The user selected a report that aggregated financial data over the last 10 years for the maximum number of the companies allowed by the application (15).
This particular stored procedure executed in more than 9 minutes.
When I have investigated the problem, I have discovered that the dynamically generated statement exceeded the defined size of 8000 characters, therefore the stored procedures was crashing in Production without anyone being aware of it.
At the same time, the query performed a few cross joins over tables with hundred of millions of records.
My opinion is that, at the time when this stored procedure was designed it was thought that will process data for a very particular use case: limited to a small number of records over a limited duration amount. In time, the requirement changed and the implemented solution was just not the right one to accommodate the new request.
Big banks have departments well versed into designing and manipulating data warehouses and design their processes with concepts as dimensions and fact tables to have all the reporting data available ahead of time; in some cases, bad decisions are still pursued and almost always this is the consequence of not providing enough time to analyze the requirement and design the proper solution.
If this solution would have been tested against a data set that would have mimic the new requirement conditions during the design phase, it would have been evident that a new design approach is needed.
What should the designers have done:
1. When designing the storage, calculate the row's size (look at each data type's size for each field and add them together) 2, Discuss with the BAs and P.M. and gather the information of the expected number of records incoming into your system over a period of time. Calculate the data storage for the first 3, 6, 12, 24 months to have an understanding of the average size of your storage over time. 3. Create a POC and measure performance against a test storage you have built which contains a similar number of records as your peek estimate. Measure the performance based on your application needs: reads versus writes, anticipated long running joins, searching for records that needs to be deleted against a big data set, etc. 4. Think about how much data your process need for its regular execution: can you keep just the data from the last 3 months into your processing tables and move the older data into separate tables? Can you partition the data based on date (yearly, quarterly) or any other fields and make the application used the partitions appropriately? 5. If the incoming data is significant, talk to the DBAs about it. At the database construction time, the DBAs can set parameters that make the database grow by allocating new data pages, size the space allocated for read or write operations inside a data page, and so on. (You should talk with the DBA about partitioning as well since this can be done at the table creation time, requires a more particular query to be written and there are sometimes less known features by the developers - as partition switching - that can dramatically improve performance.) 6. Consider how the data will be used once it is in Production.
Are there other processes that may use it and be affected by its size?
Do the data requires regular clean up via a process that needs to be designed?
Is it the data used for reporting or for reconciliations with other systems?
By asking these questions and implementing the appropriate answers, the intended outcome is a process that will run constantly within the same performance parameters regardless of the time passed from the first deployment to Production.
As an example, a process I have created in the past for one of my clients was running every night and calculates accrued interest for financial accounts. The process was backing up constantly the data after a configurable period of time and it never exceed 15 minutes of execution time every night, even 2 years after it has been released.
Takeaway: always analyze and plan for data grow in the designing phase of your system.
Think beyond "just" making a single system run.
Once a system is up and running in Production, very soon it will come the need to investigate an issue reported by some of your clients. These clients can be other managers inside your organization that are interested in understanding the data trends in real-time i.e. the Marketing managers.
Distributed systems that are designed nowadays in the Cloud, makes the data flow between separate systems while being stored in different formats. In many cases you will have to think about designing a reconciliation process ahead of time to have a good understanding of your process.
Let me give an example: one of the big Canadian online retailer saves some of its data into Azure Table Storage. During the server-side processes, messages are posted to the Azure queues and later on these messages are processed by Azure functions and calls are made into Salesforce Marketing Cloud APIs. Once these calls are received by the SMC, emails and email notifications to mobile devices are sent to the users who are interested in the company's products.
The system is working well, processing millions of messages posted in the queue over a month period.
After some time, the Marketing manager discovers a drop in the data sent to the Android mobile devices, for a given timeframe.
The question is then asked: can the development team explain this discrepancy?
The following systems are at play, working together:
the application's API's that are saving data into Azure Table Storage and post messages into the queues.
the Azure functions that processes the queues and logs messages into Azure Application Insights.
the Salesforce views and queries that allows for the Salesforce data to be retrieved.
the SQL Server database that creates reports of data processing periodically.
Each of the above systems were created by separate developers and have different owners.
The different format for data stored in each system makes it very difficult to reconcile the data together. At the same time, there is a great chance that not all the data logged into a system contains all the information you need to connect with another system in order to be able to understand correctly the data's flow.
If the designers building these systems have not collaborated well when they have designed them (most preferably under the supervision of an architect), there is very little one can do when the Marketing's manager is making his request.
What should have happened to provide a better outcome:
The project should have been designed as a whole, under the supervision of a P.M. and architect, with an unitary vision and goals.
One of the main goals of the implementation should have been the ability to understand at any given time what was received by a system and what went out.
The process should have built a data reconciliation feature and have the ability to create a few reports to display the data requested by the managers or investigated by the developers.
Note: It has to be understood that if you are not taking steps to build your own mechanism to understand your process, the higher systems where you do not have enough permission to investigate properly an issue (Production) , will became a black box to you.
There are 2 ways to think about it:
spend more time, resources and money during the design, implementation and testing phase of the project or
spend even more time and money when the problems will be encountered in the running applications.
One of my first company I have been working for in Toronto was building an ERP product 20 years ago; it has been estimated at that time that a simple bug encountered by a client cost the company 60K $,
##
Comments