Preventing Duplicates

Issue No. #17

2 October 2008

ISSN: 1532-1886

We encourage you to forward this newsletter to any number of colleagues if you also CC to addme@servicelevel.net so that we can provide them a free subscription also.

Preventing Duplicates by Carl Wright

Each day you process thousands of batches holding tens of millions of transactions. You deal with interruptions of the delivery of batches from switches and other network elements, with network links that fail and with operators who make processing mistakes.

You do this while facing constant pressure to make certain that no transactions go un-billed. It’s vital work. Unprocessed transactions are revenue that disappear straight from your profits. So you want as many transactions as possible! But you want NO DUPLICATE TRANSACTIONS.

Duplicate transactions in your billing files are a mess you don't want. If the duplicates get on the bills, we hope that Billing Quality Control (AKA Revenue Assurance) notices them before the bills are sent to customers. If you don’t, then you have a customer service nightmare.

Problems with duplicate records can keep you working long after everyone else has gone home. They can drag you back to work after you thought the day was done.

You don’t need the delays in the bill printing when you find duplicates in Revenue Assurance. Having thousands of customers who have been over-billed is not a good career move.

So our goal is to bill for services once only. We want to make certain that every transaction is unique.

Defining Uniqueness

For a telephone call you probably define a phone call with the origin phone number, the destination phone number, and the date and time of the call start. We assume you can't start two phone calls at the same time to the same destination.

The Direct Approach to Preventing Duplicates

The simplest way to prevent duplicates is to

This approach is simple and easy, but it’s expensive.

Figure 1. Insertions of records into table with unique keys

The problem seems pretty straightforward to solve until you realize that it is a problem equal in complexity and processing cost to sorting data files. Sorting is an exponentially growing task. Adding one more item does not increase the amount of work linearly, but instead it increases exponentially.

The problem is that the insert process must first search the table’s index for the record before it can insert the record. For example, if you are adding a record with twenty five million table rows, you have a worst case number of comparisons needed to determine that the record doesn't already exist. The number of comparisons needed depends on the structure of the indices used. If you know that you won't have duplicates, you can skip the search and load into the database much faster.

The Worst Case

You have to do the worst case number of comparisons to determine that something is not there. It's seems obvious if you think about it. You don't know that your car keys are not in the living room until you look absolutely everywhere in the living room. It takes less time if you find them because you stop looking when you find them.

This is the impact of preventing duplicates in an indexed table. It's an expensive task.

If the key is long or the comparison is complex, you further increase the cost of the comparison.

Batches, Beautiful Batches! It’s All in the Quanta!

Often in telecommunications the transaction data is collected and delivered in bundles that are controlled by another system. For example, the files or tapes from a telephone switch are promised to be free of duplicates. So long as you can trust those systems, you only need to prevent the duplicate processing of these “quanta” to prevent duplicates at the level of transactions.

When you can rely on the systems that deliver these batches, you eliminate all the costs to prevent duplicates. You can turn off the indexes and insert your records using a bulk insert facility in your relational database.

If you are in this situation, Hurrah for you!

Trust, but Verify

Mehran Moghaddam of Ace*Comm Corp., an expert in data collection and mediation, described a strategy that he's used for his customers. He uses the first few records within a batch/file to check in the database for duplicates. He's encountered the situation where most batches contain unique transactions, but not all of them. So he checks the contents of new batches against the existing data and accepts the batch if a certain number of the first transactions are shown to be unique. If they aren't, then he knows that the batch contents is untrustworthy.

4 Ways to Reduce the Cost of Duplicate Prevention

#1	My favorite is divide and conquer. Use it by storing your transactions in smaller tables. If you divide the tables into tables by day and by billing cycle, you have smaller tables to search for duplicates. You can divide them by other attributes, city, time of day, whatever work for you. You put the tables back together for queries, billing, etc. You reduce the number of comparisons by partitioning the tables. This is also attractive for the archiving of data. You can easily set aside tables or files when you want to move older data to archival storage.
#2	Another technique reduces the amount of work to do a comparison. You do this by reducing the keys to binary values whenever possible. Your computer compares two phone numbers stored as integers much more quickly than two strings of text characters representing the phone numbers.
#3	Another technique for speeding the detection of duplicates during the insertion step is to reduce the disk I/O for the duplicate search. Most indexing schemes ( database and ISAM ) provide techniques to get the index data into memory. Getting the indices in memory makes a huge difference in the time to insert records. It can easily make it run 100 times faster.
#4	The index structure can also be optimized to reduce the number of key comparisons from the worst case of a binary search. For example, COBOL's ISAM allows you to have blocks of records with one key entry. This entry tells you that every record from value A to value B are in a certain block. If your blocks hold 100 records, then an index for 100 million records will only have 1 million index keys. You've reduce the key search complexity by 100. They keys themselves are layered on top of the transaction keys. These provide n-ary nodes instead of the worst case binary nodes. For example, the next layer up of keys to search can also be grouped so that a "key block" has 100 keys. This grouping of keys further reduces the amount of disk reads needed to find the right record block. This is technology from the 1970's so you can be sure that today's relational database also uses these techniques.

Summary

The process of posting transactions for use when billing is often a business bottleneck. The requirement to prevent duplicate transactions on the bills is unavoidable. If you are lucky, you can depend on the duplicate prevention of other systems. When you can't use larger quanta (batches, tapes, files, ...) to check for duplicates, you must do it for each transaction.

You can use multiple strategies to reduce the costs of preventing duplicates in your customer bills.

Tell Me What You Want To Hear About

The subjects that I cover in Rating Matters are driven by my personal interests in rating and billing. These are limited by the breadth of my personal experience. Please let me know about items you want to hear about or you'd like explored further. Send me your requests at .

ŠCopyright 2008 Service Level LLC
Rating Matters is a trademark of Service Level LLC