It is essential to devote thorough and careful preparation to any data mining project in order to have success in undertaking such an enterprise. Naturally, the very first thing to do is to determine the objective of a data mining project. This will, however, not usually be much of a problem.
The planning of the necessary data and the stocktaking of the currently available data will prove to be much more difficult. Even an overview of the given data will more often than not be considerably impeded by inadequate data record descriptions. We do not indeed exaggerate this scenario - in fact, the problem is often one which is not being realised by those responsible for the data management, and therefore shows itself to be a real and serious impediment only when the data mining project is already in full swing. It is therefore of the utmost importance to clarify the exact state of the data stock as early as possible.
The lack of an adequate data warehouse is another problem. True, access to operating systems is always possible, yet it has a grave disadvantage: namely that, as a rule, no customer histories are available, which restricts most data mining activities to a great extent. The establishment of a data warehouse should therefore normally be a prerogative for a successful data mining project, even if it is not always necessary.
Where records on customers' histories are not available, an assembling of these should be started with as early as possible. The easiest way to do this is to search the relevant backups of the operational systems. Where this is not possible, you will at least need to start collecting these data at once, and you will be well advised to do this using a real data warehouse.
It is a good idea to carry out exact studies as to the correctness of the present customers' data. Faulty data will show themselves to be disruptive factors during the first data mining analyses (at the latest!) and should therefore be identified and deleted beforehand, if possible.
The customers' data available - in a data warehouse, at best - should be as complete as possible. A sorting out of seemingly unnecessary data is not be advised! The knowledge about which data will in fact be needed for later analyses can be gained no sooner than at the time of the analysis itself. A capable analysing tool such as Score™ 4.0 finds all data relevant for the analysis on its own and automatically eradicates those irrelevant.
click to proceed to Preprocessing...