What do we mean by "massive" integration?
--> By treating many thousands of flows (between 1000 and 10000 XML files for example) into one overall process of integration.
Problem
Sometimes we are forced to integrate flows in huge quantities and in limited time intervals, for example because of reduced availability of targets database (regular backups etc ...) or ODI's agents.
Principle of "classical" integration
This is having a line of processes Load / Control / Transform / Integrate fully separate (unique) for each flow.

Principle of "optimized" integration
This principle is based on 2 core concepts :
- Privilege to execute a small number of interfaces for a large number of rows processed
- Make a judicious choice in the work tables model

To find which file belongs to each block of data, it is imperative add an extra field in each table and each table collection, which is inserted a technical identifier (identifier of the file).
Why minimize the number of interfaces?
When performing an interface, the proportion of time required to synchronize with all datastores (sometimes from different media and different technologies) is very important in relation to the processing time of the data itself (treatment purely Oracle).
Thus, it is better to run 30 interfaces treating 10,000 lines rather than 300 interfaces treating 1000 lines, it already allows to save 297 synchronization phases.
Why optimize, is it profitable?
With this simple technique of grouping the Load step and sharing Control / Transform / Integrate steps, one can see an improvement in integration time up to a factor of 10!
For example, if a file takes about 30 seconds to integrate in "classic" mode, it will be certainly fewer than 10 in optimized mode. For 10000 files integrated, the gain of time can be more than 2.5 days of integration!
For an Oracle treatment, treating 1000 or 10,000 lines has no effect, however sync 30 times or 300 times the sources datastores and the target datastore increases the time required, especially where the source datastores are XML files for example, the JDBC driver does not have the same performance as the Oracle engine.
In which case the optimization is the most interesting?
Technically, the higher the source file is large and has elements (XML tags) of cardinality 0 ..* or 1 ..*, the more the optimization will be interesting. If, however, the file is small and/or if its data do not show a strong number of multiplicity, the optimization will not be very obvious.
It is more interesting to have to integrate 10,000 files of 1KB rather than 50 ok 200KB, because of course the saving of time is primarily on the pooling treatment collection and processing multi-files. Because for 50 or 1000 files having to integrate, if 20 interfaces are useful to integrate just one of these files, then we will also have 20 interfaces implemented for 1000 files ...
Aucun commentaire:
Enregistrer un commentaire