Friday, November 15, 2019
Ssis Is An In Memory Pipeline Computer Science Essay
Ssis Is An In Memory Pipeline Computer Science Essay Since SSIS is an in-memory pipeline, one has to ensure that transactions occur in the memory for performance benefits. To check if your package is staying within memory limits, one should review the SSIS performance counter Buffers spooled. This has an initial value of 0. Any value above 0 is an indication that the engine has started disk-swapping activities. Capacity planning to understand resource utilization In order to understand resource utilization it is very important to monitor CPU, Memory, I/O and Network utilization of the SSIS package. CPU It is important to understand how much CPU is being utilized by SSIS and how much of CPU is being utilized by overall SQL Server while Integration Services is running. This latter point is very important, especially if you have SSIS and SQL Server on the same box, because if there is resource contention, SQL Server will surely win that will result into disk spilling from Integration Services resulting in slower transformation speed. The performance counter that should be monitored is Process / % Processor Time (Total). One should measure this counter for both sqlservr.exe and dtexec.exe. If SSIS is not close to 100% CPU load, then this indicates: Application contention For e.g. SQL Server takes more processor resources, makes it unavailable for SSIS Hardware contention Probably a suboptimal disk I/O or not enough memory to handled the amount of data to be processed Design limitation The SSIS design is not making use of parallelism, and/or the package has too many single-threaded tasks Network SSIS moves data as fast as your network is able to handle it. Hence, it is important to understand your network topology and ensure that the path between the source and destination have both low latency and high throughput. Following performance counters can help you tune the topology: Network Interface / Current Bandwidth Provides estimate of current bandwidth Network Interface / Bytes Total/Sec The rates at which bytes are sent and received on each network adapter Network Interface / Transfers/Sec How many network transfers per second are occurring. If the number is close to 40,000 IOPs, then get another NIC card and use teaming between the NIC cards Input / Output (I/O) A good SSIS package should hit the disk only when it reads from the sources and writes back to the target. But if the I/O is slow, reading and especially writing can create a bottleneck. So it is very important to understand that the I/O system is not only specified in size (like 1 TB, 2 TB) but also its sustainable speed (like 20,000 IOPs). Memory The key counters to monitor memory for SSIS and SQL Server are as follows: Process / Private Bytes (DTEXEC.EXE) amount of memory currently used by Integration Services that cannot be shared with other processes Process / Working Set (DTEXEC.EXE) amount of allocated memory by Integration Services SQL Server: Memory Manager / Total Server Memory amount of allocated memory for SQL Server. This counter is the best indicator of total memory used by SQL, because SQL Server has another way to allocate memory using the AWE API Memory / Page Reads/sec total memory pressure on the system. If this consistently goes above 500, it is an indication that the system is under memory pressure Baseline Source System Extract Speed It is important to understand the source system and the speed at which data can be extracted from it. Measure the speed of the source system by creating a simple package that reads data from some source with the destination that says Row Count Execute this package from the command line and measure the time it took for it to complete the task. Using Integration Services log output, you can measure the time taken. Formula to be used: Rows/Sec = RowCount / Time Based on the above value, you can judge the maximum number of rows per second that can be read from the source. To increase the Rows/Sec calculation, you can perform one of the following operations: Improve drivers and driver configurations: Ensure you are using the up-to-date driver configurations for the network, data source and disk I/O. Start multiple connections: To overcome limitations of drivers, you can start multiple connections to your data source. If the source is able to handle many concurrent connections, the throughput will increase if you start several extracts at once. If concurrency causes locking or blocking issues, consider partitioning the source having your packages read from different partitions to more evenly distribute the load Use multiple NIC cards: If network is the bottleneck and you have ensured you are using gigabit network cards and routers, then a potential solution is to use multiple NIC cards per server. Optimize SQL data source, Lookup transformations and Destination Here are some optimization tips that you can implement in your SSIS packages: Use NOLOCK or TABLOCK hints to remove locking overhead Refrain from using SELECT * in SQL queries. Mention each column name in the SELECT clause for which data needs to be retrieved If possible, perform datetime conversions at source or target databases In SQL Server 2008 Integration Services, there is a new feature of shared lookup cache. During the use of parallel pipelines, it provides high-speed, shared cache If Integration Services and SQL Server run on the same box, use SQL Server destination instead of OLE DB Commit size 0 is fastest on heap bulk targets. If you cannot use 0, use the highest possible value of commit size to reduce overhead of multiple-batch writing. Commit size = 0 is bad while inserting into BTree because all incoming rows must be sorted at once into the target BTree, and if the memory is limited, there is a likelihood of spill. Batchsize=0 is ideal for inserting into a heap. Please note that a commit size value of 0 might cause the running package to stop responding if the OLE DB destination and another data flow component are updating the same source table. To ensure that the package does not stop, set the maximum insert commit size option to 2147483647 Use a commit size of Heap inserts are typically faster than using a clustered index. This means it is recommended to drop and rebuild all the indexes if there is a large part of the destination table getting changed. Use partitions and partition SWITCH command. In other words load a work table that contains single partition and SWITCH it into the main table after the indexes are build and then put the constraints on Network tuning Packet size is the main property of the network that needs to be monitored / looked at in order to take decisions for Network tuning. By default this value is set to 4,096 bytes. As noted in SqlConnection.PacketSize property in .Net Framework Class Library, when the packet size is increased, it will improve performance because fewer network read and write operations are required to transfer a large data set. If your system is transactional in nature, lowering the value will improve the performance. Another network tuning technique is to use network affinity at the operating system level to increase the performance at high throughputs. Use Data Type wisely Following are some best practices related to usage of data types: Define data types as narrow as possible Do not perform excessing casting of data types. Match your data types to the source or destination and explicitly specify data type casting Take care of precision when using money, float and decimal data types. Money data type is always faster than decimal and has fewer precision considerations than float. Change the design Following are some best practices related to SSIS design: Do not SORT within Integration Services unless absolutely necessary. In order to sort the data Integration Services allocates memory space for the entire data set that needs to be transformed. Preferably, presort the data before hand. Another way to sort the data is by using ORDER BY clause to sort large data in the database. There are times where using Transact-SQL will be faster than processing the data in SSIS. Generally all set-based operations will perform faster in Transact-SQL because the problem can be transformed into a relational algebra formulation that SQL Server is optimized to resolve. Set-based UPDATE statements these are more efficient than row-by-row OLE DB calls Aggregation statements like GROUP BY and SUM are also calculated faster using T-SQL instead of in-memory calculations by a pipeline Delta detection is a technique where you change existing rows in the target table instead of reloading the table. To perform delta detection, one can change detection mechanism such as the new SQL Server 2008 Change Data Capture (CDC) functionality. As a rule of thumb, if the target table has changed > 10 %, it is often faster to simply reload than to perform the delta detection Partition the problem For ETL design, partition source data into smaller chunks of equal size. Here are some more partitioning tips: Use partitioning on your target table. Multiple versions of the same package can be executed in parallel to insert data into different partitions of the same table. The SWITCH statement should be used during partitioning. It not only increases parallel load speed, but also allows efficient transfer of data. As implied above, the package should have a parameter defined that specifies which partition should it work on. Minimize logged operations If possible, used minimal logged operations while inserting data into your target SQL Server database. When data is inserted into a database in fully logged mode, the size of the log grows quickly, because each row that is written in the database is also written to the log. Therefore, consider the following while designing SSIS packages: Try to perform data flows in bulk mode instead of row by row. This will help minimize the number of entries to the log file. This eventually results into less disk I/O hence improving the performance If for any reason you need to delete data, organize the data in such a way that you can use TRUNCATE instead of DELETE. The later places an entry of each row that is deleted into the log file. The former will delete all the data and just put one entry into the log file If for any reason partition need to be move around, use the SWITCH statement. This is a minimally logged operation If you use DML statements along with your INSERT statements, minimum logging is suppressed. Schedule and distribute it correctly Good way to handle execution is to create a priority queue for your package and then execute multiple instances of the same package (with different partition parameter values). This queue can be a simple SQL Server table. A simple loop in the control flow should be a part of each package to: Pick a relevant chunk from the queue Relevant means that is not already been processed and that all chunks it depends on have already executed Exit the package if no item is returned from the queue Perform work required on the chunk Mark the chunk as done in the queue Return to the start of the loop Picking an item from the queue and marking it as done can be implemented as a stored procedure. Once you have the queue in place, you can simple start multiple copies of DTEXEC to increase parallelism. Keep it simple Unnecessary use of components should be avoided. Here is one of the way to avoid it: Step 1: Declare the variable varServerDate Step 2: Use ExecuteSQLTask in the control flow to execute a SQL query to get the server datatime and store it in the variable Step 3: Use the dataflow task and insert/update database with the server datatime from the variable varServerDate This sequence is advisable only in cases where the time difference from Step 2 to Step 3 really matters. If that does not matter, then just use the getdate() command at Step 3 as shown below: Create table #Table1(t_ID int, t_date datetime) Insert into #Table1(t_ID, t_date) values(1, getdate()) Executing a child package multiple times from a parent with different parameter values While executing a child package from a master package, parameters that are passed from the master package should be configured in the child package. Use the Parent Package Configuration option in the child package to implement this feature. But for using this option, you need to specify the name of the Parent Package Variable that is passed to the child package. If there is a need to call the same child package multiple times (each time with a different parameter value), declare the parent package variables (with the same name as given in the child package) with a scope limited to Execute Package Tasks. SSIS allows declaring variables with the same name but the scope limited to different tasks all inside the same package. SQL Job with many atomic steps For the SQL job that calls the SSIS packages, create multiple steps, each performing small tasks rather than one step that performs all the tasks. Creating one big step, the transaction log grows too big and if a rollback takes place, it make take the full processing space of the server. Avoid unnecessary typecasts Avoid unnecessary typecasts. For e.g., flat file connection manager, be default, uses the string [DT-STR] data type for all columns. You will have to manually change it, if there is a need to use the actual data type. It is always a good option to change it at the source-level itself to avoid unnecessary type casting. Transactions Usually, ETL processes handle large volume of data. In such scenarios, do not attempt a transaction on the whole package logic. SSIS does support transactions, and it is advisable to use transactions. Distributed transaction that span across multiple tasks The control flow of an SSIS package threads together various control tasks. In SSIS it is possible to set a transaction that can span into multiple tasks using the same connection. To enable this, set value of the retainsameconnection property of the Connection Manager to true Limit the package name to maximum of 100 characters When a SSIS package with a package name exceeding 100 characters is deployed in SQL Server, it trims the package name to 100 characters, which may cause an execution failure. SELECT * FROM Do not pass any unnecessary columns from the source to the destination. With the OLEDB connection manager source, using the Table or View data access mode is equivalent to SELECT * FROM tablename, which will fetch all the columns. Use SQL Command to fetch only required columns and pass that to the destination. Excel source and 64-bit runtime Excel Source or Excel Connection manager works only with the 32-bit runtime. Whenever a package that uses Excel Source is enabled for 64-bit runtime (by default, this is enabled), it will fail on the production server using the 64-bit runtime. Go to solution property pages debugging and set Run64BitRuntime to FALSE. On failure of a component, stop / continue the execution with the next component When a component fails, the property failParentonFailure can be effectively used either to stop the package execution or continue with the next component execution in the sequence container. The constraint value connecting the components in the sequence should be set to Completion. Also the failParentonFailure property should be set to FALSE. Protection To avoid most of the package deployment error from one system to other, set the package protection level to DontSaveSensitive Copy pasting script component Once you copy-paste a script component and execute the package, it may fail. As a work-around, open the script editor of the pasted script component, save the script and then execute the package. Configuration filter Use as a filter As a best practice use the package name as the configuration filter for all the configuration items that are specific to a package. This is typically useful when there are so many packages with package specific configuration items. Use a generic name for configuration items that are general to many packages. Optimal use of configuration records Avoid using the same configuration item recorded under different filter / object name. For e.g. there should be only one configuration record created if two packages are using the same connection string. This can be achieved by using the same name for the connection manager in both the packages. This is quite useful at the time of porting from one environment to other (like UAT to Prod). Pulling High Volume data Process of pulling high volume is represented in the following flowchart: The recommendation is to consider dropping all indexes from the target tables if possible before inserting data especially when the volume inserts are high. Effect of OLEDB Destination Settings Certain settings with OLEDB destination will impact the performance of the data transfer. Lets look at some of them: Data Access Mode This setting provides fast load option, which internally uses BULK INSERT statement for uploading data into the destination table. Keep Identity By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. On checking this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table. Keep NULLs By default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during INSERT into the destination table if NULL value is coming from the source for that particular column. On checking this option, the default constraint on the destination tables column will be ignored and preserved NULL of the source column will be inserted into the destination column. Table Lock By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at the same time. Check Constraints By default this setting is checked and recommendation is to have it unchecked if you are sure the incoming data is not going to violate constraints of the destination table. This setting indicates that the dataflow pipeline engine will validate the incoming data against the constraints of target table. Performance of data load can be improved by unchecking this option. Effects of Rows per Batch and Maximum Insert Commit Size settings Rows per batch The default value for this setting is -1 which means all incoming rows will be treated as a single batch. If required you can change this to a positive integer value to break all incoming rows into multiple batches. The positive integer value will represent the total number of rows in a batch Maximum insert commit size Default value for this setting is 2147483647 which means all incoming rows will be committed once on successful completion. If required, you can change this positive integer to any other positive integer number that would represent that the commit will be done for those specified number of records. This might put an overhead on the dataflow engine to commit several times, but on the other side it will release the pressure on the transaction log and save tempdb from growing tremendously especially during high volume data transfers. The above two settings are mainly focused on improving the performance of tempdb and transaction log. Avoid Synchronous/Asynchronous transformations While executing the package, SSIS runtime engine executes every task other than data flow task in defined sequence. On encountering a data flow task the execution of the data flow task is taken over by the data flow pipeline engine. The dataflow pipeline engine then breaks the execution of the data flow task into one ore more execution tree(s). It may also execute these trees in parallel to achieve high performance. To make things a bit clearly, here is what an Execution Tree means. An Execution tree starts at a source or an asynchronous transformation and ends at a destination or first asynchronous transformation in the hierarchy. Each tree has a set of allocated buffer and scope of these buffers is associated to this tree. Also in addition to this every tree is allocated an OS thread (worker-thread) and unlike buffers other execution tree may share this thread. Synchronous transformation gets a record, processes it and passes it to the other transformation or destination in the sequence. The processing of a record does not dependent on the other incoming rows. Since synchronous transformations output the same number of rows as the input, it does not require new buffers to be created and hence is faster in processing. For e.g., in the Derived column transformation, a new column gets added in each incoming row, without adding any additional records to the output. In case of asynchronous transformation, different number of rows can be created than the input requiring new buffers to be created. Since an output is dependent on one or more records it is called blocking transformation. It might be partial or full blocking. For e.g., the Sort Transformation is a fully blocking transformation as it requires all the incoming rows to arrive before processing. Since the asynchronous transformation requires additional buffers it performs slower than synchronous transformations. Hence asynchronous transformations must be avoided wherever possible. For e.g. instead of using Sort Transformation to get sorted results, use ORDER BY clause in the source itself. Implement Parallel Execution in SSIS Parallel execution in allowed by SQL Server Integration Services (SSIS) in two different ways by controlling two properties mentioned below: MaxConcurrentExecutables this property defines how many tasks (executable) can run simultaneously. This property defaults to -1, which is translated to the number of processors plus 2. In case, hyper-threading is turned on in your box, it is the logical processor rather than the physically present processor that is counted.à For e.g. we have a package with 3 Data Flow tasks where every task has 10 flows in the form of OLE DB Source -> SQL Server Destination. To execute all 3 Data Flow Tasks simultaneously, set the value of MaxConcurrentExecutables to 3. The second property named EngineThreads controls whether all 10 flows in each individual Data Flow Task get started concurrently. EngineThreads this property defines how many work threads the schedule will create and run in parallel. The default value for this property is 5. In the above example, if we set the EngineThreads to 10 on all 3 Data Flow Tasks, then all the 30 flows will start at the same time. One thing we want to be clear about EngineThreads is that it governs both source threads (for source components) and work threads (for transformation and destination components). Source and work threads are both engine threads created by the Data Flows scheduler. Looking back at the above example, setting a value of 10 for Engine Threads means up to 10 source and 10 work threads each. In SSIS, we dont affinitize the threads that we create to any of the processors. If the number of threads surpasses the number of available processors, it might hurt the throughput due to an excessive amount of context switches. Package restart without losing pipeline data SSIS has a cool feature called Checkpoint. This feature allows your package to start from the last point of failure on next execution. You can save a lot of time by enabling this feature to start the package execution from the task that failed in the last execution. To enable this feature for your package set values for three properties CheckpointFileName, CheckpointUsage and SaveCheckpoints. Apart from this you should also set FailPackageOnFailure property to TRUE for all tasks that you want to be considered in restarting. By doing this, on failure of that task, the package fails and the information is captured in the checkpoint file and on subsequent execution, the execution starts from that tasks. It is very important to note that you can enable a task to participate in checkpoint including data flow task but it does not apply inside the data flow task. Lets consider a scenario, where you have a data flow task for which you have set FailPackageOnFailure property to TRUE to participate in checkpoint. Lets assume that inside the data flow task there are five transformations in sequence and the execution fails at 5th transformation (assumption is that earlier 4 transformations complete successfully). On the following execution instance, the execution will start from the data flow task and the first 4 transformations will run again before coming to 5th one. It is worth noting below points. For loop and for each loop do not honor Checkpoint. Checkpoint is enabled at only control flow level and not at data level, so regardless of checkpoint the package will execute the control flow/data flow from the start in a case of restart. If package fails, checkpoint file, all server configurations and variables values are stored and also point of failure. So if package restarted, it takes all configuration values from checkpoint file. During failure you cannot change the configuration values. Best practices for logging Integration Services includes logging features that write log entries when run-time events occur and can also write custom messages. Logging, to help you in auditing and troubleshooting a package every time it is run, can capture run-time information about a package. For e.g., name of the operator who ran the package and the time the package began and finished can be captured in the log. Logging (or tracing the execution) is a great way of diagnosing the problem occurring during runtime. This is especially very useful when your code does not work as expected. Not only that, SSIS allows you to choose different events of a package and components of the packages to log as well as the location where the log information is to be written (text files, SQL Server, SQL Server Profiler, Windows Events, or XML files). The logging saves you from several hours of frustration that you might get while finding out the causes of problem if you are not using logging, but the story doesnt end here. Its true, it helps you in identifying the problem and its root cause, but at the same time its an overhead for SSIS that ultimately affects the performance as well, especially if you are excessively using logging. So the recommendation here is to use logging in a case of error (OnError event of package and containers) . Enable logging on other containers only if required, you can dynamically set the value of the LoggingMode property (of a package and its executables) to enable or disable logging without modifying the package. You can create your own custom logging which can be used for troubleshooting, package monitoring, ETL operations performance dashboard creation etc. However the best approach is to use the built-in SSIS logging where appropriate and augment it with your own custom logging. A normal custom logging can provide all the information you need as per requirement. Security audit and data audit is out of scope of this document. To help you understand which bulk load operations will be minimally logged and which will not, the following table lists the possible combinations. Table Indexes Rows in table Hints Without TF 610 With TF 610 Concurrent possible Heap Any TABLOCK Minimal Minimal Yes Heap Any None Full Full Yes Heap + Index Any TABLOCK Full Depends (3) No Cluster Empty TABLOCK, ORDER (1) Minimal Minimal No Cluster Empty None Full Minimal Yes (2) Cluster Any None Full Minimal Yes (2) Cluster Any TABLOCK Full Minimal No Cluster + Index Any None Full Depends (3) Yes (2) Cluster + Index Any TABLOCK Full Depends (3) No (1) It is not necessary to specify the ORDER hint, if you are using the INSERT à ¢Ã¢â ¬Ã ¦ SELECT method, but the rows need to be in the same order as the clustered index. While using BULK INSERT it is necessary to use the ORDER hint. (2) Concurrent loads are only possible under certain conditions. Only rows those are written to newly allocated pages are minimally logged. (3) Based on the plan chosen by the optimizer, the non-clustered index on the table may either be fully- or minimally logged. Best practices for error handling There are two methods of extending the logging capability, Build a custom log provider Use event handlers We can extent SSISs event handler for error logging. We can capture error on OnError event of package and let package handle it gracefully. We can capture actual error using script task and log it in text file or in a SQL server tables. You can capture error details using system variables System::ErrorCode, System::ErrorDescription, System::SourceDescription etc. If you are using custom logging, log the error in same table. In some cases you may wish to ignore it or handle the error at container level or in some cases at task level. Event handlers can be attached to any container in the package and that event handler will catch all events raised by that container and any child containers of that container. Hence, by attaching an event handler to the package (which is parent container) we can catch all events raised of that event type by every container in the package. This is powerful because it saves us from building event handlers for each task in the package. A container has an option to opt out of having its events captured by an event handler. Lets say, you had a sequence container for which you didnt find it important to capture events, you can then simply switch them off using the sequence containers DisableEventHandlers property. If are looking to capture only certain events of that sequence task by an event handler, you could control this using the System::Propogate variable. We recommend you to use se
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.