Data Factory

24 Dec 2024
8 Minutes to read
Contributors

Print
Share
Dark

Light

Data Factory

Updated on 24 Dec 2024
8 Minutes to read
Contributors

Print
Share
Dark

Light

Article summary

Did you find this summary helpful?

Thank you for your feedback!

How is pricing calculated

There are 3 main factors which combine to determine price:

Pipeline orchestration and execution
Data flow execution and debugging
Number of Data Factory operations such as create pipelines and pipeline monitoring

Factors that affect cost

How many pipeline runs you have
How long the integration runtime server is active for executing your pipelines
The run duration of your pipelines
How many records you are reading and writing
Which type of integration runtime you use
How long are you developing/debugging pipelines for which keeps the debug session alive for

Things to Think about

Cost Rounding

For some of the costs with Data Factory note that the cost is rounded up to the next minute. For example a pipeline running for 1 min 10 secs would round up to 2 mins from a billing perspective.

Monitoring Cost

You can monitor cost at various levels. First off you can monitor overall resource cost like any Azure resource but beyond that, within a data factory pipeline run you can see the calculations for the overall pipeline run cost and also the activity costs within the pipeline.

When developing a Data Factory solution its important to review these costs to ensure that you are effectively implementing your pipelines.

Development & Debugging

When developing and debugging you are utilizing the integration runtime and there is an associated vcore use for this. Just keep an eye on this as you may be running more runs than normal and you might find these costs creep up on you.

How many Data Factories should you have?

Data Factory is a container for some resources which deliver data integration. From the Azure billing perspective the billing item is at the Data Factory level. By analysing the internals of pipeline runs etc it is possible to see more granular data at run and activity level but this is separate from your billing data.

There is a design question if you would prefer to have a single Data Factory for everything or smaller separate data factories. The separation is easier if you want to implement charge back as you have less to worry about in terms of cost splitting and allocation.

If you do separate however there may be some costs which would increase.

A single Data Factory would benefit from shared resources such as an integration runtime.

Separate Data Factories would allow isolated scaling and cost separation.

There is not necessarily a right or wrong answer here, it depends on a few factors from a cost perspective but also a number of factors from a non-cost perspective.

Its not just how long the pipeline or mapping flow takes

Your pipeline may not seem to take that long but your cost may be higher than you expect. You need to think about the configuration of how many vCores or DIU are being used and calculate that up to the cost for various stages.

Cross Region Data Transfer

If you are transferring data across regions with your pipelines then there will be associated data transfer costs, be aware of these when choosing location for Data Factory.

Consider Alternatives for Data Transformation

With complex data transformations it can be a good idea to consider using something like Databricks to perform the transformation. You could execute a job to perform the transformation from Data Factory. In some cases combining with another resource type will lower costs

Logging Costs

The diagnostics settings for a Data Factory can log data to a Log Analytics workspace for Data Factory log events. This can be useful for monitoring and troubleshooting but it also incurs an additional cost.

Temporary Storage Costs

When performing data integration it is common to import data to temporary storage and perform operations on it in a staging area before moving the data to the destination.

Be cautious about keeping lots of copies of the data around unnecessarily.

Be aware of costs for dependent resources

Its likely your solution will use other resource types and not just Data Factory. If your using things like storage and Data Bricks as part of processing the data then your overall data integration solution costs will be a combination of these. Ensure you are looking to efficiently use all resources.

Common Optimizations

Reduce the amount of data you are processing

If you are processing a large amount of data then you will be consuming a lot of resources to process this data. If you only need a subset of the data then the sooner you can filter out the data you dont need in the pipeline the lower the resource consumption will be later in the pipeline.

A good example of this would be if my pipeline is querying SQL as the input source, we can handle this in 2 ways:

Import all of the data into the pipeline then begin filtering it in the pipeline
Add a where clause in the SQL and filter the data we export from SQL

In this case the option 2 is likely to result in the pipeline costing less

Only Process Changes if possible

In addition to filtering the data, if we are able to identify only records which have changed rather than a full dataset then we will process less data which will result in lower costs on the pipeline.

Use Reservations

Reservations can be used with Azure Data Factory. If you have processing which could benefit from a reservation then make sure to evaluate if we can benefit from a reservation.

SSIS Integration Runtime

If you are using the SSIS Integration runtime then there are some items below you may find useful.

Consider Scale down the VM when not used
Consider turn off the VM hosting SSIS when not used
Consider the hybrid benefit SQL license options

Minimize the number of activity runs?

Within the pipeline you create certain activities will incur costs for example importing data from a system will incur DIU costs. The fewer of these activities then the lower your pipeline costs will be. Obviously its a trade off here and while you might reduce the number of activities you might increase the complexity of the process or make things challenging. Making sure your pipeline is the right balance of activities to deliver the functionality required at an appropriate cost is the balance you want.

Efficient Data Formats

Certain data formats will perform better than others when processing data within Data Factory

Format	Best Use Cases	Cost Efficiency	Compression	Processing Overhead
Parquet	Analytical workloads, big data	High	Excellent	Low (columnar scanning)
Avro	Big data, streaming, schema-free	High	Excellent	Moderate (row-based)
ORC	Analytics, big data	High	Excellent	Low (columnar scanning)
Delta	Incremental loads, transactions	High	Excellent	Low for incremental loads
JSON	API data, semi-structured	Low for large data	None	High (parsing overhead)
CSV	Small, simple datasets

Control Flow and Conditional Execution

When building your pipelines it is worth considering options where you can control the execution so that if you want to develop and test certain parts of it you can only execute that part of the pipeline rather than the entire pipeline. If a single individual run is expensive this kind of control flow can save you some money and also allow you to reprocess only specific parts of the pipeline rather than the whole thing.

Appropriate Trigger Scheduling

Pipelines will run from triggers and if you schedule a trigger then you will indicate how often the pipeline will run.

There can be benefits to running triggers more frequently and benefits to running them less frequently.

If you run a trigger more frequently then it may be the case that you are processing smaller amounts of data so the pipeline will complete quicker.

If you run the trigger less frequently then there will be fewer runs which will mean less cost.

As a default position I would expect that the less frequently you run a trigger the more likely it is that your costs will be lower.

You will need to trade this off with the functional requirements and also test the assumption above.

Turn off Debug Mode

Turning off debug mode is a standard practice to reduce cost and only use it when necessary.

Limit Log Retention

If you are pushing diagnostics data to logs then the more data you push to logs and the longer you keep those logs then the more money you will pay.

There is a trade off here as to cost versus your operational requirements. As a minimum you probably want to use logging in non production environments when necessary rather than verbose by default all of the time.

Consider other options appropriately

Sometimes you may choose Data Factory to solve a problem because that is the default tool your project is using. Remember that there are often problems that can be solved by multiple tools in different ways. For some problems that you might solve with Data Factory, other viable choices may be:

Functions
Logic Apps

The decision of which to use will depend on multiple factors of which cost will be one of them. Dont fall into the trap that because you have a hammer, every problem looks like a nail and make sure you use the right tool for the right job which will usually give you the most cost effective solution anyway.

How can Turbo360 help

With Turbo360 there are a few ways that the product will help you with Data Factory.

Cost Visualization

The cost visualization feature will help you to see the costs associated with your data integration solution. This is not limited to data factory, you will also see the other resources your solution uses such as logs, storage, databases, virtual machines. You can create a scope for your data platform and then sub-scopes for different environments. You can gain lots of insights into where your money is going

Cost Monitoring

You can setup monitors to monitor costs for your data integration solution in various ways and you will also get anomaly detection for unexpected cost changes

Workload Optimization via Scheduler

The scheduler feature will allow you to create an automation so you can pause triggers at certain times. We have seen customers have a saved quite a bit of money when they can pause recurring triggers in non production environments so they stop running pipelines when no one is around testing.

Integration Runtime Rightsizing

If you are using a self-hosted runtime then our Virtual Machine rightsizing features will help you to ensure that you are choosing the right size for the virtual machine based on the performance and utilization it has.

Integration Runtime Reservations

If you are using a self-hosted runtime then our Virtual Machine reservation features will help you to ensure you are using reservations effectively.

Useful Resources

Was this article helpful?

What's Next

Function Apps

Table of contents

How is pricing calculated
Factors that affect cost
Things to Think about
Common Optimizations
How can Turbo360 help
Useful Resources