Pushing data to the Celonis Platform using the Standard Data Ingestion API
- Copy for LLMCopy page as Markdown for LLMs
- View as MarkdownOpen this page as Markdown
- Open in ChatGPTGet insights from ChatGPT
- Open in ClaudeGet insights from Claude
- Connect to CursorInstall MCP server on Cursor
- Connect to VS CodeInstall MCP server on VS Code
The Standard Data Ingestion API allows you to push real-time data to the Celonis Platform using your existing IT systems. This AWS S3 compatible API operates on events/notifications that get triggered whenever a new file reaches the API and automatically picks it up and processes it to a data pool.
When using the Data Ingestion API, the following apply:
- File formats: Uncompressed parquet files only. Alternative file formats must be converted to parquet before being ingested.
- Extractions: The Standard Data Ingestion API always performs delta loads. Make sure to define the primary keys to ensure proper updates and avoid duplicates.
- File uploads: Single files and multiple files simultaneously (using a first in, first out or FIFO method).
- Data types: Flat and nested.
For a video overview of this method:
Hi, everyone. This is Florian. I'm a product manager at Celonis responsible for the areas data pipeline and data ingestion. And in this session today, I want to introduce you to the new standard data ingestion API. We'll start with some theoretical foundations, why the API is relevant, what it can be used for, and why it's better than the existing data push API that some of you might be familiar with. And then we'll jump into a couple of demos where I'll show you how to leverage the API using different, clients, so to say. So we'll have, like, a little Python example, but we also have an example where we push data via the API using the command line. Exactly, let's start with some foundation. So the new API that we've built is built on top of cloud native object storages, and those object storages are, yeah, or have been developed for massive scale, for massive data volumes, which is why also the new data ingestion API is built on top of this architecture. The API is S3 compatible. That means, you can use the s three CLI. You get S3 compatible error messages, etcetera. Additionally, the API is built for continuous ingestion. So, I mentioned the data push API before, which is, like, our existing API to push data into Celonis. You had quite some challenges. You had to, check the status of push jobs yourself, and only if a status was successful, you could push the next chunk of data, etcetera, etcetera. The new API is way simpler. We take all of this overhead away from you, and you basically just need to push data into one via one API call, and the system handles the rest. And additionally, we can now support nested data types such as nested JSON and do the unnesting on the fly. And last but not least, it's integrated in the UI as well, so you can create your own data connection, define the tables, columns, primary keys, all those things that you know from native extractors usually. Alright then, let's, jump into a demo. So, the goal of the demo is basically I have two, JSON files. One is a flat JSON. When I say flat, I mean, we don't have any nested arrays. So we just have key value pairs. An example, I have an order. So, some usual representation of an order is an order ID, order value, order date, etcetera. And I also have a nested file. You'll see nested because additionally to the order, I also have the order line items, which are indicated here via, a nested array because usually, one order comes with multiple order lines, and those are part of the same file. So in this case, I have two order lines being part of that order. Alright. And I'll show you two different, methods to push that data into Celonis. First of all, let's go into Celonis. So what I've done in preparation is I created a new data connection by just clicking here at data connection and selecting push data in Celonis, which is a new category, basically, we've built for the new API. What I've done here is I configured, two tables I wanna push . One is flat and one is nested, And I define the primary key. Primary key is relevant for delta loads if I do updates. So I have an existing order and I update the price, etcetera. I have a example for that in a second. And then I also have to define the data structure, which is either flat or nested. Let's first look at the flat example. So, what I've done here basically is I created a little Python script, which looks like this. And in the first step, I import some libraries. One interesting one is boto3, which is basically a Python library to communicate with AWS S3, which is, yeah, basically the foundation for the new data ingestion API. And, I need to authenticate. For that, I need credentials and access key and access secret. Usually, you will get them when creating the connection initially. Let's say I lost them now. I need to restore them. I can create new credentials, which means I have a new access key and access secret. So let's copy paste those in here. Perfect. Right. Then I have the team URL, dev dot us one, which is basically my team. And I also need to specify the data connection ID. I think I've done that before. You can retrieve it from the URL. And, yeah, this one seems right. As well as the target table, in my case I'm pushing to the table called flat. Alright, and I'm pushing a file called flat_order.parquet. So although I showed you a json file before we don't support json natively yet. This will, come for sure in the second version soon. But for now, you need to convert the JSON into Paquet, basically. There's endless libraries out there, that you can leverage to to do this simple conversion. And then I push this file. Cool. Then I see a notification saying the file has been pushed. I wanna inspect it now, so let's create a data job quickly, link it to the connection, add a transformation that I just call t1. And, then there should be a new table called flat. Awesome, perfect, it's there. It has one record, that's the JSON example I pushed into. Now, I mentioned earlier I wanna show the the delta upload behavior. So what happens if something changes on the order for which I have created a second JSON? And here, I have changed the order value from five hundred ninety seven to six hundred. And in my case, the payment method is now cash and not credit card anymore. So, let's also push the second file, and do this. And now the expectation is basically, obviously, that the existing record gets updated using, these new values basically. So, because I defined the the order ID as the primary key and if we check here, this is basically the same the same order ID. Alright. Then let's refresh the schema here and load from the table once again. Select from the table once again. And I can see the payment method has been updated to cash and my order value to six hundred. Perfect. Nice. That worked, pretty well. So that was the the Python example. And if, like, for those of you familiar with the old data push API, you can see this as, like, super simple, like, a single API call using a native S3, Python library. And, like, this "try except you wouldn't even need. It's just something I added. So, if you leave this one out, it's I don't know, without configuration, it's four lines of Python code. So simple as that. Awesome. Then, let's make it a bit more challenging and also look into the nested, file. Oh, here. Opening the connection because this one is an interesting scenario as well. So I defined here the the table to be of structure nested, And I define the schema in advanced already. So I simply, yeah, took this example here, copy pasted it in here. And then you can see from left to right how each of the key value pairs translates into into a column. So we have the order ID. I select the order ID as a primary key again. And we automatically also derive the, the nested table, which is the line items. It's called nested dot underscore line underscore items because I call the table nested. And here I define the order line ID to be the primary key. Cool. Now, let's not use Python to push the file in, but let's use the command line. Let's write a simple command line to push data in here. Two steps are required. One is I need to define an AWS profile, and then I call the AWS, yes, CLI. The profile is basically required to authenticate. So what I did here with defining or sorry here with access key and access secret, I'll do that as part of, like, an AWS profile. Cool then. Let's do that. I call it AWS profile demo four, and then I'll take the access key, because it asks me for that and I take the access secret. I the default region, the region depends a little bit on on where the team is based. In my case, US one. So the default region is US east one, and file type is again Parquet, Awesome. That's the profile. And then as a second step, I can, push the file via, the CLI command. Let me briefly explain, the parts of this one. So the first part is AWS S3 c p, which is the CLI command of S3 for copying data into, S3 bucket. Then, that's the file I wanna push. It's called nested_order.parquet. In my case, it's sitting on my local machine. The endpoint URL, again, team dot cluster and then the Celonis specific, things. And then here I have the, yeah, the connection ID again and the target table I wanna push into that what I defined as nested. Yes. Cool. Then let's take this command and throw it into the command line. Again, I see a notification saying that the upload has been successful. So now what will happen in the background is it will take the file and automatically unnest it. So, I defined this configuration here, for the parent table, which is nested, and then the the the the child table or nested table, which is nested line items. Maybe I should have chosen different table name here, but I I think you got the you got the point basically. So instead of, like, having one target table, we now should have two target tables that we, that's Celonis with the API creates for you. So let's double check whether this is right. Cool. Then I create a second transformation. I just call it t2 to separate things. I should have a table called select star from nested. And this looks good. So this is my, nested, table. And then I think the second one was called, there we go, nested line items. You can also see it has, the columns, different types of columns. Like, here we have varchar, float, and date, and an integer here. So I can also query this guy here and say select star from nested line items. One interesting fact here, so you said I mentioned earlier I have two order lines. That's, that's the one that I see here as well. And one important thing that allows you to also link those tables back together and, and link them in transformations or join them in transformations and link them in a data model is one table and one column that we create automatically for you, which is basically the foreign key. So we just we said, the order ID is a primary key. That means the system will automatically take this primary key and set it as a foreign key to the nested table. You can also see that in the configuration. Let me quickly jump back, and go back here. You see the order ID is set as nested order ID as a separate column basically, on a nested table. So, yeah, that's basically it. So, just to recap what we've done, we went into, like, some of the theoretical foundations of the new data ingestion API. We, looked into demos. We pushed a flat parquet file via Python using this script here. And then we also used, the native AWS CLI, which, by the way, you can also find the documentation for it on AWS website to push a nested file and going a little bit into details how the how the unnesting, and creation of primary key and foreign key, looks like in this case. Thanks for the attention.