Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Launch HN: Castled Data (YC W22) – Open-Source Reverse ETL (github.com/castledio)
87 points by aruntdharan on Jan 25, 2022 | hide | past | favorite | 47 comments
Hi HN, We're Arun, Manish, Abhilash and Franklin from Castled Data (https://castled.io). Castled is an open source reverse ETL solution. It helps you to periodically sync the data in your database/warehouse (Snowflake, BigQuery, Redshift, etc.) into sales, marketing, or support apps (Salesforce, Hubspot, Intercom etc.), or custom software, without needing an engineering team. Here’s a demo video: https://www.loom.com/share/71bf33acbb4a41cab7c96a3460a84e5f.

On an average, mid-scale organizations use around 40 SaaS apps. These are powerful in functionality, but limited by the quality of the product/customer data which is fed into them. The data getting synced into these tools is often incomplete, suffers from quality issues, and requires unreliable and manual imports (e.g. from CSV).

Manish and I were founding engineers at Hevodata, an ETL company, when it went from 5 customers to around 300 customers. We started seeing the trend of more and more customers wanting to move the data out of their cloud data warehouse to feed their business tools. We built a prototype to solve this for our users, but when we went deep into their use cases, we found that there were a lot of unsolved problems in this space. We also realized that activating warehouse data reliably for operational purposes was emerging as the next big trend for data-driven companies.

We did some research and came across Census/Hightouch, which were early-stage Reverse ETL cloud solutions at the time. But from our previous experience working in the ETL space, we believed that any data pipeline solution needs to be open source to cover the long list of connectors that needs to be built. So we set out to build our open source Reverse ETL solution.

With Castled, companies can create automated data pipelines to periodically sync the output of a warehouse transformation query or dbt models(on the works) to their sales, marketing, support and notification tools. We fetch only the incremental results by default on every pipeline run, which makes sure that rate limits and other constraints of the destination APIs are not breached. Our users can also set a time schedule to define the frequency of the pipeline run.

The technical challenges in building such a tool include: doing CDC (Change Data Capture) from data warehouses which do not provide a typical write ahead log; handling rate limits on destination APIs; handling deduplication of records on destination objects; failure handling and automatic retries. But the biggest challenge is the sheer number of destination app integrations that need to be supported—we are talking about tens of thousands of connectors.

Our major differentiator from Census/Hightouch is that we are open source. Our users can host Castled in their own private cloud and start operationalizing their data for free. We’ve observed that initially customers are inclined towards buying a cloud solution for their data integration needs. But once they scale up, they realize that their cloud vendor is unable to cope with the increasing number of apps getting used in the organization. They soon start building in-house data pipeline solutions or look for an open-source solution to solve their problems. Being open source, we provide the flexibility for our customers to build their own connectors rather than waiting for cloud vendors to fulfill their connector requests.

Compared with open-source alternatives (e.g. Grouparoo), we have built Castled in such a way that our community can build new connectors in a few hours. One example of this is our Castled Form Language (CFL), which helps our users auto generate extremely complex forms on the UI by writing a few Java annotations on the backend. This removes the need for a UI developer to build a new connector.

We have our Github repo here : https://github.com/castledio/castled. For most users, you can spin up the application on your desktop in a few minutes. In case you want a hosted solution, we also have our cloud platform hosted at https://castled.io. We have a subscription based hosted cloud solution, which provides more security features like single sign on, authentication, user management, notification, alerts, etc. you can sign up for and try out the product for free, no credit card required.

This is the first time we are trying to build an open-source community around a project and we're excited to hear any thoughts, insights, questions, encouragement and concerns in the comments below! Also we will be monitoring the thread over the course of today to answer any questions. Also feel free to reach out to me by email at arun@castled.io



Looks great, congrats on launching!

I'm curious how you differentiate yourselves from Airbyte, which isn't really designed for reverse ETL but can be used for it. And do you ever see Castled supporting regular ETL?

Right now there is a lot of separation in the market between ETL and reverse ETL, but it seems like a pain to maintain separate tools when you could just do both in one.


The founding team of Castled comprises mostly founding engineers from HevoData, which is an ETL solution. Having built both ETL and Reverse ETL solutions from scratch, we have realized that the architecture required to support both ETL and Reverse ETL needs to be drastically different. Your cloud data warehouse is powerful enough to change the entire architecture of the product depending on which side of the pipeline the warehouse is on. So we believe you need different products to support both ETL and Reverse ETL. But agreed the same tool can provide both products.

I will have to check how airbyte supports both. Regarding Castled, Regular ETL is there in our mid term roadmap.



How do you handle CDC from a DW like Redshift? If I have 5 billion row fact table with an insert or update datetime audit column (but no soft delete tracking!) how do you deliver deltas? Are you keeping our of band hashes of pk values or tuple values?

Do you need to know the primary key of the source table to sync?


Thats a great question! We dont use updated timestamps to compute deltas, as thats unreliable and can cause data loss depending on your transaction window.

We keep snapshot tables on your data warehouse(in our own custom schema, so that you dont have to provide Castled write access to any of your production schemas). The snapshot tables are then used as the baseline to compare the query results everytime the pipeline runs. Frankly, we have not really seen a use case of transferring 5 billion rows in a Reverse ETL pipeline. This is mostly because of the fact that our destination apps are mostly transactional systems and cannot really store so much of data. For example, salesforce destination can store max 10GB of data. Because of this, we are storing the actual tuple values in the snapshot. We have easily scaled our pipelines to compute deltas from queries which returns up to 100 million records. To optimize this further, we are also considering to keep the hashes of the tuple values instead of the actual values.

Yes, we need to know the primary key of your query results. This is required to handle failures and to remove the failed records from the snapshot table, so that those can be retried on the next pipeline run.


It looks like you're asking users to write connectors in Java. Have you given some thought to who your user is? I'd imagine the type of person that'd consider using this would be unlikely to prefer or even know Java.


Thanks for the tip. We believe it's still the data engineers who will have to write these reverse connectors as well. We understand that python is probably the preferred choice of language for them. We have also seen that a majority of them understand java as well. But we are willing to support multi-language functionality in future, if the community demands it. Do you think majority of the data engineers do not understand Java?


It's probably 10 to 1 preference of Python over Java for data engineers. At least 5 to 1.


Noted. We will definitely consider adding support for python. Thanks for the suggestion.


You might consider checking Jython out as an option. It's JSR-223 compliant and trivial to drop into a Java app and just expose Java objects to the Python scripts to be used as is. I've had a pretty great experience with it.

The only downside is it's stuck on supporting Python2.x so you may end up wanting to properly integrate CPython eventually. Since you're targeting running Python code that doesn't exist yet and the language differences aren't huge though, I doubt most users would mind (I wouldn't). Just an idea to consider, esp for an MVP

(One /upside/ is Jython is a Python2 interpreter fully written in Java, so the concurrency and performance may be better than CPython2 with its GIL)


Just checked it out. Looks interesting. Thanks


I would disagree with this. Most Python developers I know can do Java. Not a big deal.


Since there seems to be differing opinions here, I'll just add my experience that having worked with 3 data teams, everyone knew Python and used Python and no knew or used Java.

Great product and very excited for this, wish I could invest in you and wish this had been around years ago when I was trying to convince Fivetran they should create reverse ETL functionality.


Thanks for letting us know!


"Know" vs "willing to write", esp. OSS for making someone else rich

Nowadays, probably something like python then rust/go, just for community, and especially aligned on apache arrow. OSS async python / HTTP, with arrow dataplane support (fast,typed,standardized), is part of our bar for whether we consider a data proj as a core dep nowadays. A surprising amount of ETL startups are YOLO json for the dataplane, so we've intentionally stayed away due to reliability+perf heart pains. But maybe you can fake it till you make it that way too, and then hire staff to clean it up 2 years later :)


Fantastic! Congratulations on the launch.

Is there a way to version control the sync configurations? Any thoughts on putting that in the roadmap?

I'd love to be able to put my 'Castled config' in the same repo as my dbt project, for example.


You mean the warehouse/app credentials, when you say sync configurations? If so, yes, that seems like a great idea. Infact I think your warehouse credentials are already there in your dbt repo in a specific format. Castled can directly read those credentials from there.


That is not what I meant, but also pretty interesting.

I actually mean the 'definition' of the syncs themselves.

I am picturing JSON or YAML that describes the source fields, their mapping to the destination fields, and any other meta about the sync: frequency, number of retries, whatever else that you could configure in the UI

So when I go and update my dbt model to modify one of the tables that I am syncing from, I can make the corresponding changes to my Castled settings file, and release it all as one atomic update to my data infrastructure.

It might be a small number of people who would want something like that, but it's definitely something I would have been excited about when I was running a data team.


We can definitely consider that. But I feel its a lot of config and can be error prone. For instance, source-destination field mapping configs might be complex and have various issues like data type mismatches, typos in field names etc and a user interface is better suited to guide you through the entire process.

But I see value in exporting the config to a github repo after the pipeline is created and thereafter future edits can be done via the github repo. Does that make sense?


Yes 100% -- you could also imagine just syncing from the UI to a repo, rather than trying to make the config human-editable. Toggle into a branch in the UI, make edits, and have those committed to the repo by the tool.

Looks awesome, I am rooting for you guys!


Thanks for the input!


> It might be a small number of people who would want something like that, but it's definitely something I would have been excited about when I was running a data team.

Yeah this one certainly depends on the target customer. For me, any tool that didn't have source control integration for configuration would be a non-starter. But it's quite possible that the target audience for this tool doesn't even understand the term "source control".


Congrats to the Castled team on the launch.

At Grouparoo, this is a primary use case. We have a UI that engineers use locally. This helps gets things right. It outputs a JSON configuration that is checked in. When that is deployed, it does all the syncing.


Cool product guys! One question >

"Being open source, we provide the flexibility for our customers to build their own connectors rather than waiting for cloud vendors to fulfill their connector requests."

Why does it need to be OS? Can't a product just have a devkit that enables you to build your own connectors?


Thanks for the suggestion. Yes, a devkit would work if its just about building new connectors. But we believe that the community would want control over the entire project rather than just the connectors module. We also wanted to provide a usable version of the product for free to the community, which you can self-host and maintain yourselves as well.


Unsupported @gmail.com aďdress, please use official, when trying to register for updates. Really?


Don't worry, it's like many "cute" restrictions -- they didn't do a good job. Just capitalize your email and it'll sail right through. It's also only checked on registration, so you can continue to use the lowercased version to login.

It's without the tiniest sense of irony one will observe the "Signin with Google" button on the login form, too :-/

ed: although I may have ruined it for everyone, since my "team name" is now "GMAIL.COM" :-P

      "team": {
        "id": 29,
        "name": "GMAIL.COM",
        "tier": "Free"
      }


Thanks for testing Matthew, I saw you were trying to test the XSS attack as well :) We will consider allowing sign in with personal accounts at a later stage.


Sorry about that! But its actually the registration for signup and not for updates. Thats why we had to block personal emails.


Please consider supporting OSS destinations as well! They share the same values as you and could make for some interesting partnerships. I understand you have to have the big SaaS names, but that’s an opportunity to differentiate from your competitors !

Ps: experience is poor for https://oss-docs.castled.io on mobile, I cannot see a menu to switch pages


Thanks for the suggestion. We started out supporting popular saas tools as it will increases the chances of people trying out. We currently support kafka as a destination. Also could you give example of some of the OSS destinations you have in mind?

Sorry about the docs. Haven't done much testing on smaller screens yet.


Congrats on your launch. Great to see more innovation in this space.

How are you thinking of monetizing?


Thanks! We also have a subscription based hosted solution hosted at https://castled.io


Thanks folks, much needed! Where is the list of destinations today?


Currently we have 14 destinations available for use. Salesforce, Hubspot, Intercom, Google Ads, Mailchimp, Google Sheet, Sendgrid, Marketo, ActiveCampaign, Kafka, Customer.io, Google pub/sub, Mixpanel, Rest API.


Amazing! Congratulations on the launch! Look forward to this.


I’m not sure reverse etl is a great name. ETL is direction agnosti


Yes, it might not be a great name. However it gives a decent idea about the product to the folks who are already using ETL/EL(T) to load data in their warehouse. Operational Analytics is another term used by the data community.


Why is the term “Reverse ETL” becoming a thing? Committing data to an OLTP system has been around since the beginning of SQL. I believe one vendor coined this term to a marketing success but this meme needs to stop. Besides, ETL has been going out in favor of ELT.


A year back when we started to built Castled, this technology which syncs data from cloud warehouses to your operational tools did not have a name. The term "Reverse ETL" became popular somewhere in the beginning of 2021. We used this term, since we know that the data community knows this technology by this name now.

But my personal take is that "Reverse ETL" is still a new technology in the sense that it completes the modern data stack, which is built around cloud data warehouses.


Go Castled! Congrats on the launch!


Thanks!


Hi everyone, Manish here (one of the founders of Castled). Thanks for pouring in so many ideas and suggestions in the comments. We have a Discord community, it would be nice if you join and help us build a great open source product – https://discord.gg/ERAjcSNerD


fyi small typo in https://oss-docs.castled.io/deploying-castled/deploy-on-aws-...: "Login to you AWS web console"


Fixed :)


fyi Your purple "Deploy on AWS" link (https://docs.castled.io/deploying-castled/deploy-on-aws-ec2) at the top of your README yields a 404.


just fixed :) Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: