Launch HN: Castled Data (YC W22) – Open-Source Reverse ETL

mjirv · on Jan 25, 2022

Looks great, congrats on launching!

I'm curious how you differentiate yourselves from Airbyte, which isn't really designed for reverse ETL but can be used for it. And do you ever see Castled supporting regular ETL?

Right now there is a lot of separation in the market between ETL and reverse ETL, but it seems like a pain to maintain separate tools when you could just do both in one.

aruntdharan · on Jan 25, 2022

The founding team of Castled comprises mostly founding engineers from HevoData, which is an ETL solution. Having built both ETL and Reverse ETL solutions from scratch, we have realized that the architecture required to support both ETL and Reverse ETL needs to be drastically different. Your cloud data warehouse is powerful enough to change the entire architecture of the product depending on which side of the pipeline the warehouse is on. So we believe you need different products to support both ETL and Reverse ETL. But agreed the same tool can provide both products.

I will have to check how airbyte supports both. Regarding Castled, Regular ETL is there in our mid term roadmap.

gorkemyurt · on Jan 26, 2022

I think its still "coming soon"

https://airbyte.com/blog/airbyte-strategy-to-commoditize-all...

gregw2 · on Jan 25, 2022

How do you handle CDC from a DW like Redshift? If I have 5 billion row fact table with an insert or update datetime audit column (but no soft delete tracking!) how do you deliver deltas? Are you keeping our of band hashes of pk values or tuple values?

Do you need to know the primary key of the source table to sync?

aruntdharan · on Jan 25, 2022

Thats a great question! We dont use updated timestamps to compute deltas, as thats unreliable and can cause data loss depending on your transaction window.

We keep snapshot tables on your data warehouse(in our own custom schema, so that you dont have to provide Castled write access to any of your production schemas). The snapshot tables are then used as the baseline to compare the query results everytime the pipeline runs. Frankly, we have not really seen a use case of transferring 5 billion rows in a Reverse ETL pipeline. This is mostly because of the fact that our destination apps are mostly transactional systems and cannot really store so much of data. For example, salesforce destination can store max 10GB of data. Because of this, we are storing the actual tuple values in the snapshot. We have easily scaled our pipelines to compute deltas from queries which returns up to 100 million records. To optimize this further, we are also considering to keep the hashes of the tuple values instead of the actual values.

Yes, we need to know the primary key of your query results. This is required to handle failures and to remove the failed records from the snapshot table, so that those can be retried on the next pipeline run.

mritchie712 · on Jan 25, 2022

It looks like you're asking users to write connectors in Java. Have you given some thought to who your user is? I'd imagine the type of person that'd consider using this would be unlikely to prefer or even know Java.

aruntdharan · on Jan 25, 2022

Thanks for the tip. We believe it's still the data engineers who will have to write these reverse connectors as well. We understand that python is probably the preferred choice of language for them. We have also seen that a majority of them understand java as well. But we are willing to support multi-language functionality in future, if the community demands it. Do you think majority of the data engineers do not understand Java?

mritchie712 · on Jan 25, 2022

It's probably 10 to 1 preference of Python over Java for data engineers. At least 5 to 1.

aruntdharan · on Jan 25, 2022

Noted. We will definitely consider adding support for python. Thanks for the suggestion.

galdosdi · on Jan 25, 2022

You might consider checking Jython out as an option. It's JSR-223 compliant and trivial to drop into a Java app and just expose Java objects to the Python scripts to be used as is. I've had a pretty great experience with it.

The only downside is it's stuck on supporting Python2.x so you may end up wanting to properly integrate CPython eventually. Since you're targeting running Python code that doesn't exist yet and the language differences aren't huge though, I doubt most users would mind (I wouldn't). Just an idea to consider, esp for an MVP

(One /upside/ is Jython is a Python2 interpreter fully written in Java, so the concurrency and performance may be better than CPython2 with its GIL)

aruntdharan · on Jan 25, 2022

Just checked it out. Looks interesting. Thanks

llaolleh · on Jan 25, 2022

I would disagree with this. Most Python developers I know can do Java. Not a big deal.

coderintherye · on Jan 26, 2022

Since there seems to be differing opinions here, I'll just add my experience that having worked with 3 data teams, everyone knew Python and used Python and no knew or used Java.

Great product and very excited for this, wish I could invest in you and wish this had been around years ago when I was trying to convince Fivetran they should create reverse ETL functionality.

aruntdharan · on Jan 26, 2022

Thanks for letting us know!

lmeyerov · on Jan 25, 2022

"Know" vs "willing to write", esp. OSS for making someone else rich

Nowadays, probably something like python then rust/go, just for community, and especially aligned on apache arrow. OSS async python / HTTP, with arrow dataplane support (fast,typed,standardized), is part of our bar for whether we consider a data proj as a core dep nowadays. A surprising amount of ETL startups are YOLO json for the dataplane, so we've intentionally stayed away due to reliability+perf heart pains. But maybe you can fake it till you make it that way too, and then hire staff to clean it up 2 years later :)

amcaskill · on Jan 25, 2022

Fantastic! Congratulations on the launch.

Is there a way to version control the sync configurations? Any thoughts on putting that in the roadmap?

I'd love to be able to put my 'Castled config' in the same repo as my dbt project, for example.

aruntdharan · on Jan 25, 2022

You mean the warehouse/app credentials, when you say sync configurations? If so, yes, that seems like a great idea. Infact I think your warehouse credentials are already there in your dbt repo in a specific format. Castled can directly read those credentials from there.

amcaskill · on Jan 25, 2022

That is not what I meant, but also pretty interesting.

I actually mean the 'definition' of the syncs themselves.

I am picturing JSON or YAML that describes the source fields, their mapping to the destination fields, and any other meta about the sync: frequency, number of retries, whatever else that you could configure in the UI

So when I go and update my dbt model to modify one of the tables that I am syncing from, I can make the corresponding changes to my Castled settings file, and release it all as one atomic update to my data infrastructure.

It might be a small number of people who would want something like that, but it's definitely something I would have been excited about when I was running a data team.

aruntdharan · on Jan 25, 2022

We can definitely consider that. But I feel its a lot of config and can be error prone. For instance, source-destination field mapping configs might be complex and have various issues like data type mismatches, typos in field names etc and a user interface is better suited to guide you through the entire process.

But I see value in exporting the config to a github repo after the pipeline is created and thereafter future edits can be done via the github repo. Does that make sense?

amcaskill · on Jan 25, 2022

Yes 100% -- you could also imagine just syncing from the UI to a repo, rather than trying to make the config human-editable. Toggle into a branch in the UI, make edits, and have those committed to the repo by the tool.

Looks awesome, I am rooting for you guys!

aruntdharan · on Jan 25, 2022

Thanks for the input!

mason55 · on Jan 26, 2022

> It might be a small number of people who would want something like that, but it's definitely something I would have been excited about when I was running a data team.

Yeah this one certainly depends on the target customer. For me, any tool that didn't have source control integration for configuration would be a non-starter. But it's quite possible that the target audience for this tool doesn't even understand the term "source control".

bleonard · on Jan 26, 2022

Congrats to the Castled team on the launch.

At Grouparoo, this is a primary use case. We have a UI that engineers use locally. This helps gets things right. It outputs a JSON configuration that is checked in. When that is deployed, it does all the syncing.

cstanley · on Jan 25, 2022

Cool product guys! One question >

"Being open source, we provide the flexibility for our customers to build their own connectors rather than waiting for cloud vendors to fulfill their connector requests."

Why does it need to be OS? Can't a product just have a devkit that enables you to build your own connectors?

aruntdharan · on Jan 25, 2022

Thanks for the suggestion. Yes, a devkit would work if its just about building new connectors. But we believe that the community would want control over the entire project rather than just the connectors module. We also wanted to provide a usable version of the product for free to the community, which you can self-host and maintain yourselves as well.

dantodor · on Jan 26, 2022

Unsupported @gmail.com aďdress, please use official, when trying to register for updates. Really?

mdaniel · on Jan 26, 2022

Don't worry, it's like many "cute" restrictions -- they didn't do a good job. Just capitalize your email and it'll sail right through. It's also only checked on registration, so you can continue to use the lowercased version to login.

It's without the tiniest sense of irony one will observe the "Signin with Google" button on the login form, too :-/

ed: although I may have ruined it for everyone, since my "team name" is now "GMAIL.COM" :-P

      "team": {
        "id": 29,
        "name": "GMAIL.COM",
        "tier": "Free"
      }

manishks · on Jan 26, 2022

Thanks for testing Matthew, I saw you were trying to test the XSS attack as well :) We will consider allowing sign in with personal accounts at a later stage.

aruntdharan · on Jan 26, 2022

Sorry about that! But its actually the registration for signup and not for updates. Thats why we had to block personal emails.

iRomain · on Jan 25, 2022

Please consider supporting OSS destinations as well! They share the same values as you and could make for some interesting partnerships. I understand you have to have the big SaaS names, but that’s an opportunity to differentiate from your competitors !

Ps: experience is poor for https://oss-docs.castled.io on mobile, I cannot see a menu to switch pages

frankcastled · on Jan 25, 2022

Thanks for the suggestion. We started out supporting popular saas tools as it will increases the chances of people trying out. We currently support kafka as a destination. Also could you give example of some of the OSS destinations you have in mind?

Sorry about the docs. Haven't done much testing on smaller screens yet.

soumyadeb · on Jan 25, 2022

Congrats on your launch. Great to see more innovation in this space.

How are you thinking of monetizing?

aruntdharan · on Jan 25, 2022

Thanks! We also have a subscription based hosted solution hosted at https://castled.io

davidkell · on Jan 25, 2022

Thanks folks, much needed! Where is the list of destinations today?

frankcastled · on Jan 25, 2022

Currently we have 14 destinations available for use. Salesforce, Hubspot, Intercom, Google Ads, Mailchimp, Google Sheet, Sendgrid, Marketo, ActiveCampaign, Kafka, Customer.io, Google pub/sub, Mixpanel, Rest API.

Learner561 · on Jan 28, 2022

Amazing! Congratulations on the launch! Look forward to this.

LaserToy · on Jan 26, 2022

I’m not sure reverse etl is a great name. ETL is direction agnosti

manishks · on Jan 26, 2022

Yes, it might not be a great name. However it gives a decent idea about the product to the folks who are already using ETL/EL(T) to load data in their warehouse. Operational Analytics is another term used by the data community.

hbarka · on Jan 25, 2022

Why is the term “Reverse ETL” becoming a thing? Committing data to an OLTP system has been around since the beginning of SQL. I believe one vendor coined this term to a marketing success but this meme needs to stop. Besides, ETL has been going out in favor of ELT.

aruntdharan · on Jan 25, 2022

A year back when we started to built Castled, this technology which syncs data from cloud warehouses to your operational tools did not have a name. The term "Reverse ETL" became popular somewhere in the beginning of 2021. We used this term, since we know that the data community knows this technology by this name now.

But my personal take is that "Reverse ETL" is still a new technology in the sense that it completes the modern data stack, which is built around cloud data warehouses.

ploomber · on Jan 25, 2022

Go Castled! Congrats on the launch!

aruntdharan · on Jan 25, 2022

Thanks!

manishks · on Jan 25, 2022

Hi everyone, Manish here (one of the founders of Castled). Thanks for pouring in so many ideas and suggestions in the comments. We have a Discord community, it would be nice if you join and help us build a great open source product – https://discord.gg/ERAjcSNerD

awwx · on Jan 25, 2022

fyi small typo in https://oss-docs.castled.io/deploying-castled/deploy-on-aws-...: "Login to you AWS web console"

aruntdharan · on Jan 25, 2022

Fixed :)

awwx · on Jan 25, 2022

fyi Your purple "Deploy on AWS" link (https://docs.castled.io/deploying-castled/deploy-on-aws-ec2) at the top of your README yields a 404.

frankcastled · on Jan 25, 2022

just fixed :) Thanks!