I wrote a prototype of a browser extension that scraped your bookmarks + 1 degre...

changing1999 · on Oct 1, 2024

In my experience building a browser-based scraper I preferred scraping pages by a direct in-browser visit rather that a fetch request. A direct visit from a real browser is basically undetectable by anti-bot software (unless you try to do something funny like automated deep crawling and scraping). So applied to your usecase it would have to go through every bookmark + 1 degree to index it. Maybe even in an offscreen canvas (haven't tried that though, could be detectable).

8chanAnon · on Oct 1, 2024

>Some walled garden sites seem completely unscrapable

Any examples besides Linkedin? Tell me what sites you're trying to target and I'll have a look to see what can be done with them. It takes some pretty evil Javascript obfuscation to block me and only one site has been able to do that. I doubt that the sites you're hitting are anywhere near that evil, lol. I would appreciate it if you have a good example that I could use in a future article.

smallerfish · on Oct 2, 2024

It's been ~18 months so I'm fuzzy on details. I remember gmail being tricky also.

IIRC I ended up building an iframe based scraper for sites that didn't yield any content with just a fetch - and I think built a fallback mechanism so that if fetch didn't work, I'd queue it up in the iframe scraper. The problem with that is that there are various heavily used security headers that prohibit loading in an iframe. (And the reason for iframe vs just loading in a tab and injecting my extension's script is that I wanted it to be able to run "in the background" without being super distracting for the user - the tab changing favicon every second or two was pretty annoying.)

paulryanrogers · on Oct 1, 2024

How often did it crawl? Once per day shouldn't trigger any blockers.