Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would go one step further and suggest people that need structured queries use the Google BigTable API to query their structured Wikipedia data. Granted, their public dataset is from 2010, so is slightly outdated, but you can write structured SQL against all of the wikipedia article metadata and then use the mediawiki api itself to grab only the article text that you're interested in.

The wikipedia data is hosted here: https://bigquery.cloud.google.com/table/publicdata:samples.w...

Here is a sample query, searching for all articles that start with Positive:

SELECT id,title FROM [publicdata:samples.wikipedia] WHERE (REGEXP_MATCH(title,r'^Positive*')) LIMIT 10

Query complete (2.0s elapsed, 9.13 GB processed

  1|	464347|	Positive airway pressure	 
  2|	10008223|	Positive behavior support	 
  3|	464347|	Positive airway pressure	 
  4|	1354851|	Positivism in Poland	 
  5|	1023857|	Positive set theory	 
  6|	5154273|	Positivism dispute	 
  7|	2871407|	Positivism	 
  8|	17179765|	Positive psychological capital	 
  9|	9033239|	Positive Action Group	 
  10|	4163012|	Positive K
Here is the python API documentation: https://developers.google.com/api-client-library/python/


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: