David R. MacIver's Blog: I want ONE MEELYUN sentences

I want ONE MEELYUN sentences

6 December 2009

I’ve been planning to do some work on my term extractor to make it a bit smarter. It’s currently a rule based system on top of various machine learning tools. This is perfectly legitimate, but it’s starting to hit the limitations of that approach. I’d like to experiment with a more intelligent approach using machine learning more directly.

To do this though I need a training set. My plan is to do this by building a first pass using the existing version on some sentence corpus and then editing that to taste.

Of course, to do this I need a decent sentence corpus. So today I set out to generate one. It was a lot fiddlier than it should have been, but I think in the end I’ve got a decent one.

I’m presumably not the only person to need something like this, so I’m making a largish sample of it available. It’s not hard to generate yourself but it’s something of a pain, so maybe I can save you some effort.

So, here you go. A bzipped list of one million random sentences from wikipedia.

The format is obvious: Plain text, one sentence per line.

I make no guarantees about the quality of the data (there’s definitely some noise), and I definitely don’t claim this to be a statistically fair sample of Wikipedia. But initial impressions are that it’s a reasonable good list. Certainly it should be good enough for my purposes.

Comments

mitcho on 2010-07-23 18:39:55:

How hard would it be to rerun your script to preserve inline links? That’s something I’d be interested in for my own research, and haven’t gotten around to writing a script yet.

Also, when you say a “random sample”, how did you choose sentences randomly?

Rob on 2011-12-13 20:09:46:

Thank you so much for this! I’m doing some personal NLP studies and this is by far the best free corpus I’ve seen.