There are a couple things that surprise students when they first begin to develop natural language processing applications.
- Some things just take a while. A script that, say, preprocesses millions of sentences isn’t necessarily wrong because it takes a half hour.
- You really do have to avoid wasting memory. If you’re processing a big file line-by-line,
- you really can’t afford to read it all in at once, and
- you should write out data as soon as you can.
- The OS and program already know how to buffer IO; don’t fight it.
- Whereas so much software works with data in human non-readable (e.g., wire formats, binary data) or human-hostile (XML) formats, if you’re processing text files, you can just open the files up and read them to see if they’re roughly what you expected.