Creating a simple TCP server in Java and using it from Python – to use the Stanford POS tagger
POS tagging
For one of my projects, I needed to use the Stanford POS tagger to parse a large text corpus. Even though there are Python POS taggers, my favourite one is by far the Java based Stanford implementation. Usually I use it directly from Java, but in this case the input file was a bit tricky to parse and Python did it very well so I just wanted to do the POS tagging in Java and everything else in Python.
My first thought was to create a file based interaction between them but that wasn't as responsive as I wanted it to be – this type of batch processing wasn't too appealing. Then for a moment I was considering using more advanced techniques like Apache Thrift or Google Protobuf but why would I need them if I just need to send a sentence over the write and receive the POS tagger version of it?
Creating the Java server
The server part seemed to be the trickiest one as once in a while I received exception I wasn't really foreseeing. As I didn't really care about those so putting the server routine in a simple try catch solved all the issues.
This is the simplified version of the server code:
// Only listen on localhost, no remote connection allowed. ServerSocket serverSocket = new ServerSocket(10007, 0, InetAddress.getByName("localhost")); while(true) // We never really terminate it. CTRL+C is enough. { System.out.println ("Waiting for connection....."); Socket clientSocket = serverSocket.accept(); System.out.println ("Waiting for input....."); // Create IO streams. PrintWriter out = new PrintWriter(clientSocket.getOutputStream(), true); BufferedReader in = new BufferedReader(new InputStreamReader(clientSocket.getInputStream())); String inputLine; try { while ((inputLine = in.readLine()) != null) { String tagged = tagger.tagString(inputLine); // POS tag the input. out.println(tagged); // Send back the tagged output. } out.close(); // The client decided to close the connection, cleanup. in.close(); clientSocket.close(); } catch(Exception ex) { } finally { System.out.println("Client has disconnected."); } } // serverSocket.close(); // We never reach this.
There are couple of interesting things I've found. One is that the Java server gets upset and throws an exception if the client decides to disappear instead of cleanly closing the socket, so a try-catch-finally was required. The other thing is that both reading and writing has to be buffered, otherwise the performance would have been really poor. The PrintWriter is buffered, so it will write the whole line to the wire in once instead of byte-by-byte.
In my case the input and output was standard English text but for unicode some kind of encoding might be required.
Connecting from Python
The Python client was unexpectedly easy, not counting the import it was three lines all together:
import telnetlib HOST = "localhost" PORT = 10007 tn = telnetlib.Telnet(HOST, PORT) tn.write(title) response = tn.read_until("\n")
Performance and issues
The text corpus I was parsing was quite big (10+Gb) so I quickly realised it won't be extremely fast to parse all the text. I was using an old Linux server (512Mb, Core2Duo) to parse the data overnight. The total process was around 20 hours with an average of ~100 sentences tagged a second.
The only issue I was facing was that the Stanford POS tagger once in a while ran out of heap memory so I had to increase the initial heap size, but 500mb seemed to be enough (-mx500m).
Otherwise the process was surprisingly stable and performing well even on that really old machine. The POS tagger did not leak memory at all so I wouldn't hesitate running the same setup again next time.
Comments
Post a Comment