Python Tika guide
IMPORTANT NOTE: Thanks to Chris Wilson's work it seems that a simple command line pip install git+git://github.com/aptivate/python-tika.git will do the work ! Much better isn't it ? See http://blog.aptivate.org/2012/02/01/content-indexing-in-django-using-apache-tika/ for more info. The following is now clearly deprecated, I keep it here just in case...
This document is a very short guide for building and using Tika (an all purpose documents' content and metadata extraction library) through a Python wrapper. The wrapper is built using JCC.
http://lucene.apache.org/tika/
http://lucene.apache.org/pylucene/jcc/index.html
Until now only the few functionalities I am interested in were tested.
Install
Install jcc : http://lucene.apache.org/pylucene/jcc/documentation/install.html
Install tika : http://lucene.apache.org/tika/0.7/gettingstarted.html
Don't forget to run mvn install in tika directory.
You will need the jar files from tika-parsers/target, tika-core/target and tika-app/target.
Build Tika Python wrapper with jcc:
> cd jcc/jcc > sudo python __main__.py --jar jar/tika-parsers-0.7.jar --jar jar/tika-core-0.7.jar java.io.File java.io.FileInputStream java.io.StringBufferInputStream --package org.xml.sax.ContentHandler --package org.xml.sax.SAXException --include jar/tika-app-0.7.jar --python tika --reserved asm --build --install
I have been told that the package line should be: "--package org.xml.sax". I don't know if it is because of a version change and I haven't tested it, but try it if you have errors with the command as it is.
1 feb 2012: thanks to another fellow tika user for his input:
I concur with the need to change the package to "--package org.xml.sax".
Without this, I do not get "errors" during the compilation process,
but jcc silently ignores the all-important AutoDetectParser.parse() method,
and produces a wrapper with no such method in it, because it doesn't recognise the return type.
This causes the example code that you gave to fail because of the missing method.
I also needed to add an OSGI library for Tika 1.0, which I happened to find on my system, so my final command was:
python ../jcc/jcc/__main__.py
--include /usr/share/java/org.eclipse.osgi.jar
--jar tika-parsers-1.0.jar
--jar tika-core-1.0.jar
java.io.File java.io.FileInputStream
java.io.StringBufferInputStream
--package org.xml.sax
--include tika-app-1.0.jar
--python tika --version 1.0 --reserved asm
Usage example
In a python console:
# Setup module and virtual machine
import tika
tika.initVM()
# The all purpose parser from Tika (html, pdf, open documents, etc...)
parser = tika.AutoDetectParser()
# Create input from a small fake html code
# Alternatively you can use: input = tika.FileInputStream(tika.File("/path/to/example"))
input = tika.StringBufferInputStream("<html><title>My title</title><body>My body</body></html>")
# Create handler for content, metadata and context
content = tika.BodyContentHandler()
metadata = tika.Metadata()
context = tika.ParseContext()
# Parse the data and display result
parser.parse(input,content,metadata,context)
content.toString()
> u'My body'
metadata.toString()
> u'title=My title Content-Encoding=UTF-8 Content-Type=text/html '
metadata.get('title')
> u'My title'
最后
以上就是开朗洋葱最近收集整理的关于Python Tika guidePython Tika guide的全部内容,更多相关Python内容请搜索靠谱客的其他文章。
发表评论 取消回复