本文概述
为了提取诸如xls文件之类的Microsoft Office文件, Tika提供了OOXMLParser类。此类用于从Microsoft文件提取内容和元数据。它位于org.apache.tika.parser.microsoft.ooxml包中, 并包含下表中列出的各种构造函数和方法。
Tika OOXMLParser构造函数
Constructor | Description |
---|---|
public OOXMLParser() | 它用于实例化类。 |
Method | Description |
---|---|
公共Set <MediaType> getSupportedTypes(ParseContext上下文) | 它返回此解析器支持的媒体类型集。 |
公共无效解析(InputStream流, ContentHandler处理程序, 元数据元数据, ParseContext上下文)引发IOException, SAXException, TikaException | 它将文档流解析为一系列XHTML SAX事件。 |
OOXMLParser示例
package tikaexample;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class MSOfficeExample {
public static void main(String[] args) throws IOException, SAXException, TikaException {
BodyContentHandler handler = new BodyContentHandler();
OOXMLParser parser = new OOXMLParser();
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
try (InputStream stream = AutoDetectParseExample.class.getResourceAsStream("srcmini.xls")) {
parser.parse(stream, handler, metadata, pcontext);
System.out.println("Document Content:" + handler.toString());
System.out.println("Document Metadata:");
String[] metadatas = metadata.names();
for(String data : metadatas) {
System.out.println(data + ": " + metadata.get(data));
}
}catch(Exception e) {System.out.println("Exception message: "+ e.getMessage());}
}
}
我们的文件包含以下内容。
输出
Document Content:Sheet1
Employee Manual Punch
In Time Out Time Device Total Minute Total Time Working Minutes
01-Nov-17 8:27:00 AM 01-Nov-17 6:30:00 PM 1 603 540 -63
02-Nov-17 8:09:00 AM 02-Nov-17 6:30:00 PM 1 621 540 -81
03-Nov-17 8:25:00 AM 03-Nov-17 6:30:00 PM 1 605 540 -65
Document Metadata:
date: 2018-05-06T11:20:06Z
cp:revision: 1
custom:DocSecurity: 0
dc:creator: Reception
dcterms:created: 2017-12-03T08:38:57Z
language: en-IN
Last-Modified: 2018-05-06T11:20:06Z
dcterms:modified: 2018-05-06T11:20:06Z
Last-Save-Date: 2018-05-06T11:20:06Z
Template:
protected: false
meta:save-date: 2018-05-06T11:20:06Z
Application-Name: LibreOffice/5.1.6.2$Linux_X86_64 LibreOffice_project/10m0$Build-2
modified: 2018-05-06T11:20:06Z
custom:LinksUpToDate: false
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
creator: Reception
dc:language: en-IN
meta:author: Reception
meta:creation-date: 2017-12-03T08:38:57Z
extended-properties:Application: LibreOffice/5.1.6.2$Linux_X86_64 LibreOffice_project/10m0$Build-2
custom:ShareDoc: false
custom:ScaleCrop: false
Creation-Date: 2017-12-03T08:38:57Z
custom:HyperlinksChanged: false
Revision-Number: 1
extended-properties:Template:
custom:AppVersion: 12.0000
评论前必须登录!
注册