Dedoc
Dedoc is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats.
Dedoc
supports DOCX
, XLSX
, PPTX
, EML
, HTML
, PDF
, images and more.
Full list of supported formats can be found here.
Installation and Setupโ
Dedoc libraryโ
You can install Dedoc
using pip
.
In this case, you will need to install dependencies,
please go here
to get more information.
pip install dedoc
Dedoc APIโ
If you are going to use Dedoc
API, you don't need to install dedoc
library.
In this case, you should run the Dedoc
service, e.g. Docker
container (please see
the documentation
for more details):
docker pull dedocproject/dedoc
docker run -p 1231:1231
Document Loaderโ
-
For handling files of any formats (supported by
Dedoc
), you can useDedocFileLoader
:from langchain_community.document_loaders import DedocFileLoader
-
For handling PDF files (with or without a textual layer), you can use
DedocPDFLoader
:from langchain_community.document_loaders import DedocPDFLoader
-
For handling files of any formats without library installation, you can use
Dedoc API
withDedocAPIFileLoader
:from langchain_community.document_loaders import DedocAPIFileLoader
Please see a usage example for more details.