huang zuxing blog @ home the quieter you become, the more you are able to hear.

14Sep/09Off

sphinx doc2txt pdf2txt xls2txt ppt2txt

Lately use sphinx to build in-house search engine.

In Linux use below tools:

http://wizard.ae.krakow.pl/~jb/xls2txt  transfer xls file into text format

http://vitus.wagner.pp.ru/software/catdoc/  transfer ppt file into text format

http://www.abisource.com/ transfer doc and pdf file into text format.

Deal with windows files  and folders  which have spaces in file/folder names:

Sample file name: /a/f/cso/flow doc/2008 version/read start file.ppt

To scan thousands of these files in Linux machine and import their content into mysql db for sphinx to build index search.

Part of script:

#/bin/bash

USAGE()
{
cat << EOF
Usage:  $0 file_format file
Sample: $0 pdf abc.pdf
Sample: $0 xls abc.xls
use $0 to convert doc, pdf, ppt,xls file to text format
EOF
}
FORMAT=$1
PPT2TXT=$TOOLS_DIR/catdoc-0.94.2/bin/catppt
#....
#....
IFS=$(echo -en "\n\b")

#checking file ....

if [ $FORMAT = ppt ]
then
echo $2 |while read FILE_INFO  ; do  $PPT2TXT  "$FILE_INFO"  >  $TEMP_FILE_NAME ; done
fi
#mysql codes ...
#....
Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.