Lately use sphinx to build in-house search engine.

In Linux use below tools:  transfer xls file into text format  transfer ppt file into text format transfer doc and pdf file into text format.

Deal with windows files  and folders  which have spaces in file/folder names:

Sample file name: /a/f/cso/flow doc/2008 version/read start file.ppt

To scan thousands of these files in Linux machine and import their content into mysql db for sphinx to build index search.

Part of script:


cat << EOF
Usage:  $0 file_format file
Sample: $0 pdf abc.pdf
Sample: $0 xls abc.xls
use $0 to convert doc, pdf, ppt,xls file to text format
IFS=$(echo -en "\n\b")

#checking file ....

if [ $FORMAT = ppt ]
echo $2 |while read FILE_INFO  ; do  $PPT2TXT  "$FILE_INFO"  >  $TEMP_FILE_NAME ; done
#mysql codes ...
