Thought of the Day!!!!!!!

Don't make EXCUSES, make IMPROVEMENTS. - Tyra Banks

Search This Blog

Thursday, August 10, 2017

How to read word document(.doc or .docx) in JAVA?

Question: How to read word document(.doc or .docx) in JAVA?

Reference Documents:
tempdata.doc
tempdata_1.docx

Problem:
1. The user can enter word document in any format(extension like .doc or .docx) as mentioned above.
2. Identify the extension of word file.
3. Reading and print all the content of the word file on Console.

Answer:-
Step #1 Create two word documents, one with tempdata.doc and other with tempdata_1.docx extensions.
Step #2 Now download and add the jar files in your java project as mentioned in the 'Reference Jar Files' section.
Step #3 Copy and paste the code in your class file and run the code to observe the output.
Step #4 To identify the extension of the word file we have used the getExtension() method of FilenameUtils Class as mentioned below:
                               String fileExtension= FilenameUtils.getExtension(filePath);
Step #5 Once we get the file extension, we have to call correct method accordingly.
Step #6 For the file with the extension ".docx" we have to use XWPFDocument and XWPFParagraph Classes.
                               XWPFDocument doc =new XWPFDocument(FileInputStream fis);
                               List<XWPFParagraph> getDocParagraphs= doc.getParagraphs();
Step #7 For the file with the extension ".doc" we have to HWPFDocument and WordExtractor classes.
                               HWPFDocument doc=new HWPFDocument(FileInputStream fis);
                              WordExtractor extractor=new WordExtractor(doc);
Note: Please change the path of the Word document file accordingly.

Reference Jar files:
1. Navigate to :- poi-bin-3.16-20170419.tar.gz
2. Click on the first link, poi-bin3.16-20170419.tar.gz link.
3. Jar files get downloaded automatically.
4. Add the jar files in your Project using 'configure build path' option.

Code:

package word;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;
import org.apache.commons.io.FilenameUtils;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;

public class WordHandling
{
public static void main(String[] args) throws IOException
{
String filePath="input_Word//tempdata_1.doc";               // Location of word document file.
loadFile(filePath);
}

public static void loadFile(String filePath) throws IOException
{
File file=new File(filePath);                                                     // Creating File Object
String fileExtension=FilenameUtils.getExtension(filePath);  // Getting extension of files.
if(fileExtension.equalsIgnoreCase("docx"))
{
readDocxFile(file);
}
else if(fileExtension.equalsIgnoreCase("doc"))
{
readDocFile(file);
}
}

// Reading data from ".docx" file.
public static void readDocxFile(File file) throws IOException
{
FileInputStream fis=new FileInputStream(file);
XWPFDocument doc =new XWPFDocument(fis);                                                      
                // Getting all the paragraphs from the document and adding the same in ArrayList List<XWPFParagraph> getDocParagraphs= doc.getParagraphs();

                // Getting a total number of paragraphs in the word document.
int totalParagraphs=getDocParagraphs.size();        
System.out.println("Total number of paragraphs : "+totalParagraphs);

                // Print document content on Console.
for (XWPFParagraph currentParagraph : getDocParagraphs)
{
System.out.println(currentParagraph.getText().toString());
}
doc.close();
}

// Reading data from ".doc" file.
public static void readDocFile(File file) throws IOException
{
FileInputStream fis =new FileInputStream(file);
HWPFDocument doc=new HWPFDocument(fis);

WordExtractor extractor=new WordExtractor(doc);
                // Getting all the paragraphs from the document and adding the same in String array.
String[] getDocParagraphs= extractor.getParagraphText();

                // Getting total number of paragraphs in word document.
int totalParagraphs=getDocParagraphs.length;
System.out.println("Total count of paragraphs : "+totalParagraphs+"\n");

                // Print document content on Console.
for (String currentPara : getDocParagraphs)
{
System.out.print(currentPara);
}
extractor.close();
}
}

Please do comment and share the post with your friends and colleagues. For any query or question, you can also email me at ashu.kumar940@gmail.com.

Other Blogs:
https://agilehelpdoc.blogspot.in/

1 comment: